Tag Archives: Speech

Hey, Cortana – How can I integrate my app into you?

At the recent IFA 2014 show in Berlin, Microsoft announced some great new phones, the budget Lumia 730 and 735 and the ‘affordable premium’ phone the Lumia 830. Exciting new hardware – but the standout new feature for many was the announcement of ‘Hey, Cortana’, a new accessibility feature for the popular voice-controlled assistant which allows you to launch the app just by saying ‘Hey, Cortana’ at any time, avoiding the need to press the search button or the Live Tile as you do currently. This feature will be available on the high-end devices with the Snapdragon 800 processor, such as the Lumia 930, 1520 and Icon, and will be shipped as part of the Lumia ‘Denim’ firmware update coming to phones in Q4 2014. Cortana has been getting many plaudits – rightly so – and is currently available in the US, on beta in the UK and China, and will soon be rolling out as a beta soon to Australia, India and Canada, with more countries next year.

clip_image002

Speech control is part of the next wave of improvements of how we interact with our devices, part of a loose grouping of technologies often called NUI (Natural User Interface), meaning human to device interface methods that are effectively invisible, i.e. ‘natural’, as distinct from artificial ways of interacting with a device such as pressing a button which have to be learnt.

So having Cortana capable of passive listening so you can just talk to it whenever the fancy takes you and wake it up is another small step towards making speech interaction with our devices easier and more natural. It removes one more piece of friction – the need to tap on a button or tile to start the voice interaction – that could prevent users choosing voice as their favoured means of interaction with their devices rather than one only used by enthusiasts.

Integrating Apps into Cortana

So that’s great, now we can say ‘Hey, Cortana’ to invoke the built-in digital assistant features of Cortana and get it to tell us upcoming events in our calendar, set new reminders, do searches etc. But what if you want to allow users to interact with your app in a similar way, use Cortana as a conduit for speech interaction with your app? Say something like: “Hey Cortana” to invoke Cortana, and then say “MSDN find ‘Windows Phone Voice Commands’” to invoke the MSDN Voice Search app.

Well, that’s not too hard! In fact, we’ve had the technology even in Windows Phone 8.0, before Cortana was a twinkle in Halo’s eye. Back in the Windows Phone 8.0 Jump Start videos, we recorded one session all about voice commands (http://channel9.msdn.com/Series/Building-Apps-for-Windows-Phone-8-Jump-Start/Building-Apps-for-Windows-Phone-8-Jump-Start-13-Speech-Input-in-Windows-Phone-8 ) which explained how to create a Voice Command definition file to allow you to launch apps with speech. It’s that same technology that is used to launch apps through speech from Cortana.

Although it was quite easy to implement, not that many developers added voice command activation to their 8.0 apps and very few users turned to voice as their favoured method of launching their favourite app. Why not? Discoverability! To activate voice recognition on a Windows Phone 8.0 device, you had to press and hold the Windows button. Most users only ever discovered this by accident, if at all! So as a consequence, it never gained wide usage.

That’s why Cortana is so cool – everyone knows that it is the voice portal on a Windows Phone app. It’s no longer down to knowing about some unnatural user interaction mechanism – a tap and hold on the Windows key – but now you invoke Cortana by a single tap on the Search button, or by tapping on the Live Tile, or even by launching it from the Apps list. Or – better by far – by simply saying ‘Hey, Cortana’ on a top-end phone with the new Lumia Denim software. [Note that in countries that do not have Cortana yet, you can still launch apps by voice command

Voice Command Recognition in Windows Phone 8.0

The Voice Commands implementation in Windows Phone 8.0 had another limitation apart from discoverability: it was a little too inflexible. With Windows Phone 8.0 Voice Commands, the full text of the commands themselves needed to be known ahead of time—this made them difficult to use for searching, social networking, or anything else with a large vocabulary.

For example, a Windows Phone 8.0 voice command definition looks like this:

    <CommandSet xml:lang="en-gb" Name="UKenglishCommands">
        <!-- The CommandPrefix provides an alternative to your full app name for invocation -->
        <CommandPrefix> MSDN </CommandPrefix>
        <!-- The CommandSet Example appears in the global help alongside your app name -->
        <Example> search </Example>

        <Command Name="MSDNSearch">
            <!-- The Command example appears in the drill-down help page for your app -->
            <Example> do a search </Example>

            <!-- ListenFor elements provide ways to say the command, including references to 
            {PhraseLists} and {PhraseTopics} as well as [optional] words -->
            <ListenFor> [do a] Search [for] {*} </ListenFor>
            <ListenFor> Find {*} </ListenFor>
            <ListenFor> Look [for] {*} </ListenFor>

            <!--Feedback provides the displayed and spoken text when your command is triggered -->
            <Feedback> Searching MSDN... </Feedback>

            <!-- Navigate specifies the desired page or invocation destination for the Command-->
            <Navigate Target="MainPage.xaml" />
        </Command>
    </CommandSet>

Every command must start with the CommandPrefix – ‘MSDN’ in this case. Thereafter, the ListenFor elements define the different variants for a specific speech command, and these may include optional words if they are enclosed in square brackets. There is the ability to allow for free speech, which is represented by {*} but that still had to be preceded by some fixed speech components in the command. So for this example, all of the following will trigger the voice command:

  • Search xyz
  • Search for xyz
  • Find xyz
  • Look xyz
  • Look for xyz

But the following will *not* be recognised because it doesn’t fit into one of the ListenFor definitions:

  • I want to know about xyz

Recognizing natural language in Windows Phone 8.1

The Voice Command Definition for 8.1 allows more flexibility:

<?xml version="1.0" encoding="utf-8"?>

<!-- Be sure to use the new v1.1 namespace to utilize the new PhraseTopic feature -->
<VoiceCommands xmlns="http://schemas.microsoft.com/voicecommands/1.1">
    <!-- The CommandSet Name is used to programmatically access the CommandSet -->
    <CommandSet xml:lang="en-gb" Name="UKenglishCommands">
        <!-- The CommandPrefix provides an alternative to your full app name for invocation -->
        <CommandPrefix>MSDN</CommandPrefix>
        <!-- The CommandSet Example appears in the global help alongside your app name -->
        <Example> find 'Windows Phone Voice Commands' </Example>

        <Command Name="MSDNSearch">
            <!-- The Command example appears in the drill-down help page for your app -->
            <Example> find 'how to install CommandSets' </Example>

            <!-- ListenFor elements provide ways to say the command, including references to 
            {PhraseLists} and {PhraseTopics} as well as [optional] words -->
            <ListenFor> Search </ListenFor>
            <ListenFor> Search [for] {dictatedSearchTerms} </ListenFor>
            <ListenFor> Find {dictatedSearchTerms} </ListenFor>
            <ListenFor> Find </ListenFor>

          <!--Feedback provides the displayed and spoken text when your command is triggered -->
            <Feedback> Searching MSDN... </Feedback>

            <!-- Navigate specifies the desired page or invocation destination for the Command-->
            <Navigate Target="MainPage.xaml" />
        </Command>

        <Command Name="MSDNNaturalLanguage">
            <Example> I want to go to the Windows Phone Dev center </Example>
            <ListenFor> {naturalLanguage} </ListenFor>
            <Feedback> Starting MSDN... </Feedback>
            <Navigate Target="MainPage.xaml" />
        </Command>

        <PhraseTopic Label="dictatedSearchTerms" Scenario="Search">
            <Subject> MSDN </Subject>
        </PhraseTopic>

        <PhraseTopic Label="naturalLanguage" Scenario="Natural Language">
            <Subject> MSDN </Subject>
        </PhraseTopic>

    </CommandSet>
</VoiceCommands>

Notice the following ListenFor elements:

        <Command Name="MSDNSearch">
           …
            <ListenFor> Search [for] {dictatedSearchTerms} </ListenFor>
           …
        </Command>
        <Command Name="MSDNNaturalLanguage">
            …
            <ListenFor> {naturalLanguage} </ListenFor>
            …
        </Command>

Both of these elements contain a label inside curly brackets which refers to the PhraseTopic elements defined towards the bottom. First dictatedSearchTerms is defined as:

        <PhraseTopic Label="dictatedSearchTerms" Scenario="Search">
            <Subject> MSDN </Subject>
        </PhraseTopic>

A PhraseTopic is a new element in 8.1 which specifies a topic for large vocabulary definition. What this means is that the voice input goes up to the cloud where it is recognized against the same large vocabulary of words that Cortana uses for its built-in functions. The Scenario attribute is used to guide the recognition. Valid values are “Natural Language”, “Search”, “Short Message”, “Dictation” (the default), “Commands”, and “Form Filling”. The Subject child element specifies a subject specific to the parent PhraseTopic’s Scenario attribute to further refine the relevance of speech recognition results.

So the command that uses this ListenFor definition allows free speech searches in MSDN.

The second ListenFor element has no fixed command terms at all:

<ListenFor> {naturalLanguage} </ListenFor>

Where naturalLanguage is defined as:

        <PhraseTopic Label="naturalLanguage" Scenario="Natural Language">
            <Subject> MSDN </Subject>
        </PhraseTopic>

Effectively, this defines an ‘anything goes’ command, so this can recognize a query such as “I want to know about xyz”. Consider this a free-form alternative to the other, more rigid, command labelled MSDNSearch where the ListenFor elements define fixed components of the speech command (‘Find’, ‘Search’) that must be present for recognition to succeed.

But how then does the app know what it should do when “Natural Language” speech input is recognized? There’s no magic here, you have to code your app to search for elements in the recognized speech that have relevance to that apps problem domain. The MSDN Voice Search sample app (link in the ‘Further Reading’ section at the end) is coded to recognise variants on ‘Take me to the Windows Dev Center’ and if that isn’t found falls back to a default of a search on Bing for whatever speech input was recognized. We’ll look at how this is done in just a moment.

Handing Speech Commands in your app

I’m not going to go into all the coding techniques used to handle voice activation here. Watch the BUILD video (link below in ‘Further Reading’) for an excellent overview of handling voice activation in both Silverlight and Windows Runtime apps, and download the sample app (link also below). But I will just call out how to handle Natural Language input.

The sample app contains the following code for handling this:

        /// <summary>
        /// Given a query string from the user, attempts rudimentary "natural language" string processing in an
        /// effort to derive a specific, curated action from the text; if a match is found, that curated action
        /// is taken. If not, the string is unhandled and should be handled elsewhere.
        /// </summary>
        /// <param name="query"> the query to attempt processing and action upon </param>
        /// <param name="actSilently"> whether or not to only take actions without audio feedback </param>
        /// <returns></returns>
        private bool TryHandleNlQuery(string query, bool actSilently)
        {
            // There are a variety of ways to say things like "I want to go to Windows Phone Dev Center"; let's load
            // some alternatives for the key components of this.
            string[] intentMarkers = AppResources.NaturalLanguageCommandIntentMarkers.Split(new char[] { ';' });
            string[] wpDevCenterNames = AppResources.WPDevCenterNames.Split(new char[] { ';' });

            int intentIndex = -1;
            int destinationIndex = -1;

            Uri destinationUri = null;
            string confirmationTts = null;

            // First we'll try to find a match for the "intent marker," e.g. "go to"
            foreach (string marker in intentMarkers)
            {
                intentIndex = query.IndexOf(marker, StringComparison.InvariantCultureIgnoreCase);
                if (intentIndex >= 0)
                {
                    break;
                }
            }

            if (intentIndex >= 0)
            {
                // Now we'll try to figure out a destination--if it comes after the intent marker in the string, we'll
                // store the destination and spoken feedback.
                foreach (string wpDevCenterName in wpDevCenterNames)
                {
                    destinationIndex = query.IndexOf(wpDevCenterName, StringComparison.InvariantCultureIgnoreCase);
                    if (destinationIndex > intentIndex)
                    {
                        destinationUri = new Uri(AppResources.WPDevCenterURL);
                        confirmationTts = AppResources.SpokenWPDevCenterSsml;
                        break;
                    }
                }
            }

            // If we found a destination to go to, we'll go there and--if allowed--provide the corresponding spoken
            // feedback.
            if (destinationUri != null)
            {
                if (!actSilently && (confirmationTts != null))
                {
                    StartSpeakingSsml(confirmationTts);
                }

                StartBrowserNavigation(destinationUri);
            }

            // If we found a destination, we handled the query. Otherwise, it hasn't been handled yet.
            return (destinationUri == null); ;
        }

That’s a lot of code, and essentially all it does is look for a number of different variants on ‘I want to go to the Windows Dev Center’. Key to understanding it is knowing that in the Resources file, the command intents and variations for the dev center name are defined:

NaturalLanguageCommandIntentMarkers: show me;go to;take me to;bring up;navigate to;launch
WPDevCenterNames: Phone Dev Center;Phone Developer Center

So all the code shown above actually does is to search for a match on the recognized text for one of the NaturalLanguageCommandIntentMarkers, followed by one of the options for WPDevCenterNames. If it matches then the app takes you to the Windows dev center. If you follow through on the sample code, you’ll see that if no match is found, then the default fallback action is just to invoke a Bing search for the recognized text.

Like I said – there’s no magic here. You just have to write code to analyse the recognized speech for strings that make sense for that app, and this could be quite a lot of code!

Parting thoughts…

Before I close, three further thoughts on Speech enabling your apps:

First, make sure you include Command definitions in your vcd file for all languages you want to support. The sample app has commands defined only for US English:

    <!-- The CommandSet Name is used to programmatically access the CommandSet -->
    <CommandSet xml:lang="en-us" Name="englishCommands">

which means that when you run it on a UK English or Australian English device, the voice commands don’t work! You have to define for each individual language you want, which could get a bit tedious. But be warned – it caught me out!

Secondly, make sure you detect when the user has typed into Cortana rather than used speech. If they typed in, don’t go speaking back to them – they typed quite possibly because they were in a meeting, so it would not be helpful to start speaking out sports results or whatever! The sample app and the BUILD video show you how to do this.

And thirdly, your user has to run the app at least once to run the code to register the speech commands with the system. Unfortunately, speech commands do not get activated just by installing the app. So the first time the user runs the app, that might be a good time to pop up a ‘Did you know?’ screen to tell them about the speech commands your app supports. All helps towards the users’ discoverability of these kind of features.

Further reading

· Video: Integrating Your App into the Windows Phone Speech Experience http://channel9.msdn.com/Events/Build/2014/2-530

· Sample: MSDN Voice Search for Windows Phone 8.1 http://code.msdn.microsoft.com/MSDN-Voice-Search-for-95c16d92

· Speech team blog post: Integrating Store Apps with Cortana in Windows Phone 8.1 http://blogs.bing.com/search/2014/04/14/integrating-store-apps-with-cortana-in-windows-phone-8-1/

Voice command element and attribute reference (Windows Phone Store apps using C#/VB/C++ and XAML)
http://msdn.microsoft.com/en-us/library/dn630431.aspx