/mceSALT

comment(s) 

Media Center Edition and Speech Application Language Tags

http://www.brains-N-brawn.com/mceSALT 6/6/2005 casey chesnut

Introduction

this is my second article on integrating Speech with Media Center Edition (MCE). the 1st article (/mceSAPI) used the Speech Application API in a Media Center Add-In to control the MCE Shell with Speech. NOTE the 2nd article is /mceState. /mceSAPI worked great, and people could use it to write their own speech controlled Add-In. in reality, i think that functionality should just get baked directly into the MCE OS, as well as pushing microphones out to the MCE remotes. the problem is that it only provides the basic remote commands for 3rd party applications, particularly Hosted HTML apps.

so this article will show how to add Speech control to Hosted HTML applications in MCE. since these apps are HTML, instead of using SAPI, we will use the Speech Application SDK (SASDK) to add SALT to the rendered HTML.

Hosted HTML

this is the development model for MCE that allows you to do a full screen UI. the MCE 2005 SDK comes with a number of sample HTML pages that use a combination of CSS, Javascript, Behaviors, and an ActiveX control for doing text entry. these pages can either be run off the file system (e.g. SDK sample pages), IIS running on the local MCE machine, or from an external server (e.g. online spotlight). actually, i dont really consider the 2nd option (IIS running on the local machine) an option, because that requires your users to have gone to the extra effort of installing IIS on their MCE. i have my issues with the 1st development option too (Hosted HTML pages on the local file system). because it forces you to a web based paradigm without providing any of those advantages (such as reach and maintenance). MCE sorely needs a thick client development paradigm. hasn't MS been pushing smart clients for the last # years? ... smart clients would make perfect sense for MCE. crap, i dont really like the 3rd option either (Hosted HTML from an external server). the reason i dont like this is because the scaling of the web page elements is handled by CSS. what the MCE UI really requires is a web based vector graphics format ... cough ... SVG (Scalable Vector Graphics). if only IE7 were to support it ...

thats enough bitching (for now), time to make something work. this article will serve the pages from IIS. during development, the IIS server will be running locally on my MCE dev machine. afterwards, i will push it out to my offsite server where this article is also hosted. also, i will be using ASP.NET 1.1 to serve the pages. although the sample pages are static HTML, i will move those static elements to ASP.NET web forms. this is necessary for the SASDK to render SALT accordingly

Speech Application SDK

SASDK is for creating voice-only and multimodal applications using web technologies (previous dev articles are /noHands and /noHands2). at its lowest level it renders SALT elements that a voice browser can use to do both Speech Recognition (SR) and Text To Speech (TTS). voice-only applications are entirely controlled by using your voice. e.g. calling an automated bank teller to check your bank account. multimodal applications are more closely related to the web applications most of us develop. they are multimodal because they can be controlled by your mouse and keyboard, along with your voice. plus the user can either read the output on the screen, or the app could also provide audible prompts. multimodal applications have yet to catch on in the standard web world, but they make perfect sense for kids and the disabled. they also make sense on devices where you dont have a keyboard and mouse in front of you. e.g. tablet PCs, pocket PCs, and now MCE.

the SASDK 1.1 was just released. its main addition seem to be the ability to work with Speech Server language packs (North American Spanish and Canadian French) ... which is HUGE if you ask me. NOTE Speech Server is for developing voice-only apps and multimodal apps for PPCs (although you cant use the language packs with multimodal PPC apps). NOTE /speechMulti is a Speech Server multimodal PPC dev article. for this application, Speech Server is not required, and we only need the SASDK. it is worth noting that the ASP.NET Speech Server 1.1 controls will only install on Windows Server 2003. this doesn't really make sense to me, because i think they only have to render SALT? i cant imagine that they require some advanced functionality of the OS that is built-in to Server 2003 that is not in Server 2000. because my home server is still Win2K (lame), i'll be using the Speech Application SDK 1.0 for this article.

so my dev environment has SASDK installed, the server must have the ASP.NET Speech Controls, and the client must have the Speech Add-in for IE installed. the Speech Add-in for IE is installed with the SASDK, but if you try to run this on another MCE machine, then it must have the IE Speech Add-in installed for Speech to work. if they dont, the page will still work normally with a remote ... speech just wont be enabled. the bad news is that the Speech Add-in is only available with the SASDK. stupid! stupid! i say stupid twice because we had this same issues with SASDK 1.0. now that SASDK 1.1 is out, we still dont have a separate MSDN download to get the 65 meg IE add-in. did i say stupid? fuck it, i'm full out pissed. if MS wants speech to take off at all, they need to at least help distribute the bits for it to run on peoples machines. at a minimum, allow people to install it without having to download an SDK. better yet, push it out as an option through WindowsUpdate. and freaking bake it into IE7!

finally, i mentioned that i think the MCE Hosted HTML dev model should end up being SVG for web based applications. i believe there is a specification for making SALT work with SVG, so i would like the SASDK made to support that as well. like anybody listens to me ... oh well

Speech Recognition

this sample app will just involve 3 pages. the 1st page will be a menu which will link to 2 other pages. one of the pages will do SR and the other page will do TTS. the menu page will end up doing both SR and TTS. the first task was to get the basic remote commands working. looking at BasicFunctions.js shows that the script explicitly handles Up, Down, Left, Right, Page up, Page down, and Back in onRemoteEvent(). since using your voice is effectively the same as the remote, i created a corresponding function called onSpeechEvent() in a new script called BasicSpeech.js. if one of the commands is recognized, then it fires the associated remote command with onRemoteEvent(#). the only one that works different is the Back command which is handled by doing history.back(). 

function onSpeechEvent(smlNode)
{
    try
    {
        if (doOnSpeech(smlNode) == true) 
        {
            return true;
        }
    }
    catch(e)
    {
        // if doOnSpeech function is not present on page, ignore error
    }
    try
    {
	// this switch tests to see which button on the remote is pressed
        switch (smlNode.text)
        {
		case "Up":  //38 Up button selected
			onRemoteEvent(38);
			break;
		case "Down":  //40 Down button selected
			onRemoteEvent(40);
			break;
		case "Left":  //37 Left button selected
			onRemoteEvent(37);
			break;
		case "Right":  //39 Right button selected
			onRemoteEvent(39);
			break;
		case "Enter": //13 Enter button selected, execute link to content/page
			onRemoteEvent(13);
			break;
		case "Back": //166 Remote Control Back button selected; Media Center will already perform a Back
			onRemoteEvent(166);
			history.back(); //to send actual back command
			break;             
		case "Page up": //33 Page up (plus) selected; page-up scrolling menu
			onRemoteEvent(33);
			break;
		case "Page down": //34 Page down (minus) selected; page-down scrolling menu
			onRemoteEvent(34);
			break;
		default:
			return false;
			// ignore all other voice clicks
        }
    }
    catch(ex)
    {
        //ignore error
    }
    return true
}

the other thing to notice is the doOnSpeech() method call. the onRemoteEvent() has the same logic. this allows the page to handle the event before even getting to one of the Basic scripts. for speech, it allows the page to handle recognition events that are specific to that page only, while every page will be listening for the BasicSpeech commands. the SR page is dead simple and only listens for the user to speak Yes or No.

function doOnSpeech(smlNode)
{
	try
	{
		// this switch tests to see which voice command was spoken
		switch (smlNode.text)
		{
			case "yes": 
				btnYes.focus(); 
				btnYes.click(); 
				return true; //to keep onSpeechEvent from moving focus
				break;
			case "no": 
				btnNo.focus(); 
				btnNo.click(); 
				return true; //to keep onSpeechEvent from moving focus
				break;
			default:
				return false;
				// ignore all other voice clicks
		}
	}
	catch(ex)
	{
		//ignore error
	}
	return true
}

both of the grammars (yesNo and basicSpeech) are tied to the page with a Listen control. on recognition for either grammar, the control calls the client side method onSpeechEvent().

the next task was to provide some UI element to let the user know the page is speech enabled. this is typically with a microphone icon. so i cooked a quick one up to look similar to the arrow images provided with the SDK.

as per standard web apps, i made this element clickable. this enables the usage scenario called 'tap and talk' from the 2 foot experience. tap and talk is when a user clicks some element on the page to let the voice browser know it is time to start listening. so i made the element work the same. when you click on it, then it calls Listen1.Start() and the page starts listening for one of the commands in the grammar. this works great when you are using a mouse, but not when you have the remote, because you would constantly have to refocus on this item. instead, i think the remote should have a 'tap and talk' button. then all the user has to do is press the 'tap and talk' button, speak their command, and it is executed. i.e. imagine a remote that only had one button on it. you would just tap the button, say 'channel 24' or 'volume up', or whatever, and the command would be executed. ultimately there will be no button, and it will always be listening, per some standard voice ui specification. since we're not there yet, i just tied it to the 'clear' button on my current MCE remote. when that button is pressed, the page will start listening. i did this by implementing doOnFocus() for the page, which is called by onRemoteEvent()

function doOnFocus(keyChar)
{
	try
	{
		// this switch tests to see which button on the remote is pressed
		switch (keyChar)
		{
			case 27:    // Clear
				Listen1.Start()
				return false; //to allow onRemoteEvent to move focus
				break;
			default:
				return false;
				// ignore all other clicks
		}
	}
	catch(ex)
	{
		//ignore error
	}
	return true
}

with that wiring, the user browses to the page on MCE. they hit the clear button to tell it to start listening and either speak 'yes or no'. the doSelect() method is called and whatever logic is tied to it is executed. then the user can hit the clear button again to say 'back'. this exits the page and returns to the previous. you could make this much more complicated with dynamic grammars that allowed the user to enter the date, a large number, etc... which would be much easier using their voice instead of the remote, especially the TripleTap control. NOTE TripleTap is not the same as Tap and Talk. this provides for a great user experience, and even displays the audio meter to let the user know that speech recognition is being performed.

finally i added a DialogEx to the page. the problem is that the API does not provide us any hooks to programmatically work with the Dialog after its been opened. meaning your only option is the remote control. it would be great if opening a Dialog returned a reference to that object that we could programmatically close. then we could speak Yes, No, Ok, Cancel and catch that reco event, then call the appropriate method on the Dialog reference. right now the behavior is different if the Dialog is modal or modeless. for the modal dialog button it just sits there. if you click the 'clear' button to activate the speech listening then nothing happens. for the modeless Dialog, when you hit clear, the Dialog is dismissed. er, um ... so dont use Dialogs to get input from the user. of course you can still use them to display output, as long as they timeout and go away on their own. maybe the Speech Add-In or MCE Shell should handle the speech dialogs instead?

Text To Speech

the next part was to do TTS ... which is easy. all i did was add a Label to the form that had its Text set to DateTime.Now.ToString() as soon as the page loaded. then i added a Prompt speech server control to the page. finally, the pageLoadFunctions calls Prompt1.Start(Label1.innerText); which speaks the current DateTime from my server. of course this could be extended to support more interesting speech, as well as using prerecorded voice talent. it makes sense to add speech to your MCE web pages as our population ages and dont have the best eyes anymore. you can see that the pic below also displays the Microphone. this allows the user to hit the tap and talk button to say 'Back' to get to the Menu they came from

now that we have both SR and TTS, i added the same functionality to the Menu page, so that you can say 'Speech Recognition' to go to that page or 'Text to Speech' or 'brains-N-brawn' to go to my web page.

brains-N-brawn.com

so my web site is already pretty sick. mainly it can be viewed from a standard desktop PC. it also has the ability to render differently for both a Pocket PC and a Smartphone. the blog can also be listened to over a standard telephone line using VoiceXML, or viewed from a WAP browser. not that anybody does that ... but they can. plus comments can be inked on from a Tablet PC. its only natural that i would also customize it for MCE as well

first, you have to recognize that the OS is MCE. second, you have to recognize that they are browsing to your site from within the MCE shell. this is because they can also browse to your site from an MCE machine, but outside of the shell ... and then you want to render normally for the 2 foot experience.

if(Request.ServerVariables["HTTP_USER_AGENT"].ToLower().IndexOf("media center pc") != -1)
{
	//add Windows Media Center Edition specific code
	//now we know its a media center pc, but we only want to redirect if its in the shell
	if(Request.ServerVariables["HTTP_USER_AGENT"].ToLower().IndexOf("mediacenter") != -1)
	{
		//now we know its the shell
		Server.Transfer(mediaPage, true);
	}
}

the next part is to get the app registered in MCE More Programs. the NewsMain.htm has the code to do this, along with grabbing the users PostalCode for MCE 2005 users. the application and entry point is registered fine, but the call to determine if it is already registered was returning false for me? the logic in the page is to not display the registered button if it is already registered, but mine was always displayed. but if i tried to re-register the app, then i got a dialog error about already being registered. the problem is that i was running my app from a .MCL file, and not from 'More Programs'. per Michael Creasy (i recommend his blog), "Only a registred app can tell if an entrypoint is registered and only an
entrypoint it owns.  This is to prevent Company A seeing if Company B's software is installed on your system". so when i ran my program from 'My Programs', then it worked as expected ... and the 'Register' button was not displayed

so when you visit my website from MCE, it starts out with a main menu. right now it just allows you to view the blog or register, both commands can be spoken. ultimately i will try to add more compelling content. if you choose to register, then it pops up a dialog which you have to control with your remote (speech wont work). if you choose blogs, then it displays a list of my latest blog entries. you can speak 'title #' to read a specific blog, or say 'page up / down' to scroll through the titles. of course you can say 'back' as well. when you are reading a post you can say 'page up / down' to scroll the text. i should work TTS into this as well, then it can just read the blog entry to you 

Stand Alone HTML

so far we have been talking about Hosted HTML applications that are being served by IIS. if you want to actually install your application on a users MCE (without IIS) and just run the Hosted HTML pages from their file system, then the above will not work. because SASDK relies on the ASP.NET Speech Server controls. for the Pocket PC, there is a command line utility called PIECC that you can run against your pages to have them pregenerate the SALT into a static HTML page. you can use this if you only if you have a very basic speech app that uses inline grammars. anyway, i have used it successfully on a PPC, but have not tried on the desktop ... its something to be aware of.

alternately you can use the generic commands of /mceSAPI to interact with your Hosted HTML apps

Extenders

dont actually have an extender to experiment with yet, but i'm aware of them. the Hosted HTML apps demonstrated here should work fine with them because if SALT is not recognized by a browser then it is just ignored. a user should be able to control these apps normally with their remote. ultimately i would like to see the extenders be extended to support speech input as well. hopefully they already planned for this in the xbox360 ... or at least have the capability to add it later.

Video

here is a video of the speech enabled pages in action. NOTE its a zip (of .wmv) that you will have to download to play

Conclusion

that shows how to speech enable your Hosted HTML applications using the Speech Application SDK. in general, i was able to accomplish just about everything i wanted. the sample javascript files were setup in a manner that they are easy to extend from a page, and i stole that model for the BasicSpeech script. the only real issue is not being able to interact with the Dialogs through speech, but that can be avoided as you build the app. there is also a problem of adding speech to Hosted HTML apps hosted off the file system, but i consider that more of a limitation of the SDK not providing a true windows application dev model. my guess is that will ultimately be replaced with the use of XAML and System.Speech, currently in Release Candidate for Beta ... dont ask ... i think they mean Alpha? finally, the speech guys really need to start promoting their bits. pushing out the current Speech Add-In for IE, or getting it integrated into IE7 would be the ultimate. studied this problem for about a week, mainly beefing up my DHTML knowledge, and took 2 days to code

now we can provide multimodal apps to our MCE users that allow them to use both their remote control and speech. it will provide for a better user experience because they can use either when appropriate. speech being particularly useful to jump directly to an item, instead of having to cycle through a bunch of peer items to get to the correct on. this is real similar to a previous article (/tabletWeb) that shows how to write web apps that can use both the pen and speech on a Tablet PC. hopefully, we will start seeing some of the 'online spotlight' portals use speech as a way to differentiate themselves

Source

the source is provided below. remember that i developed this in SASDK 1.0, but my guess is that it should work in SASDK 1.1 as well.

if you just want to try to run the application on your MCE device, then it is hosted on my server. you will have to download the SASDK 1.0, but only install the Speech Add-In for IE. the Speech Add-In for IE 1.1 component might work as well ... but i haven't tried. then you just have to run one of the MCL files below. NOTE if you have MCE but dont want to install the Speech Add-In, then you should still be able to browse these pages from MCE, and the speech stuff will just be ignored

Updates

none regarding speech, but i do want to add more content to the brains-N-brawn MCE interface

Future

more experimental MCE stuff ... without speech. later