/mceSAPI

comment(s) 

Speech API and Media Center Edition

http://www.brains-N-brawn.com/mceSAPI 5/23/2005 casey chesnut

Introduction

doing the /cfWorldWind article reminded me how much i like to play with new APIs (e.g. Direct3D). after that article was finished i brainstormed about what new topic to look at. then i remembered that i have been wanting to do Media Center Edition (MCE) development for some time. if you look back in the microsoft.public.windows.mediacenter newsgroup, then you'll see that i was one of the 1st people to ask for a mediacenter.developer newsgroup. of course, at that time, there wasnt even an SDK (lame). next i looked at the SDK when it 1st came out and it was entirely DHTML development. personally, i dislike the current state of the web, and dont consider DHTML to be "development". e.g. do you want to write a neural network in javascript? then when i saw an early version of MCE 2005 at last years MVP summit ... i got really excited. the reason i didnt jump on it though ... is that i dont watch TV (got rid of my TV about 5 years ago). of course MCE can handle other media types, but video is definitely the main draw. regardless, i decided to make the jump and give MCE dev a shot to see what it can do

now i just needed to pick an idea. already had like 5 to choose from, because i keep a running list of ideas to code. out of that list, i didnt have a preference of which to do 1st. that is, until Bill Ryan started talking publicly about the Atlanta code camp in which he presented Speech Server and Michael Earls presented MCE. Bill even held his feet to the fire to do an MCE and Speech integration. well ... that made the decision for me. one of the ideas i had for MCE was to do some sort of speech integration. so i shot off an email to Bill to see what he was doing ... and make sure that we would be doing complementary work. ends up that our ideas were different, so i started to run with mine.

this article will show how to create a Voice User Interface (VUI) to be able to control your MCE with speech (instead of the remote). it will use the Microsoft (MS) Speech API (SAPI) to do both Speech Recognition (SR) and Text To Speech (TTS).

Installation

first problem is that i dont own a MCE PC. looked around at the stores to see about buying one. there were some 64 bit AMD machines on the shelves, but none of them were MCEs, and i'm partial to Intel. oh yeah, i dont buy desktops ... was only looking at notebooks. that left my current dev machine, a Toshiba P25-S509. when i bought this machine 1.5 years ago, they had a corresponding model (P25-S609) for about $500 more that was MCE. decided not to buy that one ... based on the not watching TV thing. 20/20 ... i'm still not sure if i would have been better off with that machine. i've seen a couple newsgroup posts where other people have not successfully been able to upgrade that MCE 2004 machine to MCE 2005. regardless, i was just going to try and install the MCE 2005 OS onto my current devbox. i have the OEM installation disks and activation number from an MSDN universal subscription. luckily, i had also bought a $35 MCE remote from egghead.com about a year ago ... knowing that i would get to this sooner or later. started out trying to do an 'upgrade' install of the OS using the new activation code. this technique actually worked when i installed the Tablet OS on a different machine about 3 years ago, and the MCE install looked real similar. but it did not work. all i ended up with was a clean install of XP Pro SP2. did some more searching and found a post that said you had to do a 'clean' install to get the MCE bits installed. crap ... i'd already wasted a weekend doing that ... so decided to hold off. but when i went into work the next day, the clean install had hosed up by dev environment so much that IIS would not run. took off the rest of that day and backed everything up, paved the machine, and then reinstalled everything (feed the monkey). after everything was installed, i plugged in the remote and clicked the 'green button' ... and it came up! clicked around a little and got my 1st 10 foot experience. actually, i'm near sighted, so it was more like 6. NOTE installing the MCE OS like this is not recommended. it was just the easiest way for me to get started

Media Center Edition

now it was time to test out the environment. the most obvious error was on startup of the eHome shell (ehshell.exe). ehshell is a managed application that runs full screen. its what you view for the 10 foot experience. from a quick scan of the newsgroup it seems to use Managed DirectX as its UI. when it started up, i would get an error dialog saying "your video card or drivers are not compatible with media center". NOTE this error is because of i'm not running on proper MCE hardware.

found a post on this that explained a registry hack that would skip the DirectX check. ran that .reg file and the message box went away. that got rid of the effect, but the original cause seems to be because my video card only had 32 megs of RAM. it looks like MCE has a hard requirement that it must have over 32 megs of video RAM. the ehshell.exe (same link as registry hack) can get around that, but it did not work for me due to .NET security. from reading the newsgroups, another common solution is to make sure you have the latest video card drivers installed, and those drivers might be specific to MCE. i tried the latest nVidia drivers that would install on my box, and had problems with my computer restarting randomly. i dont get random restarts when using the latest driver from Toshiba, but my box does periodically crash and freeze when i'm working with MCE. i.e. i have to restart at least once every development session. welcome to developing with an OS on incompatible hardware. the next problem was when i tried to play a DVD. the error was "files needed to display video are not installed or not working correctly. please restart media center and/or restart the computer". NOTE Robot Chicken! NOTE this error is usually because you dont have an MPEG2 decoder installed. in this case it is because i'm not running on proper hardware.

this problem was solved by installing a DVD decoding application. you can purchase a number of them for about $20. yes, i think its lame that the OS doesnt just have one with it. the real problem is that i hate having to install anything. that got rid of the error, but i still cant play DVDs or videos. actually, i can play DVDs and videos (outside of the MCE shell) using Windows Media Player 10 ... and they work great. but when i try to play them in the MCE shell, either i get a black screen with audio, or they dont play at all, or the screen will flash black and then display the title of the video and 'Finished', or it will lockup/restart my box. er, um ... even though those options show up in MCE ... i dont click them. i'll have to get a new devbox to do any applications that involve video. another problem i've got is with an nVidia settings page. it seems to have gotten a partial install by either the OS or the driver ... but some of the files did not get make it. thats not a big deal. NOTE this error is because i'm not running on proper hardware

the final problem involves not being able to put MCE 2005 on a domain. this doesnt matter for most end users ... but it does for developers. the reason this limitation exists seems to be because Media Extenders require Fast User Switching. if you are on a domain, then you cant do FUS. even more, i turned off FUS so that you have to login with CTRL-ALT-DEL instead of the welcome screen because i periodically work in an environment that requires me to lock my screen everytime a walk away from my desk. this limitation needs to be removed. let us (developers) put it on a domain, and we'll just have to pay the price for it to not work with the Media Extenders ... but it will be worth it. er, um ... what happened to security, security, security? most end user issues can be handled by a more descriptive installation procedure. if you really want your MCE 2005 on a domain, then you should install MCE 2004 1st, because it can be put on a domain. then you'll do an upgrade to MCE 2005 and it will remain on a domain.

that leaves my environment being able to work with My Pictures, My Music, Online Spotlight, and More Programs. DVDs and My Videos show up ... but i cant use them because of the crashing problems above. if video was working properly, then i would buy a USB/PCMCIA tv/fm tuner card ... and that would handle all of the hardware needs for My TV and FM Radio. but since video is not working, i'll just have to wait until some newer machines hit the market. come on 64 bit wintel MCE notebooks ... with xbox360 integration ...

MCE Development

now the OS is installed and at least partially working ... time to get to work. the first big let down i had was discovering that some of the development needed to be done in VS.NET 2002 for .NET 1.0. WTF! the OS is 2005, and VS.NET 1.1 has been out forever. Beta 2 for 2005 is even out now, and this expects me to go back 3 years. you've got to be kidding me [shaking my head in disgust]. so i decided to just use VS.NET 2003 to do the development and create build files to compile for .NET 1.0. NOTE after the fact, i would recommend installing VS 2002 to do MCE development. the problems that i've had so far have involved creating signed RCW COM wrappers for 1.0 and installation projects. the reason i'm not installing 2002 is that i've already got 2003 and 2005 B2 installed. i really dont want to have 2002, 2003, and 2005 on the same machine, plus i figure i'll have to reinstall 2003 and 2005 if i install 2002 now? anyway ... MCE dev needs to catch up ....hopefully with the xbox360 release!

the next pain point is with the development models, of which there are 3 provided by the MCE 2005 SDK   ... but there really needs to be 4. the first dev model is Hosted HTML. Hosted HTML lets you create full screen UI views that can be controlled with the remote. this involves DHTML, Javascript, Behaviors, ActiveX, and the MCE object model. the HTML runs in MSHTML SP2 hosted within the ehshell and can either be run locally or externally (as for Online Spotlight). this model makes sense for external websites, but not for local apps. but its all we have for now, so i'll use it in some future articles.

the second model is AddIn assemblies that run in their own AppDomain within ehshell. these are class libraries that have to implement some lightweight interfaces and can either be long running background processes, or short running user executed apps. they are powerful dev environments and can access the MCE object model, but their main drawback is they can only expose small dialog boxes for UI. also, they are hard to debug

the third model is external applications that can subscribe to the state of MCE using the MediaState API. the API says this might be used by OEM hardware developers for secondary LCD displays and such. wish i could tell you more about this model, but i haven't been able to get this sample to run yet. well, it runs ... but doesnt work :( from the newsgroups, it looks like a number of developers are having problems with this sample.

its obvious ... we are missing a fourth model that should be for developing full screen windows applications. MS is doing this with Managed DirectX for the ehshell. they either need to expose this or extend WindowsForms to stop gap. ultimately, Avalon makes sense for filling in this void because of its ability to flow layout and scale the display ... but we need something yesterday. NOTE there are some 3rd party efforts that have started to do just that. a screenshot of a beta is below, but i dont think there is anything that is really usable yet :(

Media Center AddIn

for this application, i chose to use the second model : MediaCenter AddIns. the plan is to create a long running background process that is continually listening for voice commands. when a command is recognized, then it just calls the appropriate MCE object model. the limited UI isnt a problem here because this app is primarily about talking user input (as speech) and then controlling the existing MCE UI.

started out by creating Michael Creasy's HelloWorld AddIn that just starts up and displays a dialog box. it took a little bit to get the project setup, but was easy after that. the project took a while because i had to create the build scripts to compile for .NET 1.0. another requirements for these AddIns is they have to be signed and in the GAC. so i had to create a key pair and get that in the build. also had to use gacutil.exe to get the assemblies installed. finally, there is a registration process that has to be done for MCE. this was done with an XML config file and a command line utility that executes off of it. that was tedious ... but not too bad ... and it was satisfying to see the dialog popup. the main problem i saw was how in the world do you debug this thing? its an assembly running in its own AppDomain of the ehshell process. you might be able to attach to the running process and step through your code ... but i didnt try it. anyway, seems far from an ideal debugging environment.

Speech API

used the HelloWorld AddIn as the starting point for the Speech AddIn. the 1st thing to do was get Speech API 5.1 (SAPI) into the environment. SAPI is due for a refresh, because its still COM, and they've made alot of advances with speech for the Speech SDK and Speech Server. started out using VS.NET 2003 to create the RCW to let me call COM from .NET, but then i realized that it was going to create an assembly that relied on .NET 1.1. knew that i was going to have to use tlbimp.exe from .NET 1.0. downloaded the SDK and got the following error upon install.

this really really PISSED me off. you can put checks like this in end user products, but dont do this to developers. the OS thinks that it has everything, but it doesnt. it has all the runtime bits, but none of the 1.0 SDK bits i need. to get around this i had to wait 5 hours while downloading 1.5 gigs of VS.NET 2002. didnt install it (yet), just unzipped it, and pulled out the 60K needed for tlbimp.exe 1.0. this allowed me to create the signed RCW wrapper of the Microsoft Speech Object Library. then i extended the scripts to also install this Interop.SpeechLib.dll into the GAC.

Speech Recognition

since this is only for educational purposes, i just did the most basic SAPI code possible. basically i reused the C# ListBox sample code that comes with the Speech SDK. it has a list of items that the user voice clicks by saying "select <item name>". 'select' is a keyword that tips off the recognizer that this is a valid command. it makes sense to have this same mechanism in place for the MCE so that you dont accidentally delete content through normal conversation. the keyword i chose was the acronym 'M C' for Media Center. ... all 'MC Hammer' jokes will be ignored :) now the user will have to speak 'M C <command>'. NOTE i'm using a Labtec directional USB desktop microphone. also, its running on a clean install of the OS so i have not trained it for my voice. this is mostly fine for command and control scenarios, but would not be acceptable for dictation

getting that code to work in the MCE environment was pretty easy. the AddIn interface has a Launch() command that is called by the shell. since this is registered as a background AddIn, the assemblies Launch method gets called immediately after the shell starts. the Launch command just creates the recognizer, the grammar, hooks the recognition event, and then it creates an infinite loop to keep from exiting the method. when the Launch method ends, the assembly is unloaded and the AddIn will no longer function. the only trick with the loop was to have it sleep for a little bit and then call Application.DoEvents() to kick the message pump for the recognition event to occur.

the recognition delegate is where the work actually occurs. it determines which command was spoken and then has a huge switch statement (hack) to determine what to execute. the effect might be as simple as displaying a dialog box, navigating to a new shell page, or sending some key command. the grammar is entirely static and just listens for 85 different commands, and maps those to about 75 different actions. it would be pretty simple to extend this to support dynamic grammars based on context. e.g. right now i can tell it to 'start a slideshow'. but what it should really do is know what folders of images are available and use those to generate a dynamic grammar. then i could tell it to 'start a slideshow with <some folder>'. the one thing i did do to make it somewhat powerful was to give it the ability to listen for voice commands corresponding to the remote control key presses (see Charlie Kindel's open sourceMCE Controller). this allows the user to speak the remote control commands that they would have pressed. that gets it at least on par with the remote, but is far from optimal. anyway, i hope you still think its cool? the list of recognized commands follow

Basic

Navigation

Remote Commands

Listen On FM Radio Alt Tab Left
Listen Off Internet Radio Back Green Button
Speech On Live TV Closed Caption More Info
Speech Off More Programs Channel Up Next
What Can I Say? Music Albums Channel Down OK
Mute Music Artists Close Pause
Unmute Music Songs Delete Play
Mute On My Music Down Print
Mute Off My Pictures DVD Audio Record
Volume My TV DVD Menu Right
Volume Up My Videos DVD Subtitle Rewind
Volume Down Recorded TV End Tab
Postal Code Recorder Storage Settings Escape Snapshot
Version Scheduled TV Recordings Forward Stop
Parental Controls Slideshow Guide

Up

Skip Forward Start Help Zoom

Skip Back

TV Guide Home Yes
Skip Previous Visualizations Insert Enter
Full Screen   Prior Page Up
  Numbers (0-9) Select Page Down

Text To Speech

speech reco allows it to listen for commands and take action, but i also wanted to make it respond. the first idea was to just display a DialogBox. the MCE API actually has 3 different dialogs. the most unobtrusive is the DialogNotification. chose to use this one with very small text messages to show the user the result of speech recognition for certain commands. this allows a user to see what happened in case speech reco gets a false match. the problem with the DialogNotification is that it has a parameter for an image path to display. tried passing null at 1st, assuming that it would just display some default image, but it exceptioned. NOTE the pic below shows the exception thrown from DialogNotification displayed in a regular Dialog, which will accept a null image path

ended up that i had to pass it an image path. the really ugly thing is that i could not pass it a stream from an embedded resource. my hack to get around this was to embed the resource image, then on init, pull out that stream and serialize to a temporary file on disk. pass that path to the DialogNotification during runtime, and then delete the temp file on exit. ugly, but effective.

then it hit me that the user could use speech recognition to control their music without even having their TV on at all. so i decided to add TTS to give feedback to the user as they controlled the MCE through their speakers. now the AddIn optionally speaks back what control was just executed. NOTE you can go into the Speech Control Panel and change what voice it uses. also, there are some 3rd party voices you can purchase that sound much better than what is included with XP ... did i mention that SAPI really needs to be refreshed? to take this further i would want to tie this into the MediaState hooks. then the AddIn would get notifications such as a new track being played, and could speak the artist name and song title before playing the song. the problem is that i have not been able to get the MediaState stuff to work yet ... if you're reading this, and have gotten it to work ... a little help, please?

Video

here is a quick little demo showing it being used

Conclusion

this showed how to add basic speech recognition and text to speech capabilities to MCE. it could be extended to create an incredible user experience. just from working with this, i'm certain that speech should become a core part of MCE ... with developer hooks! especially when thinking about the home automation that people are doing. you could put omnidirectional mics around your house, and operate your lights, security, etc... from anywhere (without a keypad or remote). and speaker identification could be an additional check ... plus personalization. e.g. if i just said 'play music', then it would recognize my voice and play what i liked to listen to; but it could also learn the TS's voice and play what she likes. another option might involve the remotes getting wireless mics built into them. and possibly a tap-to-talk button. instead of keying the channel in, you would just hold the 'tap and talk' button and speak 'channel 69'. this could work on media extenders too, since xbox has multiple input devices that accept audio for trash talking capabilities. the TTS could be useful elsewhere too. it could be a personal assistant that read you the news headlines or emails in the morning. now if they just made a synthesized voice that sounded like HAL 9000 ... "I'm sorry Casey, I'm afraid I can't open the garage bay door"

this was my 1st foray into MCE development. started just over 1 week ago. took a couple days to get installed (properly), couple days to play with the OS, couple days to do the hello world samples and read over the SDK, and less than 1 day to implement this AddIn. plus another day to write the article

as for my overall MCE developer experience ... i like it, and am going to continue playing with it. the initial cost of entry is too high for most developers, but is not insurmountable. the MCE team needs to follow in the Tablet teams foot steps to make the platform more approachable. on a Tablet, you used to have to install the OS to get full recognition. now all you have to do is install the SDK, the recognizer pack, and with a $100 digitizer you get almost the full environment. the MCE team should try to do the same to make it where developers can just install the shell, the SDK, and get a USB tuner card to begin development. the developer community seems to be small but highly capable. the MCE team should also emulate the Tablet team in this regard and have some developer competitions. the SDK is in much better shape than when i looked at it last time, and should only get better. as for the platform, i think its phenomenal. just from a week, i already think it is inevitable that MS will own the living room. and i'm got visions of how it will change your overall home computing experience. as soon as a 64 bit wintel MCE notebook hits the market, i'm upgrading. and my recommendation for anybody else that is buying a new computer for home use is to go with MCE. very cool stuff ...

NOTE i was originally really excited about this application. but then i saw that there is already a commercial ($150) version called MCE Communicator. it does some pretty cool things such as Voice Over IP and has dynamic grammers. but it doesnt have developer hooks, so this article still has value to let other developers add speech to their AddIns. i haven't actually tried MCE Communicator, but i did watch some of the demo videos on their website. anyway, i found out about it when listening to The Media Center Show. its a podcast dedicated to MCE ... the podcast is highly recommended

Source

the full source is included (VS.NET 2003 project) along with my build scripts. i would like to create an installer, but that would entail having to install VS.NET 2002 ... which i'm not going to do ... out of principal :) the script deploy.bat will build, register, and install the assemblies to the GAC. then you just have to start MCE and it should display a dialog saying "MCE SAPI" as well as speaking that out loud. NOTE you might have to install the Speech SDK 5.11st?

/mceSapiSource.zip (MCE 2005)
/mceSapiMcplSource.zip (Vista MCE)

if you dont have the build environment, then the build assemblies are also included. so you will just have to run the commands in the last section of the build script that adds the assemblies (Interop.SpeechLib.dll and MceSapiAddIn.dll) to the GAC and run RegisterMceApp.exe. NOTE this article was never intended to be for end users. this is for developers that want to learn how to add speech to their AddIns, as well as a proof of concept for where i would like to see the platform head.

Updates

none planned

Future

alot of MCE ideas ... later