Multimodal Speech on a Pocket PC

http://www.brains-N-brawn.com/speechMulti 8/1/2004 casey chesnut


this article is an overview of speech-enabled web apps on a Pocket PC. the components to be discussed include Speech Server 2004, the Pocket IE Speech add-in, and a quick walk through about developing with the Speech SDK. my main purpose for writing this was to gain some familiarity with Speech Server 2004 and the Pocket IE Speech Add-in. this will also give a little info on how to use Speech Server to do Text-To-Speech (TTS) and Speech Recognition (SR) in Compact Framework apps

Speech Server 2004

... is exactly that. it is a SALT compatible server that is capable of doing SR (Speech Recognition) and TTS (Text-To-Speech). it is used in the scenarios when a device is not powerful enough to do SR or TTS on its own. the 2 most prominent scenarios are 1) voice-only telephone apps, and 2) multimodal pocket pc apps. 1) you are probably already familiar with voice-only telephone apps, such as when you call up your automated bank teller to check your account balance. user input can be done through your voice, or by using the keypad (DTMF) on the phone. output is either a pre-recorded voice, or a computer synthesized voice. with these Speech technologies you can write really annoying phone apps as well. 2) multimodal apps are new. currently, they are web apps that are also speech-enabled. imagine the bank scenario again, but this time you are paying bills online. you have your bills in a stack in front of you, and the web page is listing your payees. the non speech-enabled scenario involves you using your mouse to click the payee, picking up the bill to see the amount, using the keyboard to enter the amount to be paid ... then repeating. a multimodal app would involve you speaking the name of the payee, and then speaking the amount to be paid ... with you never having to touch the mouse or keyboard, or put the bills down. so you can see the benefit there. but it becomes even more beneficial when introducing mobile devices that do not have great user input capabilities, such as the pocket pc and slate tablet pc's

to support voice-only telephony apps, the Speech Server also requires an additional hardware board (~$1000). this hardware is used to handle the call processing of incoming and outgoing calls. brains-N-brawn.com does not need its own call center (nor do i have $1000 laying around) (nor do i have a phone line), so i wont be going into this further. if you really want to play with this stuff, then the SDK provides a Telephone Application Simulator (pictured below) that you can test with. another accessible alternative is to use a hosted provider. in this case, somebody else would provide the speech server setup, and all you would have to do is create (and possibly host) the speech enabled web application. then you can call into their speech server and access your voice application. NOTE i've had a similar setup with VoiceXml (competitor to SALT), and i must say it is alot of fun to call up and have a conversation with an application you have written. e.g. it is fun to say 'Hello' back to your own 'Hello World' app. but the real question is if you really need to say 'Goodbye'? NOTE i believe it is alot cheaper to setup your own call center with the SALT offerings, than it is the VoiceXml route

a multimodal device is different. from the graphic above, the Pocket PC user browses to the web page 1st. the web server returns a web page with SALT. at this point, the Speech Add-in for Pocket Internet Explorer kicks in (which has to be installed on the Pocket PC), and initiates a session with the Speech Server by using SOAP commands. the web page could require voice input from the user, or need to do TTS output. for input, the user would hit a hardware button to start recording their answer through the PPC microphone. the PIE Speech add-in does some processing of the voice input, including compression, and then transmits that over TCP to the Speech Server. the Speech Server then processes the voice according to a grammar and returns the result as XML. the Speech add-in then displays that result in the web page, and possibly posts back to the web server. if the web page needed to do TTS output, then the Speech add-in would send the text to the Speech Server. Speech Server would to TTS and then return the synthesized voice over TCP to be played back on the PPCs speaker. the experience for the user is that they think they are only connected to the web server, but behind the scenes the Speech add-in is connecting to the Speech Server for additional processing. NOTE that the Speech add-in connects to the Speech Server through SOAP commands, as well as directly on a TCP port. also NOTE that the Speech Server must have a 'trusted' connection to the Web Server.

speech server is not necessary in all scenarios. tablet pc and desktop computers are powerful enough to do SR and TTS locally. they do not need to offload the processing power to a server. in this case, they only need to access the web server that is hosting the speech application. but they do need an add-in to make their browsers speech capable. the add-in for IE is called the 'Speech Add-in for Microsoft Internet Explorer'. sadly, the IE Speech add-in is only available through the 200 meg Speech SDK download. this really sucks. MS needs to push the Speech add-in out to clients, and then i would be compelled to develop speech-enabled web sites. at least make it a direct download (minus the SDK), or through Windows Update, or a Service Pack. maybe it will happen since IE has been getting beat up lately

below are some screen shots of the main settings tabs for Speech Server. the 1st screen will let you change the default voice from a female voice to a male voice. the 2nd screen shows the TCP port over which the PIE Speech add-in can call the Speech Server. and the 3rd tab is for logging.

the following control panel is for the Speechify TTS synthesis engines. it shows the 2 voices that come with Speech Server Standard Edition. there is also an XML file that can be modified to tweak the voice synthesis

Speech SDK

the 2nd main component is the Speech SDK. the Speech SDK is used to develop speech applications that use SALT. SALT is the XML standard for creating voice applications (its competing standard being VoiceXml). the Speech SDK installs into Visual Studio .NET adding project types, controls, web scripts, designers, samples, debugging tools, and documentation. the developer skill set for voice applications is rather broad. you need to be familiar with ASP.NET, XML, SALT, XPATH, client-side scripting, grammars, prompt databases, and VUI (voice UI) design. i've already written a couple of developer articles ( /noHands) about the Speech SDK (when it was called Speech .NET), to create a web site that you could operate entirely with your voice. NOTE that it was written during an early beta ... and there have been breaking changes since then. since my interest is really to go into what changes with the addition of Speech Server, i will only touch briefly on the Speech SDK later.

NOTE the Speech Server documentation is not much help when it comes to the Pocket IE Speech add-in, even though it includes the install. the docs for the Speech SDK has much more info about Pocket IE. the Speech SDK actually has a lot of documentation in general ... i had to read over it 2 (to 3) times before stuff started to click!

Speech Add-In for Pocket Internet Explorer

this is included as an installer with Speech Server. NOTE that the desktop Speech Add-In for Internet Explorer comes with the Speech SDK. when installed, it lets the user browse to speech-enabled web applications through Pocket IE, and adds a Speech settings control panel. under the covers it interprets the SALT within the web pages, communicates with the Speech Server through SOAP and TCP, displays an audio meter when recording input, does some processing and compression of voice input, and plays back synthesized output from Speech Server. it gracefully handles if you lose connection to the Speech Server, but still have access to the web server; so that the web application will still work (just without voice). the other scenario is if you lose the web server. i attempted to run a SALT page locally within PIE, but was not successful. from watching the trace of the SES\Lobby web service (below), it did not look like the Speech add-in was communicating with the server at all. my thought was that i could store the SALT pages locally on the device, and then just have the connection to the speech server. from the docs it looks like the Speech Server must have a trusted connection to the web server; meaning my idea might not be feasible. the follow up idea is then to host the SALT pages on the device itself, and then the Speech Server could trust the web server running on the device. of course the SALT pages would have to be pre-rendered, and would just be a basic web page server (not doing all that ASP.NET is doing). UPDATE, i was able to get this to work using a command line tool included with the Speech SDK called PIECC.exe. it modifies the local salt+html page to add an ActiveX control for SALT capability which you can open from the local file system on the device (without needing a web server). then all you need is the connection to the Speech Server. the files are included below

it will only work on Pocket PC 2003 devices, and only over a WiFi or ethernet connection. the PIE Speech add-in will not work on SmartPhones. it is a shame that SmartPhones do not get the SmartPhone speech add-in. this goes against the speech 'vision' video that gets playtime at MS events. 1 of the 4 (or so) scenarios shows a guy walking through an airport doing multimodal browsing on a SmartPhone to arrange his flight. this is impossible without the Speech add-in for PIE on SP :( nor will the PIE Speech add-in work over GPRS. so the Pocket PC Phone Editions will also need to have a WiFi (or wired) connection. finally, the PIE Speech add-in is intended for controlled wireless networks. meaning it will be used in office situations and warehouses. it is not intended for you to be able to do voice browsing when you happen to be at a public WiFi hot spot (e.g. Starbucks). i wont be speech enabling brains-N-brawn.com/pie until voice browsing is more accessible to roaming devices

for some other oddities. the SpeechMap sample readme mentions that you should delete the \Windows\mscc.mtfp file to reduce recognition latency ... but that is all it says. ends up that was a bug from the Speech Server beta, and does NOT need to be removed for RTM. i did delete it at one point, and did not notice any different results? finally, if you look at the web.config of the Speech SDK samples, it mentions 'Pocket IE Professional'. what the h3ll is that ... i've never heard of it? from the text, it looks like it will do speech recognition (and possibly TTS) locally on the device. i'm assuming this would be incorporating the VoiceCommand technology? at MobileDevCon, i actually did see an alpha product where the Speech add-in did everything locally (without the use of a Speech Server). when that ends up on Pocket PCs, then that does compel me to SALT'ify my mobile web sites


the pics above are of the Button settings and the audio meter in IE, collecting audio for speech recognition. the pics below are of the Speech settings control panel




the last screen (above) shows the debug port. there is a corresponding 'Speech Debugging Console for PIE' on the desktop. this provides for a great debugging experience. the pic below shows that a Prompt is being read through the PIE Speech add-in


this will be a quick walk-through of setting up the Speech Server environment and creating a simple speech app with the Speech SDK to test on your Pocket PC using the PIE Speech add-in. the web app will do text-to-speech and speech recognition on your PPC.

Setup Environment

1st thing to do was get Speech Server 2004 installed. it will only install on Windows Server 2003, so i had to setup another machine to run it on. dug out a 3 year old notebook and put Windows Server 2003 Standard on it. next, i installed IIS. then, i did the Windows Update. the Speech Server 2004 disk comes with a folder of 3 additional hot fixes, so i installed those. then EnterpriseInstrumentation, and finally Speech Server 2004 itself. from the graphic below you can see that i did not have a TIM installed. a TIM is the interface to the telephony board mentioned above, which i do not have, and is not necessary for this demo. one of the following install screens asks if you want to install, TAS (for telephony), and you do not, which gets past not having a TIM

2nd thing was to get the Speech Add-in for Pocket IE installed. this install is included with Microsoft Speech Server. you might also need to setup a hardware button to be used with the Speech Add-in. finally, i made sure my Pocket PC was setup to work over WiFi


3rd step was to create a simple speech app. this involved installing the Speech SDK on my development notebook, where i was going to host the speech application. then i created a new Speech Web Application. make sure you change the settings so that it is Multimodal (and not Voice-only). NOTE you cannot use Whidbey B1 to create a Speech web app. if you do have Whidbey B1 installed, then you need to check your IIS settings and make sure the speech apps are running under ASP.NET 1.1 (and not 2.0). Whidbey has no Speech capabilities built-in at all ... lame. eventually the Speech SDK will be updated to work with it as well

add a Prompt speech control to the page, along with an HtmlTextBox and an HtmlButton.

go to the HTML view, and add a click handler to the button to call Start() on the Prompt

at this point, the speech app would work on desktop IE (with the Speech Add-in), but is not tied to Speech Server yet. to make it interact with Speech Server on the Pocket PC, you need to go to the code behind and add this code

4th step was to configure Speech Server to trust my web server. this involves going into the Microsoft Speech Server control panel. you have to right-click on the server, and it has an action to let you trust a web server. added the machine name and ip address of my web server and that was it. if you do not add this as a trusted site, then the Speech add-in might not work (especially when it comes to Speech Recognition). if the Speech add-in is failing, then check the Event Log on the Speech Server. you will probably see some error to this effect: The request to load a resource http://hpnotebook/speechmaps/Grammars/Location.grxml was denied because the host is not in the Trusted Sites list. To make this resource available to users, add http://hpnotebook to the Trusted Sites list.

the documentation also states that you might have to so some SADS setup. SADS is a separate install for doing speech application deployment. i messed around with this a little, but ended up not having to do it. you might have to set some mime-types on the web server. ends up i did not have to do this either, but i'll put it here anyways

finally, i picked up my Pocket PC and fired up Pocket IE. entered in the URL of the speech app on my development web server. at this point the PIE Speech add-in kicked in and asked if i wanted to trust the URL to the Speech Server. i said yes, and that setting was stored on the device so that it does not keep asking you all the time. then all i did was click on the button, and the web page spoke to me! of course i made my Pocket PC say great things about me ... as well as cuss :) so that shows how to do TTS with Speech Server.


NOTE if you run this through desktop IE, it will go through Speech Server as well. i know i said that desktop IE does not need to use Speech Server, but you can force it to, for more realistic testing

Prompt Database

this can easily be extended to support prompt databases, which are pre-recorded audio files typically done by voice models. all i did was download the prompt databases from MSDN. there are 3 of them included: 2 female, and 1 male ... each including something like 500 words. the words fall into these categories:

the words from the Greetings category are displayed below. somebody REALLY needs to put together a HAL 9000 prompt database!

all you do is build one of those projects into a .prompts file (which will be something like 10 megs). then add that file to your project. next, right-click on the Prompt control and select 'Manage Prompt Databases'. select the .prompts file and add it as document relative. doing this adds a <speechControlSettings/> section to web.config. this is where i had problems running the app under ASP.NET 2.0. changing it to run under ASP.NET 1.1 made everything work as expected

finally, you need to surround the text to be spoken with this markup:

<peml:prompt_output><peml:database fname="Prompts/Kim Evans%208kHz%2016%20bit.prompts" />TEXT_TO_BE_SPOKEN</peml:prompt_output>

now when Speech Server receives a word to be spoken, it will check the prompt database for that word(s). if the word exists in the database, then it will play back the recording, else it will do speech synthesis as a backup. i ran this on my Pocket PC. it played back recordings for words that were in the database, and did TTS for ones that were not. if i provided a sentence that only had some of the words in the database, then it did TTS (instead of mixing TTS and recordings)

Speech Recognition

now lets add Speech Recognition real quick. go back to the page designer and add a Listen control and a new HtmlButton.

right-click the Listen control and go into its Property Builder. click on the Input \ General tree item, and then 'Add New Inline Grammar'. add <one-of/> and some <item/> to the <rule/>. this defines the commands that we can speak for the Speech add-in to return a match. the one below lets you pick a color

now click the 'XPathTrigger Sample Sentence Tool' and type in an <item/> that you entered (e.g. blue). the XML fragment it returns is the markup language (SML) result for speech recognition. above, change TargetElement to 'textBox1', change TargetAttribute to 'value', change Value to '/SML'. this sets it so that on recognition, the TextBox1 HtmlControl will have its value attribute set to the recognition result.

now go into the HTML view and create an onclick() for the HtmlControl to call .Start() on the Listen control. thus, when you click the button, the Speech add-in will start listening for your command

the final thing to do is add is tie it to Speech Server in the code behind (as we did for TTS).

browse to the page in Pocket IE again. click the Reco button and you will see the audio meter in the bottom right. if you do not see the audio meter, check your WiFi connection and also the event log of the Speech Server. speak one of the commands that you can say (according to the grammar you provided), and it will show up in the textbox. NOTE that words with multiple syllables are easier for the engine to recognize. a real world application would have a much more complex grammar, as well as provide a dialogue. the dialogue would handle failure and speak back its result to make sure it recognized what you really said. to see a slightly more complex sample that does Speech Recognition, look at the TapAndTalk sample that comes with the Speech SDK (NOTE that is the only sample with Speech SDK that is pre-setup to work with Speech Server, the others would have to be modified). all you have to do is modify the web.config to get it to use Speech Server. Speech Server also comes with a larger sample that uses the MapPoint Web Service. you will have to apply for a MapPoint trial account, and modify the web.config before being able to run it.

when the speech page is rendered, it ends up being the HTML that we provided and the Prompt control rendered as SALT (which is XML) and JScript. if the page is viewed in a browser that does not support SALT, then it is just ignored. the output from the simple page below is provided below. the raw SALT element for TTS is <salt:prompt/>, and for SR it is <salt:listen>

Compact Framework Wrapper

just so happens i like the Compact Framework. have done a number of articles doing some form of Speech Recognition with CF, but never Text-To-Speech, so i decided to trick this out a bit. first, i modified the sample above to work with query strings. if a query string was passed in, then the response would immediately read it as a prompt (without waiting for a user to press a button). then i created a Compact Framework app in Whidbey Beta 1 (NOTE i did not attempt this with the OpenNETCF HtmlViewer control in VS.NET 2003). i added a WebBrowser control (with 0 size), a textbox, and a button. when the button is pressed, it reads from the text box and passes it to the speech web form as a query string through the WebBrowser. the WebBrowser control can still rely on the Speech add-in to play back the prompt. this ends up letting you do TTS in the Compact Framework, by utilizing Speech Server. this is actually a viable solution for CF applications that run in a controlled WiFi network. as shown above with PIECC, the Web Server dependency can even be cut out of the equation entirely

once TTS was working, i attempted Speech Recognition. the 1st thing to do was make the recognition page accept a query string which had the grammar file. in this way the same page could be used to recognize commands from different grammars. then, in the <BODY/> onload event i made it start recognizing immediately. the unexpected thing is that the audio meter even showed up in the CF application! so what happens is you click a button on the CF form, it loads the page to start recognition. you speak, and the page posts back, and then displays the recognition result. the problem (BUG) is that the WebBrowser control in Whidbey B1 does not let you access the returned HTML. you can set the HTML on the DocumentText property, but you cannot access it. so there was no way for me to get the recognition result back into CF (without reflecting against private members). once that is fixed, then CF could use Speech Server to do the recognition, and get the result back into the context of the CF app. pretty bad ass if i do say so :)


the pic above-left show the CF designer with the WebBrowser control selected. the pic above-right shows the WebBrowser rendering the default start page of Pocket IE. the pic below-left shows the CF doing TTS. all you do is click the 'speak' button, and it reads whatever is in the text box to the left. the WebBrowser is shown here for debugging only ... it would be hidden otherwise. the pic below-right shows the CF app doing Speech Reco. you click the 'listen' button, and the audio meter shows up waiting for a response. once you speak, the page posts back and renders the result on the page. once the CF WebBrowser control is fixed, then you would just parse the returned HTML and display it as a MessageBox in CF. at that point, the WebBrowser control could be hidden away as well. i did not try this, but you could probably use PIECC to make the SR run locally on the device as well. NOTE the grammar would have to be inline. so now we have both TTS and SR for the Compact Framework!


now what i really want is for the dev groups to provide a class library (or the protocol) to call Speech Server directly from Compact Framework (without the indirection of an embedded web browser). it could possibly provide an abstraction layer so we did not have to deal with SALT. better yet, SAPI running locally on the devices for simple TTS and command-and-control SR (like VoiceCommand) would be great. ... dont expect to be doing dictation on mobile devices for a while now

WS Tracing

when i was looking around to see what installing Speech Server had done, i saw that it had created an ASMX web service called SES, with a page called Lobby.asmx. i immediately opened that up in IE and got the following result.

i know a little bit about web services :) ... so decided to get a look at what was going on. created a TraceExtensionLib class library, and then modified the web.config to execute that SoapExtension. all the SoapExtension does is write out the requests from the client to the web service, and the response from the web service to the client. in this manner, we can see the SOAP calls made by the Pocket IE Speech add-in. the audio data is not in these calls, and is going over TCP instead. this is almost exactly the same model i followed in the /freeSpeech article! main differences are i was using SAPI grammars, used WS-Attachments (SOAP) instead of TCP, and made it work directly in the Compact Framework (and not PIE).

the sequence diagram above shows the gist of what happens in the trace files below. NOTE this does not reflect the messages that are being sent directly over TCP


so that is an overview (played around with this for 5 days) of Speech Server 2004 and the PIE Speech add-in, and how you can make it work with the Compact Framework. in general, you only really have to setup Speech Server once, and then you can run multiple speech applications off of it. the Pocket IE integration is well done and i found the experience to be responsive. the only delay i noticed was when browsing from one page to the next ... which is the nature of the web; and why i hope the Compact Framework gets direct access to speech capabilities soon

i think Speech Servers main target is telephony voice-only apps. i can see use for the PIE Speech add-in for targeted settings; but i dont think it will catch on until Speech Server can be used over GRPS or uncontrolled WiFi hot spots, or until devices can do SR and TTS locally. the main advantage being ease of entry. also showed how we could use this for non web apps on devices as well. i still question multimodal apps on the desktop. there is nothing holding them back now, except for people being so used to the mouse and keyboard (and not having the IE Speech add-in pushed out, or directly download-able). as my eyes get worse, i do 'see' where it could come in handy OR when i need to have my eyes directed elsewhere (e.g. driving). finally, i think speech applications on a Tablet PC make perfect sense now. Tablet PCs are currently being held back because of the delays with XP SP2. the updates for Tablet PC that SP2 include make the experience much improved

specification-wise, VoiceXml and SALT are similar (SALT has better call control, among other things). SALT has the advantage of coming 2nd, to clean up mistakes. VoiceXml has the advantage of being to the market 1st. teamed with Intel to provide cheap hardware (dialogic) boards, i think SALT will be cost competitive against VoiceXml. SALT has the browser advantage (with MS) on both the desktop and devices. it would be more interesting if FireFox were to add multimodal VoiceXml browsing. but right now, owning the browser does not really matter ... since multimodal has to take off 1st. voice-only apps will work on any phone ... and there are alot of phones! the big gun for VoiceXml is IBM, which as alot of speech experts. SALT also has the advantage of development environment. Visual Studio is hard to beat, with a large number of developers being familiar with it. i'm not sure how the specifications are doing in the standards bodies?

sorry for going overboard with screen shots ... i've been doing so much crypto and web service stuff lately, i think i was just happy to have some UI :)


this is what i want to happen (pulled directly from my 4ss). first, Pocket PCs being able to do SR and TTS locally (we get a glimpse of this with VoiceCommand) (this is possibly Pocket IE Pro). next, SmartPhones being able to do multimodal with Speech Server. third, i want the PIE Speech add-in to stop using the TCP direct connect, and use MTOM (WS-Attachment replacement) to let it work through firewalls. fourth, i want Speech Server working over GPRS and uncontrolled WiFi hot spots. fifth, i want documentation on the Speech Server web service API to be callable from whatever i want. sixth, i want the Speech Service web service API to evolve into a standard called WS-Speech. seventh, i want WS-Speech to get hooked into the Devices Profile (aka UPnP 2.0 Proposal)

ultimately this would allow for a Speech Server in every home. then you would be talking to your speech-aware appliances through standard prompts, instead of hunting for the remote. or the voice-activated toaster ... whatever. at most you would have a button that you would press to tell the device to listen

i would also like to see Speech Server get the ability to do speaker verification. speech recognition tells you what somebody said, while speaker verification tells you who said it. finally, my bet is that SALT ends up getting baked into XAML ... especially since i've seen no signs of a managed SAPI 6 for WinForms applications


no source code to give away with this one. the walk through (above) shows everything


i will speech-ify my (mobile) web sites once it is easier for end users to get the Speech Add-in to make IE a multimodal browser and/or when the Pocket IE Speech Add-in better supports roaming users


i've got a long list of different stuff to play with ... MapPoint, Tablet, AI, Whidbey, CF, WS ... not sure which? i'm supposed to be looking for a new contract too. later