/freeSpeech

comment(s) 

CF .NET Voice Recorder + WSE DIME Web Service + SAPI Voice Recognition

(Mobile / XML Web Service / Speech = KillerApp)

http://www.brains-N-brawn.com/freeSpeech 1/27/2003 casey chesnut

This one shows how I created a voice-controlled fat client application for my PocketPC (i.e. my own implementation of MiPad  from MS Research)

Miscellaneous

Introduction

So I'm checking out all these different technologies and Speech .NET (/noHands ) has a little blurb about a Pocket IE add-in. This would let you do voice-controlled web apps on a Pocket PC. Cool! I hate writing with the little stylus. When beta 2 comes out, I'm all over it ... and it's not there. H3ll! I see that they made the dev tools much better and improved standards support, but the really cool piece I wanted wasn't there. We will get it eventually, but no release date is set. The really interesting part is how are they going to do it? The Speech .NET add-in for IE is currently about 65 megs ... not PPC friendly. Maybe its all debug code right now? CPU and memory requirements are probably high for small devices too. You could just wait for the devices to gain power OR you could offload the processing power elsewhere. Another question is ... Speech .NET is only for the web, what about fat clients? For the full .NET Framework you can use SAPI, currently 5.1. Rumor is there is a 6 beta underway, but nobody will freakin tell me. SAPI .NET maybe? I've said this before, MS should just auto-sign me up for every beta. They suck in general about getting the word out on (most) betas ... you can tell them I said that. It's easy enough if you have one or two niche technologies, it's tremendously difficult if you are all over the board (see menu tree to the right). Regardless, SAPI does not exist for CE. There supposedly are some SAPI bits for Auto CE but nothing on the PocketPC. From a newsgroup posting I saw mention of Car .NET ... might be bogus, but I cant wait to program my car, or at least a Clarion radio. Back to small devices, there are some 3rd party vendors for speech on PocketPCs. I know this because their marketing departments always contact me whenever I write an article that has anything to do with voice or speech technologies. elantts.com is one of them, dont know if they do voice recognition (like this) or only Text-To-Speech.. I have never tried one of their products. So out of all that, the parts that interested me are offloading the speech recognition and how to support fat client apps on the PPC ... which leads to this article

the steps are ... record your voice to a wav file on a pocket pc, call a web service passing it the recorded file, the web service does speech recognition using a grammar or dictation, and then it returns what you said as text to display on the ppc

This makes it what is called a multimodal app because it can accept stylus or voice input. Some controls are easier just to use with the stylus, while voice would be preferrable in many other scenarios. Of course, sometimes your environment makes speech input an impossibility. There are a couple different usage patterns that fall under this called tap-and-talk. One is they tap a button for the device to begin listening, the user speaks, and after a certain amout of silence it stops listening. Next is they tap a button, speak, and then it stops listening after some timeout period. Then, they can can tap and hold the control to record, speak, and then release the control when they are done speaking. Finally, they tap to begin, speak, and then tap again when they are done speaking. The last 2 are the most accurate because the user helps the device know the start and stop points. Not much of an annoyance, and they work the best in noisy environments. With the VoiceRecorder Control on the PocketPC it is possible to implement the last 3 of these scenarios

PocketPC VoiceRecorder from CF .NET

Try to keep my eye out for cool .NET stuff. This involves checking endless newsgroups, web sites, listservs ... every once in a while I see a posting that makes me think "bet I could do something cool with that". One such posting involved an eC++ wrapper being called by C# code to popup the PocketPC voice recorder control. Couple iterations and it went fully managed with this posting by Neil Cowburn. So with that code you can programmatically bring up the voice recorder. Cool, but the very 1st iteration brought it up in record mode, which I wanted for this. The VoiceCtl has a style to open up already recording but it does not work on the PPC. Another style makes it Modal. The VoiceCtl can also except window messages telling it to record, play, stop, etc... except if it is being displayed Modal, so I modified Neil's code to not display Modal. This lets me SendMessage the VoiceCtl to make it open up and then start recording immediately without user interaction. Finally, I wanted to know when recording was over. The VoiceCtl handles this by sending notifications back to its parents hWND. Just so happens you can catch these messages using CF .NET by what is called a MessageWindow. Created a MessageWindow class, associated it to my main form and passed its hWND to the pInvoke call used to spawn the VoiceCtl. This causes messages from the VoiceCtl to be sent to the MessageWindows WndProc. When a message arrives, it can then call a method in the main form passing arguments from the message. I have it call a method in my form to look for the OK button press. This means a user recorded their command and that it was a good recording. It can determine this from the arguments that are passed. One of those arguments is a pointer to a structure saying how the user interacted with the control. Marshalling that pointer into a similar .NET structure and its ready to go. NOTE the only trick was that the VoiceCtl did not display when I 1st added the MessageWindow. Alex Feinman solved this by telling me to make a MoveWindow pInvoke because the MessageWindow has a size of 0. With that, we now have the full functionality of the VoiceCtl in .NET. We can open it, send messages to it, and receive notifications from it

The VoiceRecorder is how you record your notes, typically with a hardware button. It can record to a couple different formats. The lowest is GSM 6.10 8,000 Hz, Mono (2 KB/s). This is really low quality. For this app, to do speech recognition, I bumped it up to PCM 11,025 Hz, 16 Bit, Stereo (43 KB/s). There is a noticeable difference between these formats. No particular reason I chose this one, it might still be too low or could even be too high. Regardless, I got better than expected results at this setting and the file sizes were ~100K. File size is important because of the limited connections these devices have. Next, I have the cheapo Audiovox Maestro PocketPC. Even at the highest quality format setttings the recordings are not that great because the microphone is not that good. Not to mention it is positioned in the same place I put my thumb when using the stylus against the screen. My guess it the Pocket PC Phone Edition hardware might have better microphones for their double duty usage? To change the recording format: Notes - Tools - Options - Global Input Options - Options - Voice recording format

WSE DIME Web Service

The above lets us record our voice on the PocketPC device. Not enough mojo to do speech recognition on the device but if we have an internet connection then we can ship it off to a server and have it return text. The lame way to send it to a web service is base64 the wav file bytes and then send that within the SoapBody. Once again ... lame. Small devices have weak CPUs and weak connections, this makes base64 doubly bad because the devices have to waste CPU to encode/decode and it makes for larger message bodies to send over the wire. And then the server has to do the same. The cool way to do it is use DIME. DIME is a transport protocol typically used for carrying a SoapEnvelope and a number of binary attachments. In this manner base64 encoding does not have to take place because the wav file will not be embedded in the SoapBody. Just so happens MS released the WSE to support DIME, and my last article (/cfWSE) showed how to make DIME calls from CF .NET. If you are cynical, you might be seeing a method to my madness; meaning the /cfWSE article was a premeditated stepping stone to this. The server can receive the DimeAttachment with 1 or 2 lines of code and then process it accordingly. NOTE early on in the web service game I saw a surprising # of posts for voice web services. I was like ... what are you stupid, dont you understand the concept ... and now I go off and write one. Doh! At this point we can record our voice and ship it off to the speech web service to be recognized

a sample SoapRequest looks like this. NOTE the lack of the wav file because it is a DimeAttachment. The request specifies the timeout for recognizing the speech on the server, the reco mode (command and control, dictation, or hybrid), and the grammar to use

<?xml version="1.0" encoding="utf-8"?>
<soap:Envelope xmlns:soap="http://schemas.xmlsoap.org/soap/envelope/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema">
 <soap:Header></soap:Header>
 <soap:Body>
  <wavToStringDime xmlns="http://brains-N-brawn.com.com/freeSpeech/">
   <timeout>10</timeout>
   <mode>c</mode>
   <grammar>state</grammar>
  </wavToStringDime>
 </soap:Body>
</soap:Envelope>

SAPI Voice Recognition

There are a couple different flavors of voice recognition. The 1st is dictation, meaning you can say just about anything and the engine will go against a huge dictionary and low-level phenomes to figure out what you said. This is the most flexible, and also the most error-prone without lots of training. Training does not make sense in a small-device client-server scenario, but I did incorporate (untrained) dictation into this as a feature ... if not just for laughs. The 2nd is command and control. C&C takes what you said and compares it to a list of possible values in what is called a grammar. Meaning if I have a grammar that only specifies the primary colors (red, blue, green) then the speech recognition engine is going to have a much easier time and do a much better job figuring out what was said; as opposed to dictation. For instance, if I am scheduling a flight on my PocketPC and there is a drop-down of airports, then I should be able to click that drop-down, say the airport I want, it will do speech recognition against the list of possible airports, and then auto-select the appropriate airport for me. It makes even more sense when the answer set is larger, such as destination cities, which might be harder to even store on the small app device because of constant updates, size limitations, etc... but with voice you can get around these UI limitations, although the limitation of connectivity is introduced. So for this demo I set it up to allow for dictation and command and control scenarios against a couple different grammars. This lets the Pocket PC app support free form dictation, as well as associating different controls with different grammars. i.e. a control for selecting color would match what the user said against a grammar of possible colors. Finally, it supports a hybrid of both, so that if C&C fails then dictation can take its best shot

Kick 4ss Demo

The simple demo app shows a couple different user scenarios detailed in the following table. Label is just a marker to show you what row of the form is being targeted. Each label has 2 mics associated with it to provide a little different user interaction. Start shows how the user starts recording and Stop shows how they end. VoiceControl shows how it is displayed, if at all. Mode shows if it the Speech Web Service is doing a command-and-control (C&C) or dictation recognition. Control shows the UI control that is then filled on the screen. In general it demos: tap-and-talk, tap-and-timeout, tap-and-hold/release. The videos show it in action, running on the device with a Wi-Fi connection, recorded using Camtasia and the Remote Display Control Host Powertoy

Label

Start

Stop

VoiceControl

Mode

Control

Video

test tap mic, record tap stop, ok full C&C textbox  
  tap mic tap stop, ok full C&C textbox

state tap mic tap stop no ok C&C textbox  
  tap mic tap stop no ok or cancel C&C textbox

color tap mic timeout no ok or cancel C&C combobox  
  tap mic timeout not visible C&C combobox

contact tap mic tap mic no ok or cancel C&C textbox  
  tap mic tap mic not visible C&C textbox

dictate tap and hold release no ok or cancel dictation multiline textbox  
  tap and hold release not visible dictation multiline textbox

This next one is a larger video (2 megs). It shows the app in use on the device and you get to see and hear how I interact with it. You also get to see the speed, which is surprisingly fast. This could allow for entering data much faster than just with a stylus. Works much better then I thought it would, dictation for numbers was unbelievably good. Dont know about you, but I consider this almost as cool as /noSink

web cam demo

The true test was taking it to a Starbucks with a TMobile hotspot. Worked really well. Almost freezing on the night I went (in Texas) so it was really crowded ... and loud. Even with the ambient noise it was still able to recognize my speech. Not as well as the clean room demos above, but still much better than I would have expected. Of course everybody in the place looked at me like I was crazy ... par for the course

Extensions

The obvious are that it would support a system grammar and application specific grammars. The system grammar would be universal most commonly said stuff stored on the speech recognition web service server. The consuming apps could use these generically by requesting which grammar rule to use with each wav file. For application specific grammars they could pass that grammar along with the wav file to the server. If the grammar file was too large then they could pass a URL and arguments that were to another web service to return that grammar. That would require standardization and such. Next, instead of just returning the text, grammars support values associated with them. i.e. if I said October against a month grammar, it might return the number 10 instead of text. Then, this could support different languages. Security! Next, the reverse could be done for Text-to-Speech. The web service could be called with text, it could generate the wav file and return it to the small device over DIME. Finally, this might be useless as small devices gain in power and resources; possibly SAPI .NET for the PPC? Natural language recognition (NLR) is the next level up from Dictation (e.g. Hal 9000 and Dexter's computer), this would require even more processing power

Source

CF .NET client code (C#)

The voice recorder code was an extension to existing code. The CF .NET test app is included, but is too simplistic for anyone to care about. The code for doing DIME from CF .NET is at /cfWSE, which is modified code in the 1st place. The speech recognition code for the web service can be recreated from the samples that come with the SAPI SDK. That's all there is to it. Other than the pre-existing DIME code, I wrote this in 3 to 4 days

Future

Might have one more voice-related pocket pc article in me ... even without waiting for the Speech .NET Pocket IE Add-in. Been eyeing DirectX9 but if I wrote a game it would probably be too ugly to play. Maybe some more WindowsMedia, or VoiceXml, or WSE custom filters. Got a stack of books of other techs I have no experience with ... maybe they've got something for me? Later