/noHands (Part 2)

comment(s) 

Speech.NET = ASP.NET + SALT

http://www.brains-N-brawn.com/noHands 7/8/2002 casey chesnut

Part 2 - This is a follow up article to [/noHands Part 1] which is prerequisite to this material. Will continue speech enabling the same desktop web application: going over stuff that was glossed over earlier, trying out some new things, and seeing if I can trick the app up into something kind of cool. The format is basically a free-for-all, meaning there is no particular flow between sections, so just skip the boring parts


screen shot of the voice-operated chat control to the right of live video feed

SALT vs VoiceXml/VXML

Since I had done both, had gotten a couple of emails asking what I thought about VoiceXml vs SALT. Well, I had not really thought about it ... until I got an email a couple nights from a guy pointing me to the W3C. Ends up they have a whole bunch of speech work going on:

And guess what, the W3C work is all about VoiceXml 2.0 and not SALT. So VoiceXml is far from giving up. Here is an excerpt from the 'Voice Browser' article above:

The SALT Forum is developing a set of extensions to XHTML to support multimodal and telephony-enabled applications. SALT exploits XHTML's events for triggering aural prompts and speech recognition, but relies on scripting to compensate for the lack of declarative support for rich speech dialogs as found in VoiceXML 2.0.

Alright, so what the h3ll are 'speech dialogs'? From the lengthy VoiceXml 2.0 spec, you find one mention of speech dialogs in Appendix K.

Reusable dialog components provide pre-packaged functionality "out-of-the-box" that enables developers to quickly build applications by providing standard default settings and behavior. They shield developers from having to worry about many of the intricacies associated with building a robust speech dialog

Ok, so I am not impressed with speech dialogs because continued reading shows they only have 2 in the spec, and 1 of those doesnt really count. And the Speech .NET SDK does provide exactly this same funtionality (although currently unsupported), they are called Speech Application Controls. From the Speech .NET help file:

Speech Application Controls are prebuilt dialogs that are packaged as user controls that you can drop on your ASP page.

I hope they meant ASP.NET pages :) There are big name companies on both of the founder/contributor lists, with some of the companies showing up in both ... regardless, these efforts should not be duplicated. The browser wars were over, and might be starting back up because of mozilla/aol; but as a developer, I really do not want a voice browser war as well. NOTE this only concerns multimodal applications. VoiceOnly telephony apps are server-based, so the end user doesnt have to do anything or even care what the underlying markup language is. Aside, there is no way VoiceXml is going to have near as nice a development tool as VS.NET with the Speech .NET integrated tools ... that makes a huge difference

Client-side (De)Activation of Grammar Rules

Mentioned this in the last article but did not actually do it. The grammar rules could be inherently different based on their lifetime, so you want to break them up into static/dynamic/and user grammar files. This is another one of my made up best practices:

So static grammar files are likely to be large and you only want them downloaded to the client browser once. After that initial hit, client-side script should be used to activate and deactivate rules accordingly. In my app, I have the need for this when the page is 1st loaded. At this point, the user can only select a media type, so the rules for paging/etc... should be turned off, and then turned on once a media type is selected. Luckily, there are DOM methods to do just this:

Object.Activate(grammarID, [ruleName]); 
Object.Deactivate(grammarID, [ruleName]);

These are off of the SALT <Listen/>. Render my page, and see that its ID is "_ctl0". Go to the QA control, Reco property, and set its ID to "MyReco" and that is now the ID of the <Listen/> element. Go into my static grammar and set everything to 'Inactive', so only my dynamic grammar that listens for MediaType will be active when the page 1st loads. Then on the MediaAnswer, I set all of the other Rules to active like this:

MyReco.Activate("Grammar1","Browser");
MyReco.Activate("Grammar1","Scroll");
MyReco.Activate("Grammar1","Page");
MyReco.Activate("Grammar1","Step");
MyReco.Activate("Grammar1","Item");
MyReco.Start();

Turns out this does not work; wondering if it might be because of the postback nature of my page or if it just does not work yet. Decided on a simpler test, reactivated all those grammar rules, and then deactivated my media grammar. Then, in the page onLoad(), I activate the media grammar for Grammar2. Ends up this does work, although responsiveness seems slower; so the rule activation/deactivation must be lost during postbacks

Inline Grammar

When tying a grammar to a page, you can specify a src file (xml or compiled) which gets downloaded along with the SALT enabled page OR tie an inline grammar to be rendered in the HTML. So set the Media grammar active again in the file, and commented out the onLoad() stuff from above. Then went into my codebehind page Page_Load, read in the XML grammar file, and set it to the inlineGrammar property of the 2nd grammar on my QA control like this:

string gramPath = Server.MapPath(".") + @"\mediaType.gram.xml";
XmlDocument xd = new XmlDocument();
xd.Load(gramPath);
qaHome.Reco.Grammars[1].Src = null;
qaHome.Reco.Grammars[1].InlineGrammar = xd.OuterXml;

NOTE inline grammars are supposed to take precedence in case a Src grammar file is specified. I'll trust them on that one. Looking at the SALT that renders:

<salt:listen id="MyReco" onreco="MyReco_obj.SysOnReco()" onerror="MyReco_obj.SysOnError()" onnoreco="MyReco_obj.SysOnNoReco()" onsilence="MyReco_obj.SysOnSilence()">
<salt:grammar id="Grammar1" src="http://localhost/noHands/noHands.gram"/>
<salt:grammar id="Grammar2"><grammar>
	<rule name="Media" toplevel="ACTIVE">
		<l propname="Media" description="Media">
			<p valstr="image">image</p>
			<p valstr="audio">audio</p>
			<p valstr="video">video</p>
		</l>
	</rule>
</grammar>
</salt:grammar>
</salt:listen>

This shows that Grammar1, my static grammar, for the Listen element is specified with a Src file, and Grammar2 is Inline as set in the codebehind for that page. This would be used when a grammar could be different for each user that is running the application. E.g. if the system would recognize a name or some other personal information specific to that user

ToolTip Idea

Not interested in coding this, but ... when I bind an Answer to a control based on a grammar rule. When the page gets rendered, would like for the grammar to be reflected against with the rule structure put into some sentence format and then have that tied to the control as a Tooltip for multiModal instances. E.g. if I tie a US State rule to a dropDownList, and then hover over that dropDownList, then it could show me "Say Alabama Alaska Arkansas ..." automatically

Speech Server-side postback

Since I just wrote my 1st piece of server-side code that had to do with Speech a section back, might as well push my luck. Previously had done all speech processing client-side, but it is possible to take it server-side. This is also necessary for telephony apps and pocket PC devices that only support a subset of client-side functionality offered by desktop browsers. From the last article, had a whole lot of trouble with the selecting a Page processing (e.g. "Page one-hundred and one"). To enable this, it took 2 client side functions. 1 for pre-processing the SML that was returned (PageAnswerNormalization), and 1 for acting on that normalized value (PageAnswer), to actually go to that page #. Got a real headache trying to debug the Normalization code and was really missing the rich debugging features that are server side. So I am going to rip that stuff out, and try to do the same on the server.

Go to the control and take out the ClientNormalization and OnClientAnswer functions. Then set it to AutoPostback. Run the page and get an error about the 'name' attribute. Looking at the properties for Answers, there is no 'name' attribute. Rip open the docs and they show an OnTriggered attribute. This is nowhere to be seen on the properties for Answers either. So I go 'old school' into the HTML view and set it to this:

<speech:Answer OnTriggered="PageAnswer_Triggered" ID="AnswerPage" AutoPostBack="True" XpathTrigger="/SML/Page" TargetElement="qaHome"></speech:Answer>

Run the page, and it complains about the PageAnswer_Triggered() not being defined in the codeBehind. Obviously, because I dont know what its signature looks like. Bust open the sample (#7) to find the signature, set a break point, and the page loads.

protected void PageAnswer_Triggered(object sender, Microsoft.Web.UI.SpeechControls.TriggeredEventArgs e)
{
	Microsoft.Web.UI.SpeechControls.Answer a = (Microsoft.Web.UI.SpeechControls.Answer) sender;
	string gramValue = e.Value;
}

But it's not that simple. I was expecting the SML to be returned and then I would not have to do the ClientSideNormalization. Ends up it just returned the value from the SML, which is useless to me without being normalized. Could not find any way to get a hold of the SML ... and I think this is a bad idea. We should have a server-side normalization hook as well. Regardless, my grammar for natural numbers was kind of hacked in the 1st place, so I will leave this alone and see if the next section will fix it.

CmnRules.cfg

From the speech beta newsgroup, got tipped off to a file cmnrules.cfg. This can be found in the aspnet_speech directory that will get installed to the root of your default IIS web. Same directory that contains the script files for voiceOnly and multiModal operation. This file is a compiled grammar file with a large collection of commonly used rules including: Alphanum, Credit Card, Currency, Date, Date Block, Duration, Free Form Dictation, Numeral, Time, Time Block, US Phone Numbers, US Social Security, Yes/No/Cancel. Numeral looks exactly like what is needed here, all I have to do is reference the cmnrules.cfg from my existing grammar such as:

<ruleref name="natural_number" url="cmnrules.cfg" />

Crack my grammar file open and run a test by saying 'one hundred and one'  ... works great! Even returns the value as 101, so normalization is not even necessary like I had to do with my hacked natural_number grammar from Part 1. So revisited the server-side postback above and made it work like this. NOTE the Server.Transfer to avoid an unnecessary round trip.

protected void PageAnswer_Triggered(object sender, Microsoft.Web.UI.SpeechControls.TriggeredEventArgs e)
{
	Microsoft.Web.UI.SpeechControls.Answer a = (Microsoft.Web.UI.SpeechControls.Answer) sender;
	string url = "home.aspx?page=" + e.Value;
	Server.Transfer(url);
}

Jumping around pages works much better now, need to tie that to item selection as well. Finally, the SDK also comes with an uncompiled version of CmnRules.cfg as CmnRules.xml found at: '\Program Files\Microsoft .NET Speech\SpeechControls\v1.0.2826.0\src' which has over 200 rules defined in it. This will be a great learning tool for seeing how to make more complicated grammar rules. FYI cmnrules.xml comes in at 323K, and the compiled ver is a whopping 457K! Hoping that those files get installed with the browser add-in, and updated automagically when they are extended

Unsupported

Digging around the Speech .NET installation, you can find some unsupported features:

Dictation

err, umm ... that word always reminds me of a tasteless joke :) Now I remember reading specifically that Speech .NET does not currently support dictation. Even remember reading it over a couple times, because I was really upset by that ... but then I see 'Free Form Dication' when cruising through the CmnRules.cfg ... what's up! Maybe I misread ... but who cares! Must give this a spin to see what it can do.

Crack open my noHands.gram. Add a new rule called Dictate and make it Active. Add a RuleRef to it pointing to the Rule 'Dictation' with the Url 'cmnrules.cfg' and give it a PropertyName of 'Dictate'. Then I give it a recognition string to test it, and ... it totally crashes VS.NET! Finally! I must be doing something cool! Really, I am not happy unless I crash my environment periodically ... otherwise I must not be pushing it hard enough. Of course, I apply the software methodology and 'try it again' ... this time it works flawlessly and returns this SML:

<SML text="kc kicks ass" confidence="1">
	<Dictate text="kc kicks ass" confidence="1" type="FreeFormDictation" name="FreeFormDictation"></Dictate>
</SML>

So far so good, but lets tie this to a page and have it dump out the text to a textbox or something. Create a new media folder called 'live chat' :) HEHE! If you cant see where I am going with this, then you are blind ... or should I say deaf! Also, if you cant tell, I'm in the zone now ... you have been warned

Add a textbox to my Chat page, and bind the Answer to it and the new Dictate grammar rule ... nothing. It will not recognize my free form dication, although it would in the Answer dialog and in the Grammar editor ... suck! I try tricking it up by pre/post-pending 'breaker breaker' blah blah blah 'over and out' ... and it still does not work :( So maybe I did not misread that dication would not work yet ... but the zone will not be denied

So Speech .NET currently denies me from doing Dictation in a web app :( <Insert some multiples ways of skinning some poor creature cliche here/> Then, forget Speech .NET for now, lets .NET-ify some old school SAPI

----- BEGIN MEGA HACK -----

SAPI Dictation

Just so happens that SAPI will let you do Dictation. Think that everybody with Windows XP already has the Speech Library already installed too? Either way, I wrote the C# code for doing SAPI SR from a microphone about 6 months ago:

SpeechLib.SpSharedRecoContextClass ssrc = null;
SpeechLib.ISpeechRecoGrammar isrg = null;
ssrc = new SpeechLib.SpSharedRecoContextClass();
ssrc.Recognition += new SpeechLib._ISpeechRecoContextEvents_RecognitionEventHandler(RecognitionEvent);
isrg = ssrc.CreateGrammar(1);
isrg.DictationLoad(null, SpeechLib.SpeechLoadOption.SLOStatic);
isrg.DictationSetState(SpeechLib.SpeechRuleState.SGDSActive);

public void RecognitionEvent(int i, object o, SpeechLib.SpeechRecognitionType srt, SpeechLib.ISpeechRecoResult isrr)
{
	string strText = isrr.PhraseInfo.GetText(0, -1, true); //what the hell
	System.Diagnostics.Debug.WriteLine("recognized: " + strText);
}

That, and add a reference to the COM Microsoft Speech Object Library. Granted it does a pretty sh1tty job of recognizing my words because it is not trained to my voice, or else I have no clue how anybody can understand a word I say? But I am not going for quality here, that can always be improved later. Dictation is taken care of, but how am I going to get it into a web app ...

IE Hosted WinForm Control

Just happens that you can run fat-client controls in IE. Now all I need is a Chat control ... internet search later, and I find this: Socket Chat Part 2: Internet Explorer Control Client Exactly what I wanted. Have looked at this guys stuff before and really like the stuff he writes. Now the concept is a user loads the 'live chat' page. This downloads the chat control and hosts it in IE. The control uses the Speech lib on his computer to grab what he says through the mic, converts that to text and submits it to the chat server. While this occurs the chat control is being updated with messages that others are speaking. If I could make it do TTS and speak the other messages that are coming in, that would be even cooler ... but not sure the 1st part is even feasible yet, because of security

If we were still talking Speech .NET, then the grammar would be tied to the Dictate rule. Upon recognition it would tie it to a textbox. On change that textbox would set a property of the IE web control. When that property was changed, it would read the text from that property and post it to the chat server. Then SAPI would no longer be a dependency. Either way, a hosted control is necessary to hold a TCP connection for this scenario. Constant HTTP refreshes would just be plain annoying

Ripped into the chat control code. Added some properties so that it could be tied to an external textBox once Speech .NET was dictation friendly (if ever?). Added the SpeechLib reference and the speech recognition on control Load. Made the speech recognition event update the message text box and then send the message to the server. Also, made it recognize the 1st spoken command as your nickname, and then connect to the server, for the hands-free'ness. Finally, changed the size and color scheme. Sorry Saurabh, but I would have to beat myself up if I ever used the color Olive in any one of my apps. Got that done, and then loaded the page to get an expected SecurityException

Security

This is how I got around the security exception, although there is a much better article by Chris Sells here: .NET Zero Deployment I like all the stuff Chris does too. Even think Saurabh gives Chris some credit in one of his articles regarding the chat control. The 1st step gives full trust to trusted sites. The 2nd step adds the web control host page as a trusted site

Then opened up the page again ... no exception ... said my name and was logged into the chat server, and then was able to start chatting without typing a word! This is way too easy. NOTE this will only work if you have the .NET Framework installed on the client, a newer browser, and have the SAPI redistributable installed on your own computer. You might also need the Speech .NET add-in installed because some SALT is rendered to this page. Finally, speakers/headphones and a microphone would help

SAPI TTS

Went ahead and set it up to speak the messages that it received as well. Had already written that code as well ... go figure:

SpeechLib.SpeechVoiceSpeakFlags SpFlags = SpeechLib.SpeechVoiceSpeakFlags.SVSFDefault;
SpeechLib.SpVoice Voice = new SpeechLib.SpVoice();
Voice.Speak(msg, SpFlags);

Believe it or not, I started this at 2 pm today, and it's only 11:00 now. Loaded up the page, the live video feed started playing to the right and the chat control displayed on the right. Spoke my name and it connected to the Chat Server. Then began speaking free form and the control recognized the text and sent the message to the chat server. As text messages were received back, then it read those messages out loud. So I could chat without typing or having to read the chat board at all ... illiteracy at its finest ... not to mention the p0rn implications. Will deploy to my server tomorrow, to see what breaks (fingers crossed). NOTE when SAPI was active on that page, the Speech .NET control for choosing a media type did not work at all.

Wonder if this is the 1st voice-only web-based chat control ... seems like it could be, but I dont care to look; let me know if it isnt. From my /noHands part 1 article release, got an email from a good friend about an article in Wired just the day before about some group using SAPI on their fat-client desktop DVD player to let people control virtual p0rn videos using their voice (you got to hand it to the p0rn coders). HAHA too easy ... might be cool when we get that Mira/FreeStyle/"whatever name MS marketing has changed it too by now" and then we get a better home entertainment center/computer integration

----- END MEGA HACK -----

Deployment (Day Two)

Have the ChatServer running on my server. It's an app, should be a winService, but I've got my own winForm app running right next to it listening for MSN messages ... so I shouldnt have said anything :) Listening on port 5151, same as the chat control, and have set up my router to do forwarding for that port to that server box. Have my chat control hosted on the same server, and setup that directory as an application (scripts only ... thats important, although default). So control gets deployed correctly without any exceptions, but when I try to connect with the control I get this:

Not sure why this is happening, but it looks like my Server is denying the request, but the control is actually making the request successfully. The weird thing is that if I run the ChatServer on my devbox, and then open the chat page from my server, the chat control will connect to the ChatServer on my devbox although the control was loaded from my server!?!?! This is unexpected behavior. Not positive about .NET, but in Java world I remember that applets could only talk back to servers that served them. Double checked my URLs and dont see it, but this definitely seems like a stupid mistake on my part ... but I have gone blind to it. On the off chance that you want to see if you get the same behavior, here is the ChatServer.exe for you to run on your own box, and then connect to my server's chat page ...

(hour later) Nevermind ... It was a build problem (suck). Had changed my output path for debug, but not for release ... it works as expected now! stupid stupid

Source

I really dont have any legit source to give away again./noHands Part 1 has links to my grammars and client side scripts, with the Part 2 updates. The minimal server code that I did write is represented in its entirety throughout the article for you to grab as needed

Updates

8/10/2002 An MS guy handed this file over to me, to show that it is possible to do Dication on a desktop browser web pageDesktop Dictation Cool! To rewrite the chat control I would add a textbox to the page that would be filled by SALT, and when that was filled, javascript would set that text to the property of the control, the control would then send that chat message to the server. Likewise, the control could send updates to the page to be read by SALT. At least, I think that will work?

10/13/2002 Found these links from a MS student at CMU (Micah) who gave a talk on VoiceXml and Speech .NET:
Powerpoint: http://radio.weblogs.com/0100168/ppt/SphinxTalk.ppt
HTML: http://radio.weblogs.com/0100168/ppt/SphinxTalk.htm

Future

So what is next. As far as Speech .NET, I am eagerly waiting to get my hands on the Pocket IE Speech Add-In at release. Also, want to hear more about the MS SALT Telephony gateway server that they are working on. Finally, I want to know how SAPI is getting replaced with managed code for desktop and pocket devices?

Need to revisit /noMadMap . MS just released their own Compact .NET and MapPoint .NET integration: MapPoint .NET Sample Application for the .NET Compact Framework It's VB.NET, so the losers that cant do language conversions from my C# app will like that. But that is not compelling, what is interesting is that it contains an update to the Web Service components ... which should fix the bugs that I found from my article

The MapPoint .NET sample contains both a Visual Studio .NET project targeting the .NET Compact Framework, and an update to the Web Service components of the .NET Compact Framework.

Official announcement: I will no longer work on standard desktop ASP.NET web apps. Will only work on them to support Speech .NET, Mobile.NET, PocketPC or XML WebServices. My niche is now non-standard apps only ... have to see if I can pull this one off :) Contact me if you are doing something cool

Other then that, I have run out of beta software to play with nor do I know of anything else on the horizon that sounds interesting. Guess I will just start reading ... so that I can go visit the cute barnes and noble girl more. Feeling belligerent, must go kill off the weak brain cells, later