/noHands

comment(s) 

Speech.NET = ASP.NET + SALT

http://www.brains-N-brawn.com/noHands 6/27/2002 casey chesnut

Part 1 - Initially planned on this being a single article, but it really got to be so much info that i'm breaking it into 2. The 2nd part topics would be stuff that was glossed over in this article, TODOs at the bottom, completing the content pages, etc...

Running the Live Demo

NOTE: you will need a microphone and the MS Internet Explorer Speech Add-In to interact with the demo using your voice. Recommend turning down your speaker volume or wearing headphones to improve speech recognition as well. These were found on the Speech .NET Beta CD that was mailed out

ReadMe.htm - In the future these will be named ListenToMe.htm and they will read themselves :) Thus, ushering in a whole new era of illiteracy. "The computer can read it for me, so why should I have to learn how to read" ... the exact same logic for letting my math skills slip from having a calculator on my phone. Which brings up how my spelling has been getting worse since what used to be 'pleasure reading' for me, now involves listening to audio books while taking a walk or working out. So if you see some misspelled words, do not take for granted that they are typing mistakes, and know that I can no longer spell - quoth kc# circa 2002

MicrosoftInternetExplorerSpeechAddInSetup.exe (65 megs!) Granted its probably full of debug code, but how are they going to shrink this down to PocketPC size? Telephony ... who cares, it all happens on the server. Did find out that their is a version of SAPI that runs on WinCE for Autos, but there is no SAPI support for PocketPCs; but that does give hope that this is feasible. There are 3rd party apps that do speech recognition on PPC devices, but ive not had the chance to play with them ... yet

NOTE: it has JScript that should handle if you dont have the Speech Add-In installed, but this has crashed IE on one of the computers I tested with. Granted, on the 2nd attempt, it did not; and was able to use the app normally with my mouse

Recorded Demos

If you dont want to install the Beta Speech Add-In to IE, you can just watch some recorded vids of me talking to it below:

1)demo.wmv 2)test.wmv 3)image.wmv

The video format should look best with the player maximized, or full-screened. Make sure you have audio on to hear me giving commands. The recording of the video page does not show the video being played, although it does play. Same goes for the audio page, you dont hear the mp3s play, but they do in the real app ... just not picked up by the video screen grabber

NOTE the lack of a mouse cursor on the screen ... magic! The image pages actually work, the video just cheeses out on them, will have to try and record again. Tried recording these with Windows Media Encoder in Screen Capture mode, but out of the possible permutations with 4 audio devices on my computer, I could not get it to record and let the audio pass through to the Speech Add-Ins. So it basically recorded my spoken commands but nothing would happen. Ended up using a product called Camtasia that recorded to .avi, and then converted to .wmv screenCap format

NOTE performance is much better on my server. These were on my devbox with a bunch of apps running, as well as being in debug mode ... sorry

Background

Somehow I have compiled a bunch of past work/play involving audio:

Was all jacked that the Speech .NET beta was coming out. There wasnt that much info about it beforehand, so I was hoping it would: 

  1. Replace SAPI on fat clients with managed code (no support on desktop or compact devices)
  2. Replace VoiceXml on telephone devices (development support, but no telephony gateway support for testing)
  3. Allow for the new model of speech-enabling web pages (fully supported on IE, pocket IE soon)

Only got 1.5 out of 3. Speech .NET currently supports speech-enabled web pages (MultiModal) and pseudo supports telephone (VoiceOnly) apps at this time. Have not gotten a clear vision from MS of what they are going to do for speech on fat clients for the .NET Framework nor Compact .NET, which is most compelling. For fat client apps right now, you will have to wrap SAPI, which is trivial since you can gen RCWs easily; although this does not work for CF .NET. There is no way to get SAPI on the PPC, and then call that with PInvoke. Also, for the telephone apps there is not currently a telephony gateway which will allow you to test your app out over a phone, like you can currently do with VoiceXml through services like BeVocal and Voxeo. MS is developing their own server for this. You can create ASP.NET pages with no html and interact with it solely with your voice for immediate testing purposes and proofs of concept. Which brings us to the new model: MultiModal

NOTE: Speech .NET is also going to kick a55 on Tablet PCs. With its handwriting and voice recognition capabilities. My image is of them being used in business. Dropping one on a table and having it auto-transcribe the meeting minutes. Yellow pads of paper are just not cool, Pocket PCs are too small, and notebooks get in the way of me checking out the breasts on female(s) sitting across the boardroom table from me

MultiModal

Called multiModal, because the concept is that you can interact with the web app using your voice or your current device (mouse, stylus, ...). The voice interaction actually involves more than just voice too; called 'tap-and-talk', meaning you tap on the textbox or dropDownList that you want to populate, speak your answer, and then that textBox or dropDownList is filled in through Speech Recognition. For full on desktop apps, I think this is useless. Give me a keyboard and a long form, and i'll have that filled out in seconds flat. Now give me a long form on my PocketPC with a stylus, and I guarantee you that the voice entry system would kick my butt (friend keeps teasing me for pecking on the virtual keyboard, but my handwriting is unrecognizable). Furthermore, put me on a phone, and voice is the only option (although this is no longer a MultiModal app, but VoiceOnly instead). Definitely see how multiModal interactions will be great on the PPC, the sad thing is that Pocket IE does not currently have the Speech Add-In which allows for voice interaction. This is supposed to be available at release. The multiModal interaction has a couple different usage scenarios. One being that you tap and hold a button, speak your answer, and then release the button when you are done. This improves recognition because you are letting it know explicitly when you are done speaking instead of waiting for some timeout period. Especially useful in noisy environments. Another scenario is that you tap and release, start speaking, and then after some period of silence it recognized that you are done speaking. There are some other scenarios you can get from the documentation

Idea

Ok, fat client desktop apps are out since it would just be wrapping SAPI. No clue of how MS plans on doing voice enabled PocketPC embedded apps, they have been pretty quiet. Really hope that the development paradigm for fat client apps uses SALT or something similar. Telephony apps are out since there is no gateway for me to test with. Also, had already done a couple VoiceXml apps which is very similar. That leaves a multiModal app as what I could develop. Meditated on this and what multiModal is, and when it comes down to it ... multiModal is the next generation of MS agents, without the agent. Which is understandable for all the heat they took over clippy. But it doesnt seem very glamorous, until you think of multiModal on mobile devices ... which we cannot do until the Speech Add-In for Pocket IE is available at release. So that only allows me to write a desktop multiModal web app. More meditation at the gym, and I come up with 3 compelling ideas of voice on desktop apps:

Out of those 3, I opted to implement the last one (minus the pr0n). The demo is media from Dragonball Z, the best fighting cartoon ever (Trunks is my favorite)! The plan is that once you type in the domain name and get to the home page, you dont have to use the mouse or keyboard again, and that you interact with the site solely through speech. Thus, a hybrid app: multiModal using mouse or voice, with the capability to run as VoiceOnly

Beta

To get up to speed, had to see what you could do with this thing. The beta comes with tons of stuff. They have great tool integration with VS.NET for creating grammars, prompts, and adding Speech controls to ASP.NET web pages; and the debugging tool is phenomenal. Also comes with tons of samples. There are 2 samples as text in the documentation, 1 VS.NET project with 10 small quickstart style samples, and 1 VS.NET project that is a full-on application that will run in VoiceOnly or MultiModal mode; very well done ... did not think they had left much for me at 1st :(. Good documentation as well. Basically, I was overwhelmed, even got lost in the jargon although i'd picked up some of it from VoiceXml previously. Pushed through, even walking through the 4 detailed tutorials that were provided. Between the docs, samples, tutorials, and beta newsgroup help; was able to reach critical knowledge mass in about a week to start developing on my own. A screenshot of the debugger follows. It shows the result of recognizing that I requested 'Page two hundred and one' of my app:

NOTE: the 1st voice only demo I tried would prompt me, and then it would pick up the output from my speakers as being my response. so it would get in a loop. started developing with headphones on, or with my speakers off

NOTE: tried a couple different microphones. initial recommendation was that you should get a USB mic, but after some frustrating speech recognition sessions (kept thinking I said 'chat' instead of 'next), switched to old-school analog, and it works much better in some cases

Prelude

Before I could start adding the speech though ... had to write something to add the speech to:

Here is the JScript that gets rendered to the client pre/post SALT tags for a VoiceOnly app. NOTE this would look different for an app running in MultiModal mode

<object id="__sptags" CLASSID="clsid:DCF68E5B-84A1-4047-98A4-0A72276D19CC" VIEWASTEXT></object>
<?import namespace="salt" implementation="#__sptags" />
<script language="javascript" id="__RunSpeech" src="/aspnet_speech/v1.0.2826.0/client_script/RunSpeech.js"/>
<script language="javascript">

if (typeof(Page_SpeechScriptVer) == "undefined")
    alert("Unable to find script library '/aspnet_speech/v1.0.2826.0/client_script/RunSpeech.js'. Try placing this file manually, or reinstall Microsoft .NET Speech SDK.");
else if (Page_SpeechScriptVer != "1")
    alert("This page uses an incorrect version of /aspnet_speech/v1.0.2826.0/client_script/RunSpeech.js. The page expects version 1. The script library is " + Page_SpeechScriptVer + ".");

</script>

/// ... SALT goes here ... ///

<script language="javascript" id="Page_QAs_SpeechScript">
	Form1.onreset = SpeechCommon.AppendFunc( Form1.onreset, "RunSpeech.Reset();" );
	RunSpeech.form_submit = Form1.submit;	Form1.submit = new Function( "RunSpeech.EncodeState(); RunSpeech.form_submit();" );	SpeechCommon.Submit = function() { Form1.submit(); }
	var qaHome = new QA( "qaHome", "QaMenuActive", true, false, null, null, 0, 0, false, null, null, null );
	var Prompt1_obj = new Prompt( "Prompt1_obj", Prompt1, null, null, null );
	qaHome.prompt = Prompt1_obj;
	var _ctl0_obj = new Reco( "_ctl0_obj", _ctl0, null, null, null, null );
	qaHome.reco = _ctl0_obj;
	var AnswerMedia = new Answer( "AnswerMedia", "/SML/Media", null, "qaHome", null, 0, "MediaAnswer", false, "AnswerMedia_HiddenField", 0 );
	qaHome.answers["AnswerMedia"] = AnswerMedia;
	var AnswerStep = new Answer( "AnswerStep", "/SML/Step", null, "qaHome", null, 0, "StepAnswer", false, "AnswerStep_HiddenField", 0 );
	qaHome.answers["AnswerStep"] = AnswerStep;
	var AnswerItem = new Answer( "AnswerItem", "/SML/Item", null, "qaHome", null, 0, "ItemAnswer", false, "AnswerItem_HiddenField", 0 );
	qaHome.answers["AnswerItem"] = AnswerItem;
	var AnswerPage = new Answer( "AnswerPage", "/SML/Page", "PageAnswerNormalization", "qaHome", null, 0, "PageAnswer", false, "AnswerPage_HiddenField", 0 );
	qaHome.answers["AnswerPage"] = AnswerPage;

var Page_QAs=new QACollection(new Array(qaHome));
document.onreadystatechange=SpeechCommon.AppendFunc(document.onreadystatechange, 
" if(document.readyState==\"complete\" && typeof(Page_QAs)!=\"undefined\"){RunSpeech.StartUp();}" );
</script>

Took me a couple weeks to get this supporting stuff built and understood. Thus ending 2nd week with the Speech .NET beta in my hands 6/15/2002, so let the real fun begin

Speech Primer

In all applications, input and output is required. These are the scenarios for speech apps:

For input, the Speech .NET beta currently supports command-and-control scenarios only. Dictation is very hard to do well with a high level of accuracy. Typically requires hours of training for an app to better understand your accent and the way you pronounce words. And over the phone, the quality of the audio is way too low for any reasonable accuracy. But, in a command-and-control scenario, the app only has to listen for a limited # of phrases; so when the user speaks, it compares it to the list and returns the most likely match. This provides much better accuracy then dictation

Output is quite fun. TTS can sound very HAL 9000'ish at times, and I love that. In general, although it sounds funny, synthesized speech is relatively easy to understand. Also, the text to be read can be put into a Speech Xml format which allows for adding tags to emphasize certain words for a more human sound. On the other side, audio recordings can sound very Sport Talk Football'ish of early Nintendo systems. e.g.

'The'<pause>'Ball'<pause>'Is'<pause>'On'<pause>'The'<pause>'10'<pause>'Yard'<pause>'Line'<pause>'First'<pause>'Down' 

With every word being spoken with a differing volume level and emphasis. Probably will not be exploring this option since I think my voice sounds weak on answering machine recordings ... hence my abbreviated voicemail message. Also, you can read much faster then someone can speak, so it feels like internet lag when a prompt is being read

Speech Controls

For a web page to be speech enabled, it must render SALT along with the HTML output. SALT is what allows for the speech recognition to happen on the browser client. To do this, the client must obviously support SALT ... IE does this after the Speech Add-In has been installed. The SALT output from the my page is below:

<salt:prompt id="Prompt1" onbookmark="Prompt1_obj.SysOnBookmark()" oncomplete="Prompt1_obj.SysOnComplete()" onerror="Prompt1_obj.SysOnError()" onbargein="Prompt1_obj.SysOnBargein()" style="display:none">
	<PROMPT_OUTPUT>
<DATABASE FNAME="noHands.prompts"/>

</PROMPT_OUTPUT>
</salt:prompt><salt:listen id="_ctl0" onreco="_ctl0_obj.SysOnReco()" onerror="_ctl0_obj.SysOnError()" onnoreco="_ctl0_obj.SysOnNoReco()" onsilence="_ctl0_obj.SysOnSilence()">
	<salt:grammar id="Grammar1" src="http://localhost/noHands/noHands.gram"/><salt:grammar id="Grammar2" src="http://localhost/noHands/mediaType.gram.xml"/>
</salt:listen>

NOTE: SALT is not an MS only technology. It is a whole group of organizations at saltforum.org. Should replace VoiceXml and some of its limitations

Luckily, you dont have to learn another markup language. In the same model as ASP.NET server controls for rendering HTML and JScript for uplevel browsers, and Mobile .NET controls for rendering WML/HTML/CHTML for all sorts of devices ... all you have to do is drag-and-drop your speech component onto a page, set the appropriate properties, do some client-side script or server-side processing and the appropriate SALT will be rendered automagically. The beta comes with 4 controls:

  1. QA - prompts the user, listens for a response according to a grammar and takes action based on that answer. This will be initiated through the user tapping on a control (MultiModal) or in a specified order (VoiceOnly)
  2. Command - certain commands that the user can say at any time, no prompt from the system. e.g. 'Main Menu' to start over
  3. CompareValidator - allows for validation in VoiceOnly scenarios or low confidence in voice recognition
  4. CustomValidator - more of the same

For this part, I will force a QA control into the user experience desired. NOTE: these controls behave differently based on whether they are in a VoiceOnly or MultiModal app. For VoiceOnly, the QA control prompts the user to speak some command (since the user has nothing to read on a phone). The user speaks, and that answer is matched against a grammar, and if a match is found then an action is performed. While multiModal, the user will not receive a prompt because he/she can just read what they are supposed to do on the screen. VoiceOnly, the interaction is typically sequential so that the user is directed from question to question because it is easy for a user to get lost in the app without visual help. MultiModal, the user can click on whichever control they want to fill in, in any particular order

Prompts

In a voiceOnly app, the QA control will have an associated prompt. This will ask the user a question of which it will listen for the answer. The easiest method is to just type an inline prompt as text. The IE add-in will end up speaking this as synthesized speech. Otherwise, a prompt will be tied to a recorded audio file to be played. More complex scenarios involve dynamically building the text to be synthesized. Even more comples involves dynamically peicing together voice recordings to be read as a complete sentence. For a simple welcome message, just add a QA control with a prompt, and set it to be spoken one time only. There is a seperate VS.NET project for creating a prompt database. It handles importing, editing, recording audio files and spoken audio. NOTE I promote women with breathy voices for all your prompt recordings

Grammars

Mentioned these already, but have glossed over them. Now that the user has been asked a question, the app has to know what to listen for. The grammar is a file that specifies the limited amount of phrases that a user can speak to control an application. It is comprised of rules. A rule is likely to be tied to a specific control. e.g. If you have a drop-down list of states, then the rule for that control will have the name of all the states. While if your page has a textbox for entering a zip code, it will be tied to a rule that only recognizes 5 sequential digits spoken in succession; and if the user says 'Alaska' no match will occur. For the QA control, you can specify an inline grammar or a file. An inline grammar is just an XML string that will be rendered to the client. The file can be in an XML format or that XML file can be compiled into a binary format (.cfg) for smaller size and some proprietary hiding. VS.NET auto adds the .gram extension to tie it to the grammar editor, but this is really just an XML file. I typically rename it to .xml to bring up the XML editor; when I want the grammar editor, just right-click and do 'open with' - 'grammar editor'. When SALT is renderd that specifies a grammar file, that file will be downloaded from the server for the client side browser add-in to recognize against. If that file changes on the server, then it will be refreshed by the client. This would be a dynamic grammar. It could be dynamic in a couple different ways, meaning rules could be deactivated or activated; or the phrases that are valid could be dynamically created. Some of this can also happen on the client ... more on this later. An example of an XML grammar is below. It only contains one Rule for selecting a media type out of a list:

<grammar>
  <rule name="Media" toplevel="ACTIVE">
    <l propname="Media" description="Media">
      <p valstr="audio">audio</p>
      <p valstr="chat">chat</p>
      <p valstr="image">image</p>
      <p valstr="photo">photo</p>
      <p valstr="story">story</p>
      <p valstr="video">video</p>
    </l>
  </rule>
</grammar>

Since the grammar above does not change much, I create it on application_startup in the global.asax with the following code:

string gramPath = Server.MapPath(".") + @"\mediaType.gram.xml";
XmlDocument xd = new XmlDocument();
xd.Load(gramPath);
string xPath = "//l";
XmlNode xn = xd.SelectSingleNode(xPath);
foreach(XmlNode childNode in xn.ChildNodes)
{
	xn.RemoveChild(childNode);
}
string mediaPath = ConfigurationSettings.AppSettings["mediaDir"];
DirectoryInfo mediaDir  = new DirectoryInfo(mediaPath);
foreach(DirectoryInfo di in mediaDir.GetDirectories())
{
	XmlElement xe = xd.CreateElement("p");
	xe.InnerText = di.Name;
	XmlAttribute xa = xd.CreateAttribute("valstr");
	xa.Value = di.Name;
	xe.Attributes.Append(xa);
	xn.AppendChild(xe);
}
xd.Save(gramPath);

Speech Markup Language

Once the user has spoken their answer and the client has recognized the answer. It formats a response of that answer in an XML format called Speech Markup Language (SML). This XML contains the phrase that the user spoke, the confidence of that recognition, as well as a value that could have been associated with that phrase in the grammar (e.g. February mapping to the int value 2). This XML fragment can then be processed by client side JScript or posted back to the server. The SML for the grammar rule above follows

<SML text="image" confidence="0.8603525">
   <Media text="image" confidence="0.9969427">image</Media>
</SML>

It displays the text captured, as well as its value which happens to be the same in this instance. Also, the confidence level, so that if the confidence level was low, you could take appropriate action, possibly by having the app ask "Did you say image?" and then the user would respond "Yes", because differences between Yes/No is much easier to recognize. For more complex SML returned, XPath can be applied for very powerful results

Binding

The value from that SML can then be associated back to the page. If the user said 'February', then client-side javascript could apply that value to the DropDownList and that DropDownList would then be set ... without the user having to scroll down and select the value. In VS.NET this is made even simpler, by the QA controls Answer collection. When the QA is bound to a grammar, you just have to right-click on the QA control to add an Answer. You then set the properties on the Answer so that it listens for the Months rule, and then binds that value from the SML to the MonthsDropDownList selectedItem attribute. The appropriate JScript to perform this action on the client is already provided. If you want some other action to occur, you can specify you own javascript function and do whatever you want ... or postback to the server to do more heavy lifting

That should be enough background to start actually doing some work

Speech-Enabling Main Page

My main page has 3 controls on it.

Remember, the goal is to for the user to get to this page, and then be able to execute each of those controls with only their voice. So I dropped one SpeechQA control onto the form. Set its SpeechIndex to 1 and gave it an inline prompt to my defacto 'hello world' text of 'casey kicks ass'. Then, ran that page and nothing happened! Reason being the control was not active. Couple ways to set it active. The expected way is that the QA control has a ClientActivationFunction which you can tie to a javascript funtion. If it returns true, then the control is active; although if none is specified it is true by default. So wrote one to return true, bound it to the control, and still nothing. Hell! Ok, read some more and see that forcing the app to VoiceOnly mode might might do it, and that MultiModal requires some intervention by the user to get going ... although that could be done with JScript. Go into the web.config and set the <defaultSpeechMode/> <appSetting/> to VoiceOnly (as opposed to MultiModal). Load the page again, and my computer says 'casey kicks ass'! Thank you computer ... I couldn't do it without you :)

<appSettings>
   <add key="defaultSpeechMode" value="VoiceOnly" />
</appSettings>

NOTE The speech specific <appSettings/> should get moved to their own configSection within web.config

Rules

The prompt got annoying real quick, so end up setting the prompt to an empty string, so it did not have anything to say, and just immediately started listening for what I would say. Then had to create a grammar to tie it to. For file maintenance decided to put everything into 1 grammar file, although you can add more then 1 grammar to a QA control. Did this because I had a bunch of static grammar Rules that would not change, so lumped all those togethers. Broke out dynamic grammars into their own XML files so that I could manipulate them easily and when they changed only the small XML file would have to be downloaded and not the larger grammar which would only be a 1 time hit. The grammar has multiple rules, outlined below.

The client side JScript for normalizing the SML returned follows:

function PageAnswerNormalization(smlNode) 
{ 
   var hundred = parseInt(smlNode.selectSingleNode("//HUNDREDS").text); 
   var one = parseInt(smlNode.selectSingleNode("//ONES").text); 
   var total = (hundred * 100) + one;
   return total;
}

Here is a pic of my main grammar in the VS.NET grammar editor. It is displaying the rule for selecting an Item, in which the user can say 'Item (0 1 2 3 4 5 6 7 8 )'. To the right, it shows that it has many Rules defined in the single grammar file. The lowercase ones are the MS ones I copied that are not topLevel. Meaning there is another rule that references them; that parent rule is active for recognition, but the child is not

Answers

After these rules were defined, added multiple Answers to the QA control. With multiple answers, the QA will start up, start listening, and can field a response to kick off each page action that can be performed. Tied each answer to the QA control, but not to any attributes. Each answer links specifically to one of the rules above, and each answer has its own client-side JScript function to kick off when it finds a match. So when a user is on the main page he can speak any of the commands above. The control will recognize one of those answers and then kick off the appropriate client-side action. I should enable and disable those rules above as appropriate: e.g. if you at the home page, you can only select a media type, so the other rules should be disabled, and then enabled once a media type is selected

A pic of my Answers in VS.NET follows. To define client side script which is kicked off upon recognition you have to use the properties for the QA object:

NOTE: would love to see a large scale voice application development program. would be easy to tell whom stuff was working for, and whom was having problems. from my debugging sessions; could tell from my voice that I was all excited when stuff started working ... but then when I had the jscript errors and had to keep saying the stuff over and over; my voice got angrier and angrier. it really pissed me off too when it would periodically recognize one of my cuss words and jump to another page! Should make a grammar of the common cuss words, and then have it respond appropriately with likewise crude alerts (friends idea)

OnClientAnswer

The OnClientAnswer method gets called when a grammar rule is recognized to kick off client-side actions. NOTE almost all my code for dealing with speech is client side, and all my server side code is for handling the application logic. This is logical since the recognition happens on the client and you might not need to post back to the server until multiple client side speech interactions occur, to avoid unnecessary round trips. The signature for OnClientAnswer looks like this:

function OnClientAnswer(value) { /* client side script */ }

The 'value' input parameter is the value that was specified in the Rule (e.g. February = 2). Started out by making this function do exactly the same thing my app did. e.g. When clicking on the 'next' page button is posts back with a query string of ?step=next. So made my client side answer logic like this:

function StepAnswer(value) { url = "home.aspx?step=" + value; navigate(url); }

value was going to be a string of 'next' or 'prev' as defined in the StepRule of the grammar. The problem here was that I was re-creating application logic, and when my main page 1st loaded and the next or prev paging links were not even visible you could say next or previous and it would postback. Had written logic to avoid this on the server-side already, so that the links would not be rendered ... thus not clickable; but in this instance they could be 'voice-clicked' (not sure if i made that up, or read it somewhere?). To handle this, made up a best practice:  Instead of recreating page logic with the client side javascript, just have that voice command execute the exact same button click, etc... that the user would have executed with the mouse. So you dont have to recreate the same logic over and over again. Make sure to check that the page element is not null. So my function rewritten looked like this:

function StepAnswer(value)
{
	switch(value)
	{
		case 'next':
			if(document.all["nextLink"] != null)
			{
				nextLink.click();			
			}
			break;
		case 'prev':
			if(document.all["prevLink"] != null)
			{
				prevLink.click();
			}
			break;
		default:
			break;
	}
	//if it falls through then make it post back nothing to reset
	url = "home.aspx?step=";
	navigate(url);
}

err, umm ... but as soon as I did that, my app broke because my QA control was not remaining active afterwards

Control Activation

A speech control must be active to prompt the user and receive voice input. This will differ based on whether the app is running MultiModal or VoiceOnly. For VoiceOnly it relies on 3 things. SpeechIndex, ClientActivationFuntion, and the Answers collection. Tried mucking with all these and using JScript to reset values, but could not get the QA to remain active after the postback ... the postback, wait a sec ... why is it posting back! Did not have autoPostBack set anywhere and only had client side functions set, but my page was refreshing. After alot of playing around, it looks like VoiceOnly mode always posts back to the server, while this does not happen in MultiModal mode. Checking out the JScript that surrounds the SALT above for VoiceOnly apps, it has Form1.Submit() all over the place. So maybe VoiceOnly mode was the wrong choice for my app, so I make a test page to see if I can get MultiModal to work the way I want. Ends up I can get it to start automatically by giving the <Body/> tag an id of Body1 and then setting the qaHome.Reco.StartEvent = Body1.onload so it will automatically start listening. But I get another problem in that if you dont speak immediately, it recognizes Silence as an answer; and although it does not postback, it will not let me speak my answer afterwards. Further, have not touched anything regarding Silence recognition, so might be able to turn that off or at least extend it to infinity and beyond (buzz quote). Figure you could write some tricked up logic to get around this, but chose not to and would rather deal with postbacks; so set it back to VoiceOnly

Where was I? oh yeah, was finding a hack to workaround my newly declared best practice; which does not work for my situation because i'm not using the controls as they were intended :) So my hack is to leave the null value checks in, and if it falls through, then I set an empty query string ?skip= and then navigate to the page, and my QA is active again, listening for the next command.

(De)Activating Grammars

We are well versed in locking down user mouse interactions with a web app, but have just seen how the voice interface also has to be handled. The 3 techniques I have discoverd so far are:

  1. Write more client or server side logic to handle a user executing a command through voice, although that same action not be possible with the use of a mouse
  2. Make the voice commands execute the exact same action that a mouse command would, taking into account that the mouse command might not be available, in which case the voice command would also not execute
  3. Dont let the voice command even be recognized through the grammar by activating or deactivating the rule client-side or server-side

Since I have already discussed the 1st two techniques, we will skip to the 3rd. Each rule of a grammar has an attribute that says if it is active or not. An example of a rule not being active is the case of a rule specifying digits (0 1 2 3 4 5 6 7 8 9). It is a rule by itself, but might not make sense in an app by itself. But it could be referenced by another rule such as for adding simple numbers. And that rule might be something to the effect of: "Add <digitRule/> plus <digitRule/>." For a calculator application, the Addition rule would be active, it would reference digits, but digits would not be active; so if a user just said "nine", then nothing would happen. On the server, if the rules dont change all that frequently, then they might be broken out to a separate XML file, and XPath could be used to select the active attribute and toggle it. If the grammar is a per-user basis, then its likely to be an inline grammar rendered directly with the page; so XPath could be used as well. For small bandwidth devices and larger grammar files it is not a good idea to keep resending that data with a single attribute set. Luckily their is a DOM model to flip the switch on the client-side. The methods look like the following:

Object.Activate(grammarID, [ruleName]); 
Object.Deactivate(grammarID, [ruleName]);

In my example, when the page 1st loaded, all Rules would be deactivated except for the Rule for selecting a media type. Once an answer was found for a media type, then the other Rules would become active for paging, and choosing an item of that media type

Content Pages

Had already collected some images, and thought this would be a rather simple page, so decided to do the image content page next. When selecting a file, it posts back to the main page. Upon seeing that it as a media file from the query string, it redirects to that files content display page (e.g. an image will do a redirect to image.aspx). All of the media pages use the same basic layout and codebehind for the menu and paging. They only differ in how they display the media. For part 2, I will make these much more different

Also, threw together some simple audio and video pages just by embedding the Windows Media Player IE control

Source

Not releasing the code for my server side app, thumbnail or video preprocessing apps. All of the speech stuff happens client-side so you do not need any of it. Here are my speech related files:

noHands.js - client side script

noHands.gram - static grammar file, it is really an xml file

mediaType.gram.xml - dynamic grammar file created at app startup

Also recommend checking out the Library.xml and Library_Extended.xml grammar files that come with the beta, just do a file search

TODO ... Part 2

Took me a while to put this much together, based on learning curve and such, but dev should go much quicker now. Expect to have part 2 out just after the 4th of July holiday. Later