Telephony Voice Recognition Biometric and VoiceXml Record

http://www.brains-N-brawn.com/voiceBio 2/13/2003 casey chesnut

just playing with stuff i had not played with yet

Voice Recognition Biometric

Slight mod and combination of stuff from a BeVocal's Verification Sample (v1.0) and the book VoiceXml : 10 Projects (v2.0). Demos some of the security possibilities for an IVR system. First, it checks the callers phone number to make sure that phone can be used to access the system. Next, it requests the user to enter their pin # (using the touch pad to enter this is recommended, as opposed to speaking the pin out loud), which must match the pin # in the system for that phone #. Then, (if the user has already registered their voice with the system) it will request them to speak their phone # and compare their voice against the stored voice samples. If their are no voice samples stored for that phone #, then the app will walk the user through collecting those voice samples to be used later. Finally, the user will be granted access to the secure area. In the end, the caller must physically possess the correct phone, know a pin, and their voice must match ... seems pretty secure to me

As far as a biometric, Voice Recognition is used for verification and not identity. Unless the user base is small, the system wont be able to hear your voice (anonymously) and then tell you whom you are by your voice alone (e.g. Hal 9000 style). It is better used for the scenario when you say whom you are, and then it matches that spoken voice to the voice print of whom you said you were to verify. Although it would be cool if I could say 'i am mojo jojo', and then the system would say 'shut up casey'

There are disadvantages to this approach. First, voice recognition is not part of the VoiceXml (currently 2.0) standard. These are BeVocal enhancements being used. Other telephony gateways also provide similar extensions. Second, the user is tied to a specific phone (not just because of the phone #). Even if they have a SIM card and switch the same number over to a new phone, the way their voice sounds through the new hardware is likely to be different enough that the system would not recognize their voice and they would need to re-register voice samples with the new phone

demo recording: voiceBio.mp3 (1 meg) or voiceBio.wma (150K)

SALT is a competitor spec to VoiceXml (Speech .NET in the MS world). I do not believe it has a standard for this sort of functionality either

VoiceXml Record

NOTE Speech recognition and Voice recognition are different. Speech recognition (also called Speech-To-Text or SST) is the ability to hear what somebody said and then return that as text. Voice recognition (also called Speaker verification) is determining whom said those words. i.e. it doesnt matter what words the user says, but what does matter is the voice used to say those words. This usually returns a true or false for if the voice matches

This is actually something I did about a year ago ... but I got stuck. Wrote it because I had tied a chat bot to MSN Messenger. This would let people chat with my Messenger Avatar named 'Evil Messenger'. Bored high school and college kids loved it, so I was going to port it to the phone. Meaning you would call the #, Evil Messenger would say Hi, and then you would speak. The problem is VoiceXml does not do dictation ... but it does offer the Record tag. With the Record tag you can record what the user says to a wav file (on the gateway), and then post that to a server. So I was going to record what they said, send it to my server to do speech recognition on the wav file, hand that text off to the chat bot to get its response, send that response to the telephony gateway, and then it would finally speak the chat bots response using text-to-speech over the phone. Repeat. This would let you call up the chat bot and have a conversation with it. The part I got stuck on was doing dictation in a web app (easy in a win app using SAPI), but I just figured that out in my last article /freeSpeech. Not going to, but I could now tie the above scenario together; instead I will just show how to use the Record tag in .NET

the VXML page for recording the user input looks like the following. It records what the users says, and then posts that wav file to the url specified

	<record name="requestWav" beep="true" maxtime="10s" finalsilence="2000ms" dtmfterm="true" type="audio/vnd.wave;codec=1">
			At the tone, speak to Evil Messenger.
		<noinput>I did not hear anything, please try again.</noinput>
			<submit method="post" namelist="requestWav" next="http://server/getResponse.aspx" />

getResponse.aspx would then save off that wav file, do STT, get the chat bots response, and then return a VXML page with that response for TTS over the phone. The code for saving off the wav file looks like the following (on Page_Load)

if(Request["requestWav"] != null)
	StreamReader sr = new StreamReader(Request.InputStream);
	string wavFile = sr.ReadToEnd();
	int nDex = wavFile.IndexOf('=');
	if(nDex != -1)
		wavFile = wavFile.Remove(0, nDex+1);
	byte [] ba = HttpUtility.UrlDecodeToBytes(wavFile);
	string path = DateTime.Now.Ticks.ToString() + ".wav";
	string filePath = @"C:\Inetpub\wwwroot\voiceBio\wav\" + path;
	FileStream fs = new FileStream(filePath, FileMode.CreateNew);
	BinaryWriter bw = new BinaryWriter(fs);


That was the only stuff left in VXML that I wanted to play with, will probably do all SALT stuff from here on out. Next article will probably be game related stuff for the Pocket PC, maybe some voice stuff worked in. Also have a decent size Tablet PC article in mind