Speaker Verification Activity for Speech Server 2007

http://www.brains-N-brawn.com/speakerVerify 7/19/2006 casey chesnut



this is a real quick article about custom Speech Activities for Speech Server 2007. the end result will be a set of activities to perform very basic speaker verification. it will also develop some activities for performing dictation. a Speech Workflow using the speaker verification activities is pictured below.


Speech Activities

MS Speech Server 2007 has a new workflow development model. this workflow model is composed of Speech Activities. the core activities are Statement (for speech synthesis) and QuestionAnswer (for speech recognition). there are also higher-level activities which provide functionality such as support for Menus. to learn more about the new speech workflow dev model, see the /speechTextAdv article.

Speech Server 2007 also provides the ability to create custom speech activities. e.g. you could create a custom speech activity for collecting credit card info, and then use that same activity in many different applications. to create a custom speech activity, you first create a 'Workflow Activity Library' project. that project will then need to reference the Microsoft.SpeechServer public assembly. finally, your custom activity must inherit from SpeechSequenceActivity.

to be consistent with the standard speech activities, your custom activity should probably fire a TurnStarting event. it should also expose properties, which will need to be an InstanceDependencyProperty to properly store state.

Voice Biometric

biometrics are a way to add security to an application. for telephony applications, voice biometrics make alot of sense. they are interesting in that are influenced physically and socially.

  1. physical - your voice is dependent upon the size/shape of your throat and mouth.
  2. behavior - speaking style is influenced by the region in which you live and social influences

voice biometrics are mostly tied to the device on which a person enrolled with. this is because the audio can sound radically different between different phones. you might consider this hardware lock-in as additional security ('something you have'). also, your voice will change with age, so a person will have to re-enroll after some passage of time. finally, it can be faked with recordings. so it's not a good idea to use a voice biometric as your only security measure. it could easily be paired with entering a secret pin # using DTMF ('something you know').

Speaker Recognition

there are 2 types of Speaker Recognition (below). NOTE do not confuse speaker recognition with speech recognition. speech recognition determines what was said, and speaker recognition determines who is speaking

  1. speaker verification, in which a user identifies themselves, and the application verifies they are who they say they are.
  2. speaker identification, in which a user is identified from a group of enrolled users, without having to say who they are first.

this article will do speaker verification. 

Speaker Verification

there are 3 different types of Speaker Verification :

  1. text-dependent. the speaker is verified by comparing to previously spoken text.
  2. prompted. the speaker is verified  by speaking text that the app prompts the user to speak. this makes it slightly harder to beat with a recording
  3. text-independent. the speaker is verified by speaking freely. this implementation will be text-dependent.

this simple implementation is text-dependent. it is made up of 3 activities :

  1. VoicePrintExistenceCheck - checks to see if there is already a voice print saved for a user ID
  2. SpeakerVerifyTextDependRegister - if a user does not have a voice print, then they must enroll
  3. SpeakerVerifyTextDependVerify - checks a users voice sample against there voice print to see if they match

SpeakerVerifyTextDependRegister is pictured below.


it is composed of a RecordAudio and Code activity. when it starts up, it fires a TurnStarting event which allows you to set a prompt and user ID. the prompt will ask the user to speak a pass phrase which the RecordAudio will save to a wav file.


a 'pass phrase' might be speaking your name, telephone number, pin number, random phrase, etc... then, the code activity will do some basic audio processing to generate a voice print. first, it trims the silence.


this implementation parses the wav file and then performs a fourier transform to get frequency values.


the frequency values are divided up into cells, averaged, and then saved as a series of numbers.

133 142 121 134 134 110 125 136 139 154 130 155 119 115 133 148 150 121 143 141 112 122 134 136 134 ...

this series of numbers becomes the voice print and that is saved to a file associated with the users ID.

SpeakerVerifyTextDependVerify is very similar, but occurs when a user already has an existing voice print. this times it asks the user to repeat the pass phrase. it then does similar audio processing to generate a voice print for the current sample. finally, it compares the new voice print against the master voice print for the user. if they are similar, then it is considered a match and the speaker is verified.

anyway, consider this a "poor mans" implementation. it is just a proof of concept. there are many things that could be done to make it more secure. maybe MS Research will cook up a proper implementation?


the following are audio recordings of the speaker verification activities in use :


just for kicks, i also cooked up some activities for performing dictation. they work by recording a user with a RecordAudio activity, and then uses SAPI to perform dictation on the recorded file. i implemented it 2 different ways, using a RCW wrapper over COM and SAPI 5.1 (something like 5 years old), and also with System.Speech (.NET 3.0) which works with SAPI 5.1 on XP and SAPI 5.3 on Vista (currently in beta).

so i got this to work ... but it doesn't work very well. it's actually really bad with SAPI 5.1. i don't currently have Vista installed to try it with SAPI 5.3. speaker-independent dictation still needs some work


it's actually very easy to create custom workflow activities, including custom speech activities for Speech Server 2007. i was initially disappointed that Speech Server 2007 did not implement a voice biometric out of the box, but i'm no longer concerned now that i see how easy a 3rd party could implement that functionality.




none planned


possibly more speech stuff. later