System.Speech code samples
.NET 3.0 (stupid versioning) and Windows Vista are now RTM'd. this is great news for Speech developers. first, Vista brings developers SAPI 5.3 for adding speech synthesis and speech recognition to our applications. it also has integrated Speech UI capabilities to control your computer without using a mouse or keyboard. the one-two punch combo is paired with .NET 3.0, with its core pillars being WPF, WCF, W(W)F, and InfoCard. when thinking about WPF, most people think of graphically explicit applications with animated spinning cubes having videos playing on each face. but presentation is more than just what your eyes can see, its also what you hear. just like user input is not just mouse/keyboard/stylus. our own voice, which we primarily use to communicate with other humans, is entirely underutilized when it comes to communicating with our computers. .NET 3.0 helps to rectify this by providing the System.Speech namespace under its umbrella. the System.Speech namespace sits on top of SAPI 5.3 on Vista (SAPI 5.1 on XP) to provide a managed wrapper for adding speech reco and synthesis to our .NET applications.
at this time, the documentation for System.Speech is a bit lacking, and the Windows SDK only provides minimal speech samples. instead of waiting for dev support to improve, i'm going to use this article as an excuse to explore the System.Speech APIs
the Vista OS has speech recognition baked into the shell. this allows many applications to be controlled by voice, even if the application was never written to support speech control. it supports commands, dictation, key presses, punctuation marks, controls, windows, and mouse clicking. the user can start out by running the 'speech tutorial' to get used to the verbal commands Vista is listening for as well as some of the supported dictation commands. some of the key take-aways are that a user can mostly speak any text they see on the screen. e.g. if a button in a desktop application or a hyperlink in a webpage displays the word 'hello', then the user can just say 'hello' to press that button or click that link. if you have graphical buttons, then the user can say 'show numbers' and the OS will render numbers over the buttons, and the user can speak that number to click that button. else, the user can say 'mouse grid' to position the mouse anywhere on the screen and perform a mouse click at that location.
the pic above is the 'microphone bar'. it provides visual cues to see what mode speech recognition is in, if there is any microphone input, what command was spoken, as well as what command you could have spoken. right-clicking on the micrphone bar brings up a context-menu of settings that you can change to customize the experience. 3rd party Vista applications will benefit from the microphone bar by not having to provide their own speech related setting dialogs and audio level meter.
but the main benefit for developers targeting Vista is they pretty much just have to build their app, and it will be user controllable in Vista. this includes all variation of apps : Win32, WinForms, WPF, XBAP, HTML, etc... the following pics show 'hello world' applications in WinForms, WPF EXE, WPF XBPA, and ASP.NET. when each app had focus, the Speech UI was able to recognize when i said the word 'hello' to execute the button click in each environment. there was no extra code on my part ... it was handled automagically
some embedded controls in web pages work too! a couple embedded ActiveX media players were controllable, as well as some embedded WinForm controls. of course, you first had to click on the controls in IE to activate them (security and all). NOTE i did not try Java applets. also, it'll be interesting to see if embedded WPF/e will be controllable, if that ever comes out. TODO need to see about hooking Direct3D apps by implementing accessiblity hooks.
SAPI (Speech API) 5.3 is the underlying speech synthesis and speech reco engine used by Vista. it's Vista only, and is an upgrade from SAPI 5.1 on XP. NOTE SAPI 5.2 is a different engine, used by Speech Server 2004, that is tuned for voice-only telephony application. SAPI 5.3 is COM based, so .NET developers can still use .NET to create a RCW around it, and use it directly from their .NET apps, ignoring System.Speech altogether. e.g. if you have a WinForms that doesn't need a .NET 3.0 dependency, then you can just wrap SAPI 5.3 directly. this is the same technique that has been used for calling SAPI 5.1 from .NET applications on XP. in VS.NET, just add a COM reference to the following library.
and the code for simple speech synthesis will be in the SpeechLib namespace :
SpVoice sv = new SpVoice(); sv.Speak("kc rocks", SpeechVoiceSpeakFlags.SVSFlagsAsync);
another big win for developers is that SAPI 5.3 has engines to support about 8 different languages out of the box! including english, chinese, spanish, french, japanese, and german ... with additional languages planned for future release.
System.Speech is a managed interface that sits on top of SAPI. for .NET 3.0 apps running on Vista, it will use SAPI 5.3; and for apps on XP, it will use SAPI 5.1. so it abstracts which underlying version of SAPI is being used, as well as provides a cleaner managed interface than generating a RCW over the Speech Object Library. just add a reference to the .NET assembly System.Speech.
and in the System.Speech.Synthesis namespace (same result as the SAPI code fragment above) :
SpeechSynthesizer synth = new SpeechSynthesizer(); synth.SpeakAsync("kc rocks");
but working with COM libraries from .NET is generally dead simple. so if you want to wrap the Speech Object Directly, instead of using System.Speech ... then go right ahead. the COM wrapper does expose some functionality that System.Speech does not. e.g. the SAPI methods can be used to manimpulate which profile is used during recognition (ISpRecognizer.SetRecoProfile), while this is not currently exposed through System.Speech. that said, most .NET 3.0 apps will probably get along just fine with System.Speech.
one negative against System.Speech is that it does not run in the internet security sandbox. i believe this was turned off to keep internet deployed apps from eavesdropping on users conversations. i'm all for security, but i'd like to see this changed in the future. especially, since the next version of Speech Server is removing the ability to support multimodal web apps. at a minimum, speech synthesis should be allowed to work in partial trust WPF XBAP applications. for speech recognition, there could be some lockdowns to only allow command and control grammars, or only allow the app to use the shared recognizer.
for speech synthesis, we get the System.Speech.Synthesis namespace. the main class is the SpeechSynthesizer which runs in-process. you can set its output to be the default audio device, or a wav stream/file. then you simply calls SpeakAsync() with the text to be spoken. for customization, you can specify the voice to be used. right now for Vista, we only get one voice 'Microsoft Anna'. i've seen some other demos with a voice called 'Microsoft Lili', which i believe spoke Chinese. what was really interesting about that voice, is it could also speak English, which made the voice sound like a native Chinese speaker was speaking English ... very cool. supposedly you can get other voices by installing the MUI packs on Vista ... but i have yet to track any of these down to try it out. XP should have some other voices to mess around with like 'Microsoft Mary' and 'Microsoft Sam'. for the synthesizer, you can also customize its volume and rate of speaking. for 'pitch' you can change a prompts Emphasis, or do this through SSML using the <prosody/> tag.
instead of just passing in a string of text to be spoken, you can also use SSML (Speech Synthesis Markup Lanuage) or use the PromptBuilder class. SSML is a W3C standard for specifying how speech should be synthesized. so it has XML tags for emphasizing words, which voice to use, pauses in speech, how words should be pronounced, etc... this same functionality is exposed by the PromptBuilder class object model. you can even use it to play .wav files in a prompt. what's great is a PromptBuilder can be serialized to SSML, and SSML can be used by the PromptBuilder class.
so the System.Speech.Synthesis namespace is really easy to work with. about my only complaint is that we need more voices out of the box. i also wish that the synthesis engine would let us know which phonemes it is going to use to speak a word. instead, i've had to hack around this limitation, detailed in the 'Phonemes' section below.
speech reco has the System.Speech.Recognition and System.Speech.Recognition.SrgsGrammar namespace. the main classes are the SpeechRecognizer and SpeechRecognitionEngine. the difference is that the SpeechRecognizer is a shared speech recognition engine that can interact with the microphone bar; while the SpeechRecognitionEngine is a recognizer that runs in-process of your application. most apps should share resources with Vista and use the SpeechRecognizer, while apps that are processing wav files or running full-screen might use the SpeechRecognitionEngine. which brings up some more differences, the SpeechRecognizer will always get input from your default audio input device (probably a microphone), while the SpeechRecognitionEngine can get input from a mic or wav file/stream.
the hardest part about speech recognition is the 'grammar' that the speech recognizer is listening for words from. System.Speech has both dictation and command-and-control grammars. the dictation grammar is specified by using the DictationGrammar() class. you can also use an overload of the DicationGrammar class to have it listening for spelling input. so instead of the user saying the word, they actually spell it. the spelling grammar seems to be able to handle the user speaking 'A as in apple'. more on the command-and-control side, you'll use the GrammarBuilder class or an SRGS (Speech Recognition Grammar Specification). SRGS is a W3C standard for specifying grammars. for a simple grammar, you can just create a GrammarBuilder from a list of Choices, where the choices are just a list of words (as strings). so if i create a Grammar with the Choices for red, green, blue ... then my SpeechRecognizer will be specifically listening for those words. and it can listen for words from more than one grammar at a time, with a priority applied to each. grammars get significantly more complex to handle such things as a user speaking an area code as 'two hundred and sixty two' vs 'two six two'. grammars also have to take into account wildcards, such as a speaker saying please at the beginning or end of a sentence. for complex grammars, i generally turn to an SRGS grammar library. Speech Server 2004 has a large SRGS library for collecting common info such as credit card numbers, dates, floating points numbers, etc... so i try to reuse those existing grammars as much as possible. TODO using an SRGS rule in a GrammarBuilder. there also needs to be a way to convet a GrammarBuilder into an SRGS doc.
the System.Speech.Recognition namespace is easy enough to use as well. but to make it easier, MS needs to provide a core grammar library for developers to use out of the box. they also need to provide better tool support to make it easier to create and debug grammars ... more on that later.
on the recognition side, the RecognitionResults class does provide the phonemes that were recognized from a user's speech. this actually provides a way for us to find out what phonemes the speech synthesis engine used to speak a textual word. the hack is to use the SpeechSynthesis engine to synthesize a text word to a wav file. then use that wav file as input to a SpeechRecognitionEngine with a one word grammar of the textual word. the word will be recognized and from the recognition result, you will get the phonemes that the SpeechSynthesis engine used. granted, this isn't perfect because the synthesis engine is using Microsoft Anna's voice, and the SpeechRecognitionEngine is using my trained voice profie. so you could get better results by doing this in SAPI 5.3, in which you can manipulate voice profiles.
this info is useful for a number of reasons? 1) language learning. with a MUI pack, it should be able to create highly dynamic language learning applications. the app could show the phonemes that a user should be speaking vs what they actually spoke. 2) phonetic search. transcription isn't quite good enough to search podcasts and video directly as text, but there are phonetic search algorithms that are getting quite good. MS Research has been working on this and you can see a channel9 video here. i hope that the speech reco and synthesis engines give us more access to lower level results in the future, so we can perform these sorts of search in our own applications.
the System.Speech namespace and SAPI are great, but desktop speech developers need some tool support. that is one place Speech Server has been excelling, by providing great tools. first, we need some schemas in VS.NET. i should be able to add a new file to a project of type SRGS or SSML and get a schema, color coding, and intellisense. beyond that, Speech Server has a graphical tool for creating SRGS grammars ... we need that broken out of Speech Server to just be a part of Orcas proper. also, Speech Server has the concept of prompt databases, which are audio recording of voice talent mapped to text. i'd like to see System.Speech.Synthesis support that too. next, there are conversational grammars (with tool support) in Speech Server ... break that out for System.Speech.Recognition. finally, i love the Speech debugging window! that's a great way for speech developers to test their apps in noisy or public setting by either providing text or recorded audio as input. and if i want to listen to music while coding, returning the output as text is a really nice touch.
so that's just a quick article working with System.Speech. the API can be really simple to use, as well as supporting advanced scenarios. it's also great that the synthesis and recognition engines have dramatically improved in Vista too. but a major problem is the System.Speech community ... because there really isn't one. MS needs to work on developing a System.Speech community that is comparable to the Speech Server community.
something to do with XNA. later