Text Adventure Game with Speech Server 2007 and IM
this article is about writing an application to host text adventure games to be played using speech or instant messaging. came up with the idea for a text adventure game as a contest entry into the Robot Invaders contest. but i was also in the MS Speech Server 2007 beta, and realized that i could use it to expose the games as a voice-only application over telephone or VOIP too. the end goal was to use the great development and hosting platform of Speech Server, and then create an IM-to-Speech bridge for the instant messaging front-end; so that the app would keep a common codebase bot both voice and text.
there are a # of changes for the latest version of Microsoft Speech Server (MSS 2007). one of the main changes is it now supports VoiceXML, in addition to SALT. the languages are similar, but the development environments are different. developing SALT applications will be done using speech server controls, while VoiceXML development is page based. i've developed with both, and think it's great to have options. but i think the best addition for developers is the new Managed Code API. it is integrated with Workflow Foundation (WF) through custom activities, and makes for a much simpler development environment. this article will cover this dev model. next, Speech Server is adding support for VOIP. this is another great thing, because it lowers the cost for hosting voice-only applications. the trade-off is that Speech Server is dropping support for multimodal applications. multimodal apps just haven't taken off like i wanted them too ... but i hope they make a comeback in the future. finally, MSS 2007 adds alot of tool support for analyzing your voice-only apps.
the following are links to previous speech articles : VoiceXML - /vxml, /myVoices, /voiceBio SALT - /noHands, /speechMulti, /tabletWeb, /mceSalt
Windows Workflow Foundation (WF) is a new programming model coming for .NET 3.0 (previously WinFX, currently in beta). it supports developing workflow systems with both human and system interactions. it also has great tool support integrated into VS.NET. with these tools, you can build your applications by dragging and dropping Activities from the toolbar onto a surface. the Activities will link to one another to show valid paths of program execution. code or declarative rules can be associated with an Activity to specify its behavior. WF provides a base set of Activities, but the real power comes with the creation of custom activities specific to the type of application you are building. this provides for greater visibility into the internals of an application. MSS 2007 provides custom speech activities for developing voice-only applications. during runtime, WF has a hosted engine. it can be hosted in console apps, WinForms, WebForms, etc... i believe SharePoint uses it and BizTalk is having it integrated, etc... about the only place we are missing WF is for Compact Framework applications.
anyway, i'm very excited about WF. this is actually my first WF application, and i found it very intuitive. specifically for speech, it was much easier to figure out than the first VoiceXML or SALT applications that i wrote.
speech workflow applications are always sequential, and the SDK comes with a number of custom activities. NOTE that most of the icons are red circles, just because the graphics aren't done yet (remeber, this is still beta).
these speech dialog activities can be used along with the base set of WF activities, and any custom activities you might create. some of these speech activities have custom editors, which make them even easier to work with. the following pic shows the editor for the QuestionAnswer activity which allows you to specify multiple prompts as static test, as well as helping you to pick which grammar and rule should be used for speech recognition.
to try the new the speech workflow model, i decided to write a sample application. this sample application is going to be a speech-ified version of a text adventure game ... a speech adventure? for the content, i'm using text adventures written by Scott Adams and Brian Howarth, written late 1970s and early 1980s. the games are stored in a text file format and are run using an interpreter. nope, XML didn't exist back then :) for the run time engine, i ported a C version (1.14) by Alan Cox to C#.
the text file format stores many things, such as Actions. the Actions are triggered by verb + noun combinations. it stores Comments, which generally seem to be ignored? Items have a text description and their initial starting room. Messages provide prompts to the user. Rooms have a text description and know what rooms they are connected to. finally, there are verb and noun fragments. generally, there are about 75 verbs and 75 nouns per game. the data file only stores the first 3 to 4 characters for each noun and verb, so the word 'Listen' might be stored as 'LIS'. also, some words are marked (*) as synonyms. so the word SHIP and BOAT would trigger the same action.
to play these games, you start out in some room, and text describes your room, what items it contains, and the directions you can go. there might be items you can 'get item', or you can examine something by saying 'look item'. then you move from room to room, by saying (n)orth, (s)outh, (e)ast, (w)est, (u)p, (d)own. there can be something like 30+ rooms, so you'll probably need to create a map. and there will be 60+ items to interact with, so you'll probably want to keep notes too. you can check what you are carrying by saying '(i)nventory'. you will be limited by the # of items you can carry, and some items will interact with each other, so you have to think about what you carry where. also, some places are dark, but you can use a lamp to get through; just be careful not to use it too much. for one of the games, the goal is to retrieve a # of treasures and store them in a specific room. for another game, the goal is to get through the rooms to find a single item.
anyway, i consider these games to be MUCH harder than today's standard difficulty for games. these are seriously tough ...
this section will walk through the actual speech workflow for the text adventure game. it starts out by saying hello and asking what game you want to play.
MSS : Hello 123-4567. Which game to you want to play? Adventure Land, Pirate Adventure, The Golden Baton, The Time Machine
User : Adventure Land
when you create a speech workflow app, it starts out with the AnswerCall and DisconnectCall activities. after AnswerCall, i added a Statement activity. this just says 'Hello' to the user, and repeats their phone number. it grabs the phone # from the TelephonySession and reads it as a telephone # (as opposed to digits) by using a PromptBuilder. this code is executed in the TurnStarting event for the statement activity.
ITelephonySession its = this.TelephonySession; string address = its.CallInfo.CallingParty.Address; this.stateWelcome.MainPrompt.ClearContent(); PromptBuilder pb = new PromptBuilder(); pb.AppendText("Hello"); pb.AppendTextWithHint(address, SayAs.Telephone); this.stateWelcome.MainPrompt.AppendPromptBuilder(pb);
the next activity asks the user which game they want to play. this involves a dynamic prompt and a dynamic grammar. it determines the available game files by reading the files in a directory. the name of the file is the name of the text adventure, and that is synthesized to the user. the recognized value is then mapped to the name of the actual file. NOTE this QuestionAnswer activity could be replaced with a Menu activity. the pic below shows a simplified static version of the grammar.
the next couple activities are used for handling saved games.
MSS : Do you want to continue your saved game?
User : No
the IfElse activity has a Condition which checks if a saved game file exists (saved games are based on the name of the game and the user's id). if a saved game exists, it uses a QuestionAnswer to prompt if the user wants to continue their saved game. this is a static grammar, tied to the Confirmation_YesNo rule in the standard grammar library. after that, a Code activity sets an internal flag based on the users answer. if a saved game does not exist, then nothing is triggered and the game begins.
the next part initializes the game engine and starts game play.
MSS : You are in a forest. Obvious exits : North, South, East, West. A voice BOOMS out ...
User : East
first, a Code activity parses the games data file. second, a Code activity uses the data file, a solution file, and some code to generate a grammar for the game. this is kind of tricky because the 60+ verbs and 60+ nouns are just stored as word fragments. mapping the noun fragments isn't too hard. most of the nouns can be found in the data file for the game; either in the message text, room text, or item text. the verbs are a bit trickier, because they generally do not appear in full form in the data file. to get around this, i use solution files provided by Jacob Gunness. these step by step solutions provide all the verbs (and nouns) that are needed to finish a game. so i map those full words to the partial verbs in the data file. also, i've hardcoded a list of verbs that generally occur in these types of games. this logic isnt' perfect, because some of the partial verbs will not end up in the grammar; but at least all the necessary verbs will be mapped based on the solution file. the pic below shows a portion of the dynamically generated grammar. the grammar specifies that at least 1 verb must be spoken, with an optional noun. both verb and noun list will have something like 100 words in each ... making approximately 10,000 combinations. i must say, that Speech Server has been doing an excellent job recognizing what's being spoken. honestly, with a USB mic, i don't think it's had a miss yet ... serious!
then a QuestionAnswer activity starts the main game loop. it gets the intro text for the text game, and prompts the user. then it starts listening for a response from the dynamically created grammar above. so if the user says 'go east', then that is recognized as a valid entry. the next Code activity takes the recognized text and passes it to the game engine, to trigger in-game actions, and determine the next prompt for the user.
the final set of activities figure out if the user won, died, or continues the game loop.
MSS : O.K. You are in a sunny meadow. Obvious exits: South, East, West. You can also see: Large sleeping dragon ...
User : East
the Condition for the IfElse checks the game engine to see if the game is over. if the game is over then the server says Good-bye and disconnects the call. if the game is not over, then a GoTo activity jumps back to the qaGameInput activity. at this point the user is prompted with text from the game, and the game continues.
i've added a couple of Commands that are always available.
User : What can i say?
MSS : Navigate by saying North, South, East, West, Up, or Down ...
the user can speak these at anytime. the user can speak 'what can i say', to get prompted with tips on how to play the game. the other command is for debugging, by saying 'what did i say'. this will prompt the user with the text for the last recognized speech. added this to debug speech reco results, but i've been having such good results, that i haven't been using it :)
the SDK also provides a debugging window integrated into VS.NET. it can return output as text or play the prompts (recorded or synthesized). and it accepts input as text, wav files, mic, and DTMF.
now its time to backtrack. my original intent for porting the text adventure game was an instant messaging bot, for the RobotInvaders contest. what surprised me about the contest, is there was no MS provided SDK. instead, it pointed to three 3rd party bot SDKs. i took a quick look at them, and had problems with all 3. one of them was based on C# code, but was missing functionality like winks and voice clips. it also lacked any support for creating dialogs. another had a proprietary scripting language for conversation and its own IDE. and the other had over the top installation and administration requirements, but it also had VoiceXML support for dialogs. that made it click that Voice UIs and Text Messaging UIs are very similar. it made me think that at least basic voice-only apps would also make sense as text messaging apps.
just about everything maps nicely, since they are both challenge/response scenarios. the things that don't transition are winks, nudges, and the activity window from MSN. nor am i sure if call transfer makes since in the IM world? IM could definitely add a human to the conversation, like a voice app transferring to a live operator.
and i could also think about scenarios when i would want a speech app to be text based as well.
but the IM bot tool support is weak. so i decided to write the app first using the MS Speech Server tools which are excellent, plus support standardized grammar and synthesis formats.
once the speech app was working, the goal was to use a bot SDK to create a lightweight bridge from IM to Speech. so that the bridge would receive input as text and pass it on to the app in speech server; and get output as text and pass that back to the user. we already know this is possible because the speech debugging window does this. the pic below shows a conceptual sequence diagram.
er, um ... so i got this to work, but i can't make it public because i used unsupported classes. it's actually pretty slick because you really just have to configure it with the bot SDK info for where to receive incoming text messages, as well as the url for the speech app to send outgoing messages. one limitation that it currently has (at least with the developer install) is the # of concurrent sessions it can handle. my setup can only handle 2 at a time. i'm sure this limitation could easily be overcome. NOTE because of this limitation, i had to create a stand-alone app for hosting the 'TextAdventure bot' for the contest. so that codebase is slightly different :(
one tweak i did was based on the context of the incoming message. made it so the workflow app could determine if the user was interacting through speech or text, which allows it to change its behavior if necessary. specifically for one of the text adventure games, treasures are marked with an '*'. the speech engine reads this aloud, so i just strip it off instead; but i let it pass through for text messages. knowing the context of the user, the app might also switch out the grammar that is being used. e.g. if the user is text based, then the grammar might look for typing shortcuts such as 'n' for 'no'.
aside, i hope to get programmatic access to interact with a speech app for another reason too ... automated testing. i'd really like to create unit tests to quickly run through the different paths of a workflow, test dynamic grammars, and just make testing quicker; as opposed to having to manually use the speech debugging window.
now that the same app is exposed through speech and text, what are the different ways i see it being used?
NOTE don't have this implemented yet, but it should be possible to start your game over MSN, save it, and then continue it with speech (or vice versa). the part i'm missing is that i'd have to add a user id that is agnostic to the device being used. right now its using MSN email address for text and phone # for speech.
to play, you must add the bot as a contact to MSN Messenger : TextAdventureBot@hotmail.com
the pic below shows a version of the app displaying a map (from SolutionArchive.com) in the activity window
for MSN Mobile, if you are disconnected, you can reconnect and you should be able to pick up where you left off. this would be a good time to type 'save game' because it only keeps the session around temporarily. when you type 'save game', it saves it to disk; and you can continue from this point when restarting your game later on
the following is a sample transcript from an MSN session : advLandTranscript.rtf
the following is a sample recording from a debugging session
this article showed how to create a basic speech workflow application and then discussed how an IM-to-Speech bridge could also make that application available through text messaging. specifically for speech development, i'm really impressed with the speech workflow development model. i found it intuitive and easy to work with. much easier than working with both VoiceXML or SALT. the app just seems to be more robust too
and i think this app works nicely over both speech and text, but i'm not ready to say this is the way it should be done; because i'm a big proponent in customizing an app for its target platform. i'm definitely ready to say that alot of voice-only apps should be made text-based too; because there are scenarios when speech makes sense, and times when text would be better. at least i do hope that MS ends up providing us with proper bot development tools. and if we could leverage our existing speech knowledge base ... even better.
might update this when Speech Server 2007 is released
probably more speech workflow. later