Homemade Text-To-Speech with .NET

http://www.brains-N-brawn.com/ttSpeech 10/18/2004 casey chesnut


wrote the /noReco article because i had used a number of speech related products, but did not know how they really worked under the hood. that resulted in trainable Speech Recognition (SR) using the Compact Framework (CF). this article is its complement. it will show the steps to create a basic program for doing Speech Synthesis, otherwise known as Text-To-Speech (TTS). just to be clear about 'homemade' ... this will be using only managed code and it will not call into any other libraries (such as SAPI) to do the TTS. the only pInvoke call it does is to PlaySound to play the synthesized audio as output

pronouncing 'hello world' with diphones


Speech Synthesis is being able to take words as text and then produce spoken words that humans can recognize. the de facto example is Speak & Spell. that was done by Texas Instruments around 1970. today, speech synthesis technologies have advanced to where they sound more human than they do robotic. some of them can even speak the same text in multiple languages. others are tying speech synthesis to facial gestures of avatars (i.e. lip syncing) to make them more believable. future work involves producing speech synthesis programs that can sing songs and display emotion (e.g. HAL9000)

there are a number of software packages that provide TTS capabilities. from Microsoft alone: MS Agents, SAPI, Speech SDK, and Voice Command for Pocket PC, etc. IBM is also a HUGE player in the speech world recently making some of their Speech Code Open Source. there are number of other open source speech synthesis initiatives : Festival, FestVox, Flite, FreeTTS. Festival is a large application by the University of Edinburgh (C++). Festvox is by CMU for producing voices for speech synthesis applications. CMU Flite is a lightweight version of Festival for running on servers and embedded devices (C). FreeTTS is a port of Flite to Java

looking at the smaller codebases of Flite and FreeTTS, i still felt they had too much code for a beginner to understand. kept looking for a really lightweight code that could do speech synthesis. did not care if it sounded robotic or was not very robust ... i just wanted it to be simple ... and to speak! but i couldn't find any code (or articles) that fit the bill. so what i ended up doing was reading about how the systems above worked, as well as looking at their code, and came up with the steps to make the simplest speech recognition app possible

TTS requires 3 basic steps (described below):

those are just the basics. there is alot more processing that can be done to make the speech sound more realistic. to read more about those steps, i recommend the FreeTTS Programmers Guide

pronouncing 'hello world' with phonemes

1) Text Analysis

the program starts out with the user entering text to be spoken (e.g. hello world). first, it looks for sentence breaks (.!?) in case a paragraph was entered. by splitting it up into sentences, then pauses can be entered between sentences. the question and exclamation mark can also be used later to modify the spoken words. e.g. a sentence with an exclamation mark could be spoken more loudly. the sentence is then searched for pause characters (,:;) so those can be added later. next, the sentence is scrubbed for special characters ({[@#$%&*/"']}). some of them are replaced with their word representation (e.g. '@' becomes the word 'at'). then the sentence is scrubbed for numerals (0123456789). those are replaced with their text values (e.g. '123' becomes 'one two three')

if i would take this further, then i should make it replace 123 with 'one hundred twenty three' instead. it would also have to search for many other factors such as dates and acronyms. nor does it handle words that have different pronunciations. 'he read that' vs 'dont read this'. some of the more advanced implementations use natural language processing (NLP) to determine the part of speech of a word. knowing if it is a noun or a verb can help determine how to speak the word

2) Lexicon

the parsed text is now divided into individual words to be spoken. for each word, its pronunciation has to be determined. the easiest way to do this is to look the information up in a dictionary. just so happens that a 120K word pronouncing dictionary is provided by CMU. it is a 3.5 meg text file. for the hello world example, this is what it provides:

for 'hello', HH AH L and OW represent the 4 phonemes. a phoneme is an individual sound. that combination of individual sounds is what lets us distinguish one word from another (i.e. Speech Recognition). the english language has around 40 phonemes. some languages only have around 20. the !Kung have around 100 phonemes (clicks and whistles like starvin marvin). the 0 and 1 after vowel sounds specify if the vowel should be stressed or not. right now my program just ignores that information

for words that are not in the dictionary, i just have it spell the word out letter by letter. surprisingly that has not happened much in testing. it can read my street address, most news stories, and even my misspelled last name! it will often read some word, and then i'll be like there is no way that is in the dictionary! obviously the dictionary does not contain newer words such as 'blog'. NOTE would like to know the average number of words that a person understands / writes / and speaks. that would be interesting to know ...

a more advanced approach uses rule-based logic to derive the phonemes from words. the same phoneme will generally occur from the same letter combinations. because of this, you can write code to look for those letter combinations and come up with the correct phoneme. the Flite program (above) uses this technique because it does not make sense to have large dictionary files on embedded devices. it also uses the dictionary approach for words that do not follow the rules ... which the english language is prone to do. out of the 120K words in the complete dictionary, 60K of those words are considered exceptions ... 50% isnt great, but at least it will come up with some sort of phoneme output for all words. back on the AI kick, i believe Hidden Markov Models (HMM) can be used to determine phonemes as well as their durations

after figuring out the phonemes, the next step is to figure out how to pronounce them. by 'how' i mean their duration and pitch. there are databases that provide this information too, but i just ignored that info for now

for kicks, i parsed the CMU dictionary to see what the frequency of phonemes was

3) Voice Concatenation

after retrieving the phonemes, you can then retrieve each individual phoneme from a voice database and concatenate them together. with only 40 phonemes, this would be the most economical choice to save space on embedded devices. but i used the voice database from FreeTTS. instead of using phonemes it uses diphones. diphones are just pairs of partial phonemes. the diphone representation of 'hello' follows. the PAU diphone stands for 'pause', it represents silence NOTE the FreeTTS voice database uses an extra diphone AX. this might be recovered from the pronouncing dictionary by taking into account the 1 or 0 designation applied to vowels concerning stress

instead of representing a single phoneme; a diphone represents the end of one phoneme and the beginning of another. this is significant because there is less difference in the middle of a phoneme than there is at the beginning and ending edges. this will allow for more recognizable speech. the problem is that it greatly increases the size of the phoneme database from around 40 phonemes to 1600 (40 * 40) diphones. nothing like a 40-times increase to eat up space on a small device. FreeTTS comes with 2 voice dictionaries. one is a 14 meg text file for 16 bit audio. the other is a 7.5 meg text file for 8 bit audio. their format looks like the following

DIPHONE aa-ae 0 7 13     
FRAME 43223 18758 48823 20524 (... 12 more) 
RESIDUAL     185 250 125 252 124 253 250 255 251 (... until 185)     
FRAME 43212 18659 49399 19238 (... 12 more) 
RESIDUAL     185 252 254 251 253 252 254 253 254 (... until 185) 

this shows that it is the diphone AA-AE. 0 represents the middle of AA, 7 is where AA and AE meet, and 13 is the middle of AE (remember that a diphone is 2 half phonemes). the FRAME and RESIDUAL information is a compressed WAV representation. using an algorithm called LPC, you can use the FRAME and RESIDUAL info to recreate a WAV representation of each diphone. these WAV diphones can then be concatenated together to produce the spoken text. NOTE i ported the LPC code from the java implementation of FreeTTS

example phonemes example diphones
b aa-b
d b-eh
f ch-d
g d-er
k eh-f

to repeat ... the diphones are looked up in the voice database. the textual representation of each diphone is then converted to WAV format using the LPC algorithm. finally the WAV pieces are concatenated together and played to the user as spoken text. NOTE when concatenating with WAV file is when i could also be stressing syllables, adjusting pauses, and adjusting pitch

below are the list of phonemes and diphones in the FreeTTS voice database. NOTE that the voice database does not contain all the possible diphone combinations. it is possible to try and make it pronounce a word with one of these diphones that does not exist ... at which it will fail. this is most likely to happen when trying to pronounce a foreign word in English


here are some short 16-bit samples (from my desktop) of what it sounds like. it is robotic and monotone ... but definitely recognizable! the 2 'news' clips from the bottom are short headlines from a popular news site (as a real world test). they are also the longest audio clips of the bunch. of course i made it cuss a bunch too ... nothing like a cursing computer to make me smile :)

speech w/diphones
ABC 123 brains-N-brawn.com
casey chesnut HAL9000
hello world how about a nice game of chess
i kick ass shall we play a game
speak and spell speech synthesis
stephen hawking text to speech
will i dream 2 + 2 = 4
longer samples below
news1 news2

Compact Framework

i did this because the story for speech on devices is pitiful. PPC 2003 and SP 2003 devices follow the SAPI Lite interface. it is the MS standard interface for both Speech Recognition and Text To Speech (a lighter version of what SAPI is on the desktop). sadly, retail devices are not shipped with SR engines or TTS voices. even crappier, there are SAPI implementations that come with Platform Builder. so if you have a device that you can re-image, then you can image it to do SR and TTS. of course the people with retail devices are SOL. there are 3rd party implementations for doing speech on devices, but they are rag-tag at best. one vendor might offer SR only, while another only has TTS. trying to find one that offers both AND follow the SAPI interface is a chore. when looking around, i had a hell of a time finding a vendor that offered a free evaluation as well. so the state of the industry for speech on devices is crap ...

... so i finally got sick of bitching and just decided to port the above to the Compact Framework (CF). it actually works better than i expected [there is a video of it below]. from the time text is entered to the time that it 1st starts playing takes about a second. since it takes a while for spoken text to be played, you could queue up longer passages and do the processing in chunks over time. the real performance drag is loading the databases. on my HP4355, the 3.5 meg lexicon database took 2 minutes to load. the 7.5 meg 8 bit voice database takes 8 minutes. that is with CFv1 ... i'm not sure which service pack i have at the moment? luckily, the databases only have to be loaded once, then it can be used to speak large passages of text (or write to a file). would love to try this out on one of those Dell 600mhz beasts along with CFv2. other means could also be taken to improve performance. Flite used the technique of compiling the databases into the actual codebase ... that would work too

to reduce the footprint and increase performance, you could also switch to using phonemes instead of diphones ... which i tried. wrote a program to chop up the WAV format of the diphones and then reassemble those parts into the 41 phonemes (40 phonemes + pause). instead of 7.5 megs for the voice database, it reduces down to ~200KB raw WAV files. with the WAV files you dont have to load the database, so that saves 8 minutes. concatenation is also faster since the LPC algorithm does not have to be performed which was using floating point arithmetic. using phonemes instead of diphones this will definitely reduce quality ... but it is still recognizable, albeit more speak-and-spell'ish. it is more robust because it does not have the problem of missing rare diphones like the FreeTTS voice database does now the only performance problem is the 2 minutes it takes to load the diphone dictionary. attempted to get past this by dumping the 120K words into a SqlCE database. this gets rid of the 2 minute load time, but it takes a couple seconds to read from the database with that many entries. might be worthing trying this with SqlMobile, to see what its performance is

speech w/phonemes
1 + 3 = 4
casey chesnut
hello world
subliminal message


so this showed how to create a dead simple speech synthesis program. was able to get it to run on my Pocket PC with decent performance. in each scenario, it ran much better than i expected (considering performance and quality). the speech is monotonous and robotic sounding, but it is definitely recognizable as english speech. these sort of apps make perfect sense in mobile devices. envision walking around with a bluetooth Pocket PC in your pocket. it could connect to a bluetooth GPS and then report the direction you should travel to a bluetooth headset you are wearing. that scenario also makes sense in a car. multi-language scenarios involve typing text in your language and having it spoken in an alternate language. you could also do that in a learning scenario to learn another language. also, if you've used VoiceCommand on your mobile device, wouldn't you like to provide similar functionality in your own apps? or games! how about being able to create diphones from your own voice, and then auto generate podcasts from your textual blog posts that would sound basically the same as your own speaking voice (somebody tell Scoble i said podcast). etc. etc. for more ideas, Richard Sprague recently posted about : Cool demos for the next SAPI. the next SAPI ... why am i just now hearing about this? would love to get an alpha ...

  (1.1 megs)
the video above is using diphones, and not phonemes (phonemes are faster)

need to Thank Alan W Black  and Kevin Lenzo. Alan is a research professor at CMU whom provided alot of the material (found on the web) that helped me to understand this topic enough to put it together. i also recommend their company Cepstral for high quality speech synthesis voices. (worked on this for 1 week. read for 4 days, coded 3)


not giving away the code. the steps to reproduce it are outlined above in gory detail. there is actually very little code (~1000 lines, including WAV file processing) since i relied mostly on the pronouncing dictionary and voice database to bootstrap the application. alot more code could certainly be written to improve the quality of the audio output and make it more robust


no plans


something AI-ish. later