/tabletDic

comment(s) 

Tablet PC Dictionary

http://www.brains-N-brawn.com/tabletDic 9/6/2004 casey chesnut

Drama 9/15/2004

originally posted this article 9/6/2004 and was asked by Microsoft to pull it on the same day so they could review it. the problem is they consider the word list to be intellectual property (IP) based on license agreements. i am now reposting the article minus the words from the dictionary and the addition of this section. it was OK that i derived the WordList so that i could add the words that were not available with lower case representations ... to improve my Tablet PC experience

if you want to get the dictionary on your own, here is what you do. download the WordList files from the external sites linked below. then all you have to do is read each word from the WordList file and make the Tablet PC API call (code fragment below) to see if it is in the dictionary. before calling it, make sure to convert the word from the WordList to all upper case characters. you should end up with about 145K words, and it should not take more than a day

RecognizerContext rc = new RecognizerContext();
rc.Factoid = Factoid.SystemDictionary;
rc.RecognitionFlags = RecognitionModes.Coerce | RecognitionModes.WordMode;
bool retVal = rc.IsStringSupported(upperCaseWord);

Introduction

cursive is for girls ... so i print, and badly at that. my first C in school was for poor handwriting in 5th grade. my only other C was Calculus II in college (blame that on Quake and a football injury). that is how badly i print. and then just about 1.5 years ago i became a Tablet MVP without ever having used one. MS was really kind to give me an old 1st generation Tablet PC that they had (still using it today) ... and this showed how bad my handwriting really was. the 1st Tablet OS did a good job of recognizing cursive, but not so well with print. this made the Tablet PC mostly unusable for me, and is partially why i jumped over to the CF group. but now things have changed. SP2 updates the Tablet PC to the 2005 edition. it improves handwriting recognition and makes Tablet PCs much more usable. it even does a surprisingly good job at recognizing my print. this puts me back on the Tablet PC bandwagon, which is why i wrote the /tabletWeb and /tabletSign articles ... and now this one.

the Tablet PC ultimately recognizes what you write from using words in a dictionary. the steps are you write, the Tablet PC runtime does some processing of what you wrote (using some form of AI, possibly neural nets?), and then matches those results to words it contains in a couple different dictionaries. when it gives you alternate matches to choose from, these are also from one of the dictionaries. to improve your experience even more, you need to add the words you use that do not appear in the dictionary. e.g. i cannot get it to recognize when i ink brains-N-brawn.com ... because it is not in the dictionary. before adding words, it would help to know what words are already in the dictionary. Microsoft will not publicly tell us ("an engine's lexicon is considered to be proprietary intellectual property") ... so i will make an effort to discover those words on my own, and that is what this article is about.

Dictionary

there are 3 types of dictionaries: System, User, and Application

you can use Factoids / Input Scopes to specify how these dictionaries are used by an application. you can use:

NOTE the more complicated scenerios above were provided by Stefan Wick [MS]. he is a great help on the Tablet PC newsgroup and has an article about the Tablet Dictionary

Attack

the only tool at our disposable to attack the dictionary is RecognizerContext.IsStringSupported(string). you can use this call to check to see if a word is in the dictionary or not. capitalization does matter when you call this API. in general, most words are CAPITALIZED, then Single_capped, then lower_case. you will not find any words in the dictionary that are lower case and do not have a corresponding upper case representation. you will find many words that are upper case and to not have a lower case representation. this sucks for me ... because i use lower case almost exclusively. if a word is not recognized lower case, i usually write it upper case the 2nd time ... and it works. i hope that the next version of Tablet PC gives equal treatment to lower case users (see the Conclusion below)

the attack will take place on 2 fronts. it will start out with a brute force attack guessing all possible character combinations. the second front will be a word list attack using free lists of English words found on the internet

1) Brute Force

this attack just starts out with a list of characters that might be used. the dictionary can use characters from code page 1252 ... which comes to a total of 255 characters, with about 190 of those being usable. it started out with words that were one character in length, testing every possible character (190). then it went to 2 characters and every possible character combination 190 * 190 = 361000) ... and so on. to get all 4 character words took >30 hours, so i stopped it there. at this point i attempted to make the app multi-threaded, but all that did was slow the app down. this seems reasonable since i cannot write on my tablet 2 different places at once. NOTE one unique thing is that the character for number 9 had not appeared yet. 0-8 had (401K, M1, K2, 3D, M4, 4x5, MI6, G7, V-8) but not 9? quick, somebody come up with a <5 character word with the number 9!

for the 2nd run i removed the special characters that had not appeared yet. also removed lower case characters and added logic to compensate for this within the application. i.e. only test for a lower case word if the upper case word exists. this reduced the number of possible characters from 190 to about 56. this introduces some amount of error for the sake of speed. some words with the special characters that were removed will not be found. also, unique capitalization like 'SwF' will not be found. there is a slim chance to pick these missed words back up during the word list phase. only used this test for 5 character words, which took about 9 hours to run. NOTE at 5 chars the 1st word with a space was 'E. ON' ... i have no clue what that means? the 1st word with 9 finally appeared ... 'WIN95'

the 3rd run went after words 6 characters long. i removed most of the special symbols and numbers. this left the 26 letters and 6 special characters (space ' - . / É). this took about 18 hours to run.

the 4th run went after words 7 characters long. i removed all of the special symbols at this point and only went after the 26 letters. it would take 7 days to complete this test. got impatient and wrote this article before it completed

i can only think of 2 ways to continue the brute force attack for 8 characters and above. the 1st way would be to still use a single computer. i could use the results from the WordList attack below and do trending to see which letters occur the most in certain combinations. then i could do the brute force attack to only go after character combinations that occurred the most frequently. the 2nd way would require a grid of Tablet PCs running in parallel. i could dole out small tasks for all the Tablets to run at the same time and push their results back to a controlling computer. anybody want to donate some of their unused Tablet PC CPU time? sort of a SETI @home dictionary attack

2) Word Lists

this attack involved searching for free English Word Lists on the internet, and then seeing if those words were in the dictionary. the WordLists were from word-puzzle hobbyists (Scrabble), online dictionaries, hacker password lists, the gutenberg project, census data, etc.  the better ones could be found here

this attack was much more efficient than brute force. a single WordList of about 200K words would finish within an hour of time. the average collegiate dictionary has 250K words, average unabridged dictionary has 475K works. from combining all the WordLists i could find, i came up with a mega WordList from 1.5 to 2 million words. i wish the Oxford English Dictionary provided their WordList ... they have 1 to 2 million English words, and millions of more specialty words. the down side of the WordLists was that it was hard to find WordLists with recent words from the last couple years, as well as technology specific terms. my assumption is that those words are missing from the results below

# Chars BruteForce WordList Percent
1

2

2 100.0
2 463 460 99.3
3 2298 2266 98.6
4 4970 4744 95.4
5 9142 8816 96.4
6 14506 14116 97.3
7 19066 18993 99.6
TOTAL 50447 49397 97.9

to judge how good the WordList attack did, i compared its results to the brute force attack (above). so with a little confidence i can say that the WordList attack found about 95% of the words in the Tablet Dictionary ... and that is good enough for me :)

these are the words that are unique to the Brute Force attack (to see the diff between WordLists)

Statistics

then i combined the results from the BruteForce and WordList attacks to come up with my best guess of the words in the Tablet PC Dictionary.

word length distribution. # of words with N # of characters

# Chars # Words # Chars # Words
1 2 18 228
2 463 19 122
3 2298 20 45
4 4970 21 22
5 9144 22 7
6 14530 23 6
7 19310 24 3
8 21232 25 2
9 20278 26 1
10 17452 27 1
11 13398 28 1
12 9401 29 2
13 6128 20 1
14 3596 31 1
15 2057 32 0
16 1095 33 0
17 615 34 1

starting letter distribution. # of words that start with this letter. never would have guessed SCPDA

Char # Char #
A 9188 N 3260
B 8886 O 3644
C 14415 P 10648
D 9363 Q 701
E 6333 R 7678
F 6475 S 14801
G 4946 T 6738
H 6019 U 3531
I 6260 V 1841
J 1573 W 3619
K 2222 X 154
L 4780 Y 603
M 8290 Z 402

character distribution. # of times a character appears. now you know where RSTNLE comes from

Char # Char # Char # Char #
A 102644 N 88861 0 4 # 2
B 25574 O 82708 1 11 $ 3
C 52773 P 34835 2 22 & 39
D 45656 Q 2203 3 13 ' 268
E 142075 R 91255 4 11 + 10
F 17520 S 107723 5 7 - 4662
G 34610 T 84076 6 6 . 1140
H 31385 U 40366 7 6 / 184
I 110729 V 12645 8 4 : 1
J 2582 W 11588 9 3 \ 2
K 13308 X 3862     ` 1
L 70270 Y 21517        
M 36128 Z 6348        

these are the longest words (greater than 20 characters); the shortest words are: A, a, I

Interesting

these words are unique in the Tablet dictionary, compared to a compiled word list of about 800K words

Conclusion

that gives us approximately 95% of the words (146K) in the Tablet dictionary. there are probably at least 5000 more words (that i did not find) putting the entire dictionary just over 150K words total. those words that are missing are probably technical words and newer words (last 5 years). compare this to the average collegiate dictionary being 250K words and unabridged english dictionaries being 470K. it was kind of a pain to get these words ... but it is fun to search through the end result. now i'm constantly thinking to myself ... i wonder if that word is in the dictionary :)

also corrected the lower case word problem. out of the 146K words, about 24K of them did not have an all lower case representations. i wrote a program to go through the list and find those, and then saved that list to a file. you can import that list using the Tablet PC Dictionary Tool in case you hate UPPER CASE CHARACTERS TOO! hope that lower-case writers become 1st class citizens in the future

now for what i would like to see happen! as people use their Tablet PCs, they are adding their own words to their User dictionary. this greatly improves recognition for words, so the better your dictionary, the better your user experience. i would like to see MS start pushing out periodic dictionary updates (to add such words as 'blog'); instead of waiting for a new Tablet OS or Service Pack ... maybe quarterly? they could figure out which words to add by supplying a tool to end users, possibly extending the current Tablet PC Dictionary PowerToy. all it would do is anonymously (and securely) upload the words from a persons user dictionary into a MS web repository. then MS could use that data to see which words they should add based on frequency of occurrence from actual Tablet PC users. as long as it was secure and anonymous, i think end users would do this for periodic dictionary updates

Source

no interesting code to see. what you really want is the dictionary [LINK REMOVED] of the 146K unique words (capitalization ignored)

Updates

no planned updates, although it would be fun to make a smarter brute force algorithm (aka brains-N-brawn)

Future

will probably do something non-Tablet-related next