/aiTabletOCR

comment(s) 

Tablet PC OCR with Neural Network AI

http://www.brains-N-brawn.com/aiTabletOCR 9/28/2004 casey chesnut

Introduction

this article will detail my transition into Artificial Intelligence programming. it will start out with an overview of neural networks and how i applied one to do custom OCR (optical character recognition) and stroke-based handwriting recognition of ink for the Tablet PC. the title is misleading. OCR did not work so well, but the stroke-based approach did. there is a video clip of it working below.

at the end of last year, right after finishing the 1st implementations of Crypto and WSE for CF (now /cfAES and /cfWSE2 ) ... started looking for the next problem to tackle. specifically wanted to code something WITHOUT a specification (did you get it? 'specifically' without a 'specification'). doing crypto and web services had gotten monotonous. from the crypto side, all you have to do is get the bytes in the proper order; and from the WS side, all you have to do is shape the XML according to the specs. boring ... what i needed was some chaos ... and this is where AI came into play. had no previous AI experience, so i started reading book after book. even attempted a couple AI-esque type projects to see what i was in for:

learned alot from those projects, but none of them used any traditional AI techniques. they both just consisted of procedural code to process the data directly. in one case the input data was an image, and in the other case the input was sound. in the same spirit of why i wrote /noReco, i decided to try ink recognition next because i had been doing a bunch of Tablet PC programming, and wanted to get a better idea of how that might work.

Artificial Intelligence

started out reading a bunch of AI books. these are mostly a sorry breed. on one end you have history books telling what types of applications people were building, and on the other end you have educational books with page after page of mind numbing mathematics. there is a very limited number of books that explain the 'concepts' outside of the mathematical realm and about how to 'apply' these techniques. why is that? when i read about ADO.NET ... i dont get force fed relational algebra / calculus.

the history of AI is that it started out around 1960, everybody loved it because of early positive results. the thought was that we were going to build general-purpose thinking machines to solve all problems. then people started pointing out more problems than they were solving ... and everybody jump shipped. the realization was that what was hard for people is easy for computers; and what is easy for people is hard for computers. for about 20 years, only the crazy people did AI ... or you went underground and called it something else. everybody else went and worked on setting up the internet to better distribute free music and porn. during that time, the concept of AI gave us a bunch of mean hollywood robots (HAL 9000, Terminator, and now the Matrix and iRobot). finally, people had too much music and porn to be able to manage it all ... and the AI people popped their heads up and said 'we can help'. so since about 1990, AI has slowly come back on the scene. instead of trying to make a know-it-all machine, it has been applied to very specific problems ... such as kicking the russians ass at chess (a continuation of the cold war ... because we can never have too many wars going on at the same time).

AI is starting to become common-place. when speech recognition takes off ... it uses AI. if you have a Tablet PC ... the ink recognition is AI. if you've seen those image database applications that can find similar pictures and peoples faces ... AI. your XBOX enemies, stopping spam, etc. take a look at googles job boards ... they want AI. not to mention MS and IBM have alot of AI guys in their research departments. some of the cooler features of Longhorn (natural language search) are using AI.

AI can be further divided based on what approach they take. the 2 main camps are symbolic processing and connectivist. the symbolic group (old school) is about writing the logic and knowledge representation to make machines think. this is most similar to everyday programming. their key successes are expert systems, knowledge bases, and fuzzy logic. the connectivists (new school) are more about modeling AI along the lines as our own brain works. their key successes are neural networks and decision trees. more recently, the trends are towards genetic programming and emergence of intelligent behavior from chaos.

each AI technique is better suited to solve specific problems. if i was an expert at something or had a knowledge base ... then i would write an expert system. i would do something else if i had alot of business logic rules to work with. fuzzy logic is particularly good at working with imprecise data. for this particular problem, neural nets seemed like a good fit. although other techniques can certainly be used. i specifically remember seeing Hidden Markov Models being mentioned ... heard of them before, but dont remember what they are?

Neural Networks

Neural Networks (NN) are modeled after the way our brain works. basically it is a bunch of neurons (something like 10 to 100 billion) that are connected together. as we learn, some of those connections become stronger. so if we see something with our eyes, that info will take a path through the neurons, and ultimately we recognize the object that we have just seen. when you drink, it slows down those connections and messes them up a bit ... thus slurred speech and beer goggles. anyways, we dont care about the brain ... we want to make some code work

so the basic concepts of an NN are that the Network is made up of Layers. these Layers are for Input and Output. there also might be some Hidden layers. each Layer is made up of Nodes. all a Node does is store a value called its Activation, and possibly an Error. each Node from a Layer will connect to all the Nodes of the next Layer. the simplest Networks only have an Input Layer and an Output Layer. so each Node from the Input Layer would Link to every Node of the Output Layer. each one of these Links stores a value with is the Weight of how strong that connection is. the most commonly used Networks have 3 layers (Input, Hidden, and Output). the Nodes from Input connect to Hidden, and the nodes from Hidden connect to Output. So it is a FeedForward system, moving from Input to Output. you can have multiple Hidden Layers, but just having 1 seems to be the most common. NOTE there are some NNs that loop back. these are called Recurrent Networks and are not used all that often

the number of Input Nodes will depend on your input data. the input data will be some array of values between 1 and -1. the pic above shows images of an X and O. to convert that to input data, you have to divide the image up (in this example 5x5). then if a cell is not red, its value is 0; and 1 if the cell is red. so the Input Layer for the NN would have 25 Nodes. and the value for each Node would be one of the characters from the strings below (read right-to-left and top-down)

in the real world, your input values will vary more widely, and you will have to use some function to normalize them. e.g. divide each value by the max value in the array 

the number of Output Nodes will depend on how many patterns you need to recognize. for the X and O example, you would need only 1 Output Node. if its value was greater than .5 then it would be X, and if it was less than .5 then the answer was O. if you are trying to recognize 10 digits, then you would have 10 Output Nodes. each Output Node would correspond to 1 digit and the answer would be the Node that had the highest Activation after the NN was run.

the number of Hidden Nodes will vary based on your problem. i've seen samples that have more Hidden Nodes than Input Nodes, the same number of Hidden Nodes as Input Nodes, and less Hidden Nodes than Input Nodes. the rule of thumb seems to be # Hidden Nodes = (# Input Nodes + # Output Nodes) / 2. if that does not work, then slowly add more Hidden Nodes (and possibly more Hidden Layers).

that gives you the structure of the NN. just a bunch of Nodes with Activation values connected by a bunch of Links with Weight values. before we can actually use the NN, we have to train it. 1) the Network starts out with all random Link values between -.5 and .5 ... so at this point, it is stupid. (most of the code i have seen started out with random values between 1 and -1, but the books i read favored -.5 and .5). training involves feeding in input data with known output data (i've seen these pairs called Patterns). you set the input values on each Input Node. 2) then some math happens in 2 parts to compute the values from the Input Nodes and their Links to the Hidden Nodes. 2a) each Hidden Node has a Link to each Input Node. the Hidden Nodes value is then equal to the total of each Links Weight * the Activation of that Links Input Node. 2b) second, that total is run through what is called a Transfer function. the default one is called a 'sigmoid' (1/(1+Math.Exp(-x)) pictured above, but other common ones are 'step' and 'gaussian'. so now each Hidden Node will have some positive Activation value between 0 and 1. 3) the same process will then occur to compute the Activation value of the Output Nodes from its Links to the Hidden Nodes and their Activation values. the Output Nodes will now have some value between 0 and 1. if the NN were already trained, at this point you would just see which Output Node had the highest Activation ... and that would be your answer. if not, then you have to train it. NOTE the Transfer functions might also introduce the concept of Bias.

4) the next step to train, is to calculate the error value for each Output Node. the function to determine Output Node error is (output[x].Activation * (1 - output[x].Activation ) * (desiredOutputs[x] - output[x].Activation )). the desiredOutput is what value the Output Node should have had. e.g. for the X and 0 example, if i passed in X, then the single Output Node should have returned 1. if i had 10 Output Nodes to recognize the digits between 0 and 9 respectively, and passed in 0 for input, then the 1st Output Node should be 1 and all the rest should be 0. 5) then we have to calculate the error value for each Hidden Node. this is another 2 step process which involves totaling up each Links Weight * its Output Nodes Error. that total is then multiplied to the current Nodes Activation to set the Hidden Nodes error value. NOTE you can add up all the error values of the Output Nodes to get the total error of the NN for that run.

6) with the errors calculated, the next part is to adjust the Weights of the Links. this also works backwards from the Output Nodes to the Input Node Links. for each Hidden-Output Link, its new Weight value is equal to its (currentWeight + hiddenActivation * outputError * learningRate). learningRate is just a value between 0 and 1. this keeps the values from shifting to rapidly so that it keeps from over and undershooting the correct outputs during training. the lower the learningRate the longer training will take, but it gives you a better change of getting good results. from what i've read, .2 seems to be a decent starting rate for learning. 7) this same calculation then happens to calculate the new Weights for each Input-Hidden Link. at this point, all the weights have been adjusted, and the NN is slightly more intelligent. NOTE the calculations for Weights sometimes use the concept of Momentum.

steps 1-7 above show the scenario of training for 1 input. but this has to happen many times for the weights to be adjusted correctly. for the digit problem, you would want to train it using each of the 10 digits. doing a complete training cycle of all 10 inputs is called an Epoch. you will have to train it for many Epochs. the 1st step is to train it until it correctly recognizes all the Inputs within a single Epoch (that might take 500 Epochs). the 2nd step is to randomly order the inputs for each Epoch. i.e. dont submit 0-9 in the same order every time. it will take more Epochs to learn in this manner, but will have less error (this might take 750 Epochs). the 3rd step is to total the error for each Epoch and then keep running it until that error is below some threshold such as .1 (this might take 2000 Epochs to reach). NOTE you do not want to keep training it for trainings sake. at some point it becomes overspecialized and cannot handle noise in the data. if it is not overtrained, then it can take your input that is slightly off, and has a chance of coming up with the correct output answer as its best guess. this ability to handle bad data is what makes AI so appealing to me. otherwise you have to write code to handle tons of special cases for bad data, which makes the code unmaintainable as more special cases appear

that is it ... the NN is now trained. now you would feed it real-world input data (without knowing what the output is yourself), run steps 1-3 above, and the NN would output what it thinks the answer is. you can also look at the error for that output to see how confident it is in that answer. finally, you can look at the other output values and their error to see what the next-best answers are (2nd best guess, and so on)

the above describes what is called a BackPropagation NN. it is the most common NN that seems to be used 80% of the time. there are many other types. some more basic types, but very similar, are called Perceptrons and Adaline. these typically do not have a Hidden Layer and cannot solve problems that are not 'linearly separable', such as XOR (the Hidden Layer of BackPropagation allows it to solve these problems). another popular type is called Kohonen or SOM.  this type does not have a Hidden Layer and does not use supervised training. instead of recognizing patterns, it is used for classifying input data. then you have to figure out what the different classifications mean. wild! the SOM type is probably the NN that most closely models our own brain. another type is call Hopfield, which has the ability to memorize patterns. they all have the same basic concepts of Layers, Nodes, and Links ... but can be used to solve entirely different problems. the rest of this article uses the BackProp mechanism described above. i have yet to work with the other types ... but there is some source code that i ported to look at below

although i could ... i did not want to write my own BackProp neural network from scratch. there are already a couple C# implementations available and i just started with the simpler ones. the following list is in order of what i consider simplest to most complex. i've successfully used the 1st 2 with slight modifications. have yet to try the last two.

i think the results below will show NN goodness ... but there are some pain points. 1) getting the real world data into some meaningful form. in all likelihood you are going to have to massage the input data into an input array. this might involve numerous image processing or mathematical transforms. for me, this has mostly been trial and error. 2) configuring the neural network. you have to choose what size the layers should be, the learning rate, bias, momentum, error tolerance, etc... just a bunch of places a beginner can do something stupid. 3) training for optimal learning can take a long time (although running it is fast). i kept changing some setting and retraining it, and then testing it to see how it did. sometimes it would perform a little better ... and sometimes worse. right now it is definitely not intuitive for me as to what might changes make the results better

Tablet PC Recognition

now that we've got some idea about Neural Networks ... it is time to apply it to a moderately difficult problem. XOR seems to be the 'hello world' of NNs. but why would i want to write alot of code to train my computer to perform a function that is built into my programming language? the 2nd favorite seems to be pattern recognition of the digits 0-9 (OCR, optical character recognition). this is better, but it is not everyday that i come across pictures of numbers that are perfectly represented in pixel form ... and need a computer program to recognize them. to up the realism a little, along with the difficulty ... i chose handwriting recognition. this will perform a much simplified version of the handwriting recognition that occurs on a Tablet PC.

the Tablet PC is a stroke-based recognizer. it uses the x,y coordinates as you write to determine what is being written. on top of that the English recognizer is word-based. it uses that stroke information to go against a Dictionary of words (see /tabletDic), and then returns the word that is the closest match. it can also return the next closest matches. alternately, the Chinese recognizer is character based, and recognizes individual written characters. NOTE you can force the English recognizer to recognize single characters using a Factoid or InputScope. finally, the Tablet PC recognizer is not trainable. it works for everybody as soon as you turn it on, and it does not get better as you continue working with it. the only way to improve recognition is to add words to the Dictionary or improve your handwriting. if somebody else borrows it, it will work equally well for them. this is opposed to Speech Recognition on the Tablet PCs. speech reco requires it to be trained to a particular person before usage; and the more you train speech reco the more accurate it will become.

there are minimal resources for doing custom based recognition for Tablet PCs

to be different than the Tablet PC, my recognizer will require training and be character based. granted ... i am going to sort of cheat, meaning i'm going to let the Tablet API do the grunt work. before recognition can even occur, the stroke information has to be obtained and characters have to be segmented. the Tablet PC API does this wonderfully. it collects the stroke information as x,y coordinates ... so i will just use that data it provides. it also has an object called the InkDivider, which can be used to break up ink into paragraphs, lines, and words. if you set the RecognizerContext RecognitionMode to Factoid.OneChar and Coerce, then it will divide the ink into strokes that make up individual characters. i'm really not interested in writing that code ... so will rely on the Tablet API for that info. but i will not use the Tablet PC to do the actual recognition of the strokes into a character string. my recognizer will use that stroke info to do its own recognition using a neural network.

attempted this 2 ways: OCR and Stroke-based. 1) the OCR method treated the characters as images, and then compared the pixel data from the images to determine a match. made this attempt just for experience. 2) Stroke-based worked entirely with the x,y coordinates of character strokes, and then compared that to previously trained stroke data. this attempt is more similar to what the actual Tablet PC does

Optical Character Recognition

my 1st attempt was to do OCR pattern recognition, much like this article. NOTE that the article does not provide the source code for the NN. although you can easily recreate that program using the NN source code that is linked above. basically that article took the character input as an image, and broke it into a grid with 10x10 cells. each cells value was obtained by summing up the pixel values ranging from 0 to 255 (black to white, with gray in between). then those values were normalized to fall between -1 and 1. you then use those values to train the NN.

the pic above shows an early version of it working. in that training session, it was only trained for three characters A, B, C. the pic shows that it matched all 3 correctly, but that happened rarely. even for just 3 characters, the error rate was about 50%. this was based on the character being drawn just a little different due to slant of writing, size, etc. if i had it recognize more than just 3 characters, then its error rate was much higher. so this was attempt was basically a complete failure. meaning OCR does not work for handwriting recognition. going off of the pixel data is not a good way to handle noise. OCR does still make sense for reading scanned documents that use consistent fonts.

Stroke-based Handwriting Recognition

the 2nd attempt worked much better. what i did was collect the stroke information as x,y coordinates (thus moving from a raster world to vector based). the Stroke class had a BezierPoints property that returned a smoothed curve of Points. the only problem with that collection was that it only had the most significant points. first, i took those points and did some stroke geometry. i compared each Point to the following Point and determined the length between them. i also determined the angle of the line between those two Points. so now i had a collection of line stroke segment lengths and the angle at which they were drawn. with that collection, i dumped that out to a wave form ... to see what it looked like visually.

the picture above shows me writing the letter A twice, then the letter B twice, and then letter C twice. right then, i knew that was significant, because the pattern was repeatable for a specific letter, and there was a high degree of variability between letters. i was skeptical though, for 2 reasons. 1st, because i came up with that algorithm by accident. i found a couple algorithms on the web for doing handwriting reco but either did not like them or could not see how to feed that input data into a neural net.  2nd, because i had tried to use the wave form data to do speech recognition for /noReco. for speech reco, i could not use the wave form, but doing a transform of the wave using an FFT into a spectrogram did allow me to do recognition. went ahead and converted the handwriting wave using FFT to a spectrogram but it did not seem to look repeatable or variable. and some single stroke letters like 1 and lower case 'L' did not have any spectrogram at all. i should have realized that before even attempting it ... but it was not obvious until i saw the picture :( that is interesting though, the wave form works for handwriting reco, but not for speech reco; and vice versa for the spectrogram form.

going back to the wave form, i chopped it up into columns. the graphic above does not show it, but the columns overlapped some, and there were more columns. for each column i summed all the y values within that range of x. in other words it was adding all the line segment angles for a time slice. i used 50 columns slices for no particular reason. this gave me an array of 50 numbers something like 1500 1300 1150 700 200 -300 -500 -200 100 .... then, i normalized those based off the maximum number to give an array like .98 .90 .85 .70 .20 -.30 -.50 -.20 .10 ... ASIDE this is about how far i got with the /noReco article. since i did not understand NN at that time i had to use the 'nearest neighbor' algorithm for doing speech reco. i could now go back and update that article to use an NN as well ... and it should get even better results :)

now back to the NN world. the array of 50 numbers became my values to the Input Layer of the Neural Net. so i had an Input Layer of 50 layers. if i was recognizing the 10 digits, then the Output Layer was 10 Nodes. and the Hidden Layer would be 30 Nodes ((50 + 10) / 2).

then i wrote the test application. the test app lets me enter in an input character (as stroke data) and what the output character should be (as a char string). to train it to do numbers i have to enter in all 10 numbers. then i can train the neural network to some degree of error tolerance. once trained, i can write in an InkPicture box. to do recognition, the InkDivider class splits it up into strokes and hands those back to me. i process those strokes the same as above, and then run them through the neural network. then i grab the best result from the output and display that next to the written character. the results can be seen below

recognizing 10 digits. accuracy is about 90%

shape/symbol reco (square, triangle, circle). NOTE this is not the proper way to do shape reco

some character reco from the 26 character set. it thought I was an L. accuracy is more like 80% for this larger set

recognizing the same character at different sizes (got to love vectors)

so its accuracy is about 85+% for a small character set. i'm pretty excited about this, because the full Tablet Recognizer is not perfect with individual characters either (it does much better with words). anybody that plays the MSN crossword puzzle will know this :) it is different than the full Recognizer in that it requires training and is character-based. it can be quickly trained for different character sets (digits, shapes, emoticons, letters, symbols, etc). it can also recognize characters at different sizes. the main limitation is that it is based on the direction the stroke is taking. so it does not recognize if i train it with an O counter clock wise and the have it try and recognize a clockwise O (to accomplish this, i could just train it with both versions of writing O). but there are some character sets that specify the starting point and direction ... such as Graffiti. and this algorithm would be excellent for doing Graffiti and should have something like 95% accuracy or better. can think of at least a couple ways to improve this algorithm as well.

Conclusion

this showed how to create a reasonably-accurate train-able character-based handwriting recognizer for the Tablet PC using a Neural Network. it worked out as good as i could have hoped for. the NN is able to perform much better than any procedural code that i could have written to perform the same task ... and that would require alot of code that i dont ever want to write. it is also highly adaptable so that i can quickly train it to recognize some character that it has never seen before ... again without me having to write any more code. hopefully this will nudge some people to start writing some more interesting Tablet PC applications as well. worked on this specific application for 2 days and took 1 day to write the article. touched my 1st neural network code about a week ago.

 video (2.7 megs)

now for my wish list. 1st) the Tablet PC group needs to provide a more complete Recognizer DLL sample. right now only a skeleton sample is included. a sample that was still basic but actually demonstrated some of the API would help the developer community alot. eventually, if we could get that available to managed developers and have some debug support ... that would be great too. anyways, i asked for that stuff from the /tabletReco article years ago. 2nd) how about some MS-backed AI libraries in the .NET Base Class Libraries. i'm talking System.AI.NeuralNet and System.AI.FuzzyLogic, etc... maybe that is already in the pipe for Longhorn? a microsoft.public. newsgroup would be great too. 3rd) we saw the Machine Learning vid on Channel9 and the MSDN Genetic Algorithm article ... now could we get some MS evangelists to start blogging about this stuff. my new motto is less ADO.NET ... and more AI.NET

Source

not providing the source code for this application. it is explained in detail and links are provided to the relevant C# Neural Net libraries above.

i did port the C code from Neural Networks at your Fingertips  to C#. they all work except for SOM. if somebody could figure that out ... then i would post the fix

Updates

no planned updates. if somebody finds the bug in the SOM source code sample above, then i will fix that

Future

alot more AI to play with