Video Gesture Recognition

http://www.brains-N-brawn.com/aiGesture 10/6/2004 casey chesnut


this is a quick article about putting together an application to recognize gestures through a video feed. a gesture is a movement of the hand or body that has some significance. e.g. in the real-world a well known gesture is shrugging your shoulders. most people recognize that action as meaning you dont know. e.g. in the .NET programming world, the Tablet PC API has a gesture reco system, which might be used to edit text. such as if you scribble back and forth over a word, then the word will be erased

chose this project for a couple reasons. first, to try out some artificial intelligence techniques on a different media type ... specifically video; because i've already worked with images, speech, and handwriting. second, to try out some ideas of a friend i met in Pittsburgh, Micah Alpern. the iWave design is for a gesture recognition system to control your car radio using hand motions and a heads up display. this keeps your eyes on the road (for safety) and not trying to control your car radio. in that experiment, a gesture recognizer was prototyped with a 'man behind the curtain' setup ... so i wanted to make an attempt at filling in that piece. this same sort of system could be used to control a Windows Media Center Edition PC with a web cam. if the web cam was able to swivel, and track your position, then you could walk around and and control the media center without having to carry the remote control with you


the 1st step was to get a video feed into my .NET application from a web cam (pictured above). for this task, unmanaged applications can use a component of DirectX called DirectShow. DirectX is increasingly becoming managed, but DirectShow is mostly a hold out and remains native. about the only managed code that DirectShow provides is for playback of videos. luckily, CodeProject has an article ( DirectShow.NET ) by NETMaster which provides a managed wrapper for DirectShow. it comes with samples that can be used to capture web cam images, capture video to file, and a DVD player. for this project, i directly used the SampleGrabberNET demo. from there, all i did was setup a Timer to periodically grab an image. that sequence of images would then go through image processing and be used for gesture recognition. at this point, you can also do a simple diff of the images to do Motion Detection


2nd step was initial image processing. since this is using hand gestures, needed a way to filter out my hand movement from the image background. to filter for my hand i used this simple skin recognition algorithm ( Messing with Skin Recognition ). this is a pixel by pixel method that also let me collect some other metadata along the way. using that metadata allowed me to find the center-point of my hand. this is demonstrated in the picture above. the lower-right of that image shows my hand being filtered out and a red-dot showing the center-point. NOTE the app captures a screen shot about every quarter second and image processing was done synchronously, so had to occur as quickly as possible. NOTE this algorithm will not work with a flesh-colored background


also used that metadata to find the bounding box of my hand (shown above), although i ended up not using that. also considered doing a different process to outline the shape of my hand and then use that for recognition, but did not. to make sure this actually took advantage of having a video feed, and not just static image recognition, i used the motion of my hand to represent a gesture. similar to the movements recorded by this Mouse Recognizer  but using hand movements instead. it can recognize simple motions of my hand from frame to frame such as moving left, right, up, down, and staying still. it can also recognize combinations of these motions such as left-then-right, right-then-left, up-then-down, etc... other gestures might be a Z pattern, or (counter) clockwise circular motions. these motions can then be mapped to controlling your media player

the pic below shows it recognizing a gesture. my hand moved from left-right, captured in 6 frames of movement ('R'), and then displayed the recognized command as 'RIGHT'. this would then be used to control your media player to advance to the next track


so this showed how to create a simple (worked on this for 1 day) video gesture recognizer based on hand motion. this sort of interface makes perfect sense today in auto and home scenarios. it particularly makes sense during audio playback in which you might not want to pause the audio to do accurate speech recognition. as computing becomes more pervasive i expect that different forms of gesture recognition will find its way into these systems for a more natural (and possibly safer) interface


the article above points to the relevant code pieces


might extend this to do more complex recognition. possibly a combination of hand motions and hand shape recognition. this would probably involve some sort of transformation between sequential images. then it could recognize when i flipped it the bird :)


more AI. later