this article is about creating an app to cluster and search for related pictures. i got the basic idea from a Longhorn demo in which they showed similar functionality. in the demo, they selected an image of the sunset, and the program was able to search the other images on the hard drive and return similar images. there are other photo library applications that offer similar functionality. honestly ... i thought that was pretty cool, and wanted to have some idea how they might be doing that. internally, i do not know how they actually operate ... but this article will show one possibility. also writing this article to continue my AI training
the AI technique to be used is a type of Neural Network (NN). i've written one other NN article : /aiTabletOcr. that article uses the most popular type of NN called Back Propagation. i've seen estimates that say out of all the types of NN (and there are ALOT), that BackProp is used as high as 80% of the time. the problem with BackProp is that you have to train it with the correct answers (called supervised training). you have to have a set of input data and let it make a guess as to what the output should be, then it has to look at what the actual output should have been (the correct answer is supplied by you), and finally correct itself. eventually, after many iterations of this, then it can learn the pattern and will start getting the answers correct. for a picture matching application, i would have to 1st go through and find matching pictures, and then it could eventually learn which ones go together. i dont know about you, but i have better things to do then organizing my photo collection
luckily there is a type of NN that works with unsupervised training. i'm guessing that it is the 2nd or 3rd most popular type of NN? all you do is give it the input data and it trains itself. you dont even have to know what the actual answer should be. i've read a couple times where people think that this type of NN might be the kind that most closely models how the human brain actually works. this type of NN is called a Self Organizing Map (SOM). it is sometimes called a Kohonen map as well, after the researcher that coined it. a SOM is useful for clustering data and for visualization purposes. with a little bit of work you can also sort of use it for classification, but that is not its intended purpose and there are better methods available. the pic above shows its basic structure. it has Input Nodes and Map Nodes. each one of the Input Nodes connects to each Map Node. none of the Map Nodes connect to each other. not pictured is the feature vector associated with each Map Node. the feature vector is an array the length of the number of Input Nodes. so for the pic above, each Map Node would have a feature vector of length 3. NOTE the SOM map can also be in a 2D honeycomb shape or only have 1 Dimension. it could be 3D as well, but that would be hard to visualize
the map above would be useful for grouping colors. the 3 input nodes (and feature vector for each MapNode) would represent the RGB values of a color. the 4x4 map would represent 16 different color classifications. granted, you are not telling it what those 16 different colors to classify are. if you give it random input of colors from the entire spectrum, then an individual Map Node would ultimately contain high level colors such as ROYGBIV. ultimately, you would want for Red, Blue, and Green to be in the corners, and the other colors to be in between, such as Yellow between Blue and Green. on the other hand, if you only gave it input of different shades of Red, then it would only cluster different shades of Red. the left-top corner might have dark dark red, and the lower-right corner might have a light shade of pink, with lighter shades of red between those nodes.
so that is the ultimate goal, but let me start at the beginning. you 1st have to choose what your map structure should be. for the color problem, lets say 3 Input Nodes and 16 Map Nodes as pictured above. remember that each Map Node also has a feature vector of length 3. with an untrained map, the feature vectors would start out with random values between -1 and 1. NOTE another initialization technique is to use a gradient. so the feature vectors in the left-top corner start at -1. the other nodes gradually increase the initial values of their feature vectors to reach 1 in the right-bottom corner.
with the map setup, we now have to train it. this involves numerous iterations of feeding the input data to the map. for the color problem, our input data is random colors. all was have to do is randomly choose 3 numbers between 0 and 255 and then normalize them to be between -1 and 1. these 3 numbers represent the RGB values of a color. the map then takes that as input and goes through each Map Node to find which Map Node has a feature vector that most closely matches the input. the Map Node with the closest feature vector is called the Best Match Unit (BMU). the BMUs feature vectors are then adjusted to more closely match the input. also, the neighbors of the BMU have their feature vectors adjusted as well. the closer the neighbor is to the BMU, the more that their feature vector values are adjusted. the further away, they get less adjustment. this is represented in the picture above.
the training of the SOM, by applying the input set, happens for many iterations (called epochs). each time an epoch passes, 2 values are changed : the learning rate and the neighborhood size. you start out with a high learning rate, so that the Map adjusts quickly to the input set. over time, the learning rate decreases to let the feature vectors stabilize. the neighborhood size is the distance away from the BMU that Map Nodes will have their feature vectors adjusted. this typically starts out at around half the size of the map. so if the map is a 10 x 10, then the neighborhood size starts out at length 5. as time passes the neighborhood size decreases to 1, so that only the BMU has its weights adjusted. this decreasing of the learning rate and neighborhood happens in 2 phases. the 1st phase does the most work. it might start its learning rate at .9 and reduce it to .1. concurrently, the neighborhood size would start as half the length, and then decrease to 1. the 2nd phase just cleans it up a bit. the neighborhood size stays at 1, and the learning rate might start at .1 and go to .01. after those phases are done, then the SOM is trained. you can then look at the clusters that are formed to see the patterns it found, or you could give it a new input pattern and have it display what cluster it would map it to.
anyways, that is my current understanding; here are some other articles i recommend
for the SOM code, i ported over some Java and C code. ended up taking a little code from each and making my own C# library. then i tested it with the simplest program possible (Self-Organizing Nets sample linked above). this involves creating the SOM with 2 Input Nodes and 16 x 16 Map Nodes. the Map Nodes are then randomly initialized to values between -.5 and .5. the 2 Input Nodes represent X and Y values between -1 and 1. so if you continuously train the SOM with random X, Y values between -1 and 1, then the SOM should unravel itself and flatten out into a grid pattern. the upper-left corner will represent -1,1. upper-right will be 1,1, bottom-right will be 1,-1. lower-right will be -1,-1. finally, the center will be 0, 0. the pic below shows the process in action
the next step was to add another Input Node to make 3. for this i did the 'hello world' program of SOM, which is to make it cluster color values (Kohonen's Self Organizing Feature Maps sample linked above). the 3 Input Nodes represent RGB respectively. initially, i just fed the SOM 9 color values as Input to a 40x40 Node Map. the Node Map was initialized to random values and pictured below
the 1st PictureBox shows the randomly initialized Map of 40x40 nodes. the 2nd PictureBox shows the map after being trained on 9 colors. the 3rd PictureBox shows the Nodes that were the BMU the most. basically, all it does is keep a count of the # of times a Node wins, and then that count is rendered as an image. the white dots are the Nodes that won the most and you can see how they correspond to the center point of the Node clusters from the previous image. the 4th PictureBox shows the error of the Map. all it does is go to each node and determine the total difference between feature vectors of its neighbors. the Nodes that have the largest difference from its neighbors show up as white. you can see how this corresponds to the edges of the Node clusters as well ... which makes sense
then i changed it a bit (pictured below). the 1st PictureBox showed that i changed it to initialize the Map using a Gradient. the 2nd PictureBox shows that i fed it entirely random colors as inputs, instead of just the 9 repeated colors from above. this makes it much harder to visualize the win and error images in the 3rd and 4th PictureBoxes respectively.
my next attempt was much more challenging. basically i tried to recreate a research project for Mapping Weblog Communities. this would involve training it to learn a network of blogs and how they link to each other. the 1st step was to find a bunch of weblogs. i did this by downloading the OPML files from a number of large blog aggregators. the 2nd step was to have my program retrieve the RSS for each of the blogs in the OPMLs. the 3rd step was to parse the XML of the RSS and then use Regular Expressions to find the outgoing links of each blog. the 4th step was to determine if the outgoing links were to other blogs that i had retrieved, or if they were to external sites. with the grunt work done, then i could create the SOM
out of the blogs used, i had about 200 that linked to each other. so i wanted those 200 blogs to be spread out on a SOM map. decided to make my SOM map 8x8 (initialized with a gradient from -.5 to .5). with this size, if the SOM was perfectly distributed, then there would be about 3 blogs per Node after training. for Input nodes, i had 200. each Input Node represented a blog. if a blog was linked to, then the Input node would be 1, otherwise it would be -1. remember that each Map Node also has a feature vector of length 200 correspondingly.
for training, it just went to each RSS feed and extracted the links. those links were then mapped to the appropriate blog and the Input vector was a long list of 1s and -1s. it was mostly -1s because a blog typically only linked to a few blogs. generally it linked to the same blogs multiple times. my experiment did not take this frequency into account because i did not collect data over a long period of time. also, i did not take into account external links to non blogs. the trickiest part was canonicalizing the links to figure out what blog it belonged to. what i had to do was create a list for each blog showing the RSS and HTML links that might belong to it. e.g. for my site alone, i generated this list of possible URLs obtained from parsing the OPML and RSS
(OUCH!) so when i had a link within a blog, i would compare it to a list like that above that i had generated for all blogs. if it did not find an exact match, then it would start decomposing that link in the same manner and repeat the search until it found a match. if it did not find a match, then that was considered an external link and ignored.
er, um ... so i got it to work ... but i never finished writing the code to actually display the results. i could click around on the map and it would display the blogs that it had mapped to that node. from clicking around you could see that most blogs went to one corner meaning that they had very few links to within this community. on the opposite end there were a couple link whores (read Scoble) that linked alot. i was somewhere in the middle. in the node that my blog was in were 2 other blogs. 1 of them was mostly about gaming, but it was pretty cool and now i'm subscribed. the other was was mainly about legal issues and the internet (no, not Lessig). the problem was that this was too hard to see if the results were good or not. what i needed to do then was go to each Map Node and get the blogs that were mapped to it, and then draw the actual connecting lines between Nodes. instead, i just moved on to something that would be easier to visualize ... and when i say easier to visualize ... i mean hot women :) if you dont like hot women ... then you should stop reading, because we are no longer friends
now we arrive at the real reason i started playing around with this stuff ... the rest was just a build up to pull this off. from seeing the Longhorn demo, and some other apps, i started searching around on how this might be done. one acronym that kept popping up was CBIR. Content Based Image Recognition. so this lets you search for images based on their content. e.g. you could search for pictures that have Tigers in them. and i dont mean images.google.com. that is just associating images to the text on the web pages. they are searching the text and then returning images that might go along with it. CBIR is actually searching the image itself. and CBIR seems to mean different things to different people. some people want to search an image database with textual queries; while others want to provide starting images, and as you incrementally select the most appropriate starting image, then you can drill down to the image you actually want. the example i found of this is PicSOM. some of the basic concepts of PicSOM are what i use below ... thus the name of this article.
the 1st step for this project was to download alot of pictures of hot women.
the next step was do some image processing on the images so that i could create a feature vector for each image. PicSOM can do 3 different types of image processing : Color, Texture, and FFT. i ended up only doing Color and Texture. for Color, i tried a number of different color schemes : Grey, RGB, HSL, YUV, CIE-L*a*b*. they all seemed to work pretty well, but HSL seemed to work slightly better. granted, that is my own subjective opinion. i would need to extend my test program to actually determine which produced better results. for Texture, all i did was use a simple algorithm to determine how much a pixel differed from its neighbors. if the pixel was mostly like its neighbors, then the image had low texture. if the pixel was alot different than its neighbors, then texture was high, and the image was probably noisy. NOTE before working with the images, i scaled them to a standardize size thumbnail of 50x50 pixels
for both Color and Texture, i did 2 measurements : Histogram and Area. the Histogram was a global operation that distributed the values for the entire picture between a continuum of min and max values. e.g. a Histogram for grey scale had a min value of 0 (black) and a max value of 255 (white). between those values, it was divided up into 16 buckets. each pixel would then be dropped into one of those buckets, incrementing its count. if it was a light image, then the buckets closer to 255 would have the highest counts. the bucket counts of the histogram could then be used as a feature vector of length 16. the other measure used Area. this involved chopping the image up into 3x3 sections. each section would then have its average color and texture value calculated. that list could then be used a a feature vector of length 9. the pictures below show these concepts. the 1st picture is the sample image used. the 2nd pic shows it histogram. you can see how it is peaks towards black and then the rest of it is light (remember that GDI+ renders downwards). the 3rd pic shows the area concept. NOTE it would be better to use overlapping sections
with those different image processing techniques, the # of Input Nodes / feature vector size could vary. i wrote the program so that it could work with inputs of the following combinations
for the size of the SOM, i mainly worked with a 10x10 matrix or smaller. this is because i typically trained it with pictures sets of 200 and 300 images. so if the SOM was perfectly distributed after training, then that would leave 2 or 3 matching images per Node. the picture sets i used to test with are from fobpro and purehotmodels. here is a thumbnail image from each set : fobpro / purehotmodels to give you an idea of the differences between the two. the fobpro set has 2 to 5 pictures of each model in a similar pose and setting. the purehotmodels set is a bikini competition, so there is alot more noise and variance between images. with those picture sets i could then train the SOM. after training i could click on a Node of the SOM, and it will display the images that it associated with that Node. if the images displayed are similar ... then it works! a pic of the app is below
the pic shows that i trained it with the HSL color scheme histogram (16) and texture histogram (9). that required 25 Input Nodes and 25 feature vectors for each Map Node. the SOM map was of size 10x10 and initialized with a gradient fill. it was trained for 2 phases of 500 iterations each. the rule of thumb that i've seen (and did not follow) is the 1st phase should be 10x to 100x the # of input samples, and that the 2nd phase should be 2x the 1st phase. the lower-left ASCII print shows the distribution of images across the Map. you can relate that to the 'win' image and see that the white spots have the most images per node. also, from the 'err' image; the light spots on the map have the biggest difference between neighboring nodes. when i clicked on a Node of the SOM map (the maroon grid), then the thumbnail images associated to that Node were displayed to the ListView on the right. and as you can see, it shows 2 sets of matches. for the girl in pink, that was the only 2 images of her in the set. for the girl in red, there was 1 other image which happened to be in a neighboring node. so it works! as i click on the different nodes, then it constantly shows matching sets of images. some of those other matches from that same run are shown below ... remember, these pics are purely for educational purposes
to display an overview of the entire Map, i generated a collage. all it does is get the best matches for each Node and will render up to the top 4 images. the 1st image is of the fobpro set trained by Color. the 2nd image is purehotmodels trained with Texture. the nodes with 0 matches will display as white
i then went on to test the different image feature vectors individually. the most obvious way to tell they are working is by looking at the opposite corners of the resulting map and see how those sets of images differ. the following shows HSL Color Histogram. one corner was definitely blue, and the other corner is definitely red. below that the corners were green and black
the next pictures show HSL Color Area. mentally divide the pics up into the 9 sections, and it is obvious that the picture sections match
the following shows Texture Histo. the pictures with alot of girls or noisy backgrounds have a high texture (on the left) because the color keeps changing from pixel to pixel. the close-up pictures or with quiet backgrounds have a lower texture (on the right), so they are mapped together. you can also see how differences of color are ignored
the next set shows Texture Area. if you mentally divide these pictures up into the 9 sections, then you can see how their textures match accordingly. i think the best example is the 2 girls that both have a comic poster over right shoulders, and a noisy environment to the left
so that was basically just looking at the results of the trained map ... which would be useful to somebody that needed to organize alot of pictures. a more realistic application would be to find similar pictures when given a starting image. so that you provide an image to the application, and it returns the closest matches. all i did was randomly remove some of the pictures from the picture sets. then i retrained the SOM without those pictures in it. then when you provide one of those untrained pictures to the map, it searches the ones that it does have and provides the closest matches. the pic below shows the result. the PictureBox on the left is the image that i searched with that was not included in the trained Map. the thumbnail images on the right are the similar images that it returned.
to make this even more fun, and possibly provide the idea for an even cooler application ... i broke out my Tablet PC. suppose you want to find a picture you know exists in your photo library. you kind of remember what the scene looks like, but you dont have any similar pictures to search for. so what do you do? my concept is that you draw a mockup of the image putting objects in the correct space and using similar colors. if you make the Map forgiving enough, then it should be able to go off that badly drawn image to find the actual photo you are searching for. right now, i can only get this to work off of the RGB Color Area feature vector. texture will not work because i'm just drawing with the primary colors, so every pixel is the same and there is minimal texture. some of the color schemes will not work for a similar reason. if this was improved to do object recognition ... then this idea becomes quite feasible. and just plain cool if you ask me. has anybody else thought of (or implemented) this yet? this might be an original idea ...
out of the photo sets i'm testing with, i know that there are numerous pictures of people standing in front of a white background. so about all i do is create a new image and draw human stick figures using flesh tone colors on a white canvas. then i can search the picture sets to return those images. the series of pics below show that it works much better than i expected! so out of 200 to 300 pictures, and my bad drawing ... it was able to return the correct results
the drawing that i submitted is on the left. the matches are on the right. the particular image i was searching for is the 2 girls with their arms behind each others back of the lower-right.
this one has a blue background to represent sky with just a stick figure (that resembles Mr. Hanky)
this drawing of wonder woman had a couple of exact matches. NOTE how i drew boobies on wonder woman to improve the search results :)
exact match for orange bikini is bottom-right
dressed in black. these were similar, but not the exact match searched for
cheerleader. exact match is top-right. NOTE how the drawing is not outlined. it just puts the colors in the general area they should be
this one was a pleasant surprise. i accidently submitted an image to be found that was a blank white canvas. it ended up returning the pics of this girl hiding behind a white poster board. how amazing is that!
the final step of this project was to download more pictures of hot women
this used as simple a SOM setup as possible. there are at least a couple other structures that would provide better results. i believe that PicSOM uses multiple SOM maps. e.g. 1 SOM for color, 1 for histo, 1 for FFT. this makes sure that each SOM is trained best for what it is clustering. when doing a search it returns the best matches out of the multiple SOMs, possibly doing Set operations as well. i just used one SOM and concatenated the feature vectors instead. if i were to do multiple SOMs for this project, then there would be 1 SOM each for color histo, color area, texture histo, and texture area. there could be different SOMs for each color scheme as well
another approach is called the GHSOM (Growing Hierarchical SOM). a SOM can grow by adding more nodes to the Map. the approach i've read about is that you grow the SOM based on error. if an input maps to a BMU, but the distance is further away than some error you specify, then a Map Node could be created in that area to represent that novel input, instead of forcing it into the existing Map Node. the heirarchical aspect is based on having multiple SOMs of varying abstraction. suppose that the map is being trained for Color grouping and the results are well distributed except for one Map Node, the Color red. the red Map Node keeps winning a high percentage of the time. another child SOM could be created and associated to the red Map Node which would cluster the different shades of red. this would keep the results of the parent SOM from being skewed red and the child SOM would have more granular
the samples above worked with input sets of 200 to 300 pictures on a 10x10 matrix. to train this map for 1000 iterations on a P4 took about 2 minutes. NOTE that i did not write the code to be performant. i also worked with picture sets of 1000 pictures that took 6 minutes to load. that picture set was based on an images.google search keyword, so they were of the same genre but the pictures were much more varied (i.e. almost all of the pictures were unrelated). the results were that most of the Map Nodes contained pictures that were mostly similar as well. out of the random pictures it did group the few pairs of matches that were found. on the other hand, there were some Nodes that were much harder to visualize how the clustered pictures were grouped together, or they would all look similar in a group except for 1 or 2 that looked out of place. my guess is that the size of the map was too small and the data was not related enough for this to happen. to help understand, i extended the application to sort the thumbnails in the ListView based on their error distance to the current node. most of the images that looked out of place had the highest error ... so they would have been good candidates to add another Map Node to a growing SOM. also tried an input set of 5000 pictures (same genre, but mostly unrelated pictures). it took about 30 minutes to load. this made the Map Nodes too crowded to really see a relationship, and increating the Map Size would take longer to train. i think this is where a GHSOM really comes into play so that the training could be done incrementally.
below is a small collage of the 5000 pic SOM (15x15 lattice). TopLeft is blue. TopRight = gray, BottomLeft = white, BottomRight = ?. if you couldn't tell, the images.google search word was 'bikini'. plus if you stare at it real hard (and take LSD) then there is a magic image to be seen ...
that shows how to use a Self Organizing Map to find similar pictures. the incredible part is that it works good enough to find horribly drawn stick figure images. worked on this for one week. 2 days of reading about SOMs, 2 days of implementing the basics, 1 day on the Blog SOM, and 2 days on the Picture SOM. a video of it working is below
video (2 megs)
not giving away the source. there is already a C# SOM implementation (linked above) that you can play with
more AI. probably a different media format. later