/aiCaptcha

comment(s) 

Using AI to beat CAPTCHA and post comment spam

L337 subtitle = How 2 0wnz blogz :)

http://www.brains-N-brawn.com/aiCaptcha 1/30/2005 casey chesnut

NOTE do not contact me for the code. i will not release this code. it doesn't matter if you're a spammer, academic, or want to test a CAPTCHA you've written ... the code is not going to be released. don't even write to ask for a portion of the code e.g. the line thinning code ... none of it will be released. 

Introduction

(if you were one of the 94 people i comment spammed) sorry about that, and hope that you are not pissed. if you are new to my site, then you must realize that i like to stir things up every once in a while. if you've been here before, then i'm hoping you've got a smile on your face, and sort of expect stuff like this from me :) anyways, you were targeted for 2 reasons. 1) because your blog uses CAPTCHA to provide a false sense of security. 2) because we are members of the same group. so i know a handful of you (and know of most of you). could easily have done this against a bunch of strangers ... but did not think that that was a good idea. this is just my way of saying that we've got more work to do. i will not be comment spamming you anymore. unless you comment spam me back in retaliation ... and then i'll have to blast you out of the water ... just kidding.

also, i have definitely offended at least one person. if you are offended, you can definitely contact me directly (after you cool down, please). when you do contact me, please do not mass email everybody. because by doing that you are sending out exponentially more emails than the 94 comment spams that i posted :) also, if you are not offended, i would appreciate if you would leave a comment to help keep this in perspective. that would really be appreciated. Thank you, casey

this article is about writing a comment spam bot. it ended up posting 94 comment messages to CAPTCHA protected blog pages in 10 minutes. all it does is visit a blog post and download the associated CAPTCHA image. then it uses some image processing techniques to parse out the characters in the image. each character is then run through some AI processing to figure out what letter the character image represents. finally, with the result, it posts the comment spam to the blog engine. i wrote it for a couple of reasons ... mainly to show that rel= 'nofollow' and CAPTCHA are false protection from comment spam.

rel='nofollow'

the 1st reason is that i was pissed off by the whole rel='nofollow' thing that blog engines have been implementing. i think its a total hack, and wanted to submit it to the The Daily WTF. if you don't know, its a tag that the major search engines are now looking for in hyperlink tags. if the link tag has this value, then the link will not be used for page ranking. so it is supposed to take away the motivation for comment spammers to post. but i dont think that is what will actually happen. for some of the reasons why, see this article : A step toward solving comment spam? instead, i think the search engines should work on making their algorithms better. also think that blog engines should work more on protecting themselves. one way for a blog to protect itself is to write a custom comments engine. with everybody blogging on the same engines, it makes it real easy for a comment spammer to target a group of people with one shot. this is the exact same problem that Microsoft has with its products. they are so successful and used everywhere that they become a big target. people just ignore when Linux actually has more security flaws found against it. another way that blogs can protect themselves is to use CAPTCHA.

CAPTCHA

which stands for : completely automated public Turing test to tell computers and humans apart. pretty self explanatory that we are running out of 3 and 4 letter acronyms. you might be familiar with CAPTCHA from ticket buying sites, or signing up for a free email account. most CAPTCHA implementations today present an image to the user with characters or words in it. the image will have some noise applied to it to make the characters more difficult to read. this makes it hard for a computer to be able to parse the characters out, but is still relatively easy for a human to read. granted, i've certainly seen a number of CAPTCHA images that i could not even make out. not to mention they make your site inaccessible to blind users that would otherwise be using a screen reader. although there is a proposal for an accessible CAPTCHA. anyway, my good friend, happens to have his blog on a site that uses CAPTCHA. when i looked at it, i was pretty certain i could apply the AI techniques i've been teaching myself to beat it. so that is my 2nd reason for writing this article ... because i could

an example CAPTCHA image

SomeBlogSite.com

a quick trip to their home page showed that they hosted about 100 blogs on it. a quick random sampling of those blogs showed that each of them used the same CAPTCHA setup. so i figured i could write one bot program to spam all of them. a little more investigation showed that some of the blogs did not have any blog posts on them ... thus nothing to spam. and there were some blogs that had comments turned off altogether ... i.e. the 1st one at the top of the list, for the Security lead appropriately enough. the home page also lists other blogs. alot of them also run on .TEXT so could be spammed by the same bot, but they did not have CAPTCHA setup ... so i considered those too easy to even bother with.

started manually posting some comments, and used a TCP sniffer to see what was going on behind the scenes. i then wrote a .NET client to make those same requests. 1st it would request the blog posting as HTML, and then it parses out the url to request the CAPTCHA image and display it in the WinForm. playing around with this i discovered some time limits. first, after retrieving the blog page, you have 1 minute to retrieve the CAPTCHA image. if you dont retrieve it within that minute, then that request becomes invalid. second, after retrieving the image, you only have 1 minute to write your comment and post. so the scenario is, you are reading a blog and then write your comment, and a couple minutes pass. even if you enter the CAPTCHA answer correctly, the request will fail because a minute has passed, and the page will reload displaying a new CAPTCHA image and the comment that you tried to submit. you can then enter the new CAPTCHA answer and resubmit your comment. this really sucks as far as user experience goes, you definitely need more time to enter in your post and submit it. nor does it make it more secure from a bots perspective. the bot can parse out the numbers in a matter of seconds before posting it as comment spam.

Parsing CAPTCHA

now that the app can retrieve CAPTCHA images, the next step is to parse them. taking a look at a number of images showed these patterns about the CAPTCHA image :

CAPTCHA image within WinForm

the first step was to determine the darkness of the image. all it does is walk horizontally across the middle of the image and record the lightest and darkest pixels. it then backs off a little from the darkest picture, and any pixel darker than that is considered to be part of a character. it then does a horizontal sweep across the middle of the image. when it encounters a dark pixel (representing part of a character), it uses a fill algorithm to find all the pixels for that character. then it continues the horizontal sweep from the other side of that character, until it encounters the next character. in this manner, it finds every pixel for each of the 6 characters.

discovering individual characters

it can also do multiple horizontal sweeps from varying heights, in case the 1st sweep does not intersect all of the characters. the only time this fails is when 2 of the characters are touching. this happens rarely, and when it does happen it usually involves the 'f' character. regardless, this failure only happened around 5% of the time, which i considered acceptable. 

showing 2 characters that are connected

Character Tricks

so now instead of trying to recognize the string of 6 characters, we just have to recognize each character individually. the good news is that there are only 16 characters to recognize. the bad news is that they are skewed enough to make recognition somewhat difficult. what immediately popped into my head was to use a neural network. NNs were made for this sort of thing. but before i could do that, i would have to generate some test data to train it with. i extended the app so that it would render the collected pixels for each character. then i could manually enter the value for each character individually and save each to an image file.

saving off training data

and here are some examples of what the skewed characters look like. you can see that they vary based on font, size, thickness, rotation, and some warp factor. this makes some of the characters hard  for me to tell apart myself :). some of the b's look just like 6's and some of the 1's look like 7's (and vice versa for both).

variation between character images

with the data, i went about converting it into some feature vector that could be used as input into a neural net. based on past experience, i just started writing a whole bunch of methods to transform the characters and get feature info about the characters. i did this because i'm not advanced enough with neural networks to just explicitly know what will or wont be good training data. i do have some idea, but i'm not that confident yet. anyway, the 1st method i wrote was to generate the outline of the character. honestly, i wrote this on accident when i was really trying to write an algorithm to generate the skeleton of a line. the accidental outlining algorithm ended up being a keeper, and it took me more than a couple tries to get a line skeleton algorithm to work decently. even then, my line skeletonization algorithm could use some work, because it sometimes has random line segments branching off that should not be there.

character pre-processing

the single line representations turned out being useful, because i wrote an algorithm to walk the line and find all the endpoints. the pic below shows the endpoints marked with red pixels. attempted to write an algorithm to find the intersections too, but it was far from successful.

line endpoint detection

Recognizing Characters

with some different character representations, i went about generating feature vectors. the 1st one i did was to just break the image up into quadrants (e.g. 10 x 10). each quadrant was then assigned a value for how many pixels fell within it. this could be run against all character representations above (original, outline, skeleton). the 2nd feature vector involved finding a slope associated with each pixel. it worked by visiting each pixel. it would then traverse the line away from that pixel for a certain distance (e.g. 5 pixels). then the slope was determined between those 2 points and recorded on the originating pixel. this could only be applied to the outline and skeleton representations. finally, i combined the quadrant and slope methods. all it did was find the average slope for a particular quadrant.

then i setup the neural network and started throwing those feature vectors at it. since i could do supervised training, i used a BackPropogation NN. the input node size ranged from 50 to 100 nodes, depending on the feature vector used. the output node size was 16 (0-9,a-f). i varied the hidden layer node size from between 1/2 the size of the input layer to twice its size. i attempted to train it with different combinations of the feature vectors above ... but i had some problems. if i used a small number of input patterns (e.g. less than 100 characters), then the neural net would train successfully, but it did not perform well at runtime for recognizing the skewed characters. if i used a large number of input patterns (e.g. 250 characters), then the neural net would not converge during training. i.e. it would not successfully match all of the input data, even though the total error kept dropping.

after a little thought, i came to the conclusion that the 16 output nodes was the problem. i needed to increase the possible outputs to better reflect the skewed character input set. so instead of just having one output node for the number 1. i would have an output node for the number 1 tilted to the right, one for it tilted to the left, and one for it properly oriented straight up and down. instead of manually choosing which characters would be in the expanded output set, i started to write a program to pick them out for me. basically, all it was going to do was read all of the input characters into a collection with their feature vectors. then i was going to have it pick out the different character representations across some range, to get them somewhat evenly spaced between each other. ... but then i got lazy.

instead i just used this collection of feature vectors directly (and did not use the neural network). so that i would input my unknown character feature vector, and it would return the value for the feature vector that was the closest match. i also tied this to the sample gathering function. so that i could request a new sample CAPTCHA image, then let my program make a guess for each character. then i only had to correct the ones that were wrong and save those samples to be used in the next round. a couple rounds of gathering samples like this and the results improved quite a bit.

better yet, i also integrated the endpoint finding discussed above. instead of just having one single collection for all feature vectors, i made multiple collections based on how many endpoints the skeleton representation had. so that 0 and 8 feature vectors would be in one collection, because neither has any endpoints, and the number 3 would be in a collection for characters with 3 endpoints. i mention 3 and 8 specifically, because it was having problems telling them apart without the endpoint check. using this combination of endpoint counting and nearest neighbor feature vector comparison, the program was able to correctly guess the text representation from an image representation of the character most of the time. its still not perfect, but it is accurate enough that it will correctly guess the entire CAPTCHA string about 1 out of 2 times.

so defeating the CAPTCHA turns into recognizing an individual character 6 times in a row. with this technique character recognition is just over 90%. so across the 6 character string, this adds up to just better than 50% accuracy. if i wanted to improve accuracy, the 1st thing i would do is work on the line thinning algorithm so that it did not add spurious endpoints. the next thing i would do is get a working intersection algorithm. with those 2 pieces feature points for a single character, you could very easily determine the real value for in image character. my guess is that it would increase the overall CAPTCHA recognition to well over 90% accuracy. you could also take the brute force approach and just add more sample data. i ended up with over 1000 sample characters. i.e. about 60 samples for each of the 16 characters. that was overkill, and just a result of too much testing on my part.

recognized CAPTCHA image

Posting Comment SPAM

now the program can download the CAPTCHA image for a blog posting, parse that image and determine what its answer is. then all the program has to do is POST an Http Request with the appropriate name value pairs and the comment spam. the program also has to get the response back from the website to see if it correctly guessed the CAPTCHA, or if it needs to try again ... which is the true beauty of the bot. even if it has to try 10 times, that 10th time is all you need, and it can perform those 10 operations faster than i could on my own. oh ... how i love automation (pronounce auto (as in car) mat (as in door) ion (as in atom particle)).

went ahead and did some preprocessing before the attack. wrote a quick little program to parse the OPML for all the people hosting. went ahead and scrapped some of the people from that list that did not allow comments or that had not made any posts. that gave me a rough list of 99 blogs. then i had the program visit each of those blogs and find the URL to the last post that they made. with that info, i'm now prepared to start the attack.

Day of the Attack!

BEFORE so i think everything is setup now and ready to roll. in theory, all i should have to do is press the red spam button, and then the bot will go down the list of latest blog entries and insert my canned entry. but since the internet is involved, there are a number of unknowns. 1) is my program going to work. i've tested most of it, but there is some code that cant easily be tested until game day. 2) ip blocking. its not likely, but possible that the web site could have code to handle DOS / comment spam attacks like this. maybe it sees my name show up N# of times and then blocks me from then on. my hope is that this is not the case, and they are relying entirely on CAPTCHA. 3) different configurations. i've only tested against 2 different blogs. it is possible that all the other blogs have some different setup that my program is unable to handle. maybe i've been testing against the exceptional cases from the beginning. 4) murphy ... something is bound to break. 5) if i was a real bad guy ... then i should be doing this from a publicly available WiFi point. just so happens that our local Panera offers free WiFi while you eat. instead, i'm going to be doing this from the comfort of my home.

innocent victims blog post before spam bot

i'm just a little nervous about this. i'm worried that the spamees could take it wrong. or the person that wrote the CAPTCHA stuff gets pissed. i really hope that is not the result. my intention here is to demo the weakness of CAPTCHA and that the end of comment spam is nowhere in sight. to fully make this point ... somebody has to get spammed. that is just the way it is.

AFTER the attack took 10 minutes. out of a list of 99 blog posts, it successfully posted comments to 94. the 5 that were missed either did not allow comments or the bot failed to guess the CAPTCHA image five times in a row. my guess is that most of them did not have comments turned on. i say this because the bot succeeded 94 times out of 212 attempts. about a quarter of those fails were retries for the sites that did not allow comments. so the accuracy for guessing the CAPTCHA was above 50%, probably around 66%. meaning it would correctly guess the CAPTCHA at least 1 out of 2 attempts.

showing spam bot results

the above pic shows the results after the attack. the pic below shows the spam comment that it posted.

innocent victims blog after spam bot

somebody beat me to the punch with this one

Harder CAPTCHA

regardless of whether the attack went smooth or not, i think its my responsibility to explain how the CAPTCHA could be improved. these are some of the ways that would make it much more difficult for my comment spam bot to work against. not that i dont think i could extend my current spam bot to handle alot of these cases too. e.g. here are some articles where they have beaten much harder CAPTCHAs than what i did : Breaking a Visual CAPTCHA and PWNtcha - captcha decoder. a harder CAPTCHA would take more time to write a program to beat. i only spent about 24 coding hours to put this one together. it would also be a harder problem to solve, so less people would be able to write that bot. 

some of this might have already been done : Clearscreen SharpHIP HIP-CAPTCHA Control for ASP.NET update ! SomeBlogSite.com should also implement rel='nofollow'. it was not my intention to increase my googlejuice from doing this ... but it probably will ... and i'll take it. SomeBlogSite.com might also change the handling of the CAPTCHA image on the server side. i think it is valid to require that the client download the image within one minute of viewing a post. but it should increase the timeout for them to have to respond, so that they can actually read the message and reply before that timeout occurs. at least stretch that out to 5 or 10 minutes.

Social Engineering

i hate social engineering. it involves manipulating people to get them to solve some problem for you. in the case of CAPTCHA, this can be easily done by providing free porn : Solving and creating captchas with free porn. so even if you come up with a CAPTCHA that a computer cant solve, then i'll just make a nudie site and get other people to solve it for me. once they solve it, then my site could continue to blast you with comment spam. the sickening part is that it would be easier to build a website like that, than it is to write a bot. the only hard part is to get the people to actually come to your site and be compelled to answer the CAPTCHA for you.

this has been the same for a couple of AI related things i've done. my web spider has a decent amount of AI in it now. it goes out and crawls around the web looking for content. and it does a real good job. the problem is that there is a public web site that does the same thing. and instead of going and pulling in content, its model is to just sit there and have people push content to it. and people push more content to it than my spider can get on its own. that really disturbs me. at least there are some AI problems that i dont see being solved with social engineering ... yet. such as barcode, handwriting, and speech recognition.

Conclusion

that shows how easy it is to write a spam bot to beat CAPTCHA. even with popular blog sites and search engines implementing both rel='nofollow' and CAPTCHA, i think its clear that we are not even close to stopping comment spam. this was an automated tool that i just have to hit a button and let it run. there are certainly people that will spam comments manually as well. i know my own site has had that happened ... and there is no way for CAPTCHA to stop that. the bigger problem is that SomeBlogSite.com is exponentially more secure than most blog sites that dont have any form of CAPTCHA whatsoever. it would take absolutely no time to write a program to mass spam the publicly available blog engines that everybody is using without CAPTCHA. so even though i chose SomeBlogSite.com ... there are many more sites out there that are lower hanging fruit. now for the bad part ... my site is entirely vulnerable. i could easily blast my own site out of the water in about 10 minutes. not good.

i also need to kick a shout out to Thank Nick P. who is putting together www.HaloRanking.com. if you think you're the Master Chief ... then prove it! he helped alot with brainstorming some of the image processing algorithms and how to come up with input vectors to train the neural net. he also had a sweet crypto related idea, that would have made this self training, if the pieces had been put into place differently.

Source

definitely not releasing this code until SomeBlogSite.com improves the strength of their CAPTCHA. even after that time, i probably will not release this code ... just so that the bad guys will have to waste more of their own time

Updates

none planned

Future

there is alot of different stuff i want to write now. probably something WS related next