For our fourth episode, we decided to try making a long, in-depth show about those squiggly word puzzles you find all over the internet, called CAPTCHAs. This is our first show that contains interviews, including of the happy fellow you see above, Dr. Andrei Broder, the Chief Scientist at Yahoo!. You’ll hear from him quite [...]


For our fourth episode, we decided to try making a long, in-depth show about those squiggly word puzzles you find all over the internet, called CAPTCHAs. This is our first show that contains interviews, including of the happy fellow you see above, Dr. Andrei Broder, the Chief Scientist at Yahoo!. You’ll hear from him quite a bit in this episode.

This show is almost 50 minutes long. We hope you enjoy it. Right now we’re thinking about this as sort of a special occasion. Most of our shows will likely be shorter — mostly because they’re easier to make (Nat spent over 100 hours on this one). Unless you tell us long is the way to go!


And on that note, we’d love to get your feedback on this show in the comments below. Constructive criticism and gushing encouragement are all welcome!


If you want to learn more about the topics we discussed, here are some handy links.


The Interviewees

Dr. Andrei Broder, Chief Scientist at Yahoo’s Advertising Technology Group
Ben Maurer, co-founder of reCAPTCHA
Dr. Kumar Chellapilla, Scientist at Microsoft Research
Shaun Friedle, creator of the Megaupload autofill CAPTCHA greasemonkey script

CAPTCHA basics

The official CAPTCHA website
Alan Turing’s 1950 paper, Computer Machinery and Intelligence, wherein he poses the Turing Test
A nice little summary of the history of CAPTCHA
A long Wired article about CAPTCHA and Luis von Ahn’s GWAP project
reCAPTCHA - solve spam, read books
CyberLover – the bot that steals personal information
The Photoshop Phriday competition to make funny pictures from reCAPTCHA word combinations
A funny xkcd about CAPTCHA and turing tests
The CAPTCHA patent
Taylor Hayward’s work on 3D images as CAPTCHAs

Algorithmic attacks on CAPTCHA

Kumar Chellapilla’s paper on breaking CAPTCHAs at Microsoft Research
Shaun Friedle’s megaupload autofill CAPTCHA greasemonkey script as broken down in John Resig’s blog

Convolutional Neural Networks

Video of the Hubel/Wiesel cat brain experiments. Amazing example of reverse engineering.
Yann LeCun’s LeNet-5, a convolutional neural network. LeCun is one of the originators of the technique.
A great paper introducing convolutional neural networks
Convolutional Neural Networks best practices, a Microsoft Research paper from Patrice Simard

CAPTCHA bypass services (aka CAPTCHA farms)

Inside India’s CAPTCHA solving economy, a ZDNet article
Decaptcher
Spyder CAPTCHA assist for myspace

This episode contains two songs from Eternal Jazz Project, a Swedish jazz band that released some of their music under the Creative Commons BY-NC-SA license on magnatune. This episode is distributed under the same license.



Transcript

00:00:00
Broder: There was a procedure called add URL, where you would come to a search engine and you would say, you know, here is the pages I just made. But anyway we had this problem and, of course, there was spammers and there were people that were adding the same page millions of times and wrote little scripts to add their pages. So we had somehow to slow the spammers. And this is how we came up with the idea that we need a test to distinguish between spammers and humans.
00:00:46
Nat: That was Dr. Andrei Broder, the Chief Scientist at Yahoo!, discussing his time at Altavista in 1997, when he led the team that invented a little thing called CAPTCHA.
Alex: And CAPTCHAs are the subject of our program today. We’re going to be exploring the state of the art in CAPTCHA generation and circumvention
Nat: I’m Nat Friedman, reporting from the Bavarian capital of Munich.
Alex: And I’m Alex Graveley, reporting from sunny, cloudy, cold San Francisco.
Nat: And this is Hacker Medley, the podcast for curious hackers.
00:01:28
Nat: Let’s see here. Word verification. Type the characters you see in the picture below. Okay. C-O, I think that’s a U.
Alex: Wait, is this supposed to be a word or is this just letters?
Nat: I think it says [coralia]. Is that a word?
Alex: I don’t know.
Nat: I think it’s just sort of just random letters that are pronounceable. Okay. I think it’s C-O-U, and I think there’s an R like tucked in there and that’s, wait that might not be an A actually. I think that, yeah, that’s an A. And then this is either a B or an LE.
00:02:02
Alex: And here Nat is trying to solve a CAPTCHA, one of those squiggly word puzzles that you see all over the internet, where you have to type in the words that you see in order to enter a blog, comment or create a new mail account or even participate in an online poll.
Nat: The estimates are that we are, human beings as a species, are solving over 200 million CAPTCHAs every single day, but the very first CAPTCHA was implemented at AltaVista back in 1997. I interviewed Dr. Broder at his office in Santa Clara and asked him to tell us how it happened.
Broder: I think from the very beginning we had kind of an idea that the problem has to be some kind of a pattern recognition problem because this is one area where humans are much better than machines. And at some point it sort of started from, I think some lunch discussion and we were pointing out, someone was pointing out machines are not yet incredibly good at playing chess. How come humans cannot make so much computation are good at chess and it’s all about pattern recognition. So we knew that we need a pattern recognition problem. And then we came with this one.
00:03:07
Nat: How did you come up with the algorithm for distorting text?
Broder: That one is a lot easier to tell you how we decided what things are because actually I had a scanner at home, and scanners were not so cheap as today, and I had a scanner, and I believe was made by Brothers but I’m not 100% sure, and the scanner came with a manual and they also had some OCR software, which came with the scanner. And pretty much I looked in the manual and everything in the manual that they said it’s bad for OCR.
Andrei: We decided why don’t we make it. So one of the things that we’re saying, well it’s bad if the letters are misaligned, so we said okay they should be misaligned. And it’s bad if you use multiple fonts, so we said okay use multiple fonts. So it was all there.
00:04:07
Nat: That’s a pretty interesting story, huh?
Alex: Yeah, I love those old stories of like hacker epiphanies that solve really complex problems. The funny thing is that search engines today don’t even use this scheme anymore, they just use PageRank, which crawls the whole web. But instead, CAPTCHAs have turned out to be incredibly valuable for locking out spammers from pretty much all aspects of the internet.
Nat: You know, what’s kind of amazing to me is that these guys, this little team at AltaVista 12 years ago, they came up with this human detection technique and it’s pretty much exactly what we’re using today.
Alex: Yeah, I mean it looks pretty much the same to us but it is somewhat different, like the state-of-the-art has pushed these things towards being much harder for computers to solve.
00:04:54
Nat: Yeah, that’s true. I mean pattern recognition techniques and AI and computer vision have advanced a lot since then. And actually, that’s a good point, that kind of brings us to why Alex and I think CAPTCHAs are so interesting. That little image, that little rectangle of distorted text on your web browser, that is kind of like a window into the world of artificial intelligence and how it relates to human capabilities.
Alex: Yeah, and specifically it’s just like this really interesting set of problems which are sort of described in that they’re tests that computers can generate and grade the answer to but which they can’t themselves solve very easily but that humans can solve really quickly.
Nat: So here are the criteria. In order to be a viable CAPTCHA, a test has to be something that’s beyond the frontier of current artificial intelligence, but well within the capabilities of even really, really average people. So in a certain way, the set of all viable CAPTCHAs describes the ways in which people are still better and more capable than computers.
00:05:53
Alex: Yeah, and it shows you sort of the places where AI still has to grow and the limitations of what we can do, at least with regards to image recognition.
Nat: That’s a good point. But, of course, bad news is that AI is getting smarter and we’re not. So, you know, for the time being, at least when it comes to recognizing distorted text, we’re still well beyond computers but there’s no reason it’s going to stay that way stay forever.
Nat: Actually, Alex, by the way, the idea of CAPTCHA goes back to an earlier concept called a Turing Test.
Alex: I’ve heard of Turing Tests but it’s funny, I didn’t know that CAPTCHA stood for a Completely Automated Public Turing Test to tell Computers and Humans Apart, which is a pretty long acronym, but the important thing in there is that it is a form of Turing Tests. Nat maybe you can explain what that is?
Nat: Sure. So back in 1950 Alan Turing, the father of computing, wrote this really amazing paper called Computer Machinery and Intelligence. And what you have to understand is, in 1950 the transistor was only 3 years old. So computers were like really big, they were room sized, they were really loud and they didnít do very much. So it was in this world of fairly limited computer capabilities that Turing asked an enormous question, and the question was: “Can machines think?” And this is like a philosophical question, and in order to answer it you’d have to define what thinking is.
00:07:13
Alex: But I mean it’s interesting because people are just sort of sitting around with this big old computers waiting for punch cards to be processed and they had their heads in the clouds of these sort of abstract questions.
Nat: Right. Now instead of going in a total abstract route though, Turing devised, he invented a game, a very concrete game, which he called “The Imitation Game.” And the way people usually describe the game is, you have a person who’s a judge, and he’s communicating with someone else who’s in another room, who could be a computer or a human being and they’re talking through little text messages, like IM or something, and the question is: can the judge tell if he’s talking to a computer or a person?
00:07:50
Alex: That’s sort of what’s become the Turing test, which has been around so long at this point that it actually represents sort of this like unachievable holy grail of artificial intelligence. And it represents, if it ever gets solved it represents the point at which computers can really convincingly simulate the interactions between humans.
Nat: Yeah. Actually when I was a little kid my friends and I used to talk about the Turing Test, as you said like a kind of major milestone in artificial intelligence that we figured would have been solved by now. But I hadn’t actually read the paper until we started doing the research for the show and what I discovered is that what Turing actually wrote is different from what we just described. See in Turing’s original paper there’s three people, there’s a man, and a woman, and the judge, and they’re all in separate rooms, and the judge is trying to guess which is the man and which is the woman. And then what you do is you take either the man or the woman and you replace them with a computer, and the question is, does that change the judge’s accuracy from when he was talking with two humans?
00:08:51
Alex: Kind of a weird twist. And the computer actor in that specific scenario is like trying to trick the judge into thinking that the human is lying and it’s all very confusing. I still don’t fully understand why that question is posed in such an obscure and specific way but, you know, it’s Turing so chances are good that he was thinking about something that I’m not.
Nat: No question about that.
Alex: You know, it’s interesting because I think the Turing test is only hard to pass if you suspect you’re talking to a computer.
Nat: Yeah. Actually there’s a whole bunch of examples of people spending hours talking to even like really poorly implemented chat bots that don’t even put a delay in before they respond to someone’s IM or something like that. So they respond in a tenth of a second. And actually I found a really funny screenshot online, it turns out there’s a Russian chat bot that’s called cyber lover, and what it does is it goes into chat rooms and on IM and it poses as an attractive female and it enters IM conversations with men, and it kind of gradually convinces them through these faked human interactions to give up their personal information. And so the screenshot is of the dashboard for this chat bot and you can see all the men that it’s talking to and it reports as it gets their full name and their address and their credit card numbers and that sort of thing. We’ll have to put that up on the website.
00:10:21
Alex: I’m just going to go call Visa real quickly.
Nat: Getting back to the Turing test, though, there’s a bet on the site longbets.org, one of my favorite websites, between Mitch Kapor, who’s the founder of Lotus, and Ray Kurzweil, as to whether a computer will be able to pass a Turing test by 2029.
Alex: Yeah, and that’s the commonly understood concept of the Turing test, not the sort of gender guessing, gender faking one. Mitch Kapor is betting that computers won’t do it, which seems kind of negative to me, and Ray Kurzweil is betting that they will because his sort of whole singularity concept depends on it. And it’s a real bet. There’s $20,000 on the line.
00:11:04
Nat: So Alex, Turing posed this big question back in 1950, and then for 46 years, the AI community worked like crazy to try to build algorithms that could imitate human capabilities and even really simple uncontrolled situations. And they haven’t really quite got there. Actually I have a little blast from the past for you, Alex. Let’s listen to this.

DR SBAITSO CLIP


Alex: Oh man, It’s my very first shrink.
Nat: I don’t know if you remember that from the sound blaster?
Alex: I totally do. It’s like one of those programs that were on the, it was one of the demos that came on the sound blaster install disc.
Nat: Yeah. And then they had one with the talking parrot. You remember that, too? It had a different voice?
Alex: I kind of remember the talking parrot. Can you simulate the voice for me?
00:11:52
Nat: I don’t think I could. So because computers were having so much trouble at even really simple human tasks, let alone actually imitating people in a human context, the conventional wisdom about AI has been, for decades, that AI is in a rut. But then in 1996, a researcher at the Weizmann Institute in Israel named Moni Naor, he looked at the situation and he saw an opportunity. He figured that the things that people could do that AI was still failing to do, he figured could be used by COMPUTERS to automatically tell computers and humans apart.
Alex: Moni’s paper is called “Verification of a human in the loop or Identification via the Turing Test.” And he had a bunch of really cool ideas, some kind of novel concepts for the kinds of puzzles that you could pose to humans to determine if they were in fact human.
Nat: Most of those puzzles you’ll see are kind of in the areas of like sensory processing, image recognition, that kind of thing. Actually I think we should just read a couple.
Alex: Alright, yeah. There’s one that was the Gender recognition, which is actually kind of difficult if you show a picture of a face determining whether or not it’s a male or a female.
00:12:56
Nat: I have trouble with that just in real life.
Alex: Me, too. I got hit the other day because of it. And there’s facial expression understanding, whether the person in the picture is happy or sad. And then there’s identifying body parts, which actually seems like a really difficult problem to me for computers to solve, being able to tell which, in a random picture, whether or not you can highlight the arm or the leg.
Nat: Hereís one I like, filling in words. Given a sentence where the subject has been deleted and a list of words, select one for the subject.
Alex: That’s kind of cool. Sort of text comprehension.
Nat: And he also here mentions handwriting understanding, which is actually pretty close to what CAPTCHAs ended up being.
Alex: And he mentions also speech recognition, which is used in audio CAPTCHAs today for blind people.
Nat: So I mean Moni’s paper gives us a pretty good inkling of what CAPTCHA could be, but he wrote the paper before CAPTCHA was actually invented. And a lot of these particular ideas, well they didn’t turn out to be that great.
00:13:54
Alex: Yeah, things like drawing a circle around a person in a scene or a person’s body part is actually kind of annoying to do in practice. And also things like guess the word that fits into the sentence you can do by, if you index a lot of web pages you can determine which sentences are common or which word structures are very common.
Nat: And actually whenever you have a test that doesn’t have very many choices, like for example a binary choice, like male or female, if you just write a script that guesses randomly you’re going to be right 50% of the time. So that’s a pretty good pass rate for a pretty short script. So you have to give the user like lots of binary choices, like five or ten or something like that, to make the random guessing pass rate low enough or whatever. But anyway, totally independently of this paper that Moni Naor’s wrote, you had the work that was going on at Altavista. So kind of industry and academia were converging on the same point.
00:14:48
Alex: Right. And a few years later at CMU, this totally awesome guy named Luis von Ahn and his professor Manuel Blum wrote a paper where they coined the term CAPTCHA and sort of formalized the whole concept. One thing that’s totally awesome in this paper and one of the reasons I like CAPTCHAs so much is that it points out that CAPTCHAs are pretty much a win-win situation, “either the CAPTCHA is not broken and there is a way to differentiate humans from computers, or the CAPTCHA is broken and a useful AI problem has been solved.”
Nat: Yeah, I love that, too. I think that’s really cool. So since 1996, 1997, the time when AltaVista invented CAPTCHA and thee papers, CAPTCHAs have become super wide spread. Millions are soles every day. And, by the way, the average CAPTCHA takes about 14 seconds to solve. So if you multiply that out that’s a lot of time that’s being spent by people solving CAPTCHAs every day. And with all this work being done, Luis von Ahn saw an opportunity, and with a couple of other people, founded a company called reCAPTCHA.
00:15:48
Ben Maurer: So my name is Ben Maurer. I’m one of the cofounders of reCAPTCHA and I’m responsible for the design of our API and for our infrastructure.
Ben: So people are solving 200 million CAPTCHAs a day, let’s say, and what they’re doing is they’re spending time doing something that by definition a computer can’t do. That’s automatically valuable because if we could give people a task that is useful then we’re getting something that we don’t otherwise have the ability to get. And so we said what can we do with all this, you know, with all this human computation power?
Alex: So just to cut in, in case you don’t know what reCAPTCHA is, you’ve probably seen these before: they’re the CAPTCHAs that have two words that you have to type, the words are usually in some kind of old or smudgy print face, and there’s maybe a line drawn through them.
Ben: And what we came up with is instead of having one word in the CAPTCHA we have two words and one of them is sort of a fake. It’s not part of the CAPTCHA it’s just a word that we don’t know what it is and we want you to tell us what it is and we do that to digitize books and newspapers and other content that computers can’t read.
00:17:00
Nat: So then what they do is they run two different OCRs over the text. Ben told me that they use a couple of commercial OCRs, and an open source one called Tesseract, which comes from Google, which is now considered pretty state of the art. And they identify words that the OCR software couldn’t recognize or doesn’t have a lot of confidence about. Ben explained it pretty well.
Ben: So OCRs are never 100% sure whether they’re right or not. But what we do is we take multiple OCR engines that use different algorithms and they tend to have failures that aren’t 100% correlated with each other. If they both agree then we sort of say we’re, it’s very likely that the word is correct. We use a few other signals such as, you know, does the word fit in this sentence? Like if you have, you know, one sentence we had in an old newspaper was that the motors ears were running down the street. And motor ears is something that just doesn’t occur in the English language and what happened is a C looked like an E to the OCR and we have the ability to say motor ears is a bigram that just doesn’t typically appear and it’s suspicious.
00:18:07
Nat: By the way, Alex, I thought it was nifty that they also use bigram probabilities to help identify which words the OCRs failed to recognize.
Alex: Yeah, I suspect that they’re using the one provided by Google where they have this huge bigram index, this big database you can download for a small fee, and it basically shows the occurrence of combination of words all over the web.
Nat: It makes sense actually also because Google ended up buying reCAPTCHA pretty recently.
Alex: Yeah. And reCAPTCHA has APIs in a whole bunch of languages. So it’s sort of a general-purpose CAPTCHA platform that you can just embed into your site. And these things are used everywhere, on Facebook, TicketMaster, Craigslist, wikipedia…everywhere..
Nat: Ben told, Alex, that reCAPTCHAs actually getting a whole lot of old books and newspapers transcribed.
00:18:55
Ben: We’ve done about, I think about 50 years worth of the New York Times already and currently reCAPTCHA users are solving 50 million CAPTCHAs a day
Nat: And by the way, Alex, The Times is paying reCAPTCHA for all that digitization work that they’re doing.
Alex: That’s pretty awesomely shrewd right there!
Nat: Definitely.
Alex: And they’re doing all this in pretty standard stuff with Python, nginx, and of a lot of intelligent hackery.
Nat: Actually with all that scale, solving 50 million CAPTCHAs a day, I asked Ben a little bit about the architecture, and specifically how do they store the CAPTCHAs on disk. Is it just on file per CAPTCHA image? And here’s what he saidÖ
Ben: Yeah, that was originally how things worked and that’s a pretty big disaster just because every time you serve a CAPTCHA then you end up doing a disk seek. And when you have a server that can serve a few thousand requests per second you can’t do a few thousand disk seeks per second. It’s just too slow.
Ben: And we found that one file per CAPTCHA, when we would get substantial load on the server the latency would become very high. So we actually use a custom file format to store the CAPTCHAs that allows us to load a bunch of CAPTCHAs into memory at once.
00:20:16
Alex: That’s great. That’s another one of those sort of problems that you only run into when you have really large amounts of scale.
Nat: Yeah, and it’s cool to peek under the covers of an operation like that..
Nat: Now, by the way, reCAPTCHA does just take the scanned word off the page and present it to you unmodified, they actually distort the word a little bit before you see it in the CAPTCHA.
Alex: Right, like I said, they maybe draw a line draw a line through it or they make it wavy. And recently they started using these like XOR blobs where they would sort of switch the foreground of the word with the background of the word for part of the word.
00:20:50
Nat: And the reason, Ben told me, that they do this is because, even though OCR software couldn’t recognize the word, you know, OCR software is not really designed to solve CAPTCHAs, it’s trying to get a balanced view of the document, so it might be possible to build an algorithm that could get enough CAPTCHAs right to be annoying. For example, Ben said that if you took a standard OCR software software and just tweaked it’s algorithm to use its second best guess for what the word could be instead of its best guess that might solve enough reCAPTCHAs to be a problem. So that’s why they add extra distortion just for extra safety.
Alex: Yeah, and everyone we’ve talked to has basically said the same thing, which is that reCAPTCHA is one of the toughest CAPTCHAs out there, which is important because you only need, say, 10% of CAPTCHAs solved by your bot to create thousands of fake Gmail accounts or get a lot of SPAM comments through. So the team at reCAPTCHA works really hard to make their CAPTCHAs as difficult to break as possible, while still trying to keep them easy for humans to solve.
Nat: And they’ve had a pretty good balance but CAPTCHA was not always as secure as it is now. And, Alex, there’s a funny story about that.
00:22:04
Nat: So back in the fall of 2004, Microsoft’s hotmail team, like most webmail services, one of their big concerns was SPAM, and specifically SPAMMERS using hotmail to send SPAM.
Alex: Yeah, and Nat, like every other webmail service on the planet, with the sort of first line of defense is to ask people that are creating a new account to solve a CAPTCHA.
Nat: So hotmail was depending on CAPTCHAs to protect them from SPAM. And they wanted to know: how safe are these things anyway? You know, hard would it really be to build an algorithm to break a CAPTCHA? So being Microsoft, of course, they have a really substantial research division right on campus. So they called up Microsoft Research and got in touch with a scientist in the division there named Kumar Chellapilla, who is a machine learning expert.
00:22:52
Kumar: Yeah, so my actual relationship with CAPTCHA comes from machine learning. So my actual PhD research work was on computational intelligence, and this is trying to build intelligent adversaries or agents that could act and train against humans.
Kumar: So these are models that you could train by giving it like input and output signal. And for some, for my PhD work I did mostly game playing like checkers and chess and so on.
Nat: When Kumar joined Microsoft Research, he did some work on OCR technology and handwriting recognition specifically for their tablet PC project.
Kumar: One of the common areas is signature analysis. How do you get a computer to look at two signatures and tell it to accept the signature or not? These are very, very hard problems.
Nat: And so Kumar sat down and he looked at the most prominent CAPTCHAs on the web from the biggest companies on the web at the time, and here’s what he found.
Kumar: And I was surprised. I have somewhat of an undergrad understanding of image processing, a doctorate level understanding in machine learning, and as I started applying some of these techniques, it was very easy to undo the challenges that were being put forth by the CAPTCHA. And I was so surprised at how quickly this happened, that we immediately, I think in November 2004, December 2004 there this famous machine learning conference called neural information processing systems then that was the first place where we presented a poster.
00:24:17
Kumar: And it was amazing. We had about half a dozen different CAPTCHAs that were provided by several different people in the industry and we could show that many of them you could break like one out of two, one out of four, two out of three.
Nat: Now, Alex, as it turns out, solving a CAPTCHA is something that actually breaks down into two separate problems: first is the problem of segmentation, and then comes the problem of recognition.
Alex: And I didn’t know this beforehand but segmentation is the process of breaking a picture of a word up into individual letters. And the recognition is then taking each one of those sort of subpictures and identifying which letter it represents.
Nat: And what Kumar quickly discovered was that recognizing the letters in most of the CAPTCHAs at the time was pretty easy.

[00:025:01]


Kumar: one of the problems we already solved by the time I started looking at CAPTCHAs was if you give me a single character, moderately distorted but not devastatingly distorted, then you sort of use your mouse or you point to the center of the character, I have techniques that can learn from that signal and basically give you the character that is there at that point.
Nat: The tool that Kumar was using was a special kind of neural network called a convolutional neural network.

Actually, why don’t we start off and tell people what neural networks are.


Alex: Yeah, sure. Neural networks are this sort of pretty widely-used technique in AI that’s been around for a really long time. And the basic idea is that you have these neuron-like elements that have inputs and outputs and the inputs and outputs are sort of arranged with inputs going into other neurons and outputs going into other neurons. So for a given neuron each input has its own weight, which multiplies the input value. The neuron adds up those weighted inputs, and if it’s greater than a certain threshold then the neuron fires, meaning that it sends a signal to its output. And the output signals of all these neurons sort of propagate through the network until you get the “answer” on a specific set of output neurons.
00:26:15
Nat: Exactly. So the basic idea for convolutional neural networks came from an experiment that was done back in 1959 by these two guys, David Hubel and Torsten Wiesel. What they did was they took a cat, and they put it under anesthesia. And then they inserted some electrodes directly into the cat’s visual cortex. And they opened its eyes and flashed different patterns of light and dark lines in front of the cat. And what they found was really interesting, they found that some neurons in the cats visual cortex fired rapidly in response to lines at one angle, and some neurons fired rapidly in response to lines at a different angle. So there was some angle sensitivity to different groups of neurons. And there were other neurons in the visual cortex that were totally angle-independent.
00:26:57
Nat: So what happened subsequent to that is, you know, this was obviously a pretty big result in neurology but some computer scientists got a hold of it and what they realized is they could take neural nets and they could arrange them like a cats visual cortex was arranged. So the lowest level you’d have neurons which are recognizing simple features in the image, like corners, edges at a certain angle or end points in certain regions of the image. And then there would be subsequent layers, which are usually called the hidden layers, in the neural network, and these subsequent layers would sort of combine those basic features to detect higher-order characteristics or features in the image. And if you have enough of those and the right kind you can start to recognize even really distorted letters or objects or things like that. So it turns out that this special type of network, this convolutional neural network, which is sort of roughly based on the way vision works in mammals, is really good at image recognition.
00:27:53
Nat: Specifically, these networks were really good at recognizing handwriting. And so when Kumar got assigned the whole CAPTCHA project he’d already been through Tablet PC and he had all this handwriting recognition experience and therefore he had this really powerful image recognition tool at his disposal. And when he took this thing and he pointed it at the state of the art CAPTCHAs on the web it just blew them away.
Kumar: So four out of five characters I’d be able to recognize correctly or certainly nine out of ten. So that was sort of like, that was the platform for almost all of my techniques. I would try to reduce every CAPTCHA I saw out there with some ad hoc processing down to a place where I could just give it maybe like five or ten locations where I thought characters were and then this system would, it’s not free because you have to label like thousands and thousands of these laboriously but it’s a very automatable technique.
Nat: So, Alex, with the recognition problem solved, for Kumar breaking CAPTCHAs basically came down to just identify the locations of the letters. And this is the segmentation problem. And in the CAPTCHAs that existed on the web in 2004, segmentation was actually not that hard. Kumar explained to me how he solved from Ticket Masters CAPTCHA.
00:29:01
Kumar: They were exclusively using these grids of slanted lines. They were almost regular but not exactly. They would tilt a little bit between the different parallel lines in the grid, but their text was always horizontal so, and the text was always thicker than the grid. So if you did some blurring the background lines would blend into the background and then the words will stay up in the front.
Kumar: So if you had the word hello on a white piece of paper, hello being typed in black, and then you could get a stat of that page to this connected components algorithm, and what it will do is it will take, it will start at one of the black pixels, let’s say the H, it would grow that by looking at the neighboring black pixels and it will slowly grow it into the letter H. Then once it has reached the edge of H it will no longer blend into the background so it will remove that letter H as a character. And you can repeat this iteratively until you get all the characters. So that’s another one where I think the Ticket Master one was reduced down to one of those and then we could build that out. Register.com also had a similar one. The very early MSN Hotmail one also did not have enough arcs, so some of the characters would not even be touching so you could easily eliminate those.
00:30:23
Nat: So what do you do when the characters are touching?
Kumar: So if this island analogy works, you can think of everything in the background is a big ocean because it’s relatively flat and there’s these islands sticking out. You could grow the islands, let’s say there’s more sedimentation and the land mass kinda moves out into the water, then if two islands are very close to each other then they may grow and they may connect each other. So that allows you to sort of connect things. And that’s usually like a growing operation standard image processing, things like halos and so on you can add to objects that way.
00:30:56
Kumar: You can also do erosion, which is the opposite. You remove pixels that are very close to the edge of the character, so in the same island sense, you’re losing, because part of the island is eroding into the ocean and so that way you can separate two characters that are connected. So if you have two, let’s say you have two O’s that are connected by a thin line, unless for a simple suggestion the O’s are more like filled in circles, then as you erode the circles are relatively in all directions, so they may become smaller circles still filled in, but the line that’s connecting the two circles would slowly get to a point where once it becomes really thin, one pixel wide, another step of erosion would just completely cause the connected pixels to go away. And now you’ve broken two O’s connected by a line into two O’s. And once they’re separated you can then do the opposite. You can now start to grow them back. And so if you do something like the four steps of erosion followed by four steps of growing you would lose every line or anything that was thinner than four pixels wide.
Kumar: And so that’s like a common, you erode to just make them disconnected, then you grow them back so that pieces of character that provide the erosion connect back.
00:32:13
Alex: That’s a totally awesome description, but I suspect in practice it’s a lot more nuanced and probably required a PhD to understand what the hell’s going on.
Nat: Well actually the paper’s pretty well written. It’s pretty accessible. We’ll put a link on our website if you want to check it out. But what he basically said in the paper was: recognizing distorted characters is solved. If you want to make CAPTCHAs really hard, lean on the segmentation problem because identifying the locations of the characters is really surprisingly hard if you do things like make them touch or don’t just do totally trivial things to your CAPTCHA. So the best CAPTCHAs on the web today have adapted to pose really harder segmentation problems.
Alex: It seems so weird to me that image recognition can, you know, identify a letter, it just can’t figure out where it is.
00:33:00
Nat: I know, right? It’s not intuitive at all. Now actually, even though a lot of these issues were pointed out five years ago in Kumar’s paper, and Google and Microsoft and Yahoo and reCAPTCHA now have really good CAPTCHAs that are hard for computers to break, a lot of the CAPTCHAs that you find on the web and in the wild, and you and I have both run into these, they still mostly pose a recognition problem and not a segmentation problem. Actually, I asked Dr. Broder about this, and here’s what he said.
Broder: You know, I see some CAPTCHAs that clearly are very hard for humans to solve but in fact they don’t introduce any difficulty for computers whatsoever. They are simply creating some extra annoyance for humans without getting any quality. I mean people have to realize what are the hard problems and what are not the hard problems. And some of the CAPTCHAs are totally silly and I’m sure that you can use them as an exercise in any course in pattern recognition and people will solve them.
00:34:08
00:34:19
Alex: If you look around the web today you can find like little Python scripts or other little programs that you can run to break some of the weaker CAPTCHAs out there.
Nat: And, Alex, actually I have a little treat for us. I did some googling and I found a university student in Northern England who wrote a particularly cool CAPTCHA solver.
Shaun: Well, my name is Shaun Friedle and I’m the author of Megaupload auto-fill CAPTCHA, which is a GreaseMonkey script for Firefox which auto completes the CAPTCHA on megaupload.
Alex: Woah! That’s such a great hack, right? Like this guy decided to start solving CAPTCHAs in the browser using Javascript.
00:34:55
Nat: Yeah, I mean this is the definition of like a good hack basically. I mean what Shaun Friedle did is he wrote a GreaseMonkey script that solves just this one particular CAPTCHA, on a site called Megaupload, which I’ve never heard of before but apparently is like one of those websites like rapidshare where you can upload a file and people can download it. And they use CAPTCHA to protect the download link from bots. And I asked Shaun how he got into this, what motivated him to do this in the first place.
Shaun: And then I came across a farm thread on the user scripts to [our] site. Someone else was asking if it was possible to decode the reCAPTCHA script in using a GreaseMonkey script and all of these people were saying no that’s stupid, there’s no way you will be able to do it, that’s impossible.
Nat: And so you took that as a challenge, huh?
Shaun: Yeah, I thought well I don’t know if it’s really possible in the reCAPTCHA, I thought well I can probably try and do that just in GreaseMonkey purely in JavaScript on the megaupload CAPTCHA. And at that time I had done no image processing in JavaScript. In fact, I ran about 100 types in JavaScript before that point so I’m not really a JavaScript programmer. So I started researching whether it was possible and I found out using the canvas functionality in HTML 5 you could do some image processing and eventually built it from there and managed implementing the entire thing in JavaScript.
00:36:19
Alex: I actually hadn’t heard of anybody doing sort of external image processing using Javascript and CANVAS like that.
Nat: Yeah, actually Sean was really humble and he said he’d never done any image processing in Javascript before. But I think like almost no one had ever done image processing in Javascript before he wrote this hack. And then it ended up on John Resnick’s blog, who is the author of [J Quarry] and a lot of people found it pretty interesting. That’s actually how I found out about it. But it does seem like a technique that could be useful lots of different places. Anyway, Shaun had also previously read this game programming book, and learned about neural networks from that, and so he implemented a neural network in Javascript, and then he manually trained it by typing in a whole bunch of CAPTCHAs himself to recognize the megaupload CAPTCHA.
00:37:02
Alex: And his script still works?
Nat: Yeah. He told me it can solve the Megaupload CAPTCHA in about 200 milliseconds.
Alex: That’s not bad. I think it’s a neat hack because it’s not necessarily anything novel research wise but doing it all in Javascript inside the browser and being a novice. It just seems like really educational.
Nat: We asked everyone who has been a CAPTCHA like what they think when they run into CAPTCHAs on the web and he said the thing he thinks is that about 60% of the CAPTCHAs he encounters he could probably hack with his GreaseMonkey script with a few hours of modifications. And of course he’s talking about the CAPTCHAs, not the big company CAPTCHAs but the ones that don’t pose really hard segmentation problems.
Alex: Yeah, and that totally goes against my original thinking when we started this podcast, which is that the way you can be sure that a CAPTCHA works is by writing your own because you’ll be able to sort of hide anonymously on the internet because people won’t spend time solving your particular CAPTCHA. But it turns out that, you know, unless you’re kind of tracking the leading edge in image recognition technology, like reCAPTCHA is, you’re CAPTCHAs are probably going to end up really, really trivial to solve.
00:38:12
Nat: And you’re kind of right on one count though, Alex, which is that if you’re site is really tiny and nobody cares about it they’re not going to try to bother to break your CAPCHA anyway. But the cool thing with reCAPTCHA of course is that they’re going to always keep up with the latest attacks. It’s like a platform that’s always going to evolve with the attackers.

Now we’ve been talking about some pretty sophisticated ways of attacking CAPTCHA. But there’s one very easy way to break a CAPTCHA we haven’t mentioned yet.


Alex: Is this the sort of legendary porn attack that I’ve always heard about?
Nat: Well that’s one. Why don’t we talk about it first?
Alex: Yeah, so I always heard that, like there’s always been this rumor that porn sites would stick CAPTCHAs up in front of people who wanted to look at porn images and they would have to solve the CAPTCHA in order to move on and see the pornographic image. And that CAPTCHA they solved would then be forwarded along to some script that was creating an email account or posting a comment.
00:39:07
Nat: This is like exactly the kind of story that’s designed to just be spread all over the internet because it involves like a cool hack and pornography. But it turns out it’s not really an issue. The volume of CAPTCHAs that would be solved by this technique is just too low to actually make a dent. And it’s not really a very competitive thing for a port site to do: for every one site that issues CAPTCHAs in front of their images, there are a thousand that won’t. So it doesn’t add up economically. There is another way that humans can be used to break CAPTCHAs that is actually is a bit more of an issue.
Alex: Oh is the sort of like CAPTCHA Farm things in India with lots of people solving CAPTCHAs.
Nat: They actually prefer the term “CAPTCHA bypass service.”
Alex: I’ve heard of these, too. These are like teams of very low-wage people usually in poor countries just typing in CAPTCHAs for very, very small amounts of money all day long. And I guess these guys break CAPTCHAs and then they get forwarded along to create SPAM and blog comments and things like that.

00:40:06


Nat: Yeah. And actually we heard a funny story about this from a friend at Google. Apparently Google has this property Blogger, and apparently they were having problems with people creating SPAM blogs on Blogger. So they added a CAPTCHA to the blog creation page and that helped for awhile. But then eventually the spam blogs came back. And they tracked the CAPTCHA solutions to this one IP address in Costa Rica. Instead of just blocking the server they decided to monitor it and they could see the rate at which the CAPTCHAs were being solved, it’s actually changing over the course of the day. At 9am they’d be solving something like say 10 CAPTCHAs per minute, and then half an hour later, at 9:30, they’d be solving like 20 CAPTCHAs per minute, and at 9:45 they’d be solving 30. And then it would maybe continue like that until 12 o’clock and drop to zero for an hour. And then at 1:00 it would pick back up again. So they could deduce from this there’s a team of four people, drifting into work in the morning and then all going to lunch together in the afternoon solving CAPTCHAs for a living.

00:41:03


Alex: The funny thing is that when these guys solved CAPTCHAs and then those CAPTCHAs are used to post SPAM on web pages, you know, they’re not actually expecting people to click on the links that are included in those SPAM comments, they’re usually just there to just trick Googles PageRank algorithm into rating the spammy links higher.


Nat: Yeah, actually that’s a really good point and PageRank is big money. So most of these CAPTCHA farms area actually a lot bigger than just four guys in Costa Rica somewhere. We tried really hard to try to interview a CAPTCHA farmer for this podcast. None of the ones we contacted would agree to have their voice recorded for some reason, but they did answer some questions over email, and we’ll link some of their web pages online where they advertise their services.

00:41:51


Nat: And you can see, for example, that the prices are just, I mean they are astoundingly cheap. To solve 1000 CAPTCHAs, for example, one site called decaptcher.com charges for just $2. So even if the workers take the average of 14 seconds to solve each CAPTCHA, and they don’t have any time between CAPTCHA solutions, that comes out to like fifty cents per hour. And actually by email, we learned that many of these CAPTCHA farmers are further kind of hindered by the fact that they’re not great typists and they don’t speak any English at all.


Alex: Yeah, that’s a pretty sucky situation. You can imagine how hard it would be to solve CAPTCHAs in Hindi. These guys are probably not even solving at the optimal or like the average 14 seconds for each one.


Nat: Yeah. I mean if we had to solve CAPTCHAs in Hindi, good Lord. The service, decaptcher, they provide APIs in a whole bunch of languages, you know, C, C++, Perl, Python, C#, etcetera, and they even a FAQ question on their website. I’ll read it for you. Here’s the question: I want to bypass CAPTCHAs from my bot. The bots all have different IPs. Is it possible to use your service from many IPs? Then they answer: we have no restrictions about IP: with DeCaptcher you can bypass CAPTCHA from as many IPs as you need.

00:43:07


Alex: Wow. So they’re just like right out in the opening about using botnets to solve CAPTCHAs huh?


Nat: Yeah, seriously. What it comes down to, really, with these CAPTCHA farms, you know, you can’t stop them. They’re going to be out there, they’re going to exist. It is a solution that somebody can just type in a CAPTCHA. It’s not totally secure. So CAPTCHAs not about total security, CAPTCHA is really just about making spam uneconomical. If we go back to Broder at Yahoo! one more time, I think he put this really well:

00:43:34


Broder: Yeah, this is exactly right. I mean it’s exactly the same problem you have in mail spam and there are actually nowadays good statistics about how many people are actually answering those ads for changing your anatomy and so on. And it’s an incredibly small number, 1 in a million or 1 in 10 million or something like that. And you can essentially compute a certain ROI so if you increase the cost even slightly suddenly the whole enterprise becomes non profitable. And I think that’s basically what we are trying to increase the cost slightly but because you have to multiply it by a large number of attempts to make it nonprofitable.

00:44:32


Nat: So we’ve kind of moved from this big philosophical question, Can Machines Think, to rooms full of poor people typing in squiggly letters to help sell Viagra on the internet.


Alex: Yeah, but along the way in talking about computers solving and posing these questions that really represent the bleeding edge of artificial intelligence. And it’s funny to think like, you know, could Turing have imagined that this would be the battleground for his, you know, his sort of ultimate question of can computers think..

00:45:03


Nat: So I think one thing people want to know is where’s all this all going? I mean what is the future of CAPTCHA? Let’s talk a little bit about that.


Alex: Well, it’s still an active field so there’s new CAPTCHAs being invented all the time. I did one for Microsoft that just came out, it’s called ASIRRA, and this works by showing you a picture of an animal and asks you if it’s a dog or a cat. Now, like we said before, that’s a binary choice so they end up showing you a few of them to see that, you know, so that the odds of just guessing randomly aren’t quite so good. And there’s this other one from Google called rotCAPTCHA, which asks you to tell which pictures are facing right-way-up, which I guess is also difficult for computers.


Nat: Yeah, and actually computer vision, of course, is advancing, too. Actually, Kumar, from Microsoft, told me that the whole 1d segmentation problem – a bunch of letters are in a slightly wavy line – that’s getting solved, too. So the future in CAPTCHA letters might need to be scattered around in 2d in a plane. But eventually, of course, the machines are going to be able to do that, too. So the question that I kind of wanted to know since we started looking into this whole topic is – when’s that going to happen? When will CAPTCHA no longer be viable as a concept? Here’s Kumar Chellapilla once again.

00:46:16


Kumar: It’s an adversarial problem. So if you’re blocking spammers they’re going to work harder and they’re going to automate existing solutions to make them cheaper, like more adversarial problems. There is one advantage, though, I should call it out. It’s a lot easier to generate I mean lots and lots of more difficult CAPTCHAs than it is to break them. So even in vision or in machine learning, computer vision machine learning, people talk about this synthesis forces analysis dichotomy, right? Is it more difficult to ask difficult questions or it more difficult to answer difficult questions? And there are a lot of these nonsymmetrical problems where for the email or any freebie research entities, they can easily generate lots and lots and lots of difficult CAPTCHAs.

00:47:08


Nat: So, Alex, before we started researching CAPTCHA for the show, I was pretty convinced that we were going to find out that AI was on just this collision path to make CAPTCHA irrelevant within, I don’t know, five to ten years. But it really doesn’t look that way to me anymore. I mean basically I think CAPTCHA is probably viable for a couple of decades or maybe longer.


Alex: Yeah. And the other thing is that CAPTCHA is not even really designed to be 100% secure. So as these things slowly become more and more solvable it doesn’t necessarily mean that the whole system will fall apart. It just means that there’ll be a little bit more SPAM an that will push the edge of research a little bit further.

00:47:49


Nat: Yeah, Kumar actually compared CAPTCHAs to a speed bump. So it’s sort of like a little deterrent, you combine it with other techniques like content filtering and that’s how you get a really good result. And, you know, even if the computers do catch up we go back to that whole win-win concept. I mean that’s a win, too. I think Ben Maurer from reCAPTCHA said it really well..


Ben: I mean, if we get to the point where computers are able to do anything that a human can do I’ll be happy. I mean at that point computers will be able to do a really good job at filtering SPAM on their own and they won’t need CAPTCHAs.

00:48:30


Nat: Well, that was our show. We had a lot of fun studying CAPTCHAs and we hope you enjoyed it, too. We’ve posted a whole bunch of interesting links from our research on hackermedley.org so that you can learn more about the Turing Test and neural networks and cat brains, and that kind of thing. So check it out.


Alex: This episode was a bit of an experiment in doing a longer form show, with interviews no less. So we’d love to hear if you think it worked and especially if you think it didn’t work. So please visit our website at hackermedley.org and give us some feedback.


Nat: Thanks for listening.


Alex: Yeah, thanks.