aparrish: (Default)
2007-05-25 02:01 pm

The CAPTCHA Puzzle

Ah, reCAPTCHA: you've captcha'd all of our hearts. It's a great idea, on the surface: replace the randomly distorted text in CAPTCHA boxes across the web with boxes containing text that has stumped OCR. It's a win-win: your blog goes spamless and the Internet Archive becomes searchable. Truly, we are now harnessing the power of the web.

The problem, of course, is that the text in question can't be deciphered by the computer—which is why it needs to be transcribed in the first place. So that portion of the reCAPTCHA widget can't actually function as CAPTCHA on its own: it can't verify whether or not you're a computer, because the computer doesn't know the right answer beforehand! Here's reCAPTCHA's solution for this "puzzle":

Each new word that cannot be read correctly by OCR is given to a user in conjunction with another word for which the answer is already known. The user is then asked to read both words. If they solve the one for which the answer is known, the system assumes their answer is correct for the new one. The system then gives the new image to a number of other people to determine, with higher confidence, whether the original answer was correct.


But this solution isn't watertight. For example, nothing prevents any arbitrary string from becoming the most popular transcription for a particular word. Spammers intent on gaming the system might try to overwhelm "correct" transcriptions with false ones, thereby potentially allowing them illicit access—or, at the very least, compromising the usefulness of the transcriptions. The only way to counteract this is to fall back on ye olde-tyme spam prevention techniques (reCAPTCHA has a reasonable list of these) or to have someone transcribe the list of words beforehand—in which case you're back to where you started.

Given all of this, I can only suppose that the "distributed transcription" functionality of reCAPTCHA is orthogonal to its more straightforward CAPTCHA functionality, and only the latter is used to determine whether the input comes from a human or a computer. Fair enough. I think distributed transcription is a good idea. But CAPTCHAs suffer from two problems that make them a problematic vehicle for this kind of work: (a) they're a primary target for (sometimes ingenious) spammers; and (b) people find them annoying. Or at least I do. Why should I have to do extra work just to prove I'm human?

The whole thing brings to mind this article in Smithsonian Magazine about not recognizing the difference between puzzles and mysteries, and [livejournal.com profile] jzig's post about how "smart CS majors are PARTICULARLY prone to screwing this up." I think the reCAPTCHA people might think they're dealing with a puzzle here, rather than a mystery.

(For the opposite extreme, i.e., making humans do something that a computer is better at, see Wikiclock. Via Leonard.)