Optical Character Recognition (OCR) is a crucial aspect of the new digitization economy. Google, Microsoft, and all the rest of the world use OCR to quickly create electronic texts based on digital images of books and journals. A plethora of cunning programming, spell-checkers, natural language processing and machine learning methods are incorporated into most modern OCR applications these days, and so the per-character error rates for clean pages are up around 98%.
Alas, you’re a ridiculous dreamer if you think the pages of 19th-century and earlier books are anything like the 12-point Times Roman on clean Bright White Xerox Stock that OCR software manufacturers use to reach those numbers. Old books suck. Their pages are foxed, the type is often broken or distressed, people scribble on them (the damned education students worse than any, in my experience), and as a result the OCR rates are closer to 95–96%, even where the letters are present.
You may think that’s niggling. But let’s say (conservatively) that there are ten million volumes being digitized at the moment. Maybe they’re 150 pages each, on average. That’s about a billion and half pages, each with 1000 or so characters. So over the trillion letters, there will be billions of character recognitions missed.
The essentially irreducible error rate in OCR is one of the reasons Distributed Proofreaders, the collaborative online community for proofreading OCRed texts, has come into being. There are a number of other reasons, of course, but a big part of the work we do there is fixing letters mis-recognized by OCR software.
Now when the OCR software misspells a word, things are relatively easy to fix. Indeed, many OCR packages make good use of spell-checkers to remove ridiculous non-English words from the document, and to disambiguate many simple errors. But when the OCR software mis-recognizes a word as another English word, in the DP community we call this a “stealth scanno” (as in “typo”).
This is a problem that’s often damnably difficult to catch with either software or human review. It’s spelled “right”; it’s just the wrong word.
Over the last six years lists of common stealth scannos have been developed: “he”➙”be”; “books”➙”hooks”; “care”➙”core”; “and”➙”arid”; “black”➙”blade” and so forth. Barbara tells me she recently saw “books”➙”hooka”. Recall that this isn’t just one-letter substitutions; smudged or photocopied text, diamond and other small typefaces, and poor stereotyping can all smudge the ink of letterforms so that they tend to touch and overlap.
Now there are numerous tools that have been developed by those who scan and repair digital texts, and you can find some of those through Google searches, if you know the right words. The obvious direction you might be considering for this task is a simple method that highlights all the occurrences of a stealth scanno in a text. But through the years folks have observed thousands of stealth scannos, some involving very common English words (”he”➙”be” comes to mind). What use would it be if you highlighted every occurrence of every word in a 20000-word scanned book?
Challenge: Create a system that will highlight possible scannos in a text, minimizing both false positive and false negative errors.
Consider the stealth scanno “he”➙”be”. Highlighting every occurrence of both words will surely catch every mis-recognition in a document, but also a lot of correctly recognized words. But you know, if you look at adjacent words, there might be some hope to limit the false positives: the phrase “be happy” might be mis-recognized as the scanno “he happy”, and that’s much less likely to occur in an English sentence than the correct phrase. On the other hand, both “be said” and “he said” are viable phrases; you might want to look farther afield in the text to get more information to tune those occurrences. So we can work under the assumption that local context will be helpful.
Acceptance Test: Given a list of stealth scanno substitutions, your method will be applied to a test suite of 200 English-language passages, each 500 characters in length (including punctuation). A subset of those texts will include stealth scanno substitutions from the list, while some other subset (possibly overlapping) will include correctly recognized terms from the stealth scanno list. To foster the design of contextual approaches, I’ll guarantee that errors will never occur within 50 characters of either end of the text: you get 100 free certified-correct letters in each passage!
Your method should accept a given string of 500 characters, and produce as output a “scanno confidence vector”: a vector or array of 500 floating-point numbers in the range [0.000--1.000]. Each element of this confidence vector corresponds to one letter of the input text. A value of 0.000 in position i of the vector indicates that there is definitely no scanno present at a character position i, while a value of 1.0 means that your method has determined there is absolutely for sure a scanno present in that position.
The success of your method will be evaluated on the basis of the total absolute difference between the vectors and the real locations of the stealth scannos, summed over all 500 characters (even the free ones!) and 200 passages, resulting in a final score ranging from 0.000 (100% correct predictions) to 10000.000 (entirely and exactly wrong predictions). Lower is better.
Its success will also be determined by the largest absolute error, taken over all 10000 character samples. Lower is better, in the range [0.000--1.000].
You will not be given original page scans, nor any portion of the text outside the 500-character passage. Just the output of OCR.
“Two performance measures?” you ask. Yes. Submissions which when evaluated appear on the Pareto-efficient frontier (in other words, those not dominated by any other submission on either performance measure) will receive a grade of “A”. After removing those from the set, the next-but-one Pareto-efficient frontier will receive a grade of “B”, and so on until all submissions have been compared.
Your submission should be written in a common scripting language (PHP, Python, Perl, Ruby, or possible Unix shell script), or in a high-level interpretive scientific programming language like Mathematica, Matlab, R or S-Plus. I should be able to run it on my Mac laptop, running MacOS X 10.4. Alternately, it can be run a Web Service, in which an HTML form is used to upload a 500-character text extract to a server, with a result being a 500-line page with one number on each line.

