Left as an Exercise for the Student: Stealth Scanno Detector

Opti­cal Char­ac­ter Recog­ni­tion (OCR) is a cru­cial aspect of the new dig­i­ti­za­tion econ­omy. Google, Microsoft, and all the rest of the world use OCR to quickly cre­ate elec­tronic texts based on dig­i­tal images of books and jour­nals. A plethora of cun­ning pro­gram­ming, spell-​​checkers, nat­ural lan­guage pro­cess­ing and machine learn­ing meth­ods are incor­po­rated into most mod­ern OCR appli­ca­tions these days, and so the per-​​character error rates for clean pages are up around 98%.

Alas, you’re a ridicu­lous dreamer if you think the pages of 19th-​​century and ear­lier books are any­thing like the 12-​​point Times Roman on clean Bright White Xerox Stock that OCR soft­ware man­u­fac­tur­ers use to reach those num­bers. Old books suck. Their pages are foxed, the type is often bro­ken or dis­tressed, peo­ple scrib­ble on them (the damned edu­ca­tion stu­dents worse than any, in my expe­ri­ence), and as a result the OCR rates are closer to 95–96%, even where the let­ters are present.

You may think that’s nig­gling. But let’s say (con­ser­v­a­tively) that there are ten mil­lion vol­umes being dig­i­tized at the moment. Maybe they’re 150 pages each, on aver­age. That’s about a bil­lion and half pages, each with 1000 or so char­ac­ters. So over the tril­lion let­ters, there will be bil­lions of char­ac­ter recog­ni­tions missed.

The essen­tially irre­ducible error rate in OCR is one of the rea­sons Dis­trib­uted Proof­read­ers, the col­lab­o­ra­tive online com­mu­nity for proof­read­ing OCRed texts, has come into being. There are a num­ber of other rea­sons, of course, but a big part of the work we do there is fix­ing let­ters mis-​​recognized by OCR software.

Now when the OCR soft­ware mis­spells a word, things are rel­a­tively easy to fix. Indeed, many OCR pack­ages make good use of spell-​​checkers to remove ridicu­lous non-​​English words from the doc­u­ment, and to dis­am­biguate many sim­ple errors. But when the OCR soft­ware mis-​​recognizes a word as another Eng­lish word, in the DP com­mu­nity we call this a “stealth scanno” (as in “typo”).

This is a prob­lem that’s often damnably dif­fi­cult to catch with either soft­ware or human review. It’s spelled “right”; it’s just the wrong word.

Over the last six years lists of com­mon stealth scan­nos have been devel­oped: “he”➙“be”; “books”➙“hooks”; “care”➙“core”; “and”➙“arid”; “black”➙“blade” and so forth. Bar­bara tells me she recently saw “books”➙“hooka”. Recall that this isn’t just one-​​letter sub­sti­tu­tions; smudged or pho­to­copied text, dia­mond and other small type­faces, and poor stereo­typ­ing can all smudge the ink of let­ter­forms so that they tend to touch and overlap.

Now there are numer­ous tools that have been devel­oped by those who scan and repair dig­i­tal texts, and you can find some of those through Google searches, if you know the right words. The obvi­ous direc­tion you might be con­sid­er­ing for this task is a sim­ple method that high­lights all the occur­rences of a stealth scanno in a text. But through the years folks have observed thou­sands of stealth scan­nos, some involv­ing very com­mon Eng­lish words (“he”➙“be” comes to mind). What use would it be if you high­lighted every occur­rence of every word in a 20000-​​word scanned book?

Chal­lenge: Cre­ate a sys­tem that will high­light pos­si­ble scan­nos in a text, min­i­miz­ing both false pos­i­tive and false neg­a­tive errors.

Con­sider the stealth scanno “he”➙“be”. High­light­ing every occur­rence of both words will surely catch every mis-​​recognition in a doc­u­ment, but also a lot of cor­rectly rec­og­nized words. But you know, if you look at adja­cent words, there might be some hope to limit the false pos­i­tives: the phrase “be happy” might be mis-​​recognized as the scanno “he happy”, and that’s much less likely to occur in an Eng­lish sen­tence than the cor­rect phrase. On the other hand, both “be said” and “he said” are viable phrases; you might want to look far­ther afield in the text to get more infor­ma­tion to tune those occur­rences. So we can work under the assump­tion that local con­text will be helpful.

Accep­tance Test: Given a list of stealth scanno sub­sti­tu­tions, your method will be applied to a test suite of 200 English-​​language pas­sages, each 500 char­ac­ters in length (includ­ing punc­tu­a­tion). A sub­set of those texts will include stealth scanno sub­sti­tu­tions from the list, while some other sub­set (pos­si­bly over­lap­ping) will include cor­rectly rec­og­nized terms from the stealth scanno list. To fos­ter the design of con­tex­tual approaches, I’ll guar­an­tee that errors will never occur within 50 char­ac­ters of either end of the text: you get 100 free certified-​​correct let­ters in each passage!

Your method should accept a given string of 500 char­ac­ters, and pro­duce as out­put a “scanno con­fi­dence vec­tor”: a vec­tor or array of 500 floating-​​point num­bers in the range [0.000–1.000]. Each ele­ment of this con­fi­dence vec­tor cor­re­sponds to one let­ter of the input text. A value of 0.000 in posi­tion i of the vec­tor indi­cates that there is def­i­nitely no scanno present at a char­ac­ter posi­tion i, while a value of 1.0 means that your method has deter­mined there is absolutely for sure a scanno present in that position.

The suc­cess of your method will be eval­u­ated on the basis of the total absolute dif­fer­ence between the vec­tors and the real loca­tions of the stealth scan­nos, summed over all 500 char­ac­ters (even the free ones!) and 200 pas­sages, result­ing in a final score rang­ing from 0.000 (100% cor­rect pre­dic­tions) to 10000.000 (entirely and exactly wrong pre­dic­tions). Lower is better.

Its suc­cess will also be deter­mined by the largest absolute error, taken over all 10000 char­ac­ter sam­ples. Lower is bet­ter, in the range [0.000–1.000].

You will not be given orig­i­nal page scans, nor any por­tion of the text out­side the 500-​​character pas­sage. Just the out­put of OCR.

Two per­for­mance mea­sures?” you ask. Yes. Sub­mis­sions which when eval­u­ated appear on the Pareto-​​efficient fron­tier (in other words, those not dom­i­nated by any other sub­mis­sion on either per­for­mance mea­sure) will receive a grade of “A”. After remov­ing those from the set, the next-​​but-​​one Pareto-​​efficient fron­tier will receive a grade of “B”, and so on until all sub­mis­sions have been compared.

Your sub­mis­sion should be writ­ten in a com­mon script­ing lan­guage (PHP, Python, Perl, Ruby, or pos­si­ble Unix shell script), or in a high-​​level inter­pre­tive sci­en­tific pro­gram­ming lan­guage like Math­e­mat­ica, Mat­lab, R or S-​​Plus. I should be able to run it on my Mac lap­top, run­ning MacOS X 10.4. Alter­nately, it can be run a Web Ser­vice, in which an HTML form is used to upload a 500-​​character text extract to a server, with a result being a 500-​​line page with one num­ber on each line.

Posted in Uncategorized | 1 Reply

On the dangers of spidering badly

Some­body who lives at 208.101.36.2 decided late yes­ter­day to run their l33t script they cob­bled together for their sixth-​​grade class. If that per­son actu­ally comes by to read this: U R such a n00b. Scat.

When writ­ing a spi­der­ing bot, do not attempt to seri­ally fol­low every link on a con­tent page with­out a sub­stan­tial time delay between requests. In mod­ern web­sites espe­cially, such links often rep­re­sent javascript-​​driven dis­clo­sure effects, not actual hyper­links. By spawn­ing 5760 stu­pid links in a few sec­onds, your dumb bot is not merely going to col­lect a lot of redun­dant data, but in addi­tion will seri­ously piss off the admins.

If you apply this stu­pid fast spi­der­ing tech­nique to a dynamically-​​generated site, you will find that some­thing bad will almost cer­tainly happen.

Even­tu­ally these bad things will hap­pen to you.

Thus begin­neth the lesson.