Left as an Exercise for the Student: Stealth Scanno Detector

Opti­cal Char­ac­ter Recog­ni­tion (OCR) is a cru­cial aspect of the new dig­i­ti­za­tion econ­omy. Google, Microsoft, and all the rest of the world use OCR to quickly cre­ate elec­tronic texts based on dig­i­tal images of books and jour­nals. A plethora of cun­ning pro­gram­ming, spell-​​checkers, nat­ural lan­guage pro­cess­ing and machine learn­ing meth­ods are incor­po­rated into most mod­ern OCR appli­ca­tions these days, and so the per-​​character error rates for clean pages are up around 98%.

Alas, you’re a ridicu­lous dreamer if you think the pages of 19th-​​century and ear­lier books are any­thing like the 12-​​point Times Roman on clean Bright White Xerox Stock that OCR soft­ware man­u­fac­tur­ers use to reach those num­bers. Old books suck. Their pages are foxed, the type is often bro­ken or dis­tressed, peo­ple scrib­ble on them (the damned edu­ca­tion stu­dents worse than any, in my expe­ri­ence), and as a result the OCR rates are closer to 95–96%, even where the let­ters are present.

You may think that’s nig­gling. But let’s say (con­ser­v­a­tively) that there are ten mil­lion vol­umes being dig­i­tized at the moment. Maybe they’re 150 pages each, on aver­age. That’s about a bil­lion and half pages, each with 1000 or so char­ac­ters. So over the tril­lion let­ters, there will be bil­lions of char­ac­ter recog­ni­tions missed.

The essen­tially irre­ducible error rate in OCR is one of the rea­sons Dis­trib­uted Proof­read­ers, the col­lab­o­ra­tive online com­mu­nity for proof­read­ing OCRed texts, has come into being. There are a num­ber of other rea­sons, of course, but a big part of the work we do there is fix­ing let­ters mis-​​recognized by OCR software.

Now when the OCR soft­ware mis­spells a word, things are rel­a­tively easy to fix. Indeed, many OCR pack­ages make good use of spell-​​checkers to remove ridicu­lous non-​​English words from the doc­u­ment, and to dis­am­biguate many sim­ple errors. But when the OCR soft­ware mis-​​recognizes a word as another Eng­lish word, in the DP com­mu­nity we call this a “stealth scanno” (as in “typo”).

This is a prob­lem that’s often damnably dif­fi­cult to catch with either soft­ware or human review. It’s spelled “right”; it’s just the wrong word.

Over the last six years lists of com­mon stealth scan­nos have been devel­oped: “he”➙“be”; “books”➙“hooks”; “care”➙“core”; “and”➙“arid”; “black”➙“blade” and so forth. Bar­bara tells me she recently saw “books”➙“hooka”. Recall that this isn’t just one-​​letter sub­sti­tu­tions; smudged or pho­to­copied text, dia­mond and other small type­faces, and poor stereo­typ­ing can all smudge the ink of let­ter­forms so that they tend to touch and overlap.

Now there are numer­ous tools that have been devel­oped by those who scan and repair dig­i­tal texts, and you can find some of those through Google searches, if you know the right words. The obvi­ous direc­tion you might be con­sid­er­ing for this task is a sim­ple method that high­lights all the occur­rences of a stealth scanno in a text. But through the years folks have observed thou­sands of stealth scan­nos, some involv­ing very com­mon Eng­lish words (“he”➙“be” comes to mind). What use would it be if you high­lighted every occur­rence of every word in a 20000-​​word scanned book?

Chal­lenge: Cre­ate a sys­tem that will high­light pos­si­ble scan­nos in a text, min­i­miz­ing both false pos­i­tive and false neg­a­tive errors.

Con­sider the stealth scanno “he”➙“be”. High­light­ing every occur­rence of both words will surely catch every mis-​​recognition in a doc­u­ment, but also a lot of cor­rectly rec­og­nized words. But you know, if you look at adja­cent words, there might be some hope to limit the false pos­i­tives: the phrase “be happy” might be mis-​​recognized as the scanno “he happy”, and that’s much less likely to occur in an Eng­lish sen­tence than the cor­rect phrase. On the other hand, both “be said” and “he said” are viable phrases; you might want to look far­ther afield in the text to get more infor­ma­tion to tune those occur­rences. So we can work under the assump­tion that local con­text will be helpful.

Accep­tance Test: Given a list of stealth scanno sub­sti­tu­tions, your method will be applied to a test suite of 200 English-​​language pas­sages, each 500 char­ac­ters in length (includ­ing punc­tu­a­tion). A sub­set of those texts will include stealth scanno sub­sti­tu­tions from the list, while some other sub­set (pos­si­bly over­lap­ping) will include cor­rectly rec­og­nized terms from the stealth scanno list. To fos­ter the design of con­tex­tual approaches, I’ll guar­an­tee that errors will never occur within 50 char­ac­ters of either end of the text: you get 100 free certified-​​correct let­ters in each passage!

Your method should accept a given string of 500 char­ac­ters, and pro­duce as out­put a “scanno con­fi­dence vec­tor”: a vec­tor or array of 500 floating-​​point num­bers in the range [0.000–1.000]. Each ele­ment of this con­fi­dence vec­tor cor­re­sponds to one let­ter of the input text. A value of 0.000 in posi­tion i of the vec­tor indi­cates that there is def­i­nitely no scanno present at a char­ac­ter posi­tion i, while a value of 1.0 means that your method has deter­mined there is absolutely for sure a scanno present in that position.

The suc­cess of your method will be eval­u­ated on the basis of the total absolute dif­fer­ence between the vec­tors and the real loca­tions of the stealth scan­nos, summed over all 500 char­ac­ters (even the free ones!) and 200 pas­sages, result­ing in a final score rang­ing from 0.000 (100% cor­rect pre­dic­tions) to 10000.000 (entirely and exactly wrong pre­dic­tions). Lower is better.

Its suc­cess will also be deter­mined by the largest absolute error, taken over all 10000 char­ac­ter sam­ples. Lower is bet­ter, in the range [0.000–1.000].

You will not be given orig­i­nal page scans, nor any por­tion of the text out­side the 500-​​character pas­sage. Just the out­put of OCR.

Two per­for­mance mea­sures?” you ask. Yes. Sub­mis­sions which when eval­u­ated appear on the Pareto-​​efficient fron­tier (in other words, those not dom­i­nated by any other sub­mis­sion on either per­for­mance mea­sure) will receive a grade of “A”. After remov­ing those from the set, the next-​​but-​​one Pareto-​​efficient fron­tier will receive a grade of “B”, and so on until all sub­mis­sions have been compared.

Your sub­mis­sion should be writ­ten in a com­mon script­ing lan­guage (PHP, Python, Perl, Ruby, or pos­si­ble Unix shell script), or in a high-​​level inter­pre­tive sci­en­tific pro­gram­ming lan­guage like Math­e­mat­ica, Mat­lab, R or S-​​Plus. I should be able to run it on my Mac lap­top, run­ning MacOS X 10.4. Alter­nately, it can be run a Web Ser­vice, in which an HTML form is used to upload a 500-​​character text extract to a server, with a result being a 500-​​line page with one num­ber on each line.

Posted in Uncategorized | 1 Reply

On the dangers of spidering badly

Some­body who lives at 208.101.36.2 decided late yes­ter­day to run their l33t script they cob­bled together for their sixth-​​grade class. If that per­son actu­ally comes by to read this: U R such a n00b. Scat.

When writ­ing a spi­der­ing bot, do not attempt to seri­ally fol­low every link on a con­tent page with­out a sub­stan­tial time delay between requests. In mod­ern web­sites espe­cially, such links often rep­re­sent javascript-​​driven dis­clo­sure effects, not actual hyper­links. By spawn­ing 5760 stu­pid links in a few sec­onds, your dumb bot is not merely going to col­lect a lot of redun­dant data, but in addi­tion will seri­ously piss off the admins.

If you apply this stu­pid fast spi­der­ing tech­nique to a dynamically-​​generated site, you will find that some­thing bad will almost cer­tainly happen.

Even­tu­ally these bad things will hap­pen to you.

Thus begin­neth the lesson.

When there’s nothing to do at work

Michael Feath­ers presents an inter­est­ing man­age­r­ial anec­dote in “Send­ing Peo­ple Home”:

There are some ideas that you can com­mu­ni­cate effec­tively in orga­ni­za­tions and oth­ers that you can’t. In this case, there was clearly a man­ager who knew that twenty peo­ple couldn’t work on the build at once, and that there was noth­ing pro­duc­tive the other peo­ple could do. But, the fact that he knew it didn’t help.

Wolfram’s “Open Conference” with over-​​broad Nondisclosure Agreement

I find myself, hav­ing dri­ven eight hours to Cham­paign for two days of classes and a pre­sen­ta­tion at the annual tech­ni­cal con­fer­ence at Wol­fram Research, and spent the night in a smelly old motel, handed a slip of paper to sign. I sud­denly find that I need to learn to read bet­ter — that reg­is­ter­ing for a con­fer­ence entails sign­ing away my rights to dis­cuss the conference.

I sup­pose I must hang out with a dif­fer­ent crowd. A more col­le­gial, aca­d­e­mic, col­lab­o­ra­tive crowd. One that doesn’t have ridicu­lous lawyers doing things like this.

One the one hand, we have the thing that pushed me into agree­ing to come here over the two other (at least equally inter­est­ing) con­fer­ences ongo­ing this week:

The Wol­fram Tech­nol­ogy Con­fer­ence 2006 will be an open forum for all to see how Math­e­mat­ica has matured and how it con­tin­ues to grow. Pro­fes­sion­als, stu­dents, edu­ca­tors, and oth­ers inter­ested in the future of tech­ni­cal com­put­ing are encour­aged to attend.

On the other hand, we have the fol­low­ing, which is reached through one small-​​font link on the reg­is­tra­tion page:

This Con­fi­den­tial­ity and Non-​​Disclosure Agree­ment (“Agree­ment”) is made by and between Wol­fram Research, Inc. (“WRI”), and the under­signed indi­vid­ual (“Recipient”).

Dur­ing Recipient’s atten­dance at the Wol­fram Tech­nol­ogy Con­fer­ence 2006, WRI may dis­close, and Recip­i­ent may have access to, cer­tain Con­fi­den­tial Infor­ma­tion includ­ing future Wol­fram Research tech­nol­ogy. In order to ensure that such infor­ma­tion remains con­fi­den­tial, Recip­i­ent agrees as follows:

1. For pur­poses of this Agree­ment, “Con­fi­den­tial Infor­ma­tion” shall mean infor­ma­tion or mate­r­ial des­ig­nated as Con­fi­den­tial or Inter­nal Infor­ma­tion by WRI, or not gen­er­ally known by non-​​WRI per­son­nel, that Recip­i­ent obtains knowl­edge of or access to dur­ing the Wol­fram Tech­nol­ogy Con­fer­ence 2006.

2. Recip­i­ent agrees to hold in con­fi­dence and not directly or indi­rectly reveal, report, pub­lish, dis­close, export, or trans­fer any Con­fi­den­tial Infor­ma­tion to any per­son or entity, or uti­lize any of the Con­fi­den­tial Infor­ma­tion for any pur­pose that is not expressly approved by WRI in writing.

3. Con­fi­den­tial Infor­ma­tion is and shall remain the sole prop­erty of WRI, and no licenses or pro­pri­etary rights are granted hereby to Recip­i­ent. Recip­i­ent agrees not to remove from premises or to repro­duce any Con­fi­den­tial Infor­ma­tion with­out the express writ­ten con­sent of WRI.

4. Recip­i­ent acknowl­edges that the Con­fi­den­tial Infor­ma­tion is a spe­cial, valu­able, and unique asset of WRI. Because of the unique nature of the infor­ma­tion, Recip­i­ent under­stands and agrees that WRI will suf­fer irrepara­ble harm in the event that Recip­i­ent fails to com­ply with any of Recipient’s oblig­a­tions herein and that mon­e­tary dam­ages will be inad­e­quate to com­pen­sate WRI for such breach. Accord­ingly, Recip­i­ent agrees that WRI will, in addi­tion to any other reme­dies avail­able to it at law or in equity, be enti­tled to injunc­tive relief to enforce the terms of the Agreement.

5. This Agree­ment shall be gov­erned by, and inter­preted in accord with, the laws of the State of Illi­nois, United States.

6. The enforce­abil­ity of any pro­vi­sion of this Agree­ment shall not impair or affect any other pro­vi­sion. The fail­ure of WRI to enforce at any time one or more pro­vi­sions hereof shall not be con­strued as a waiver of any or all provisions.

7. This Agree­ment con­tains the full and com­plete under­stand­ing of the par­ties with respect to the sub­ject mat­ter hereof and super­sedes all prior rep­re­sen­ta­tions and under­stand­ings, whether oral or writ­ten. This Agree­ment may not be mod­i­fied or amended unless done in writ­ing, specif­i­cally stat­ing the addi­tions or changes, and signed by both parties.

IN WITNESS WHEREOF, Recip­i­ent hereby vol­un­tar­ily exe­cutes this Agree­ment as of the date below:

No clause spec­i­fies what the Con­fi­den­tial Infor­ma­tion is. In other words, this is a blan­ket nondis­clo­sure cov­er­ing every utter­ance made by any per­son at the con­fer­ence. No blog­ging, no tele­phone descrip­tions, no rec­om­men­da­tions, no unveiled enthu­si­as­tic acco­lades over their won­der­ful software.

Open”. I like “open forum” espe­cially, in this context.

Now, I’m sure that the sen­ti­ment that inspires this — beyond the cor­po­rate cul­ture — is one of fos­ter­ing col­le­gial com­fort among the par­tic­i­pants, and spread­ing the word on tech­ni­cal aspects of Math­e­mat­ica 6.0 and other new­fan­gled things. In other words, I bet they think it makes it more “pro­fes­sional” to do it this way. I know a lot of entre­pre­neur­ial engi­neer­ing and sci­en­tific types who imag­ine that mak­ing poten­tial cus­tomers sign an NDA is a sign of seri­ous­ness and dili­gence, rather than ama­teur­ish self-​​importance and amus­ing hubris.

I’m OK with that. I like hubris, to a point. But I’ve been to a lot of con­fer­ences, through the years. The most expen­sive — the ones I paid the most to attend — all had explicit state­ments pro­hibit­ing sub­mis­sions with nondis­clo­sures. Dis­al­low­ing. Con­fer­ence implies talk­ing, pass­ing it along, enthu­si­as­tic col­lab­o­ra­tion: in a word, “dis­clo­sure”. “Nondis­clo­sure”, for me, implies sales event. Same sense I get from the peo­ple who want to give you a free lunch to “come watch a video” at their condos.

I just need to learn to read bet­ter, is all. There is that lit­tle link, right there, in plain sight. Who would even con­sider sign­ing up for a con­fer­ence, and pay to attend, and travel to get there, with­out read­ing every damned word on the reg­is­tra­tion form includ­ing the hyper­links? Only a fool, surely. Not the sort of fool one wants attend­ing such an event.

There are two sure-​​fire ways to keep a secret. One is to keep your mouth shut about it. The other is to so com­pletely alien­ate any­body who has any inter­est in it that they give up on try­ing to lis­ten to you.

I’ll sleep on it, and let you know how this plays out. Might turn out to be a good day for antiquing on the way home, tomorrow.…

Gradual Unveiling #5: Ten valid cultural risks of agile teamwork

All the fol­low­ing are worse, in the Acad­emy. Keep that in mind. All worse.

The Cur­mud­geon Coder:

Being a self-​​proclaimed Agile Advo­cate I seem to find myself in dis­cus­sions regard the bad points about agile. Books, arti­cles, and talks on the sub­ject of agile always paint the rosy happy story about using agile. I’m no fool, and I real­ize that things aren’t quite as happy as some peo­ple make it out to be. No one said that agile was a sil­ver bul­let. The rea­son that I’m an advo­cate for it is because I believe it is sim­ply a bet­ter way to write soft­ware. Let’s get down to the meat of things. What is the “Dark” side of agile?

(Via O’Reilly ONlamp.)