Notional Slurry Logo

Archive for October, 2006

Away for a bit

Family medical emergency.

Update, Nov 7 2006: It’s difficult, especially when there are loved ones gravely ill, to find the time to pay the bills and feed yourself, let alone opine on ephemeral public affairs or join in with the self-created world of academic discourse.

I had the opportunity to attend and present what turned out to be a slapdash talk at a fascinating conference in Chicago these last three days, but in light of family stresses I’m afraid it played out less as an opportunity and more as a distraction from Real Life. I will try to recount and respond to the many fascinating conversations I had — it’s my professional responsibility, of course. It’s what I do for work, this conference-attending, this collegial engagement and gentle interdisciplinary academic prodding, this insistence on thoughtfulness among my colleagues and betters in the world.

But insofar as I need to be away from work to care for family, it will be a while before I’m able to thoughtfully respond and recount here. In the meantime, Im hopeful that some of my new correspondents might stumble along to my blog (Google will send them). So let me add some links to previous work that still applies, and that may help frame things I’ve said for those I met in person these last days, who in most cases I’m sure are not regular blog readers:

There are others. I’m busy, I’m ragged, but I’ll be back.

One for the Language Log crowd

Barbara recounts a poem encountered in a magazine I scanned the other day:

The long-winded speaker, he spoke;
The poor office seeker, he soke;
The runner, he ran;
The dunner, he dan;
And the shrieker, he horribly shroke.

Gradual Unveiling #6: Extending Fowler’s “New Methodology”

For one of the best overviews of and introductions to the principles and state of the culture of Agile Software Development, with some ventures into Agile Product Design and Management (why is all this stuff Capitalized?), I recommend Martin Fowler’s The New Methodology.

Read it, thinking of the Academy. Of scientific and numerical work — not programming as such, and not product design surely. Research. Exploration. Discovery. The steady cycle of adaptive thinking and doing. Keyword: “adaptive”.

Carnival of the Agilists for October 06

At silk and spinach: “carnival of the agilists, 19-oct-06″.

Influential data visualizations

A collection of graphs, charts, and other visualizations of ideas and relationships, in draft and published form. Some impressive examples, which impress in many cases not just because of face value, but back story and implication as well:

All 1,943 Cornell Faculty were asked to respond to the following question:

Of the many charts (graph, map, diagram, table and “other”) you have seen in your life, which has been the most important, remarkable, meaningful or valuable?

On the archival paper provided, they were asked to create a copy of the chart and in the remaining space annotate notable attribute of the data and the image, describe what they remembered about first seeing this image and comment on why they chose this image.

(Via Information Aesthetics.)

Left as an Exercise for the Student: Stealth Scanno Detector

Optical Character Recognition (OCR) is a crucial aspect of the new digitization economy. Google, Microsoft, and all the rest of the world use OCR to quickly create electronic texts based on digital images of books and journals. A plethora of cunning programming, spell-checkers, natural language processing and machine learning methods are incorporated into most modern OCR applications these days, and so the per-character error rates for clean pages are up around 98%.

Alas, you’re a ridiculous dreamer if you think the pages of 19th-century and earlier books are anything like the 12-point Times Roman on clean Bright White Xerox Stock that OCR software manufacturers use to reach those numbers. Old books suck. Their pages are foxed, the type is often broken or distressed, people scribble on them (the damned education students worse than any, in my experience), and as a result the OCR rates are closer to 95–96%, even where the letters are present.

You may think that’s niggling. But let’s say (conservatively) that there are ten million volumes being digitized at the moment. Maybe they’re 150 pages each, on average. That’s about a billion and half pages, each with 1000 or so characters. So over the trillion letters, there will be billions of character recognitions missed.

The essentially irreducible error rate in OCR is one of the reasons Distributed Proofreaders, the collaborative online community for proofreading OCRed texts, has come into being. There are a number of other reasons, of course, but a big part of the work we do there is fixing letters mis-recognized by OCR software.

Now when the OCR software misspells a word, things are relatively easy to fix. Indeed, many OCR packages make good use of spell-checkers to remove ridiculous non-English words from the document, and to disambiguate many simple errors. But when the OCR software mis-recognizes a word as another English word, in the DP community we call this a “stealth scanno” (as in “typo”).

This is a problem that’s often damnably difficult to catch with either software or human review. It’s spelled “right”; it’s just the wrong word.

Over the last six years lists of common stealth scannos have been developed: “he”➙”be”; “books”➙”hooks”; “care”➙”core”; “and”➙”arid”; “black”➙”blade” and so forth. Barbara tells me she recently saw “books”➙”hooka”. Recall that this isn’t just one-letter substitutions; smudged or photocopied text, diamond and other small typefaces, and poor stereotyping can all smudge the ink of letterforms so that they tend to touch and overlap.

Now there are numerous tools that have been developed by those who scan and repair digital texts, and you can find some of those through Google searches, if you know the right words. The obvious direction you might be considering for this task is a simple method that highlights all the occurrences of a stealth scanno in a text. But through the years folks have observed thousands of stealth scannos, some involving very common English words (“he”➙”be” comes to mind). What use would it be if you highlighted every occurrence of every word in a 20000-word scanned book?

Challenge: Create a system that will highlight possible scannos in a text, minimizing both false positive and false negative errors.

Consider the stealth scanno “he”➙”be”. Highlighting every occurrence of both words will surely catch every mis-recognition in a document, but also a lot of correctly recognized words. But you know, if you look at adjacent words, there might be some hope to limit the false positives: the phrase “be happy” might be mis-recognized as the scanno “he happy”, and that’s much less likely to occur in an English sentence than the correct phrase. On the other hand, both “be said” and “he said” are viable phrases; you might want to look farther afield in the text to get more information to tune those occurrences. So we can work under the assumption that local context will be helpful.

Acceptance Test: Given a list of stealth scanno substitutions, your method will be applied to a test suite of 200 English-language passages, each 500 characters in length (including punctuation). A subset of those texts will include stealth scanno substitutions from the list, while some other subset (possibly overlapping) will include correctly recognized terms from the stealth scanno list. To foster the design of contextual approaches, I’ll guarantee that errors will never occur within 50 characters of either end of the text: you get 100 free certified-correct letters in each passage!

Your method should accept a given string of 500 characters, and produce as output a “scanno confidence vector”: a vector or array of 500 floating-point numbers in the range [0.000--1.000]. Each element of this confidence vector corresponds to one letter of the input text. A value of 0.000 in position i of the vector indicates that there is definitely no scanno present at a character position i, while a value of 1.0 means that your method has determined there is absolutely for sure a scanno present in that position.

The success of your method will be evaluated on the basis of the total absolute difference between the vectors and the real locations of the stealth scannos, summed over all 500 characters (even the free ones!) and 200 passages, resulting in a final score ranging from 0.000 (100% correct predictions) to 10000.000 (entirely and exactly wrong predictions). Lower is better.

Its success will also be determined by the largest absolute error, taken over all 10000 character samples. Lower is better, in the range [0.000--1.000].

You will not be given original page scans, nor any portion of the text outside the 500-character passage. Just the output of OCR.

“Two performance measures?” you ask. Yes. Submissions which when evaluated appear on the Pareto-efficient frontier (in other words, those not dominated by any other submission on either performance measure) will receive a grade of “A”. After removing those from the set, the next-but-one Pareto-efficient frontier will receive a grade of “B”, and so on until all submissions have been compared.

Your submission should be written in a common scripting language (PHP, Python, Perl, Ruby, or possible Unix shell script), or in a high-level interpretive scientific programming language like Mathematica, Matlab, R or S-Plus. I should be able to run it on my Mac laptop, running MacOS X 10.4. Alternately, it can be run a Web Service, in which an HTML form is used to upload a 500-character text extract to a server, with a result being a 500-line page with one number on each line.

On the dangers of spidering badly

Somebody who lives at 208.101.36.2 decided late yesterday to run their l33t script they cobbled together for their sixth-grade class. If that person actually comes by to read this: U R such a n00b. Scat.

When writing a spidering bot, do not attempt to serially follow every link on a content page without a substantial time delay between requests. In modern websites especially, such links often represent javascript-driven disclosure effects, not actual hyperlinks. By spawning 5760 stupid links in a few seconds, your dumb bot is not merely going to collect a lot of redundant data, but in addition will seriously piss off the admins.

If you apply this stupid fast spidering technique to a dynamically-generated site, you will find that something bad will almost certainly happen.

Eventually these bad things will happen to you.

Thus beginneth the lesson.

When there’s nothing to do at work

Michael Feathers presents an interesting managerial anecdote in “Sending People Home”:

There are some ideas that you can communicate effectively in organizations and others that you can’t. In this case, there was clearly a manager who knew that twenty people couldn’t work on the build at once, and that there was nothing productive the other people could do. But, the fact that he knew it didn’t help.

Wolfram’s “Open Conference” with over-broad Nondisclosure Agreement

I find myself, having driven eight hours to Champaign for two days of classes and a presentation at the annual technical conference at Wolfram Research, and spent the night in a smelly old motel, handed a slip of paper to sign. I suddenly find that I need to learn to read better — that registering for a conference entails signing away my rights to discuss the conference.

I suppose I must hang out with a different crowd. A more collegial, academic, collaborative crowd. One that doesn’t have ridiculous lawyers doing things like this.

One the one hand, we have the thing that pushed me into agreeing to come here over the two other (at least equally interesting) conferences ongoing this week:

The Wolfram Technology Conference 2006 will be an open forum for all to see how Mathematica has matured and how it continues to grow. Professionals, students, educators, and others interested in the future of technical computing are encouraged to attend.

On the other hand, we have the following, which is reached through one small-font link on the registration page:

This Confidentiality and Non-Disclosure Agreement (“Agreement”) is made by and between Wolfram Research, Inc. (“WRI”), and the undersigned individual (“Recipient”).

During Recipient’s attendance at the Wolfram Technology Conference 2006, WRI may disclose, and Recipient may have access to, certain Confidential Information including future Wolfram Research technology. In order to ensure that such information remains confidential, Recipient agrees as follows:

1. For purposes of this Agreement, “Confidential Information” shall mean information or material designated as Confidential or Internal Information by WRI, or not generally known by non-WRI personnel, that Recipient obtains knowledge of or access to during the Wolfram Technology Conference 2006.

2. Recipient agrees to hold in confidence and not directly or indirectly reveal, report, publish, disclose, export, or transfer any Confidential Information to any person or entity, or utilize any of the Confidential Information for any purpose that is not expressly approved by WRI in writing.

3. Confidential Information is and shall remain the sole property of WRI, and no licenses or proprietary rights are granted hereby to Recipient. Recipient agrees not to remove from premises or to reproduce any Confidential Information without the express written consent of WRI.

4. Recipient acknowledges that the Confidential Information is a special, valuable, and unique asset of WRI. Because of the unique nature of the information, Recipient understands and agrees that WRI will suffer irreparable harm in the event that Recipient fails to comply with any of Recipient’s obligations herein and that monetary damages will be inadequate to compensate WRI for such breach. Accordingly, Recipient agrees that WRI will, in addition to any other remedies available to it at law or in equity, be entitled to injunctive relief to enforce the terms of the Agreement.

5. This Agreement shall be governed by, and interpreted in accord with, the laws of the State of Illinois, United States.

6. The enforceability of any provision of this Agreement shall not impair or affect any other provision. The failure of WRI to enforce at any time one or more provisions hereof shall not be construed as a waiver of any or all provisions.

7. This Agreement contains the full and complete understanding of the parties with respect to the subject matter hereof and supersedes all prior representations and understandings, whether oral or written. This Agreement may not be modified or amended unless done in writing, specifically stating the additions or changes, and signed by both parties.

IN WITNESS WHEREOF, Recipient hereby voluntarily executes this Agreement as of the date below:

No clause specifies what the Confidential Information is. In other words, this is a blanket nondisclosure covering every utterance made by any person at the conference. No blogging, no telephone descriptions, no recommendations, no unveiled enthusiastic accolades over their wonderful software.

“Open”. I like “open forum” especially, in this context.

Now, I’m sure that the sentiment that inspires this — beyond the corporate culture — is one of fostering collegial comfort among the participants, and spreading the word on technical aspects of Mathematica 6.0 and other newfangled things. In other words, I bet they think it makes it more “professional” to do it this way. I know a lot of entrepreneurial engineering and scientific types who imagine that making potential customers sign an NDA is a sign of seriousness and diligence, rather than amateurish self-importance and amusing hubris.

I’m OK with that. I like hubris, to a point. But I’ve been to a lot of conferences, through the years. The most expensive — the ones I paid the most to attend — all had explicit statements prohibiting submissions with nondisclosures. Disallowing. Conference implies talking, passing it along, enthusiastic collaboration: in a word, “disclosure”. “Nondisclosure”, for me, implies sales event. Same sense I get from the people who want to give you a free lunch to “come watch a video” at their condos.

I just need to learn to read better, is all. There is that little link, right there, in plain sight. Who would even consider signing up for a conference, and pay to attend, and travel to get there, without reading every damned word on the registration form including the hyperlinks? Only a fool, surely. Not the sort of fool one wants attending such an event.

There are two sure-fire ways to keep a secret. One is to keep your mouth shut about it. The other is to so completely alienate anybody who has any interest in it that they give up on trying to listen to you.

I’ll sleep on it, and let you know how this plays out. Might turn out to be a good day for antiquing on the way home, tomorrow….

Gradual Unveiling #5: Ten valid cultural risks of agile teamwork

All the following are worse, in the Academy. Keep that in mind. All worse.

The Curmudgeon Coder:

Being a self-proclaimed Agile Advocate I seem to find myself in discussions regard the bad points about agile. Books, articles, and talks on the subject of agile always paint the rosy happy story about using agile. I’m no fool, and I realize that things aren’t quite as happy as some people make it out to be. No one said that agile was a silver bullet. The reason that I’m an advocate for it is because I believe it is simply a better way to write software. Let’s get down to the meat of things. What is the “Dark” side of agile?

(Via O’Reilly ONlamp.)

Mourn the undone

Laudator Temporis Acti passes along some sound advice:

There is no remorse so deep, as that which is unavailing…

Older entries »