Notional Slurry Logo

Proposal: On Closing the Set

Over at the Distributed Proofreaders project, our goal is to accurately capture, present and distribute the content of printed works in the public domain. Unlike the land grab efforts of Google Print and the like, our texts are [supposedly] better, more accurate, truer to the original published form. We rely on OCR to read the scanned text, but unlike those others we also acknowledge that OCR is fallible, and that certain typographic conventions that convey subtle meanings—line breaks, and em-dashes—need to be preserved.

One of our number, Jon Niehof [a.k.a. jnik], has a great and useful idea whose time has clearly come. Some time back, he began collecting the lists of books (or, more generally, titled works) mentioned in the works we’ve scanned and proofread. The point for us working in the DP community is of course to complete the set, to create a moving front from which the next books for scanning can be chosen.

And at the same time to create a self-consistent record of literature’s explicit relationships.

And at the same time to create a dataset to record a novel “social” network, which is at the moment a subject of some interest.

Without With Jon’s permission, I’m going to suggest it’s time to take it out of DP and get the community most interested involved. In DP we are overwhelmed with work, and the community’s conversation centers around how to get the workflow slimmed down, not extended without horizon. [That said, please consider going and giving it a try. You will be helping a unique volunteer effort that captures all the good of the land grabbers, and can have a say in how it moves. I would consider signing up, and proofreading five pages, to be your way of acknowledging the fact that you've read this piece.]

What is needed, I think, is a special-purpose wiki, seeded with some starting point. Users could add works cited, mentioned, advertised, or otherwise appearing in others. And by works I mean not merely novels and technical monographs, but catalogs, reviews in magazines, and perhaps ultimately newspaper columns.

Consider the benefits that could arise. First, it would forma sort of table of contents or directory, since of course any title could eventually be linked to scans or Gutenberg editions of the actual work. Second, there’s that network, that record of what appears where. Third, it will give me an excuse to do something I’ve been putting off for some time (and which Jon once fretted would swamp his little internal DP effort): Scan and upload our recently-purchased copy of Allibone’s A critical dictionary of English literature…, which with its supplements mentions well over 130,000 works.

So easy to do. One wiki, slowly accreting.

But me, a mere student, a first-year graduate student in of all things engineering? Hah. They wold drum me out for being distracted by such non-mathematical trivialities, of not being “serious” about my studies, of having my nose well away from the grindstone and gazing off towards left field. Those meanies.

And think how many brownie points something like this would bring to a professional, a real live scholar of literature, or of networks, or of practically anything not involving linear programming?

It is yours. Please. Go right ahead. I will send along our Allibone as soon as it is ready.

Update [20 Jan 2006]: Jon Niehof should get credit for his great idea of Closing the Set.

Barbara said,

January 19, 2006 @ 2:44 pm

It seems to me that there are two types of set here, though. Allibone’s is more of a catalogue — it doesn’t cite the books, it only records their existence (and sometimes gives third-party reviews of them). The other type of set is like the one in Clouston’s Flowers from a Persian Garden, where he references dozens of other books to support and expand on his topic.

Just because you have a paper indexed in CiteSeer doesn’t mean CiteSeer cites you, does it? So why would Allibone’s be a good starting point?

Tozier said,

January 19, 2006 @ 3:20 pm

True. In fact, there seem to be multiple ways that other works are mentioned, too. Imperative (“Go read this”), citation, in passing, indirectly. And various levels of certainty, as in, “Mrs. Oliphant has written extensively on…”, as opposed to, “In Mrs. Oliphant’s ‘The Count’s Daughters’….” &c

RSS feed for comments on this post · TrackBack URI

Leave a Comment