Notional Slurry Logo

Why I feel so strongly about redundant digitization

I’d like to see all scarcities of public domain works eliminated. All of them. As soon as possible. But that’s not, strictly speaking, why I care so much that I’m scanning my own personal copies of books all the time.

Yes, public domain works are a public good. That’s the law. Ideally, there should be no obstacles whatsoever if you want to see the text of any work published in the US before 1923. In an efficient economy, nobody would be able to claim “re-copyright” on a book whether it was a facsimile or OCR text, no scanmonkey would be able to lock a work behind a firewall, no University would block your access to it just because you’re not a matriculated student. There’s no ceasing or desisting, when a work in the public domain finally becomes public property; it’s just there, and thus everywhere.

And at the same time, large-scale commercial digitization efforts do cost a lot of money. Google’s, JSTOR’s, all the rest. Which is why those vendors (and they are vendors, nothing more) are perfectly justified in charging whatever access fees they can get, and sticking anything they’ve digitized themselves behind a license-protected firewall if they want. You need to be a member of the paying community to be granted access, and I don’t mind.

Really. As far as I’m concerned, that’s just fine. Consider it an act of charity when you pay: acknowledgement of their energetic early adoption of new technology, when you give them your nickel.

They’re sure not going to have it for long.

Because, of course, any idiot with a book and a scanner can digitize whatever they want. You don’t have to be a big, corporate, grant-sponsored idiot; even a fumblefingers like me can do it. You don’t need a robot. You don’t need to be a library, you just need to be able to go into one.

That’s the point of Project Gutenberg, Distributed Proofreaders, and the multitude of crowdsourced digitization efforts out there in the world: People digitize things. They give them away. They release them. For fun. To have them. Because they should be digitized. And often as not simply because they can be digitized.

I should know. I’ve scanned some godawful crap books through the years.

So the reason I feel so strongly about redundant digitization is not some Internet Hippy trope of “information [in the form of old crappy books] wants to be free”. I want there to be multiple copies of books, and I want different people to scan and photograph and microfilm (if they must) those books. I want multiple copies, different editions, different printings, different marginalia, different noisome booger stains and pressed flowers in the pages. I want somebody to just scan the HG Wells stories from Pall Mall Gazette, and I want somebody else to scan a bound volume where the covers and advertisements are gone, and I want somebody else to scan the ads to make rubber stamps from the wood engravings, some people scanning at 150 dpi (which everybody knows is good enough to OCR) and some at 600 dpi in color because they’re anal retentive. I want booksellers to digitize them, and publishers, and aficionados, and libraries, and corporations.

All on their own dime. In any order whatsoever, though if you’re asking random order would be my slight preference.

Because somewhere in all that roil, I expect people to notice that the version of a story published in the magazine was different from the one in a bound HG Wells collected works. And I want somebody to notice that the next story after Wells’s, the one written by that no-name hack with no Google pagerank at all, is just as interesting and good, even though it’s not mentioned in the canon. I want somebody to use HG Wells as spam, and I want somebody to publish new versions of Wells with the spelling Americanized, and I want somebody to start making fake page scans of books that were never actually written. [I'm confident that the first case of book-digitization fraud has already happened, and that nobody will ever catch the cunning devil who did it.] I want Wells to be blogging right there alongside Barbellion.

And also for every one of the million other authors, for every one of the billion public-domain books and newspapers and magazines and journals, broadsheets and newsletters, correspondence and transcripts. Public domain, “orphaned works”, maybe even all the newest stuff.

But… but… won’t print die?

No not print. Print is crucial. It’s our record and our archive. We will always need print, and we will come—just watch—to rely on it more as time goes on.

I want our understanding of print to die. Our mythology. The authority of texts and citations, the abusive misapprehension of what constitutes scholarship and knowledge in our global culture. The notion of fact, of “it’s true because it’s in a book” and “I don’t have to talk to you and explain what I mean because I cited the paper in my bibliography.” Lazy people talk about books they’ve never read, cite articles in journals they’ve never heard of, as signals of their status and erudition. Fanatics cite ancient public-domain works that have seen many editions, but fail to understand the nature of mutable words and ideas. Scholars refute conversation, of all things, when collaborative conversation is what scholarship should be. Teachers emphasize memorization of texts, but never point out that the texts themselves must be questioned in turn.

That has to die.

And it will. In a bloody mess, if we’re not careful.

Seems like a trivial victory, but I actually feel good whenever I see a missing page in a Google scan of a book, or bad OCR, or a reprinter’s copyright statement slapped on a book from 1870—because it motivates me to make my own copy of the same book. To “waste my time” and make a complete, inarguably different version of the “same” electronic text. I feel a glimmer of hope whenever a typo slips into Project Gutenberg, or somebody complains that they can’t tell which is the right version of a book.

Because that’s the way print has been all along. We’ve just lost the way. Somehow the myths of The Book, of The Editor, the Archive, and even the Authoritative Word, they’ve eaten our ability to hold flexible and contingent opinions. So few of us wonder which book we have in our hands; which edition, which version, which printing, which copy? If somebody in another decade or another country picks up a different printed copy of “that book”, what words will they be reading?

Not the same ones, I’ll bet you.

Because printers, and editors, and authors themselves, they change the words on the page from manuscript to draft to prepublication review copy to hardcover to paperback, magazine extract… and digitized electronic version. A blog entry (like this one) edited six, ten, fifty times even after publication. Which is “right”? Should a preprint be suppressed by an academic publisher because it dilutes the authority of the official version? Premise: yes; conclusion: no.

The Final Word needs to die. I’ll go out on a limb and say it’s what’s killing American society. And—even though higher education doesn’t mean a damned thing when it comes to American society—it’s also what’s undermining our higher education system, where tenure depends on citation rank and Erdös numbers, and where pathological specialization sets conversation up as the opposite of scholarship.

It doesn’t serve anybody’s ends if digitization is done intentionally poorly, or shoddily. But it must surely help improve not just performance everywhere but also suspicion of texts if we all become aware of how rough the worst examples are. So I’d love to see embarrassing digitization quality measurements published, for Google or for anybody else who’s scanning Our Heritage of The Fucking Written Word (all kneel).

And when people hear how bad those electronic versions are, I want to see raving; reactionary outcries from academics and politicians worried that what people are reading is not what the publisher set down on the page. Backlash. Chaos. I want, above everything else, somebody to come right out and say it: How can you trust what you know, the authority you give to scholars and the learned, if we can’t even be sure that the words on the page are correctly captured, and will remain untouched by time?

And I’ll be pleased, because the implication is that scanned books are not “the real book”. And somebody, some cunning librarian somewhere I hope, they’ll point out that real books aren’t “the real book” either.

Someday soon the damned singularity will hit, and it won’t be Kurzweil’s brain-downloading nanobots we have to deal with. It’ll be a million Babels of broken authority.

And don’t think I’m going easy on hippy-dippy Internettist culture, either. Look at the archived Project Gutenberg editions of “great works”. Management there treats these piles of steaming crap as if they were “the electronic version of important books”, and draconian protections have been put in place to “preserve” them in the “archive” that is PG. Even when typographic errors are corrected in the texts, a new single canonical version replaces the old one—even though the original printed versions differ in points and even large-scale editorial structure. Project Gutenberg needs the same rug pulled out from under it: there can be no selective archive.

Someday people will start noticing that there are two, five, a dozen digitized versions of some worthwhile book. These will inevitably diverge. Camps will start to develop among scholars and the lay public, just as they are now with blogging vs. traditionally published research conversation, with Wikipedia vs. the editorial encyclopedia. And just as people are increasingly comfortable citing blogged texts, someday the norm of citing individual versions of texts may gain acceptance.

After that’s happened, then the conversation can start again. When people don’t assume they know what text you’re talking about merely on the basis of something as unreliably vague as the publisher or edition.

If we undertake a gradual introduction of parallel digital versions of works, starting now and ramping up all over the place, maybe the inevitable collapse of print’s authority can be safely spread out, and the damage ameliorated. Not the damage to the reputation of digitizers; the damage to our society as the way we talk to one another changes again.

So encourage skepticism of digital editions of works now, and be diligent in questioning the printed word, the authority of the journal, the newspaper, the blogger and the broadcast as well. Undermine all their authority now. Spread it out.

Just imagine what will happen if it all comes crashing down at once.

Because if there’s anything I’m sure of, it’s that it will come down. Just don’t know if it’ll be slow or fast. Transformation’s in the wind, if you can get past the pong of war.


And then he remembered: A link from Scienceblogs reminds me of an anecdote I meant to include. It’s suitable as a postscript.

Once when I was a Ph.D. student (the second time) I was working for a few months at labs in a large drug company’s research campus near Philadelphia. A tech I knew was walking down the hall with a sequencing gel film in his hand, scratching his head.

Me: “What’s up, D——? Problem with the preps?”

D: “Just can’t figure it out. I’ve sequenced this viral prep three times now, and I just can’t get it to match the published sequence.”

Me: “… Virus, D——. Virus. Alive. The kind of alive that has DNA. And you extracted the DNA from a sample from an actual, also alive animal, and now you can’t get it to match an old sequence. Dude.”

God, how I wanted to biff him.

And there we have it in a nutshell. Time for more of you to be walking down the hall scratching your heads as you look at books, at papers, at archived materials. Time for us all to be flummoxed because the written word is not exactly what we’re expecting it to be.

John Weise said,

April 6, 2008 @ 9:28 pm

Enjoyed the post. A tool for exploring differences would be a good time. Sometimes great complexity would be revealed, and other times nothing. And then there is the question of just how new is the new edition of Introduction to Chemistry after all?

World of Science News : Blog Archive : Print and Misprint [A Blog Around The Clock] said,

April 6, 2008 @ 11:11 pm

[...] Reading of the day: Why I feel so strongly about redundant digitization Read the comments on this [...]

Ken Muldrew said,

April 7, 2008 @ 12:20 pm

Word!!!

Digital said,

April 8, 2008 @ 10:58 am

I agree with you up to the point of blurring fact. Endless redundent copys of the same material bring more detail to our archives of knowlage.

It is true, and you raise a great point about the over-reliance and unquestioning of books/documents/ect… however, the FACT that is contained in them must be preserved. Changing the meaning of a document for a laugh is damaging to the whole of knowlage.

Obviously there are many who don’t give a damn, but I’d never go so far as condoning their actions.

Edward Vielmetti said,

April 24, 2008 @ 1:15 am

Bill, I can just imagine you doing cataloging of all of these unique items for the Tozier Collection, and being *completely unable* to do copy-cataloging of MARC records. Yes, we have that – oh, but it’s slightly different – time to create a new unique record.

LibraryThing has a useful notion of a Work, which is some amalgam of all of the versions of a book, including the badly edited ones and the 4th edition that was recalled because of a horrific typo. And all of the translations into Volapuk.

CStanford said,

May 12, 2008 @ 2:20 pm

This takes me back to the thought and discussions I had about this at library school. Have you read the South African archivist Verne Harris? If not, you should.

Thanks for re-kindling these ideas in my mind!

Tozier said,

May 13, 2008 @ 8:57 am

I haven’t read Harris, no. Suggest a particular work?

RSS feed for comments on this post · TrackBack URI

Leave a Comment