Why I feel so strongly about redundant digitization

I’d like to see all scarci­ties of pub­lic domain works elim­i­nated. All of them. As soon as pos­si­ble. But that’s not, strictly speak­ing, why I care so much that I’m scan­ning my own per­sonal copies of books all the time.

Yes, pub­lic domain works are a pub­lic good. That’s the law. Ide­ally, there should be no obsta­cles what­so­ever if you want to see the text of any work pub­lished in the US before 1923. In an effi­cient econ­omy, nobody would be able to claim “re-​​copyright” on a book whether it was a fac­sim­ile or OCR text, no scan­mon­key would be able to lock a work behind a fire­wall, no Uni­ver­sity would block your access to it just because you’re not a matric­u­lated stu­dent. There’s no ceas­ing or desist­ing, when a work in the pub­lic domain finally becomes pub­lic prop­erty; it’s just there, and thus everywhere.

And at the same time, large-​​scale com­mer­cial dig­i­ti­za­tion efforts do cost a lot of money. Google’s, JSTOR’s, all the rest. Which is why those ven­dors (and they are ven­dors, noth­ing more) are per­fectly jus­ti­fied in charg­ing what­ever access fees they can get, and stick­ing any­thing they’ve dig­i­tized them­selves behind a license-​​protected fire­wall if they want. You need to be a mem­ber of the pay­ing com­mu­nity to be granted access, and I don’t mind.

Really. As far as I’m con­cerned, that’s just fine. Con­sider it an act of char­ity when you pay: acknowl­edge­ment of their ener­getic early adop­tion of new tech­nol­ogy, when you give them your nickel.

They’re sure not going to have it for long.

Because, of course, any idiot with a book and a scan­ner can dig­i­tize what­ever they want. You don’t have to be a big, cor­po­rate, grant-​​sponsored idiot; even a fum­blefin­gers like me can do it. You don’t need a robot. You don’t need to be a library, you just need to be able to go into one.

That’s the point of Project Guten­berg, Dis­trib­uted Proof­read­ers, and the mul­ti­tude of crowd­sourced dig­i­ti­za­tion efforts out there in the world: Peo­ple dig­i­tize things. They give them away. They release them. For fun. To have them. Because they should be dig­i­tized. And often as not sim­ply because they can be digitized.

I should know. I’ve scanned some godaw­ful crap books through the years.

So the rea­son I feel so strongly about redun­dant dig­i­ti­za­tion is not some Inter­net Hippy trope of “infor­ma­tion [in the form of old crappy books] wants to be free”. I want there to be mul­ti­ple copies of books, and I want dif­fer­ent peo­ple to scan and pho­to­graph and micro­film (if they must) those books. I want mul­ti­ple copies, dif­fer­ent edi­tions, dif­fer­ent print­ings, dif­fer­ent mar­gin­a­lia, dif­fer­ent noi­some booger stains and pressed flow­ers in the pages. I want some­body to just scan the HG Wells sto­ries from Pall Mall Gazette, and I want some­body else to scan a bound vol­ume where the cov­ers and adver­tise­ments are gone, and I want some­body else to scan the ads to make rub­ber stamps from the wood engrav­ings, some peo­ple scan­ning at 150 dpi (which every­body knows is good enough to OCR) and some at 600 dpi in color because they’re anal reten­tive. I want book­sellers to dig­i­tize them, and pub­lish­ers, and afi­ciona­dos, and libraries, and corporations.

All on their own dime. In any order what­so­ever, though if you’re ask­ing ran­dom order would be my slight preference.

Because some­where in all that roil, I expect peo­ple to notice that the ver­sion of a story pub­lished in the mag­a­zine was dif­fer­ent from the one in a bound HG Wells col­lected works. And I want some­body to notice that the next story after Wells’s, the one writ­ten by that no-​​name hack with no Google pager­ank at all, is just as inter­est­ing and good, even though it’s not men­tioned in the canon. I want some­body to use HG Wells as spam, and I want some­body to pub­lish new ver­sions of Wells with the spelling Amer­i­can­ized, and I want some­body to start mak­ing fake page scans of books that were never actu­ally writ­ten. [I’m con­fi­dent that the first case of book-​​digitization fraud has already hap­pened, and that nobody will ever catch the cun­ning devil who did it.] I want Wells to be blog­ging right there along­side Bar­bel­lion.

And also for every one of the mil­lion other authors, for every one of the bil­lion public-​​domain books and news­pa­pers and mag­a­zines and jour­nals, broad­sheets and newslet­ters, cor­re­spon­dence and tran­scripts. Pub­lic domain, “orphaned works”, maybe even all the newest stuff.

But… but… won’t print die?

No not print. Print is cru­cial. It’s our record and our archive. We will always need print, and we will come—just watch—to rely on it more as time goes on.

I want our under­stand­ing of print to die. Our mythol­ogy. The author­ity of texts and cita­tions, the abu­sive mis­ap­pre­hen­sion of what con­sti­tutes schol­ar­ship and knowl­edge in our global cul­ture. The notion of fact, of “it’s true because it’s in a book” and “I don’t have to talk to you and explain what I mean because I cited the paper in my bib­li­og­ra­phy.” Lazy peo­ple talk about books they’ve never read, cite arti­cles in jour­nals they’ve never heard of, as sig­nals of their sta­tus and eru­di­tion. Fanat­ics cite ancient public-​​domain works that have seen many edi­tions, but fail to under­stand the nature of muta­ble words and ideas. Schol­ars refute con­ver­sa­tion, of all things, when col­lab­o­ra­tive con­ver­sa­tion is what schol­ar­ship should be. Teach­ers empha­size mem­o­riza­tion of texts, but never point out that the texts them­selves must be ques­tioned in turn.

That has to die.

And it will. In a bloody mess, if we’re not careful.

Seems like a triv­ial vic­tory, but I actu­ally feel good when­ever I see a miss­ing page in a Google scan of a book, or bad OCR, or a reprinter’s copy­right state­ment slapped on a book from 1870—because it moti­vates me to make my own copy of the same book. To “waste my time” and make a com­plete, inar­guably dif­fer­ent ver­sion of the “same” elec­tronic text. I feel a glim­mer of hope when­ever a typo slips into Project Guten­berg, or some­body com­plains that they can’t tell which is the right ver­sion of a book.

Because that’s the way print has been all along. We’ve just lost the way. Some­how the myths of The Book, of The Edi­tor, the Archive, and even the Author­i­ta­tive Word, they’ve eaten our abil­ity to hold flex­i­ble and con­tin­gent opin­ions. So few of us won­der which book we have in our hands; which edi­tion, which ver­sion, which print­ing, which copy? If some­body in another decade or another coun­try picks up a dif­fer­ent printed copy of “that book”, what words will they be reading?

Not the same ones, I’ll bet you.

Because print­ers, and edi­tors, and authors them­selves, they change the words on the page from man­u­script to draft to pre­pub­li­ca­tion review copy to hard­cover to paper­back, mag­a­zine extract… and dig­i­tized elec­tronic ver­sion. A blog entry (like this one) edited six, ten, fifty times even after pub­li­ca­tion. Which is “right”? Should a preprint be sup­pressed by an aca­d­e­mic pub­lisher because it dilutes the author­ity of the offi­cial ver­sion? Premise: yes; con­clu­sion: no.

The Final Word needs to die. I’ll go out on a limb and say it’s what’s killing Amer­i­can soci­ety. And—even though higher edu­ca­tion doesn’t mean a damned thing when it comes to Amer­i­can society—it’s also what’s under­min­ing our higher edu­ca­tion sys­tem, where tenure depends on cita­tion rank and Erdös num­bers, and where patho­log­i­cal spe­cial­iza­tion sets con­ver­sa­tion up as the oppo­site of scholarship.

It doesn’t serve anybody’s ends if dig­i­ti­za­tion is done inten­tion­ally poorly, or shod­dily. But it must surely help improve not just per­for­mance every­where but also sus­pi­cion of texts if we all become aware of how rough the worst exam­ples are. So I’d love to see embar­rass­ing dig­i­ti­za­tion qual­ity mea­sure­ments pub­lished, for Google or for any­body else who’s scan­ning Our Her­itage of The Fuck­ing Writ­ten Word (all kneel).

And when peo­ple hear how bad those elec­tronic ver­sions are, I want to see rav­ing; reac­tionary out­cries from aca­d­e­mics and politi­cians wor­ried that what peo­ple are read­ing is not what the pub­lisher set down on the page. Back­lash. Chaos. I want, above every­thing else, some­body to come right out and say it: How can you trust what you know, the author­ity you give to schol­ars and the learned, if we can’t even be sure that the words on the page are cor­rectly cap­tured, and will remain untouched by time?

And I’ll be pleased, because the impli­ca­tion is that scanned books are not “the real book”. And some­body, some cun­ning librar­ian some­where I hope, they’ll point out that real books aren’t “the real book” either.

Some­day soon the damned sin­gu­lar­ity will hit, and it won’t be Kurzweil’s brain-​​downloading nanobots we have to deal with. It’ll be a mil­lion Babels of bro­ken authority.

And don’t think I’m going easy on hippy-​​dippy Inter­net­tist cul­ture, either. Look at the archived Project Guten­berg edi­tions of “great works”. Man­age­ment there treats these piles of steam­ing crap as if they were “the elec­tronic ver­sion of impor­tant books”, and dra­con­ian pro­tec­tions have been put in place to “pre­serve” them in the “archive” that is PG. Even when typo­graphic errors are cor­rected in the texts, a new sin­gle canon­i­cal ver­sion replaces the old one—even though the orig­i­nal printed ver­sions dif­fer in points and even large-​​scale edi­to­r­ial struc­ture. Project Guten­berg needs the same rug pulled out from under it: there can be no selec­tive archive.

Some­day peo­ple will start notic­ing that there are two, five, a dozen dig­i­tized ver­sions of some worth­while book. These will inevitably diverge. Camps will start to develop among schol­ars and the lay pub­lic, just as they are now with blog­ging vs. tra­di­tion­ally pub­lished research con­ver­sa­tion, with Wikipedia vs. the edi­to­r­ial ency­clo­pe­dia. And just as peo­ple are increas­ingly com­fort­able cit­ing blogged texts, some­day the norm of cit­ing indi­vid­ual ver­sions of texts may gain acceptance.

After that’s hap­pened, then the con­ver­sa­tion can start again. When peo­ple don’t assume they know what text you’re talk­ing about merely on the basis of some­thing as unre­li­ably vague as the pub­lisher or edi­tion.

If we under­take a grad­ual intro­duc­tion of par­al­lel dig­i­tal ver­sions of works, start­ing now and ramp­ing up all over the place, maybe the inevitable col­lapse of print’s author­ity can be safely spread out, and the dam­age ame­lio­rated. Not the dam­age to the rep­u­ta­tion of dig­i­tiz­ers; the dam­age to our soci­ety as the way we talk to one another changes again.

So encour­age skep­ti­cism of dig­i­tal edi­tions of works now, and be dili­gent in ques­tion­ing the printed word, the author­ity of the jour­nal, the news­pa­per, the blog­ger and the broad­cast as well. Under­mine all their author­ity now. Spread it out.

Just imag­ine what will hap­pen if it all comes crash­ing down at once.

Because if there’s any­thing I’m sure of, it’s that it will come down. Just don’t know if it’ll be slow or fast. Transformation’s in the wind, if you can get past the pong of war.


And then he remem­bered: A link from Sci­ence­blogs reminds me of an anec­dote I meant to include. It’s suit­able as a postscript.

Once when I was a Ph.D. stu­dent (the sec­ond time) I was work­ing for a few months at labs in a large drug company’s research cam­pus near Philadel­phia. A tech I knew was walk­ing down the hall with a sequenc­ing gel film in his hand, scratch­ing his head.

Me: “What’s up, D——? Prob­lem with the preps?”

D: “Just can’t fig­ure it out. I’ve sequenced this viral prep three times now, and I just can’t get it to match the pub­lished sequence.”

Me: “… Virus, D——. Virus. Alive. The kind of alive that has DNA. And you extracted the DNA from a sam­ple from an actual, also alive ani­mal, and now you can’t get it to match an old sequence. Dude.”

God, how I wanted to biff him.

And there we have it in a nut­shell. Time for more of you to be walk­ing down the hall scratch­ing your heads as you look at books, at papers, at archived mate­ri­als. Time for us all to be flum­moxed because the writ­ten word is not exactly what we’re expect­ing it to be.

7 thoughts on “Why I feel so strongly about redundant digitization

  1. Enjoyed the post. A tool for explor­ing dif­fer­ences would be a good time. Some­times great com­plex­ity would be revealed, and other times noth­ing. And then there is the ques­tion of just how new is the new edi­tion of Intro­duc­tion to Chem­istry after all?

  2. Pingback: World of Science News : Blog Archive : Print and Misprint [A Blog Around The Clock]

  3. I agree with you up to the point of blur­ring fact. End­less redun­dent copys of the same mate­r­ial bring more detail to our archives of knowlage.

    It is true, and you raise a great point about the over-​​reliance and unques­tion­ing of books/​documents/​ect… how­ever, the FACT that is con­tained in them must be pre­served. Chang­ing the mean­ing of a doc­u­ment for a laugh is dam­ag­ing to the whole of knowlage.

    Obvi­ously there are many who don’t give a damn, but I’d never go so far as con­don­ing their actions.

  4. Bill, I can just imag­ine you doing cat­a­loging of all of these unique items for the Tozier Col­lec­tion, and being *com­pletely unable* to do copy-​​cataloging of MARC records. Yes, we have that — oh, but it’s slightly dif­fer­ent — time to cre­ate a new unique record.

    Library­Thing has a use­ful notion of a Work, which is some amal­gam of all of the ver­sions of a book, includ­ing the badly edited ones and the 4th edi­tion that was recalled because of a hor­rific typo. And all of the trans­la­tions into Volapuk.

  5. This takes me back to the thought and dis­cus­sions I had about this at library school. Have you read the South African archivist Verne Har­ris? If not, you should.

    Thanks for re-​​kindling these ideas in my mind!

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>