Distributed Proofreaders: Design and maintenance of online communities

The Dis­trib­uted Proof­read­ers web­site offers a unique ser­vice to the grow­ing com­mu­nity of online schol­ars, archives, pro­fes­sional librar­i­ans, pub­lish­ers and bib­lio­philes who are dig­i­tiz­ing the world’s books.

Here is what DP is for, in “brief”: Opti­cal Char­ac­ter Recog­ni­tion (OCR) is error-​​prone for almost all inter­est­ing books: very old books, badly dam­aged books, scarce books, odd books. The mas­sive dig­i­ti­za­tion projects under­way at Google, the Inter­net Archive, and the oth­ers? They will man­age just fine with mod­ern books. A robotic plan­e­tary scan­ner can turn the pages of a 1999 edi­tion of The Idiot’s Guide to Something-​​or-​​other and pho­to­graph them at a speed of hun­dreds per hour, and in doing so will achieve extra­or­di­nary accuracy.

But han­dling a 1902 bound vol­ume of news­pa­pers, or an 1888 dime novel on foxed, frayed, acid-​​browned pulp paper, or a hurricane-​​ravaged unique copy of a diary? That’s a dif­fer­ent thing entirely. The strange shape and rar­ity and fragility of these works make them unsuit­able for cur­rent large-​​scale scan­ning efforts; and even when they have been scanned, the typo­graphic diver­sity, or maybe the dirt­i­ness of the pages (or the orig­i­nal print­ing), ren­ders the result­ing OCR files ter­ri­ble. I myself have scanned old crummy books, using the great­est care, and the result­ing OCR error rates have been over 10% on a character-​​per-​​character basis.

So, robot­ics will help, yes? We can scan things, but in most cases we can­not rely on the result­ing tran­scrip­tions. And let’s not even get into the qual­ity of images, wood engrav­ings, the impor­tance of typo­graphic cues in inter­pret­ing and using old texts… all the things that sim­ple scan­ning and tran­scrip­tion (of the cur­rent sort) fail utterly to capture.

The DP sys­tem, set up more than five years ago, was an attempt to improve the poor qual­ity of books sub­mit­ted to Project Guten­berg and other online text archives. The approach is inno­v­a­tive and effec­tive: Books des­tined for release into Project Guten­berg are scanned on a page-​​by-​​page basis, and each indi­vid­ual page is OCRed to pro­duce an asso­ci­ated “raw” text file. Then a DP project is cre­ated on the DP servers, and the pages and scans uploaded by a vol­un­teer. When the project is launched on the site, any vol­un­teer can “check out” an unread page of the project, and right there in their web browser they’re shown the orig­i­nal scan image, and the asso­ci­ated text file. Apply­ing a sur­pris­ingly sim­ple suite of edit­ing guide­lines, the vol­un­teers make the needed minor adjust­ments to the text file, to bring it in accord with the orig­i­nal page image. When they’re done with that one page, the changes they’ve made are saved in the DP site’s databases.

Here’s the key: By the dis­trib­uted action of hun­dreds of vol­un­teers proof­read­ing what­ever they want, one page at a time drawn from any of a hun­dred projects, we see a huge improve­ment in the accu­racy of the text tran­scrip­tions over the purely machine-​​OCRed pages. At the moment, at least five peo­ple look at every page of every DP project before it’s con­sid­ered com­pleted, cor­rect­ing its accu­racy and for­mat. The results are, arguably, equiv­a­lent to or bet­ter than those achieved by pro­fes­sional type­set­ting houses.

It’s a won­der­ful idea. I love it, and that’s why I’ve bought many thou­sands of rare books to con­tribute and preserve.

But DP is not just a work­flow, and not just a tech­nique. It’s a com­mu­nity. And in prac­tice, that com­mu­nity has some seri­ous prob­lems. Prob­lems that I expect will soon lead to either (1) a deep and dis­rup­tive restruc­tur­ing, (2) an explicit fork and cre­ation of a com­pet­ing com­mu­nity, or (3) the with­er­ing and even­tual demise of the community.

See, it is not enough to say that the proof­read­ing ser­vice pro­vided by DP is impor­tant. For the sys­tem to thrive, the ser­vice it pro­vides to its users must be con­sid­ered at least as impor­tant. It is because of the scant atten­tion being paid to that that I sug­gest it’s enter­ing a down­ward spiral.

Let me be explicit: The admin­is­tra­tors of the com­mu­nity do a won­der­ful job. But they are over­bur­dened, not just by the work­load but also by a strong sense of sunk cost and “tra­di­tion” asso­ci­ated with the cur­rent site struc­ture. And a vocal com­mu­nity of influ­en­tial users whose opin­ion out­weigh those of the silent major­ity. I sus­pect these fac­tors not only limit the growth of the site, but are under­min­ing its cur­rently frag­mented com­mu­nity struc­ture to the point where it is poised to col­lapse when the next cri­sis comes along.

It is not that any­thing is explic­itly wrong now with DP. Rather it appears that the adap­tive poten­tial of the com­mu­nity has been lost, and the result­ing struc­ture can­not help but break when jostled.

I’ll jus­tify all the dire­ness at length in sub­se­quent posts. But first, just to make it clear that I’m not merely say­ing (as online com­mu­nity mem­bers can) that “some­thing must be done”, let me out­line my pro­pos­als for changes:

  1. Increased open­ness: At present, users must reg­is­ter and log in to view any con­tent on the site. This opac­ity is need­less, and indeed dam­ages the community’s abil­ity to explain and pro­mote itself. Reg­is­tra­tion and login should only be required to mod­ify con­tent: proof­read­ing, post­ing to the forums or wiki, or con­tact­ing users.
  2. Redesign to enhance inter­nal cohe­sion: DP users have sev­eral ways to build and par­tic­i­pate in inter­nal com­mu­ni­ties: a phpBB forum, a back-​​channel of direct mes­sag­ing, a suite of Jab­ber IM ses­sions, and project notes that appear when pages are proof­read. Despite the exis­tence of these par­al­lel chan­nels, I sus­pect forum read­ers see only a few threads or the most vocal par­tic­i­pants; direct mes­sages can’t be shared between more than two users; IM ses­sions are not archived (even though many site-​​affecting deci­sions are made there); and project notes include no ephemeral infor­ma­tion. The result­ing dis­con­nected cliques among the vol­un­teers tend not inter­act except in times of site-​​wide change. The major­ity of users tend to fall (and stay) in sep­a­rate sub­groups com­posed of either vocal forum users, iso­lated “non­par­tic­i­pant” proof­read­ers, reg­u­lar chat­ters, and a dilute but elite admin­is­tra­tive clique. By con­sis­tent and judi­cious cross-​​linking, and the addi­tion of a smart knowl­edge man­age­ment system—especially the addi­tion of a user-​​editable wiki to tie all the sep­a­rate com­po­nents together—the qual­ity of fruit­ful com­mu­ni­ca­tion should increase.
  3. Broad­ened and deep­ened admin­is­tra­tive struc­ture: The cur­rent admin­is­tra­tion of DP is a small num­ber of vol­un­teers act­ing as “Pow­ers That Be”. These folks con­sis­tently say they have too much work to man­age, and as a result they either spend their time act­ing silently behind the scenes, or micro-​​managing indi­vid­ual dis­cus­sions across the entire com­mu­nity. Such tasks should be bro­ken down into smaller pieces and del­e­gated; many sim­pler tasks (such as main­te­nance of FAQs and polic­ing fora) can be shifted to com­mu­nity members—via a pub­lic wiki, or user-​​editable data­bases of clear­ances. More impor­tant, an explicit method for (a) choos­ing and pro­mot­ing admin­is­tra­tors, (b) pro­vid­ing ubiq­ui­tous context-​​appropriate con­tact infor­ma­tion for appro­pri­ate admin­is­tra­tors, and © con­tin­u­ous pub­lic shar­ing of admin­is­tra­tive knowl­edge would greatly improve both admin­is­tra­tors’ and vol­un­teers’ experience.
  4. Present a coher­ent pub­lic face: The pub­lic con­tent of the DP web­site is inward-​​facing, and does not suf­fice to explain, pro­mote or recruit new mem­bers for the site. Pub­lic aware­ness of DP is neg­li­gi­ble, even among book and librar­ian blog­gers and other cru­cial poten­tial pro­fes­sional users who might be expected to already know about it. Project Gutenberg’s vocal sup­port over­shad­ows pub­lic aware­ness of DP, even though the for­mer depends on the lat­ter for its highest-​​quality content.
  5. Either drop or pro­vide proof for “fac­tory physics” micro-​​management: The DP project work­flow is described as if it were a very com­pli­cated assem­bly line, but this metaphor is often taken too lit­er­ally by influ­en­tial vol­un­teers. It has led to attempts to work­flow bot­tle­necks through an increas­ingly dra­con­ian suite of ad hoc struc­tural changes, and not through social approaches tar­get­ing the com­mu­nity of users (mar­ket­ing, pro­mo­tion sys­tems, reward or rep­u­ta­tion sys­tems). The results are dis­ap­point­ing, either because the “fac­tory physics” metaphor is incor­rect when applied to an emer­gent com­mu­nity like DP, or because the under­ly­ing non­lin­ear math­e­mat­ics have never been explic­itly noted and explored. Lack­ing any believ­able com­pre­hen­sive model of the inor­di­nately com­plex DP work­flow, and given the non­lin­ear responses shown to past adjust­ments, any fur­ther plans to make on-​​line adjust­ments should aim squarely at the social side. (That said, a detailed math­e­mat­i­cal model could and should be built.)
  6. Reward vol­un­teer per­for­mance accord­ing to a multi-​​axis rep­u­ta­tion sys­tem: The last set of major changes that were made to the DP work­flow broke vol­un­teers’ tasks into sep­a­rate com­po­nent phases, each with dif­fer­ent skill pre­req­ui­sites. Rather than block users from access to “advanced” tasks, DP should pro­mote them as a reward for the qual­ity of the ser­vice they’ve already pro­vided. I know of at least three peo­ple who have stopped volunteering—even at the open, novice level—because they failed to pass the qual­i­fi­ca­tion quizzes for higher-​​level access.

    There has been a pre­lim­i­nary study of automat­ing a pro­mo­tion sys­tem by count­ing the diffs in proof­read texts before and after a sec­ond reader checks them, but such an auto­mated sys­tem neglects the wide diver­sity of DP project con­tent. Instead, users should be awarded points for both qual­ity and quan­tity, for accu­racy and insight, for help­ing one another and for cor­rect­ing their own mis­takes. Access to higher “ranked” (more skilled) jobs should fol­low on mutual appraisal by the com­mu­nity, not just on some abstracted pseudo-​​objective mea­sure like a quiz or a diffs count.

That’s my list. In sub­se­quent entries, I’ll lay out what I mean, and also talk about:

  • Hear­ing (and using) the voice of the volunteers
  • Fos­ter­ing engage­ment to pro­mote self-​​governance and qual­ity control
  • Pro­duc­tive and unpro­duc­tive isolation
  • Par­al­lel tools, cross-​​linked: cre­at­ing a suf­fi­ciently ephemeral memory
  • What DP is for: Accu­racy is not enough
  • Equity, rep­re­sen­ta­tion and modes of speech
  • Discrete-​​event sim­u­la­tion as the only valid ratio­nale for the “Fac­tory physics” mentality
  • Inagility
  • The three loom­ing threats to DP: dig­i­ti­za­tion, con­ges­tion, and dilution

I love the DP com­mu­nity. Like many, I wish it were bet­ter. But unlike many, I’m sure that unless it gets bet­ter soon, it will inevitably disappear.

I see what’s hap­pened in the admin­is­tra­tion of DP—remember, Dis­trib­uted Proofreaders—as a silent fall back to the premises and assump­tions of cen­tral­ized con­trol. The com­mu­nity, I feel, can­not fail to ben­e­fit from broader adop­tion of dis­trib­uted think­ing, dis­trib­uted con­trol, dis­trib­uted social structure.

This entry was posted in Uncategorized by Tozier. Bookmark the permalink.

One thought on “Distributed Proofreaders: Design and maintenance of online communities

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>