The Distributed Proofreaders website offers a unique service to the growing community of online scholars, archives, professional librarians, publishers and bibliophiles who are digitizing the world’s books.
Here is what DP is for, in “brief”: Optical Character Recognition (OCR) is error-prone for almost all interesting books: very old books, badly damaged books, scarce books, odd books. The massive digitization projects underway at Google, the Internet Archive, and the others? They will manage just fine with modern books. A robotic planetary scanner can turn the pages of a 1999 edition of The Idiot’s Guide to Something-or-other and photograph them at a speed of hundreds per hour, and in doing so will achieve extraordinary accuracy.
But handling a 1902 bound volume of newspapers, or an 1888 dime novel on foxed, frayed, acid-browned pulp paper, or a hurricane-ravaged unique copy of a diary? That’s a different thing entirely. The strange shape and rarity and fragility of these works make them unsuitable for current large-scale scanning efforts; and even when they have been scanned, the typographic diversity, or maybe the dirtiness of the pages (or the original printing), renders the resulting OCR files terrible. I myself have scanned old crummy books, using the greatest care, and the resulting OCR error rates have been over 10% on a character-per-character basis.
So, robotics will help, yes? We can scan things, but in most cases we cannot rely on the resulting transcriptions. And let’s not even get into the quality of images, wood engravings, the importance of typographic cues in interpreting and using old texts… all the things that simple scanning and transcription (of the current sort) fail utterly to capture.
The DP system, set up more than five years ago, was an attempt to improve the poor quality of books submitted to Project Gutenberg and other online text archives. The approach is innovative and effective: Books destined for release into Project Gutenberg are scanned on a page-by-page basis, and each individual page is OCRed to produce an associated “raw” text file. Then a DP project is created on the DP servers, and the pages and scans uploaded by a volunteer. When the project is launched on the site, any volunteer can “check out” an unread page of the project, and right there in their web browser they’re shown the original scan image, and the associated text file. Applying a surprisingly simple suite of editing guidelines, the volunteers make the needed minor adjustments to the text file, to bring it in accord with the original page image. When they’re done with that one page, the changes they’ve made are saved in the DP site’s databases.
Here’s the key: By the distributed action of hundreds of volunteers proofreading whatever they want, one page at a time drawn from any of a hundred projects, we see a huge improvement in the accuracy of the text transcriptions over the purely machine-OCRed pages. At the moment, at least five people look at every page of every DP project before it’s considered completed, correcting its accuracy and format. The results are, arguably, equivalent to or better than those achieved by professional typesetting houses.
It’s a wonderful idea. I love it, and that’s why I’ve bought many thousands of rare books to contribute and preserve.
But DP is not just a workflow, and not just a technique. It’s a community. And in practice, that community has some serious problems. Problems that I expect will soon lead to either (1) a deep and disruptive restructuring, (2) an explicit fork and creation of a competing community, or (3) the withering and eventual demise of the community.
See, it is not enough to say that the proofreading service provided by DP is important. For the system to thrive, the service it provides to its users must be considered at least as important. It is because of the scant attention being paid to that that I suggest it’s entering a downward spiral.
Let me be explicit: The administrators of the community do a wonderful job. But they are overburdened, not just by the workload but also by a strong sense of sunk cost and “tradition” associated with the current site structure. And a vocal community of influential users whose opinion outweigh those of the silent majority. I suspect these factors not only limit the growth of the site, but are undermining its currently fragmented community structure to the point where it is poised to collapse when the next crisis comes along.
It is not that anything is explicitly wrong now with DP. Rather it appears that the adaptive potential of the community has been lost, and the resulting structure cannot help but break when jostled.
I’ll justify all the direness at length in subsequent posts. But first, just to make it clear that I’m not merely saying (as online community members can) that “something must be done”, let me outline my proposals for changes:
- Increased openness: At present, users must register and log in to view any content on the site. This opacity is needless, and indeed damages the community’s ability to explain and promote itself. Registration and login should only be required to modify content: proofreading, posting to the forums or wiki, or contacting users.
- Redesign to enhance internal cohesion: DP users have several ways to build and participate in internal communities: a phpBB forum, a back-channel of direct messaging, a suite of Jabber IM sessions, and project notes that appear when pages are proofread. Despite the existence of these parallel channels, I suspect forum readers see only a few threads or the most vocal participants; direct messages can’t be shared between more than two users; IM sessions are not archived (even though many site-affecting decisions are made there); and project notes include no ephemeral information. The resulting disconnected cliques among the volunteers tend not interact except in times of site-wide change. The majority of users tend to fall (and stay) in separate subgroups composed of either vocal forum users, isolated “nonparticipant” proofreaders, regular chatters, and a dilute but elite administrative clique. By consistent and judicious cross-linking, and the addition of a smart knowledge management system—especially the addition of a user-editable wiki to tie all the separate components together—the quality of fruitful communication should increase.
- Broadened and deepened administrative structure: The current administration of DP is a small number of volunteers acting as “Powers That Be”. These folks consistently say they have too much work to manage, and as a result they either spend their time acting silently behind the scenes, or micro-managing individual discussions across the entire community. Such tasks should be broken down into smaller pieces and delegated; many simpler tasks (such as maintenance of FAQs and policing fora) can be shifted to community members—via a public wiki, or user-editable databases of clearances. More important, an explicit method for (a) choosing and promoting administrators, (b) providing ubiquitous context-appropriate contact information for appropriate administrators, and © continuous public sharing of administrative knowledge would greatly improve both administrators’ and volunteers’ experience.
- Present a coherent public face: The public content of the DP website is inward-facing, and does not suffice to explain, promote or recruit new members for the site. Public awareness of DP is negligible, even among book and librarian bloggers and other crucial potential professional users who might be expected to already know about it. Project Gutenberg’s vocal support overshadows public awareness of DP, even though the former depends on the latter for its highest-quality content.
- Either drop or provide proof for “factory physics” micro-management: The DP project workflow is described as if it were a very complicated assembly line, but this metaphor is often taken too literally by influential volunteers. It has led to attempts to workflow bottlenecks through an increasingly draconian suite of ad hoc structural changes, and not through social approaches targeting the community of users (marketing, promotion systems, reward or reputation systems). The results are disappointing, either because the “factory physics” metaphor is incorrect when applied to an emergent community like DP, or because the underlying nonlinear mathematics have never been explicitly noted and explored. Lacking any believable comprehensive model of the inordinately complex DP workflow, and given the nonlinear responses shown to past adjustments, any further plans to make on-line adjustments should aim squarely at the social side. (That said, a detailed mathematical model could and should be built.)
- Reward volunteer performance according to a multi-axis reputation system: The last set of major changes that were made to the DP workflow broke volunteers’ tasks into separate component phases, each with different skill prerequisites. Rather than block users from access to “advanced” tasks, DP should promote them as a reward for the quality of the service they’ve already provided. I know of at least three people who have stopped volunteering—even at the open, novice level—because they failed to pass the qualification quizzes for higher-level access.
There has been a preliminary study of automating a promotion system by counting the diffs in proofread texts before and after a second reader checks them, but such an automated system neglects the wide diversity of DP project content. Instead, users should be awarded points for both quality and quantity, for accuracy and insight, for helping one another and for correcting their own mistakes. Access to higher “ranked” (more skilled) jobs should follow on mutual appraisal by the community, not just on some abstracted pseudo-objective measure like a quiz or a diffs count.
That’s my list. In subsequent entries, I’ll lay out what I mean, and also talk about:
- Hearing (and using) the voice of the volunteers
- Fostering engagement to promote self-governance and quality control
- Productive and unproductive isolation
- Parallel tools, cross-linked: creating a sufficiently ephemeral memory
- What DP is for: Accuracy is not enough
- Equity, representation and modes of speech
- Discrete-event simulation as the only valid rationale for the “Factory physics” mentality
- Inagility
- The three looming threats to DP: digitization, congestion, and dilution
I love the DP community. Like many, I wish it were better. But unlike many, I’m sure that unless it gets better soon, it will inevitably disappear.
I see what’s happened in the administration of DP—remember, Distributed Proofreaders—as a silent fall back to the premises and assumptions of centralized control. The community, I feel, cannot fail to benefit from broader adoption of distributed thinking, distributed control, distributed social structure.
With volunteers like this, DP will endure and flourish.