I've already blogged about the Leveson Inquiry's disturbing habit of releasing evidence as scanned in PDFs.
@edent Put the Leveson docs up on Google Docs. I'd be curious how their OCR could handle them. Then click 'make public'
— Mr Anderson (@kevglobal) May 11, 2012
Google Docs has an annoying 2MB limit for uploaded PDFs. However, I've taken the first half of Rebekah Brooks' witness statement and run them through the OCR process.
This is how Google recognises the text in the document
Leveson Inquiry into the culture, practices and ethics of the press
1 I dlT| necessarily inhibited to some extent about what I can say in reiation to some of the issues that the Inquiry has raised with me.
3. ijoined News International in 1989. I began my career on the News of the Worlcfs coiour supplement, Sunday magazine, whiie simultaneousiy attending ajournalism course at the London College of Printing.
4. Since then i have been either a joumeiist or an executive on both The News of the World and The Sun. For afrnc-st a decade Iwas a nationai newspaper editor. In May 2000 I became the editor of The News of the Worid and in January 2003 I became the editor of The Sun.
5. In September 2009, I was appointed Chief Executive of News lnternationaf. My responsibilities embraced ail the newspapers and digital products of the 1.... -. -
That's based on this text:
Why Is This Important
The journalist Heather Brooke has been ranting for some time about the closed nature of the British Courts. It's close to impossible to get verbatim or accurate information about course cases. This means as citizens, journalists, or archivists, we can't accurately search documents. We need access to the original digital documents.
Poor OCR is also a huge problem. As above, OCR gives us a misleading impression that documents are searchable.
Should we wish to search, say KRM-18, to see whether the MP Tom Watson is mentioned; a search for "Watson" turns up zero results. Yet he is mentioned.
The page shows:
But the scanned text reads:
Had ~ debrief with 5f[ ~nd his team tm~.igl~t ttt 77~ betbre he [o~ t.o his constituency:
l-]~e is veo’ h.,qlopY~ith d~ ~va~, today" wellt mid ~s~cci~iiJ~’ ~,it[i tae ~bsoiutely’idiotie. del)&t~s led by Wtttson.urtd
So, it's totally impossible to rapidly search through these documents. It would be necessary to laboriously read each document manually.
How To Accomplish This
There are two ways to get this done - in the case of the Leveson Inquiry.
- Petition the Inquiry to release the original documents.
- Crowdsource the OCR. Taking the Google OCR as a starting point and "Wikifying" it to let anyone correct the text. A bit like Distributed Proofreaders
I will, of course, send an email to the Leveson Inquiry - but would people be interested in being part of a crowdsourcing effort to opening up these documents?