ocr – Terence Eden’s Blog

Context-Aware Text Recognition?

@edent — Tue, 23 Jan 2018 12:15:44 +0000

I've been playing with Google's Cloud Vision API. It is OCR (Optical Character Recognition) - but in THE CLOUD and uses MACHINE LEARNING!

When it works, it is indistinguishable from magic. When it fails, it reveals a very limited understanding of human text. Let's take a look at this quick example - a piece of evidence from Leveson Inquiry

Considering that the document is a digital scan of a fax of a print out, it low resolution, blurry, and skewed - it is nothing short of incredible that it has recovered so much text. But look at the passage I've highlighted.

Secondly, the Inquiry is aware that on 15 July ! resigned my position

The letter I has been replaced with an exclamation point. Why is that?

Here's a close up of the text in question.

There are multiple ways to "ZOOM! ENHANCE!" the letter in question. Here's a basic resizing and a more complex resampling.

Does that I look like a ! to you? The bottom of it looks a little blobby, I suppose. It also comes at the end of a line which does remove some context clues.

But...

There is a space before it. Even in non-proportional fonts, this would be unusual.
The next word is not capitalised.
The letter I has been used liberally throughout the document, the exclamation mark isn't used at all.
The paragraph is full of words like "me", "my", and "I".

This is just one example. I've seen Google Vision recognise an opening parenthesis ( as the the letter C - despite recognising the closing ) just a few characters later.

I've seen an other homographic confusion - the word US becoming U5 - confusing the letter s with the number 5. For some reason, Google likes to replace the regular comma with the ideographic comma "、" despite the rest of the text being in English.

What I'm getting at - why aren't there any text recognition services which use the context of the surrounding text to clarify ambiguous characters?

Selecting Text In Images - Pure SVG, No JavaScript

@edent — Fri, 29 Aug 2014 11:05:59 +0000

Recently, I wanted to embed an photograph of a book page. I thought it would be nifty if the text from the page could be selected.

If you hover your mouse over this image, you should be able to select part of the text.

Ideally, it will look something like this...

It even works on Android (tried on Chrome, Opera, FireFox) and iOS 7.

So, how did I do it?

Originally, I was pointed to Project Naptha - it seems to do everything I want but is very JavaScript heavy and requires modern browser support.

I then turned to SVG - Scalable Vector Graphics.

The way I've done this is almost certainly wrong and I'd appreciate any advice about the proper way to render text in an SVG.

The first part is easy - displaying a PNG as the background to the SVG. In this case, I've taken the image and Base64 encoded it.

The X & Y co-ordinates are from the top left. I've manually added in the height and width of the image.

Next, we add the text.

   
      
         For nearly three years, between 1960 and 1963, MI5 and GCHQ
      
      
         read the French high grade cipher coming in and out of the French
      
      ...

As you can see, I've grouped the text together in a element. I've set the opacity to zero - so while they are on top of the image, they cannot be seen unless selected. I've also manually split the lines and placed them on the image. I've set a "textLength" so that they'll fit across the page and automatically adjust themselves if they're too long.

This is very imprecise and quite time consuming. To get a better idea of how accurate (or not) it is, here's the same image, with the opacity set to 0.5.

Close enough, but not brilliant.

Finally, I've had to reference the images via an iframe. Without doing that, I wasn't able to select the text. I'm not sure if that's a browser fault, or expected functionality.

If you can suggest a quicker and more accurate way of doing this - I'd love for you to leave a comment below.

Crowdsourcing Leveson

@edent — Fri, 11 May 2012 11:40:48 +0000

I've already blogged about the Leveson Inquiry's disturbing habit of releasing evidence as scanned in PDFs.

I had a suggestion from digital journalist Kevin Anderson

Google Docs has an annoying 2MB limit for uploaded PDFs. However, I've taken the first half of Rebekah Brooks' witness statement and run them through the OCR process.

This is how Google recognises the text in the document

Leveson Inquiry into the culture, practices and ethics of the press

1 I dlT| necessarily inhibited to some extent about what I can say in reiation to some of the issues that the Inquiry has raised with me. My background

3. ijoined News International in 1989. I began my career on the News of the Worlcfs coiour supplement, Sunday magazine, whiie simultaneousiy attending ajournalism course at the London College of Printing.

4. Since then i have been either a joumeiist or an executive on both The News of the World and The Sun. For afrnc-st a decade Iwas a nationai newspaper editor. In May 2000 I became the editor of The News of the Worid and in January 2003 I became the editor of The Sun.

5. In September 2009, I was appointed Chief Executive of News lnternationaf. My responsibilities embraced ail the newspapers and digital products of the 1.... -. -

That's based on this text:

Why Is This Important

The journalist Heather Brooke has been ranting for some time about the closed nature of the British Courts. It's close to impossible to get verbatim or accurate information about course cases. This means as citizens, journalists, or archivists, we can't accurately search documents. We need access to the original digital documents.

Poor OCR is also a huge problem. As above, OCR gives us a misleading impression that documents are searchable.

Should we wish to search, say KRM-18, to see whether the MP Tom Watson is mentioned; a search for "Watson" turns up zero results. Yet he is mentioned.

The page shows: But the scanned text reads:

Had ~ debrief with 5f[ ~nd his team tm~.igl~t ttt 77~ betbre he [o~ t.o his constituency:
l-]~e is veo’ h.,qlopY~ith d~ ~va~, today" wellt mid ~s~cci~iiJ~’ ~,it[i tae ~bsoiutely’idiotie. del)&t~s led by Wtttson.urtd
Prescott.

So, it's totally impossible to rapidly search through these documents. It would be necessary to laboriously read each document manually.

How To Accomplish This

There are two ways to get this done - in the case of the Leveson Inquiry.

Petition the Inquiry to release the original documents.
Crowdsource the OCR. Taking the Google OCR as a starting point and "Wikifying" it to let anyone correct the text. A bit like Distributed Proofreaders

I will, of course, send an email to the Leveson Inquiry - but would people be interested in being part of a crowdsourcing effort to opening up these documents?

Leveson - Death By A Thousand (Paper) Cuts

@edent — Wed, 25 Apr 2012 11:06:38 +0000

I've been listening to the Leveson inquiry. A large part of the exchanges seem to go like this:

Jay: Turning to page 51.
Witness: Which bundle?
Jay: 1606.
Witness: 1660?
Leveson: No, the page after.
Jay: Paragraph 7.
Witness: I don't have a paragraph 7.
Jay: Ah, I have an earlier print out.
Leveson: You'll find it in tab 15.
Witness: Is this Volume 2?

And so on, ad nauseum.

Surely there's no reason to have so much paper wastefully printed and then discarded? Why not a single reference electronic document which can be supplied to each participant? Allowing them to increase the font size, annotate, cross reference, and search?

Search

Ah, search. Searching text is something computers are really good at. Within a fraction of a second, even a modest computer can extract every sentence which contains the word "Clegg" from hundreds of thousands of pages. Brilliant! Makes life really easy. Until humans come along and bugger about with it.

Let's take a look at the "smoking gun" emails which have been submitted from News International to Leveson. Specifically KRM18.

I have no idea how these emails were supplied to Leveson. I hope that they were submitted electronically - with all headers intact. What's supplied to the pubic, however, is this:

The emails have been...

Printed out.
Redacted with marker pen.
Scanned in as a PDF.
Then subject to an uncorrected OCR process.

Computers are really bad at recognising text. OCR (Optical Character Recognition) is a very error-prone process. Take a look at how the computer has translated the above document.

It's partly there. But enough of the characters are mangled, and words distorted that searching through the text is near impossible.

I get that PDF is a reasonably popular file format for sharing documents. It preserves the document structure faithfully - but at the expense of readability, fluidity, and usefulness. But distributing images is the least useful way of distributing information to people who want to use it.

It's simply bad civic responsibility to do this. These emails, if they are important enough to be made public, should be made public in their original form. I understand that some redactions should be made - but that's about the limit.

How on Earth is anyone supposed to make sense of this extract?

We need to shake off the tyranny of printed paper. It is wasteful, non-useful, and - in this context - damaging to justice.

I leave you with an entirely random extract from the emails...