Context-Aware Text Recognition?
I've been playing with Google's Cloud Vision API. It is OCR (Optical Character Recognition) - but in THE CLOUD and uses MACHINE LEARNING!
When it works, it is indistinguishable from magic. When it fails, it reveals a very limited understanding of human text. Let's take a look at this quick example - a piece of evidence from Leveson Inquiry
data:image/s3,"s3://crabby-images/3c785/3c78524d7a45ae99008f4998df62cb31f510998c" alt="A scanned document, the text is askew. Next to it is a computer-generated version of the text. A passage is highlighted."
Considering that the document is a digital scan of a fax of a print out, it low resolution, blurry, and skewed - it is nothing short of incredible that it has recovered so much text. But look at the passage I've highlighted.
Secondly, the Inquiry is aware that on 15 July ! resigned my position
The letter I
has been replaced with an exclamation point. Why is that?
Here's a close up of the text in question.
data:image/s3,"s3://crabby-images/35c1d/35c1d78735c493ee2e9173ce1759215413be02c8" alt="A block of text"
There are multiple ways to "ZOOM! ENHANCE!" the letter in question. Here's a basic resizing and a more complex resampling.
data:image/s3,"s3://crabby-images/610e0/610e0712d1356d1391717ceec0dded92d7f6b1d6" alt="The letter i has been scaled up. It looks a little like an exclamation point"
Does that I
look like a !
to you? The bottom of it looks a little blobby, I suppose. It also comes at the end of a line which does remove some context clues.
But...
- There is a space before it. Even in non-proportional fonts, this would be unusual.
- The next word is not capitalised.
- The letter I has been used liberally throughout the document, the exclamation mark isn't used at all.
- The paragraph is full of words like "me", "my", and "I".
This is just one example. I've seen Google Vision recognise an opening parenthesis (
as the the letter C
- despite recognising the closing )
just a few characters later.
I've seen an other homographic confusion - the word US
becoming U5
- confusing the letter s
with the number 5
. For some reason, Google likes to replace the regular comma with the ideographic comma "、" despite the rest of the text being in English.
What I'm getting at - why aren't there any text recognition services which use the context of the surrounding text to clarify ambiguous characters?