I’ve been playing with Google’s Cloud Vision API. It is OCR (Optical Character Recognition) – but in THE CLOUD and uses MACHINE LEARNING!
When it works, it is indistinguishable from magic. When it fails, it reveals a very limited understanding of human text. Let’s take a look at this quick example – a piece of evidence from Leveson Inquiry
Considering that the document is a digital scan of a fax of a print out, it low resolution, blurry, and skewed – it is nothing short of incredible that it has recovered so much text. But look at the passage I’ve highlighted.
Secondly, the Inquiry is aware that on 15 July ! resigned my position
I has been replaced with an exclamation point. Why is that?
Here’s a close up of the text in question.
There are multiple ways to “ZOOM! ENHANCE!” the letter in question. Here’s a basic resizing and a more complex resampling.
I look like a
! to you? The bottom of it looks a little blobby, I suppose. It also comes at the end of a line which does remove some context clues.
- There is a space before it. Even in non-proportional fonts, this would be unusual.
- The next word is not capitalised.
- The letter I has been used liberally throughout the document, the exclamation mark isn’t used at all.
- The paragraph is full of words like “me”, “my”, and “I”.
This is just one example. I’ve seen Google Vision recognise an opening parenthesis
( as the the letter
C – despite recognising the closing
) just a few characters later.
I’ve seen an other homographic confusion – the word
U5 – confusing the letter
s with the number
5. For some reason, Google likes to replace the regular comma with the ideographic comma “、” despite the rest of the text being in English.
What I’m getting at – why aren’t there any text recognition services which use the context of the surrounding text to clarify ambiguous characters?