<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet href="https://shkspr.mobi/blog/wp-content/themes/edent-wordpress-theme/rss-style.xsl" type="text/xsl"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	    xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	     xmlns:dc="http://purl.org/dc/elements/1.1/"
	   xmlns:atom="http://www.w3.org/2005/Atom"
	     xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	  xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>
<channel>
	<title>ocr &#8211; Terence Eden’s Blog</title>
	<atom:link href="https://shkspr.mobi/blog/tag/ocr/feed/" rel="self" type="application/rss+xml" />
	<link>https://shkspr.mobi/blog</link>
	<description>Regular nonsense about tech and its effects 🙃</description>
	<lastBuildDate>Tue, 29 Apr 2025 20:57:32 +0000</lastBuildDate>
	<language>en-GB</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>https://wordpress.org/?v=6.9.4</generator>

<image>
	<url>https://shkspr.mobi/blog/wp-content/uploads/2023/07/cropped-avatar-32x32.jpeg</url>
	<title>ocr &#8211; Terence Eden’s Blog</title>
	<link>https://shkspr.mobi/blog</link>
	<width>32</width>
	<height>32</height>
</image> 
	<item>
		<title><![CDATA[Context-Aware Text Recognition?]]></title>
		<link>https://shkspr.mobi/blog/2018/01/context-aware-text-recognition/</link>
					<comments>https://shkspr.mobi/blog/2018/01/context-aware-text-recognition/#respond</comments>
				<dc:creator><![CDATA[@edent]]></dc:creator>
		<pubDate>Tue, 23 Jan 2018 12:15:44 +0000</pubDate>
				<category><![CDATA[/etc/]]></category>
		<category><![CDATA[google]]></category>
		<category><![CDATA[leveson]]></category>
		<category><![CDATA[ocr]]></category>
		<guid isPermaLink="false">https://shkspr.mobi/blog/?p=29047</guid>

					<description><![CDATA[I&#039;ve been playing with Google&#039;s Cloud Vision API. It is OCR (Optical Character Recognition) - but in THE CLOUD and uses MACHINE LEARNING!  When it works, it is indistinguishable from magic.  When it fails, it reveals a very limited understanding of human text.  Let&#039;s take a look at this quick example - a piece of evidence from Leveson Inquiry    Considering that the document is a digital scan of…]]></description>
										<content:encoded><![CDATA[<p>I've been playing with <a href="https://cloud.google.com/vision/docs/drag-and-drop">Google's Cloud Vision API</a>. It is OCR (Optical Character Recognition) - but in THE CLOUD and uses MACHINE LEARNING!</p>

<p>When it works, it is indistinguishable from magic.  When it fails, it reveals a very limited understanding of human text.  Let's take a look at this quick example - <a href="https://shkspr.mobi/blog/2012/05/crowdsourcing-leveson/">a piece of evidence from Leveson Inquiry</a></p>

<img src="https://shkspr.mobi/blog/wp-content/uploads/2018/01/Screen-Shot-2018-01-22-at-10.09.54.png" alt="A scanned document, the text is askew. Next to it is a computer-generated version of the text. A passage is highlighted." width="632" height="481" class="aligncenter size-full wp-image-29048">

<p>Considering that the document is a digital scan of a fax of a print out, it low resolution, blurry, and skewed - it is nothing short of incredible that it has recovered so much text.  But look at the passage I've highlighted.</p>

<blockquote><p>Secondly, the Inquiry is aware that on 15 July <strong>!</strong> resigned my position</p></blockquote>

<p>The letter <code>I</code> has been replaced with an exclamation point.  Why is that?</p>

<p>Here's a close up of the text in question.</p>

<img src="https://shkspr.mobi/blog/wp-content/uploads/2018/01/text.png" alt="A block of text" width="375" height="94" class="aligncenter size-full wp-image-29050">

<p>There are multiple ways to "ZOOM! ENHANCE!" the letter in question.  Here's a basic resizing and a more complex resampling.</p>

<img src="https://shkspr.mobi/blog/wp-content/uploads/2018/01/text-scaled-up.png" alt="The letter i has been scaled up. It looks a little like an exclamation point" width="760" height="203" class="aligncenter size-full wp-image-29049">

<p>Does that <code>I</code> look like a <code>!</code> to you?  The bottom of it looks a little blobby, I suppose.  It also comes at the end of a line which does remove some context clues.</p>

<p>But...</p>

<ul>
<li>There is a space before it. Even in non-proportional fonts, this would be unusual.</li>
<li>The next word is not capitalised.</li>
<li>The letter I has been used liberally throughout the document, the exclamation mark isn't used at all.</li>
<li>The paragraph is full of words like "me", "my", and "I".</li>
</ul>

<p>This is just one example. I've seen Google Vision recognise an opening parenthesis <code>(</code> as the the letter <code>C</code> - despite recognising the closing <code>)</code> just a few characters later.</p>

<p>I've seen an other homographic confusion - the word <code>US</code> becoming <code>U5</code> - confusing the letter <code>s</code> with the number <code>5</code>.  For some reason, Google likes to replace the regular comma with the <a href="http://graphemica.com/%E3%80%81">ideographic comma "、"</a> despite the rest of the text being in English.</p>

<p>What I'm getting at - why aren't there any text recognition services which use the context of the surrounding text to clarify ambiguous characters?</p>
<img src="https://shkspr.mobi/blog/wp-content/themes/edent-wordpress-theme/info/okgo.php?ID=29047&HTTP_REFERER=RSS" alt="" width="1" height="1" loading="eager">]]></content:encoded>
					
					<wfw:commentRss>https://shkspr.mobi/blog/2018/01/context-aware-text-recognition/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
			</item>
		<item>
		<title><![CDATA[Selecting Text In Images - Pure SVG, No JavaScript]]></title>
		<link>https://shkspr.mobi/blog/2014/08/selecting-text-in-images-pure-svg-no-javascript/</link>
					<comments>https://shkspr.mobi/blog/2014/08/selecting-text-in-images-pure-svg-no-javascript/#comments</comments>
				<dc:creator><![CDATA[@edent]]></dc:creator>
		<pubDate>Fri, 29 Aug 2014 11:05:59 +0000</pubDate>
				<category><![CDATA[/etc/]]></category>
		<category><![CDATA[images]]></category>
		<category><![CDATA[ocr]]></category>
		<category><![CDATA[svg]]></category>
		<guid isPermaLink="false">http://shkspr.mobi/blog/?p=10783</guid>

					<description><![CDATA[Recently, I wanted to embed an photograph of a book page.  I thought it would be nifty if the text from the page could be selected.  If you hover your mouse over this image, you should be able to select part of the text.     Ideally, it will look something like this...    It even works on Android (tried on Chrome, Opera, FireFox) and iOS 7.    So, how did I do it?  Originally, I was pointed to…]]></description>
										<content:encoded><![CDATA[<p>Recently, I wanted to embed an photograph of a book page.  I thought it would be nifty if the text from the page could be selected.</p>

<p>If you hover your mouse over this image, you should be able to select part of the text.</p>

<iframe src="https://shkspr.mobi/blog/wp-content/uploads/2014/08/SVG-Select-Text-Zero-Opacity.svg" width="566" height="170" scrolling="no"> </iframe>

<p>Ideally, it will look something like this...</p>

<img src="https://shkspr.mobi/blog/wp-content/uploads/2014/08/Selected-Text.png" alt="Selected Text" width="469" height="125" class="aligncenter size-full wp-image-10784">

<p>It even works on Android (tried on Chrome, Opera, FireFox) and iOS 7.</p>

<img src="https://shkspr.mobi/blog/wp-content/uploads/2014/08/Android-SVG-Selection-fs8.png" alt="Android SVG Selection-fs8" width="605" height="386" class="aligncenter size-full wp-image-10799">

<p>So, how did I do it?</p>

<p>Originally, I was pointed to <a href="http://projectnaptha.com/">Project Naptha</a> - it seems to do everything I want but is very JavaScript heavy and requires modern browser support.</p>

<p>I then turned to SVG - Scalable Vector Graphics.</p>

<p>The way I've done this is <em>almost certainly wrong</em> and I'd appreciate any advice about the proper way to render text in an SVG.</p>

<p>The first part is easy - displaying a PNG as the background to the SVG.  In this case, I've taken the image and Base64 encoded it.</p>

<pre><code class="language-svg">&lt;?xml version="1.0" encoding="UTF-8" standalone="no"?&gt;
&lt;svg
   xmlns="http://www.w3.org/2000/svg"
   version="1.1"
   xmlns:xlink="http://www.w3.org/1999/xlink"
   width="566"
   height="166"
&gt;
&lt;image xlink:href="data:image/png;base64,iVBORw0KGg...."
   x="0"
   y="0"
   width="566"
   height="166" /&gt;
</code></pre>

<p>The X &amp; Y co-ordinates are from the top left. I've manually added in the height and width of the image.</p>

<p>Next, we add the text.</p>

<pre><code class="language-svg">   &lt;g fill-opacity="0"&gt;
      &lt;text
         x="70"
         y="45"
         font-size="14"
         font-family="serif"
         textLength="415"
         lengthAdjust="spacingAndGlyphs"&gt;
         For nearly three years, between 1960 and 1963, MI5 and GCHQ
      &lt;/text&gt;
      &lt;text
         x="42"
         y="62"
         font-size="14"
         font-family="serif"
         textLength="440"
         lengthAdjust="spacingAndGlyphs"&gt;
         read the French high grade cipher coming in and out of the French
      &lt;/text&gt;
      ...
   &lt;/g&gt;
&lt;/svg&gt;
</code></pre>

<p>As you can see, I've grouped the text together in a &lt;g&gt; element.  I've set the opacity to zero - so while they are on top of the image, they cannot be seen unless selected.  I've also <strong>manually</strong> split the lines and placed them on the image.  I've set a "textLength" so that they'll fit across the page and automatically adjust themselves if they're too long.</p>

<p>This is <em>very</em> imprecise and quite time consuming.  To get a better idea of how accurate (or not) it is, here's the same image, with the opacity set to 0.5.</p>

<iframe src="https://shkspr.mobi/blog/wp-content/uploads/2014/08/SVG-Select-Text-Half-Opacity.svg" width="566" height="170" scrolling="no"> </iframe>

<p>Close enough, but not brilliant.</p>

<p>Finally, I've had to reference the images via an iframe.  Without doing that, I wasn't able to select the text. I'm not sure if that's a browser fault, or expected functionality.</p>

<p>If you can suggest a quicker and more accurate way of doing this - I'd <strong>love</strong> for you to leave a comment below.</p>
<img src="https://shkspr.mobi/blog/wp-content/themes/edent-wordpress-theme/info/okgo.php?ID=10783&HTTP_REFERER=RSS" alt="" width="1" height="1" loading="eager">]]></content:encoded>
					
					<wfw:commentRss>https://shkspr.mobi/blog/2014/08/selecting-text-in-images-pure-svg-no-javascript/feed/</wfw:commentRss>
			<slash:comments>8</slash:comments>
		
		
			</item>
		<item>
		<title><![CDATA[Crowdsourcing Leveson]]></title>
		<link>https://shkspr.mobi/blog/2012/05/crowdsourcing-leveson/</link>
					<comments>https://shkspr.mobi/blog/2012/05/crowdsourcing-leveson/#comments</comments>
				<dc:creator><![CDATA[@edent]]></dc:creator>
		<pubDate>Fri, 11 May 2012 11:40:48 +0000</pubDate>
				<category><![CDATA[politics]]></category>
		<category><![CDATA[crowdsourcing]]></category>
		<category><![CDATA[justice]]></category>
		<category><![CDATA[leveson]]></category>
		<category><![CDATA[ocr]]></category>
		<category><![CDATA[text]]></category>
		<guid isPermaLink="false">http://shkspr.mobi/blog/?p=5702</guid>

					<description><![CDATA[I&#039;ve already blogged about the Leveson Inquiry&#039;s disturbing habit of releasing evidence as scanned in PDFs.  I had a suggestion from digital journalist Kevin Anderson  Terence Eden is on Mastodon@edentGah! The #leveson witness statements are photocopied &#38; scanned in levesoninquiry.org.uk/evidence/?witn…Disastrous for open justice - shkspr.mobi/blog/index.php…❤️ 0💬 0🔁 110:12 - Fri 11 May 2012Mr And…]]></description>
										<content:encoded><![CDATA[<p>I've already blogged about the <a href="https://shkspr.mobi/blog/2012/04/leveson-death-by-a-thousand-paper-cuts/">Leveson Inquiry's disturbing habit of releasing evidence as scanned in PDFs</a>.</p>

<p>I had a <a href="https://twitter.com/kevglobal/status/200898240965644289">suggestion from digital journalist Kevin Anderson</a></p>

<blockquote class="social-embed" id="social-embed-200898240965644289" lang="en" itemscope="" itemtype="https://schema.org/SocialMediaPosting"><blockquote class="social-embed" id="social-embed-200891119947620352" lang="en" itemscope="" itemtype="https://schema.org/SocialMediaPosting"><header class="social-embed-header" itemprop="author" itemscope="" itemtype="https://schema.org/Person"><a href="https://twitter.com/edent" class="social-embed-user" itemprop="url"><img class="social-embed-avatar social-embed-avatar-circle" src="data:image/webp;base64,UklGRkgBAABXRUJQVlA4IDwBAACQCACdASowADAAPrVQn0ynJCKiJyto4BaJaQAIIsx4Au9dhDqVA1i1RoRTO7nbdyy03nM5FhvV62goUj37tuxqpfpPeTBZvrJ78w0qAAD+/hVyFHvYXIrMCjny0z7wqsB9/QE08xls/AQdXJFX0adG9lISsm6kV96J5FINBFXzHwfzMCr4N6r3z5/Aa/wfEoVGX3H976she3jyS8RqJv7Jw7bOxoTSPlu4gNbfXYZ9TnbdQ0MNnMObyaRQLIu556jIj03zfJrVgqRM8GPwRoWb1M9AfzFe6Mtg13uEIqrTHmiuBpH+bTVB5EEQ3uby0C//XOAPJOFv4QV8RZDPQd517Khyba8Jlr97j2kIBJD9K3mbOHSHiQDasj6Y3forATbIg4QZHxWnCeqqMkVYfUAivuL0L/68mMnagAAA" alt="" itemprop="image"><div class="social-embed-user-names"><p class="social-embed-user-names-name" itemprop="name">Terence Eden is on Mastodon</p>@edent</div></a><img class="social-embed-logo" alt="Twitter" src="data:image/svg+xml,%3Csvg%20xmlns%3D%22http%3A%2F%2Fwww.w3.org%2F2000%2Fsvg%22%0Aaria-label%3D%22Twitter%22%20role%3D%22img%22%0AviewBox%3D%220%200%20512%20512%22%3E%3Cpath%0Ad%3D%22m0%200H512V512H0%22%0Afill%3D%22%23fff%22%2F%3E%3Cpath%20fill%3D%22%231d9bf0%22%20d%3D%22m458%20140q-23%2010-45%2012%2025-15%2034-43-24%2014-50%2019a79%2079%200%2000-135%2072q-101-7-163-83a80%2080%200%200024%20106q-17%200-36-10s-3%2062%2064%2079q-19%205-36%201s15%2053%2074%2055q-50%2040-117%2033a224%20224%200%2000346-200q23-16%2040-41%22%2F%3E%3C%2Fsvg%3E"></header><section class="social-embed-text" itemprop="articleBody">Gah! The <a href="https://twitter.com/hashtag/leveson">#leveson</a> witness statements are photocopied &amp; scanned in <a href="http://www.levesoninquiry.org.uk/evidence/?witness=rebekah-brooks">levesoninquiry.org.uk/evidence/?witn…</a><br>Disastrous for open justice - <a href="http://shkspr.mobi/blog/index.php/2012/04/leveson-death-by-a-thousand-paper-cuts/">shkspr.mobi/blog/index.php…</a></section><hr class="social-embed-hr"><footer class="social-embed-footer"><a href="https://twitter.com/edent/status/200891119947620352"><span aria-label="0 likes" class="social-embed-meta">❤️ 0</span><span aria-label="0 replies" class="social-embed-meta">💬 0</span><span aria-label="1 reposts" class="social-embed-meta">🔁 1</span><time datetime="2012-05-11T10:12:30.000Z" itemprop="datePublished">10:12 - Fri 11 May 2012</time></a></footer></blockquote><header class="social-embed-header" itemprop="author" itemscope="" itemtype="https://schema.org/Person"><a href="https://twitter.com/kevglobal" class="social-embed-user" itemprop="url"><img class="social-embed-avatar social-embed-avatar-circle" src="data:image/webp;base64,UklGRugBAABXRUJQVlA4INwBAACwCQCdASowADAAPrVMoUynI6MiKrVaqOAWiWMAxxVPBAOTvWLa7wnDEHBsXvYybYIxOWgYdbBhqKBiKpLTCAwFzyTRJbW6ZYON7SSj1kIg/TjfH8NUGAD++5bMCvq0eeJ+m+ScbtVm1B4ju65wKNSkkytpJ9C0bISjCCd5zkIU7eOQv9mh97+F91jXhvB49YWzWoqQ/33RY13WuC2r9jTEY+UgqJSIRCGqsJibjV8tpqXkPvFvW9NLWn/+ObXnaQQNpiHvlgwKEs6qx6fhZPWbx0b9/S4pRk9CY/z+qvB/mjy/BrHyfLzqhkj14YjX18vAfIkv4N96d03RKB0kIqSoJ+S2RMvgFJr5IxyBYZEX3OUvfkakll8bau190J8MkLug2HEmhHVnAFV68Kg4mH4s3Trq75z+r51bJ9HG0CWGre/kHnEyELddmZzFB8xtD4WCR3tx0F000a5497aymDaAXSGKSiBwvhl/8QEB+9IBUf3OC8+Umh6ptAsDQMek3WfXPfnhJ0RwaTbJEOeoK2X4br776WozZ/RfH4yJlpKs3dGKrrHR79H+9wenHD1+47btxPL9vvL6zJn26d8xsASbIlk7vpk7gPedt2/xHKP6NGKKRkoOqhVw0+wAAA==" alt="" itemprop="image"><div class="social-embed-user-names"><p class="social-embed-user-names-name" itemprop="name">Mr Anderson</p>@kevglobal</div></a><img class="social-embed-logo" alt="Twitter" src="data:image/svg+xml,%3Csvg%20xmlns%3D%22http%3A%2F%2Fwww.w3.org%2F2000%2Fsvg%22%0Aaria-label%3D%22Twitter%22%20role%3D%22img%22%0AviewBox%3D%220%200%20512%20512%22%3E%3Cpath%0Ad%3D%22m0%200H512V512H0%22%0Afill%3D%22%23fff%22%2F%3E%3Cpath%20fill%3D%22%231d9bf0%22%20d%3D%22m458%20140q-23%2010-45%2012%2025-15%2034-43-24%2014-50%2019a79%2079%200%2000-135%2072q-101-7-163-83a80%2080%200%200024%20106q-17%200-36-10s-3%2062%2064%2079q-19%205-36%201s15%2053%2074%2055q-50%2040-117%2033a224%20224%200%2000346-200q23-16%2040-41%22%2F%3E%3C%2Fsvg%3E"></header><section class="social-embed-text" itemprop="articleBody"><a href="https://twitter.com/edent">@edent</a> Put the Leveson docs up on Google Docs. I'd be curious how their OCR could handle them. Then click 'make public'</section><hr class="social-embed-hr"><footer class="social-embed-footer"><a href="https://twitter.com/kevglobal/status/200898240965644289"><span aria-label="0 likes" class="social-embed-meta">❤️ 0</span><span aria-label="1 replies" class="social-embed-meta">💬 1</span><span aria-label="0 reposts" class="social-embed-meta">🔁 0</span><time datetime="2012-05-11T10:40:47.000Z" itemprop="datePublished">10:40 - Fri 11 May 2012</time></a></footer></blockquote>

<p>Google Docs has an annoying 2MB limit for uploaded PDFs.  However, I've taken the first half of <a href="https://web.archive.org/web/20120511101152/http://www.levesoninquiry.org.uk/evidence/?witness=rebekah-brooks">Rebekah Brooks' witness statement</a> and run them through the OCR process.</p>

<p>This is how <a href="https://docs.google.com/document/d/1eTss2IfnCHAZQQVIfEGLpvHcriyzshpVodq4JrRopro/edit">Google recognises the text in the document</a></p>

<blockquote><p>Leveson Inquiry into the culture, practices and ethics of the press</p><br>

<p>1 I dlT| necessarily inhibited to some extent about what I can say in reiation to some of the issues that the Inquiry has raised with me.
My background</p><br>

<p>3. ijoined News International in 1989. I began my career on the News of the Worlcfs coiour supplement, Sunday magazine, whiie simultaneousiy attending ajournalism course at the London College of Printing.</p><br>
<p>4. Since then i have been either a joumeiist or an executive on both The News of the World and The Sun. For afrnc-st a decade Iwas a nationai newspaper editor. In May 2000 I became the editor of The News of the Worid and in January 2003 I became the editor of The Sun.</p><br>

<p>5. In September 2009, I was appointed Chief Executive of News lnternationaf. My responsibilities embraced ail the newspapers and digital products of the 1.... -. -</p></blockquote>

<p>That's based on this text:</p>

<p><a href="https://docs.google.com/document/d/1eTss2IfnCHAZQQVIfEGLpvHcriyzshpVodq4JrRopro/edit"><img src="https://shkspr.mobi/blog/wp-content/uploads/2012/05/Brooks-Witness-Statement.jpg" alt="Brooks Witness Statement" title="Brooks Witness Statement" width="422" height="384" class="aligncenter size-full wp-image-5703"></a></p>

<h2 id="why-is-this-important"><a href="https://shkspr.mobi/blog/2012/05/crowdsourcing-leveson/#why-is-this-important">Why Is This Important</a></h2>

<p>The journalist <a href="https://twitter.com/newsbrooke">Heather Brooke</a> has been ranting for some time about <a href="https://www.heatherbrooke.org/the-silent-state">the closed nature of the British Courts</a>. It's close to impossible to get verbatim or accurate information about course cases.  This means as citizens, journalists, or archivists, we can't accurately search documents.  We need access to the original digital documents.</p>

<p>Poor OCR is also a huge problem.  As above, OCR gives us a misleading impression that documents are searchable.</p>

<p>Should we wish to search, say <a href="https://web.archive.org/web/20120521174431/http://www.levesoninquiry.org.uk/evidence/?day=2012-04-24">KRM-18</a>, to see whether the MP Tom Watson is mentioned; a search for "Watson" turns up zero results.  Yet he is mentioned.</p>

<p>The page shows:
<img src="https://shkspr.mobi/blog/wp-content/uploads/2012/05/Evidence-mentioning-Watson.jpg" alt="Evidence mentioning Watson" title="Evidence mentioning Watson" width="599" height="58" class="aligncenter size-full wp-image-5705">
But the scanned text reads:</p>

<blockquote><p>Had ~ debrief with 5f[ ~nd his team tm~.igl~t ttt 77~ betbre he [o~ t.o his constituency:
</p><p>l-]~e is veo’ h.,qlopY~ith d~ ~va~, today" wellt mid ~s~cci~iiJ~’ ~,it[i tae ~bsoiutely’idiotie. del)&amp;t~s led by Wtttson.urtd
</p><p>Prescott.</p></blockquote>

<p>So, it's totally impossible to rapidly search through these documents. It would be necessary to laboriously read each document manually.</p>

<h2 id="how-to-accomplish-this"><a href="https://shkspr.mobi/blog/2012/05/crowdsourcing-leveson/#how-to-accomplish-this">How To Accomplish This</a></h2>

<p>There are two ways to get this done - in the case of the Leveson Inquiry.</p>

<ol>
    <li>Petition the Inquiry to release the original documents.</li>
    <li>Crowdsource the OCR.  Taking the Google OCR as a starting point and "Wikifying" it to let anyone correct the text.  A bit like <a href="http://www.pgdp.net/c/">Distributed Proofreaders</a></li>
</ol>

<p>I will, of course, send an email to the Leveson Inquiry - but would people be interested in being part of a crowdsourcing effort to opening up these documents?</p>
<img src="https://shkspr.mobi/blog/wp-content/themes/edent-wordpress-theme/info/okgo.php?ID=5702&HTTP_REFERER=RSS" alt="" width="1" height="1" loading="eager">]]></content:encoded>
					
					<wfw:commentRss>https://shkspr.mobi/blog/2012/05/crowdsourcing-leveson/feed/</wfw:commentRss>
			<slash:comments>1</slash:comments>
		
		
			</item>
		<item>
		<title><![CDATA[Leveson - Death By A Thousand (Paper) Cuts]]></title>
		<link>https://shkspr.mobi/blog/2012/04/leveson-death-by-a-thousand-paper-cuts/</link>
					<comments>https://shkspr.mobi/blog/2012/04/leveson-death-by-a-thousand-paper-cuts/#respond</comments>
				<dc:creator><![CDATA[@edent]]></dc:creator>
		<pubDate>Wed, 25 Apr 2012 11:06:38 +0000</pubDate>
				<category><![CDATA[politics]]></category>
		<category><![CDATA[usability]]></category>
		<category><![CDATA[leveson]]></category>
		<category><![CDATA[murdoch]]></category>
		<category><![CDATA[ocr]]></category>
		<category><![CDATA[paper]]></category>
		<guid isPermaLink="false">http://shkspr.mobi/blog/?p=5619</guid>

					<description><![CDATA[I&#039;ve been listening to the Leveson inquiry. A large part of the exchanges seem to go like this:  Jay: Turning to page 51. Witness: Which bundle? Jay: 1606. Witness: 1660? Leveson: No, the page after. Jay: Paragraph 7. Witness: I don&#039;t have a paragraph 7. Jay: Ah, I have an earlier print out. Leveson: You&#039;ll find it in tab 15. Witness: Is this Volume 2?   And so on, ad nauseum.  Surely there&#039;s no…]]></description>
										<content:encoded><![CDATA[<p>I've been listening to the Leveson inquiry. A large part of the exchanges seem to go like this:</p>

<blockquote><p>Jay: Turning to page 51.
</p><p>Witness: Which bundle?
</p><p>Jay: 1606.
</p><p>Witness: 1660?
</p><p>Leveson: No, the page after.
</p><p>Jay: Paragraph 7.
</p><p>Witness: I don't have a paragraph 7.
</p><p>Jay: Ah, I have an earlier print out.
</p><p>Leveson: You'll find it in tab 15.
</p><p>Witness: Is this Volume 2?
</p></blockquote>

<p>And so on, <i lang="la">ad nauseum</i>.</p>

<p>Surely there's no reason to have so much paper wastefully printed and then discarded?  Why not a single reference electronic document which can be supplied to each participant? Allowing them to increase the font size, annotate, cross reference, and search?</p>

<h2 id="search"><a href="https://shkspr.mobi/blog/2012/04/leveson-death-by-a-thousand-paper-cuts/#search">Search</a></h2>

<p>Ah, search.  Searching text is something computers are really good at.  Within a fraction of a second, even a modest computer can extract every sentence which contains the word "Clegg" from hundreds of thousands of pages.  Brilliant! Makes life really easy. Until humans come along and bugger about with it.</p>

<p>Let's take a look at the "smoking gun" <a href="https://web.archive.org/web/20120428223514/http://www.levesoninquiry.org.uk/evidence/?day=2012-04-24">emails which have been submitted from News International to Leveson</a>. Specifically <a href="https://web.archive.org/web/20120428084720/http://www.levesoninquiry.org.uk/wp-content/uploads/2012/04/Exhibit-KRM-18.pdf">KRM18</a>.</p>

<p>I have no idea how these emails were supplied to Leveson. I <strong>hope</strong> that they were submitted electronically - with all headers intact. What's supplied to the pubic, however, is this:</p>

<p><img src="https://shkspr.mobi/blog/wp-content/uploads/2012/04/Leveson-Email-Printed.jpg" alt="Leveson Email Printed" title="Leveson Email Printed" width="623" height="378" class="aligncenter size-full wp-image-5621">
The emails have been...</p>

<ul>
    <li>Printed out.</li>
    <li>Redacted with marker pen.</li>
    <li>Scanned in as a PDF.</li>
    <li>Then subject to an uncorrected OCR process.</li>
</ul>

<p>Computers are <em>really</em> bad at recognising text. OCR (Optical Character Recognition) is a very error-prone process.  Take a look at how the computer has translated the above document.</p>

<img src="https://shkspr.mobi/blog/wp-content/uploads/2012/04/Leveson-Email-Printed-OCR.jpg" alt="Leveson Email Printed OCR" title="Leveson Email Printed OCR" width="626" height="382" class="aligncenter size-full wp-image-5620">

<p>It's <em>partly</em> there. But enough of the characters are mangled, and words distorted that searching through the text is near impossible.</p>

<p>I get that PDF is a reasonably popular file format for sharing documents. It preserves the document structure faithfully - but at the expense of readability, fluidity, and usefulness.  But distributing <em>images</em> is the least useful way of distributing information to people who want to use it.</p>

<p>It's simply bad civic responsibility to do this.  These emails, if they are important enough to be made public, should be made public in their original form. I understand that some redactions should be made - but that's about the limit.</p>

<p>How on Earth is anyone supposed to make sense of this extract?
<img src="https://shkspr.mobi/blog/wp-content/uploads/2012/04/OCR.jpg" alt="OCR" title="OCR" width="603" height="229" class="aligncenter size-full wp-image-5625"></p>

<p>We need to shake off the tyranny of printed paper. It is wasteful, non-useful, and - in this context - damaging to justice.</p>

<p>I leave you with an entirely random extract from the emails...
<img src="https://shkspr.mobi/blog/wp-content/uploads/2012/04/Please-Consider-The-Environment-Before-Printing-This-Email.jpg" alt="Please Consider The Environment Before Printing This Email" title="Please Consider The Environment Before Printing This Email" width="602" height="574" class="aligncenter size-full wp-image-5623"></p>
<img src="https://shkspr.mobi/blog/wp-content/themes/edent-wordpress-theme/info/okgo.php?ID=5619&HTTP_REFERER=RSS" alt="" width="1" height="1" loading="eager">]]></content:encoded>
					
					<wfw:commentRss>https://shkspr.mobi/blog/2012/04/leveson-death-by-a-thousand-paper-cuts/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
			</item>
	</channel>
</rss>
