Why "Raw Data Now" Could Fail...
Tim Berners-Lee has made a call for governments to open up their data. Indeed, Tim's been appointed by the UK government to do just that.
His central thesis is that we, the taxpayers, have paid for government research and data - we should be able to access it. Easy, free and unfettered access to raw, unadulterated data will allow us to do wonderful things.
Take a look at his recent TED Talk, it's inspiring stuff.
I think there's a fatal flaw in his plan.
Data, in its raw form is hard to come by. Data in databases is, in my experience, a rarity.
Data is usually held in Excel workbooks or Word documents or, more likely, random emails.
Let's take, as a perfect example, the Post Office.
Tom Taylor wanted to know the location of every postbox in the UK.
This is the sort of information which could be very useful to all sorts of projects. A widget to tell you where the nearest postbox was which hadn't missed the last pickup. If you were looking for a new place to live, knowing where the postboxes were would be helpful. Perhaps there is a public health implication that none of us are aware of yet.
It's the sort of small, bespoke manipulation of information which having free access of data makes possible.
So Tom makes a Freedom of Information request to the Post Office.
You can read the whole story yourself, but in synopsis, the Post office doesn't hold these data!
Local Post offices may do - but it's probably on scraps of paper, old print outs, a list in an obscure data format on an old PC that's never backed up.
So, some bright spark creates a database in Microsoft Access. Not only is Access a barely credible alternative to a database but (and here's the punchline) the database they've created doesn't record post codes properly!
Now, this isn't a crappily designed products built by EDS or Captia to an ever changing specification - this is a in house design. Probably specced and built by someone with a day's training in Access. People who only think they know what they're doing are dangerous.
Because it's only designed for internal use - and light use at that - the data and its structure are of extremely low quality.
At best, huge tranches of data are held in barely functional, imperfect databases. The rest as flat files on individual computers in multiple inconsistent revisions.
Now, it's been several years since I've worked for a local government, but I can't believe too much has changed since then. Especially given what I see in day-to-day business. For some of the companies that I do business with, the very idea of having a database is akin to science fiction. Everyone knows that best practice is to keep data centralised location in a well maintained database. But everyone knows that's it's easier, in the short term, to keep the data in a spreadsheet on your desktop.
So, the challenge is 3-fold. 1. Convince people that placing information is a good thing to do. 2. Designing databases which are both correct & useful. 3. Freeing the data from their hellish-Excel bondage.
It won't be easy. However, the end result will be worthwhile.
But it's up to all of us - whether we're in public or private service - to make sure the data we're creating is rational, well formed and accessible.
Edit: 29/06/2009 Tim has posted about Putting Government Data Online.
The chances are quite high that the data your department/agency runs off will be largely in relational databases, often with a large amount in spreadsheets.
Doug Aitken says:
I totally agree with you, it's simple things and practices that would make life a whole lot easier. I work in a part of a large company and a lot of our data is held in spreadsheets but I'm sure a lot is also in databases. I might actually try & find out, might be useful!
Matthew says:
Having the data is enough if you get people to improve it for you: http://www.dracos.co.uk/play/locating-postboxes/ 🙂