Deficiencies in the Twitter Archive
After Twitter's repeated broken promises, I was unsure if I'd ever get access to my Twitter archive. But, finally, I'm able to extract my data from their systems.
There are a number of deficiencies. Of course, it's impossible to please all the people all the time but, in typical Twitter fashion, they don't appear to have taken the effort to satisfied anyone.
Let's take a quick run through of where the archive breaks down.
Usernames Change
When I first signed up to Twitter, I was known as "Vodaclone". What a witty and original name! I took advantage of Twitter's ability to change screen names a few months after joining.
Yet all the tweets are written as though they come from @edent.
Profile Picture Changes
Like many people, I update my avatar. According to the Twitter archive - I've used the same one since the dawn of time. It's sad that there doesn't appear to be any record of the faces I pulled or the banners I added.
Missing Media
Indeed, it's odd that the avatar images aren't linked to the originals. What's more annoying is that the images I've uploaded to Twitter's image service aren't included. My archive of ~28,000 Tweets weighs in at 6MB - while adding in dozens of images would balloon that - it's couldn't be a huge strain on Twitter's resources.
I used grep to extract all the media_url parameters, then used wget to download them all. My 370 images took up a mere 33MB.
Favourites
Remember all those cool Tweets you favourited? They're not here.
No Direct Messages
Twitter has gone to great lengths to try and kill off its DM service. Once it realised that people were using the service in a way which wouldn't force them to interact with advertisers, they've made it steadily harder to access private messages.
They are, of course, complete absent from the archive. Hopefully all the meaningful DMs which were sent to you are backed up in email somewhere - but the ones you sent are nowhere to be seen. Good luck future scholars of the world!
Lack of Thread Context
A typical tweet from 2010 reads
What was Amanda saying? The only way to find out is to go on to the Twitter website and see the thread in question. The archive doesn't include any of the replies people sent you.
Ideally, Twitter could have included the complete conversation threads in the archive. Twitter's threading tools are notably abysmal. The lack of being able to search for tweets via the "in_reply_to" metadata makes understanding conversation threads particularly troublesome.
No Updated Metadata
I remember sitting down with Twitter's developer relations guy - Raffi Krikorian - at WarbleCamp (this was back when Twitter cared about developers). We thrashed out some of the ideas around Entities and how they could be useful to the developer community.
One of the things which never happened was "backporting" entities. Tweets which were made before entities were switched on are stuck with no metadata. So if you're trying to examine your archive for hashtags, links, etc - you'll have a hard time.
A perfect example is this early tweet about Twitter annotations. Even on the Twitter website, the URL isn't automatically linked.
Comparisons to Other Services
Facebook, for all its failings are pretty good at giving you an archive of all your content. Given the complexities of their databases, it's not surprising that it's a bit mangled - but it's there. You get all the photos and videos you uploaded as well.
Yes, it's great that Twitter has finally kept its word on data extraction - but this really feels like an underwhelming effort.
Tools
So, it looks like I'll have to write a tool to download all the missing tweets, conversations, photos, favourites, and DMs. I'll also need to write a script which reformats the metadata on old posts to ensure they are compatible with new ones.
Then dump everything into a database, or series of flat files.
Of my 28k Tweets - around 12k are replies to other tweets. How very sociable of me! Twitter rate limit their API to 150 queries an hour. So it will take around 4 days to get all the tweets to which I've replied. Of course, then it becomes a recursive issue (I have to see which of those are replies, and which of those replies are replies etc). So, probably a week.
Fun times ahead.
Joris Leermakers says:
I've created a simple tool to analyze the archive a bit more (clouds of most used words, hashtags etc...) http://twitter.leermakers.net/ Just upload your ZIP from Twitter. Within a few minutes you have an analysis like http://twitter.leermakers.net/6b045f5197b524133c298f7c8cb5e7e4/ (my personal account analyzed)