Bugs in Twitter Text Libraries
The Twitter Engineering Team have a set of text processing classes which are meant to simplify and standardise the recognition of URLs, screen names, and hashtags. Dabr makes use of them to keep in conformance with Twitter's style.
One of the advantages of the text processing is that it will recognise that www.example.com is a URL and automatically create a hyperlink. Considering that dropping the "http://" represents 5% saving on Twitter's 140 character limit for messages, this is great.
So, I was mightily surprised to get this bug report from user "schmmuck"
How very odd... This is how it looks on m.twitter.com.
Twitter also use mobile.twitter.com for smartphones. Here's how that site renders the text.
Finally, let's take a look at the "canonical" rendering at Twitter.com
The Problem(s)
The first issue is inconsistency. Twitter ought to be using the same regex for each of its sites. It doesn't. This means that different developers will get divergent experiences. This leads to confusion, which leads to fear, which, as we all know, leads to anger.... and so forth.
Secondly, and more importantly, parsing is hard. There are so many edge cases that errors inevitably creep in. My post about hashtags explains the problems in defining what should be recognised.
So, based on what we've seen, should Twitter recognise any of the following as URLs?
news.bbc.co.uk - no www there.
invalid.name - a silly URL, but a valid one.
खोज.com - International domains contain more than just ASCII
All the above are valid - yet they're not recognised by Twitter.
A (Simple) Solution?
There is a canonical list of TLDs which is also available as a plain text list.
Any string containing a "." followed by a valid TLD, then followed by a space or "/" should be treated as a URL.
Your thoughts?
David Carrington says:
Consistency I agree needs to be fixed. The www problem too.
I'm not convinced that adding handling for all TLDs is worth the few extra characters saved.
Owen Blacker says:
I'd disagree with David about adding support for the full canonical list of TLDs — it's software, it's not as if it's hard and a 5% character saving is gonna be useful to quite a few people.
But you're both definitely right about consistency, that's just sloppy of them!
David Carrington says:
My point is that the list of TLDs changes, and I can't be arsed to subscribe to something which notifies me when I need to refresh the list in my apps. It will also make the regex even bigger than it is now, which I'd personally prefer not to do.
There is also the problem with clashes such as "Hi john.in tomorrow?" - is that http://www.john.in? I think not. It's those kinds of issues which I believe are not worth the effort of solving for the sake of 7 characters.
Steven Pears says:
I think the regex is a good start, but that you need to make it the first in a two-pass process. Put a word boundary on the regex so that it doesn't allow for "awwww...", then use regex groups to give you a quick and easy way to check a valid TLD.
If your TLD list is out of date then you accept you may not catch 100%, but you're going to get a much more accurate set of results compared to regex alone, and in the long run people will appreciate it.
Matt Sanford says:
It's great to have people checking over the auto-linking issues at last. The http://www...foo issue was reported at http://github.com/mzsanford/twitter-text-rb/issues/closed#issue/4 and has since been fixed in the Open Source library and is awaiting deploy on twitter.com.
It turns out to be a harder problem that expected. The http://www...foo bug was introduced while adding support for some IDNs. I'll investigate the TLD stuff but the main worry is the gTLD process and what that will mean for the list of valid TLDs.