When is a URL not a URL?
Summary
Twitter's way of linking URLs is broken. It's annoying to users, and a pain in the arse to developers. This quick post talks about the problem and offers a solution.
I've raised a bug with Twitter and I hope you'll star it as important to you.
Preamble
A common trope in programming classes is "how do you detect valid email address?"
It should be obvious, right? A string of text, an @, a domain - probably ending in .com. As it turns out, it's not that simple. "who+o'toole@invalid.museum" is a potentially valid address, for example. There are literally thousands of ways to detect the potentially infinite variety of email addresses.
The same is true for URLs - and slavish adherence to guidelines is killing Twitter's usefulness.
The URL Matching Problem
Which of these strings should be turned into hyperlinks?
www.bbc.co.uk example.com http://test https://test.test ftp://news.com
As it happens, Twitter only matches "https://test.test" and none of the others.
Twitter's matching regex is, as far as I can tell, this
If it starts with http:// or https:// and has a dot in it - it's a URL
I think this is a serious weakness. Twitter users are sharing URLs which their followers can't click on - Twitter is also linking to URLs which don't exist.
I've picked these examples more or less at random.
Solution?
Much like the email regexes, I would take a much more lax approach. Essentially, if it looks vaguely like a URL - link to it.
I would suggest the following rules:
- If it starts with a protocol - http:// ftp:// tel: etc - create a hyperlink.
- If it starts with www. - create a hyperlink.
- If it ends . then a valid TLD - create a hyperlink.
- If it contains a valid TLD followed by a slash then some other characters - create a hyperlink.
The "correct" method would then be for Twitter to perform an HTTP HEAD request to see if the URL is potentially valid. There are three drawbacks to this.
- It may place excessive load on Twitter's servers to process and cache these requests.
- The URL may be that of an Intranet site - and thus inaccessible to Twitter.
- The URL may be valid but temporarily inaccessible.
Regardless of the method, surely it's inexcusable that "www.example.com" isn't detected as a URL whereas "http://bork.bork.bork" is?
ACTION!
If you think Twitter's approach to hyperlinks is wrong - please make your voice heard at the bug report.
Tom Morris says:
My favourite solution to this is to reuse or reimplement an existing good library for it: the best one I've found is AutoHyperlinks, which is used by the Macintosh IM client Adium - it is clever enough that if you put something like "bit.ly" or "fox.es" or "sn.im" in, it'll detect it as a hyperlink by checking against a list of valid TLDs.
http://code.google.com/p/maccode/wiki/AutoHyperlinks
Take a gander: it has lots of interesting edge cases, AND it has unit tests. This is definitely something worth nicking: I'm almost tempted to set up a web service that lets people chuck in text with URLs and get back HTML using AutoHyperlinks.