When is a URL not a URL?

By @edent · api regex twitter urls · 1 comment · 450 words

Summary

Twitter's way of linking URLs is broken. It's annoying to users, and a pain in the arse to developers. This quick post talks about the problem and offers a solution.

I've raised a bug with Twitter and I hope you'll star it as important to you.

Preamble

A common trope in programming classes is "how do you detect valid email address?"

It should be obvious, right? A string of text, an @, a domain - probably ending in .com. As it turns out, it's not that simple. "who+o'toole@invalid.museum" is a potentially valid address, for example. There are literally thousands of ways to detect the potentially infinite variety of email addresses.

The same is true for URLs - and slavish adherence to guidelines is killing Twitter's usefulness.

The URL Matching Problem

Which of these strings should be turned into hyperlinks?

www.bbc.co.uk

example.com

http://test

https://test.test

ftp://news.com

As it happens, Twitter only matches "https://test.test" and none of the others.

Twitter's matching regex is, as far as I can tell, this

If it starts with http:// or https:// and has a dot in it - it's a URL

I think this is a serious weakness. Twitter users are sharing URLs which their followers can't click on - Twitter is also linking to URLs which don't exist.

I've picked these examples more or less at random.

Solution?

Much like the email regexes, I would take a much more lax approach. Essentially, if it looks vaguely like a URL - link to it.

I would suggest the following rules:

If it starts with a protocol - http:// ftp:// tel: etc - create a hyperlink.
If it starts with www. - create a hyperlink.
If it ends . then a valid TLD - create a hyperlink.
If it contains a valid TLD followed by a slash then some other characters - create a hyperlink.

The "correct" method would then be for Twitter to perform an HTTP HEAD request to see if the URL is potentially valid. There are three drawbacks to this.

It may place excessive load on Twitter's servers to process and cache these requests.
The URL may be that of an Intranet site - and thus inaccessible to Twitter.
The URL may be valid but temporarily inaccessible.

Regardless of the method, surely it's inexcusable that "www.example.com" isn't detected as a URL whereas "http://bork.bork.bork" is?

ACTION!

If you think Twitter's approach to hyperlinks is wrong - please make your voice heard at the bug report.

One thought on “When is a URL not a URL?”

2011-07-28 17:55

Tom Morris says:

My favourite solution to this is to reuse or reimplement an existing good library for it: the best one I've found is AutoHyperlinks, which is used by the Macintosh IM client Adium - it is clever enough that if you put something like "bit.ly" or "fox.es" or "sn.im" in, it'll detect it as a hyperlink by checking against a list of valid TLDs.

http://code.google.com/p/maccode/wiki/AutoHyperlinks

Take a gander: it has lots of interesting edge cases, AND it has unit tests. This is definitely something worth nicking: I'm almost tempted to set up a web service that lets people chuck in text with URLs and get back HTML using AutoHyperlinks.