regex – Terence Eden’s Blog

Regular Expressions make me feel like a powerful wizard - and that's not a good thing

@edent — Mon, 06 Feb 2023 12:34:19 +0000

(This is a rant because I'm exhausted after debugging something. If you've made RegEx your whole personality, I'm sorry.)

The other day I had to fix a multi-line Regular Expression (RegEx). After a few hours of peering at it with a variety of tools, I finally understood the problem. Getting that deep into the esoteric mysteries made me feel like a powerful wizard with complete mastery of my domain. And I think that's dangerous.

I'm sure we've all read a story about a witch or wizard who distractedly substitutes eye-of-newt with iron-ute with disastrous consequences. Humans are easily confused. And confusion leads to unexpected mistakes.

Look, most humans are very bad at reading compiled code. Without any external tools - can you tell me what the following code does?

0000000 c031 d88e c08e 15be b47c ac0e 003c 0474
0000010 10cd f7eb 48f4 6c65 6f6c 202c 6f57 6c72
0000020 2164 0a0d 0000 0000 0000 0000 0000 0000
0000030 0000 0000 0000 0000 0000 0000 0000 0000
*
00001f0 0000 0000 0000 0000 0000 0000 0000 aa55
0000200

No. Of course not⁰. That's why we write code in a more human readable language and then compile it to computer readable instructions.

Regular Expressions are a sort-of halfway house. They're slightly readable by humans - but written in such a terse vocabulary as to be mostly unintelligible without concentration. There's no space for comments. Different engines have variable support for all their functions. They are a symbolic language with unhelpfully indecipherable and inconsistent symbols.

As a result, once the RegEx becomes more than trivially complex they're hard for most humans to understand. That makes them difficult to debug. It also makes it difficult to add or remove functionality.

I genuinely - and possibly misguidedly - believe that even something like ^\w+([-+.']\w+)*@\w+([-.]\w+)*\.\w+([-.]\w+)*$ might just as well be written in BrainFuck.

My contention is that almost all RegExs would be better served by more human readable code and that the very existence of RegEx101.com ought to bring shame on our industry.

Here are some positive use-cases for RegEx:

You want to show off how smart you are.
You need maximum efficiency when combing through a billion lines of text.
You have a desire to build something hard to debug.
You don't have lots of printer paper and need to make your code as terse as possible.
You think if/else and switch/case statements are the mark of a diseased mind.
You don't trust compilers.

I know what you're thinking: "This guy's too stupid to get regular expressions!" Yes. Yes I am. So are most people.

What I'm getting at is that source code is designed to be read and edited by busy and distracted humans. We should be writing intelligible code for each other and letting computers do the boring work of making it more efficient.

You don't have to agree with me. That's fine. But, perhaps you'll take note of the famous maxim from the "Wizard" book:

a computer language is not just a way of getting a computer to perform operations but rather that it is a novel formal medium for expressing ideas about methodology. Thus, programs must be written for people to read, and only incidentally for machines to execute.

Structure and Interpretation of Computer Programs

We are not wizards. Nor should we strive to be. The alchemists fell.

You can read the original code which is MIT Licenced. ↩︎

When is a URL not a URL?

@edent — Wed, 27 Jul 2011 11:37:57 +0000

Summary

Twitter's way of linking URLs is broken. It's annoying to users, and a pain in the arse to developers. This quick post talks about the problem and offers a solution.

I've raised a bug with Twitter and I hope you'll star it as important to you.

Preamble

A common trope in programming classes is "how do you detect valid email address?"

It should be obvious, right? A string of text, an @, a domain - probably ending in .com. As it turns out, it's not that simple. "who+o'toole@invalid.museum" is a potentially valid address, for example. There are literally thousands of ways to detect the potentially infinite variety of email addresses.

The same is true for URLs - and slavish adherence to guidelines is killing Twitter's usefulness.

The URL Matching Problem

Which of these strings should be turned into hyperlinks?

www.bbc.co.uk

example.com

http://test

https://test.test

ftp://news.com

As it happens, Twitter only matches "https://test.test" and none of the others.

Twitter's matching regex is, as far as I can tell, this

If it starts with http:// or https:// and has a dot in it - it's a URL

I think this is a serious weakness. Twitter users are sharing URLs which their followers can't click on - Twitter is also linking to URLs which don't exist.

I've picked these examples more or less at random.

Solution?

Much like the email regexes, I would take a much more lax approach. Essentially, if it looks vaguely like a URL - link to it.

I would suggest the following rules:

If it starts with a protocol - http:// ftp:// tel: etc - create a hyperlink.
If it starts with www. - create a hyperlink.
If it ends . then a valid TLD - create a hyperlink.
If it contains a valid TLD followed by a slash then some other characters - create a hyperlink.

The "correct" method would then be for Twitter to perform an HTTP HEAD request to see if the URL is potentially valid. There are three drawbacks to this.

It may place excessive load on Twitter's servers to process and cache these requests.
The URL may be that of an Intranet site - and thus inaccessible to Twitter.
The URL may be valid but temporarily inaccessible.

Regardless of the method, surely it's inexcusable that "www.example.com" isn't detected as a URL whereas "http://bork.bork.bork" is?

ACTION!

If you think Twitter's approach to hyperlinks is wrong - please make your voice heard at the bug report.

Bugs in Twitter Text Libraries

@edent — Wed, 31 Mar 2010 10:27:50 +0000

The Twitter Engineering Team have a set of text processing classes which are meant to simplify and standardise the recognition of URLs, screen names, and hashtags. Dabr makes use of them to keep in conformance with Twitter's style.

One of the advantages of the text processing is that it will recognise that www.example.com is a URL and automatically create a hyperlink. Considering that dropping the "http://" represents 5% saving on Twitter's 140 character limit for messages, this is great.

So, I was mightily surprised to get this bug report from user "schmmuck"

Dabr rendering error

How very odd... This is how it looks on m.twitter.com.

m.twitter rendering error

Twitter also use mobile.twitter.com for smartphones. Here's how that site renders the text.

mobile.twitter rendering error

Finally, let's take a look at the "canonical" rendering at Twitter.com

Twitter rendering error

The Problem(s)

The first issue is inconsistency. Twitter ought to be using the same regex for each of its sites. It doesn't. This means that different developers will get divergent experiences. This leads to confusion, which leads to fear, which, as we all know, leads to anger.... and so forth.

Secondly, and more importantly, parsing is hard. There are so many edge cases that errors inevitably creep in. My post about hashtags explains the problems in defining what should be recognised.

So, based on what we've seen, should Twitter recognise any of the following as URLs?

news.bbc.co.uk - no www there.

invalid.name - a silly URL, but a valid one.

खोज.com - International domains contain more than just ASCII

All the above are valid - yet they're not recognised by Twitter.

A (Simple) Solution?

There is a canonical list of TLDs which is also available as a plain text list.

Any string containing a "." followed by a valid TLD, then followed by a space or "/" should be treated as a URL.

Your thoughts?