Regular Expressions make me feel like a powerful wizard - and that's not a good thing


(This is a rant because I'm exhausted after debugging something. If you've made RegEx your whole personality, I'm sorry.)

The other day I had to fix a multi-line Regular Expression (RegEx). After a few hours of peering at it with a variety of tools, I finally understood the problem. Getting that deep into the esoteric mysteries made me feel like a powerful wizard with complete mastery of my domain. And I think that's dangerous.

I'm sure we've all read a story about a witch or wizard who distractedly substitutes eye-of-newt with iron-ute with disastrous consequences. Humans are easily confused. And confusion leads to unexpected mistakes.

Look, most humans are very bad at reading compiled code. Without any external tools - can you tell me what the following code does?

 BIN0000000 c031 d88e c08e 15be b47c ac0e 003c 0474
0000010 10cd f7eb 48f4 6c65 6f6c 202c 6f57 6c72
0000020 2164 0a0d 0000 0000 0000 0000 0000 0000
0000030 0000 0000 0000 0000 0000 0000 0000 0000
*
00001f0 0000 0000 0000 0000 0000 0000 0000 aa55
0000200

No. Of course not0. That's why we write code in a more human readable language and then compile it to computer readable instructions.

Regular Expressions are a sort-of halfway house. They're slightly readable by humans - but written in such a terse vocabulary as to be mostly unintelligible without concentration. There's no space for comments. Different engines have variable support for all their functions. They are a symbolic language with unhelpfully indecipherable and inconsistent symbols.

As a result, once the RegEx becomes more than trivially complex they're hard for most humans to understand. That makes them difficult to debug. It also makes it difficult to add or remove functionality.

I genuinely - and possibly misguidedly - believe that even something like ^\w+([-+.']\w+)*@\w+([-.]\w+)*\.\w+([-.]\w+)*$ might just as well be written in BrainFuck.

My contention is that almost all RegExs would be better served by more human readable code and that the very existence of RegEx101.com ought to bring shame on our industry.

Here are some positive use-cases for RegEx:

  • You want to show off how smart you are.
  • You need maximum efficiency when combing through a billion lines of text.
  • You have a desire to build something hard to debug.
  • You don't have lots of printer paper and need to make your code as terse as possible.
  • You think if/else and switch/case statements are the mark of a diseased mind.
  • You don't trust compilers.

I know what you're thinking: "This guy's too stupid to get regular expressions!" Yes. Yes I am. So are most people.

What I'm getting at is that source code is designed to be read and edited by busy and distracted humans. We should be writing intelligible code for each other and letting computers do the boring work of making it more efficient.

You don't have to agree with me. That's fine. But, perhaps you'll take note of the famous maxim from the "Wizard" book:

a computer language is not just a way of getting a computer to perform operations but rather that it is a novel formal medium for expressing ideas about methodology. Thus, programs must be written for people to read, and only incidentally for machines to execute. Structure and Interpretation of Computer Programs

We are not wizards. Nor should we strive to be. The alchemists fell.


Share this post on…

  • Mastodon
  • Facebook
  • LinkedIn
  • BlueSky
  • Threads
  • Reddit
  • HackerNews
  • Lobsters
  • WhatsApp
  • Telegram

28 thoughts on “Regular Expressions make me feel like a powerful wizard - and that's not a good thing”

  1. said on twitter.com:

    I wrote Perl for a year at uni before I found ASP

    Pro: I'm the guy who is asked to write a regex in 2 minutes that saves someone an hour of writing code just to search GBs of logs.

    Con: I'm the guy who is asked to translate regexes when they're found without a comment attached.

    Reply | Reply to original comment on twitter.com
    1. Guy Leech says:

      Just put a comment above it which is a sample of the text that you are matching and what bits of it you are capturing

      Reply
  2. says:

    Perl introduced the possibility (through the /x and /xx modifiers) to ignore spaces, tabs, linefeeds and even comments in the regular expression body. It allows to "unconvolute" what would have been an unreadable mess into something understandable and maintainable. Every programming language should allow this in its regular expressions.

    Reply
  3. Richard Morton says:

    Thinking off the top of my head, is there a way to have a plain language style code that then gets compiled to a regex, and then include the plain language in the code comments or documentation?

    Reply
  4. Alan says:

    Wholeheartedly agree! I shudder every time I come across a RegEx that might be the problem, even worse if it has no unit tests,

    Reply
    1. Yubi says:

      Goto+label are closer to regular expressions than if/else statements.

      Reply
  5. said on hachyderm.io:

    @Edent I agree that a dedicated, less-clunky solution should be used where possible, but the sheer ubiquity of RegEx is what makes it invaluable as far as I'm concerned. Means I can toss a string into vscode to check for specifics, or use the same in some 'find' type box in a browser even (in some cases). It's awful to write and read, but it's there 😛

    Reply | Reply to original comment on hachyderm.io
  6. Paul Chapman says:

    Terence,

    Break your regex into logical pieces. Assign each piece to a variable (or constant) whose name describes it. Catenate the variables together to create the finished string. (This might run at compile time.) Your regex is now human-readable (looks more like BNF), and each piece can be inspected visually by itself to see if it matches the variable-name description.

    Eg, your

    ^\w+([-+.']\w+)@\w+([-.]\w+).\w+([-.]\w+)*$

    becomes the pseudocode (doubling up the \s):

    word = "\w+" addressee_separator = "[-+.']" addressee = word + optional_repeat(addressee_separator + word) domain_separator = "[-.]" domain_part = word + optional_repeat(domain_separator + word) domain = domain_part + "\." + domain_part address = "^" + addressee + "@" + domain + "$"

    where function (or macro) optional_repeat(x) returns "(" + x + ")*" (or you can spell it out if you don't want the reader to have to consult the definition of optional_repeat()). NB. This is code, so comments can be included! Adjust verbosity according to taste, or wizardry comfort level.

    Critique: I don't like that your regex confuses the optional .s and the compulsory . in the domain name, making the grammar ambiguous. The ambiguity is revealed by the definition of domain, the like of which no one should be using in a well-constructed grammar. 🙂

    IMHO, better would be:

    ... word_with_hyphen = word + optional_repeat("-" + word) domain = word_with_hyphen + compulsory_repeat("\." + word_with_hyphen) ...

    Cheers, Paul

    Reply
    1. Patrice Bremond-Gregoire says:

      I like that a lot. Of course, I wouldn't put this code in-line where regex is used, I would create a function (e.g. ValidEmailRegEx) that returns the regex, and use your method in the function. When one reads the code using the regex one only need to know that this part validates an email. Then, if it is found that the email validation fails, one can look inside the function to figure out why.

      Reply
  7. Philip Oakley says:

    IS this not the description of APL, but for a different category of folks 😉

    Reply
  8. Richard Meadowsq says:

    So what is your alternative? I have a old framwork that I developed back in the 90s. Lexer and Parser class. I would not expect any developer to be able to quickly read anything that I wrote using it.

    Reply
  9. Paul Drury says:

    Regex is a great tool, fast and very powerful. I see loads of people moaning about regex being not easy to read, but I see no-one offering to produce a readable alternative. Until someone writes a more readable alternative then regex is still the best there is.

    If you are presented with a massive string of unreadable regex then it was a human that produced that. A programmer failed to break it up into substrings each named to represent their function. It was a human that failed to comment what is going on. You can write a dense block of unreadable code in any language. You don't need regex to produce that.

    The core fault here isn't with regex, it is with lazy programmers.

    Reply

Trackbacks and Pingbacks

What are your reckons?

All comments are moderated and may not be published immediately. Your email address will not be published.

Allowed HTML: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong> <p> <pre> <br> <img src="" alt="" title="" srcset="">