Regular Expressions make me feel like a powerful wizard - and that's not a good thing

Computer Science programming regex · 28 comments · 600 words · Viewed ~5,511 times.

(This is a rant because I'm exhausted after debugging something. If you've made RegEx your whole personality, I'm sorry.)

The other day I had to fix a multi-line Regular Expression (RegEx). After a few hours of peering at it with a variety of tools, I finally understood the problem. Getting that deep into the esoteric mysteries made me feel like a powerful wizard with complete mastery of my domain. And I think that's dangerous.

I'm sure we've all read a story about a witch or wizard who distractedly substitutes eye-of-newt with iron-ute with disastrous consequences. Humans are easily confused. And confusion leads to unexpected mistakes.

Look, most humans are very bad at reading compiled code. Without any external tools - can you tell me what the following code does?

 BIN0000000 c031 d88e c08e 15be b47c ac0e 003c 0474

0000010 10cd f7eb 48f4 6c65 6f6c 202c 6f57 6c72

0000020 2164 0a0d 0000 0000 0000 0000 0000 0000

0000030 0000 0000 0000 0000 0000 0000 0000 0000

*

00001f0 0000 0000 0000 0000 0000 0000 0000 aa55

0000200

No. Of course not⁰. That's why we write code in a more human readable language and then compile it to computer readable instructions.

Regular Expressions are a sort-of halfway house. They're slightly readable by humans - but written in such a terse vocabulary as to be mostly unintelligible without concentration. There's no space for comments. Different engines have variable support for all their functions. They are a symbolic language with unhelpfully indecipherable and inconsistent symbols.

As a result, once the RegEx becomes more than trivially complex they're hard for most humans to understand. That makes them difficult to debug. It also makes it difficult to add or remove functionality.

I genuinely - and possibly misguidedly - believe that even something like ^\w+([-+.']\w+)*@\w+([-.]\w+)*\.\w+([-.]\w+)*$ might just as well be written in BrainFuck.

My contention is that almost all RegExs would be better served by more human readable code and that the very existence of RegEx101.com ought to bring shame on our industry.

Here are some positive use-cases for RegEx:

You want to show off how smart you are.
You need maximum efficiency when combing through a billion lines of text.
You have a desire to build something hard to debug.
You don't have lots of printer paper and need to make your code as terse as possible.
You think if/else and switch/case statements are the mark of a diseased mind.
You don't trust compilers.

I know what you're thinking: "This guy's too stupid to get regular expressions!" Yes. Yes I am. So are most people.

What I'm getting at is that source code is designed to be read and edited by busy and distracted humans. We should be writing intelligible code for each other and letting computers do the boring work of making it more efficient.

You don't have to agree with me. That's fine. But, perhaps you'll take note of the famous maxim from the "Wizard" book:

a computer language is not just a way of getting a computer to perform operations but rather that it is a novel formal medium for expressing ideas about methodology. Thus, programs must be written for people to read, and only incidentally for machines to execute. Structure and Interpretation of Computer Programs

We are not wizards. Nor should we strive to be. The alchemists fell.

You can read the original code which is MIT Licenced. ↩︎

28 thoughts on “Regular Expressions make me feel like a powerful wizard - and that's not a good thing”

2023-02-06 12:39

taste of taboo 🦝📸 said on kinkytaboo.online:

@Edent regex are a total pain for adhd brains

Reply | Reply to original comment on kinkytaboo.online
2023-02-06 12:53

Steven Pears said on twitter.com:

I wrote Perl for a year at uni before I found ASP

Pro: I'm the guy who is asked to write a regex in 2 minutes that saves someone an hour of writing code just to search GBs of logs.

Con: I'm the guy who is asked to translate regexes when they're found without a comment attached.

Reply | Reply to original comment on twitter.com
2023-02-06 12:55

barubary@infosec.exchange said on infosec.exchange:

@EdentThere's no space for comments.... unless you're using Perl, in which case you can use spaces, indentation, comments, etc in your regexes as you like. #perl
perl

Reply | Reply to original comment on infosec.exchange
1. 2023-02-13 09:44
  
  Guy Leech says:
  
  Just put a comment above it which is a sample of the text that you are matching and what bits of it you are capturing
  
  Reply
2. 2023-02-14 12:41
  
  Tom Parker-Shemilt says:
  
  Or Python with verbose mode https://docs.python.org/3/library/re.html#re.VERBOSE
  
  Reply
2023-02-06 13:05

Emily S said on tech.lgbt:

@Edent I'm pretty sure this one of those lessons that one must learn on the road to becoming a senior engineer.

Reply | Reply to original comment on tech.lgbt
2023-02-06 13:28

Andy Mabbett says:

Not that I disagree with you, but I find that:

https://regex101.com/

helps - not least because it gives you a human-readable version of your regex in an adjacent pane.

Reply
2023-02-06 13:54

Evgeny Kuznetsov said on evgenykuznetsov.org:

Wow! Had I known about that RegEx101 thing, my life would have been much simpler. Then again, in that case I would probably not think Regular Expressions were write-only code, which they of course are.

Reply | Reply to original comment on evgenykuznetsov.org
2023-02-06 14:06

James O'Malley said on twitter.com:

Couldn’t agree more with this. And it’s for this reason, rather than my being an idiot whose brain can’t learn them, why my code always contain dozens of str_replace()s and explode()s.

Reply | Reply to original comment on twitter.com
2023-02-06 14:28

Neil Young said on twitter.com:

Agree with this. There's a use case for "immediate" regular expressions as sort of keyboard shortcuts, but if they're going to be read by anyone else, they're obtuse.

Reply | Reply to original comment on twitter.com
2023-02-06 14:42

AP says:

Perl introduced the possibility (through the /x and /xx modifiers) to ignore spaces, tabs, linefeeds and even comments in the regular expression body. It allows to "unconvolute" what would have been an unreadable mess into something understandable and maintainable. Every programming language should allow this in its regular expressions.

Reply
2023-02-06 15:07

Sam J Sharpe said on mastodon.social:

@Edent is it bad that I can read that regex and I think it's attempting to validate an email address?

Reply | Reply to original comment on mastodon.social
2023-02-06 16:06

Richard Morton says:

Thinking off the top of my head, is there a way to have a plain language style code that then gets compiled to a regex, and then include the plain language in the code comments or documentation?

Reply
1. 2023-02-13 14:56
  
  Chuck says:
  
  Swift language has added a RegEx builder library that is readable, but then compiles to a RegEx for execution. Similar DSL style approaches should be developed for other languages. https://developer.apple.com/documentation/regexbuilder
  
  Reply
2023-02-06 17:10

Alan says:

Wholeheartedly agree! I shudder every time I come across a RegEx that might be the problem, even worse if it has no unit tests,

Reply
2023-02-06 19:02

Pat Mächler ❎ said on twitter.com:

"if/else and switch/case statements are the mark of a diseased mind."

Indeed!

While loops and gotos+labels are completely sufficient.

Reply | Reply to original comment on twitter.com
1. 2023-02-13 22:59
  
  Yubi says:
  
  Goto+label are closer to regular expressions than if/else statements.
  
  Reply
2023-02-07 14:55

chessmango said on hachyderm.io:

@Edent I agree that a dedicated, less-clunky solution should be used where possible, but the sheer ubiquity of RegEx is what makes it invaluable as far as I'm concerned. Means I can toss a string into vscode to check for specifics, or use the same in some 'find' type box in a browser even (in some cases). It's awful to write and read, but it's there 😛

Reply | Reply to original comment on hachyderm.io
2023-02-10 06:32

Philippe Duval said on twitter.com:

An interesting take about why programs must be written for people to read (which is what I keep telling my students): shkspr.mobi/blog/2023/02/r… via @edent

Reply | Reply to original comment on twitter.com
2023-02-10 06:50

Daniel May said on twitter.com:

everything here is true but im still weirdly proud to be a powerful wizard shkspr.mobi/blog/2023/02/r…

Reply | Reply to original comment on twitter.com
2023-02-10 18:24

Paul Chapman says:

Terence,

Break your regex into logical pieces. Assign each piece to a variable (or constant) whose name describes it. Catenate the variables together to create the finished string. (This might run at compile time.) Your regex is now human-readable (looks more like BNF), and each piece can be inspected visually by itself to see if it matches the variable-name description.

Eg, your

^\w+([-+.']\w+) @\w+([-.]\w+).\w+([-.]\w+)*$

becomes the pseudocode (doubling up the \s):

word = "\w+"
addressee_separator = "[-+.']"
addressee = word + optional_repeat(addressee_separator + word)
domain_separator = "[-.]"
domain_part = word + optional_repeat(domain_separator + word)
domain = domain_part + "\." + domain_part
address = "^" + addressee + "@" + domain + "$"

where function (or macro) optional_repeat(x) returns "(" + x + ")*" (or you can spell it out if you don't want the reader to have to consult the definition of optional_repeat()). NB. This is code, so comments can be included! Adjust verbosity according to taste, or wizardry comfort level.

Critique: I don't like that your regex confuses the optional .s and the compulsory . in the domain name, making the grammar ambiguous. The ambiguity is revealed by the definition of domain, the like of which no one should be using in a well-constructed grammar. 🙂

IMHO, better would be:

...
word_with_hyphen = word + optional_repeat("-" + word)
domain = word_with_hyphen + compulsory_repeat("\." + word_with_hyphen)
...

Cheers, Paul

Reply
1. 2023-02-13 07:50
  
  Patrice Bremond-Gregoire says:
  
  I like that a lot. Of course, I wouldn't put this code in-line where regex is used, I would create a function (e.g. ValidEmailRegEx) that returns the regex, and use your method in the function. When one reads the code using the regex one only need to know that this part validates an email. Then, if it is found that the email validation fails, one can look inside the function to figure out why.
  
  Reply
2023-02-13 12:01

Philip Oakley says:

IS this not the description of APL, but for a different category of folks 😉

Reply
2023-02-13 14:31

Richard Meadowsq says:

So what is your alternative? I have a old framwork that I developed back in the 90s. Lexer and Parser class. I would not expect any developer to be able to quickly read anything that I wrote using it.

Reply
2023-02-13 16:42

Paul Drury says:

Regex is a great tool, fast and very powerful. I see loads of people moaning about regex being not easy to read, but I see no-one offering to produce a readable alternative. Until someone writes a more readable alternative then regex is still the best there is.

If you are presented with a massive string of unreadable regex then it was a human that produced that.
A programmer failed to break it up into substrings each named to represent their function. It was a human that failed to comment what is going on. You can write a dense block of unreadable code in any language. You don't need regex to produce that.

The core fault here isn't with regex, it is with lazy programmers.

Reply
2023-02-14 08:15

Michael Bammann says:

Perhaps this can also help in the future:

https://github.com/VerbalExpressions/JSVerbalExpressions
(also avaialbe for many other languages)

Reply
1. 2023-02-14 18:21
  
  Chuck says:
  
  I had not seen that before. Very nice.
  
  Reply
More comments on Mastodon.