Regular Expressions make me feel like a powerful wizard - and that's not a good thing
(This is a rant because I'm exhausted after debugging something. If you've made RegEx your whole personality, I'm sorry.)
The other day I had to fix a multi-line Regular Expression (RegEx). After a few hours of peering at it with a variety of tools, I finally understood the problem. Getting that deep into the esoteric mysteries made me feel like a powerful wizard with complete mastery of my domain. And I think that's dangerous.
I'm sure we've all read a story about a witch or wizard who distractedly substitutes eye-of-newt with iron-ute with disastrous consequences. Humans are easily confused. And confusion leads to unexpected mistakes.
Look, most humans are very bad at reading compiled code. Without any external tools - can you tell me what the following code does?
BIN0000000 c031 d88e c08e 15be b47c ac0e 003c 0474
0000010 10cd f7eb 48f4 6c65 6f6c 202c 6f57 6c72
0000020 2164 0a0d 0000 0000 0000 0000 0000 0000
0000030 0000 0000 0000 0000 0000 0000 0000 0000
*
00001f0 0000 0000 0000 0000 0000 0000 0000 aa55
0000200
No. Of course not0. That's why we write code in a more human readable language and then compile it to computer readable instructions.
Regular Expressions are a sort-of halfway house. They're slightly readable by humans - but written in such a terse vocabulary as to be mostly unintelligible without concentration. There's no space for comments. Different engines have variable support for all their functions. They are a symbolic language with unhelpfully indecipherable and inconsistent symbols.
As a result, once the RegEx becomes more than trivially complex they're hard for most humans to understand. That makes them difficult to debug. It also makes it difficult to add or remove functionality.
I genuinely - and possibly misguidedly - believe that even something like ^\w+([-+.']\w+)*@\w+([-.]\w+)*\.\w+([-.]\w+)*$
might just as well be written in BrainFuck.
My contention is that almost all RegExs would be better served by more human readable code and that the very existence of RegEx101.com ought to bring shame on our industry.
Here are some positive use-cases for RegEx:
- You want to show off how smart you are.
- You need maximum efficiency when combing through a billion lines of text.
- You have a desire to build something hard to debug.
- You don't have lots of printer paper and need to make your code as terse as possible.
- You think if/else and switch/case statements are the mark of a diseased mind.
- You don't trust compilers.
I know what you're thinking: "This guy's too stupid to get regular expressions!" Yes. Yes I am. So are most people.
What I'm getting at is that source code is designed to be read and edited by busy and distracted humans. We should be writing intelligible code for each other and letting computers do the boring work of making it more efficient.
You don't have to agree with me. That's fine. But, perhaps you'll take note of the famous maxim from the "Wizard" book:
a computer language is not just a way of getting a computer to perform operations but rather that it is a novel formal medium for expressing ideas about methodology. Thus, programs must be written for people to read, and only incidentally for machines to execute. Structure and Interpretation of Computer Programs
We are not wizards. Nor should we strive to be. The alchemists fell.
Reply to original comment on kinkytaboo.online
|Pro: I'm the guy who is asked to write a regex in 2 minutes that saves someone an hour of writing code just to search GBs of logs.
Con: I'm the guy who is asked to translate regexes when they're found without a comment attached.
Reply to original comment on twitter.com
|perl
Reply to original comment on infosec.exchange
|Guy Leech says:
Reply to original comment on tech.lgbt
|https://regex101.com/
helps - not least because it gives you a human-readable version of your regex in an adjacent pane.
Wow! Had I known about that RegEx101 thing, my life would have been much simpler. Then again, in that case I would probably not think Regular Expressions were write-only code, which they of course are.
Reply to original comment on evgenykuznetsov.org
|Reply to original comment on twitter.com
|Reply to original comment on twitter.com
|Reply to original comment on mastodon.social
|Richard Morton says:
Chuck says:
Alan says:
Indeed!
While loops and gotos+labels are completely sufficient.
Reply to original comment on twitter.com
|Yubi says:
Reply to original comment on hachyderm.io
|Reply to original comment on twitter.com
|Reply to original comment on twitter.com
|Paul Chapman says:
Break your regex into logical pieces. Assign each piece to a variable (or constant) whose name describes it. Catenate the variables together to create the finished string. (This might run at compile time.) Your regex is now human-readable (looks more like BNF), and each piece can be inspected visually by itself to see if it matches the variable-name description.
Eg, your
^\w+([-+.']\w+) @\w+([-.]\w+).\w+([-.]\w+)*$
becomes the pseudocode (doubling up the \s):
word = "\w+"
addressee_separator = "[-+.']"
addressee = word + optional_repeat(addressee_separator + word)
domain_separator = "[-.]"
domain_part = word + optional_repeat(domain_separator + word)
domain = domain_part + "\." + domain_part
address = "^" + addressee + "@" + domain + "$"
where function (or macro) optional_repeat(x) returns "(" + x + ")*" (or you can spell it out if you don't want the reader to have to consult the definition of optional_repeat()). NB. This is code, so comments can be included! Adjust verbosity according to taste, or wizardry comfort level.
Critique: I don't like that your regex confuses the optional .s and the compulsory . in the domain name, making the grammar ambiguous. The ambiguity is revealed by the definition of domain, the like of which no one should be using in a well-constructed grammar. 🙂
IMHO, better would be:
...
word_with_hyphen = word + optional_repeat("-" + word)
domain = word_with_hyphen + compulsory_repeat("\." + word_with_hyphen)
...
Cheers, Paul
Patrice Bremond-Gregoire says:
Philip Oakley says:
Richard Meadowsq says:
Paul Drury says:
If you are presented with a massive string of unreadable regex then it was a human that produced that.
A programmer failed to break it up into substrings each named to represent their function. It was a human that failed to comment what is going on. You can write a dense block of unreadable code in any language. You don't need regex to produce that.
The core fault here isn't with regex, it is with lazy programmers.
Michael Bammann says:
https://github.com/VerbalExpressions/JSVerbalExpressions
(also avaialbe for many other languages)
Chuck says:
More comments on Mastodon.