Regular Expressions make me feel like a powerful wizard - and that's not a good thing
(This is a rant because I'm exhausted after debugging something. If you've made RegEx your whole personality, I'm sorry.)
The other day I had to fix a multi-line Regular Expression (RegEx). After a few hours of peering at it with a variety of tools, I finally understood the problem. Getting that deep into the esoteric mysteries made me feel like a powerful wizard with complete mastery of my domain. And I think that's dangerous.
I'm sure we've all read a story about a witch or wizard who distractedly substitutes eye-of-newt with iron-ute with disastrous consequences. Humans are easily confused. And confusion leads to unexpected mistakes.
Look, most humans are very bad at reading compiled code. Without any external tools - can you tell me what the following code does?
BIN0000000 c031 d88e c08e 15be b47c ac0e 003c 0474
0000010 10cd f7eb 48f4 6c65 6f6c 202c 6f57 6c72
0000020 2164 0a0d 0000 0000 0000 0000 0000 0000
0000030 0000 0000 0000 0000 0000 0000 0000 0000
*
00001f0 0000 0000 0000 0000 0000 0000 0000 aa55
0000200
No. Of course not0. That's why we write code in a more human readable language and then compile it to computer readable instructions.
Regular Expressions are a sort-of halfway house. They're slightly readable by humans - but written in such a terse vocabulary as to be mostly unintelligible without concentration. There's no space for comments. Different engines have variable support for all their functions. They are a symbolic language with unhelpfully indecipherable and inconsistent symbols.
As a result, once the RegEx becomes more than trivially complex they're hard for most humans to understand. That makes them difficult to debug. It also makes it difficult to add or remove functionality.
I genuinely - and possibly misguidedly - believe that even something like ^\w+([-+.']\w+)*@\w+([-.]\w+)*\.\w+([-.]\w+)*$
might just as well be written in BrainFuck.
My contention is that almost all RegExs would be better served by more human readable code and that the very existence of RegEx101.com ought to bring shame on our industry.
Here are some positive use-cases for RegEx:
- You want to show off how smart you are.
- You need maximum efficiency when combing through a billion lines of text.
- You have a desire to build something hard to debug.
- You don't have lots of printer paper and need to make your code as terse as possible.
- You think if/else and switch/case statements are the mark of a diseased mind.
- You don't trust compilers.
I know what you're thinking: "This guy's too stupid to get regular expressions!" Yes. Yes I am. So are most people.
What I'm getting at is that source code is designed to be read and edited by busy and distracted humans. We should be writing intelligible code for each other and letting computers do the boring work of making it more efficient.
You don't have to agree with me. That's fine. But, perhaps you'll take note of the famous maxim from the "Wizard" book:
a computer language is not just a way of getting a computer to perform operations but rather that it is a novel formal medium for expressing ideas about methodology. Thus, programs must be written for people to read, and only incidentally for machines to execute. Structure and Interpretation of Computer Programs
We are not wizards. Nor should we strive to be. The alchemists fell.
taste of taboo 🦝📸 said on kinkytaboo.online:
@Edent regex are a total pain for adhd brains
Steven Pears said on twitter.com:
I wrote Perl for a year at uni before I found ASP
Pro: I'm the guy who is asked to write a regex in 2 minutes that saves someone an hour of writing code just to search GBs of logs.
Con: I'm the guy who is asked to translate regexes when they're found without a comment attached.
barubary@infosec.exchange said on infosec.exchange:
@EdentThere's no space for comments.... unless you're using Perl, in which case you can use spaces, indentation, comments, etc in your regexes as you like.#perl perl
Guy Leech says:
Just put a comment above it which is a sample of the text that you are matching and what bits of it you are capturing
Tom Parker-Shemilt says:
Or Python with verbose mode https://docs.python.org/3/library/re.html#re.VERBOSE
Emily S said on tech.lgbt:
@Edent I'm pretty sure this one of those lessons that one must learn on the road to becoming a senior engineer.
Andy Mabbett says:
Not that I disagree with you, but I find that:
https://regex101.com/
helps - not least because it gives you a human-readable version of your regex in an adjacent pane.
Evgeny Kuznetsov said on evgenykuznetsov.org:
Wow! Had I known about that RegEx101 thing, my life would have been much simpler. Then again, in that case I would probably not think Regular Expressions were write-only code, which they of course are.
James O'Malley said on twitter.com:
Couldn’t agree more with this. And it’s for this reason, rather than my being an idiot whose brain can’t learn them, why my code always contain dozens of str_replace()s and explode()s.
Neil Young said on twitter.com:
Agree with this. There's a use case for "immediate" regular expressions as sort of keyboard shortcuts, but if they're going to be read by anyone else, they're obtuse.
AP says:
Perl introduced the possibility (through the /x and /xx modifiers) to ignore spaces, tabs, linefeeds and even comments in the regular expression body. It allows to "unconvolute" what would have been an unreadable mess into something understandable and maintainable. Every programming language should allow this in its regular expressions.
Sam J Sharpe said on mastodon.social:
@Edent is it bad that I can read that regex and I think it's attempting to validate an email address?
Richard Morton says:
Thinking off the top of my head, is there a way to have a plain language style code that then gets compiled to a regex, and then include the plain language in the code comments or documentation?
Chuck says:
Swift language has added a RegEx builder library that is readable, but then compiles to a RegEx for execution. Similar DSL style approaches should be developed for other languages. https://developer.apple.com/documentation/regexbuilder
Alan says:
Wholeheartedly agree! I shudder every time I come across a RegEx that might be the problem, even worse if it has no unit tests,
Pat Mächler ❎ said on twitter.com:
"if/else and switch/case statements are the mark of a diseased mind."
Indeed!
While loops and gotos+labels are completely sufficient.
Yubi says:
Goto+label are closer to regular expressions than if/else statements.
chessmango said on hachyderm.io:
@Edent I agree that a dedicated, less-clunky solution should be used where possible, but the sheer ubiquity of RegEx is what makes it invaluable as far as I'm concerned. Means I can toss a string into vscode to check for specifics, or use the same in some 'find' type box in a browser even (in some cases). It's awful to write and read, but it's there 😛
Philippe Duval said on twitter.com:
An interesting take about why programs must be written for people to read (which is what I keep telling my students): shkspr.mobi/blog/2023/02/r… via @edent
Daniel May said on twitter.com:
everything here is true but im still weirdly proud to be a powerful wizard shkspr.mobi/blog/2023/02/r…
Paul Chapman says:
Terence,
Break your regex into logical pieces. Assign each piece to a variable (or constant) whose name describes it. Catenate the variables together to create the finished string. (This might run at compile time.) Your regex is now human-readable (looks more like BNF), and each piece can be inspected visually by itself to see if it matches the variable-name description.
Eg, your
^\w+([-+.']\w+)@\w+([-.]\w+).\w+([-.]\w+)*$
becomes the pseudocode (doubling up the \s):
word = "\w+" addressee_separator = "[-+.']" addressee = word + optional_repeat(addressee_separator + word) domain_separator = "[-.]" domain_part = word + optional_repeat(domain_separator + word) domain = domain_part + "\." + domain_part address = "^" + addressee + "@" + domain + "$"
where function (or macro) optional_repeat(x) returns "(" + x + ")*" (or you can spell it out if you don't want the reader to have to consult the definition of optional_repeat()). NB. This is code, so comments can be included! Adjust verbosity according to taste, or wizardry comfort level.
Critique: I don't like that your regex confuses the optional .s and the compulsory . in the domain name, making the grammar ambiguous. The ambiguity is revealed by the definition of domain, the like of which no one should be using in a well-constructed grammar. 🙂
IMHO, better would be:
... word_with_hyphen = word + optional_repeat("-" + word) domain = word_with_hyphen + compulsory_repeat("\." + word_with_hyphen) ...
Cheers, Paul
Patrice Bremond-Gregoire says:
I like that a lot. Of course, I wouldn't put this code in-line where regex is used, I would create a function (e.g. ValidEmailRegEx) that returns the regex, and use your method in the function. When one reads the code using the regex one only need to know that this part validates an email. Then, if it is found that the email validation fails, one can look inside the function to figure out why.
Philip Oakley says:
IS this not the description of APL, but for a different category of folks 😉
Richard Meadowsq says:
So what is your alternative? I have a old framwork that I developed back in the 90s. Lexer and Parser class. I would not expect any developer to be able to quickly read anything that I wrote using it.
Paul Drury says:
Regex is a great tool, fast and very powerful. I see loads of people moaning about regex being not easy to read, but I see no-one offering to produce a readable alternative. Until someone writes a more readable alternative then regex is still the best there is.
If you are presented with a massive string of unreadable regex then it was a human that produced that. A programmer failed to break it up into substrings each named to represent their function. It was a human that failed to comment what is going on. You can write a dense block of unreadable code in any language. You don't need regex to produce that.
The core fault here isn't with regex, it is with lazy programmers.
Michael Bammann says:
Perhaps this can also help in the future:
https://github.com/VerbalExpressions/JSVerbalExpressions (also avaialbe for many other languages)
Chuck says:
I had not seen that before. Very nice.
More comments on Mastodon.