Regular Expressions make me feel like a powerful wizard - and that's not a good thing
(This is a rant because I'm exhausted after debugging something. If you've made RegEx your whole personality, I'm sorry.)
The other day I had to fix a multi-line Regular Expression (RegEx). After a few hours of peering at it with a variety of tools, I finally understood the problem. Getting that deep into the esoteric mysteries made me feel like a powerful wizard with complete mastery of my domain. And I think that's dangerous.
I'm sure we've all read a story about a witch or wizard who distractedly substitutes eye-of-newt with iron-ute with disastrous consequences. Humans are easily confused. And confusion leads to unexpected mistakes.
Look, most humans are very bad at reading compiled code. Without any external tools - can you tell me what the following code does?
BIN0000000 c031 d88e c08e 15be b47c ac0e 003c 0474
0000010 10cd f7eb 48f4 6c65 6f6c 202c 6f57 6c72
0000020 2164 0a0d 0000 0000 0000 0000 0000 0000
0000030 0000 0000 0000 0000 0000 0000 0000 0000
*
00001f0 0000 0000 0000 0000 0000 0000 0000 aa55
0000200
No. Of course not0. That's why we write code in a more human readable language and then compile it to computer readable instructions.
Regular Expressions are a sort-of halfway house. They're slightly readable by humans - but written in such a terse vocabulary as to be mostly unintelligible without concentration. There's no space for comments. Different engines have variable support for all their functions. They are a symbolic language with unhelpfully indecipherable and inconsistent symbols.
As a result, once the RegEx becomes more than trivially complex they're hard for most humans to understand. That makes them difficult to debug. It also makes it difficult to add or remove functionality.
I genuinely - and possibly misguidedly - believe that even something like ^\w+([-+.']\w+)*@\w+([-.]\w+)*\.\w+([-.]\w+)*$
might just as well be written in BrainFuck.
My contention is that almost all RegExs would be better served by more human readable code and that the very existence of RegEx101.com ought to bring shame on our industry.
Here are some positive use-cases for RegEx:
- You want to show off how smart you are.
- You need maximum efficiency when combing through a billion lines of text.
- You have a desire to build something hard to debug.
- You don't have lots of printer paper and need to make your code as terse as possible.
- You think if/else and switch/case statements are the mark of a diseased mind.
- You don't trust compilers.
I know what you're thinking: "This guy's too stupid to get regular expressions!" Yes. Yes I am. So are most people.
What I'm getting at is that source code is designed to be read and edited by busy and distracted humans. We should be writing intelligible code for each other and letting computers do the boring work of making it more efficient.
You don't have to agree with me. That's fine. But, perhaps you'll take note of the famous maxim from the "Wizard" book:
a computer language is not just a way of getting a computer to perform operations but rather that it is a novel formal medium for expressing ideas about methodology. Thus, programs must be written for people to read, and only incidentally for machines to execute. Structure and Interpretation of Computer Programs
We are not wizards. Nor should we strive to be. The alchemists fell.
Guy Leech says:
Wow! Had I known about that RegEx101 thing, my life would have been much simpler. Then again, in that case I would probably not think Regular Expressions were write-only code, which they of course are.
Richard Morton says:
Chuck says:
Alan says:
Yubi says:
Paul Chapman says:
Patrice Bremond-Gregoire says:
Philip Oakley says:
Richard Meadowsq says:
Paul Drury says:
Michael Bammann says:
Chuck says:
More comments on Mastodon.