I think there are several reasons for regex hate and they’re slightly contradictory:
First, a lot of ‘regex’ languages are more expressive than regular expressions. If I remember correctly, Perl regexes are actually Turing complete (or, at least, they’re push-down automata - I can’t remember which). This means that they’re a very dense and expressive language for writing parsers. This language, as a side effect of being so concise, is very hard to read. This is fine for single-use one liners but you don’t want to be stuck maintaining a complex regular expression.
Second, the fact that there are a lot of regex languages. C++11 defined six different regex languages, with several additional modifiers and this didn’t include Perl-compatible regular expressions. Several regex languages use different syntax for the same thing or similar syntax for completely different things. This is hinted at in the article, which explicitly says that it’s using Vim’s regex syntax. Just because you’re familiar with regexes in one context doesn’t mean that you can correctly read them in another.
Third, and in contradiction to the first point, regexes can parse only regular languages. A lot of things that you parse are not regular languages. The canonical example here is matching brackets, which requires counting and so can be parsed with a Turing Machine or a push-down automaton, but not with a finite state machine (regular expressions are an encoding of FSMs). This leads to horrible fragile hacks. For example, the Lisp syntax highlighting in Vim defines 10 (I think - it’s been over a decade since I looked) different syntax rules for brackets inside other brackets. Each one is coloured differently and so if you nest brackets up to 10 deep then you get different colours. After that, it breaks. This rule is trivial to express in any programming language (define a set of colours, count the bracket depth, increment when before you reach an open bracket, decrement after you reach a closed bracket, pick colour depth mod number of colours to colour any bracket) but cannot be expressed with regular expressions.
Like many other things (object orientation, functional programming, machine learning, distributed systems, and so on), regular expressions are a fantastic tool for some problems and a complete disaster when applied to problems in the wrong domain.
For example, the Lisp syntax highlighting in Vim defines 10 (I think - it’s been over a decade since I looked) different syntax rules for brackets inside other brackets. Each one is coloured differently and so if you nest brackets up to 10 deep then you get different colours. After that, it breaks.
Couldn’t you just make it so that the last rule can contain the first?
If vim is using proper regexes, you cannot. As mentioned above, regular expressions are equivalent to finite state machines, which mathematically cannot parse arbitrarily deep recursive structures in strings. Here’s a good article which covers this in its first part.
It might be possible outside of the regex system but that’s not how it was defined. Each of the rules was a single regex. If you are a pair of brackets with no brackets inside them, you are the first colour. If you are a pair of brackets with one pair of brackets inside them, you are the second colour. If you are… and so on. That’s all that you can do with regular expressions.
Well, then they’re doing it wrong. Regexes are not meant to be used alone. Here is a fully functioning Vim script that highlights nested parentheses in alternating colors:
I wholeheartedly agree with this! I have managed to learn enough regex syntax to be able to quickly transform data in a pinch. This is great because it minimizes the loss of momentum that comes from running into a tricky sub-problem (e.g. crap, I thought these were plain CSV files but really they’re…). That being said, I almost never use regular expressions in code, for largely the reasons listed in the article.
Another example of this, at least for me, is Vim macros. There are zero macros defined in my .vimrc, but I sometimes define one-off macros for specific tasks that I need to do a bunch of times. Often, these tasks could probably be completed with regexes, but macros are even easier. They’re also harder to read (because I’m defining them on the fly, in part) and even more brittle, but who cares for a one-off task!
Hm interesting, I use vim, but basically never use regexes directly in the editor. But I use them in Python, shell scripts, and the command line, in roughly that order. For that use case I think there are indeed many pitfalls to watch out for.
When you’re putting them in code, I think this helps a lot.
Also my #1 tip is to treat a regular language/expression as a FUNCTION that returns True or False (a predicate).
Then it’s natural to unit test it. If you get in the habit of writing such unit tests with YES / NO examples, then I claim you will learn regexes gradually and never forget it … or at least it won’t be very painful.
It’s almost always a bad idea to parse one programming language out of a string literal in another programming language. This is why we have HTML templating libraries and the like. To this end, regexes are best done in VerbalExpressions or something like that.
Barring that, regex and regular strings shouldn’t use the same escape character. Something as simple as changing the regex escape character to anything other than \ could make string-literal regexes about 1000% easier to read.
To get good use out of regexes, and avoid the escaping problem, you should either be using a language that has dedicated syntax for them (Perl, awk, Javascript) or a language that has raw strings (Go, Python, C#, Java 13+).
Ideally your language / regex library also supports /x mode (whitespace insensitivity, so that you can actually format your patterns nicely) and named captures. And ideally it obeys the rule that all metacharacters are either backslashed alphanumerics, or un-backslashed punctuation, and that backslashing punctuation always makes it non-meta (even when unnecessary, i.e. that punctuation didn’t have a meta meaning to begin with). This aids human interpretation and makes programmatic escaping simple.
I’m curious what kind of regexes people use so they run into problems all the time.
Stuff I used the last two weeks:
Is this a 7 digit number with leading zeros allowed? ^[0-9]{7}$
does this string start with 2 ascii letters, then 2 digitis? ^[a-z]{2}[0-9]{2}
Yes, you can do that in code, but it’s usually 5 lines instead of one and not easily readable at a glance.
This is about the complexity that the majority of my regexes have, a trim() before and maybe an ignore-case flag or calling lower() and that’s it. I also don’t remember when I last discovered a bug in one of those regexes. I think the people crying “don’t use regex” are burnt by trying to validate a URL or something once and then forming an opinion.
To this end, regexes are best done in VerbalExpressions or something like that.
This looks interesting, but I’m a bit curious, once you’re going to write something so verbose and that isn’t already familiar to people why not switch to a PEG or parser combinator?
You don’t use this when you’re writing something verbose and unfamiliar. You use it when you’re writing a quick regex. Then, when you inevitably modify it into something verbose and unfamiliar, you’re using a tool that makes it legible.
Wow, verbal expressions look like exactly what I have thought should exist for the longest time. It doesn’t help too much in an editor context (tho it would be cool if it could!), but for code this seems like such a good idea that I think it should be put in the Python standard library
Funny you should say that, because Emacs has had something very similar to “verbal expressions” since at least 2001: Rx notation. However, the comments at the top of Emacs’ Rx implementation cite Olin Shiver’s SRE, from 1998, as an influence. And both Rx and SRE in turn was inspired by the older sregex.el from Emacs, which dates from 1997. In short, the basic idea of this has been present in Emacs for a very long time!
I wish that Rosie Pattern Language would catch on. It’s like regex but better because it can actually do “impossible” things like parsing HTML, since it’s PEG parser and not a regular parser.
I think there are several reasons for regex hate and they’re slightly contradictory:
First, a lot of ‘regex’ languages are more expressive than regular expressions. If I remember correctly, Perl regexes are actually Turing complete (or, at least, they’re push-down automata - I can’t remember which). This means that they’re a very dense and expressive language for writing parsers. This language, as a side effect of being so concise, is very hard to read. This is fine for single-use one liners but you don’t want to be stuck maintaining a complex regular expression.
Second, the fact that there are a lot of regex languages. C++11 defined six different regex languages, with several additional modifiers and this didn’t include Perl-compatible regular expressions. Several regex languages use different syntax for the same thing or similar syntax for completely different things. This is hinted at in the article, which explicitly says that it’s using Vim’s regex syntax. Just because you’re familiar with regexes in one context doesn’t mean that you can correctly read them in another.
Third, and in contradiction to the first point, regexes can parse only regular languages. A lot of things that you parse are not regular languages. The canonical example here is matching brackets, which requires counting and so can be parsed with a Turing Machine or a push-down automaton, but not with a finite state machine (regular expressions are an encoding of FSMs). This leads to horrible fragile hacks. For example, the Lisp syntax highlighting in Vim defines 10 (I think - it’s been over a decade since I looked) different syntax rules for brackets inside other brackets. Each one is coloured differently and so if you nest brackets up to 10 deep then you get different colours. After that, it breaks. This rule is trivial to express in any programming language (define a set of colours, count the bracket depth, increment when before you reach an open bracket, decrement after you reach a closed bracket, pick colour depth mod number of colours to colour any bracket) but cannot be expressed with regular expressions.
Like many other things (object orientation, functional programming, machine learning, distributed systems, and so on), regular expressions are a fantastic tool for some problems and a complete disaster when applied to problems in the wrong domain.
Couldn’t you just make it so that the last rule can contain the first?
If vim is using proper regexes, you cannot. As mentioned above, regular expressions are equivalent to finite state machines, which mathematically cannot parse arbitrarily deep recursive structures in strings. Here’s a good article which covers this in its first part.
But Vim’s syntax highlighting isn’t just regexes. It’s based on regexes, but they’re connected by a more sophisticated system.
It might be possible outside of the regex system but that’s not how it was defined. Each of the rules was a single regex. If you are a pair of brackets with no brackets inside them, you are the first colour. If you are a pair of brackets with one pair of brackets inside them, you are the second colour. If you are… and so on. That’s all that you can do with regular expressions.
Well, then they’re doing it wrong. Regexes are not meant to be used alone. Here is a fully functioning Vim script that highlights nested parentheses in alternating colors:
I wholeheartedly agree with this! I have managed to learn enough regex syntax to be able to quickly transform data in a pinch. This is great because it minimizes the loss of momentum that comes from running into a tricky sub-problem (e.g. crap, I thought these were plain CSV files but really they’re…). That being said, I almost never use regular expressions in code, for largely the reasons listed in the article.
Another example of this, at least for me, is Vim macros. There are zero macros defined in my .vimrc, but I sometimes define one-off macros for specific tasks that I need to do a bunch of times. Often, these tasks could probably be completed with regexes, but macros are even easier. They’re also harder to read (because I’m defining them on the fly, in part) and even more brittle, but who cares for a one-off task!
“Loss of momentum” is a great way of putting it that I never thought of before.
my takeaway: machine learning is the “regex” of our time
Hm interesting, I use vim, but basically never use regexes directly in the editor. But I use them in Python, shell scripts, and the command line, in roughly that order. For that use case I think there are indeed many pitfalls to watch out for.
Here I advocate thinking about “regular languages” separately, which this post doesn’t address: http://www.oilshell.org/blog/2020/07/eggex-theory.html
When you’re putting them in code, I think this helps a lot.
Also my #1 tip is to treat a regular language/expression as a FUNCTION that returns True or False (a predicate).
Then it’s natural to unit test it. If you get in the habit of writing such unit tests with YES / NO examples, then I claim you will learn regexes gradually and never forget it … or at least it won’t be very painful.
Regexes in code suffer from two problems:
It’s almost always a bad idea to parse one programming language out of a string literal in another programming language. This is why we have HTML templating libraries and the like. To this end, regexes are best done in VerbalExpressions or something like that.
Barring that, regex and regular strings shouldn’t use the same escape character. Something as simple as changing the regex escape character to anything other than \ could make string-literal regexes about 1000% easier to read.
To get good use out of regexes, and avoid the escaping problem, you should either be using a language that has dedicated syntax for them (Perl, awk, Javascript) or a language that has raw strings (Go, Python, C#, Java 13+).
Ideally your language / regex library also supports
/x
mode (whitespace insensitivity, so that you can actually format your patterns nicely) and named captures. And ideally it obeys the rule that all metacharacters are either backslashed alphanumerics, or un-backslashed punctuation, and that backslashing punctuation always makes it non-meta (even when unnecessary, i.e. that punctuation didn’t have a meta meaning to begin with). This aids human interpretation and makes programmatic escaping simple.I’m curious what kind of regexes people use so they run into problems all the time.
Stuff I used the last two weeks:
^[0-9]{7}$
^[a-z]{2}[0-9]{2}
Yes, you can do that in code, but it’s usually 5 lines instead of one and not easily readable at a glance.
This is about the complexity that the majority of my regexes have, a trim() before and maybe an ignore-case flag or calling lower() and that’s it. I also don’t remember when I last discovered a bug in one of those regexes. I think the people crying “don’t use regex” are burnt by trying to validate a URL or something once and then forming an opinion.
This looks interesting, but I’m a bit curious, once you’re going to write something so verbose and that isn’t already familiar to people why not switch to a PEG or parser combinator?
You don’t use this when you’re writing something verbose and unfamiliar. You use it when you’re writing a quick regex. Then, when you inevitably modify it into something verbose and unfamiliar, you’re using a tool that makes it legible.
Wow, verbal expressions look like exactly what I have thought should exist for the longest time. It doesn’t help too much in an editor context (tho it would be cool if it could!), but for code this seems like such a good idea that I think it should be put in the Python standard library
Funny you should say that, because Emacs has had something very similar to “verbal expressions” since at least 2001: Rx notation. However, the comments at the top of Emacs’ Rx implementation cite Olin Shiver’s SRE, from 1998, as an influence. And both Rx and SRE in turn was inspired by the older sregex.el from Emacs, which dates from 1997. In short, the basic idea of this has been present in Emacs for a very long time!
I wish that Rosie Pattern Language would catch on. It’s like regex but better because it can actually do “impossible” things like parsing HTML, since it’s PEG parser and not a regular parser.
Worth the read alone for the origin story of the horrible out of context quote.