The argument against succinctness seems odd to me. Yes, regular expressions (the example given) are notoriously succinct, but is
a(b{3,10}|c*)d
really harder to read than
("a" (or (repeat "b" 3 10) (any "c") "d")
or some other notation? The verbose one might be easier to read for someone unfamiliar with regex notation, but once you’ve learned regexes, it becomes a lot easier to see “the whole picture” with a succinct notation (IMHO).
It’s like saying we should write arithmetic like (to use an example that totally isn’t a real programming language ahem):
ADD 1 TO X GIVING Y
instead of
Y = 1 + X
Mathematical notation is notoriously succinct, and it has succeeded because it makes communicating mathematics much easier (yes, I’m intentionally echoing “Notation as a Tool of Thought” here). Standard mathematical notation is the world’s most common DSL and is so common as to be ubiquitous. It is also notoriously succinct.
Many of the arguments in TFA against regex notation seem to be at least partially answered by extended regex notation, which allows whitespace. To whit:
a
(
b{3,10}
|
c*
)
d
is just as if not more readable as the s-expr above (again IMHO) and still lets me see the trees and the forest.
I suppose the argument comes down to ease-of-use for beginners versus ease-of-use for experienced users. Experienced users want brevity and conciseness and beginners what code that is self-explanatory.
I think that there should be tools for converting from a dsl to the unsugared powerful syntax. As much as I love regexes, not everyone knows them and there are lots of subleties, complexities, and variations (is that perl, vim, shell?).
The problem with regular expressions isn’t so much that they are overly succinct but that the sub-expressions typically go unnamed. E.g. we might have a regular expression for IPv4 addresses (from https://stackoverflow.com/a/5284410):
re = /\b((25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)(\.|$)){4}\b/
but this would be much easier to read if we wrote:
octet = /25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?/
re = /\b(#{octet}).(#{octet}).(#{octet})\.(#{octet})\b/
and as a side benefit it does stricter validation and correctly captures all 4 octets.
The most important facility any language can provide IMO is the ability to give names to constructs we create.
Once you can name subexpressions you have the question of recursion. If you support recursion this is essentially PEG. That’s why I call PEG “regex++”.
I think that this might be the best advice in here. How do you go about finding the right balance between bringing a tool closer to the problem, and bringing it so close to your problem that it declines in usefulness?
The argument against succinctness seems odd to me. Yes, regular expressions (the example given) are notoriously succinct, but is
really harder to read than
or some other notation? The verbose one might be easier to read for someone unfamiliar with regex notation, but once you’ve learned regexes, it becomes a lot easier to see “the whole picture” with a succinct notation (IMHO).
It’s like saying we should write arithmetic like (to use an example that totally isn’t a real programming language ahem):
instead of
Mathematical notation is notoriously succinct, and it has succeeded because it makes communicating mathematics much easier (yes, I’m intentionally echoing “Notation as a Tool of Thought” here). Standard mathematical notation is the world’s most common DSL and is so common as to be ubiquitous. It is also notoriously succinct.
Many of the arguments in TFA against regex notation seem to be at least partially answered by extended regex notation, which allows whitespace. To whit:
is just as if not more readable as the s-expr above (again IMHO) and still lets me see the trees and the forest.
I suppose the argument comes down to ease-of-use for beginners versus ease-of-use for experienced users. Experienced users want brevity and conciseness and beginners what code that is self-explanatory.
I think that there should be tools for converting from a dsl to the unsugared powerful syntax. As much as I love regexes, not everyone knows them and there are lots of subleties, complexities, and variations (is that perl, vim, shell?).
The tricky parts of regex don’t go away with verbalising the operators.
For example: is the statement above evaluated in a greedy or non-greedy fashion?
The problem with regular expressions isn’t so much that they are overly succinct but that the sub-expressions typically go unnamed. E.g. we might have a regular expression for IPv4 addresses (from https://stackoverflow.com/a/5284410):
but this would be much easier to read if we wrote:
and as a side benefit it does stricter validation and correctly captures all 4 octets.
The most important facility any language can provide IMO is the ability to give names to constructs we create.
Excellent point. Giving names and recursive ability to regexes gets you Parsing Expression Grammars, though there’s no standardized notation for them.
Once you can name subexpressions you have the question of recursion. If you support recursion this is essentially PEG. That’s why I call PEG “regex++”.
I think that this might be the best advice in here. How do you go about finding the right balance between bringing a tool closer to the problem, and bringing it so close to your problem that it declines in usefulness?
This is a good takeaway lesson. A lot of DSLs don’t even have the escape hatch!