russ cox original article: https://swtch.com/~rsc/regexp/regexp1.html
Yeah, that should be the actual link (or better yet https://swtch.com/~rsc/regexp/), not this crappy article.
Hi. Just wanted to share my slides about the topic :)
Sorry that you didn’t like the title (and/or) the slides! As it’s just a 15 mins talk, I was not able to cover more details (also cause I wanted to make this talk as understandable as possible to absolute beginners).
So, it can seem shallow to some folks here.
i have no problem with the slides, but you really should include references to used sources if you more or less copy work of someone else!
I did mention Russ Cox’s article in my talk. But yeah, thanks for the pointer. I’ll make sure I include the references in my slides too :)
It’s super interesting to have better guarantees on regexps, and I’ve actually had real world desktop applications crash because they saw a URL they didn’t like.
That being said I think presentations like the one above lie by omission. They should also mention the performance of some common regexps like grepping an apache log file. In previous versions of Go it was literally faster to do a CGO call to libpcre for every line of input than it was to use the builtin regex package.
My automata theory is incredibly rusty, but isn’t it the case that you typically convert an NFA to a DFA, so that, say, you don’t have to simultaneously take two paths explicitly? I guess I could dig out my copy of the dragon book…
You could, but NFA to DFA conversion is susceptible to exponential blowup in the number of DFA states. This is why production regex engines use techniques to avoid this, by e.g., compiling the DFA lazily or compiling many small DFA.s
exponential blowup in the number of DFA states.
Yeah. That makes sense!
Yes, you could. However, if you compute the DFA strictly, you’ll usually end up with a HUGE (exponential in the size of the NFA) DFA, which then needs to be minimized to be workable. On the other hand, Thompson’s algorithm lazily generates the DFA as it’s needed, allowing you to still run in constant time.
you’ll usually end up with a HUGE (exponential in the size of the NFA) DFA
That’s exaggerating a bit. There are pathological, worst case, regular expressions that are exponential in the size of the NFA, but most of them aren’t, and I wouldn’t say it’s “usual.”
Thanks for that! I’ll have to look at Thompson’s algorithm. I guess I’m even more rusty than a thought…
I was just digging into this myself: https://github.com/jeremyheiler/regex
So the regexps are wearing skinny ties, sipping Old Fashioned’s, and having affairs?
I like the lua string matching functions - they seem pretty expressive and yet not as full of surprise costs as standard regex. But then I’ve never had to use them for serious work.
Great article. At a hack night last night we took a crack at implementing some of it in PureScript. Turns out pretty nice. So far at least. :-)
And nobody’s ever complained that Perl’s algorithm has exponential run time?
Most common regex matches are not pathological regexes, which means they don’t run for a painstakingly long time.
So, pathological regexes are a special case, and Perl do run exponentially on those. Also, please go through this discussion in the Perl forums, regarding the algorithm: http://www.perlmonks.org/?node_id=597364
Perl and the others have good reasons for its choice. This talk does not mention the trade offs involved and only celebrates speedups for pathological cases. In short, the advantage of the PCRE approach is that it allows to add more features to the regex language.
Good point. I wanted this talk to be for absolute beginners, so didn’t go into the details of the disadvantages of Thompson’s NFA and PCRE’s advantages :)
Having said that, Russ states in the article that the current features of the regex language are just abstractions, and can be done with the normal syntax (Backreferences being a useful exception).
No-one has ever complained about Perl, ever. It’s the perfect language!
You have to work hard to get it to behave badly; Perl understandably wants matching to be fast, and all of the ways to make it blow up with reasonable, useful code have been fixed sometime between 1987 and today.
I think it’s more common than you think. StackOverflow had down time last year because of a backtracking regex.
Yes, but they weren’t using Perl. If they were, that pattern would not have backtracked at all, and wouldn’t have caused any problem. The logic to avoid backtracking in that case was added in Perl 3.019 in 1990 — but .NET in 2016 is a different story :)
How do I reason about when backtracking will occur and when it won’t?
I don’t have a satisfying answer for that one.
It’s sort of like the halting problem vs. static code analysis thing. Everyone knows that you can’t statically analyze code behavior because the halting problem is unsolvable, and everyone knows that static analysis does work pretty good anyway because 99% of code in the real world does stuff that’s simple enough to analyze. You just need a big enough library of patterns and heuristics.
Well, regex matching with backtracking suffers from combinatorial explosion, except that 99% of regexes in the real world are simple enough that you can extract some feature from them that lets you rule out almost all possibilities and match them (or fail the match, since most problems usually occur on non-matching strings) immediately. A good regex engine will recognize and exploit those opportunities. But actually knowing exactly what bag of tricks your engine supports, recognizing when it doesn’t have your back anymore, and knowing how to fix it when that happens, is pretty much what separates the wizards from the average folks. There are some easy-to-learn cases (don’t unnecessarily nest quantifiers!), but a million more obscure ones.
Interestingly, us FSM regex proponents say something similar, “99% of all regexes you might want to write can be covered by a non-backtracking implementation, and when you hit a case where you want a more elaborate feature (look around or backreferences), you might be able to solve it with two regexes or maybe don’t use regex at all.” :-)
I suspect the 99% numbers are phooey in both cases though.
Most of the time regexes don’t backtrack much. The only time I’ve run into this was when I was writing a unit test in Java, and mistyped (complex)* as (complex)**. At the time, Java must have been naive about repeated *s, and tried all variants. The test ran for about a minute before timing out.