They had a regular expression that went accidentally quadratic - these are pretty common, and can be quite catastrophic.
I never use regex unless I can’t figure out a way to do it just with string manipulation functions, and although it isn’t because of stories like this (it’s mostly just because I’m bad at regex), they sure do make me feel vindicated in doing this :)
“Some people, when confronted with a problem, think ‘I know, I’ll use regular expressions.’ Now they have two problems.”
For Prometheus we’ve got several places where we’re dealing with effectively unstructured data (or at least it has a structure that’s unknown to us) but it’s important that users be able to extract that data. So we fall back to regexes.
The regexes themselves tend not to be the problem, lots of people don’t get anchoring though.
If you’re running user-supplied regexes on arbitrary data (using popular “regex” implementations - it’s not an issue with actual regular expressions) you’ll hit this issue sooner or later. Make sure you do the right thing when your users' regexes run for exponential time.
If what you want is to allow users to run Turing-complete code on the data (which is what you’re doing, via a rather inefficient encoding, with popular regex implementations), maybe consider embedding a popular/standardized scripting language? That way users at least have the option of using more maintainable approaches (e.g. parser combinators).
We’re using RE2 which doesn’t have this issue.
If what you want is to allow users to run Turing-complete code on the data
We explicitly don’t want that, we’re only looking for data extraction. If a user’s metrics/services/machine taxonomy is so complex that they need a Type-3 grammar to handle it, they have bigger problems and we’ll point them towards our various plugin interfaces to let them code it up themselves in a language of their choice.
This kind of issue is why we have RE2. It lacks a few features that are sometimes handy in complex scenarios, but is designed to be not subject to pathological cases causing exponential CPU time usage. https://github.com/google/re2
On an unrelated note, and wouldn’t have helped here, I have been using Lua patterns lately, which are quite pleasant and much easier to implement than regex. I discovered them after OpenBSD’s http server implemented them. I don’t actually need to use anything like a regex often, but when I do, I use Lua patterns.
This seems like a problem a regex engine could optimize for. If the character after the anchor doesn’t match and can’t match, restart matching after it. Something like BM string matching. There can still be some very bad patterns and inputs, but a great many practical cases can be so optimized.
.NET regexen have an optional timeout, which should be used 100% of the time when dealing with untrusted data.
Some regex engines provide an API call that puts an implicit .STAR? at the beginning of the regex so that the semantics of the match are “match anywhere” as opposed to “match only from the start of the string.”
– burntsushi @ Hacker News
So, I guess the solution is to put ^.*? in front of the regular expression?
I doubt that solves the problem. The problem results from the fact the regex engine has already done that. You only need to do that to reproduce the problem with some engines.
Actual (i.e. non-perl) regular expressions don’t have this problem. I don’t think there’s a lot of value in optimizing out the most common ways to hit the problem - as long as you have the problem you’re going to hit it sooner or later, if anything it’s better to make the problem easier to hit so that you find out about it earlier and take appropriate action.