This article includes one of my pet peeves, claiming that the problem with using regexes to parse or validate HTML is a mathematical one, or an issue of Chomsky-hierarchy power:
Validating arbitrarily nested HTML with regexes is a classic mistake, made by both novice Web developers and the designers of anti-XSS protections in IE 8. The mathematical reason for this world of XSS fail is simple: such languages are context-free or context-sensitive, and require at least a pushdown automaton to match them.
But HTML is not “arbitrarily” nested. Every browser imposes a finite maximum nesting depth, e.g. it’s 512 in Webkit. It’s perfectly reasonable for a validation tool to also impose a finite depth and reject documents beyond it. And if you do so, it’s a regular language, because every finite language is regular.
It still remains error-prone and a bad idea to validate HTML with regexes. But not for mathematical reasons, rather for software-engineering reasons: it’s simply hard to construct regexes that correctly parse nested-structure documents, and even if correctly constructed they are unmaintainable. The pushdown automaton is a red herring; if you used a better front-end language, there’s no real reason you couldn’t mechanically compile to an FSM, though it might be a large one and therefore you might choose something else for performance/space reasons.
It’s perfectly reasonable for a validation tool to also impose a finite depth and reject documents beyond it. And if you do so, it’s a regular language, because every finite language is regular.
By that argument, all computer programs are regular. This is not a tractable approach, for obvious reasons. I wish people would quit this bit of sophistry.
I’m okay with using “all memory on a computer” as a rough approximation of “infinite memory”, but I get more skeptical of doing so with small constants, especially since on a modern computer, you can scale finite methods quite a bit. And recognizing that a CFG-like grammar with a finite nesting depth is amenable to finite-analysis methods is useful in practice in some cases.
But more to the point, I don’t think talking about pushdown automata actually gets at why parsing HTML with a regex is bad. Regexes are not a good software engineering tool for parsing nested data structures—not even smallish finite ones. Regexes aren’t even good for parsing nested data structures with a nesting depth of five, even though in that case it’s computable and tractable.
Of course, that replaces a mathematically simple definition with one that has some uncomfortable rough edges…
HTML may be be nested up to some arbitrary implementation defined number, and what happens after that is implementation defined.
From the attacker point of view, that feels like a vista of opportunities to explore…
Does anyone have a thought on how much moving from human-readable protocols, such as HTTP, to bianry ones, like HTTP2 will change things? In many respects, I think binary protocols become easier to parse safely. Especially things like integers which are fixed size in the protocol rather than variable and unbounded in human readable formats. In some sense I lament HTTP2 because HTTP’s simplicity made it so easy for someone to get in and excited, but if it leads to more safety that is a worthwhile tradeoff.
Human-readable to binary wouldn’t change the security profile much. A lot of the problems we are seeing today extend from manual memory management where a flaw in parsing leads to an exploitable condition instead of a crash (not that logic bugs don’t lead to exploitable conditions, they can).
Now there is also something to be said for creating protocols that are easier to implement in software but HTTP isn’t all that hard to parse compared to something with an insane format like Adobe Flash.
Part of the thrust in the paper is that people are implementing things over and over again for performance reason. This was a big fetishism in Node.JS and various Ruby libraries for awhile, “look at how fast our C HTTP parser is”. Perhaps binary protocol lesson this gap, providing a lesser reason to implement something in such an unsafe language as C.
I’ve been thinking a lot about Bram’s Law. Maybe “obvious and easy to get wrong” is nice for starting and growing. On the other hand, Postel’s Maxim means you’ll probably be stuck interoperating with bad implementations.
Postel’s Maxim seems to be getting some flack these days, perhaps we’re seeing the end of an era in terms of complexity in input processing.
Bram’s Law makes sense. If you assume Bram’s Law you need to hold to Postel’s Maxim.
In practice though you need to assume that you’re going to see utterly terrible implementations and even hostile implementations, so even if you are liberal in what you accept, you need to assume that someone is going to send you hostile input and providing them with a potential turing machine is usually less than ideal.
The impact of imperfect implementations is now a problem. An exploitable implementation impacts many more people than the person that deployed the code and no one is held accountable for it today.
I’m looking at some third party code at the moment…. It seems to firmly disprove the converse of Bram’s Law…
If Bram’s Law is…
The easier a piece of software is to write, the worse it’s implemented in practice.
ie. The fact that a piece of software was hard to write, in no way means it is implemented well.
You can always test a piece of software into the shape of a working application…..
…but inspection of the actual code may induce nausea and vomiting.
Oh indeed, I have seen too many horrors to ever entertain the converse of Bram’s Law.
The way I view it is a binary protocol is it is an ordinary ascii protocol with the lexical / tokenization step done (in a perhaps painful and uncomfortable manner). (ie. Yes, your numbers have been tokenized and converted to binary…. but you still have to handle byte order / alignment / struct packing / bit fields….)
ie. A binary protocol has all the same problems….. just the tokenization step has a slightly different set of them.