1. 10

(Sorry, I couldn’t find a transcript. Here’s the opening paragraph, though.)

“This talk is about the science of insecurity, by which I mean the entire spectrum of exploits and vulnerabilities considered as a systematic, repeatable, and most importantly, predictive mathematical model. We’re going to examine, from first principles, what it is about exploits that makes them exploits in the first place, and how we can use this systematic understanding to design and implement software in which - to borrow a turn of phrase from Dan Kaminsky - entire classes of bugs simply don’t exist.”

  1. 10

    I watched this video like 5 years ago, and I read some of the related papers at http://langsec.org/, and looked at their HAMMER parser generator, etc.

    My takeaway: I totally agree with more formalism with respect to specifying languages/data formats sent over the network.

    However context free grammars, which they advocate, are the wrong formalism. It’s been 6 years, and I don’t think they will make a dent in security without a different formalism.

    As I recall, she advocates getting rid of length prefixes, because length prefixes can’t be expressed with CFGs. They make the language at least context-sensitive. I chuckled yesterday because in the “Kill the web” thread on HN some people were arguing about turning HTML into a length-prefixed format:


    I pretty much disagree with BOTH sides, I think they are missing the point, but that is a separate post…

    • Another issue is that languages specify infinite length strings. And practically speaking you really care about finite strings when dealing with network security. The CFG is just too coarse and doesn’t capture the properties you care about.
    • If you take a CFG, and limit to a certain length, you now have a regular language. I think someone brought that up at the end of the video – you just enumerate all possible inputs! That is regular. That is further evidence that the grammar isn’t really modelling the properties you care about.
    • Context free grammars only give you a O(n^3) guarantee. In practice you want your protocols and languages to be parsed in linear time, to avoid resource exhaustion attacks. I think pretty much every language (Python, Ruby, Perl, Lua, etc.) changed their hash table implementation in the last 5-10 years to avoid O(n^2) resource exhaustion attacks.
    • Context free grammars can’t express real protocols already in use. That includes some that are length-prefixed, but probably tons of others. Probably most.

    So there needs to be a different formalism, probably yet to be invented.

    That said, one of their slogans is “Full recognition before processing” and I totally agree with that.

    That is sort of the point behind “static parsing” in Oil [1]. It’s analogous to Interleaving parsing and execution, which fundamentally confuses code and data, which is dangerous from a security perpsective.

    I think there needs to be a new parsing tool that gets you some of these properties, which is not based on CFGs. Of course there are other formalisms like PEGs, but PEGs also aren’t able to express length prefixes, which are used in most deployed network formats. They’re more for programming languages.

    [1] http://www.oilshell.org/blog/tags.html?tag=parsing-shell#parsing-shell

    [2] http://www.oilshell.org/blog/tags.html?tag=parsing#parsing

    1. 4

      Well said. I don’t think we even have to look that far - unification grammars handle this sort of thing just fine.

      Edit to add: Well, they handle stuff like length restrictions. If the desired property is that no valid language in the grammar has superlinear behavior, I don’t know of a way to do that yet, and I agree with you that research is needed.

      1. 5

        I’m not familiar with unification grammars, but LL(k) and LR(k) languages can be parsed in linear time/constant space. And regex matching is linear time/constant space as long as you use the proper algorithm [1].

        For example, the OSH parser is uses a fixed 2-token lookahead, and the lexer uses regexes, so it should run in linear time and constant space. That is not true of say JavaScript (example below).

        There are two issues I see… one is that they are sort of confusing network protocol parsing and programming language parsing. Parsing TCP is not really like parsing JavaScript.

        Or maybe a better way to describe it would be the to contrast the control plane vs. data plane. So HTTP headers are the control channel, and the HTML or JavaScript is the data channel. Those design of those two data formats have different requirements, which lead to different parsers and parsing tools for them.

        I think that theory will help, but there is also a problem is also on the tools side. If you’ve ever looked at Node’s hand-written HTTP parser or Nginx’s, they are horrible in the sense that security bugs can lurk there for YEARS without being detected (and have). And those are two pieces of code which are exposed to an extreme amount of untrusted input.

        LL(k) parsers have some advantages for streaming over a network and partial parsing. And in practice people use predictive recursive descent parsers, which is very often a hand-written LL(k) parser.

        But it seems like there are no good LL(k) parsing tools anymore? I found out when writing Oil that ANTLR v3 did LL(k) analysis to tell you if your grammar was parseable with LL(k). But ANTLR v4 has switched to a more powerful algorithm LL(*), and it dropped this analysis.

        Plus nobody really uses ANTLR for network protocols, and it’s not even that common in programming languages. Almost all parsers are hand-written, and the ones that use yacc – like bash, Ruby, and R – are actually partially hand-written. There is a lot of ad hoc C code in those yacc grammars.

        So I think the tools are in bad shape, in addition to possibly needing some more theoretical concepts (e.g. for length prefixes).

        Here is a good talk about parsing JavaScript in v8/Chrome [2]. She says it’s 14,000 lines of hand-written C++. And there are two parsers – lazy and eager, i.e. to try to skip over uncalled code for speed.

        An interesting case she brought up was ES6 arrow functions. In ES6 you can now do:

        (a, b) => a + b

        When doing LL parsing, as opposed to LR parsing, you don’t know if you’re looking at something like this:

        (a, b); print();

        This makes the grammar non-LL(k) because the argument list can be arbitrarily long. It can be easily parsed with LR(k) techniques, and this is how JavaScript was historically parsed. But my point is that v8 does not use LR(k) parsing, for good reasons. So there is a disconnect between the specification and implementations. Specifications should be designed so that the implementations aren’t nightmares!

        I’m rambling a bit, but my point is that languages should be designed to be compactly described, and parsed quickly and easily by something other than 14K lines of hand-written C++ code. That is not the same as saying they should be parsed by CFGs. You could say that CFGs are simultaneously not powerful enough and too inefficient.

        And the other point is that parsing TCP or HTTP headers is a somewhat different problem than parsing HTML or JavaScript. They are borrowing tools from the programming language world where they might not be a great fit.

        [1] https://swtch.com/~rsc/regexp/regexp1.html

        [2] https://www.youtube.com/watch?v=Fg7niTmNNLg

        1. 2

          Yeah, I agree with all of that.

          I think part of the reason this tooling is so bad is that maintaining a parser generator is a lot of work just to tread water - keeping up with changes in the host and target languages.

          And, as you say, there’s really two different kinds of languages here.

          1. 3

            For completeness, here’s some tools geared at network formats that I recall:

            I haven’t evaluated them, since I’m more interested in language tools at the moment… but I would love to see an third party evaluation of them on practical problems – at least for Nail and Hammer, I don’t know if the code is available for PADS.

            (Also is it a coincidence that they’re called Nail and Hammer?? I don’t think the authors have any relation.)