1. 37

  2. 16

    This started out as a total rant about the current state of the web, insecurity and the lack of proper rigidity in specifications. I decided not to post it while I was all riled up. The next day I rewrote it in its current form. It’s still a bit one-sided as I’m still having trouble understanding their reasoning. I vainly hope they’ll either give me a more coherent explanation why they dropped the formal grammar, or actually fix it.

    1. 16

      The formal grammar doesn’t reflect reality. The browsers started diverging from it years ago, as did the server authors. Sad, but true of many many similar specifications. The WHATWG spec a descriptive spec, not a prescriptive one: it was very carefully reverse engineered from real behaviours.

      1. 8

        You can model that, too. Specs trying to model C with undefined behavior or protocol operation with failure modes just add the extra stuff in there somewhere. Preferably outside of the clean, core, expected functioning. You still get the benefits of a formal spec. You just have to cover more ground in it. Also, good to do spec-based test generation run against all those clients, servers, or whatever to test the spec itself for accuracy.

        1. 2

          … that’s exactly what these modern bizarro algorithmic descriptions of parsers are—rigorous descriptions of real behaviors that have been standardized. “Just add the extra stuff” and this is what you get.

          It sounds like by a “formal spec” you mean a “more declarative and less algorithmic” spec, which definitely seems worthwhile. But be clear about what you want and how it’s different from what people have been forced to do by necessity in order to keep the web running.

          1. 1

            By formal spec, I mean formal specification: a precise, mathematical/logical statement of the standard. A combo of English and formal spec (esp executable) with both remove ambiguities, highlight complexities, and aid correct implementation.

            Certain formal languages also support automatic, test generation from specs. That becomes a validation suite for implementations. A formal spec also allows for verified implementations, whether partly or fully.

            1. 2

              I am exceedingly familiar with what a formal specification is. I am pretty sure you are confused about the difference between rigor and a declarative style—the two are entirely orthogonal. It is possible to specify something in an algorithmic style and to be entirely unambiguous, highlight complexities, aid correct implementation, and support automatic test generation, moreover, this has been done and is done extremely often—industry doesn’t use (or get to use) parser generators all the time.

              1. 1

                Ok, good you know it. It’s totally possible Im confused on rigor. Ive seen it used in a few different ways. How do you define it?

                1. 2

                  Sorry for the delay, renting a car :(

                  I would define rigor as using mathematics where possible and extremely precise prose when necessary to removing ambiguity, like you pointed out. Concretely, rigor is easier to achieve when the language you are writing in is well defined.

                  If you written using mathematical notation you get the advantage of centuries of development in precision—you don’t have to redefine what a cross product or set minus or continuity are, for example, which would be very painful to do in prose.

                  Specs try to achieve the same thing by using formalized and often stilted language and relying on explicit references to other specs. Because mathematicians have had a much longer time to make their formalisms more elegant (and to discover where definitions were ambiguous—IIRC Cauchy messed up his definition of convergence and no one spotted the error for a decade!) specs are often a lot clunkier.

                  For an example of clunkiness, look at the Page Visibility API. It’s an incredibly simple API, but even then the spec is kind of painful to read. Sorry I can’t link to the specific section, my phone won’t let me. https://www.w3.org/TR/page-visibility/#visibility-states-and-the-visibilitystate-enum

                  Separately, for an example of formal methods that looks more algorithmic than you might normally expect, see NetKAT, which is a pretty recent language for programming switches. https://www.cs.cornell.edu/~jnfoster/papers/frenetic-netkat.pdf

                  Certainly web spec authors have a long way to go until they can commonly use formalisms that are as nice as NetKATs. But they still have rigor, just within the clunky restrictions imposed by having to write in prose.

      2. 5

        I have to parse sip: and tel: URLs (RFC-3261 and RFC-3966) for work. I started with the formal grammar specified in the RFCs (and use LPeg for the parsing) and even then, it took several iterations with the code to get it working against real-world data (coming from the freaking Monopolistic Phone Company of all places!). I swear the RFCs were written by people who never saw a phone number in their life. Or were wildly optimistic. Or both. I don’t know.

        1. 8

          I may hazard a guess… I watched the inception of WHATWG and used to follow their progress over several years, so I have a general feeling of what they’re trying to do in the world.

          WHATWG was born as an anti-thesis to W3C’s effort to enforce a strict XHTML on the Web. XHTML appealed to developers, both of Web content and of user agents, because, honestly, who doesn’t want a more formal, simpler specification? The problem was that the world “in large” is not rigid and poorly lends itself to formal specifications. WHATWG realized that and attempted to simply describe the Web in all its ugliness, complete with reverse engineered error handling of non-cooperative browsers. They succeeded.

          So I could imagine the reasoning for dropping the formal specification is due to admitting the fact that it can’t be done in a fashion compatible with the world. Sure, developers would prefer to have ABNF for URLs, but users prefer browsers where all URLs work. Sorry :-(

          1. 3

            This is my understanding too, but you still need to nail down some sort of “minimally acceptable” syntax for URLs to prevent further divergence and to guide new implementations.

        2. 10

          I appreciate both the desire to have a formal spec which implementations can be compared against, and the desire to capture things as they really are. A spec no-one follows is worse than useless; someone will write an implementation against it, then wonder why everything else seems to do things incompatibly.

          Unfortunately, it’s very difficult to formally capture what’s arisen from a process of ad hoc implementations. I experienced this working on CommonMark; its spec is incredibly long, and isn’t even a grammar; it’s more like what WHATWG put together for the URI, for the same reasons.

          1. 6

            Bear in mind that the URL syntax did have RFCs, so it didn’t purely arise from a process of ad hoc implementations. I’m not 100% sure the RFC did actually capture what browsers were already doing, but it certainly helped in focusing the existing implementations towards a common agreement. WHATWG’s spec is a step back from that, IMO.

            1. 8

              Thing is, simply having an RFC is not enough for it to be followed. Here’s the obligatory classic link: https://web.archive.org/web/20110904005422/http://diveintomark.org/archives/2004/08/16/specs

              1. 1

                haha, that’s a great one! I did’t know it yet. Thanks for pointing it out

              2. 5

                WHATWG’s spec is a step back from that, IMO.

                Agreed. It’s a pity, and now that this is done, it’d be very difficult to try to converge again on a spec in the future.

            2. 6

              Nice article. Problems are the LANGSEC link will show all kinds of stuff where they might be overwhelmed. They might not want to watch a video. I suggest modifying the LANGSEC part somewhere to quickly convey the benefit to the audience with some variation of: “tools like Hammer exist that can automatically generate a secure, efficient parser directly from a formal grammar.” That’s the benefit. That’s the pull. Then, they might want to look at Hammer, the LANGSEC page, or the video maybe in that order.

              The benefit gives a clear reason to put in effort. The “automatically generate” part tells them it can work with cheap managers, lazy developers, and/or just busy people. It’s no hassle. High reward, maybe low risk, and that’s interesting. :)

              1. 4

                Thanks for reminding me of Hammer. I should write a wrapper for it in CHICKEN and add a parser that is based on it to the uri-generic alternatives :)

                Having all these alternatives is super-helpful when benchmarking parsers generated by specific parser generator libraries, too! It’s a real-world useful regular syntax that is just beyond toy language.

                1. 1

                  That’s an interesting idea. Definitely good if it gets integration with more systems. Especially in a way that makes its use idiomatic for easier uptake.

              2. 8

                The browser vendors weren’t willing to change their products be compliant with the RFC, because websites already rely on the non-compliant behavior. They wrote their own spec that matches the behavior that the web relies on. That behavior apparently cannot be modeled using EBNF.

                If you can write a grammar that matches the behavior of Internet Explorer, Chrome, and Firefox well-enough to not break web pages that currently work with all of them, I’m sure they’d love to see it.

                1. 2

                  Yea, I find the sentiment being expressed here kind of silly. “Browser spec authors should give up on actually caring about writing specs that (a) actually standardize behavior, as opposed to describing some ideal that no one implements, and (b) maintain backwards compatibility with the actual web because it makes it hard to implement URI parsing for my hobby project!”

                  The implication is also that the spec writers are either dumb or purposely writing bad specs.

                  Maybe your hobby project doesn’t need to work with real world data, but if a spec writer decides things should just stop working, no browser will implement the spec. And if a browser decides the same, people will just stop using that browser.

                  It’s unfortunate that the state of things is like this. But people still write browser engines and spec compliant URI parsers from scratch in spite of this, because their goals are aligned with that of the spec—to work with real world data.

                  If you don’t care about parsing all the URIs browsers need to parse, there’s absolutely no shame in only parsing a restricted syntax. In fact, depending on your problem domain, this might be better engineering.

                  1. 2

                    I would say that this is a good use-case for warnings. Browsers (especially Chrome) have been discouraging poor practices for a while now. For example, HTTP pages don’t get the “padlock” that a significant proportion of users have been trained to look for; then HTTP resources on HTTPS pages were targetted; and now HTTP itself gets flagged as “not secure”.

                    If it makes sense to flag the entire HTTP Web, then I’d find it perfectly acceptable to (gradually) roll this out for malformed data too; e.g. if a page’s URL doesn’t follow a formal spec, if its markup or CSS is malformed, etc. This is justifiable in a few ways:

                    • As the article’s langsec argument states: if data isn’t validated against a (decidable) formal language before getting processed, there is potential for exploits.
                    • In the case of backwards compatibility, we can consider “legacy” content (created before the relevant specs) as if it complies with some hypothetical previous spec, like “Works for me in Netscape Navigator”. In which case that hypothetical spec is out of date, in the same way that e.g. Windows98 is out of date: it’s no longer maintained, and won’t receive any security patches. Hence it’s legitimate to consider such legacy formats as not secure.
                    • In the case of forwards compatibility, it can still be considered as not secure, since we’re skipping parts of the data we don’t understand, which were presumably added for a reason and may be needed for security.
                    1. 2

                      I agree with you in principle, but I don’t think this is a good idea in practice.

                      As the article’s langsec argument states: if data isn’t validated against a (decidable) formal language before getting processed, there is potential for exploits.

                      Using yacc, say, instead of writing your own recursive descent parser, doesn’t magically make your parser safe—in fact, far from it. I don’t buy the safety argument at all, having had the joy of encountering bizarre parser generator bugs.

                      Moreover the point is that the URI grammar here can’t be described as a CFG. So forget about even using a nice standard LR parser generator that outputs to a nice automaton, so the idea of having security by relying on parser generators used by lots of other people is moot.

                      In the case of backwards compatibility, we can consider “legacy” content (created before the relevant specs) as if it complies with some hypothetical previous spec, like “Works for me in Netscape Navigator”. In which case that hypothetical spec is out of date, in the same way that e.g. Windows98 is out of date: it’s no longer maintained, and won’t receive any security patches. Hence it’s legitimate to consider such legacy formats as not secure.

                      And now you have two parsers, so not only are any potential security exploits are still there, but it is now even more likely that you’ll forget about some edge case.

                      1. 1

                        I agree, but note that I’m not arguing that something or other is secure. Rather, I’m saying that adding a “Not Secure” warning might be justifiable in some cases, and this lets us distinguish between the extreme positions of “reject anything that disagrees with the formal spec” and “never remove any legacy edge-cases”.

                        The usual response to warnings is that everybody ignores them, but things like “Not Secure” seem to be taken seriously by Web devs. Google in particular has used that to great effect by incentivising HTTPS adoption, disincentivising HTTP resources on HTTPS pages, etc.

                        And now you have two parsers, so not only are any potential security exploits are still there, but it is now even more likely that you’ll forget about some edge case.

                        That is indeed a problem. Splitting parsing into a “strict parser” and “lax parser” can obviously never remove exploits, since we’re still accepting the old inputs. It can be useful as an upgrade path though, to “ratchet up” security. It gives us the option to disable lax parsing if desired (e.g. a ‘hardened mode’), it allows future specs to tighten up their requirements (e.g. in HTTP 3, or whatever), it allows user agents which don’t care about legacy inputs to avoid these problems entirely, etc.

                        I definitely agree that having backwards compatible users agents using multiple parsers is bad news for security in general, since as you say it increases attack surface and maintenance issues. That burden could be reduced with things like parser generators (not necessarily yacc; if it’s sufficiently self-contained, it could even be formally verified, like the “fiat cryptography” code that’s already in Chrome), but it’s certainly real and is probably the biggest argument against doing this in an existing general-purpose user agent

                        1. 2

                          I see. Yea, it seems like some way to deprecate weird URLs would be best. Just not sure how to do it. “Not Secure” seems like a reasonable idea, but I wonder if it’d become overloaded.