1. 12

  2. 1

    Browsers are basically the only program that tries to do useful things with garbage inputs. I do not understand why this was decided, but it’s too hard to go back on it now.

    Every other program, word processors, even text editors if you give them malformed junk they either fail to open or show you garbage.

    About the only other program that “always works” is a hexdump tool.

    1. 4

      csv gets close to this too. Python’s csv parser will, by default, generally not emit an error regardless of what you throw at it. It generally just tried to find a parse. (There are some exceptions, such as field length limits.) Of course, you can also enable its “strict” dialect which will be more eager about returning errors, but it’s disabled by default.

      (Yes, yes, despite the fact that csv has a spec, a lot of data still doesn’t conform strictly to it. And it’s really annoying to have your csv parser fall on its face when you get bad csv data. So “good” csv parsers like Python’s go out of their way to avoid any errors. And no, sometimes you don’t get to choose to use csv data. Sometimes that’s just what you’re given.)

      1. 2

        This is very fair. CSV is maybe not quite as bad only because the format isn’t as complex, and certainly totally malformed content (for example, /dev/urandom) will produce absolute garbage with a CSV handler, but they try dangerously hard to handle total nonsense sometimes. Probably the closest thing out there.

      2. 2

        I do not understand why this was decided, but it’s too hard to go back on it now.

        I don’t think anyone said “let’s just accept anything”, but it was more of a race to the bottom between browsers competing to support the as many web pages, with varying qualities of code, as possible, creating informal standards. And since “formal” standards don’t really mean anything, they could either be totally forgotten, or adapt and accept the circumstances before it was too late. That’s at least my, slightly mythologised, understanding.

        1. 2

          But they did say exactly that. See Postel’s Law.

        2. 2

          If you turn garbage into output, people will publish garbage because they test with your browser. Then visitors will see garbage until they too switch to your browser.

          1. 1

            I think the idea is accessibility and compatibility?

          2. 1

            ugh, just ugh. See also my earlier rant about the URL grammer, which is equally loosely specified.

            1. 1

              Also a whatwg spec…

            2. 1

              Hard to enforce or even create a formal, strict grammar when you’ve got decades of content that are non-compliant 😔.

              Also note that the whatwg spec explicitly describes what browsers do, not what they ought to do

              1. 3

                Yes, it’s all too late now. I wish I could go back in time and shake the early browser vendors into realising what they were doing by accepting malformed crap.

                1. 2

                  But isn’t the point of the spec that you have something to go back to in case of doubt about what should be accepted? You know, to determine whether “what this browsers does” is a bug or correct behaviour?

                  Also, a spec is essential when one would want to build a new browser. Without a useful grammar to conform to, you’re just lost in the weeds. This only widens the gulf between the established browsers and any potential newcomers (which is massive already simply by virtue of the enormousness of all the specs taken together).

                  1. 1

                    We tried xhtml. It didn’t catch on.

                2. 1

                  Is there any good offline “strict” validator for HTML now? I used to use “tidy”, but I remember that the last time I used it, it didn’t understand some HTML5 stuff? Maybe my package was out of date.

                  I guess the whole concept of “tidy” is out of date because of the looseness of HTML5 and browser implementations?

                  Another problem is that almost nobody writes pure HTML by hand. It’s usually mixed in with Markdown, JavaScript, or PHP.

                  Still, it would be nice to emit something stricter than HTML5 in the hopes of being able to render it with something other than 10 million lines of code in WebKit and whatnot …

                  Another thought: This is interesting because it exposes the limits of grammars. Grammars are basically a boolean function – they tell you yes or no, if the string is part of the language.

                  But that is not all you want to know. You want to know what to do with the input string, and for that the bare minimum / lowest common denominator is to construct a tree from it.

                  So this blog post would be more interesting if it addressed “what tree” the HTML5 spec says to make. The grammar approach is interesting but pretty incomplete. The model doesn’t fit reality.

                  I said the same thing at the end of this post – i.e. that grammars don’t fit reality. PEGs and Pratt parsing are two “outliers” from grammar-based parsing.


                  1. 1