1. 43
  1.  

  2. 20

    I love plain text protocols, but … HTTP is neither simple to implement nor neither fast to parse.

    1. 7

      Yeah the problem of parsing text-based protocols in an async style has been floating around my head for a number of years. (I prefer not to parse in the async or push style, but people need to do both, depending on the situation.)

      This was motivated by looking at the nginx and node.js HTTP parsers, which are both very low level C. Hand-coded state machines.


      I just went and looked, and this is the smelly and somewhat irresponsible code I remember:

      https://github.com/nodejs/http-parser/blob/master/http_parser.c#L507

      /* Proxied requests are followed by scheme of an absolute URI (alpha).

      • All methods except CONNECT are followed by ‘/’ or ‘*’.

      I say irresponsible because it’s network-facing code with tons of state and rare code paths, done in plain C. nginx has had vulnerabilities in the analogous code, and I’d be surprised if this code didn’t.


      Looks like they have a new library and admit as much:

      https://github.com/nodejs/llhttp

      Let’s face it, http_parser is practically unmaintainable. Even introduction of a single new method results in a significant code churn.

      Looks interesting and I will be watching the talk and seeing how it works!

      But really I do think there should be text-based protocols that are easy to parse in an async style (without necessarily using Go, where goroutines give you your stack back)

      Awhile back I did an experiment with netstrings, because length-prefixed protocols are easier to parse async than delimiter-based protocols (like HTTP and newlines). I may revisit that experiment, since Oil will likely grow netstrings: https://www.oilshell.org/release/0.8.7/doc/framing.html


      OK wow that new library uses a parser generator I hadn’t seen:

      https://llparse.org/

      https://github.com/nodejs/llparse

      which does seem like the right way to do it: do the inversion automatically, not manually.

      1. 4

        Was going to say this. Especially when you have people misbehaving around things like Content-Length, Transfer-Encoding: chunked and thus request smuggling seems to imply it’s too complex. Plus, I still don’t know which response code is appropriate for every occasion.

        1. 2

          Curious what part of HTTP you think is not simple? And on which side (client, server)

          1. 5

            There’s quite a bit. You can ignore most of it, but once you get to HTTP/1.1 where chunked-encoding is a thing, it starts getting way more complicated.

            • Status code 100 (continue + expect)
            • Status code 101 - essentially allowing hijacking of the underlying connection to use it as another protocol
            • Chunked transfer encoding
            • The request “method” can technically be an arbitrary string - protocols like webdav have added many more verbs than originally intended
            • Properly handling caching/CORS (these are more browser/client issues, but they’re still a part of the protocol)
            • Digest authentication
            • Redirect handling by clients
            • The Range header
            • The application/x-www-form-urlencoded format
            • HTTP 2.0 which is now a binary protocol
            • Some servers allow you specify keep-alive to leave a connection open to make more requests in the future
            • Some servers still serve different content based on the User-Agent header
            • The Accept header

            There’s more, but that’s what I’ve come up with just looking quickly.

            1. 3

              Would add to this that it’s not just complicated because all these features exist, it’s very complicated because buggy halfway implementations of them are common-to-ubiquitous in the wild and you’ll usually need to interoperate with them.

              1. 1

                And, as far as I know, there is no conformance test suite.

                1. 1

                  Ugh, yes. WPT should’ve existed 20 years ago.

              2. 2

                Heh, don’t forget HTTP/1.1 Pipelining. Then there’s caching, and ETags.

            2. 2

              You make a valid point. I find it easy to read as a human being though which is also important when dealing with protocols.

              I’ve found a lot of web devs I’ve interviewed have no idea that HTTP is just plain text over TCP. When the lightbulb finally goes on for them a whole new world opens up.

              1. 4

                It’s interesting to note that while “original HTTP” was plain text over TCP, we’re heading toward a situation where HTTP is a binary protocol run over an encrypted connection and transmitted via UDP—and yet the semantics are still similar enough that you can “decode” back to something resembling HTTP/1.1.

                1. 1

                  UDP? I thought HTTP/2 was binary over TCP. But yes, TLS is a lot easier thanks to ACME cert issues and LetsEncrypt for sure.

                  1. 2

                    HTTP/3 is binary over QUIC, which runs over UDP.

              2. 1

                SIP is another plain text protocol that is not simple to implement. I like it and it is very robust though. And it was originally modeled after HTTP.

              3. 9

                Somewhat related, I’ve been trying to collect plaintext-inspired file formats in a project I’ve been calling Human Intermediate Formats. HIFs would allow you to use common Unix tools on plaintext lossless representations of other file types.

                For example, while I don’t consider QDF an HIF, I’ve used it to provide a diff to a PDF to fix mistakes and add functionality to an existing PDF. I’ve used midicomp to better understand the structure of SMF midi files, and do minor transformations with sed, and as a gitattributes textconv filter to review changes my midi projects.

                If you’re aware of other formats I should consider as a Human Intermediate Format, do let me know by filing an issue.

                1. 8

                  In my opinion, an obligatory reference to the subject is the chapter 5, “Textuality”, of Eric Raymond ’s “The Art of UNIX Programming”: it goes in the details of several formats and comes up with a series of recommendations for plain-text formats.

                  1. 6

                    This article goes into a lot of detail how hard it is to safely parse a plain-text protocol like HTTP: https://fasterthanli.me/articles/aiming-for-correctness-with-types

                    1. 5

                      It’s easy to forget how varied and hairy these protocols get in the real world that make writing a safe and easy to understand parser hard beyond a trivial example. Everyone has mentioned HTTP already, and I have recent experience with NNTP and mail to say it’s not all roses.

                      From my experience, I think binary is easier when your tools can do it better. Things like Wireshark can visualize binary protocols, and as mentioned, if your PL has binary pattern matching, then parsing a binary protocol by hand is entirely trivial. Or you could just machine generate it, and avoid the ambiguities. The problem is programmers have been happily using tools that made things harder than they had to be.

                      1. 2

                        Oh gods imagine how much easier network protocol parsing would’ve been if every PL had had Erlang’s binary pattern matching.

                      2. 3

                        Looks like this made it over to the orange site and doing just as well. Kinda cool since this is a first for me.

                        1. 3

                          It really does not matter if you have some kind of thought out and complete rules for parsing the protocol’s messages. But if you don’t, plain text protocols are a sure way to shoot oneself into foot using a majestic railgun, much worse than binary protocols (which tend to break down really fast when poorly designed and start causing problems long before they hit any production systems).

                          See also: LANGSEC - http://langsec.org/ (I feel obliged to link it here, as you mention neither “parser” nor “grammar” in your post).

                          1. 2

                            One issue not addressed here is that plain text protocols are typically hand implemented (a recursive descent parser), where binary protocols are often machine generated (protobuf, grpc, etc.). HTTP is in theory parse-able by a number of parser generators, but in practice it and other text based protocols are hand implemented, leading to bugs, security vulns, and other problems.

                            The ideal seems to be something like HTTP, where a client/server can fall back to 1.1 if 2.0 isn’t jointly supported, and is general enough to support most anything you’d want to do with a protocol. Similarly, most machine generated formats like protobuf have a text based format as fallback.

                            1. 7

                              And this, in turn, leads to security problems. Parsing untrusted data is one of the biggest sources of security vulnerabilities. The easier a protocol is to parse, the easier it is to secure. Most binary protocols require trivial binary pattern matching to parse. About the only check that you ever need to do is whether an offset is contained within a packet and even in an unsafe language it’s pretty easy to abstract that away into a single place. Binary protocols can often just use the header information as-is and parse lazily. Flat Buffers, for example, just uses the wire-protocol message in a buffer and provides accessors that do any offset calculation or endian conversion necessary. The cost of parsing is so low that you can re-parse a field every time you access it.

                              1. 1

                                Grpc is huge. It depends on a large amount of hand written parsing code. Using it is unlikely to reduce the amount of hand written parsing in your system.

                                I don’t mind binary protocols, especially if I can handle them with something like Python’s struct module, but grpc is just a bad example. It’s amazing how little functionality they packed into such a huge amount of code.

                                1. 1

                                  I feel like if the issue is that text protocal are using handcrafted and unsecure parser, the ideal would then be to have client/server that only use 2.0. Allowing fallback is status-quo and doesn’t fix any security hole.

                                  The way I see it, the ideal would be to have some non-default dev-mode where text protocol are enabled, and/or those text protocol only support the simplest subset of robust features.