1. 75
    1. 35

      Networking is the place where I notice how tall modern stacks are getting the most.

      Debugging networking issues inside of Kubernetes feels like searching for a needle in a haystack. There are so, so many layers of proxies, sidecars, ingresses, hostnames, internal DNS resolvers, TLS re/encryption points, and protocols that tracking down issues can feel almost impossible.

      Even figuring out issues with local WiFi can be incredibly difficult. There are so many failure modes and many of them are opaque or very difficult to diagnose. The author here resorted to WireShark to figure out that 50% of their packets were re-transmissions.

      I wonder how many of these things are just inherent complexity that comes with different computers talking to each other and how many are just side effects of the way that networking/the internet developed over time.

      1. 4

        Considering most “container” orchestrators (at least the ones I’ve used) operate on virtual overlay networks and require a whole bunch of fuckery to get them to talk to each other, on top of whatever your observability-platform-of-the-week is, the complexity is both necessary and not. Container orchestration is a really bad way of handling dynamic service scaling IMO. For every small piece of the stack you need yet-another-container™️ which is both super inefficient (entire OS sans-kernel) and overcomplicated.

        1. 5

          I’m not wed to containers, but they often seem like the least bad thing (depending on the workload and requirements). The obvious alternative is managing your own hosts, but that has its own pretty significant tradeoffs.

          1. 4

            Containers themselves are fine for a lot of cases. The networking layer (and also the storage/IO layer) are a large source of complexity that, IMO, is not great. It’s really unfortunate we’re to the point where we’re cramming everything on top of Linux userspace.

          2. 2

            There’s a bunch of different options that have varying degrees of pain to their respective usage, and different systemic properties between each of them.

          3. 2

            For me the killer feature is really supervision and restarting containers that appear to be dead, with the supervision done by a distributed system that can migrate the containers.

    2. 35

      Apparently this code was written by Russ Cox himself and he explains some the history on the orange site.

      1. 14

        The response to this article really shows the bind that the Go team is in. If they don’t provide bug-for-bug compatibility with C/Linux everyone jumps down their throats. In reality, it’s good that there is a modicum of diversity because someone has reimplemented the networking stack in a new language (largely as a mechanical port from Plan 9 as Russ says), but no one asked HN’s permission, so they’re on the naughty list for it.

        1. 1

          One of the biggest advantages of diversity is ensuring that the protocol remains truly interoperable though, and doesn’t rely on implemention-specific choices. Go’s implemention is fully interoperable, it just has bad performance by default (or alternately, bad performance given the conventions people are used to). I wouldn’t call not setting TCP_NODELAY bug for bug compatibility, just compatibility with people’s expectations.

          Of course there’s an argument to be made that that also puts the Go team in a bind (something something unix2000 something) but it’s a different argument than you’re making IMO.

          Edit: I just read down-thread comments on TCP_QUICKACK and I’m less convinced by my own comment now ;) mostly I just don’t know enough about TCP optimization.

    3. 20

      By coincidence I just discovered the same thing — Go’s TcpConn sneakily turning on NODELAY by default — last week. It was because my boss got obsessive about the bandwidth our sync protocol uses, and started measuring it and asking me why it sends as many bytes as it does. After I accounted for the known overhead there was still extra overhead … which turned out to come from the TCP headers of a whole bunch of tiny packets.

      The protocol is based on WebSockets, and there are some messages that need to be acknowledged by the recipient, so it sends back a tiny message with just the ID number of the original message. No problem, those get glommed together into a single packet, right?

      Nope. In the Go WebSocket library (at least the very common one we use) every WebSocket message is written directly to the Conn, which means it occupies at least one IP packet. Ouch. That’s probably causing a lot of waste in many WebSocket-based protocols implemented in Go.

      1. 1

        Wouldn’t it be up to the WebSocket-library user to batch messages at their desired size or interval instead of the library itself?

        1. 1

          I don’t think so. That means you’re warping your app-level protocol to work around limitations of one platform’s network library.

          For instance, the messages I referred to above are replies to earlier requests. One reply corresponds to one request. Batching up multiple replies into one WebSocket message means building a redundant framing format within WebSockets to delimit the replies, which would be silly.

    4. 14

      Annoyingly, and honestly a bit shockingly, there’s to this day no reasonable way to disable Delayed ACKs.

      Technically speaking TCP_QUICKACK “exists,” but

      1. it’s not portable. (Last I checked the BSDs don’t have it at all.)
      2. it’s not a permanent socket option. Instead it gets reset based on unspecified conditions. And
      3. you can’t do anything about the receiver not having it set.

      Because of all of this it’s become quite common for various projects to default to always setting TCP_NODELAY because bad interactions between Nagle’s Algorithm and Delayed ACKs end up being quite common in a lot of cases, so it becomes an easy thing to default to when you don’t want people to complain about “random” delays.

      1. 10

        Uhu, I feel like the post severely underplays the reason why NODELAY is enabled in the first place. Delayed ACK really do decimate throughout.

        I have fond memories of learning about this thing. I was writing some networky thing in my student days, and I was optimizing it for throughout. After “it works”, my first optimization was to batch stuff up, so that I do few big writes rather that a bunch of small ones. And of course that insanely absolutely tanked performance, because delayed ack waits for the next packet to send ack, and nagle waits for an ack to send a next packet.

        That really is a performance bug in the TCP protocol, and it’s important to highlight that NODELAY is an ugly work-around for that.

        In other words, if you set NODELAY, you optimize for carefully written applications which minimize syscalls. If you don’t set it, you optimize for poorly written applications which send a volley of tiny packets.

        So, really, TCP (which has two optimizations working at cross-purposes) and git-lfs (which does tiny writes while it clearly has half a gig of data to work with) are in the wrong here, Go seems fine.

        1. 1

          It seems difficult for an application or even a user to know which options should be used in a given situation. Especially since computers are a lot more mobile and connect to a lot more networks than when these options were first devised. Perhaps there’s an opportunity here for a heuristic tuner.

          1. 3

            It seems difficult for an application

            I don’t think this is applicable to this situation. The following covers 90% of use-cases, from the application point of view

            • set NODELAY
            • batch writes to socket (eg, wrap the thing into a BufferedWriter or what not, or design the application to naturally batch)

            Note also that both nagel algorithm and delayed ack are essentially heuristic tuners…

    5. 8

      The problem does not seem to be that TCP_NODELAY is on, but that the packets are sent carry only 50 bytes of payload. If you send a large file, then I would expect that you invoke send() with page-sized buffers. This should give the TCP stack enough opportunity to fill the packets with an reasonable amount of payload. Or am I missing something?

      1. 6

        I disagree. It shouldn’t matter how big or small the application’s write() calls are. Any decent I/O stack should buffer these together to efficiently fill the pipe. I don’t know the details of git-lfs, but it sounds like it’s more complex than just shoveling the bytes of a file into the socket, so there’s probably a valid reason for it to issue small writes.

        1. 14

          I disagree. It shouldn’t matter how big or small the application’s write() calls are. Any decent I/O stack should buffer these together to efficiently fill the pipe

          It’s not clear to me that this is the right solution. I’d expect some buffering, but the more buffering that you do, the more latency and (often more importantly) jitter you introduce. If I send 4 MTU-sized things and then a one-byte thing, do I want the kernel to buffer that until it has a full MTU, or do I want it to send it immediately? Often, I want the latter because I’m sending the one-byte thing because I don’t have anything more to send.

          TCP_NODELAY is intended specifically for the latter case. The use case in the man page is X11 mouse events: you definitely want these sent as soon as possible because the latency will kill usability far more than a reduction in throughput. It definitely shouldn’t be the default.

          I think the root of the problem is that the Berkeley socket interface doesn’t have a good way of specifying intent on each write. I want to be able to say, per packet, whether I care most about latency, throughput, or jitter and then let the network stack do the right thing to optimise for this. If I send a latency-sensitive block through a stream behind a throughput-sensitive one then it should append it and flush the buffer. If I send only throughput-sensitive ones, it should do as much buffering as it likes. If I send jitter-sensitive ones, then it should ensure that the data in the buffer is flushed before it reaches a certain age. Ideally, it should also dynamically tune some of these heuristics based on RTT, for example ensuring that, for latency-sensitive packets, the latency imposed by buffering is not more than 0.5 RTT or similar.

          The problem with sending small writes goes beyond the buffering though. Each write requires the kernel to allocate a new mbuf (or mbuf fragment) to hold the data (Linux calls these skbufs). For small fragments, if you’re buffering, these need to be copied into a kernel heap allocation. In contrast, if you do writes that are a multiple of the page size then the kernel can just pin the page for the duration of the syscall (or AIO operation) and DMA directly from that. With KTLS, it can then copy-and-encrypt directly from the direct map into a fixed (per NIC send ring) buffer, unless the hard has TLS offload, in which case it is a single DMA directly from userspace memory to the device, with no overhead (this, in combination with aio_sendfile and some NUMA-awareness is how Netflix manages 80+ GiB/s of TLS traffic from a single host).

          1. 4

            If I send 4 MTU-sized things and then a one-byte thing, do I want the kernel to buffer that until it has a full MTU, or do I want it to send it immediately?

            Isn’t that what flushing the stream conveys? Why should the kernel need to guess?

            the Berkeley socket interface doesn’t have a good way of specifying intent on each write.

            Neither does the Go standard library — the ubiquitous Writer interface just has a vanilla Write() method, not Flush. IIRC (don’t have docs handy at the moment) there isn’t a standard interface adding Flush(), though there are structs like BufferedWriter that have it as part of their implementation.

            That means Go can’t really do buffering at the lower levels like the TcpConn struct, instead everything higher level that writes to a stream has to decide whether to do its own buffering or not. The WebSocket library we use has a buffer for assembling a message, but when the message is complete it just calls Write to send it to the Conn …

            TL;DR I think Go’s stream interfaces may have erred on the side of simplicity, making them more difficult to fine-tune for optimal network performance. By contrast, look at Swift’s Combine library, which has really rich support for flow control and backpressure, at the expense of having lots more moving parts.

            1. 3

              Isn’t that what flushing the stream conveys?

              It can, but that’s an oddly asymmetric interface: you can flush the buffer, but you can’t tell it to keep buffering.

              Neither does the Go standard library

              This isn’t surprising, the Go library is based on Plan 9 which, in my opinion, is a system that takes the worst ideas in UNIX to their logical conclusion.

              1. 2

                a system that takes the worst ideas in UNIX to their logical conclusion

                I’m curious, which ideas from unix are the worst, and why are they bad? Do you have an example?

                1. 2

                  Everything is an unstructured stream of bytes. If you want to make everything an X, pick an X with useful properties, such as introspection and typed interfaces. Even something like COM is better. For example, UNIX ended up with an isatty function to prove whether something is a terminal, but no isasocket or isapipe, and in fact neither of those are what I actually want, I want things like is this persistent storage? In a COM system, for example (and COM is awful, I’m using it as an example because of COM is better than what you have then you’re in a really bad place), most uses of isatty would be replaced by a cast to an IColoredOutputStream or similar. If this succeeded, then the destination would provide functions for writing formatting commands interleaved with text-writing commands.

                  At the lowest level, most UNIX devices provide an interface based on ioctl, which is a completely opaque interface. You could make this better by at least providing some standard ioctls that let you query what ioctls a device supports and, in addition, what ioctls the argument types are. Unfortunately, because ioctl takes a 32-bit integer (64 on ILP64 platforms, but that basically means Alpha), you end up with different devices using the same ioctl commands for entirely different things.

                  Plan 9 made some of these things a bit better by replacing some devices with directories containing individual files for each command but that massively increases the amount of kernel state that you need to communicate with the device.

                  Much as I dislike DBUS, it would be a much cleaner way of interfacing with a lot of things. The ZFS interfaces started moving in a sensible direction, with a library for creating name-value lists with typed values and having each ioctl take a serialised one of these. These, at least, let you have errors for type confusion rather than the kernel just corrupting userspace memory in arbitrary ways. They also meant that 32-bit compat is easy because the nvlist serialised format is consistent across architectures.

              2. 1

                Sick burn, bro! (I agree with your elucidation below.)

        2. 3

          There is an use case for NODELAY. Just like there is a use case for DELAY. So any discussion about the default behavior appears to be pointless.

          And I don’t see why applications performing a bulk transfer of data by using “small” (a few bytes) write is anything but a bad design. Not writing large (e.g., page-sized) chunks of data into the file descriptor of the socket, especially when you know that there multiple more of this chunks are to come, just kills performance on multiple levels.

          If I understand the situation the blog post describes correctly, then git-lfs is sending a large (50 MiB?) file in 50 bytes chunks. I suspect this is because git-lfs issues writes to the socket with 50 bytes of data fr the file. And I am genuine curious about potential valid reasons to issue small writes in such cases.

          1. 2

            The point of discussing the default is that Go’s implementers decided to use the opposite default value than what a Unix developer is used to. Both are valid, but one tends to assume the socket uses Nagle buffering unless told otherwise. Not so in Go, and this isn’t really documented … I’ve been using Go since 2012 and didn’t learn this until last month.

            I’m curious about the 50-byte writes too. I’m guessing they make some sense at the high level (maybe the output of a block cipher?) and the programmer just assumed the stream they wrote them to had some user-space buffering, only it didn’t. So yeah, application error, but the NODELAY made the effects worse.

            1. 1

              It seems to be forgotten in this discussion that Nagel’s algorithm was not created to fix those kinds of programming mistakes. So why are the platform defaults relevant to this discussion?

              Otherwise, I also believe that it is likely that the 50 byte writes are probably due to an unbuffered stream being used, when it should have been a buffered one. Which makes the relevant question if Go makes it easier to make such errors, e.g., because the default is unbuffered.

        3. 1

          Any decent I/O stack should buffer these together to efficiently fill the pipe

          It seems like this should happen in the userspace part of the stack though? Going to the kernel to just memcpy bytes seems wasteful?

          1. 1

            The normal TCP stack in Linux resides (mostly) in the kernel space. Hence sending data will involve a copy from user space to kernel space. (There are various ways to optimize that, including moving the TCP/IP stack into user space and exposing parts of the network interface card to user space directly, but that is not relevant to this discussion).

            1. 1

              Yes. As that’s all I have to say here, I feel like either I don’t understand what you are trying to say, or vice verse? :)

              To expand on this, the problem with in-kernel buffering is not that you need memcpy (you need to regardless, unless you do something very special about this), but that you need to repeatedly memcpy small bits. As in, it’s much cheaper to move 4k from userspace to kernel space in one go, rather than byte-at-a-time.

              I guess the situation is similar to file io? You generally don’t want to write(2) to file directly, you want to wrap that into BufferedStreamWriter or whatnot.

              1. 3

                Yes. As that’s all I have to say here, I feel like either I don’t understand what you are trying to say, or vice verse? :)

                I am sorry, I think you replied to my comment, when you actually quoted snej’s comment. At least that is the visual impression I get. And this seemed to confused me.

                Yes, it looks like we are on the same page. Always try to perform large writes for optimal performance, e.g, less syscalls and it gives the TCP/IP stack more room for optimization.

                1. 3

                  Nice, always great to notice a trivial miscomunication instead of someoneswrongontheinterneting! :)

    6. 5

      It’s kinda fun to see the inverse of the “mysterious network delay” genre of evergreen network debugging post (usual solution: set TCP_NODELAY!).

      However, it would be nice if the author had done some more investigation before concluding that TCP_NODELAY is at fault. After all, setting TCP_NODELAY is pretty common for HTTP clients — e.g., curl, Python’s http.client and urllib3.

      It seems more likely that git-lfs doesn’t buffer properly. After some code inspection I noticed:

      1. git-lfs opens the file using the unbuffered os.OpenFile function — I guess we’d get one syscall per Read call, then? Ouch.
      2. git-lfs doesn’t set the HTTP client’s Transport.WriteBufferSize, so it uses the default of 4 KiB — a bit small for bulk transfers?

      Introducing the magic OS-level buffering of Nagle’s algorithm won’t fix tiny filesystem reads, nor undersized buffers. The argument against TCP_NODELAY by default seems specious when making bulk transfers fast always requires looking through the full stack.

      Given the leap to conclusion and inflammatory title I think this is more a rant about Golang than technical networking content.

      1. 0

        +1

    7. 19

      I wasn’t able to dig out why Go chose to disable Nagle’s algorithm, though I assume a decision was made at some point and discussed.

      I don’t know, isn’t Go known for the core team simply making decisions for high-impact changes based on whatever suits them immediately best?

      1. 9

        Thanks for the troll votes, but @ewintr’s comment links to this being a completely reasonable assumption. It reminds me of Russ Cox’s comments on syntax highlighting or Pike’s suggestion acme didn’t need alternate colour scheming or keyboard shortcuts because he “did some tests”.

        1. [Comment removed by author]

      2. [Comment removed by author]