1. 39

  2. 9

    Will admit when I saw the list at the top of this article I almost closed the tab thinking it was a low-effort “falsehoods”-style listicle. Glad I kept scrolling, because there’s huge amounts of hard-won insight in here.

    1. 2

      ^this … excellent article, well worth the read, but also nearly skipped it because the title and the first paragraph felt listicle-ish.

      Glad I made it past though.

    2. 5

      You are lucky if 99.999% of the time network is not a problem.

      As a network admin, this sounds pretty lame. I would be curious to know what their 99,999% network problems are.

      Many of the application-level issues get blamed on the network, for the simple reason that it is hard to troubleshoot and somehow “black magic”. After investigating however, the issue can usually be traced down to a bad tuning of a server, or an application protocol bug. This is really frustrating because as the “network guy” you get to justify yourself to prove that the issue is in the upper layer, while nobody takes the time to verify the network requirements are fulfilled.

      We had a case recently where we tried to replicate a 300Gb table over a 10Mbps link. It was litterally taking days to export the binary blobs, where it should technically take ~8 hours. And everything seemed fine, except that the upload rate was capped at a few Kbps. When we transferred the same amount of data over FTP, the full bandwidth was used. Same for other protocols like HTTP, SSH, … You name it.
      It turns out that the proprietary protocol has a check that sends back the data it received to its peer over and over in chunks, so the latency was introduced by the database re-reading all chunks of data after it is received, and sending it back, then waiting for a confirmation to receive more data. Our databases have been doing this for ages, but as we have a 10Gbps network internally, we never noticed.

      So please don’t blame it too much on the network, we have feelings too :)

      1. 1

        Many of the application-level issues get blamed on the network, for the simple reason that it is hard to troubleshoot and somehow “black magic”.

        And sometimes it all gets tangled up together, like when “the network” has proxies or TLS termination, perhaps even implemented in hardware. I remember having a terrible time diagnosing an issue with a Citrix Netscaler where clients would intermittently fail to connect with one particular service that was behind it, as early as the TLS handshake stage. We never managed to figure it out, despite setting up a whole array of monitoring on both “sides” of the blasted thing.

        But it’s possible that the TCP or HTTP behavior of the service was somehow triggering a bug in the Netscaler that was not triggered by the other services; there was a good degree of heterogeneity in implementation.

        It stayed a mystery, and eventually we moved to a different deployment model, and I guess I’ll never know. But that uncertainty stuck with me.

      2. 3

        Please pick a globally unique natural primary key (e.g. a username) where possible.

        Ooof, not a good example to use. :-)

        (Usernames often need to change.)