1. 54

  2. 10

    This article very accurately describes how I approach understanding systems. I would like to write a long comment, but I feel like almost everything I wanted to say has already been said better by the author.

    One specific point I had on my to-write-about list for a long time now is this:

    Learning more about software systems is a compounding skill. The more systems you’ve seen, the more patterns you have available to match future systems against, and the more skills and tricks and techniques you develop to apply to future problems.

    I mentally describe this as a logarithmic learning technique. When you have a system with n layers on the path you are looking at, a reasonable guess about the size of the state space you’re dealing with is to say that it is “on the order of exp(n)”. Clearly, thinking about such a system as whole does not work for any non trivial value of n. However learning and thinking about a core set of rules and behaviors of the involved layers only takes work “on the order of n”, which is the same as log(exp(n)), hence the name.

    As an example, I once debugged a bug where a file transfer through a proprietary RPC failed for certain files at the very end of the transfer. Through the use of Wireshark a reasonable guess was that TCP segments - and hence Ethernet frames - of a certain length would not get ACKed. Doing a file transfer involved a number of layers I had little knowledge about.

    However by looking at the layering of the RPC protocol it was not a monumental task to deduce how files get split into packets, framed with some meta information and then written to a TCP socket before waiting for an answer from the remote side. This allowed a prediction which files, meaning of which sizes, will not get properly transferred. We were able to pad files of offending sizes and thus provide a hot fix. In a second step this allowed a very simple repro case to be written: ping with a -s size to get ethernet packets of 1498 or 1499 bytes in size. From there it was easy to convince our vendor that yes, the ethernet driver was broken and they’d fix it.

    1. 7

      Awesome post! Loved this line near the closing:

      Computers are complex, but need not be mysteries. Any question we care to ask can, in principle and usually even in practice, be answered. Unknown systems can be approached with curiosity and determination, not fear.

      This post reminded me of one of my favorite debugging war stories. I covered it briefly and at a high level in “Shipping the Second System”:

      We upgraded [our infrastructure], but then, discovered bugs in [the] network layer that stopped things from working. Our DevOps team reasonably pushed for a new “stable” Ubuntu version. We adopted it, thinking it’d be safe and stable. Turned out, we hit kernel/driver incompatibility problems with Xen, which were only triggered due to the scale of our bandwidth requirements.

      But, at the time of the bug, I wrote up a detailed postmortem about it, and, after reading the OP here, I went back to my postmortem. Here were the layers involved:

      • 1/ Python backend code. We had Python code which was responsible for processing terabytes of data per day, in parallel, across a cluster of machines.
      • 2/ Parallel execution & cluster manager (Storm). We had upgraded a system which does real-time parallel processing for our live production backend. That system (which we treated as ‘black box infrastructure’) was written in Java + Clojure, and was open source in Apache.
      • 3/ Python <=> JVM communication library (streamparse). We had actually written our own layer for doing Python <=> Java communication, as an open source module named streamparse. So we knew exactly how that layer worked and could instrument it.
      • 4/ Network layer and library (ZeroMQ or Netty). Storm had recently rewritten its networking layer, porting that functionality from ZeroMQ to Netty. Of course, ZeroMQ and Netty are each network layer abstractions; the former is written in C (and bound natively to Java), the latter is written in Java.
      • 5/ Linux kernel. We were upgrading the Linux kernel of our machines to the new “Ubuntu stable” version in 2014.
      • 6/ Xen hypervisor. Since the Linux boxes in question were running on Amazon EC2, they were running under the Xen hypervisor and thus using Xen kernel modules.
      • 7/ Ethernet driver for Linux + Xen. Since our Python code (via Storm, via Netty/ZeroMQ, via the Linux kernel, via Xen) was pushing many terabytes of bits through the cluster, it was putting a lot of load on the physical Ethernet hardware that was running, virtualized, through our VM.

      In this situation, our observability indicated that once we hit a certain level of cluster-wide processing, the network bandwidth between the nodes would drop from gigabytes per second to megabytes per second. Strangely, they would all drop pretty much in unison, and it would only happen after the cluster was running for 20 or 30 minutes under load. As part of diagnosing this problem, I went as deep as instrumenting debug information into layers 3 and 4 above. Everyone was convinced, due to the timing of the ZeroMQ => Netty switch in the Storm codebase, that this layer was at fault. But, once I proved that layers 3 and 4 were not at fault (by running ZeroMQ and Netty implementations side-by-side, and discovering that the same fault occurred in both cases), I had to go down into layers 5, 6, and 7.

      Through extensive spelunking through graphs and logs on the nodes in question, I discovered a strange system-wide log message that correlated with the ethernet slowdowns:

      xennet: skb rides the rocket

      This bizarre log message also correlated with “eth0 errors” reported by Munin, which were never seen before in the cluster.

      It turns out, that message was due to a Xen ethernet driver bug in our Ubuntu release, which could be found here. Ultimately, I wrote up my findings in a GitHub gist here, so it could be shared with other members of the Storm community – people were frequently hitting the bug since it was common to push the network hardware hard on EC2 instances when running Storm. I don’t think I’ll ever forget this war story because, ultimately, the “resolution” was a single arcane command with ethtool that simply flipped a bit on the ethernet driver, and caused everything to suddenly work again in our system.

      To this day, whenever my team comes across a gnarly bug in production that we have a hard time tracking down, someone will say, “Kinda like ‘skb rides the rocket’ …”, and everyone will just nod and shudder.

      1. 5

        I have a mental distinction between the type of understanding that is practical when building things and when debugging things (where I count any reliability issue like performance into debugging.) You cannot possibly know all layers in a useful way when creating new things – you have to work off of black box models to allow the complex domain logic to fit in your brain.

        But when debugging, things are completely different. At that point you are asking more or less specific questions of the entire system, and then answer both can and will be at any layer in it. You have to understand them well enough to tease out those specific answers, which is different from fully understanding all parts of them in general.

        1. 3

          Yeah I agree with this and would add performance and security as cross-cutting concerns that lead to understanding (as the article mentions).

          An irony is that the people that wrote the software sometimes understand it the least! It’s possible to kinda follow patterns and pile on functionality without understanding, and it many cases that’s the best way to get the product done.

          But when you need to fix bugs you often end up pulling on threads that require a lot of understanding. Performance and security problems are particular kinds of bugs that often straddle abstraction layers. Logic bugs are sometimes confined to a particular layer, but performance and security bugs more often span layers.

        2. 4

          I’d like to add another pitfall: obsessing over optimizing or eliminating intermediate layers when one should be focusing on meeting business requirements in higher layers. I’m prone to this one myself. This one is tough in the current move-fast-and-disrupt-things environment of startups, or even teams within larger companies that are trying to be nimble like a startup.