1. 47
    1. 7

      They should absolutely follow up with a post about antithesis finding this bug in antithesis

      1. 3

        Time-based bugs are particularly interesting and scary. Most of the techniques that we use to make distributed systems more resilient are based on the assumption that failures are independent, and a big part of good distributed systems design is ensuring that this independence isn’t violated. Time is a great correlator. If a clustert of machines are all booted at the same time (e.g. with a single EC2 RunInstances call), it’s possible for any timer to completely break the assumption of independent failures.

        1. 2

          This is kind of a nitpick, but I wonder what the actual fix was

          It turned out this was a bug Colin had first fixed in the Xen netfront driver in 2017

          e.g. was it upgrading to a new version of Xen, or patching existing code to include that fix, etc.

          I definitely agree about flakiness and non-deterministic bugs – it is a big tax on the whole software dev process

          1. 1

            They call out that this was something fixed in the Xen driver, so I’d imagine they just ported the same fix over to the ENA driver, but I agree they could’ve been more clear

            Colin’s fix for this was simply to check if the requested MTU matched the current interface MTU and return from the ioctl without doing anything.

            1. 2

              Is it impossible for the MTU to actually change, or does this just make the bug not happen during normal operation?

              1. 3

                I’d imagine the latter, but it’s not very clear. I’ve never had the MTU change out from under me on EC2, I generally have never had need to change it from its default (1500). I’ve also never used Bare Metal instances, I’m sure there are other considerations for that.

          2. 1

            Tough bug. Great report. Awesome seeing Colin Percival jump in there :)

            1. 1

              I’m curious as to why Antithesis uses FreeBSD for their hypervisor instead of Linux. Anyone know?

              1. 5

                My guess is that making a deterministic hypervisor involves a lot of kernel hacking, and BSD source code is much smaller, coherent, and maintainable by fewer people than Linux

                Linux is like a big sprawling thing that uses every trick in the book, with code from thousands of people, and hundreds of different companies

                Also I recall hearing specifically that the hypervisor support in Linux is a nightmare. This might have been from some talks on Solaris/Illumos (e.g. from Bryan Cantrill), which shares some lineage with BSDs

                (I am not a kernel hacker, nor do I have any first hand knowledge, so take this with a grain of salt. However I do know people who have worked on high perf networking in the kernel, and say there is a stark difference between BSDs and Linux)

                1. 2

                  I suspect this is the main reason, another one may be the license since bhyve and FreeBSD are licensed under the BSD license, which is more permissive than Linux’ GPLv2.

                2. 2

                  I sent an email to Antithesis and received a reply from their CEO:

                  We decided to start with bhyve because it’s a very simple hypervisor with very few features, so much easier to modify with the deep changes we needed to make in order to make it deterministic and have copy-on-write snapshotting (we wrote about some of those efforts here: https://antithesis.com/blog/deterministic_hypervisor/).

                  The alternative we considered was to write our own hypervisor completely from scratch. We looked at kvm, but decided its architecture was too complex to allow deep modification.