Time-based bugs are particularly interesting and scary. Most of the techniques that we use to make distributed systems more resilient are based on the assumption that failures are independent, and a big part of good distributed systems design is ensuring that this independence isn’t violated. Time is a great correlator. If a clustert of machines are all booted at the same time (e.g. with a single EC2 RunInstances call), it’s possible for any timer to completely break the assumption of independent failures.
They call out that this was something fixed in the Xen driver, so I’d imagine they just ported the same fix over to the ENA driver, but I agree they could’ve been more clear
Colin’s fix for this was simply to check if the requested MTU matched the current interface MTU and return from the ioctl without doing anything.
I’d imagine the latter, but it’s not very clear. I’ve never had the MTU change out from under me on EC2, I generally have never had need to change it from its default (1500). I’ve also never used Bare Metal instances, I’m sure there are other considerations for that.
My guess is that making a deterministic hypervisor involves a lot of kernel hacking, and BSD source code is much smaller, coherent, and maintainable by fewer people than Linux
Linux is like a big sprawling thing that uses every trick in the book, with code from thousands of people, and hundreds of different companies
Also I recall hearing specifically that the hypervisor support in Linux is a nightmare. This might have been from some talks on Solaris/Illumos (e.g. from Bryan Cantrill), which shares some lineage with BSDs
(I am not a kernel hacker, nor do I have any first hand knowledge, so take this with a grain of salt. However I do know people who have worked on high perf networking in the kernel, and say there is a stark difference between BSDs and Linux)
I suspect this is the main reason, another one may be the license since bhyve and FreeBSD are licensed under the BSD license, which is more permissive than Linux’ GPLv2.
I sent an email to Antithesis and received a reply from their CEO:
We decided to start with bhyve because it’s a very simple hypervisor with very few features, so much easier to modify with the deep changes we needed to make in order to make it deterministic and have copy-on-write snapshotting (we wrote about some of those efforts here: https://antithesis.com/blog/deterministic_hypervisor/).
The alternative we considered was to write our own hypervisor completely from scratch. We looked at kvm, but decided its architecture was too complex to allow deep modification.
They should absolutely follow up with a post about antithesis finding this bug in antithesis
Time-based bugs are particularly interesting and scary. Most of the techniques that we use to make distributed systems more resilient are based on the assumption that failures are independent, and a big part of good distributed systems design is ensuring that this independence isn’t violated. Time is a great correlator. If a clustert of machines are all booted at the same time (e.g. with a single EC2
RunInstancescall), it’s possible for any timer to completely break the assumption of independent failures.This is kind of a nitpick, but I wonder what the actual fix was
e.g. was it upgrading to a new version of Xen, or patching existing code to include that fix, etc.
I definitely agree about flakiness and non-deterministic bugs – it is a big tax on the whole software dev process
They call out that this was something fixed in the Xen driver, so I’d imagine they just ported the same fix over to the ENA driver, but I agree they could’ve been more clear
Is it impossible for the MTU to actually change, or does this just make the bug not happen during normal operation?
I’d imagine the latter, but it’s not very clear. I’ve never had the MTU change out from under me on EC2, I generally have never had need to change it from its default (1500). I’ve also never used Bare Metal instances, I’m sure there are other considerations for that.
Tough bug. Great report. Awesome seeing Colin Percival jump in there :)
I’m curious as to why Antithesis uses FreeBSD for their hypervisor instead of Linux. Anyone know?
My guess is that making a deterministic hypervisor involves a lot of kernel hacking, and BSD source code is much smaller, coherent, and maintainable by fewer people than Linux
Linux is like a big sprawling thing that uses every trick in the book, with code from thousands of people, and hundreds of different companies
Also I recall hearing specifically that the hypervisor support in Linux is a nightmare. This might have been from some talks on Solaris/Illumos (e.g. from Bryan Cantrill), which shares some lineage with BSDs
(I am not a kernel hacker, nor do I have any first hand knowledge, so take this with a grain of salt. However I do know people who have worked on high perf networking in the kernel, and say there is a stark difference between BSDs and Linux)
I suspect this is the main reason, another one may be the license since bhyve and FreeBSD are licensed under the BSD license, which is more permissive than Linux’ GPLv2.
I sent an email to Antithesis and received a reply from their CEO: