The two most ‘fun’ bugs that I’ve encountered (I wasn’t responsible for fixing either of them):
For a while, there was an issue on FreeBSD (I can’t remember if this was merged code or a branch) where, on multicore systems, data would be randomly corrupted. Investigation showed that the objects being corrupted were always accessed protected by a lock. The root cause of the bug was that someone had accidentally removed the annotation that caused some locks to be strongly aligned and, occasionally, they would span cache-line boundaries. Until fairly recently, LOCK-prefixed instructions on x86 chips that spanned such a boundary would not trap, they’d just silently not be atomic. This let two threads simultaneously acquire the same lock, but only in very rare cases.
The more fun one was actually a bug in the prototype CHERI CPU. The observable behaviour was that the network was really, really slow. Further investigation showed that we were seeing a lot of packets fail checksums (which caused TCP to retransmit, so no failure was reported to userspace). This was eventually tracked down to the CPU performing speculative loads even of uncached memory. The network interface was a memory-mapped FIFO, so if you read from the MMIO address you popped a word from the network device’s buffer. If you did this in speculation, you silently discarded a word from the packet, so the network stack would see a packet with a mismatched checksum. Amusingly, the Xbox 360 had a similar bug: there is an instruction to perform uncached reads and this can be executed in speculation, which can have observable side effects.
We also had a bug where the SD-card controller on the FPGA platform couldn’t keep up with the CPU’s write rate and so would write alternate blocks. We had 1 GiB of RAM and 512 MiB of SD card, so while the system was working we never actually read back anything that we’d written to disk: it stayed in the buffer cache. It was only after a reboot that we’d find that the root FS was completely broken. This was easier to find because we could look at the SD card, see interleaved zeroes, try a simple boot-and-write, then try a bare-metal write from the CPU without the OS and see that the hardware was at fault.
The most fun bug that I’ve had to find was in the Linux kernel (and, at some point, I really should upstream the fix). In one of the modes for paravirtualised console support (used only by the project I was working on and PV MIPS), it added the virtio console as console 0. It then iterated over consoles and failed to find any. It turns out that the collection that it uses to hold the list of available consoles uses a 0 value of a key as a not-present marker. Registering the console as console 1 fixed the bug. I particularly like this bug, because it took a couple of days to find, but the fix was flipping a single bit (the ASCII code for ‘1’ is one bit different from ‘0’, so it was one bit in both the source and the compiled binary). This is probably the highest ratio of debugging effort to fix size of any bug that I’ve fixed.