So there’s been a string of Nagle’s Algorithm posts on Lobsters recently, and I was getting ready to make a comment accordingly. But no, this is just another bug that happens to have the same timestamp. What a world.
My exact same thought. I was surprised that CTRL+F Nagle did not get any results in that article!
This is really interesting post. It’s about a real problem in a thing I’ve heard of, has a cohesive narrative, is accessibly explained, technically based, and has well presented data, yet it feels a bit nihilistic. There’s a “lessons learned” section, but it’s mostly about why it was hard. The fix was presumably to upgrade a dependency or backport a patch, because it turns out that the problem had been solved already.
What do I do avoid this kind of problem?
This kind of complexity, presumably introduced to solve a problem in one domain (i.e. slowing down thread scheduling for backgrounded tasks in Android) ended up causing problems in another domain. At some point, someone hits your edgecase, and, boom - buffer starvation.
Software is complex, it always has been. But, the ability of individual developers to reason about the systems we develop on is failing, and we’re stuck with just trying to manage the complexity well. More code coverage, more automated testing. Is that really going to scale when everything is based on frameworks? At some level, you’re assuming that the entire stack of your app, all the way up from libc, works so your app can run. And that Zygote doesn’t spawn a process with a corrupt heap. And so on.
Of course, this isn’t specific to Android, you’ve got X/Wayland, glibc, and the particulars of DEs on desktop Linux. It’s still dependency hell all the way down. OK, fine, let’s start using Nix to tie entire collections of libraries and binaries together so they’re at least compiled with and dynamically loading the right versions. That’s better. Now, why is my desktop still stuttering with particular combinations of hardware and software running?
Maybe a less nihilistic view involves embracing more formal methods and system proofs for the parts of the system everyone relies on. seL4 for a whole system, for instance. Or various functional languages and program synthesis for just doing lower level components. But formal methods are hard and you’re paid to ship product instead of write proofs. :-(
Anyway, what I’m saying is… make sure to support your local systems engineers so we can all build applications with better foundations.
Don’t design a system that requires such precise timing on a platform that can’t supply it.
The playback system was already fragile.
I totally agree, the lesson is very simple, and I’d argue that it doesn’t even need to be learned since it’s very obvious: If you ask the OS to wake you up in X milliseconds, your code HAS TO handle the X+500ms case as gracefully as possible, since the OS promises nothing at all and even if it did, that would still have been a better practice.
As everyone agrees, proving code correct via formal methods is often too costly to be considered in a commercial setting, but sketching up a quick informal proof in your head as a programmer is always a good practice, even for a one-off script.
Excellent, excellent write-up.
Really enjoyed reading this. Excellent write-up. Thank you very much!
This is the kind of thing that makes me want to avoid all layers above the libc API and use raw pthreads.
I believe it makes some sense to do that for applications with soft real time performance requirements like this, just to minimise the amount of code which could potentially exhibit a perf bug that causes it to miss a deadline.
That’s generally a great idea, however android used to have first class c API for their A/V stack, but not anymore.
The latest iterations are Java only. In our video recording app, we try to put the more sensitive real time code native. But we’re forced to call into Java side to get latest APIs :/
Is java.lang.Thread not a raw pthread on Android? :(
This seems really unfortunate since you prolly can’t guarantee the GC will never blow a deadline unless you do something like make sure the heap has so few pointers in it that a full collection takes too little time to affect a deadline. To do that in a complex app you’d need to split into multiple processes?
pthreads are a library implemented on top of (and sometimes within) libc. You probably meant “raw kernel threads”. The turtles, they keep going.
I did mean pthreads. The semantic distance isn’t big enough that I would worry about it. Similar to how I wouldn’t really worry about using say C/Rust vs using assembly.
(Back when some pthreads implementations were m:n instead of 1:1, the difference would have been a problem. Those systems essentially don’t exist now. Just some super obsolete versions of FreeBSD and Solaris: both later backed out that decision because it was apparently terrible.)