Ooh, fascinating, gonna read this in detail soon.
I’ve been thinking about this problem from a different angle: beyond a certain point you don’t just get swapping/stalling, you get OOM kills. And for some programs that’s impossible to prevent, in particular data intensive batch jobs might just load and process way too much memory.
So a different approach is to detect impending OOM events and dump a memory profiling report in advance so someone can go and optimize the program. So for my Python memory profiler, intended for offline use, does that. Heuristics are here: https://github.com/pythonspeed/filprofiler/blob/master/memapi/src/oom.rs
(This turns out to be extra tricky because macOS has different memory policies than Linux, it seems to try to keep more RAM available than Linux, and correspondingly swaps more aggressively. And then there’s cgroups, which actually is cgroups v1 and cgroups v2. And setrlimit, but I decided those limits are so wacky probably no one should use them.)
For offline profiling a slightly over-aggressive heuristic is fine: the goal is reduce memory usage, so whether you hit OOM or were just likely to doesn’t matter, what matters is getting a useful “here’s where your memory usage is coming from” report.
Now I’m starting to think about making a (commercial) production memory profiler, with different tradeoffs: less intrusive, less accurate, so fast you can actually use it in production. And for OOM preemptive detection it needs to be really sure OOM is about to happen… Or, alternatively, if the data for profiling reports is on disk, you can extract a reason-for-crash profiling report after the crash happens… presuming the OOM kill didn’t e.g. kill your whole container.
It’s a fun problem space!
I also recommend checking the talk Linux Memory Management At Scale by Chris Down.