A related thing I’ve heard of is that AIUI at one point the combination of MySQL + Linux had a bug where it would act as though it had been pinned to just one NUMA node and see requests for memory fail spuriously. I can’t remember details for sure (this was fixed over a decade ago) but I don’t thin( MySQL was actively messing with the NUMA control bits, it was just a bug in code in Linux that was trying to make things go faster by automatically keeping allocations on the same NUMA node as the process was started on. This didn’t work very well with a single MySQL process (with lots of threads, mind) that was configured to use slightly less than all the memory on the machine across all the NUMA nodes.
I wonder if this is the same Linux / MySQL interaction as referenced in The Night Watch.
Thank you for reminding me of that. I presume not because a priority inversion sounds like a scheduler bug, but maybe.
Yeah, it doesn’t sound like it. And it’s not like only 1 Linux performance bug ever affected MySQL. But it was an okay excuse to link The Night Watch and I do that any chance I get!
I guess this is tangential, but reading about this makes me wonder if NUMA is really worth it. That is, is it worthwhile for the hardware and OS to go through so much effort to present the illusion of a single large multi-core machine with a single pool of memory, rather than splitting up the workload onto smaller, simpler machines? Of course, this assumes that the workload is amenable to scaling out rather than up, as e.g. typical stateless web applications are.
Based on my personal, practical experiences, I now tend to err on the side of NUMA not being worth it, a lot of the time.
In theory all the advantages listed in other comments here are true: NUMA interconnects between threads and memory on a single machine under a single kernel can be far more efficient than e.g. a fast/short “external” network connection between two separate machines with separate kernels. It also allows for some odd things to work that are otherwise difficult (e.g. scaling the total memory and/or I/O bandwidth of a single kernel/system beyond what 1x hardware CPU can easily support on its own, at some perf/scaling tradeoff).
The caveat is that properly taking advantage of NUMA and avoiding its pitfalls is quite difficult, and most persons in most situations simply won’t have the right software architecture for taking optimal advantage of it and/or it won’t be worth the additional configuration/complexity overhead and various other tradeoffs. The Linux kernel does try to automagically make a best effort on a multi-purpose machine with many independent threads, processes, and memory allocations to shuffle around, but even this really isn’t ideal most of the time and never really can be. Maybe for the base layer of a many-containers/pods sort of deployment which has some awareness of NUMA, but not many other cases.
For most things that are in the category of stateless-ish processing that easily scales across vast clusters of machines, I think it’s probably a good rule of thumb to avoid NUMA complexities and tradeoffs and simply scale out using non-NUMA machines across modern faster ethernet (and other) interconnects. If you were leaning towards NUMA in this kind of scenario just to fit more raw CPU/mem into a single rack, you’d be better off with non-NUMA blade servers (which also can provide pretty low latency and high-bandwidth “real” network interconnects across their blade chassis units). Even for many somewhat less-scalable and/or less-stateless parts of an architecture, I think NUMA’s often not worth its various costs in engineering, complexity, and dollars.
There are special cases that don’t scale well at the machine level, and NUMA can help some of these achieve higher single-machine scale in ways nothing else can. There are also scalable solutions that happen to map very well to NUMA (think e.g. a two-layer software architecture that’s deploying a coupled, high-traffic pair of independent heavy processing layers on every machine, one bound to each of two NUMA nodes), but actually squeezing out that benefit will be challenging at various levels! I get these cases, and I especially think some of the larger-scale NUMA boxes make a ton of sense (the 4-8+ socket kinds with TBs of memory) for specialized cases.
What I’m railing against is the proliferation of low-end 1U 2-socket NUMA boxes deployed by vendors and users as a generic way to scale up cpu-core/memory counts per 1U in a naive way without any special considerations of the NUMA tradeoffs, for non-containerized use as bare metal machines for singular, network scalable purposes.
The processor interconnect is a lot faster than a NIC.
If IP over avian carriers is a thing, HTTP/2 over UPI could be a thing, too, just saying!
It’s usually implemented using pigeons. Not ducks.
Ah, right, so with a single large machine using NUMA, multiple processors can use a shared cache or other in-memory data structure more efficiently than multiple machines could access something like Redis.
I guess the tricky part is to make sure that the processor interconnect isn’t accidentally over-used, e.g. for data that’s specific to a single web request or database connection. How much of this is automatically handled by the hardware and OS? Do higher layers like the JVM (and similar runtime environments) have to optimize for NUMA? Links to introductory material would be great; I don’t see NUMA being discussed often on sites like this one.
Linux will attempt to schedule a single thread on the same NUMA node. Shared memory like a database block cache or the Redis primary index will cross NUMA boundaries, because it’s not managed by any individual thread. A single web request on a single thread will almost always use entirely local memory. Note that async runtimes don’t process a request on a single thread, and expose an ordinarily simple case to NUMA issues. As web requests are usually otherwise independent, it’s sensible to run 1 app server per NUMA node and pin it to that node.
A database connection will use local memory for connection state and intermediate results of queries, but again accesses to the block cache or Linux page cache will often go across the interconnect. That’s still faster than going over the network, as modern CPU interconnects have around a terabit of throughout.
Memory allocators also try to optimize locality, with per-thread and per-core caches. I’m not sure if any mainstream allocators have processor-pinned arenas, but I wouldn’t be surprised. I’m not sure what the JVM does. I vaguely recall reading that ZGC is NUMA aware, but I could be wrong. Azul Systems’ JVM is probably NUMA aware.
Links to introductory material would be great
Links to introductory material would be great
Wish I could help. I mostly picked up what I know from implementation details type docs for Linux and malloc implementations. A recent post here about Netflix CDN boxes discusses the CPU interconnect bottlenecking, with NUMA diagrams and hard numbers.