1. 16
  1.  

  2. 8

    I haven’t been a Mac/macOS user for many years but I’m still excited to see Linux, FreeBSD, and Windows (via BootCamp?) running on the M1 because I think all their mainstream system schedulers are too reliant on NUMA & SMP behavior and could use some overhauling to better take advantage of non-NUMA/non-SMP compute power.

    There are a lot of subtle assumptions that have big impact. For example, even though AMD goes to great lengths to make ThreadRipper appear to be a NUMA architecture there were a number of Windows cpu scheduler bugs that caused a high standard deviation in latencies of many operations due to the non-uniform memory access patterns of the various chiplets/cores in the TR1/TR2 processors. Microsoft patched those in a Windows 10 point release, but I think there’s a lot more low-hanging fruit no one has bothered picking.

    1. 39

      Arm worked a lot on their Intelligent Power Architecture in Linux, which provides big.LITTLE power-aware scheduling. The Linux kernel is very Intel-focused and so this was the biggest (non-disruptive) change that they could get upstreamed. FreeBSD has nothing equivalent (Robin, who led the IPA project, was interested in helping but no one volunteered to do or fund the work). In general, it’s a very difficult problem for a few reasons:

      • It’s a bin-packing problem, except you don’t know how big the things are until after you’ve tried a packing and when you retry they may be different sizes. If you have a slow core and one that’s twice the speed, you can think of it as having 3 core-units of performance available. If you estimate that everything can fit on the small cores and get it wrong, everything is slow. If you estimate that you need the big core and don’t, you waste power. If your thermal envelope can handle it, you may also want to have three possible configurations (big, little, big+little).
      • The transition costs when you get it too wrong are quite high. If you shut down the big core but then need to power it up again then that’s an operation that has both a large power requirement and a high latency. By the time it’s started, you might not need it anymore.
      • It’s a constraint problem on many axes. If you run the big cores for a long time, you may end up hitting thermal throttling, so for a single-threaded CPU-bound workload the optimal may be to have one big core running that thread and a little core running everything else. If you can burst the big core and then go into a deeper sleep state sooner and so your whole-system power consumption may be lower, but if you then need to wake up again almost immediately then it will be a lot worse than if you just ran more slowly on the small core.
      • The cores aren’t actually fast vs slow. There was a nice paper at one of the ISCA workshops in, I think, 2014, that showed that a Cortex A7 outperformed an A15 on some workloads because the (fast, superscalar) A15 had a 4-cycle L1 latency, whereas the (slow, just-about-dual-issue-downhill-with-a-trailing-wind) A7 had a 1-cycle L1 latency and so memory-bound workloads with a working set that fitted in L1 were faster on the A7 than the A15 (in wall-clock time, even though the A7 was clocked at a lower speed).

      Even nominally SMP systems have some interesting corner cases. We found, for example, that disabling CPU affinity in the FreeBSD scheduler made some parallel workloads faster because CPU affinity would cause whatever the most-used execution units in a particular workload were to get hot and then cause thermal throttling to kick in. Without it, there was a small penalty for migration (but within a socket, remote cache snooping is now so fast that it doesn’t make much difference) but then the CPU heated more evenly and maintained a higher clock speed.

      It’s very easy to make things perform differently with small tweaks to the scheduler. Making that a consistent performance improvement is incredibly hard.

      1. 9

        I just want to say thanks for taking the time out to write that. It’s a greatly informative response, and I have a lot of topics to research/chase down.