1. 25
  1. 9

    Transparent superpage support was added in FreeBSD 7.0 (2008) and has been enabled by default since then, without the performance issues that the Linux version seems to have had. I am not sure why the Linux version has had so many problems, but it looks as if they’re not demoting pages back after promoting them. For example, as I recall, if you fork in FreeBSD and then take a CoW fault in a page, the vm layer will instruct the pmap to fragment the page in the child and then copy a single 4 KiB page, so you don’t end up copying the whole 2 MiB. There’s also support for defragmentation via the pager, though I don’t think there’s anything if memory is not swapped out, which can help recombine pages later.

    1. 3

      Huge page page faults are absolute hogs. The fault handler emit TLB shootdowns for every single 4k page in the huge page region, which takes about 200us for every fault for a 2MB HP.

      Because of this, THP are usually bad news for latency critical applications as this behavior will cause absurd tail latencies. It’s even worse when using them with allocators that do not natively support them. Even those that supposedly do (eg jemalloc) show iffy tail behavior.

      From my experience, the best use case for HP are ring buffers (either SW/SW or SW/HW) where the capacity is known in advance and the pages can be pre-faulted. But that’s a very tailored situation that doesn’t broadly apply.

      1. 3

        Huge page page faults are absolute hogs. The fault handler emit TLB shootdowns for every single 4k page in the huge page region, which takes about 200us for every fault for a 2MB HP.

        I’m not sure I understand why you need to do all of the shootdowns, but at least in FreeBSD situations that need to shoot down more than one page are batched. The set of addresses is written to a shared buffer and then the pmap calls smp_rendezvous on the set of cores that is using this pmap and does an INVLPG on each one. Hyper-V also has a hypercall that does the same sort of batched shootdown. I’m not sure how this changes on AMD Milan with the broadcast TLB shootdown.

        FreeBSD’s superpage support on x86 takes advantage of the fact that all AMD, Intel, and Centaur CPUs handle conflicts in the TLB, even if they architecture says that they don’t. This means that promotion does not have to do shootdowns, it just installs the new page table entries. As I recall (and I’m probably mixing up Intel and AMD here), on Intel architectures the two entries coexist in different TLBs, on AMD the newer one evicts the older, and on Centaur the cores detect the conflict, invalidate both and walk the page table again.

        1. 2

          I did not dig any deeper and my root-cause analysis could be wrong. Here is a relevant ftrace if you are curious: https://gist.github.com/xguerin/c9d97ef50701bd247a219191cb37ec8a. Total latency is 271us. Largest cost centers are: 1/ get_page_from_freelist takes a whoopy 120us; 2/ clear_huge_page takes another 135us (admitedly 2/ is not stricly required as part of the overall operation).

    2. 4

      Oh, delightful. It is so hard to find good performance numbers on page sizes, it’s a real pain.

      1. 3

        The Google 7% fleet-wide number is somewhat credible to me. I found 1.5X - 2.2X speed-up from using the Linux HugeTLB FS for this application: https://github.com/c-blake/suggest#system-layer-stress-test

        1. 1

          PowerPC on Linux uses 64 kiB pages, although Redhat/Fedora are considering switching to 4 kiB due to the same compatibility issues.

          They’re not considering it all that hard, speaking as a Fedora user on ppc64le. In practice most stuff works, though I’d love to be able to run Hangover.

          1. 1

            It’s so weird that there’s so much pressure to maintain 4k pages - it was worth the cost to Apple to take the compatibility issues from moving from 4k to 16k pages in hardware, I’m not sure why people are so hell bent on thinking that 4k is still the best option.

            1. 3

              It’s not, but for x86 hardware the only options are 4kb or 2mb, and even today 2mb is just a bit too big to be convenient for small programs. Looks like Aarch64 has more options (16 and 64 kb pages), which I actually didn’t know

              1. 1

                I’m not sure why people are so hell bent on thinking that 4k is still the best option.

                Which people are those?

                1. 1

                  Has Apple managed this for macOS? I spoke to some of Apple’s CoreOS team about RISC-V having an 8 KiB default page size at ASPLOS a few years ago and their belief was that it would break large amounts of code. A lot of *NIX software assumes that you can mprotect a 4 KiB granule. For iOS, they’ve generally been willing to break a load of things and require people to put in porting effort but less so for macOS.

                  1. 2

                    The 4k assumption is actually even worse than just an “assumption” even code that tried to do the right thing: using getpagesize() didn’t work as it was a macro that expanded to 4k on intel machines (at least on Darwin), which made rosetta challenging. The M-series SoCs support some kind of semi-4k page allocation in order to deal with this problem under rosetta.

                    For ARM builds on iOS have been 16k for many years preceding that, so a moderate amount of Mac software (which shared code with iOS) had clear source changes to do the right thing, and getpagesize() stopped being a macro so new builds of software got the correct thing.

                    1. 1

                      They seem to have, as part of the Intel -> Apple Silicon transition (which I guess requires some porting effort anyway). On the M1/M2, macOS has 16k page sizes, and a quick GitHub search turns up that this did initially break various things that assumed 4k.