1. 13
  1.  

    1. 2

      To disable it you pass a flag “–disable-experimental-malloc” which seems to suggest the feature is not stable? So why is it enabled by default?

      1. 3

        At a wild guess, features like this are introduced with flags (eg) --enable-experimental-malloc and --disable-experimental-malloc, with disabled being the default.

        Those who care to can set the disabled flag in their builds so that when defaults change their builds will keep working the same way.

        Once the feature is stable, the default is changed but the flags are not renamed (to preserve compatibility).

      2. 1

        I’m no expert in memory management, so it’s not 100% clear to me what “cache” means in this context - AFAICT when the application “deallocates” memory, it isn’t actually freed at the OS level, so when the application allocates again there’s already spare memory available that can be used (thus avoiding a context switch). Is that correct? And if so, isn’t that what OpenSSL did that made safety tools like Valgrind useless with that codebase? (Presumably this won’t be a problem since it’s libc, not the application - I’m just curious.)

        1. 7

          The important word here is “per-thread”, not “cache”. malloc implementations need to maintain (explicit or implicit) data structure to account which memory is allocated. That means locking or atomics in multi-thread contexts. Having per-thread “ready to be allocated” memory avoids synchronization overheads.

          1. 4

            Yes, valgrind knows about malloc and free. The level of interest is things coming out of malloc and going into free. If you add layers on top of that, valgrind doesn’t know about them and doesn’t see what’s happening, because there’s nothing happening at the layer it’s looking at.

            1. 3

              AFAICT when the application “deallocates” memory, it isn’t actually freed at the OS level, so when the application allocates again there’s already spare memory available that can be used (thus avoiding a context switch). Is that correct?

              Specifically, using cached pages, you avoid a mode switch, which is potentially much faster than a process context switch.

              And if so, isn’t that what OpenSSL did that made safety tools like Valgrind useless with that codebase?

              Right. With a custom memory manager on top of what the system provides, memory debuggers can’t really distinguish between what’s really supposed to be in use and what’s supposedly freed.

              The libc allocator may also provide you with mitigations that make exploits harder. Custom allocators built for speed rarely include these features. On OpenBSD, you get the option to disable the free page cache, among a bunch of other things. This feature is enabled by default in ssh, and it’s not nice for a crypto library to subvert that with a custom allocator.

              1. 1

                Does FreeBSD’s jemalloc do this?

                1. 3

                  Short answer: yes.

                  Long answer: there are two strategies here. Both libraries have multiple arenas, mapped regions of memory that they dump allocated objects into. Each arena has a lock that you need in order to allocate or deallocate. There are on the order of 16-64 arenas at a time in non-trivial threaded programs, not one per thread. Jemalloc also has a truly one-per-thread cache used for small allocations, to avoid touching arenas. Glibc has just added the one-per-thread “tcache” functionality, which is exactly what it’s called in jemalloc.

                  See also: jemalloc implementation notes.

                  Disclaimer: I don’t actually specifically know if this is enabled in FreeBSD. I don’t see why it wouldn’t be.