1. 43
  1.  

  2. 12

    Shame it is light on details, and the author didn’t investigate why the allocator is so slow.

    1. 13

      I don’t know about your experience investigating issues like this on Windows, but I find it very frustrating mainly because you pretty quickly hit the wall of undocumented internal APIs (NTAPI) and the lack of source code. In this particular case, while the MSVCRT source code is available, AFAICS, malloc is not part of that but rather comes from the PlatformSDK which, AFAIK, is closed-source.

      Also, while at it, I think the author’s own experience with LLVM provides an answer to the “Why doesn’t Microsoft replace its slow malloc with its own much faster mimalloc” question: because it will likely break a non-trivial number of applications even if only by exposing some latent bugs in those applications themselves. And a few percent of all the crap ever written for Windows stopping to work is not an acceptable outcome to Microsoft.

      1. 3

        Also, while at it, I think the author’s own experience with LLVM provides an answer to the “Why doesn’t Microsoft replace its slow malloc with its own much faster mimalloc” question: because it will likely break a non-trivial number of applications even if only by exposing some latent bugs in those applications themselves.

        But the author themselves notes this and points out that it could be an optional opt-in.

    2. 10

      Everything is broken and I am too tired to do anything but contribute another obscene hack on top of this pile of cards we have built modern civilization on.

      I’ve started to relate to this more and more as I progress in my career. At a certain level of expertise, you start noticing all the broken stuff that no one else has noticed yet. Then you’re on your own at that point: no bug reports, no workarounds, no documentation, nothing unless you personally create it. And that can be draining if it starts blocking all other progress.

      1. 9

        malloc() on Linux is insanely fast.

        I tried to replace our naive Arc-based tree data structure in Rust at work with a pool-based allocation approach that was a lot more code. Arc allocates from the heap for each piece of data that is referenced counter, and I was hoping to eliminate the allocation time, along with improving cache utilization.

        This was foolish. The simple Arc-based solution was as fast as the far more complex approach.

        1. 5

          malloc() on Linux is insanely fast.

          Linux doesn’t include a userspace malloc. Android used to use jemalloc (as FreeBSD does), which is pretty fast and replaced it a while ago with SCUDO, which is slower but is tuned a bit more for security than performance. glibc uses tcmalloc, which is a variant of dlmalloc with thread caching enabled. This has pretty good RSS usage but terrible performance. We saw over a 30% speedup on some of the SPEC benchmarks replacing it with snmalloc.

          The Windows malloc is actually two allocators. There’s a dlmalloc-style range-based heap and a sizeclass allocator called the low-fragmentation heap. Threads allocate with the dlmalloc-style heap until they’ve hit some threshold and then pivot to allocating things of that size from a fixed-size pool. This gives very good physical memory usage, because a thread that allocates a single object of a weird size is not tying up a whole slab. Unfortunately, it also adds some additional branches.

          The Windows heap also has a bunch of security mitigations that a lot of other malloc implementations lack. In particular, I seem to recall that the double-free protection is expensive. Thread-caching allocators struggle with double-free protection because a simple implementation might put the same object in two threads’ thread caches. Message-passing allocators struggle because they may have the same free in multiple message queues. In snmalloc, we don’t detect double free at the point of free, we detect it when the double-freed object is being popped off the free list. The Windows allocator does some lightweight locking to check this and that introduces contention and slowness. In exchange double-free errors in your app’s telemetry give you a useful report that lets you fix the bug.

        2. 8

          Not explored here but malloc() on windows is going to confirm that the memory is available for use, even going as far as to expand the pagefile if there’s not enough physical memory that is free.

          Linux allocates extremely optimistically. Which is part of why Linux is terrible in low memory conditions.

          1. 4

            The huge difference isn’t inherent to Windows paging/commit behaviour. Mimalloc doesn’t have this performance problem and also works just fine on Windows. (As mentioned elsewhere, Microsoft use it too.)

            It’s probably something like lock contention in the malloc implementation, or less likely an O(n^2) algorithm somewhere.

            I assume lock contention is more likely because it’s a very common problem with old mallocs when you try to use them on modern computers.