1. 17
  1. 2

    I was about to try snmalloc the other day but this put me off:

    Building with GCC is currently not recommended because GCC emits calls to libatomic for 128-bit atomic operations.

    A bit of googling suggests that not even all x86_64 CPUs have the necessary instruction. Did you find your implementation reasonably portable to other 64-bit architectures (I assume 32-bit ones are out of the question)? Also, mimalloc doesn’t seem to have this restriction. If you have time to elaborate why you need 128-bit atomics, I for one would be interested to hear.

    1. 3

      A bit of googling suggests that not even all x86_64 CPUs have the necessary instruction.

      As far as I’m aware, all AMD CPUs and all Intel CPUs after the Core 2 support it. That’s basically anything <10 years old and most things <15. Every 32-bit x86 CPU including and after the 486 provides the analogous 32-bit instruction. Linux also uses this for RCU, so hits a bunch of slow paths if it isn’t available.

      Did you find your implementation reasonably portable to other 64-bit architectures (I assume 32-bit ones are out of the question)?

      You can see the set of supported architectures in the architecture abstraction layer directory. We support:

      • x86-[64,32], including x86-64 in an SGX enclave (which gives you a slightly restricted set of instructions)
      • Arm (A32/T32 and A64).
      • PowerPC[32,64]
      • RISC-V [32,64]
      • SPARC[32,64]

      CHERI support didn’t quite make it for 0.6, it should be landing in the main branch in the next week or two. MTE support should come at a similar time.

      I believe we test all of these in CI with QEMU user mode (unless we have real hardware available).

      Also, mimalloc doesn’t seem to have this restriction. If you have time to elaborate why you need 128-bit atomics, I for one would be interested to hear.

      We use it on x86 for ABA protection in our [multi,single]-producer, multi-consumer stack implementations. The MPMC stack is used for allocating allocators (when you free an allocator, you push it onto a stack, the next one is then popped off) and we use both for managing chunks. We use the SPMC stack for per-allocator chunk collections (multi-consumer because a thread can wake up from an OS low-memory notification and go and return all not-in-use chunks to the OS), MPMC for a global pool so that we can grab chunks from it anywhere.

      Most architectures provide load-linked / store-conditional (LL/SC) either in addition to or instead of compare-and-swap (CAS) as a primitive. These are intrinsically ABA-safe. The problem with CAS as a primitive is that you can read a value, another thread then changes the value and changes it back, and then your CAS will succeed. With LL/SC, the SC will fail if another thread has written to the same memory location since the LL. LL/SC, in theory, has forward-progress guarantee problems but most implementations (and, I believe, the PowerPC spec) locks the cache line in exclusive state for a few cycles so that common LL/SC sequences will always proceed.

      If you have doubleword compare-and-swap (DCAS) as a primitive then you can avoid ABA by using a pair of the real value and a counter and increment the counter on every store. This means that ABA becomes {A0}{B1}{A2} and so the second DCAS will fail because {A0} == {A2} will fail in the compare.

      I’m not sure how mimalloc implements the things that we use the stacks for.

      1. 2

        fwiw windows 8.1 and onward require cmpxchg16b, so it is hardly niche

        I haven’t looked at the code, but I wouldn’t guess this prevents 32-bit archs from running; they still have double-word cases

        1. 1

          I use the following for double-cas on gcc. @david_chisnall may be of interest. Swapping out the asm block for __atomic_compare_exchange on clang Just Works, as should InterlockedCompareExchange128 on msvc.

          typedef struct { int64_t x,y; } __attribute__((aligned(16))) DW;
          #define DCAS(dst,old,new...) ({ bool _dcas_r; \
           __asm__("lock cmpxchg16b %3" : "=@cce" (_dcas_r), \
                                          "+a" ((old)->x), \
                                          "+d" ((old)->y), \
                                          "+m" (*(dst)) \
                                        : "b" ((new).x), \
                                          "c" ((new).y)); \
           _dcas_r; })
          
          1. 1

            With GCC, you can also use the __sync_ builtin instead of the __atomic_ builtin. I’m not especially interested in working around it for GCC because you can just compile with either clang or MSVC to get the right instruction. GCC has had a bug open about this issue for several years. I think it was four years old when we added this warning, so I assumed they’d fix it soon. Three years later, I’ve given up on GCC. If I can, I’ll make stuff work with it, but I don’t expect the result to have competitive performance.

            1. 1

              __sync* is annoying because it doesn’t give you back both the flag and the old value; you have to choose which you want (or take the old value and replicate the compare by hand). Regarding performance—here is one data point: ips4o (which requires double-word cas) ran slightly faster for me when compiled with gcc than with clang, despite the fact that the former required a library call for its atomics. Not surprising a function call is much cheaper than an atomic, even when accounting for 2nd-order effects (and I would expect fewer of the latter following special compiler knowledge of libgcc/libatomic/..).

              1. 1

                Possibly the 9.x release shipped by Ubuntu was particularly bad, but I’ve spent far more time fighting compiler bugs in GCC in the last two years than the total of all other compilers in the last 10 years. My favourite one was that #if __has_include(<sys/futex.h>) fails, but #include <sys/futex.h> worked fine, apparently because of a bug in the fixincludes logic - __has_include worked correctly with every other header that I tested it with.

                More annoying, I had a miscompilation that I never managed to get to the root of which caused my code to crash. I’d normally blame some UB in my code for this kind of thing, but when the same code compiles correctly with Visual Studio 2019, clang 9,10, 11, 12, and 13, and gcc 10, 11, and 12, in release and debug configurations, for multiple architectures, I’m inclined to blame the compiler.