1. 14
  1.  

    1. 11

      @ltratt, have you been talking to Hesham (who is currently writing the ‘how to adapt memory allocators to CHERI’ training material)?

      In order for CHERI to do something useful, capabilities need to have “non global” permissions

      I was very confused by this until I realised that you meant ‘bounded’ when you said ‘non-global’. As far as I know, all CHERI allocators do provide Global permission in their returned capabilities.

      printf("c1 address 0x%lx

      You can use the # flag to the p format specifier to print capabilities in verbose form (base, bounds, otype, permissions, and so on), which is generally a better idea than rolling your own subset of this.

      My corner cutting has come back to bite us! The problem is that when we pass to realloc a capability whose bounds allow access to 4 bytes and then we try and copy 8 bytes from it, CHERI – quite rightly! – considers this a violation of its rules and stops the program.

      In particular, even ignoring CHERI, an allocator that does this almost certainly introduces exploitable information leaks.

      The fact that this is a bump allocator (and therefore doesn’t do free) means that you miss the bit that’s actually difficult: amplification. In free, you want to go from the pointer that’s returned to some metadata about it so that you know which free list to put it on and so on. This is normally done in one of two ways:

      • Store the metadata before at the start of the object. Access it by subtracting. This is generally a bad idea because it means that freeing something that is not a pointer to the start of an object will corrupt heap metadata structures. This doesn’t work with CHERI because you can’t access past the start of the object.
      • Do some bitmasking on the address to find the metadata elsewhere (e.g. at the start of the slab). This also doesn’t work on CHERI, for the same reason.

      You need some other mechanism to translate an address to metadata, based on some capabilities stored in the allocator. This means that the allocator needs to stash some powerful capabilities that it can use to re-derive the source address.

      The CHERIoT RTOS allocator runs in a separate compartment, so its globals are automatically private. It uses a dlmalloc-like allocator and stores metadata at the front of the allocation, but uses the revocation bit to find the metadata (the revocation bits are set, so you can’t load a pointer that has the metadata as the base, which gives some more defence in depth).

      With snmalloc, we always use the pagemap to find metadata. This is a flat region of the address space (relying on the fact that the page tables already give us a nice sparse array structure) containing pointers to metadata about slabs. This works nicely with CHERI because we only need to find the base pointer for the pagemap. With Dapeng’s work, the global containing this pointer is automatically protected and exposed only on the cross-library call (which also picks up a separate stack), without it then an attacker who can poke the global pointer or uninitialised stack memory can find it.

      1. 4

        Oh, there’s much more that’s difficult than is contained in the post – our paper goes into some (but definitely not all!) of the other problems, but even I am only willing to put so much detail into one blog post. [I don’t think I know Hesham. Happy to talk to him if it’s useful.]

        1. 4

          Well, broadly, it’s a good idea to talk to people who are shipping CHERI allocators before writing about CHERI allocators. Hesham and Brooks are working on integrating snmalloc into CheriBSD and Hesham has been writing a guide on porting allocators to CHERI (he’s a former PhD student of Robert, now working for Capabilities Ltd). Hongyan, Wes, Robert N, and I have also all worked on CHERI allocators. That’s the list of people I’d start from when looking for people to review work on CHERI allocators.

          1. 3

            I’m a bit surprised by this as I shared the draft paper with you and your colleagues by email in March. [One outcome of that email was https://github.com/microsoft/snmalloc/pull/609 which is a variant of one our attacks: snmalloc wasn’t vulnerable to our original attack but Matt realised a variant would work.]

            1. 4

              The best time to get feedback is early. I didn’t see the draft until it was already accepted. March was after it was submitted. There were a lot of things interesting things that you could have looked at if you’d dropped an email to people working on allocators when you started working on this to say ‘hey, we want to do an evaluation’. If nothing else, we could have pointed out that @saaramar had already done a much more rigorous adversarial evaluation of snmalloc and could have suggested adding him as a coauthor and incorporating his work. We’d also have pointed out early that you were looking at the wrong branch of snmalloc (thank you for fixing that in the camera ready), how to enable the temporal safety, and that several of the other allocators are there solely to provide a baseline for benchmarks: they’re unmaintained were never hardened so security evaluation of them is largely meaningless. We’d also have pointed you at the CHERIoT allocator as an example of one that optimises for a very different set of constraints (heaps measured in tens of KiBs) but provides deterministic UAF protection. If you’d talked to any of the folks that have ported allocators to CHERI, then you could have incorporated the design rationale in the paper.

              It’s great to see more people being involved with CHERI but a lot more can be achieved by collaborating rather than working in isolation and then sharing updates. DSbD could have facilitated that a lot better, but it’s not like you don’t have my email address.

              1. 10

                I dislike arguments, and I especially dislike public arguments, but I will make one final correction to the public record, trusting that we are then able to move on to more edifying technical matters. You may remember commenting on an earlier draft in 2021 (hence your name in the acks) that just had the attacks; we worked at all times in the same public git repo you’d built the paper from in 2021; we didn’t fix the snmalloc version in camera ready but in a much earlier draft in Jan ’23; and the draft we shared with you in Mar ’23 was weeks before acceptance, so there was plenty of time for you to comment and us to adjust.

                All of this, though, seems like irrelevant fluff to me. Fundamentally, we admire the work you and your colleagues have done, even though we can see occasional imperfections in it, just like you can see imperfections in our work! We are all human, and nothing we do is ever quite perfect. I see no shame in that.

                I do think that you raise an important issue about the coordination of CHERI work, which has undoubtedly been challenging. An interesting question is: how many of those issues were preventable and how many were semi-inevitable given the rapid scale-up of distributed sets of people? I’m not sure there are easy answers there, and it might even be too soon to really get a good sense of this. I do hope someone looks at this one day, as there probably will be some interesting lessons to learn that others will find relevant in different contexts.

                1. 4

                  You may remember commenting on an earlier draft in 2021

                  I recall commenting on some of the semantics of malloc in CHERI in your idioms repo. There was a lot of additional discussion there, but as far as I knew you were working on some blog posts and example code and not a paper. The paper came as a surprise. If I’d known that you were aiming for a rigorous academic publication I’d have given a different level of feedback. The feedback I give on ‘I’m playing with this and there’s some weird stuff going on’ and ‘I’m writing a paper about this’ are very different.

                  we worked at all times in the same public git repo you’d built the paper from in 2021

                  The initial CHERI work was done in a fork, with some of it ending up in branches, and eventually merged. It wasn’t fully merged until late last year. For a long time, cheribuild was building a random snapshot. This was there for a few people to test but it was never intended as a version for evaluation.

                  Fundamentally, we admire the work you and your colleagues have done, even though we can see occasional imperfections in it, just like you can see imperfections in our work! We are all human, and nothing we do is ever quite perfect. I see no shame in that.

                  Similarly, we welcome criticism and feedback but finding known issues in prototypes is not especially useful: if you ask, we can tell you that you’re there and save you some effort. We’ve been working with various folks to do red-teaming exercises and they lead to really useful feedback that we can incorporate into newer versions or use to refine the threat model if we think they’re out of scope. For anything security related, we absolutely want people to try to break it, but we want them to try to break things that we think are secure. Trying to break things that we know are insecure proofs of concept and finding a subset of the ways that we know are broken is far less helpful.

                  There are quite a few known vulnerabilities in the current snmalloc, depending on your deployment and threat model. There are also hooks for hardening it for particular use cases, with some out-of-tree users. This is why I was surprised that it scored so highly in your eval: in the current default build, anyone with a large stack uninitialised use bug can completely break heap memory safety. This goes away with Dapeng’s work, but anyone that can cause an asynchronous signal to be delivered and can inject their own signal handler can do a similar attack. These are the kinds of things that we point people at when they tell us they want to do a security eval of snmalloc, along with the ways that we know of exploiting them. We’ve had some great engagements with folks trying to break these things.

                  I do think that you raise an important issue about the coordination of CHERI work, which has undoubtedly been challenging. An interesting question is: how many of those issues were preventable and how many were semi-inevitable given the rapid scale-up of distributed sets of people?

                  UKRI could definitely have done more, especially in terms of helping to communicate the things that had been done already. A few of the DSbD funded projects are doing things that I tried years ago that didn’t work. Trying them in a different way might lead to something that works, but they haven’t tried to find out why we gave up on those approaches so they’re just reproducing the things that don’t work, rather than trying new things.

                  I would have liked to have a regular call with folks working on (building or breaking) memory allocators, I just didn’t know that you were in that category until after you’d had the ISMM paper accepted and so the discussions that we had accidentally excluded you.

                  1. 4

                    From memory, all but one of the attacks in our paper were done in a single session of a few hours, so they’re not sophisticated in the slightest :) I assumed – and, I soon realised, most other people assumed – that CheriBSD’s allocator would have been hardened against basic attacks such as those. [Please note: I’m not criticising that allocator for being vulnerable to them, because I understand that hardening everything with a small team is infeasible!]

                    I had all sorts of ideas for sophisticated attacks (I prototyped a signal handler attack after raising a github issue in hybrid mode where signal handlers bypassed the DDC) but dropped them, because it felt like there was something useful to say even with the simpler stuff. For example, simply naming the attacks feels like a useful contribution for all us newer CHERI researchers, since these aren’t covered in existing CHERI papers. That means that I don’t view our paper as a traditional security evaluation, because it’s incredibly simple-minded. However, I definitely view it as a good thing that snmalloc in its default configuration isn’t vulnerable to that particular set of simple attacks!

      2. 1

        Random non-expert question on “amplification”, which I hadn’t previously thought about. From a distance this sounds related to “programmable tagged architectures” (I know for example of the programming-language-community works on “micropolicies”) – programmable metadata for all pointers. CHERI has relatively large capabilities (256 bits), and also 1-bit tags for each pointer that are presumably stored somewhere. Could allocators somehow steal some of the 256 bits, and/or reuse datastructures that are implemented for 1-bit tag caching, to store their metadata?

        1. 2

          Random non-expert question on “amplification”, which I hadn’t previously thought about

          In that, you’re hardly alone. Most software does not need to contemplate amplification, and yet C doesn’t sufficiently surround the places that do with flashing warning lights, so it remains one of the finer points of CHERIFying software.

          CHERI has relatively large capabilities (256 bits)

          CHERI capabilities have not been 256 bits for a long time now. Morello and CHERI-RISC-V both use (variants of) CHERI concentrate and have 128-bit capabilities for 64-bit addresses; CHERIoT uses 64-bit capabilities for 32-bit addresses.

          presumably stored somewhere

          To date, the “somewhere”s have been:

          Could allocators somehow steal some of the […] bits […] to store their metadata?

          Even ignoring that the 128-bit capability format is pretty full already, allocators do not, in general, need to store metadata within the references to the heap, but rather the metadata is about the heap data structure itself. So while it might be nice, from the allocator’s perspective, to, for example, have the sizeclass of an allocation stamped into its references, this is only part of the puzzle: the allocator needs, for example, to be able to thread that object back onto free lists, which must not be accessible to the allocator’s client. (Moreover, the act of freeing is, most of the time, anyway, quite rare relative to the use of an object, so devoting bits in the capability to accelerate the former would probably be the wrong decision.)

          In terms of amplification more generally, the thing you should probably read is the HYDRA programmers’ manual. CHERI does not support the full glory of HYDRA-like capability object types with type owners, instead supporting the two most common use cases:

          • Fully “public” objects, where all fields (and so, all bytes of the underlying memory) are accessible to any reference bearer; the basic CHERI capability is one of these
          • Fully “private” objects, where all fields are inaccessible. These are called “sealed” in CHERI’s lexicon, and the acts of (resp. un)sealing capabilities from (resp. to) the “public” flavor are themselves mediated by other capabilities.

          There have been proposals at various points to extend this taxonomy in different ways…

          • Section D.10 of the CHERI ISA v8 discusses “anti-tamper seals”, specifically to assist free in knowing that the pointer returned is one given by malloc
          • Perhaps only internally, Lawrence Esswood suggested “baby seals”, which allowed for some small number of “private” words followed by “public” words within a capability. Such a design could allow software to build general amplification out of unsealing (and specifically, allocators to safely store in-band metadata for live objects, including pointers back out to internal metadata), but the space cost seems high relative to out-of-band amplification, which has worked for software’s needs so far, and the proposal has not been implemented (AFAIK).
        2. 2

          CHERI has relatively large capabilities (256 bits)

          You might have looked at some very old papers. On 64-bit systems, CHERI capabilities are 128 bits, on 32-bit systems they are 64 bits.

          and also 1-bit tags for each pointer that are presumably stored somewhere

          Yes, exactly how it’s stored is implementation dependent. Morello actually has two different ways of storing it.

          Could allocators somehow steal some of the 256 bits, and/or reuse datastructures that are implemented for 1-bit tag caching, to store their metadata?

          I’m not quite sure what this means. The bits in a capability are all used, but you can use the tag bit to implement tagged unions of a pointer or 128 / 64 bits of other data. That doesn’t help with amplification though.

          Amplification is necessary because the consumer of the malloc APIs should not hold rights to the entire heap, only to objects that the allocator has handed it. The allocator needs to hold a capability to any address ranges used for the heap.

    2. 3

      Tangential but it’s usually more efficient to bump down than bump up in a bump allocator.

      1. 3

        That’s really neat (and obvious in retrospect, which is the best kind of clever). I think it would still work with CHERI. Given a size, we compute the alignment requirements for accurate representation. With bump-up, we then round the start up to this and allocate the size. With bump-down, we’d just round down instead, so the total extra cost should be the same.

        1. 1

          I noticed that CHERI has builtins for efficiently aligning pointers so wasn’t sure if it would help much there but it’s a nice detail to know in general so I thought I’d pass it along.

          1. 2

            The builtins are now upstream, I believe. The problem with aligning on CHERI is that people often cast pointers to integers and back for alignment. With CHERI, you must use intptr_t for that to work. Adding a built in made this an easier fix.

            Specifically for allocators, there’s an additional problem that the top and bottom of a capability must be more strongly aligned the larger it is. This isn’t a problem for snmalloc because CHERI can represent all of snmalloc’s size classes precisely, but for other allocators you may need to increase the alignment of both the top and bottom to correctly handle this. There’s an instruction to compute the alignment mask for a given size, which you can then use for rounding up or down to the required size and to compute the length of the representable allocation.

            The other interesting thing with bump allocators on CHERI is that you only need one pointer. A CHERI capability contains a top, a base, and an address. The set bounds instruction creates a new capability from the current address to a specified length. You just move the address, store a copy of that, set bounds, and return that. If you request bounds that are out of range, the result will be untagged so you can just test for that and not bother checking that the request is in bounds.

      2. 2

        I’ve seen that link passed around quite a bit, but it is nonsense imo. The principal issue they bring up is alignment, but this is something that should be handled with inline logic at the call site (if it is even worth doing—which is not obvious), and reduces to a constant thereby in nearly all cases (indeed, the entire allocation procedure should be inlined and trivialised further, and the allocation state kept in registers). Meantime, bumping up means that the result pointer is the same as the current pointer prior to the allocation, so bumping up has lower latency than bumping down.

        1. 1

          To implement malloc() with a bump allocator you must align to the largest scalar size on the target architecture before returning to the user. That is the specification of malloc() and all callers of malloc() expect maximally aligned memory.

          You can do your suggestion in a custom memory allocation routine, where the caller should know to expect unaligned memory.

          1. 2

            No, malloc() can return allocations that are too small to hold a maximally-aligned type, and small allocations can have smaller alignments. A “fundamental” alignment just means any non-extended alignment, i.e. any alignment that has not been made bigger than usual using _Alignas.

            1. 1

              Are you saying you can implement malloc() without any code that does alignment that is compatible with all available scalar types before returning to the caller?

              1. 5

                No, obviously it must have code to ensure its return value is aligned enough. But for small allocations, the alignment does not need to be larger than the largest power of 2 less than or equal to the allocation size.

                For example, if you call malloc(4) the alignment does not need to be larger than 4, on any architecture.

                1. 1

                  Thanks for that additional detail.

                2. 1

                  Aaaand this is why the new allocation API designs always pass an explicit alignment along with the size.

          2. 1
            1. You would not use a bump allocator to implement malloc—that would be incredibly stupid—and that’s not what your link is about.

            2. You could implement malloc as an inline function in a header file.

            1. 2

              You would not use a bump allocator to implement malloc—that would be incredibly stupid—and that’s not what your link is about.

              For batch processes that are short-lived, using a bump allocator to implement malloc() can significantly speed up the execution of the process. This was particular common in compilers, the most famous example being early versions of GCC, which didn’t ever call free().

              You could implement malloc as an inline function in a header file.

              Yet that header-file implementation of malloc() still must align the pointer before returning it to the user.

              1. 2

                I think perhaps my original comment was not clear. Let me restate the salient point. Most allocations have a constant size that can be determined at the call site (those which do not are generally arrays, which tend to grow large, and which it is therefore generally acceptable to take more time to allocate). Hence, if some or all of the allocation logic is inlined, rounding the allocation size to a multiple of the desired alignment can be done at compile time, erasing the supposed performance disadvantages of bumping up rather than down.

                1. 1

                  There will still inevitably be situations when the compiler can’t prove anything about the state of the bump allocator and will need to synthesize the alignment code. In those cases, if bumping down then the alignment code will be more efficient. The degree to which this matters depends on how sufficiently smart your compiler is. That degree is always non-zero, however, so it’s a good rule of thumb to err on the side of bumping down.

                  1. 1

                    Except for the other factor I mentioned, which is that bumping up has lower latency.

                    1. 1

                      It’s a fair point that on an out-of-order superscalar CPU that you can avoid a data dependency by bumping up with a pre-aligned heap thus bringing the cost of the allocation to a relative zero. In that case, you can make it a precondition of the data structure that the heap is aligned to reap the benefit of zero-latency even when a runtime alignment is necessary. e.g.

                      void *alloc(struct Heap *heap, size_t amt) {
                          unsigned char *result = heap->base;
                          size_t aligned_size = (amt + sizeof(max_align_t) - 1) / sizeof(max_align_t) * sizeof(max_align_t);
                          unsigned char *new_base = result + aligned_size;
                          if (new_base > heap->end) {
                              return NULL;
                          }
                          heap->base = new_base;
                          return result;
                      }
                      

                      The nice benefit of this code is that it’s even easier for the compiler to prove that the alignment is unnecessary since the amt argument is likely constant as you mentioned earlier. That is instead of aligning the data structure’s base pointer, which may be a global with unknown state. You waste some memory in this implementation but in terms of CPU efficiency on modern CPU architectures, it’s probably optimal.

              2. 1

                This was particular common in compilers, the most famous example being early versions of GCC, which didn’t ever call free().

                Huh. I’d heard at least one of the Borland Turbo … compilers did it, but I’d never heard about early GCC.