1. 32
    1. 10

      Okay sorting through this trying to explain it to myself, who still has limited experience with embedded systems:

      • RISC-V is an open-standard set of CPU instructions
      • People/companies can design CPUs which implement the RISC-V instruction set in the SystemVerilog language
      • People/companies then send these designs to CPU fabrication companies which create the CPU described
      • A company called lowRISC created an open-source RISC-V CPU design called Ibex which “has seen multiple tape-outs” (been sent to fabrication)
      • Microsoft Research (MSR) created cheriot-ibex which extends the Ibex design with hardware memory safety capabilities; unclear whether this variant has seen any tape-outs
      • The cheriot-ibex CPU design adds instructions beyond the standard RISC-V set; this instruction set architecture (ISA) is specified in the cheriot-sail spec, which cheriot-ibex can be thought of as implementing
      • SystemVerilog CPU designs can be emulated, which enables people to run software on them even though the physical CPU does not yet exist
      • MSR created cheriot-rtos, a real-time operating system that runs on the (presumably emulated) cheriot-ibex CPU
      • MSR created a LLVM backend to compile existing C++ programs to target the cheriot ISA and run under cheriot-rtos
      • The compiled C++ programs get a whole bunch of memory bound checks & security checks for free (assuming they are already built by LLVM), since those capabilities are built into the hardware

      Is this all basically correct?

      1. 9

        More or less. Ibex was originally created by ETH Zurich, but is maintained by lowRISC now. lowRISC is a CIC (somewhere between a limited company and a non-profit) set up largely by the same people as RaspberryPi, with the goal of making completely open hardware. The CHERIoT Ibex wasn’t done by MSR, it was Azure Silicon. The emulator that we’re shipping is built from the Sail, not Ibex. This is faster than a cycle-accurate simulator. You can also run the Ibex core with a verilog simulator such as iverilog or verilator, or in FPGA. I have it running in an FPGA on my desk at 20 MHz.

        1. 3

          Cool, thanks! Is cheriot-ibex hoped to be used as the compute element of an Azure Sphere board or something? Are any other entities looking at building a board around it? I saw on the FAQ that a rust LLVM backend is being built for it, rust apps would then be able to take advantage of the improved security properties of the system through compartments?

          1. 2

            I can’t comment on product plans. It’s now open source, so any embedded SoC vendors can incorporate it in their products and we have an off-the-shelf software stack that they can use. The folks running the Digital Security by Design programme have expressed interest in having an IoT SoC fabbed with it, since a lot of the folks using Morello actually want something tiny.

            A couple of groups are working on CHERI Rust. We hope to intercept their work. Adding UPtr to Rust and the strict provenance model made this fairly easy. Rust code should be fairly easy to port. One of the open questions there that I’m particularly interested in is how far we’re able to push the Rust type system in mutual distrust. For example, we can enforce deep immutability in the hardware, so a Rust cross-compartment call that had an immutable borrow can just work. On the other hand, passing any mutable pointer to another compartment means that the Rust compiler can no longer trust anything about type safety for that object, which may be harder.

    2. 5

      Hah, I think I know what I’m doing this weekend.

      I’m super enthusiastic about CHERI because I secretly and privately suspect that hardware-level security is essential for systems security. The CHERI approach is sometimes contrasted to e.g. Rust’s, in that Rust prevents unsafe program flow at compile-time, whereas CHERI disallows it at runtime even if it leads to a crash, but I think the two are complementary and that CHERI’s is actually essential for some types of software. I don’t think CHERI’s value lays primarily in protecting unsafe legacy code, I think it’s a very good foundation for hosting future code.

      If you look at the kind of vulnerabilities that were found in unsafe Rust code over the years, many (if not most) of them aren’t really textbook buffer overflows, things like “I have this eight-item buffer, and this index, and I didn’t check the index, so now I’m writing the 65536th item and oooooh there goes your stack”.

      Lots of them are actually due to things like premature returns out of an unsafe context which introduces inconsistent state in otherwise safe constructs, so you wind up doing the wrong thing in the safe code that’s executed afterwards. That is, many of them work by injecting incorrect state at runtime into contexts that have been checked at compile time. As long as hardware models are memory unsafe, they will need memory unsafe constructs to manipulate them, and will forever be subject to the tyranny of Shea’s law, that interfaces are the primary locations for screwing up a system.

      1. 9

        I’m a huge fan of safe languages (my main objection to Rust is that it has so many caveats next to its safety) for any domain where they’re usable. One of my goals with this project was to demonstrate that being able to rely on memory safety for you lowest-level systems code (including assembly) has a huge impact on the things that you can build.

        One of the examples in the repo has a JavaScript VM. The FPGA on my desk is using a lightly compartmentalised MQTT/TLS/TCP/IP stack to fetch JavaScript bytecode from the Azure IoT hub, which it then runs in a compartment with access only to the memory allocator, the UART and the MMIO register for the LEDs on the board. The JavaScript VM is not trusted to isolate the JavaScript, a bug in it gives no more rights than the JS has anyway. It’s currently running some code that shows some patterns on the LEDs. A complete compromise of this compartment gives no way of accessing the network stack.

      2. 6

        Disclaimer - I’m a co-author of this blog.

        I highly agree with you. Regarding Rust and CHERI, I think it’s a mistake to capture the debate as “Rust vs CHERI”. We need both - Rust is the right, wise choice for most codebases (runtimes, parsers, and most parts of a modern OS). However, as we wrote in our first blog on our project - “The core parts of any OS, including an RTOS, involve doing unsafe things. The memory allocator, for example, has to construct a notion of objects out of a flat address range. Safe Rust can’t express these things and unsafe Rust doesn’t give us significant benefits over modern C++ code that uses the type system to convey security properties.” (ref: https://msrc.microsoft.com/blog/2022/09/whats-the-smallest-variety-of-cheri/ ).

        That’s precisely where CHERI comes into play. Unlike other options, it introduces deterministic mitigations to most memory safety bug classes. And in addition, it doesn’t rely on secrets (unlike other mitigation strategies that are relied on secrets and are vulnerable to architectural side channels).

        I’m very excited about our approach because scaling CHERI to small cores can revolutionize the IoT and embedded industries. Think about how many different C/C++ codebases these environments have (mostly C, by the way). And in top of that, IoT/embedded products usually don’t have modern memory safety mitigations in place. CHERIoT is a fantastic solution to that - get the new hardware, use the toolchain to rebuild your code - and that’s it. You have deterministic mitigations to spatial safety, heap temporal safety, pointer integrity, and so on.

        1. 1

          Once you compile your C/C++ program to target CHERIoT, is it very loosely analogous to compiling it with (much stronger versions of) sanitizers? So code that in the past had seemingly “worked” while doing some weird undefined-behavior memory stuff will cause a crash?

          1. 4

            In addition to Saar’s comment, it won’t necessarily crash. If the compartment has an error handler then this will be called before any corruption occurs. The compartment can then try to recover (typically by resetting to a known-good state and returning to the calling compartment).

          2. 3

            The short answer is “yes”, but emphasizing “much stronger versions of” :)

            Sanitizers can help you with spatial safety guarantees - they have bounds check before any dereference, so all loads/stores should be protected for bounds (by the way, Firebloom does that in iBoot, see this for details: https://saaramar.github.io/iBoot_firebloom/ ).

            Besides spatial safety, there are a few chosen additional points I would like to emphasize:

            1. Lifetime ownerships: CHERIoT has deterministic mitigation for heap temporal safety - all UAFs/double free/dangling pointers are not exploitable because of the Load Barrier and revocation. However, with sanitizers, if you free an instance of type A, and reclaim it with another object while holding a dangling pointer of type A*, your sanitizer can’t catch that.

            2. Pointer integrity: sanitizers can’t enforce pointer inregrity. Attackers can corrupt/fake pointers freely or confuse them with integers. With CHERIoT, it’s impossible to forge pointers, and there is no confusion possible between reference to memory and integers. This is a powerful property enforced in the architecture.

            3. In our case, the compiler is not in the TCB for all the security properties (as it is with sanitizers). Look at the “Intracompartment” subsection in the blog for details.

            And of course, on top of that, the cost of sanitizers is high: it highly increases code size and hurts perf.

        2. 1

          I have a weird question for you, along those lines: is it possible to combine software-based mechanisms (like the borrow checker) and hardware-based ones like CHERI, to get the best of both worlds? IIRC, CHERI has some (small amount of) overhead, and it would be interesting to see what a language could do for memory safety if it knew it was targeting CHERI.

          From my reading on CHERI, it likely isn’t possible because it basically means bypassing very pervasive and universal restrictions, and I’m not sure it even makes sense to ask if it can be bypassed. Perhaps if the compiler threaded a “universal permission” value of sorts through every function, and the compiler could use that permission when trying to dereference something it knows is live?

          Context: I’m the lead of the Vale programming language (https://vale.dev), and once we finish our “region borrow checker” proof of concept (hopefully this week, fingers crossed!) which currently falls back to generational references (think memory-tagging) but I’ve been wondering how it might integrate well with CHERI.

          1. 4

            IIRC, CHERI has some (small amount of) overhead, and it would be interesting to see what a language could do for memory safety if it knew it was targeting CHERI.

            Combining in this way is hard because (as you say) you need to have a mechanism that allows verified code to turn off checks, but which can’t be used by untrusted code (if it can, you have no security). The overhead of CHERI is pretty low. It’s basically the overhead of doubling pointer sizes. I think the worst we see on CHERIoT Ibex is around 15% and it’s that high because it’s aggressively optimised for area at the expense of performance and so doesn’t increase the size of the memory bus. I’m exchange for that, you get spatial and temporal memory safety, no pointer-data type confusion, and no ability to fake pointers.

            It’s more interesting to me to use static type system properties to get a larger set of guarantees than the ones that the hardware can provide.

            That said, you should look at the JavaScript interpreter in the repo. It uses 16-bit integers for JS pointers. They are treated as indexes into a small set of capabilities that come from the heap allocator. Within type-safe JavaScript code, you can use tiny pointers. When you interop with C or assembly, you can expand them to full pointers. For extra fun, the JS GC integrates with our temporal safety model so any pointer to the GC heap is implicitly invalidated when the GC runs.

      3. 1

        This is something I don’t hear often, I’d love to learn more! Do you happen to recall any real-world examples of unsafe code causing unsafety in the surrounding safe code? I can work up a trivial example of it, but it would be interesting to read about a real case.

        Also, are there certain domains that should care more about this kind of thing than others? Safety critical systems, or ones particularly likely to be attacked perhaps?

        (I’ve never heard of Shea’s law, that’s a good one.)

        1. 1

          This is something I don’t hear often, I’d love to learn more! Do you happen to recall any real-world examples of unsafe code causing unsafety in the surrounding safe code? I can work up a trivial example of it, but it would be interesting to read about a real case.

          Sure, there’s actually a cool paper that inventories several classes of such bugs, and develops a formal framework for their analysis: https://raw.githubusercontent.com/sslab-gatech/Rudra/master/rudra-sosp21.pdf . The one I was aluding to above was CVE-2020-36317.

          Also, are there certain domains that should care more about this kind of thing than others? Safety critical systems, or ones particularly likely to be attacked perhaps?

          That’s kind of why I found that bug interesting – actually, that’s how I ran into the RUDRA paper. I wasn’t specifically looking into memory safety violation patterns, what I was looking into was error handling in unsafe contexts, because that’s pretty relevant for safety critical systems. I was recently reminded of it in another thread here, I think @david_chisnall posted a link a few weeks ago and I kept coming back to it afterwards because I am now actually specifically interested in its subject.

          But I think it’s a matter that has significantly wider relevance than just safety-critical code. Interfacing with underlying OS features routinely involves interfacing via unsafe code. So, frankly, I think pretty much all fields should care about this kind of thing, because pretty much all of them rely on it to some degree.

          (I’ve never heard of Shea’s law, that’s a good one.)

          It’s #15 on an equally notorious list :-).