Our first boards are expected to arrive tomorrow. I’m really excited to start playing with them. I started working on CHERI almost 10 years ago. Back then, we had a 100 MHz MIPS softcore in FPGA that was very similar to the MIPS R4K - state of the art circa 1991 (a useful age, since any patents required to implement it had expired). Software development on a 100MHz CPU with PIO access to a slow SD card is not a fun experience. We later got a QEMU implementation that was a lot faster (around 200-300MIPS where the 100MHz core managed about 0.7 IPC, and with very fast I/O via VirtIO). Still a long way away from a modern environment. The experimental platform that we’re mostly using today is Toooba, which is an out-of-order RISC-V core, similar to a Cortex-A8 in terms of pipeline structure, which runs in FPGA at 50 MHz (though is faster than the in-order MIPS core) in a dual-core configuration.
Morello is a 2.5GHz modified Arm Neoverse N1, which was Arm’s flagship server core until quite recently. It’s the same core as in AWS Graviton2. Clock for clock, I expect it to be significantly faster than Toooba and the clock speed is 50 times greater (and it has twice as many cores). That’s going to be a massive improvement for software development.
The CHERI architecture has come along way since then as well. When I started, capabilities were 256 bits and the software stack was a tiny microkernel with all of the CHERI-specific bits in hand-written assembly. Capabilities didn’t have an offset / address field, so if you wanted to use them as pointers then you had to carry an integer offset around with the capability (which, with alignment requirements, made a 512-bit structure) - you could increase the base, but you couldn’t move it back again, so you couldn’t pass a pointer to the middle of an array / structure that allowed access to the whole allocation. Tag bits weren’t stored in registers, so you couldn’t implement memcpy (if you didn’t use capability instructions, you didn’t copy capabilities. If you did, then you’d trap on the first non-capability data that you saw).
We made enough improvements that it’s now possible to compile large C/C++ codebases such as the FreeBSD base system and most of KDE as pure-capability CHERI code, with a fairly small amount of porting effort. I think our first port of tcpdump had more lines of code changed than the recent KDE port (including xlib and Qt) had in total.
Morello has a few things that aren’t in the existing prototypes that I’m also excited to play with. There’s a new way of doing cross-domain calls that avoids using up space in the type field of capabilities by adding an indirection. All entry points can be sealed with the same type and they point to a pair of code and data capabilities. The jump instruction unseals the capability and loads the code and data capabilities at the target (one into the program counter, the other into a normal register).
Do you know if it’s possible to get a board or two?
I am working on a C/C++ build system/package manager (build2) and we have ~300 C/C++ packages (https://cppget.org) that are continuously built and tested on various platforms/compilers (https://cppget.org/?builds). Since all the packages are built with the same build system, it is pretty easy for us to try to build them for a new platform/compiler (normally all we have to do is add support in the build system). I think it would be interesting to try to test them on CHERI and see what it uncovers.
The Digital Security by Design challenge fund has a Technology Access Programme, which gives successful applicants a £15K grant and a Morello board for 6 months. The University of Cambridge team is planning on doing ports build runs, which will see how many of the 30K things in the FreeBSD ports collection build.
If you’ve got spare CPU cycles, there’s also an Arm Fixed Virtual Platform emulator and a qemu port that you can use to build and test things, though a lot slower than Morello. We’re planning on putting 30 Morello systems in a rack connected to GitHub Actions for CI for various things that we care about (probably around May), so if you ping me in a few months I can try running your test builds. Do you run tests for these packages as well? Roughly how long would you expect it to take to do a full build on a quad-core 2.5GHz machine? If it’s not too long, then we might be able to add a regular run.
Yes, we do. That would be the interesting part in this case.
Roughly how long would you expect it to take to do a full build on a quad-core 2.5GHz machine?
My back of the envelop estimate is around 8 hours.
If it’s not too long, then we might be able to add a regular run.
Thanks, though we have our CI infra that runs on bare metal. Not sure it will be easy to integrate it with your setup.
EDIT:
The Digital Security by Design challenge fund has a Technology Access Programme, which gives successful applicants a £15K grant and a Morello board for 6 months.
From their FAQ this is only available for UK-based businesses.
Yes, we do. That would be the interesting part in this case.
Yup. We’ve been working hard to make things that will break at run time at least emit warnings, but execution tests are much better.
My back of the envelop estimate is around 8 hours.
That seems something that we could easily put in a weekly, possibly daily, CI job and if someone is actively working on fixing a particular package then we could probably give them an account on a machine for a little while.
From their FAQ this is only available for UK-based businesses.
That’s true. You might try reaching out to a company like embecosm and see if they’d be interested in being the designated holder of the grant and giving you access to the systems?
That seems something that we could easily put in a weekly, possibly daily, CI job and if someone is actively working on fixing a particular package then we could probably give them an account on a machine for a little while.
While this doesn’t fit our CI model well, I will ping you in a few months to see what’s available (we have more of an “online” CI service where anyone can submit a CI job at any time and expect to see the results quickly rather than the more commonly found “batch” CI).
Also, are FreeBSD jails fully functional on CheriBSD (I assume that’s what you will be running)? Currently we run all our CI tasks in QEMU/KVM virtual machines and running them directly on the host doesn’t feel robust.
You might try reaching out to a company like embecosm and see if they’d be interested in being the designated holder of the grant and giving you access to the systems?
Thanks for the suggestion, but browsing the Technology Access Program pages I got a distinct whiff of a dysfunctional bureaucracy that I would rather not get involved with.
Also, are FreeBSD jails fully functional on CheriBSD (I assume that’s what you will be running)? Currently we run all our CI tasks in QEMU/KVM virtual machines and running them directly on the host doesn’t feel robust.
Yes. We’re planning on network booting the pool from a read-only NFS share so that they can have a local scratch space on their disk but be completely reset between CI runs. We might also use jails to simplify some of the management parts.
This is super cool. I work in NVIDIA as a CPU validation engineer. I have been meaning to understand cheri after stumbling across it in the past. Now I probably have a work related reason to do so.
Thanks for this excellent post! I’m interested in CHERI and I looked at the announce to see some mouth-watering performance numbers, and didn’t find them. Your perspective post is quite interesting and also has nice numbers. Should be a blog post :-)
I think Arm is quite nervous about performance numbers because there hasn’t really been any Morello-specific optimisation on the software stack yet. The C++ ABI, for example, is almost a direct transliteration from the Itanium ABI with s/address/capability/. There’s probably quite a bit of headroom for optimisation in the default calling conventions.
When I did the original LLVM CHERI work, there were a few optimisations that didn’t work and were difficult to fix and so I just disabled them for CHERI targets. Several of these were related to vectorisation and so didn’t matter with the MIPS / RISC-V prototypes, where we didn’t have a vector unit at all, but will make a big difference with Morello where there is one. The Arm and Linaro folks have been doing superb work on these but I don’t know what the status is.
It’s important to think of Morello as an upper bound on the overhead of CHERI. The software stack hasn’t been heavily optimised, the ISA was a really great bit of engineering work by Richard, Graeme, and friends at Arm but is not based on having any data on instruction mixes for large codebases with a moderately optimised compiler (getting this data is one of the goals of the Morello program), and the microarchitecture is a high-performance core optimised for non-CHERI workloads with a very rapid turn-around to adapt it for CHERI (I am incredibly impressed that the Arm microarchitects managed to retrofit CHERI support to the Neoverse N1 in the incredibly tight timelines that UKRI gave them).
Our first boards are expected to arrive tomorrow. I’m really excited to start playing with them. I started working on CHERI almost 10 years ago. Back then, we had a 100 MHz MIPS softcore in FPGA that was very similar to the MIPS R4K - state of the art circa 1991 (a useful age, since any patents required to implement it had expired). Software development on a 100MHz CPU with PIO access to a slow SD card is not a fun experience. We later got a QEMU implementation that was a lot faster (around 200-300MIPS where the 100MHz core managed about 0.7 IPC, and with very fast I/O via VirtIO). Still a long way away from a modern environment. The experimental platform that we’re mostly using today is Toooba, which is an out-of-order RISC-V core, similar to a Cortex-A8 in terms of pipeline structure, which runs in FPGA at 50 MHz (though is faster than the in-order MIPS core) in a dual-core configuration.
Morello is a 2.5GHz modified Arm Neoverse N1, which was Arm’s flagship server core until quite recently. It’s the same core as in AWS Graviton2. Clock for clock, I expect it to be significantly faster than Toooba and the clock speed is 50 times greater (and it has twice as many cores). That’s going to be a massive improvement for software development.
The CHERI architecture has come along way since then as well. When I started, capabilities were 256 bits and the software stack was a tiny microkernel with all of the CHERI-specific bits in hand-written assembly. Capabilities didn’t have an offset / address field, so if you wanted to use them as pointers then you had to carry an integer offset around with the capability (which, with alignment requirements, made a 512-bit structure) - you could increase the base, but you couldn’t move it back again, so you couldn’t pass a pointer to the middle of an array / structure that allowed access to the whole allocation. Tag bits weren’t stored in registers, so you couldn’t implement
memcpy
(if you didn’t use capability instructions, you didn’t copy capabilities. If you did, then you’d trap on the first non-capability data that you saw).We made enough improvements that it’s now possible to compile large C/C++ codebases such as the FreeBSD base system and most of KDE as pure-capability CHERI code, with a fairly small amount of porting effort. I think our first port of tcpdump had more lines of code changed than the recent KDE port (including xlib and Qt) had in total.
Morello has a few things that aren’t in the existing prototypes that I’m also excited to play with. There’s a new way of doing cross-domain calls that avoids using up space in the type field of capabilities by adding an indirection. All entry points can be sealed with the same type and they point to a pair of code and data capabilities. The jump instruction unseals the capability and loads the code and data capabilities at the target (one into the program counter, the other into a normal register).
Do you know if it’s possible to get a board or two?
I am working on a C/C++ build system/package manager (
build2
) and we have ~300 C/C++ packages (https://cppget.org) that are continuously built and tested on various platforms/compilers (https://cppget.org/?builds). Since all the packages are built with the same build system, it is pretty easy for us to try to build them for a new platform/compiler (normally all we have to do is add support in the build system). I think it would be interesting to try to test them on CHERI and see what it uncovers.The Digital Security by Design challenge fund has a Technology Access Programme, which gives successful applicants a £15K grant and a Morello board for 6 months. The University of Cambridge team is planning on doing ports build runs, which will see how many of the 30K things in the FreeBSD ports collection build.
If you’ve got spare CPU cycles, there’s also an Arm Fixed Virtual Platform emulator and a qemu port that you can use to build and test things, though a lot slower than Morello. We’re planning on putting 30 Morello systems in a rack connected to GitHub Actions for CI for various things that we care about (probably around May), so if you ping me in a few months I can try running your test builds. Do you run tests for these packages as well? Roughly how long would you expect it to take to do a full build on a quad-core 2.5GHz machine? If it’s not too long, then we might be able to add a regular run.
Thanks for the information, I will look into it.
Yes, we do. That would be the interesting part in this case.
My back of the envelop estimate is around 8 hours.
Thanks, though we have our CI infra that runs on bare metal. Not sure it will be easy to integrate it with your setup.
EDIT:
From their FAQ this is only available for UK-based businesses.
Yup. We’ve been working hard to make things that will break at run time at least emit warnings, but execution tests are much better.
That seems something that we could easily put in a weekly, possibly daily, CI job and if someone is actively working on fixing a particular package then we could probably give them an account on a machine for a little while.
That’s true. You might try reaching out to a company like embecosm and see if they’d be interested in being the designated holder of the grant and giving you access to the systems?
While this doesn’t fit our CI model well, I will ping you in a few months to see what’s available (we have more of an “online” CI service where anyone can submit a CI job at any time and expect to see the results quickly rather than the more commonly found “batch” CI).
Also, are FreeBSD jails fully functional on CheriBSD (I assume that’s what you will be running)? Currently we run all our CI tasks in QEMU/KVM virtual machines and running them directly on the host doesn’t feel robust.
Thanks for the suggestion, but browsing the Technology Access Program pages I got a distinct whiff of a dysfunctional bureaucracy that I would rather not get involved with.
Yes. We’re planning on network booting the pool from a read-only NFS share so that they can have a local scratch space on their disk but be completely reset between CI runs. We might also use jails to simplify some of the management parts.
This is super cool. I work in NVIDIA as a CPU validation engineer. I have been meaning to understand cheri after stumbling across it in the past. Now I probably have a work related reason to do so.
Thanks for this excellent post! I’m interested in CHERI and I looked at the announce to see some mouth-watering performance numbers, and didn’t find them. Your perspective post is quite interesting and also has nice numbers. Should be a blog post :-)
I think Arm is quite nervous about performance numbers because there hasn’t really been any Morello-specific optimisation on the software stack yet. The C++ ABI, for example, is almost a direct transliteration from the Itanium ABI with s/address/capability/. There’s probably quite a bit of headroom for optimisation in the default calling conventions.
When I did the original LLVM CHERI work, there were a few optimisations that didn’t work and were difficult to fix and so I just disabled them for CHERI targets. Several of these were related to vectorisation and so didn’t matter with the MIPS / RISC-V prototypes, where we didn’t have a vector unit at all, but will make a big difference with Morello where there is one. The Arm and Linaro folks have been doing superb work on these but I don’t know what the status is.
It’s important to think of Morello as an upper bound on the overhead of CHERI. The software stack hasn’t been heavily optimised, the ISA was a really great bit of engineering work by Richard, Graeme, and friends at Arm but is not based on having any data on instruction mixes for large codebases with a moderately optimised compiler (getting this data is one of the goals of the Morello program), and the microarchitecture is a high-performance core optimised for non-CHERI workloads with a very rapid turn-around to adapt it for CHERI (I am incredibly impressed that the Arm microarchitects managed to retrofit CHERI support to the Neoverse N1 in the incredibly tight timelines that UKRI gave them).