1. 15

This article describes some compiler/hardware features with a focus on llvm-project implementations. I am thinking whether I should use a difficult title (but the URL needs to stay unchanged). Not simply “… in LLVM” as I often update articles and don’t want to just exclude other schemes implemented in other compilers…

  1. 6

    This is a great overview. It’s worth noting that there was a follow up to the CPI paper that fairly robustly demonstrated that you can’t have a secure CFI scheme in the absence of memory safety (one of the CPI authors got into a shouting match with that paper’s author at Oakland, not their finest hour). These schemes worked well when they came out because they broke attacks but most of them are not robust in the presence of an information disclosure bug. Speculative side channels have meant that it’s now fairly easy for an attacker to find a disclosure bug and leak the continents of arbitrary bits of memory (and detect whether pages are mapped). For things like RETGUARD, they first leak the xor’d value, then overwrite both it and the return address.

    That doesn’t mean that these things have no value. For example, without CFI there have recently been some fun exploits involving overwriting a single byte of the return address. These are not possible with RETGUARD, but a single stack arbitrary-out-of-bounds bug can still completely bypass RETGUARD and similar schemes. Increasingly, attacker toolkits come with bypasses for them.

    My favourite CFI bypass (which doesn’t work against XFG or the PaX scheme) is from the Counterfeit Object-Oriented Programming paper, which depends on being able to overwrite objects in an array and use the wrong vtables, so a loop that iterates over everything in the array and calls the same method on each would call different methods, which might expect to be in a larger class and so could update some state on other objects in the array, including overwriting their vtables. For most non-trivial C++ codebases, this is sufficient to construct a Turing complete weird machine. Every (interprocedural) control flow arc is to a valid function entry point and every return is to a valid return address, it’s purely built from memory and type safety violations. CHERI makes these very hard to exploit, but not impossible, though CHERI combined with something like XFG would make it much harder.

    1. 2

      Thanks, this is an excellent survey of (some of) the techinques used!

      1. 2

        I have never understood why no one simply uses separate call and data stacks. The performance should be just as good (perhaps slightly better—IIRC, you get to shave an instruction off of the prologue), and it can be made ABI backwards compatible (and close to forwards compatible, though there are some wrinkles). When the call stack is put at a random address, it should be pretty much impossible to leak it (since there is no reason to create a pointer to it), and so ROP becomes basically impossible.

        1. 1

          You’d need an extra register and you’d need to push and pop things from two stacks in the common case. This is what SafeStack does and you can see the published perf overheads.

          SPARC has a somewhat better model. The abstract model for SPARC is an infinite number of register windows. You have some shared with the caller and some with the caller, so you usually don’t need to explicitly push or pop (unless you have a lot of live values in a function). In the hardware, there is a finite number of windows, arranged in a ring, and you take a trap and spill them when you run out (you can also spill them all on context switch and then reload them on demand). The very last generation of SPARC processors did what the first generation should have done and spilled them asynchronously in the background. This meant that the CPU could spill whenever the store units were idle and not worry about memory ordering because it is undefined to access that memory region for anything other than spills and reloads without some very aggressive barriers.

          1. 1

            You’d need an extra register and you’d need to push and pop things from two stacks in the common case

            But you get to skip frame pointer maintenance, because the call stack frames are all the same size. If you have hardware call and ret instructions (as x86 does), it actually ends up as a minor win for the split stacks.

            The reason safestack super sucks is that it keeps the other stack pointer in a tls, not that it uses multiple stacks.

            In the time since I wrote that comment, I decided that probably the reason no one’s implemented the scheme I’m thinking of is that it’s extremely fiddly (when you take into account compatibility). I am going to see about implementing it.

            1. 2

              But you get to skip frame pointer maintenance, because the call stack frames are all the same size.

              Only for one of the stacks, you’d still need it for the other one.

              The reason safestack super sucks is that it keeps the other stack pointer in a tls, not that it uses multiple stacks.

              At least one of the implementations used a separate register, but this was a bigger ABI break.

              1. 1

                Only for one of the stacks, you’d still need it for the other one.

                You can make the call stack frame a pair of a return address and data stack pointer. On x86, the prologue becomes: ‘push rbp; sub rbp, stack-bytes-used’, and the epilogue is simply ‘pop rbp’ (assuming rbp is the data stack pointer, and I think it’s probably the right choice).

                (The alternative is you skip frame pointers and do fancy DWARF stuff. This can be done both with split stacks and with a unified stack; split stacks are still a wash wrt instruction count given hardware call/ret, but have the important advantage that you can very easily and cheaply sample the call stack, e.g. for logging or profiling.)

                At least one of the implementations used a separate register, but this was a bigger ABI break.

                Have a link? I looked into it a bit earlier today, and found this comment:

                // FIXME: use a dedicated register for it ?

                From 2016. Regardless, I aim to maintain ABI compatibility.

                1. 1

                  You can make the call stack frame a pair of a return address and data stack pointer

                  That’s just moving the frame pointer onto the call stack. Probably not a bad tradeoff, since you rarely need the frame pointer (especially on x86, where you can have large immediate offsets for stack addressing).

                  On x86, the prologue becomes: ‘push rbp; sub rbp, stack-bytes-used’, and the epilogue is simply ‘pop rbp’ (assuming rbp is the data stack pointer, and I think it’s probably the right choice).

                  But now you can’t use push and pop for spills on the stack. That’s similarly a problem on AArch64 where there are some special behaviours for sp-relative loads and stores. Intel put quite a bit of work into making LLVM use push and pop instructions, because they make the prologs and epilogs smaller, which has a quite measurable impact on overall code size. They also do some very exciting things with push and pop in the decoder to make them able to execute independently (keep a shadow rsp in the front end, dispatch push and pop as offsets to this, and only allocate a new rename register on instructions that explicitly reference rsp), so not using them for spills and restores would likely be slower as well as generating larger code.

                  Have a link? I looked into it a bit earlier today, and found this comment:

                  I thought it was in the paper, if not it was from a discussion with the paper’s authors. The version that they upstreamed to LLVM was a subset because everyone was super nervous about needing to support ABI-breaking changes, especially when both Intel and Arm were expected to introduce shadow stack hardware within a few years (this turned out to be much harder to do without completely breaking binary compatibility than either company expected).

          2. 1

            I can’t speak for x86, but on Arm there are instructions that simplify and reduce code for dealing with the stack. If there were two stacks then there would be more instructions in the prologue and epilogue, and generally more bookkeeping needed (and problaby more registers used).

            On Cortex-M the stack is handled in hardware for interrupts. Having two stacks there would be a huge penalty.

            1. 1

              on Arm there are instructions that simplify and reduce code for dealing with the stack. If there were two stacks then there would be more instructions in the prologue and epilogue, and generally more bookkeeping needed (and problaby more registers used)

              I don’t know arm. On x86, there also are instructions for dealing with the stack; but the net effect is neutral or positive. (In particular, you no longer need a frame pointer, because the call stack frames have a fixed size, and point to the data stack frames.)

              x86 also uses the stack for interrupts, and would work just fine with a scheme using two stacks. Essentially, the cpu briefly pushes some data onto the call stack.

            2. 1

              Some systems make this pretty approachable, so it’s actually been tried, see e.g. https://s3.eurecom.fr/docs/secucode09_francillon.pdf . I remember reading an even more recent paper about it later on, which actually implemented the whole scheme with LLVM, generating code for AMD64, but I can’t find it right now.

            3. 1

              This is indeed a great overview of CFI forms. There is an interesting note to make about software-based shadow stacks and why we should avoid them. Back in the day, Microsoft tried implementing a mitigation called RFG - Return Flow Guard, which is effectively a shadow stack in software. You choose a virtual address for the shadow stack (which is kept “secret”) and use it to store return addresses. Now, you can check the integrity of the return address from the normal stack upon return by comparing it to the corresponding one from the software shadow stack.

              This approach has a by-design race condition - what if an attacker changes the return address at the entry to the callee (before fetching the return address from *rsp) or after comparing both addresses in the epilog but before executing ret?

              An exploit of this race was demonstrated by attacking a leaf function. It spawns two threads:

              • thread A does sleep() -> GetLength() -> sleep() loop
              • thread B constantly writing to the VA of “RET pointer” of the strlen function

              The observation here is that by attacking a leaf function, 99.99% of the “writes” are harmless. When you enter the leaf function, you have a very high probability of winning the race.

              An interesting note that this by-design race doesn’t exist on ARM as it does on x86 ISA, because you have the link register to fetch the return address from (and you don’t need to fetch it from *rsp).

              All of that information is publicly documented in Joe Bialek’s talk from Offensivecon 2018, here: https://github.com/microsoft/MSRC-Security-Research/blob/master/presentations/2018_02_OffensiveCon/The%20Evolution%20of%20CFI%20Attacks%20and%20Defenses.pdf

              Another important issue is the “secret” part of this approach. In the unfortunate reality of infinite amount of CPU side channels, secret-based mitigations face a concerning problem. For example, in the slides, you can see the AnC attack (https://www.vusec.net/projects/anc/) could successfully leak the shadow stacks VAs. As Joe wrote in his slides: “POC took several minutes in Edge, but attacks never get slower”.