1. 33
  1. 12

    I made graphs! https://alopex.li/data/images/sizes-32bit.png and https://alopex.li/data/images/sizes-64bit.png

    I am a little entertained that for 32-bit at least there’s fairly clear splits between “Not good” (most things), “Not bad”, (RV32GC, x86, m68k), and “Thumb + ARC” in their own category (I’ve never heard of ARC before now). For 64-bit there’s really only “tried to be small” (AArch64, x86_64, RV64GC) and “didn’t try to be small” (everything else).

    Also nice to see that the RISC-V C extension really does matter quite a bit, even if it doesn’t necessarily live up to its hype. The spec claims 25-30% code size reduction, this shows numbers on the low end of that range. I also expected x86_64 to be far more bloated than it apparently is.

    1. 8

      ARC is used in a lot of deep-embedded control stuff - the market that’s increasingly being taken over by RV. Intel’s Management Engine on non-Atom chips (Atoms used SPARC) is historically ARC, though it moved to a 486-derived x86 core starting in Skylake.

      Thanks for the graphs!

      1. 7

        The stock fun fact about ARC is that it’s short for Argonaut RISC Core, and if that name sounds familiar it’s not coincidence - the architecture’s historic roots are the SuperFX coprocessor that Argonaut used in Star Fox for the SNES.

        1. 1

          Also nice to see that the RISC-V C extension really does matter quite a bit, even if it doesn’t necessarily live up to its hype. The spec claims 25-30% code size reduction, this shows numbers on the low end of that range.

          In 64bit, as shown by OP, RISC-V does already hold the code density crown.

          The B extension, which was finished last year yet was not included in the test, provides significant further reduction.

          There’s also an ongoing effort in riscv-code-size-reduction, which might or might not be done late this year.

          At that point, 32bit will be similar to, and possibly beat, thumb2 code density.

        2. 22

          Just imagine how nice this could look if the author had used a hypertext transfer protocol that supported tabular data.

          1. 1

            Eh, a simple ASCII table would have done the job.

          2. 6

            ARM64 code density, for a fixed-length ISA with 64b words, is actually pretty good.

            Worth pointing out that while the word length is 64 bits, the instruction length is still 32 bits, so anything not loading full size immediates will have similar/identical code density to AArch32 code.

            1. 23

              There are a lot of differences between AArch64 and AArch32. AArch64 removed most predication and also removed store/load multiple. Store multiple had some interesting effects on code size in combination with the right ABI. Every function prolog for a non-leaf function needs at least one store instruction to spill the link register and stack pointer, with stm you could spill as many registers as you wanted with a single instruction. This meant that embedded ABIs could make almost all registers callee-save without increasing code size from extra spills. The predication meant that you could eliminate a lot of branches.

              AArch64 is an ISA that’s optimised for large cores. One of the main criticisms that I hear from folks trying to build high-end RISC-V cores is that they optimise for the wrong thing: RISC-V optimises for making instructions small, but a big core wants you to optimise for fewer instructions. Every instruction that you decode on a superscalar processor consumes space in a scheduler. Every instruction that writes to a register consumes a new rename register. Rename registers are one of the most expensive things on a large core.

              This is why complex addressing modes are such a big win: they are effectively one or more arithmetic instructions followed by a load, but the ALUs for the address calculation can sit in the load-store pipeline and the intermediate results are just wires between pipeline stages, they don’t need any control logic in register rename. If you do this as two instructions then you need to allocate a rename register and burn scheduler power to dispatch the second instruction once the first is scheduled. A rename register remains live until an instruction that writes to the same architectural register is either executed in the same basic block or retires out of speculative execution (so that the value is no longer architecturally visible on any path). If the compiler is not able to clobber the register in the second instruction (e.g. if it’s a floating-point load and so ends up in a different register bank) then the two-instruction sequence will increase rename register pressure and hurt performance (it will hurt power either way).

              This is part of the reason why server-class x86 chips do better that you might expect on power-performance numbers.

              Note that this doesn’t mean that code size is unimportant. Itanium optimised for work-done-per-instruction(-bundle) aggressively but suffered from huge code size, which blew out i-cache size. This isn’t always a straightforward trade because it’s possible to compress instructions in an i-cache (or decompress them and store them as decoded micro-ops, depending on what you’re optimising for). This burns some power, but if you’re able to consume less power in your compression / decompression logic than you would in a larger i-cache then it’s worth it.

              This is further complicated by the fact that raw power consumption is often less important for performance than hot-spot power consumption. The thermal throttling that you need to apply is typically driven by the part of the core that’s hottest. Register rename is one of the places that is very easy to become a hot spot because it is always powered.

              1. 2

                Note that this doesn’t mean that code size is unimportant. Itanium optimised for work-done-per-instruction(-bundle) aggressively but suffered from huge code size, which blew out i-cache size.

                And, worse, all three IPF microarchitectures had an undersized 16K L1I (though starting with Montecito, they enjoyed large and fast L2I to make up for it.)

                1. 1

                  This post is really insightful, thanks! I’m not that familiar with the tradeoffs in ISA design so this is a lot of interesting new information.

                  Interesting that stm got dropped - most of my (limited) ARM exposure is hobbyist RE of GBA games, where stm is liberally used. I guess spilling is naturally 2x stack use per ref for AArch64 and I assume there’s some implications for performance there?

                  One of the main criticisms that I hear from folks trying to build high-end RISC-V cores is that they optimise for the wrong thing: RISC-V optimises for making instructions small, but a big core wants you to optimise for fewer instructions.

                  I guess that’s the issue with trying to make a one-size-fits-all ISA, though it sounds like they valued making RV64 orthogonal with RV32 over making it more suitable for high perf core designs? I know almost nothing about RISC-V, sadly.

                  1. 9

                    Interesting that stm got dropped - most of my (limited) ARM exposure is hobbyist RE of GBA games, where stm is liberally used

                    stm is one of those instructions that software people love, hardware people hate. It has a bunch of properties that make it painful for hardware. It can’t be implemented as a single pipelined operation, it has to be a state machine with a shift register driving it that issues a micro-op down a pipeline for every register that it’s operating on. This has some very painful interactions with exceptions. The target register for stm may be near a page boundary and so you can take a page fault in the middle of an stm and that must be predictable, so you can’t reorder the stores (either the first or second half of the instruction can trap and that’s observable).

                    AArch64’s stp has very strong alignment requirements, which mean that both words must end up in the same page and so either both trap or neither trap. This is much nicer for the microarchitecture because it just needs to push two registers into the store queue (big AArch64 chips have at least 16-byte-wide store queues, so this takes only a single store queue entry, assembled by reading two rename registers).

                    AArch32 is a really nice architecture to implement in simple pipelines but has a lot of things that get painful in large superscalar chips.

                    I guess that’s the issue with trying to make a one-size-fits-all ISA, though it sounds like they valued making RV64 orthogonal with RV32 over making it more suitable for high perf core designs? I know almost nothing about RISC-V, sadly.

                    I think it’s largely a problem with premature standardisation. By the time the RV64 core ISA and the C extension were baked, the only implementations were fairly simple research prototypes. I think BOOM was the most complex and it is very simple in comparison to a modern server core (or even a high-end laptop or tablet core). Most of the problems are apparent only when you start building very large cores and several of the design decisions are very nice for tiny cores.

                    I think, if I were designing an ISA from scratch again, I’d seriously think about how to decouple the abstract ISA from the encoding. This is quite tricky (things like sizes of immediates leak up into assembly code) but if you can do it then you have a building block for making completely source-compatible implementations of an ISA with very different tradeoffs in how to encode them. You might need to reserve a register for smaller cores so that larger instructions can be cracked at assembly time into sequences of smaller ones but, if you did that, then you’d have a generic mechanism for assembling large macro-ops for systems where they make sense.

                    1. 2

                      AArch32 is a really nice architecture to implement in simple pipelines but has a lot of things that get painful in large superscalar chips.

                      That’s actually very interesting, I’d always wondered why AArch64 was so different compared to ARM32.

                      Out of curiosity, do you have any opinions on the RISC-V push/pop instruction proposal?

                      1. 4

                        Out of curiosity, do you have any opinions on the RISC-V push/pop instruction proposal?

                        I’d not seen it before, but it looks very like stm / ldm, limited to the stack pointer. This kind of thing is one of the reasons that it’s good to make the stack pointer architectural, which RISC-V didn’t do (the operations that you want to do on the stack pointer overlap only slightly with the operations that you want to do on other registers). It is likely to have most of the same implementation headaches on big pipelines, but be great for microcontrollers.

                        Skimming it:

                        Correct execution requires that sp refers to idempotent memory (also see Non-idempotent memory handling), because the core must be able to handle faults detected during the sequence. The entire PUSH/POP sequence is re-executed after returning from the fault handler, and multiple faults are possible during the sequence.

                        This has no forward-progress guarantees. It’s possible if cm.push spans a page boundary that you will trap on the first half, then retry, trap on the second, retry, trap on the first, and so on, with each fault handler evicting one page and pulling in the other. Avoiding this kind of corner case was the motivation for stp’s strong alignment.

                        Aside from that, it looks interesting for embedded systems. Historically, RISC-V has avoided encoding ABI details in the ISA (which I consider a huge mistake - co-designing an ISA and ABI can give some massive wins).

                        If embedded systems use RV32E then it would be nice if this could spill more registers so that some can be moved from the temporary space to the save space in an improved RV32E ABI.

                    2. 2

                      I guess that’s the issue with trying to make a one-size-fits-all ISA, though it sounds like they valued making RV64 orthogonal with RV32 over making it more suitable for high perf core designs?

                      Per the rationale in the spec, RV64 is deliberately not necessarily a superset of RV32, ie there’s no assumption that you can take RV32 code and be able to run it on an RV64 processor. So I think it’s more of a matter of “if it ain’t broke don’t fix it”. Whether or not it actually broke and needs fixing (or at least, could use improving), well, that I don’t know.

                  2. 1

                    Yeah, I wasn’t trying to imply that ARM64 uses 64b instruction words - I only know of one ISA that uses a fixed length 64b encoding; it’s not common. I was just noting that other 64b ISAs in the table (x86_64, SPARC64) tend to skew a little larger for final binary size than their 32b counterparts, I assume due to things outside of the .text section.

                    1. 2

                      No worries - I felt it was ambiguous what was meant, so I felt it was worth some clarity.

                  3. 5

                    It would also be interesting to compare code density changes for the same target using -Ofast vs -Osmall. I know e.g. that at least on x86_64 dense code is occasionally at odds with fast code (though not as much as it once was), and I wonder how much that applies across other instruction sets.

                    1. 1

                      Although I’d hope if you’re caring about code density, you’d be on -Osmall….

                      1. 1

                        I’m actually more curious by how much density affects performance.

                    2. 3

                      Were these all ELF binaries? Compiler differences?

                      1. 4

                        All ELF, all buildroot’s GCC 10.

                      2. 2

                        This is really interesting.

                        I don’t find the code size of the traditional/old RISC architectures all that surprising - it tracks my experience.

                        I would be super interested in knowing the size of the binaries if they are compressed, because in principle they’re all doing more or less the same thing so maybe they compress to similar sizes? :D

                        1. 2

                          RV64GC - 741,856 bytes - ilp64d ABI

                          Why use the ABI where sizeof(int) == 8? Wouldn’t lp64d with sizeof(int) == 4 && sizeof(long) == sizeof(void *) == 8 be more standard and more of an apples to apples comparison with RV64G (without compressed instructions).

                          1. 1

                            Any comparison of stack frame sizes…? Sparc V8 has notoriously large ones.