1. 36
  1.  

    1. 29

      This was a fascinating time. Apparently most Arm partners were completely uninterested in 64-bit cores. The feedback that Arm got from most partners was roughly, ‘64-bit isn’t something any of our customers are asking for’. It probably wouldn’t have happened for quite a few more years without Apple pushing for it (except possibly in @lproven’s alternate timeline).

      As the thread says, Google didn’t think it was important. ART, at the time, was full of pointer to uint32_t casts (we tried porting it to MIPS64 for a student project and gave up). They had to rewrite all of these to go through a helper that did pointer compression (ART pointers became 32-bit offsets to the Java heap base address). Apparently it was a deeply unpleasant time to be in the Android team because customers were walking into phone shops and saying ‘I want a 64-bit phone, this one is only 32 bits and so is only half as good’ (with no understanding of what it meant, but triggering a lot of lost sales).

      The 64-bit design was incredibly data driven and is carefully tuned for superscalar out-of-order architectures. The original Arm architecture was somewhat overfitted to the original microarchitecture. There were some obvious examples of this (like the fact that reading pc gave you pc + 4 because register fetch happened a cycle after fetch in the ARM1) but a lot of other corner cases.

      Predicated execution is great for in-order cores because it avoids stalling for short branches and avoids the need for big branch predictors. If you can if-convert a small conditional to two predicated instructions instead of a branch then you can do both paths in two cycles, with no additional hardware complexity. If you do it as a conditional branch, you need to have a branch predictor, state for the branch predictor, and even the best case will be two cycles (branch-not-taken + the one instruction). In contrast, if you do the same on a big superscalar architecture then you have a lot of complexity that duplicates the same sort of thing that you need to do for speculative execution. Some folks wanted to remove predication entirely but that has big down sides. If you want the same set of conditions that AArch32 (or x86) supports then you need a load of different compare-and-branch instructions and that burns encoding space (RISC-V does this). Conditional move was fairly controversial but experimentally you need 25-50% more branch predictor state to achieve the same performance as conditional move without it, so it’s a very clear win. This meant that they also needed condition code registers. Microarchitects hate condition codes because they’re an extra rename register that needs handling. The compromise (after a lot of experimentation) was to have a small subset of condition-code-setting instructions. A few of the conditional instructions in AArch64 significantly reduce complexity of some very hot code paths in some widespread algorithms.

      AArch32 had the PC as a general-purpose register. As an assembly-language programmer, this is fantastic. You can turn any instruction into a branch. If you want to tail call a function pointer, you can just load it directly into PC. This is fine on simple in-order pipelines: if the destination is pc, stall the pipeline after decode. It’s incredibly painful for big cores because every instruction might be a branch and branches can come from any pipeline so you special handling at the end.

      AArch32 had load and store multiple instructions. These were big. They took a bitmap of the registers to save and load, a base register, and had {pre,post} {in,de}crement modes. This let you fold an entire function prolog and epilog into a single instruction each. For extra fun, because the link register was next to the program counter, you could spill the link register and reload it into the PC for return. Microarchitects really hated this instruction. It could span a page boundary (not a problem on ARM1: no MMU) and so could fault in the middle, which caused painful restart logic. Spilling and reloading multiple values is a win though, so AArch64 compromised with load and store pair. You need atomic versions of these for RCU and similar things, so you microarchitecturally they’re basically free.

      On big cores, rich addressing modes are really important because your load/store pipelines are deep, but rename registers are expensive. If you need to do a arithmetic operation to generate the address and the another to do the load/store then you allocate an extra rename register. Even with a load and store to the same address, it consumes less power to do the arithmetic twice in the load/store pipelines than to do it once, store it in a rename register, and then forward that register into the load/store pipeline twice. This is particularly noticeable for performance in loops if the compiler doesn’t manage to clobber the temporary register: the rename register backing it must remain live until the

      The most surprising choice to me was that they went with 32 registers. I think this was largely driven by the desire for x86-64 emulation. You can store the x86-64 register file in registers and the emulator state in the rest. Aside from emulation, the only code that showed any perf distance was very mathematically intensive code, primarily crypto algorithms (which are mostly offloaded to dedicated hardware). Without x86-64 emulation as a requirement, 16 registers would probably have been a better choice: most of the code that’s faster with 32 registers is orders of magnitude faster on a GPU or DSP core. The performance of Rosetta 2 suggests that this was the right choice.

      The biggest thing that you notice with AArch64 is that it favoured data-driven pragmatism over any notion of a clean and orthogonal architecture (in stark contrast to RISC-V). The link register is architectural, which dramatically reduces the size of branch-and-link instructions (they need a 1-bit designator for whether to link, not a 5-bit register target). The stack pointer is also architectural, which means that you can do lazy update tricks on it with load/store multiple. Condition codes and a small set of conditional instructions are still there because they’re too useful to leave out, not because they’re clean or elegant.

      It’s not the best architecture possible (I’m biased, but I think we designed a better one on a cancelled project, though only because we could learn from AArch64 and others) but it was pretty close to the best possible with the knowledge available at the time.

      1. 5

        That’s really cool. One interesting reply in that thread: “The ISA was designed by Arm. It was commissioned by Apple to align with its microarchitecture goals. Other customers were involved in the design process. I have little insight into their contribution, but Apple was clearly doing the driving in the early years.”

        1. 17

          Yup. That’s how Arm normally works. A partner requests some feature, Arm gathers requirements from other partners and proposes something. Then they iterate with partners until there’s consensus. It’s misleading to think of Arm has a CPU or IP vendor. The value that they provide to the ecosystem is that they are a really good standards body. They drive consensus on architectural extensions and ensure that they don’t put things into the architecture that people aren’t willing to build or that are a really bad idea (unless a really big partner asks for them, then they put them in as optional, so that Qualcomm can build slow Windows-only Arm cores with stupid features that no other OS needs).

          I’ve worked with them (both as an academic and as a representative of an Arm partner) on MTE, CHERI, and confidential computing features and it’s mostly been a great experience (MTE less so because certain partners are asking for things that destroy the long-term value of the feature and not listening when Arm says ‘no, that’s a stupid idea’, but there’s nothing Arm can do about that). It’s been far more enjoyable than RISC-V standardisation efforts because, until you get close to a standard, Arm acts as an intermediary and so they’ll talk to everyone and build some rough consensus without large numbers of meetings (well, loads for them, not loads for everyone else) and then iterate over contentious design decisions. And the people that you talk to at Arm are working with other engineers to build and evaluate prototypes, so if you give them a stupid idea they can (usually) come back with data that shows that it’s stupid.

          I don’t know how you’d replicate that. A lot of it is down to Richard Grisenthwaite, who is one of the few people who is able to say polite and constructive things when he means ‘no, that’s a stupid idea, we will never do that, and you are a drooling moron for even suggesting it’, even after the tenth time you suggest it.

          1. 3

            What’s the Windows-only feature you alluded to? DCAS?

            1. 1

              Dcas is part of lse, which is mandatory on armv8.1 and later. I don’t know what the windows-specific feature is. (I certainly don’t think dcas is a dumb feature, though I do happen to recall hearing that the windows kernel needs it.)

          2. 2

            who is one of the few people who is able to say polite and constructive things when he means ‘no, that’s a stupid idea, we will never do that, and you are a drooling moron for even suggesting it’, even after the tenth time you suggest it.

            Isn’t that a stereotypical British trait?

            “Up to a point, Lord Copper.”

        2. 4

          [ Sorry for the double reply ]

          I think that quote is both true and also slightly misleading. Arm doesn’t add things to the architecture unless a partner is willing to build it but I think a couple of senior folks had been looking for the partner that would say yes to a 64-bit extension for a couple of years before Apple did. CHERI is currently in a similar situation: they need one partner to commit to building it before it can go in. As with 64-bit, everyone wants someone else to go first and pay the porting costs and, I suspect, it will be just as deeply uncomfortable for the people who find that they’re 1-2 years behind the competition.

          1. 1

            …looking for the partner that would say yes to a 64-bit extension for a couple of years before Apple did.

            Yeah I am familiar with that approach; that makes a lot of sense. My current employer has ideas to build all sorts of neat stuff, if only we could be sure someone would actually want to buy it when we did…

            1. 5

              It’s particularly difficult for Arm because it typically takes 2-3 years to go from ‘this is a nice idea’ to final architecture. They normally don’t start designing first-party cores with new architecture until it’s final, so then it’s another year to get a licensable core. Then another year for partners to tape something out and another year for it to be integrated into consumer devices. This means that it’s 4-5 years minimum between starting work on something and consumers having access to it. Most of the device vendors are driven by software teams that don’t think more than 6 months ahead, so they need to anticipate what software people will ask for in three or four years time.

              Apple made this shorter for ARMv8 by starting working on the alpha and beta specs and so they were very close to tape out by the time the architecture was final, but Apple is vertically integrated in a way that forces them to think about future roadmap things with a sensible amount of lead time.

      2. 1

        Fantastic comment! Thank you very much for it.

        Could I ask: what do you mean by saying certain features are “architectural”? I can’t parse that.

        1. 1

          The link register and stack pointer are architectural because they are fixed by the instruction set: the programmer does not get to choose which register to use. So branch-with-link is just branch with a link bit set, the link register is implicit.

          1. 1

            I am not an assembly programmer. I only barely follow that; sorry, my bad, but we all have our limitations.

            Is this different from Arm32? (Or even OG Arm24?)

            1. 2

              This is one of the things that annoys me the most about RISC-V. The JAL instruction takes a full major opcode (there are 128 of these), the JALR takes a minor opcode (there are 8 of these per major opcode), so the two of them are using around 1% of the total available encoding space for 32-bit instructions. In both, there is a full 5-bit operand to specify the link register. In almost all code, this is either the zero register (discard the link value, just do a jump) or the ABI-specified link register. You could encode that choice in one bit, reducing the size of these instructions by a factor of 16 and significantly increasing the available encoding space.

              It would have made a lot more sense to encode the any-register-as-link-register version as a 48-bit instruction, or provide a 16-bit instruction that let you express ‘add a small displacement to PC’ so that you could store PC+6 in an arbitrary register and then do a branch (no link), which would give the same effect.

              RISC-V is also weirdly inconsistent. The compressed extension version of JAL doesn’t allow you to specify a link register but the compressed version of JALR does. The compressed JALR, as a result, requires 10 bits of operand encoding space (in a 16-bit instruction), and could easily be encoded with 6 bits of operand space. The combination of this and an instruction that added a left-shifted-by-one 4-bit immediate to PC would fit in the same encoding space, freeing up a huge amount of the 32-bit encoding space.

            2. 1

              The link register works in a similar way in arm32 and arm64: the branch-with-link instruction stores the return address in a fixed register. RISC-V is different: you can choose which register gets the return address, or rather the ABI chooses it for you.

              The stack pointer is a bit different. On arm32 there aren’t any special instructions for the stack, but when the processor handles an interrupt or system call, a few registers switch from their user-mode versions to their kernel-mode versions. These multi-version registers are the program counter, the link register, and the stack pointer. So for most purposes the stack pointer isn’t architectural, except that switching to a privileged mode determines which register is conventionally used for the stack pointer in the ABI. On arm64 the stack pointer is more special than arm32, because on arm64 it can only be used with a restricted set of instructions. As @david_chisnall said, this restriction allows it to be implemented more efficiently.

              The program counter is even more different between arm32 and arm64, because on arm32 it can for most purposes be used like a general-purpose register (causing headaches for the processor designers) but on arm64 it isn’t visible at all.

        2. 1

          @fanf’s answer is spot on. Features that affect programmers are fixed at one of the following layers:

          • The ABI. This defines a set of conventions that the compiler and linker use. They may be private to that toolchain or shared across more than one for interoperability. For example, MIPS did position independent code by enforcing the rule that all function calls used jump and link register with register 25 as the source address. This let the callee find its own PC on entry by looking at that register. On MIPS and RISC-
          • The architecture. This defines everything that the hardware provides that is not an implementation detail. Some instructions operate only on specific registers. On x86, that’s most of them. AArch64 and x86 have special instructions for manipulating the stack and special instructions for call and return (on x86, pushing the return address into the stack, on AArch64 putting it in the link register). This constrains the ABI somewhat.
          • The microarchitecture. This defines everything that is specific to an implementation. Ideally, software shouldn’t need to care about anything at this level but often does for performance. Occasionally things at this level leak, for example by exposing side channels.

          Some ISAs persist in the myth that they are general purpose. This is not possible, any ISA constrains the abstract machines that can be efficiently implemented on it. Once you realise that, you give up and optimise the ISA for the kinds of things that compilers want.

    2. 6

      I wrote the arm64 Go compiler before there was any silicon. I started with QEMU, but quickly realized that QEMU emulates correct code according to the ISA, but it doesn’t emulate incorrect code according to the ISA. That is a problem when you are writing a compiler and a language runtime which is full of bugs.

      So I wrote my own emulator, on Plan 9. But I was very annoyed I had to spend effort (and introduce bugs) on instruction decoding, which is a trivial thing, instead of being able to generate it from a machine-readable version of the spec. ARM was helpful in trying to get me this form of the spec, but it took a couple of years, so I had to rely on scraping the PDF version.

      It was very succesful, my emulator worked great, and I was managed to do the first half of the arm64 on my emulator (which was also a debugger). Eventually ARM provided test silicon and my port worked just fine on actual hardware.

      Except that we encountered some hard to induce memory corruption issues that were related to atomic instructions. I was so confident in my emulator, and in my implementation, that I really pushed ARM to try to get to the bottom of it. It turned out it was a CPU bug.

      One thing that is baffling to me about the arm64 spec is that the semantics of operations is defined operationally in terms of some sort of pseudocode… but there is no formal semantics of this language. And there is no public implementation of this language. I had to waste a lot of effort just trying to guess what this language is supposed to mean.

      Also, I have spent way more effort on organizational and political matters while dealing with ARM and Canonical in the pursue of the Go port, that I spent actual technical effort… For a lot of time ARM didn’t want this done at all… and Canonical wanted it done for peanuts.

      1. 4

        One thing that is baffling to me about the arm64 spec is that the semantics of operations is defined operationally in terms of some sort of pseudocode… but there is no formal semantics of this language. And there is no public implementation of this language. I had to waste a lot of effort just trying to guess what this language is supposed to mean.

        The spec is formally defined in a language called ASL. The pseudocode in the instructions is generated from the ASL. If you are an Arm partner, they give access to the ASL. This is, along with the conformance test suite, are the most valuable things that you get as a partner: they are sufficient to be able to build an implementation of the architecture.

        At the time ARMv8 shipped, this conversion was not complete and the canonical spec was the prose around the pseudocode, with the pseudocode being derived from that.

        Somewhat surprisingly, they gave Peter Sewell’s group a license that allowed them to write an ASL to Sail translator and to publish the output of running it on the ARMv8 spec. If you want a formal model of AArch64, this is the right place to look.