1. 19

I’m no expert in hardware but I would love to know the differences in terms of performance as we guys saw Apple ditching intel x86 in the favor ARM and creating there own chip

  1.  

  2. 14

    the differences in terms of performance

    It’s extremely complicated in practice, and unfortunately I think a question like this often gets hot takes based on which ISA is simpler on the surface and people’s basic understandings of traditional RISC/CISC design tradeoffs.

    There are ways in which the ARM ISA is simpler due to it being newer and some choices made in its initial design that were closer to RISC. But it has also gone the path of anything else over time and accumulated its own baggage. The x86 has a legendary amount of baggage in its instruction set.

    The reason there isn’t a straightforward general answer is because the realization of those ISAs in hardware matters. The effect of the x86 instruction set can be mitigated to varying degrees in microcode, with complex OOO designs, etc. It’s also made more complex by the fact that up until recently many ARM chips were targeting very different spaces than x86 so there wasn’t a good comparison to be made.

    If you want to answer this question, I think it makes more sense to focus on two specific cores and try to understand the differences between them, including some of the hardware. A good starting point would be anandtech’s recent article on the A14/M1 that highlights some of the differences in decode width between it and x86 cores.

    1. 6

      While I agree that different microarchitectures have a huge difference, there are some microarchitectural decisions that the ISA forces. The big one for x86 is that you absolutely need to have an efficient microcode engine: it is completely impractical to implement the entire ISA in fixed-function silicon. This has a huge impact on the rest of the microarchitecture.

      There are a bunch of different ways of implementing microcode. The simplest is to just crack it into a bunch of normal instructions using a little state machine that pumps out decoded instructions. You typically don’t want to use architectural registers for these, so you add a few names that the rename engine can use that aren’t exposed to non-microcoded instructions. This is very easy to do but has a couple of downsides. The first is that those registers, because they are not part of architectural state, cannot be saved on context switch. This means that you need to either ensure that the microcoded instruction sequences are idempotent, or you need to disable interrupts across the instructions. When you go down this path, you often have to pause the decoder and issue a single microcoded instruction at a time.

      This approach is very low impact on the silicon but if your common workloads are very low on microcoded instructions. If you want to be able to execute multiple microcoded instructions in parallel then you need to have a lot more extra logic (for example, keeping enough state that you can completely roll back all side effects of an interrupted multi-cycle instruction).

      In AArch32, about the only instructions that were typically microcoded were the load and store multiple. These were a big win in code size, because they were often a single 32-bit instruction for frame setup and tear down but were microarchitectually painful. They could span pages and so might fault in the middle, which led to some horrible complexity in implementations. AArch64 replaces these with load/store pair instructions that are valid only within a stack frame. These don’t give quite as dense code but are vastly simpler to implement. x86, on the other hand has a lot of common instructions that need to be implemented in microcode and so you need an efficient microcode benchmark.

      There’s also complexity in terms of the decoder. This can be quite painful from a power perspective because the decoder has to be powered almost all of the time. The only time that you don’t need it on x86 is in high-end systems when you’re in a hot loop that lives entirely in the trace cache. Arm has three instruction sets, AArch32, Thumb-2, and AArch64. The first and last of these are fixed-width encodings and so are fairly trivial to decode, Thumb-2 instructions are 32- or 16- bits, but can all be fairly easily expanded to single AArch32 instructions. AArch64 has some complexity around SVE (it’s effectively a two-width instruction set, they just pretend it isn’t).

      As a colleague once said, x86 chips don’t have an instruction decoder, they have an instruction parser. Any instruction is 1-15 bytes. You need a fairly complex state machine to decode it and, because of the way that prefixes can be chained, you need to parse the whole thing before you can figure out where the next one starts. Doing that on a superscalar chip that wants to issue multiple instructions is really hard and so Intel chips don’t actually do that, they decode into trace caches that contain fixed-width micro-ops and then try to execute from there. Arm cores don’t need any of that logic and can typically cache either raw instructions or the result of some very simple expansion.

      The original RISC project was designed by looking at the instructions that C compilers actually generated on CISC systems and building an ISA optimised for those sequences. AArch32 was designed the same way. AArch64 used a variety of workloads including some managed languages to go through the same process. Somewhat depressingly, RISC-V did not. x86 gradually accreted over time. AArch64 and AArch32 in ARMv7, for example, have very powerful bitfield insert and extract instructions. These are really useful for a bunch of things (such as NaN boxing in JavaScript JITs), but are present only in very recent x86 chips.

      Arm does not have subregisters. On x86, you have AL, AH, AX, EAX, and RAX all update the same registers. For anything shorter than RAX, this means you need a read-modify-write operation on the rename register. This adds complexity in the rename engine. Arm briefly had a floating-point mode that let you treat the FPU registers as either 32 64-bit or 16 64-bit floating point registers. This caused similar pain and was abandoned (32-bit FPU ops needed to track the enclosing register pair so that they correctly half-overwrote or were overwritten by 64-bit operations). Recent Intel chips make updating subregisters quite fast but at the expense of microarchitectural complexity (i.e. power).

      The Arm instruction set is designed to avoid data-dependent exceptions. Only loads and stores will trap based on the data and loads and stores will only trap on the address, not the data (unless you have ECC memory or are Morello). x86, in contrast, has a load of instructions that can trap based on the value (e.g. integer division by zero). This means that you need to start a new speculation window every time you hit a divide instruction in x86, because you may have to reset the pipeline to the state at that instruction if you discover that it traps.

      In summary, there are some fundamental differences between the two that mean that, within a fixed power and area budget, you should be able to make AArch64 faster. This matters somewhat less at the very high end, because a lot of these overheads are more-or-less fixed and so there isn’t a huge amount of difference when you scale everything else up. That said, instruction scheduling / register rename are some of the biggest fixed costs on a high-end core and anything that adds to that can end up being a bottleneck. x86 is made entirely out of things that add rename and scheduler complexity. If you’re building a laptop / tablet / phone core, these fixed costs are a very noticeable part of your power / area budget. Even without being especially clever, you can spend all of the saved transistor budget on extra cache and get a big speedup. It looks as if Apple does this as well as some more clever things.

      1. 1

        It’s a shame this fantastic answer came so long after the original thread, thanks for taking the time to writer it. What sort of other clever things are you talking about at the end? I’d be interested to hear your thought on The Mill given what you’ve said here, too.

        1. 2

          I don’t know exactly what Apple is doing, but I don’t think their perf results are just from having large caches. I believe they’ve spent more of their power budget on the branch predictor than some competitors (this makes a lot of sense if your workloads are dynamic-language-heavy). I don’t know what else they’re doing, but given how much effort Apple put into AltiVec and SSE optimisations for a bunch of fast paths I wouldn’t be surprised if they’ve doubled up some of the pipelines for Neon operations and carefully tweaked hand-written standard library assembly and the compiler cost model to get the most out of this. They ask developers to upload LLVM IR to the app store and so can aggressively tune their generated code for the specific device that the app is being installed on (I wanted to do this with FreeBSD packages too, but we never had enough contributors to build out all of the infrastructure), so they can get the last 2-10% out of any given microarchitecture.

          The Mill feels a lot like VLIW in origin: leave difficult things out of the architecture and punt them to the compiler. This generally gives you great benchmark numbers and terrible real-world performance. The compiler can spend a lot more time optimising things than the CPU can (it’s fine for the compiler to spend a few tens of milliseconds on a function, that latency in the CPU would make performance completely unacceptable) so in theory it can do more, but in practice it can’t take advantage of any dynamic behaviour. The Mill doesn’t do register rename at all, so can be almost entirely execution logic, but it punts a lot to the compiler and it’s not clear to me that it will do any better in the real world than EDGE architectures.

          Compiler ISAs are now really dataflow encodings. Registers aren’t really registers, they’re just locally scoped names for data flow. This is something that really bugs me because a typical compiler IR is an SSA form, which it then maps into some fixed-number-of-registers encoding and the register rename engine then tries to re-infer an SSA representation. It feels as if there really ought to be a better intermediate between two SSA forms but so far every attempt that I’m aware of has failed.

          The rule of thumb for C/C++ code is that you get, on average, one branch per 7 instructions. This was one of the observations that motivated some of the RISC I and RISC II design and I’ve had students verify experimentally that it’s more or less still true today. This means that the most parallelism a compiler can easily extract on common workloads is among those 7 instructions. A lot of those have data dependencies, which makes it even harder. In contrast, the CPU sees predicted and real branch targets and so can try to extract parallelism from more instructions. nVidia’s Project Denver cores are descendants of TransMeta’s designs and try to get the best of both worlds. They have a simple single-issue Arm decoder that emits one VLIW instruction per Arm instruction (it might do some trivial folding) and some software that watches the in instruction stream and generates optimised traces from hot code paths. Their VLIW design is also interesting because each instruction is offset by one cycle and so you could put a bunch of instructions with data dependencies between them into the same bundle. This makes it something more like a VLIW / EDGE hybrid but the software layer means that you can avoid all of the complicated static hyperblock formation problems that come from EDGE architectures and speculatively generate good structures for common paths that you then roll back if you hit a slow path.

    2. 8

      I wrote a blog post with something of a gloss on the subject after a particular use-case I was chasing down: https://barakmich.dev/posts/popcnt-arm64-go-asm/ – I tried to convey a gut-feeling on it, but I’m not sure if I was successful.

      But the fact is, when you’re talking performance, there’s lies, damn lies, and benchmarks. It’s not like these are apples-to-oranges comparisons (if you pardon the pun)

      The reasons to move toward an arm64 ISA are many, varied, and actually probably less performance-focused than, say, power-focused or licensing/lockdown-focused, to name two.

      1. 2

        This is really interesting! Thanks for posting!

        1. 1

          Really interesting I’ll save your post for later

        2. 4

          One point: ARM instructions tend to fixed-width instructions (like UTF-32), vs x86 instructions tend to vary in size (like UTF-8). I always loved that.

          I’m intrigued by the Apple Silicon chip, but I can’t give you any one reason it should perform as well as it does, except maybe smaller process size / higher transistor count. I am also curious how well the Rosetta 2 can JIT x86 to native instructions.

          1. 10

            “Thumb-2 is a variable-length ISA. x86 is a bonkers-length ISA.” :)

            1. 1

              The x86 is relatively mild compared to the VAX architecture. The x86 is capped at 15 bytes per instruction, while the VAX has several instructions that exceed that (and there’s one that, in theory, could use all of memory).

              1. 2

                If you really want to split your brain, look up the EPIC architecture on the 64-bit Itaniums. These were an implementation of VLIW (Very Long Instruction Word). In VLIW, you can just pass a huge instruction that tells what individual functional unit should do (essentially moving scheduling to the compiler). I think EPIC batched these in groups of three .. been I while since I read up on it.

                1. 6

                  interestingly by one definition of RISC, this kind of thing makes itanium a RISC machine: The compiler is expect to work out dependencies, functional units to use, etc which was one of the foundational concepts of risc in the beginning. At some point RISC came to mean just “fewer instructions”, “fixed length instructions”, and “no operations directly with memory”.

                  Honestly at this point I believe it is the latter that most universally distinguishes CISC and RISC at this point.

                  1. 3

                    Raymond Chen also wrote a series about the Itanium.

                    https://devblogs.microsoft.com/oldnewthing/20150727-00/?p=90821

                    It explains a bit of the architecture behind it.

                  2. 1

                    My (limited) understanding is that it’s not the instruction size as much as the fact that x86(-64) has piles of prefixes, weird special cases and outright ambiguous encodings. A more hardwarily inclined friend of mine once described the instruction decoding process to me as “you can never tell where an instruction boundary actually is, so just read a byte, try to figure out if you have a valid instruction, and if you don’t then read another byte and repeat”. Dunno if VAX is that pathological or not, but I’d expect most things that are actually designed rather than accreted to be better.

                    1. 1

                      The VAX is “read byte, decode, read more if you have to”, but then, most architectures which don’t have fixed sized instructions are like that. The VAX is actually quite nice—each opcode is 1 byte, each operand is 1 to 6 bytes in size, up to 6 operands (most instructions take two operands). Every instruction supports all addressing modes (with the exception of destinations not accepting immediate mode for obvious reasons). The one instruction that can potentially take “all of memory” is the CASE instruction, which, yes, implements a jump table.

                2. 6

                  fixed-width instructions (like UTF-32)

                  Off-topic tangent from a former i18n engineer, which in no way disagrees with your comment: UTF-32 is indeed a fixed-width encoding of Unicode code points but sadly, that leads some people to believe that it is a fixed-width encoding of characters which it isn’t: a single character can be represented by a variable-length sequence of code points.

                  1. 10

                    V̸̝̕ȅ̵̮r̷̨͆y̴̕ t̸̑ru̶̗͑ẹ̵̊.

                  2. 6

                    I can’t give you any one reason it should perform as well as it does, except maybe smaller process size / higher transistor count.

                    One big thing: Apple packs an incredible amount of L1D/L1I and L2 cache into their ARM CPUs. Modern x86 CPUs also have beefy caches, but Apple takes it to the next level. For comparison: the current Ryzen family has 32KB L1I and L1D caches for each core; Apple’s M1 has 192KB of L1I and 128KB of L1D. Each Ryzen core also gets 512KB of L2; Apple’s M1 has 12MB of L2 shared across the 4 “performance” cores and another 4MB shared across the 4 “efficiency” cores.

                    1. 7

                      How can Apple afford these massive caches while other vendors can’t?

                      1. 3

                        I’m not an expert but here are some thoughts on what might be going on. In short, the 4 KB minimum page size on x86 puts an upper limit on the number of cache rows you can have.

                        The calculation at the end is not right and I’d like to know exactly why. I’m pretty sure the A12 chip has 4-way associativity. Maybe the cache lookups are always aligned to 32 bits which is something I didn’t take into account.

                      2. 3

                        For comparison: the current Ryzen family has 32KB L1I and L1D caches for each core; Apple’s M1 has 192KB of L1I and 128KB of L1D. Each Ryzen core also gets 512KB of L2; Apple’s M1 has 12MB of L2 shared across the 4 “performance” cores

                        This is somewhat incomplete. The 512KiB L2 on Ryzen is per core. Ryzen CPUs also have L3 cache that is shared by cores. E.g. the Ryzen 3700X has 16MiB L3 cache per core complex (32 MiB in total) and the 3900X has 64MiB in total (also 16MiB per core complex).

                        1. 2

                          How does the speed of L1 on the M1 compare to the speed of L1 on the Ryzen? Are they on par?

                      3. 4

                        Apple needs to support about 20 devices and 2 software stacks, whereas your core i9 can still boot and run all the bios and software on your grandma’s 286 in the shed if you could somehow Frankenstein it physically into the bus.

                        1. 3

                          x86 is CISC and kept backwards compatibility throughout, so it carries a lot of baggage.

                          ARM is RISC, and has had a lot of compatibility-breaking ISA revisions, thus quite clean, relatively speaking.

                          1. 3

                            Remember that we can’t get a direct comparison between architectures based on this change to the MacBook lineup, as there was also a huge change in process node. I believe the Intel chips were on 14nm, M1 is on 5nm. Shrinking the size of the chips is likely responsible for most of the performance and efficiency gains Apple is reporting, versus any differences in architecture.

                            1. 2

                              The extremely short version: x86 is a fortysome-year-old architecture, extended all the time since then and with two significant attempts at cleaning up odd parts, while ARM is tenty-some years old with one major attempt at cleaning it up, and ARM was better to begin with, enjoying both the benefits of twenty years of general architecture experience and a stellar team.

                              1. 1

                                ARM is 35

                                1. 10

                                  AArch64 has almost nothing to do with the 32-bit ARM, it’s a complete redesign of the ISA.

                                  1. 2

                                    But the 32-64bit transition was totally different than the botchery that ended up with x86-64, no? Like, 64-bit ARM was a complete redesign, wasn’t it?

                                    1. 1

                                      Well, as I see these things, and it’s a matter of taste IMO, both of them did something remarkably well, yet produced something that was recognisably a child of its parent. I consider the changes similar, even if they were similar changes starting from different bases. The AMD 64-bit team produced a remarkably unhorrible result considering what it started with, aarch64 also delivered improvements, which makes it similar except that its starting point was ARM rather than the dreadful x86.

                                      This is also a good occasion to quote Van Jacobson: “Life is too short to spend debugging Intel parts”. Congratulations, Apple.

                                      1. 2

                                        I mean, x86-64 is making the best out of a bad hand, but that it was AMD that had to invent it while Intel was off chasing Itanic was probably a pretty good indicator of how Intel would lose its lead.

                                    2. 1

                                      How the years fly by.

                                      15 years, then. Or even 14, if you consider 1971 rather than 1970 the start.

                                  2. 2

                                    I’m not an expert, but Apple seems uniquely positioned in that they can co-develop the processor and its most popular compiler at the same time. If an instruction or series of instructions is causing performance problems, they can just… not use it. Intel develops its own compilers too, and they get really good performance on Intel’s latest chips, but they’re not that widely used due to the high cost and other issues.

                                    There are other well-funded ARM vendors like Qualcomm and AWS, and none of them are reporting the same spectacularly good benchmark numbers.

                                    1. 3

                                      Part of the reason there is that Qualcomm and Amazon are relatively new to the game compared to Apple.

                                      1. 3

                                        Apple’s early involvement with Arm in the ancient times is unlikely to have anything to do with this. The first three generations of iPhone used Samsung SoCs, the first Apple SoC used Arm Cortex cores just like the Samsung ones. It is believed that the greatness of the custom cores they started using later has everything to do with the acquisition of chip design company P.A. Semi.

                                        1. 2

                                          I’m sure that the fact that Apple has about the same yearly revenue as Sweden is probably also a factor. Amazon is about the same size but has to actually do infrastructure stuff for a living, so their net revenue is about 1/5th of Apple’s. (Source: Wikipedia.)

                                        2. 2

                                          Apple does not have benchmark numbers in the same class of system as AWS. Sure they have impressive single-core performance but they don’t have a 64-core chip. With servers, scalability matters, not maximizing single-core perf in a tiny form factor. Maybe their design decisions are not actually that good for a big big server. There’s lots of other things to spend die space on rather than gigantic L1 caches..

                                          AWS uses stock Arm-designed cores, by the way. Qualcomm uses semi-custom ones (for all we know this could be little more than a rebrand). Their benchmark numbers depend on Arm in a big way.