You mean Go wasn’t using registers for function parameters until now? 😧
I’m surprised, because mainstream ABIs have been doing this for a very long time (except x86-32, which has too few registers.) That it took Go this long seems like another casualty of rolling its own ABI.
I remember PowerPC having another neat optimization that IIRC isn’t in x86 or ARM — the subroutine-call instruction put the return address in a special Link register, instead of pushing it on the stack. Then returning was a Branch To Link Register instruction. That way calling a leaf function required no memory access at all. (Non-leaf functions did still have to push a stack frame, including the saved LR value.)
I remember PowerPC having another neat optimization that IIRC isn’t in x86 or ARM — the subroutine-call instruction put the return address in a special Link register, instead of pushing it on the stack.
This is the same on Arm (bl - branch and link - puts the return address in lr - the link register). This design, I believe, comes from early RISC designs that wanted orthogonal instructions. As a design principle, only load/store instructions should load from or store to memory, everything else should access only internal CPU state.
Some architectures have a designated link register but most RISC ISAs make the link register a GPR and allow the branch-and-link to write to any register. The more general functionality means that branch-and-link instructions are bigger (they need 5 bits more space on architectures with 32 registers), which consumes encoding space that could probably be better used for other things. In particular, if you’ve got PC-relative addressing then you can emulate the general-purpose bl with an add of PC followed by a branch.
There was some debate in early RISC designs about whether branch-and-link was actually necessary. The compiler could always generate something like:
add lr, pc, 8
b {destination}
Assuming the value of PC is the value at the start of the instruction (for interesting historical reasons on Arm it’s displaced forward) and add and branch are each 4-byte instructions. The general consensus was that it was worth folding these into a single instruction because:
Subroutine call is a very common operation (probably the only thing I disagree with Dijkstra on: he wrote an essay deploring special-purpose instructions for subroutine calls)
A branch instruction does not write to any other register
Any instruction reads the PC and so there’s no extra register read
I believe most ISAs went the same way as Arm, providing a branch-and-link that uses an architecturally defined link register and assuming that the tiny number of cases that wanted a different destination could use the two-instruction sequence above.
RISC-V, as is traditional, did the wrong thing and ended up using 5 bits in their JAL / JALR instructions to encode one bit of state (should the return address go to the link register or be discarded?). This means that JAL and JALR (J and JR are aliases for these that write the result to x0 [i.e. discard it]) consume 16 times as much encoding space as they need to. This is a problem because these instructions have large immediates and so consume a lot of encoding space even without the extra bloat. JAL consumes an entire major opcode, in spite of needing only 1/16th of one for exactly the same code density to within a rounding error. This means that they didn’t have the encoding space left for rich addressing modes.
David, in the spirit of (US) Thanksgiving, I’m thankful for your detailed & informative replies on lobste.rs!
One quibble: it seems you contradicted yourself, saying first “most RISC ISAs make the link register a GPR and allow the branch-and-link to write to any register”, but then later “most ISAs went the same way as Arm, providing a branch-and-link that uses an architecturally defined link register”.
I think I agree that it’s not worth the expense to allow any register to act as a link register. It’s hard to think of how you’d take advantage of multiple registers for this, given that caller and caller have to agree on which register to use.
David, in the spirit of (US) Thanksgiving, I’m thankful for your detailed & informative replies on lobste.rs!
Thanks!
One quibble: it seems you contradicted yourself, saying first “most RISC ISAs make the link register a GPR and allow the branch-and-link to write to any register”, but then later “most ISAs went the same way as Arm, providing a branch-and-link that uses an architecturally defined link register”.
Yes, sorry. ‘Most RISC ISAs’ is a bit of a fluffy concept. I think (from memory), Arm (A32 / T32 / A64) use a fixed link register, PowerPC allows any target, SPARC doesn’t (because it’s coupled to the register window concept), and I don’t remember what PA-RISC or Alpha did. There are lots of RISC ISAs that have died or never got large market share, so both are probably correct depending on how you count ‘most’. In terms of ‘most units shipped’, Arm alone probably counts. In terms of most distinct ISAs, I have no idea. In terms of most commercially successful ISAs, I think it’s probably fairly evenly split. I have no idea which of the meanings I meant each time or which one is actually correct for several interpretations.
I think this also shifted a bit over time. RISC-I and MIPS had a strong design principle that the ABI should not leak into the ABI. RISC-II and SPARC went the other way. Arm tries to avoid it, but is willing to compromise. In the extreme case, the stack pointer is not special in any way, but loads and stores on the stack are incredibly common (and a very large proportion of the static instruction count: loads and stores over heap or address-taken stack variables are often in loops and so contribute less to static instruction count relative to the number that appear in the dynamic instruction count) and so they’re worth treating as special cases.
I think I agree that it’s not worth the expense to allow any register to act as a link register. It’s hard to think of how you’d take advantage of multiple registers for this, given that caller and caller have to agree on which register to use.
It is useful in theory for leaf functions that are not exposed across a compilation-unit boundary. These can use a different link register (because they don’t have to respect the ABI) and use a calling convention that makes the normal link register callee-save. This allows the caller to avoid having to safe the real link register. It’s also useful on architectures that don’t have PC-relative addressing.
If I remember correctly, PowerPC uses the ‘branch-and-link-to-next-instruction’ instruction as a ‘move PC to general-purpose register’ instruction (with special-case logic in the branch predictor to understand that it’s not really a jump if the immediate offset is +4), which allows them to avoid making the PC a general-purpose register and avoid needing special-purpose instructions for PC-relative addressing. AArch64 and RISC-V, for example, both have explicit instructions for calculating an address with PC as the base, which consume quite a bit of encoding space, which PowerPC can avoid by having a single cheap instruction to get the PC.
On 32-bit x86, you do something similar: call one instruction forward and pop from the stack. If you’re lucky, the implementation will be able to do some micro-op fusion. If not, then the fact that the call and the pop both modify the stack pointer means that you end up with some horrible interaction with register renaming. This composes incredibly badly with the fact that you have only six GPRs on x86, so if you need one for PC-relative addressing you’re down to five. If you need 64-bit arithmetic, you are now down to two pairs, which since all of the instructions are destructive means that you need more stack spills for any temporaries you want to preserve. This is why 32-bit Windows doesn’t use position-independent code for DLLs and instead does a fun constraint-solving thing to find static base addresses for each DLL that can work with every installed program and does static relocations in the DLLs.
If I remember correctly, PowerPC uses the ‘branch-and-link-to-next-instruction’ instruction as a ‘move PC to general-purpose register’ instruction (with special-case logic in the branch predictor to understand that it’s not really a jump if the immediate offset is +4), which allows them to avoid making the PC a general-purpose register and avoid needing special-purpose instructions for PC-relative addressing. AArch64 and RISC-V, for example, both have explicit instructions for calculating an address with PC as the base, which consume quite a bit of encoding space, which PowerPC can avoid by having a single cheap instruction to get the PC.
FWIW, POWER9 did add an instruction to add from PC. They also “added” a new system call instruction, so I expect a new ABI.
For instance, when you pass parameters on the stack and assume that all registers are callee-clobbered, every function call can be used as a preemption point for goroutines. During the call, only the PC needs to be saved, all other state is already saved on the stack.
I believe I read that parameter passing on the stack also helped with stack growth and shrinking when pointers to memory on the stack need to be rewritten. Then at least you only have to find values in one buffer, the stack, not also in a secondary layout of the saved registers on the stack or somewhere close.
I feel like your post is one sided. I don’t understand them in detail, but my impression last I heard about it was that there were very good reasons for go to have its own ABI.
Well, the segmented stacks originally used for goroutines may have pushed them to a nonstandard ABI. And there may be other ways that it got them some marginal optimizations. But they paid for it with extra overhead calling to/from any other language (including system calls) and incompatibility with other tooling like debuggers and crash reporters. Other non-interpreted languages have managed to provide GC and green threads without breaking the platform ABI.
(I used Go a lot at one point, but the architects’ resolute NIH (or more accurately, NIBP9) attitude was one of the things that soured me.)
For RISC-V, it’s part of the ISA’s own ABI documents, even. RISC-V has 32 GPRs (31 if not counting x0, hardcoded to zero).
Register-based parameter passing is sanity. Stack-based exists, because of register-starved architectures.
Same deal for storing the return address of a call in the stack; RISC-V and other sane architectures save it into a register. Of course, it’ll end up saved in the stack anyway when making deeper calls.
Fastcall isn’t always faster. Pushing and popping to the stack is fairly quick on x86, but you only have 6 GPRs (and they’re not all really general-purpose, some instructions require specific registers) and so if you pass arguments in registers then the caller needs to spill some things to make room for them and the callee needs to spill the to make room for anything else it does.
Most RISC platforms have used register-based calling conventions since their creation and intentionally had enough registers to make this efficient. RISC-II and SPARC both optimised this further with register windows, where each function sees only a subset of the full register file, with some reserved for incoming arguments, some for local use, and some for outgoing arguments. With RISC-II this wasn’t great because it had a fixed number of windows (8, I think) and hitting this limit in call depth required some special-case behaviour.
SPARC, in contrast, arranges them in a ring so that when you hit the end you get a trap, spill some of them, and then can continue. The very latest UltraSPARC chips (which shipped after everyone who didn’t work at Oracle had given up on SPARC) apparently finally implemented the obvious extension to this and spill things from old register windows asynchronously. This means that any time the load-store pipeline is idle, the CPU can spill old registers in the background and unless you’re basically doing either recursive function calls with no work in them or function calls with nothing other than memory ops at every recursion depth then you end up with a CPU that performs as if it had an unlimited number of registers.
On a modern CPU, register windows are not really subsets of the register file, they’re namespaces in the register rename engine, so the asynchronous spill gives the CPU a nice way of reclaiming rename registers when they’re under pressure.
Yeah I’m sure borland C++ was using AX-DX for the first 4 params back in the day, and you needed to specifically override it if you wanted to use the stack only. Maybe Turbo Pascal, too.
You mean Go wasn’t using registers for function parameters until now? 😧
I’m surprised, because mainstream ABIs have been doing this for a very long time (except x86-32, which has too few registers.) That it took Go this long seems like another casualty of rolling its own ABI.
I remember PowerPC having another neat optimization that IIRC isn’t in x86 or ARM — the subroutine-call instruction put the return address in a special Link register, instead of pushing it on the stack. Then returning was a Branch To Link Register instruction. That way calling a leaf function required no memory access at all. (Non-leaf functions did still have to push a stack frame, including the saved LR value.)
This is the same on Arm (
bl
- branch and link - puts the return address inlr
- the link register). This design, I believe, comes from early RISC designs that wanted orthogonal instructions. As a design principle, only load/store instructions should load from or store to memory, everything else should access only internal CPU state.Some architectures have a designated link register but most RISC ISAs make the link register a GPR and allow the branch-and-link to write to any register. The more general functionality means that branch-and-link instructions are bigger (they need 5 bits more space on architectures with 32 registers), which consumes encoding space that could probably be better used for other things. In particular, if you’ve got PC-relative addressing then you can emulate the general-purpose
bl
with an add of PC followed by a branch.There was some debate in early RISC designs about whether branch-and-link was actually necessary. The compiler could always generate something like:
Assuming the value of PC is the value at the start of the instruction (for interesting historical reasons on Arm it’s displaced forward) and add and branch are each 4-byte instructions. The general consensus was that it was worth folding these into a single instruction because:
I believe most ISAs went the same way as Arm, providing a branch-and-link that uses an architecturally defined link register and assuming that the tiny number of cases that wanted a different destination could use the two-instruction sequence above.
RISC-V, as is traditional, did the wrong thing and ended up using 5 bits in their JAL / JALR instructions to encode one bit of state (should the return address go to the link register or be discarded?). This means that JAL and JALR (J and JR are aliases for these that write the result to x0 [i.e. discard it]) consume 16 times as much encoding space as they need to. This is a problem because these instructions have large immediates and so consume a lot of encoding space even without the extra bloat. JAL consumes an entire major opcode, in spite of needing only 1/16th of one for exactly the same code density to within a rounding error. This means that they didn’t have the encoding space left for rich addressing modes.
David, in the spirit of (US) Thanksgiving, I’m thankful for your detailed & informative replies on lobste.rs!
One quibble: it seems you contradicted yourself, saying first “most RISC ISAs make the link register a GPR and allow the branch-and-link to write to any register”, but then later “most ISAs went the same way as Arm, providing a branch-and-link that uses an architecturally defined link register”.
I think I agree that it’s not worth the expense to allow any register to act as a link register. It’s hard to think of how you’d take advantage of multiple registers for this, given that caller and caller have to agree on which register to use.
Thanks!
Yes, sorry. ‘Most RISC ISAs’ is a bit of a fluffy concept. I think (from memory), Arm (A32 / T32 / A64) use a fixed link register, PowerPC allows any target, SPARC doesn’t (because it’s coupled to the register window concept), and I don’t remember what PA-RISC or Alpha did. There are lots of RISC ISAs that have died or never got large market share, so both are probably correct depending on how you count ‘most’. In terms of ‘most units shipped’, Arm alone probably counts. In terms of most distinct ISAs, I have no idea. In terms of most commercially successful ISAs, I think it’s probably fairly evenly split. I have no idea which of the meanings I meant each time or which one is actually correct for several interpretations.
I think this also shifted a bit over time. RISC-I and MIPS had a strong design principle that the ABI should not leak into the ABI. RISC-II and SPARC went the other way. Arm tries to avoid it, but is willing to compromise. In the extreme case, the stack pointer is not special in any way, but loads and stores on the stack are incredibly common (and a very large proportion of the static instruction count: loads and stores over heap or address-taken stack variables are often in loops and so contribute less to static instruction count relative to the number that appear in the dynamic instruction count) and so they’re worth treating as special cases.
It is useful in theory for leaf functions that are not exposed across a compilation-unit boundary. These can use a different link register (because they don’t have to respect the ABI) and use a calling convention that makes the normal link register callee-save. This allows the caller to avoid having to safe the real link register. It’s also useful on architectures that don’t have PC-relative addressing.
If I remember correctly, PowerPC uses the ‘branch-and-link-to-next-instruction’ instruction as a ‘move PC to general-purpose register’ instruction (with special-case logic in the branch predictor to understand that it’s not really a jump if the immediate offset is +4), which allows them to avoid making the PC a general-purpose register and avoid needing special-purpose instructions for PC-relative addressing. AArch64 and RISC-V, for example, both have explicit instructions for calculating an address with PC as the base, which consume quite a bit of encoding space, which PowerPC can avoid by having a single cheap instruction to get the PC.
On 32-bit x86, you do something similar: call one instruction forward and pop from the stack. If you’re lucky, the implementation will be able to do some micro-op fusion. If not, then the fact that the call and the pop both modify the stack pointer means that you end up with some horrible interaction with register renaming. This composes incredibly badly with the fact that you have only six GPRs on x86, so if you need one for PC-relative addressing you’re down to five. If you need 64-bit arithmetic, you are now down to two pairs, which since all of the instructions are destructive means that you need more stack spills for any temporaries you want to preserve. This is why 32-bit Windows doesn’t use position-independent code for DLLs and instead does a fun constraint-solving thing to find static base addresses for each DLL that can work with every installed program and does static relocations in the DLLs.
FWIW, POWER9 did add an instruction to add from PC. They also “added” a new system call instruction, so I expect a new ABI.
A couple of things can be made simpler.
For instance, when you pass parameters on the stack and assume that all registers are callee-clobbered, every function call can be used as a preemption point for goroutines. During the call, only the PC needs to be saved, all other state is already saved on the stack.
I believe I read that parameter passing on the stack also helped with stack growth and shrinking when pointers to memory on the stack need to be rewritten. Then at least you only have to find values in one buffer, the stack, not also in a secondary layout of the saved registers on the stack or somewhere close.
Thanks for pointing that out BTW. This made me look at the proposal which includes some justification for this design (yes, plan9). :)
I feel like your post is one sided. I don’t understand them in detail, but my impression last I heard about it was that there were very good reasons for go to have its own ABI.
Well, the segmented stacks originally used for goroutines may have pushed them to a nonstandard ABI. And there may be other ways that it got them some marginal optimizations. But they paid for it with extra overhead calling to/from any other language (including system calls) and incompatibility with other tooling like debuggers and crash reporters. Other non-interpreted languages have managed to provide GC and green threads without breaking the platform ABI.
(I used Go a lot at one point, but the architects’ resolute NIH (or more accurately, NIBP9) attitude was one of the things that soured me.)
MIPS and (I think) SPARC also do this as well.
ARM and AArch64 and RISC-V all do this as well. It seems to be just one of those small innovations that make life a little better for everyone.
For RISC-V, it’s part of the ISA’s own ABI documents, even. RISC-V has 32 GPRs (31 if not counting x0, hardcoded to zero).
Register-based parameter passing is sanity. Stack-based exists, because of register-starved architectures.
Same deal for storing the return address of a call in the stack; RISC-V and other sane architectures save it into a register. Of course, it’ll end up saved in the stack anyway when making deeper calls.
fastcall dates to, what, the early 90s?
Fastcall isn’t always faster. Pushing and popping to the stack is fairly quick on x86, but you only have 6 GPRs (and they’re not all really general-purpose, some instructions require specific registers) and so if you pass arguments in registers then the caller needs to spill some things to make room for them and the callee needs to spill the to make room for anything else it does.
Most RISC platforms have used register-based calling conventions since their creation and intentionally had enough registers to make this efficient. RISC-II and SPARC both optimised this further with register windows, where each function sees only a subset of the full register file, with some reserved for incoming arguments, some for local use, and some for outgoing arguments. With RISC-II this wasn’t great because it had a fixed number of windows (8, I think) and hitting this limit in call depth required some special-case behaviour.
SPARC, in contrast, arranges them in a ring so that when you hit the end you get a trap, spill some of them, and then can continue. The very latest UltraSPARC chips (which shipped after everyone who didn’t work at Oracle had given up on SPARC) apparently finally implemented the obvious extension to this and spill things from old register windows asynchronously. This means that any time the load-store pipeline is idle, the CPU can spill old registers in the background and unless you’re basically doing either recursive function calls with no work in them or function calls with nothing other than memory ops at every recursion depth then you end up with a CPU that performs as if it had an unlimited number of registers.
On a modern CPU, register windows are not really subsets of the register file, they’re namespaces in the register rename engine, so the asynchronous spill gives the CPU a nice way of reclaiming rename registers when they’re under pressure.
Yeah I’m sure borland C++ was using AX-DX for the first 4 params back in the day, and you needed to specifically override it if you wanted to use the stack only. Maybe Turbo Pascal, too.
Apologies, I’ve only used x86 on Macs, so I’m not familiar with the 32-bit ISA.