1. 19
    1. 1

      Isn’t a variant of this how coroutines work in Rust and C++? Just with an opaque struct instead of global variables.

      1. 1

        I haven’t used them myself but yes I believe so

        1. 1

          Not really. There are two special things about coroutines:

          • the function is split up into sections with multiple entry and exit points where it yields or awaits;

          • the split functions are closures over the variables that remain live across yield points.

          The difference from old fortran-style function calls is:

          • in fortran, arguments were stored in a fixed address, into the function’s statically-allocated variables; in c++ and rust they are passed in registers or on the stack

          • in fortran there was always one activation record for each function; in c++ and rust there is zero when a function is not running, and there can be many if it is recursive or many concurrent coroutines

        2. 1

          The author doesn’t appear to realise that CPU registers ARE special global variables.

          Adding special function call and return instructions turns out to be a mistake, at least if you want fast code. They can help with code size. VAX calls should be avoided. Even x86 call is inefficient in copying the return address to RAM (the stack) and back when a large percentage of function calls are to leaf functions that don’t call anything else (so there’s no need to save the return address to RAM) and many are short in which case the overhead is significant.

          1. 2

            The problem with the VAX call and return instructions is they did too much. You can’t draw a general conclusion that lightweight call/return instructions are bad because over-elaborate call/return instructions are bad.

            Modern instruction sets prefer not to expose the program counter as if it were a general-purpose register, because that causes too many difficult special cases in the implementation. (arm32 suffers from this; arm64 fixed the mistake.) Hiding the pc register makes it impossible to build a function call from basic register ops: instead the instruction set has branch-with-link (ie, call) and either a return instruction or general branch-to-register.

            As I understand it, x86 implementations optimize access to the top of stack so that function calls are as fast as bl/ret would be on more modern instruction sets.

            1. 1

              Storing the return address into RAM is itself too much work. The fact that the most recent x86 implementations map stack locations to rename registers is irrelevant – that’s energy use and silicon that should not be required in the first place.

              I note that IBM patented this in 2000 (https://patents.google.com/patent/US7085914B1/) so it’s just recently run out.

              RISC-V also “hides” the PC. As well as the BL/JAL instructions, the PC can also be copied into a GPR on both RISC-V (AUIPC) and arm64 (ADRP), while also adding a +/-2 GB (RISC-V or +/-4 GB offset (arm64) to it, in units of 4K. RISC-V adds the offset to the raw address of the AUIPC instruction, while arm64 truncates the address of the ARRP instruction to a multiple of 4K first.

              Hiding the pc register makes it impossible to build a function call from basic register ops

              So actually that’s not true. Both RISC-V and Arm support calls with a much larger offset than BL/JAL provides by using a two-instruction sequence with the first instruction getting PC+hi*4k into a register and the second instruction doing a jump-and-link-indirect to a 12 bit offset from that register.

              The BL/JAL instructions could be removed from the ISAs if you didn’t mind every call taking two instructions instead of most of them taking one.

              1. 1

                What I meant by “basic register ops” is the regular mov/add/etc. that use the general purpose registers. You can’t use those for call and return if pc is hidden, instead you need special instructions that manipulate the program counter as you described.

                There’s another advantage to having a special case for bl/ret, which is to avoid spamming the general branch prediction machinery with the return addresses of functions (which often get called from many places). So, most processors have an internal return address stack, shadowing the regular stack in memory.

                RISC V tries to pretend it doesn’t have specific call and return instructions: it doesn’t have an architectural link register like ARM. However it has a conventional link register (er, two, in fact) which implementations are expected to hook up to their return address stack if they have one. So it ends up having instructions that are effectively bl and ret, but more complicated.

            2. 2

              I suspect the author is aware as they have written extensively on calling conventions in the past https://devblogs.microsoft.com/oldnewthing/20040102-00/?p=41213

              I did feel that the article could have used more depth in why it may be advantageous to not support a call stack with limited memory. Fortran is in many ways a language designed for a specific domain, it has great affordances for fast numerical computation and has less support for e.g. string processing. In the early days of its use I imagine using stacks for function calls was not an unknown idea but I’m sure they would have been considered an unaffordable luxury by the language designers

              For some early history of the stack for programming let me refer you to this paper https://www.sigcis.org/files/A%20brief%20history.pdf

              You’re right in that many of the conveniences which have been added to processors have not been useful in light of optimising compilers but they are still relevant to talk about in relation to language evolution

              1. 1

                I’d like to know what type of programming you do where leaf functions are common, because in my experience at The Enterprise, leaf functions are rare, and what you mostly have are layers upon layers of calls trying to abstract everything out.

                I also think you’re applying today’s concerns to yesterday. The VAX was designed in the mid-70s, where CPU and memory speeds were nearly matched so saving to memory was less of an issue. The CALLS and CALLG instructions were intended for a common language calling convention so one could link Pascal, Fortran and Bliss object files into a single executable. [1] The VAX also has JSB (jump subroutine) and RSB (return from subroutine) that do less and more match the x86 CALL and RET instructions.

                [1] I think the CALLS and CALLG instructions are rather brilliant in that they allow one to either pass parameters in via the stack, or a pointer (in memory outside the stack) and the callee doesn’t have to care how the parameters came in. I know, that’s unfashionable these days when everything wants to be passed in registers, only for the callee to push them out to the stack anyway to make further calls, because, again, in my experience, very few subroutines tend to be leaf subroutines.

                1. 1

                  As long as those mid-level functions are, on average, making two or more calls to lower level functions (either different functions, or the same one in a loop) then most function CALLs occur to leaves. Not most functions, but most calls. Or at least a significant percentage, even in Enterprise software. I think you might be surprised if you measured it.

                  I was around in the late 70s when the VAX was introduced though I think the one that arrived at my university in 1982 was the first in the country. The main design concern with VAX was not speed (it was known and expected to be slower than 11/70) or code size, but assembly language programmer productivity, as programmers were expensive and rare but computers were rapidly becoming cheaper.