1. 8
  1.  

  2. 1

    I like this. Didn’t Apple do something very similar with the Code Fragment Manager back when they transitioned from 680x0 to PowerPC? I’d been wondering if there were some technical reason that a company who controlled both ABIs as well as the dominant toolchain couldn’t do something along these lines.

    It seems like a very good idea. It’ll be interesting to see how it holds up in the wild.

    1. 3

      Apple has made a bunch of tweaks to both the SysV x86 and Arm ABIs to reduce the difference between PowerPC, x86, and Arm. In particular, they define types to have the same size and alignment and structs to have the same padding rules across architectures. This makes it easy to jump between emulated and native code because all in-memory data structures have the same layout.

      This goes a step further and defines the same number of argument registers and the same stack layout across two architectures, which means that you can trivially do on-stack replacement to jump between the two worlds (move emulated register values into the corresponding real argument registers and call the real function). This was one of the overhead hot spots in Rosetta: in a few places, functions with a lot of arguments were called from the emulator and implemented in native code, so the trampoline needed to read all of the arguments from the stack then write them out again in a different order. The Win32 APIs have a lot more functions than POSIX / OpenStep that take a lot of arguments and so probably see this more.

      The really interesting thing here is the implicit assumption that targeting Arm and emulating on x86 is going to be more common than the converse.

      1. 1

        The Win32 APIs have a lot more functions than POSIX / OpenStep that take a lot of arguments and so probably see this more.

        Win32 has a tendency to prefer CreateHoberEx(LPHOBERARGS) whenever they get too many arguments on a function, IME; little bit for extensiblity (they usually take struct size as first arg), little bit the ugly x86 calling convention makes it about as efficient anyways.