If I understand correctly (supported by the excellent Why is Rosetta 2 fast article linked from here) this isn’t really emulating x86 on x86-64 on aarch64. It’s branching out of long mode on x86-64 and relying on Rosetta to support it. The same techniques here are how you run 32-bit code on a 64-bit operating system (on *NIX, typically, the kernel does this for you, on Windows you launch a 64-bit process that then does all of this in userspace). It makes sense that Rosetta would support this, since existing userspace things require it. In particular, WINE on recent 64-bit Intel macOS (which drops support for a 32-bit userspace environment) ships a Windows-on-Windows layer that works like 64-bit Windows and manages a 32-bit address space for running 32-bit Windows binaries. WINE is important because it’s used by a lot of companies that ship games on macOS and Apple really doesn’t want to resurrect the ‘games don’t work on Macs’ memes.
The linked article was also very interesting about how Rosetta works. In particular, working library-at-a-time is great for both analysis and code locality, but Rosetta doesn’t exploit this for analysis: it (almost entirely) works instruction at a time. Apparently Apple’s M series CPUs have some logic in the front end that lets instructions in issued together forward results directly to others without going via the register rename engine. This is a great bit of hardware-software co-design because a single x86 instruction translated to 2-3 Arm instructions will still all be dispatched in a single cycle and the wide issue on these cores means that it can still issue multiple emulated x86 instructions per cycle.
I think you understand correctly. I couldn’t fit the subtleties of emulation vs compatibility mode in a short title. Thinking harder now, maybe something like “emulating x86 via x64 on aarch64” would have been better.
Thinking harder now, maybe something like “emulating x86 via x64 on aarch64” would have been better.
I think that works better. The key thing, for me, is that you aren’t translating x86 to x86-64, you’re running x86 and Rosetta is translating that to AArch64, and Rosetta supports both decoder modes. I’m really curious in Rosetta here. Does it translate the page assuming that it’s x86-64 when you load it and then retranslate it when you jump to it after leaving long mode? Given how well 32-bit x86 games seem to work in Rosetta, I assume it’s still doing library-at-a-time compilation at some point for the 32-bit code.
In this project I definitely learned that Rosetta is much fancier than I expected!
Things are even hooked up such that you can lldb the/64-bit/binary and it gives you an ordinary x86-64 debugging experience. The debugger does get confused when it disassembles when in compatibility mode (it disassembles as x86-64, not 32-bit) but there’s a flag dis -A i386 to even work around that.
I mention the debugger because I noticed that the first time I ran a Rosetta program it would take a while to boot (presumably pretranslation), and whenever I set a breakpoint in the debugger, including in the 32-bit code, it would also sometimes pause for quite a while (which suggests to me that it’s setting breakpoints by inserting an int3 in the code and then triggering Rosetta translation again).
I had previously thought that any x86 emulator had to always do just-in-time translation (see “translating x86” section here https://neugierig.org/software/blog/2023/02/retrowin32-progress.html ) but Rosetta proves it can do a lot of work ahead of time, though it still also must support runtime translation…
It’s interesting how cyclical these things are. Windows for Alpha shipped with FX32!, which did ahead of time binary translation of x86 code to Alpha. It was often faster than native Intel but only because the Alpha chips were more than twice as fast as anything Intel sold, it was much slower than Alpha native software. They didn’t do a JIT, I believe, because the RAM overheads and latency were too high (the translated binaries were stored on disk and run just like native ones).
Rosetta on x86 Macs (based on the Transitive Technologies emulators[1]) was purely a JIT, driven by the need of fast start-up times and a desire to do some runtime optimisation. This was a few years after DynamoRio demonstrated that you could do a MIPS JIT on MIPS and get a 10-20% speed up, so the benefits of doing run time optimisation were fairly clear. VirtualPC for Mac (x86 on PowerPC)[2] got a lot of benefits from doing dynamic checks for flag usage. As I recall, they did the versions without flag setting by default and then did a deoptimisation pass to redo them if the flags register was used (this required keeping some values live longer but emulating six-register x86 ISA on a 32-register PowerPC chip made this relatively easy).
And now Rosetta 2 is here doing ahead of time binary translation with very little complex analysis (though at least enough to reconstruct a CFG and understand the ABI in places). The simple translation makes startup fast (imagine how slow FX32! was, doing more analysis and running on a chip maybe 10% the speed of a modern M2) and has good cache locality. It also has good branch predictor behaviour because they don’t need to add indirections on jumps for retranslation.
I think a big part of this is that modern AArch64 is designed to support translation from x86. Both Microsoft and Apple have emulator teams that have worked closely with Arm to ensure that x86 instructions can be mapped to small sequences of AArch64 ones (and, in a few cases, that common short sequences of x86 instructions can be mapped to shorter AArch64 sequences). PowerPC was, in theory, designed to be able to run x86 binaries, but I think that was more marketing than engineering: it was mostly that IBM and Motorola engineers thought they’d be able to make chips 2-3 times faster than Intel equivalents and so could just eat the overhead of emulation, rather than anything in the ISA that looked like it would help.
The debugger parts are actually not that hard. On macOS, the debugger uses the Mach task port for the debugged process and sends messages to do things like query the register file and read memory. The kernel just needs to redirect these to Rosetta 2, which has all of the state that it needs to provide a view of the program as if it were x86-64.
[1] Transitive was bought by IBM so that they could offer a SPARC to POWER migration path. The version of OS X after the acquisition dropped Rosetta very abruptly. I strongly suspect that IBM bought Transitive, in part, out of spite over how IBM was portrayed by Steve Jobs during the transition and refused to grant Apple a license to the next version.
[2] Connectix, the makers of VirtualPC, was bought by Microsoft. Their VM product evolved to become Hyper-V (I think the hypervisor bit was a complete rewrite but a lot of the surrounding infrastructure for emulated devices and so on was preserved). Their emulator ended up in the Xbox 360 for running x86 Xbox games on PowerPC Xbox and became the core of the x86 on Arm emulator in Windows.
If I understand correctly (supported by the excellent Why is Rosetta 2 fast article linked from here) this isn’t really emulating x86 on x86-64 on aarch64. It’s branching out of long mode on x86-64 and relying on Rosetta to support it. The same techniques here are how you run 32-bit code on a 64-bit operating system (on *NIX, typically, the kernel does this for you, on Windows you launch a 64-bit process that then does all of this in userspace). It makes sense that Rosetta would support this, since existing userspace things require it. In particular, WINE on recent 64-bit Intel macOS (which drops support for a 32-bit userspace environment) ships a Windows-on-Windows layer that works like 64-bit Windows and manages a 32-bit address space for running 32-bit Windows binaries. WINE is important because it’s used by a lot of companies that ship games on macOS and Apple really doesn’t want to resurrect the ‘games don’t work on Macs’ memes.
The linked article was also very interesting about how Rosetta works. In particular, working library-at-a-time is great for both analysis and code locality, but Rosetta doesn’t exploit this for analysis: it (almost entirely) works instruction at a time. Apparently Apple’s M series CPUs have some logic in the front end that lets instructions in issued together forward results directly to others without going via the register rename engine. This is a great bit of hardware-software co-design because a single x86 instruction translated to 2-3 Arm instructions will still all be dispatched in a single cycle and the wide issue on these cores means that it can still issue multiple emulated x86 instructions per cycle.
I think you understand correctly. I couldn’t fit the subtleties of emulation vs compatibility mode in a short title. Thinking harder now, maybe something like “emulating x86 via x64 on aarch64” would have been better.
I think that works better. The key thing, for me, is that you aren’t translating x86 to x86-64, you’re running x86 and Rosetta is translating that to AArch64, and Rosetta supports both decoder modes. I’m really curious in Rosetta here. Does it translate the page assuming that it’s x86-64 when you load it and then retranslate it when you jump to it after leaving long mode? Given how well 32-bit x86 games seem to work in Rosetta, I assume it’s still doing library-at-a-time compilation at some point for the 32-bit code.
In this project I definitely learned that Rosetta is much fancier than I expected!
Things are even hooked up such that you can
lldb the/64-bit/binaryand it gives you an ordinary x86-64 debugging experience. The debugger does get confused when it disassembles when in compatibility mode (it disassembles as x86-64, not 32-bit) but there’s a flagdis -A i386to even work around that.I mention the debugger because I noticed that the first time I ran a Rosetta program it would take a while to boot (presumably pretranslation), and whenever I set a breakpoint in the debugger, including in the 32-bit code, it would also sometimes pause for quite a while (which suggests to me that it’s setting breakpoints by inserting an int3 in the code and then triggering Rosetta translation again).
I had previously thought that any x86 emulator had to always do just-in-time translation (see “translating x86” section here https://neugierig.org/software/blog/2023/02/retrowin32-progress.html ) but Rosetta proves it can do a lot of work ahead of time, though it still also must support runtime translation…
It’s interesting how cyclical these things are. Windows for Alpha shipped with FX32!, which did ahead of time binary translation of x86 code to Alpha. It was often faster than native Intel but only because the Alpha chips were more than twice as fast as anything Intel sold, it was much slower than Alpha native software. They didn’t do a JIT, I believe, because the RAM overheads and latency were too high (the translated binaries were stored on disk and run just like native ones).
Rosetta on x86 Macs (based on the Transitive Technologies emulators[1]) was purely a JIT, driven by the need of fast start-up times and a desire to do some runtime optimisation. This was a few years after DynamoRio demonstrated that you could do a MIPS JIT on MIPS and get a 10-20% speed up, so the benefits of doing run time optimisation were fairly clear. VirtualPC for Mac (x86 on PowerPC)[2] got a lot of benefits from doing dynamic checks for flag usage. As I recall, they did the versions without flag setting by default and then did a deoptimisation pass to redo them if the flags register was used (this required keeping some values live longer but emulating six-register x86 ISA on a 32-register PowerPC chip made this relatively easy).
And now Rosetta 2 is here doing ahead of time binary translation with very little complex analysis (though at least enough to reconstruct a CFG and understand the ABI in places). The simple translation makes startup fast (imagine how slow FX32! was, doing more analysis and running on a chip maybe 10% the speed of a modern M2) and has good cache locality. It also has good branch predictor behaviour because they don’t need to add indirections on jumps for retranslation.
I think a big part of this is that modern AArch64 is designed to support translation from x86. Both Microsoft and Apple have emulator teams that have worked closely with Arm to ensure that x86 instructions can be mapped to small sequences of AArch64 ones (and, in a few cases, that common short sequences of x86 instructions can be mapped to shorter AArch64 sequences). PowerPC was, in theory, designed to be able to run x86 binaries, but I think that was more marketing than engineering: it was mostly that IBM and Motorola engineers thought they’d be able to make chips 2-3 times faster than Intel equivalents and so could just eat the overhead of emulation, rather than anything in the ISA that looked like it would help.
The debugger parts are actually not that hard. On macOS, the debugger uses the Mach task port for the debugged process and sends messages to do things like query the register file and read memory. The kernel just needs to redirect these to Rosetta 2, which has all of the state that it needs to provide a view of the program as if it were x86-64.
[1] Transitive was bought by IBM so that they could offer a SPARC to POWER migration path. The version of OS X after the acquisition dropped Rosetta very abruptly. I strongly suspect that IBM bought Transitive, in part, out of spite over how IBM was portrayed by Steve Jobs during the transition and refused to grant Apple a license to the next version.
[2] Connectix, the makers of VirtualPC, was bought by Microsoft. Their VM product evolved to become Hyper-V (I think the hypervisor bit was a complete rewrite but a lot of the surrounding infrastructure for emulated devices and so on was preserved). Their emulator ended up in the Xbox 360 for running x86 Xbox games on PowerPC Xbox and became the core of the x86 on Arm emulator in Windows.