The not an emulator thing is largely marketing. People conflate emulator with CPU emulator and assume that they’re slow, but an emulator is just a program that makes one environment appear to be another. Making POSIX appear to be Win32 is emulation, it’s just that the translation is generally fast because most bits of Win32 are higher-level abstractions and so implementing them on POSIX instead of NT is not adding a lot of overhead (in some cases, it’s faster).
This is absolutely wonderful! Thank you for writing this! Before I knew how to program, but knew a little bit about dlls I wondered if it was possible to intercept the calls to them. This article confirms that you can and it’s how software I know works. Absolutely magical!
These things are a lot more painful on Linux than on most *NIX systems because Linux doesn’t have a cohesive notion of a userspace ABI. On FreeBSD, for example (I believe these abstractions predate 386BSD, so should be common on the other *BSDs), there is an ABI associated with each process, which includes:
Which system call table it uses.
What its signal frame looks like.
What its initial process execution environment looks like.
How to parse and map its file format.
What path substitution happens for finding compat libraries.
32-bit compat, Linux compat, a.out binary support, Capsicum mode, and so on are all just instances of this. Linux doesn’t have a similar abstraction. Worse, Linux makes the system call table a per-architecture thing, which makes things like Capsicum (which disables a load of system call as soon as you enter capability mode) very hard to implement, or anything that needs a foreign system call interface (a FreeBSD syscall compat layer on Linux would be very hard to add, for example).
This is a problem for CHERI support on Linux, because Linux decided to make a load of system calls take uint64_t instead of pointers and so needs to have a different version for CHERI mode, where they’d take a 128-bit tagged pointer (64-bit address + metadata).
Matthew Wilcox, instead, suggested that the personality() mechanism could be extended to support a Windows personality. This, essentially, would create a new system-call entry point that would emulate the Windows calls. Gofman replied that this approach had been considered, but that the cost of executing the personality() call on each transition between Linux and Windows code would be too high. A possible solution here is to implement a special personality that looks at a flag, stored in user-space memory, to determine how system calls should be handled. Gofman offered to create a Wine patch using such a mechanism if an implementation existed; Krisman said that he would give it a try.
The personality mechanism in Linux is just a set of flags that change some aspect of the behaviour and requires everything that depends on the value to explicitly check the relevant bit in the flags. In contrast, FreeBSD has multiple different syscall arrays and dispatches to them, multiple ways of generating a signal frame, and so on, all defined from a single structure they provides the implemnetations.
Every syscall in FreeBSD is a sys_foo function, which then typically forwards to a kern_foo function that does the real work. If there is a 32-bit compat version then it will have a different sys_ function that will then do the required ABI adaptation and forward to the same kern_ version. The Linux compat interfaces in FreeBSD are implemented in the same way: The system call dispatches to the relevant function in the Linux syscall table, which jumps to the Linux implementation of the sys_ function (which might be the same as the FreeBSD one for some common POSIX things), which then forwards to a kern_ function.
The WINE case is a bit special here because the WINE process has some native bits and some Windows bits. If they want to allow the Windows bits to do direct system calls then they need to have a different system call table depending on the page containing the syscall instruction. As I recall, Linux added a mechanism to disallow syscalls except from marked pages and deliver a signal instead, which allows these to be trapped and emulated (and the syscall instruction to be patched and turned into a jump to a function that does the emulated syscall).
The not an emulator thing is largely marketing. People conflate emulator with CPU emulator and assume that they’re slow, but an emulator is just a program that makes one environment appear to be another. Making POSIX appear to be Win32 is emulation, it’s just that the translation is generally fast because most bits of Win32 are higher-level abstractions and so implementing them on POSIX instead of NT is not adding a lot of overhead (in some cases, it’s faster).
Great article aside from that.
This is absolutely wonderful! Thank you for writing this! Before I knew how to program, but knew a little bit about dlls I wondered if it was possible to intercept the calls to them. This article confirms that you can and it’s how software I know works. Absolutely magical!
Ah this is great. That answers a lot of a questions I had about these foreign binary runners. Thank you for writing this.
I would love to have a similar article for User-Mode-Linux, gvisor & Co.
These things are a lot more painful on Linux than on most *NIX systems because Linux doesn’t have a cohesive notion of a userspace ABI. On FreeBSD, for example (I believe these abstractions predate 386BSD, so should be common on the other *BSDs), there is an ABI associated with each process, which includes:
32-bit compat, Linux compat, a.out binary support, Capsicum mode, and so on are all just instances of this. Linux doesn’t have a similar abstraction. Worse, Linux makes the system call table a per-architecture thing, which makes things like Capsicum (which disables a load of system call as soon as you enter capability mode) very hard to implement, or anything that needs a foreign system call interface (a FreeBSD syscall compat layer on Linux would be very hard to add, for example).
This is a problem for CHERI support on Linux, because Linux decided to make a load of system calls take
uint64_t
instead of pointers and so needs to have a different version for CHERI mode, where they’d take a 128-bit tagged pointer (64-bit address + metadata).Linux has a “personality” system that allows the kernel to interpret system calls differently.
https://man7.org/linux/man-pages/man2/personality.2.html
This was perhaps too slow for WINE to use:
https://lwn.net/Articles/824380/
(I don’t know how they ended up supporting windows syscalls in the end)
I wonder how fast the FreeBSD version is and if there are significant differences.
The personality mechanism in Linux is just a set of flags that change some aspect of the behaviour and requires everything that depends on the value to explicitly check the relevant bit in the flags. In contrast, FreeBSD has multiple different syscall arrays and dispatches to them, multiple ways of generating a signal frame, and so on, all defined from a single structure they provides the implemnetations.
Every syscall in FreeBSD is a
sys_foo
function, which then typically forwards to akern_foo
function that does the real work. If there is a 32-bit compat version then it will have a differentsys_
function that will then do the required ABI adaptation and forward to the samekern_
version. The Linux compat interfaces in FreeBSD are implemented in the same way: The system call dispatches to the relevant function in the Linux syscall table, which jumps to the Linux implementation of thesys_
function (which might be the same as the FreeBSD one for some common POSIX things), which then forwards to akern_
function.The WINE case is a bit special here because the WINE process has some native bits and some Windows bits. If they want to allow the Windows bits to do direct system calls then they need to have a different system call table depending on the page containing the
syscall
instruction. As I recall, Linux added a mechanism to disallow syscalls except from marked pages and deliver a signal instead, which allows these to be trapped and emulated (and thesyscall
instruction to be patched and turned into a jump to a function that does the emulated syscall).