The original article, from 2015, makes a better read.
TL;DR: The argument boils down to the fact that the larger pointer size for pointer-heavy code that tends to be in compilers eliminates the advantages of more registers, because fewer pointers can fit in a cache (and the corresponding instruction stream is also larger and less dense, with similar cache effects).
Sounds like Windows needs an x32 ABI.
I speak for all Windows devs when I tell you that the absolute last thing we need right now is another ABI. We already had three on x86 (now four) and are up to two on AMD64, and that assumes you don’t count COM and WinRT as distinct ABIs—but I personally would, bringing the totals to six and four, respectively.
Could you expand on the different APIs and on how they differ? No criticism, just genuine interest.
ABIs, not APIs.
Win32 traditionally had three heavily used ABIs: cdecl, which passes arguments on the stack and assumes caller cleanup; stdcall, which is identical but says the callee cleans up; and fastcall, which is similar to stdcall, but passes the first two arguments (that fit) into IIRC ECX and EDX. As a sweeping and unfair generalization, everything uses stdcall, except when it doesn’t, which you’ll find out when your app compiles fine but behaves bizarrely because you missed importing a header and the function got defined implicitly, because C.
On AMD64, Microsoft merged all three of these into a single calling convention, which I believe they called just “the calling convention,” and which is similar to fastcall, but uses RCX, RDX, R8, and R9 for the first four arguments that fit, and also reserves XMM0 through XMM3 for any floating-point arguments. They also require 32 bytes of stack space to be cleared by the caller. I forget why Microsoft thought this was a good idea (mandatory reserved spill area for the four integer registers?), but I frequently appreciate it when I’m trying to figure the hell happened to the stack in a memory viewer, because it gives me at least some idea where the stack frames are.
Meanwhile, Microsoft knew that XMM was cool, which is why it’s part of the x64 convention, but then Intel introduced some new hotness, which made the vector units 256-bits wide. Oops. So they added vectorcall on both x86 and AMD64, which is identical to either fastcall or “the x64 convention”, respectively, but it uses the XMM and YMM registers for argument passing. (I am waiting for the inevitable VMX2 version of this ABI that uses the ZMM registers. Maybe it was already released. If so, I hope it’s called evenmorefasterercall. Or maybe crosses the streams and they call it rastercall. I’m flexible.)
So that gets to the original “four” and “two” I mentioned for x86 and AMD64, respectively. But wait, there’s more!
Everything I’ve just listed are C calling conventions, but Windows has had COM for forever. COM took advantage of the fact that all the C++ compilers on Windows used exactly the same vtable layout, so it provides a way to call C++ objects as long as they’re sitting behind a vtable (plus provides some standard ways to acquire the proper vtable). Skipping past some details, this is usually called a thiscall, and looks like stdcall (or “that thing we use on x64”) with the exception that the this pointer is passed in ECX/RCX. Whether this counts as a separate ABI is left to the reader; I’d be inclined to say yes.
And finally that leaves us with WinRT. WinRT is like COM, but now with full-blown objects, complete with subclasses and properties and exceptions and all kinds of stuff. If you squint, WinRT’s ABI looks identical to COM, but the thing is that it also standardizes arguments and returns to support things like throwing C++-esque exceptions across ABI boundaries—something COM definitely cannot do. Thus, while you could (correctly) say that the arguments are mechanically thiscall, and therefore stdcall/“the x64 thing”, the fact that these functions require special setup and teardown makes me mentally classify them as a different ABI.
So there you’ve got it. Six calling conventions on x86, four on AMD64.
Wow, thanks for your explanation! Very interesting!
I’m not aware of anyone actually using the x32 ABI in practice or any real support for it? Last time I went looking it seemed that documentation and distro support/prebuilt binaries were really thin on the ground.
Gentoo supports the x32 ABI. Just select the x32 profile and it works out of the box :)
There were some rumblings that Arch might adopt it, but I don’t know if anything ever came of it.
It was mostly just a jocular reference, though, in that the problems described in TFA are exactly the problems x32 was designed to solve.
As for TFA itself, I’m kinda surprised that it makes a difference. If your project is big enough that cache pressure on a 64-bit arch becomes a problem, I’d think you’d also be running up to the 4GB limit, meaning a 32-bit application isn’t going to work at all.
It was mostly just a jocular reference
I was actually hoping you might contradict me with information about support for it that I had just missed. Alas.
cache pressure on a 64-bit arch becomes a problem
Always a problem. On current desktop CPUs, L1 is about 64kB, L2 about 256kB and L3 about 8MB, and there are big differences in latency between them. You can write a program which randomly reads and writes to a 16MB array, thrashes every cache level and spends almost all its time in ~100ns memory stalls on roughly every other read. The smaller your data, the fewer and less severe memory stalls you’ll have operating on it.
If your project is big enough that cache pressure on a 64-bit arch becomes a problem, I’d think you’d also be running up to the 4GB limit
Eh? There’s a few orders of magnitude between cache size (maybe low-double-digit megabytes of L3) and 4GB…
I was thinking being able to have all of your source/project files/whatever in memory at once, but giving it a moment’s more thought it really doesn’t matter, since the OS’s buffer cache can be larger than 4GB and mapped in and out as needed. You’d still need to syscall out to (re)open the files, but that’s relatively quick and if they’re already in the buffer cache it’s only a tiny bit of overhead.
Maybe the linker for huge binaries with lots of objects or something? Probably a rare enough case to not matter.
So yeah, nevermind.
Naively I’d think that access to the vectorization instructions would overwhelm the increased cache miss rate but I might be misunderstanding visual studio’s workload.
Vectorization instructions are mainly useful in array operations, and Visual Studio doesn’t have a lot of that. The larger register set would make more of a difference, but even then it’s not much. In general for pointer-heavy cache-un-local code (VS, and most code people use regularly), optimizations like vectorization and better register utilization can’t help much; performance wins generally come from application-level improvements such as data layout optimization. By sticking to x86 rather than x64 you get a big data layout optimization “for free,” modulo unhappy extension writers. Source: I work on the MSVC optimizer, so “common code we can’t optimize” is a fairly constant subject.
“And you didn’t need the extra memory anyway. ”
is of course a huge assumption because I get low memory issues with VS frequently. I have 16 gigs of ram.
I think they were saying that VS has a bunch of processes. So each process might take less than 4 GB of RAM, but the entire suite of programs could give you problems if you only have 16 GB of RAM.
I know little about Windows, but I’m pretty sure the Roslyn compiler platform is in a separate process from the IDE. Moreover, the IDE process could be 32-bit, while the others could be 64-bit if they have to handle a lot of data.
Multiprocess design is generally better for stability in huge programs like that… think of all the fun and interesting things that could happen with one stray read or write in a big 16 GB heap :)