The author spends a lot of time acting as though it is weird that changing the types of a function’s parameters or similar results in new code not working, which is very confusing to me. They also act like this is C specific which it isn’t. Any language that wishes to share across a build boundary needs to ensure that both sides of the boundary agree on what the interface is.
The fact that many modern languages eschew ABI compatibility in favor of every application have copies of every library that they depend on, and by default don’t support the basics of being a system library remains bizarre to me.
It’s also unnecessary: Objective-C has ABI stable ivars (though it makes ivar access more expensive than a pointer offset of course) so the implementation and data stored in parent objects doesn’t break subclasses compiled against a different SDK. Swift supports ABI compatibility even for generic types through witness tables.
Any language that wants to claim to be a “systems” language needs to provide a reasonable ABI stability story if it actually wants to be used for for system libraries, etc
It’s weird, because it’s about the C (and C++) ecosystem, which is very ossified. You can’t change ABI of anything without breaking someone, and users of a 40-year-old language really dislike anything changing and breaking.
There’s also this post by the same author which shows other examples that don’t involve changing the type signature:
In C++, changing a copy constructor from being default to being user provided can break ABI
Attempting to standardize GCC’s C nested functions would break ABI because it would need to modify how function pointers are represented (one bit would be different)
In C++, adding new virtual functions will modify the layout of the vtable, breaking the ABI for any consumer that derived from the class, added their own functions, and thus assumed the location of their functions inside the vtable.
For the first one, who’s the case where it breaks ABI (obviously it breaks API :D )?
For the second one: womp womp :D More seriously how would GCC’s nested functions need to change for standardization?
The last one is standard knowledge and well understood by any C++ dev who makes libraries. IIRC it’s why QT objects are all exposed via wrapper structs that make static calls that internally forward to the polymorphic implementation. I’ve often wished that there was an attribute in clang/gcc where you could make a polymorphic object actually be implemented that way automatically. It wouldn’t hide the existence of a vtable pointer, or the general field+inheritance layout problem, but it would certainly reduce some foot guns.
ObjC get ABI stability by having literal hash table from the string name of the method to the method impl, and accesses ivars via an indirect load so the entire object is not fragile ABI.
Swift is also ABI stable (because it’s a systems language, and you need your platform ABI to stable) through similar mechanisms to objc when crossing library boundaries, and even has ABI stability for generic objects through witness tables for protocols - logically swift protocols are equivalent to Haskell’s type classes rather than, say, ObjC’s compile time only enforcement.
For the first one, who’s the case where it breaks ABI
I’m far from an expert, but it seems that some compilers will pass the type with a default copy constructors using only two registers, while for the user-provided constructor case, it will invoke a bit-copy operation (on a different register), so what you get is two libraries that expect their arguments to live in different registers.
For the second one: womp womp :D More seriously how would GCC’s nested functions need to change for standardization?
GCC nested functions are implemented by a trampoline jump with an executable stack. Not all operating systems support executable stacks (as it is commonly a entry-point to other exploits): for example, OpenBSD (so GCC had to patch their approach).
A standard implementation of nested functions would need to use a different approach. The problem is that this would modify the ABI of callers: depending on how nested functions are implemented, they would need to be called in different ways. If you re-use GCC’s syntax, this means that depending on your compiler version, your nested functions would use different implementations, and hence, cause an ABI break.
The last one is standard knowledge and well understood by any C++ dev who makes libraries. IIRC it’s why QT objects are all exposed via wrapper structs that make static calls that internally forward to the polymorphic implementation.
I had to revert a change in snmalloc that made realloc of size 0 return NULL because mandoc and sort in the FreeBSD base system both call realloc via a wrapper that kills the program if any realloc call returns NULL. I assumed that if two programs from a fairly small sample set depended on this, changing it would break a lot of the world. I was quite surprised to learn that C23 will make this explicitly UB. Though given how low C11 adoption has been, I’m not convinced that anything WG14 says matters anymore.
The proposed solution seems to be missing the point somewhat. If everything is compiled for either the new ABI or the old one then you don’t need any of these tricks. You can have different copies of the C standard library and link the correct one. It’s 1-2MiB, so having two copies on a system is basically free these days. The problem is when you have two third-party libraries that use different versions of the ABI. These may store intmax_t in a structure, or expose it at their ABI boundaries. Now you need to handle the case that one of them can give you an integer that the other one can’t handle (for example, one library uses intmax_t to store a unique identifier, the other uses it as the key for a hash table, what happens when the first one uses __int128_t and the second uses int64_t for intmax_t?). This becomes a combinatorial problem in terms of the number of libraries that you use.
This is much worse for C++. For example, libc++ made some poor decisions in its implementation of the futex-like behaviour for std::atomic that hard-code the assumption that the host OS has a single size for its futex-like system call and what that size is as part of the libc++.so ABI. I’d like to fix this but if I do then I’ll get different behaviours between different C++ files compiled with different copies of the libc++ headers such that one waiting will not be awoken by another signalling the same atomic variable. That’s unacceptable breakage and there’s no way of avoiding it without requiring that everything be recompiled.
In general, I think assuming that everything can be recompiled is an increasingly safe assumption. For supply-chain security, an increasing number of things are now doing their release builds in a secure pristine VM that builds all dependencies (typically using containers to cache things as appropriate). App Stores often have freshness requirements and can kick things off if they haven’t been compiled sufficiently recently. Open source operating systems package repos typically rebuild everything together to produce atomic package sets (a single machine can compile all 30,000 packages in the FreeBSD repo in around 24 hours). This makes ABI breakage a lot cheaper than it was 15 years ago, when compiling a single package might have been an overnight job for a decent desktop.
In general, I think assuming that everything can be recompiled is an increasingly safe assumption.
For open-source software, I agree.
For closed-source software – the hospital charting program that only runs on Windows XP, the mine mapping and modeling software that costs six figures and produces 10% of the world’s platinum, etc, it seems like it would be a compatibility break on par with switching to a new CPU architecture. Fortunately, Apple has switched to a new CPU architecture like 3 times already and demonstrably kept old software alive via emulation, so… Seems possible maybe?
In general, I think assuming that everything can be recompiled is an increasingly safe assumption.
For open-source software, I agree.
Gosh, I don’t. Today we have distributions issuing security updates all the time. Those updates don’t replace all of usermode because they maintain ABI compatibility. The issue isn’t the cost of compilation, it’s that installing any update anywhere would require a machine to download gigabytes of updates and take significant downtime to install.
Somehow we ended up in a strange place where a lot of developers believe that ABI compatibility is an obsolete concept, proceed to make changes, and distributions have to go along afterwards and backport changes, putting that ABI compatibility right back.
It’s also odd to me that people in open source have a strange mental block that assumes the user to kernel boundary must be ABI compatible, and usermode libraries do not. From a technical point of view, the two are identical; the difference today is the personalities involved. Saying that app stores allow an entire app to be rebuilt, including dependency libraries, is ignoring that they don’t require the OS to be rebuilt, which means there’s some ABI layer underneath that app which must be compatible. From there, there’s some language and linkage capable of supporting compatibility.
The real question is whether the group of people who maintain ABI compatibility are a small group of elites or whether it’s a broader value among the community.
…it’s that installing any update anywhere would require a machine to download gigabytes of updates and take significant downtime to install.
The existence of Nix and Alpine Linux are possible counter-examples, but I don’t know enough about them to know how well they work in practice. Anyone about with more experience than me?
The real question is whether the group of people who maintain ABI compatibility are a small group of elites or whether it’s a broader value among the community.
…My cynical experience here is that it will always be a small group of elites who Actually Make Things Work, compared to the broader community that Just Want Things To Work. That’s a human problem though, not a technical one. Totally not influenced by discovering today that a key dependency I have to use for work made an obviously breaking change in their last patchlevel release.
The existence of Nix and Alpine Linux are possible counter-examples, but I don’t know enough about them to know how well they work in practice. Anyone about with more experience than me?
Nix is actually a pretty good example rather than counter-example; any changes to widely used stdenv packages (glibc, libgcc, etc) effectively require a full world rebuild because all build inputs must version match exactly.
On the other hand, Alpine is not a counter-example, because they actually dynamically link musl in order to avoid world-rebuilds. Breaking the ABI of musl would still introduce a world-rebuild scenario.
I’m not sure these are either examples or counter examples. It’s clearly possible to replace all of usermode. The question is whether it can be done without causing pain for users, in the form of download size or downtime.
To me the best case for this working is cloud-native containers: there’s a redundant fleet of images, so it’s possible to build a new version of a container that modifies every user package, and start rolling it out. End users will experience no downtime, since the previous version is still in operation until the rollout is complete.
A single node desktop is the opposite, since there’s no redundancy. The desktop I’m writing this from has a sizeable collection of Steam games. Whether source is available, or if binaries are downloadable for every conceivable ABI, the impact of a system upgrade would be huge - tens to hundreds of GB. The more software you have, the more impactful an ABI break becomes, regardless of source availability.
…which also tends to raise the question about whether the design of open source systems is now driven by server side cloud workloads, and the concerns of end user devices are no longer a major factor.
The author spends a lot of time acting as though it is weird that changing the types of a function’s parameters or similar results in new code not working, which is very confusing to me. They also act like this is C specific which it isn’t. Any language that wishes to share across a build boundary needs to ensure that both sides of the boundary agree on what the interface is.
The fact that many modern languages eschew ABI compatibility in favor of every application have copies of every library that they depend on, and by default don’t support the basics of being a system library remains bizarre to me.
It’s also unnecessary: Objective-C has ABI stable ivars (though it makes ivar access more expensive than a pointer offset of course) so the implementation and data stored in parent objects doesn’t break subclasses compiled against a different SDK. Swift supports ABI compatibility even for generic types through witness tables.
Any language that wants to claim to be a “systems” language needs to provide a reasonable ABI stability story if it actually wants to be used for for system libraries, etc
It’s weird, because it’s about the C (and C++) ecosystem, which is very ossified. You can’t change ABI of anything without breaking someone, and users of a 40-year-old language really dislike anything changing and breaking.
I hate to do this to you, but C turns 50 this year. I still think that the 70s were 30 years ago so this is kind of a shock for me too.
There’s also this post by the same author which shows other examples that don’t involve changing the type signature:
In C++, changing a copy constructor from being default to being user provided can break ABI
Attempting to standardize GCC’s C nested functions would break ABI because it would need to modify how function pointers are represented (one bit would be different)
In C++, adding new virtual functions will modify the layout of the vtable, breaking the ABI for any consumer that derived from the class, added their own functions, and thus assumed the location of their functions inside the vtable.
For the first one, who’s the case where it breaks ABI (obviously it breaks API :D )?
For the second one: womp womp :D More seriously how would GCC’s nested functions need to change for standardization?
The last one is standard knowledge and well understood by any C++ dev who makes libraries. IIRC it’s why QT objects are all exposed via wrapper structs that make static calls that internally forward to the polymorphic implementation. I’ve often wished that there was an attribute in clang/gcc where you could make a polymorphic object actually be implemented that way automatically. It wouldn’t hide the existence of a vtable pointer, or the general field+inheritance layout problem, but it would certainly reduce some foot guns.
ObjC get ABI stability by having literal hash table from the string name of the method to the method impl, and accesses ivars via an indirect load so the entire object is not fragile ABI.
Swift is also ABI stable (because it’s a systems language, and you need your platform ABI to stable) through similar mechanisms to objc when crossing library boundaries, and even has ABI stability for generic objects through witness tables for protocols - logically swift protocols are equivalent to Haskell’s type classes rather than, say, ObjC’s compile time only enforcement.
I forgot my favorite piece of terrible ABI horror: MSVC++ changes the size of member function pointers depending on declaration ordering :D
I’m far from an expert, but it seems that some compilers will pass the type with a default copy constructors using only two registers, while for the user-provided constructor case, it will invoke a bit-copy operation (on a different register), so what you get is two libraries that expect their arguments to live in different registers.
GCC nested functions are implemented by a trampoline jump with an executable stack. Not all operating systems support executable stacks (as it is commonly a entry-point to other exploits): for example, OpenBSD (so GCC had to patch their approach).
A standard implementation of nested functions would need to use a different approach. The problem is that this would modify the ABI of callers: depending on how nested functions are implemented, they would need to be called in different ways. If you re-use GCC’s syntax, this means that depending on your compiler version, your nested functions would use different implementations, and hence, cause an ABI break.
According to OP, it seems that this was forgotten when implementing
std::pmr::memory_resource
, as it exposes a design with virtual functions: https://en.cppreference.com/w/cpp/memory/memory_resourceOh, changing from no constructor to a constructor makes the type non-POD, clang at least has an attribute to deal with that: https://clang.llvm.org/docs/AttributeReference.html#trivial-abi
I had to revert a change in snmalloc that made
realloc
of size 0 returnNULL
becausemandoc
andsort
in the FreeBSD base system both callrealloc
via a wrapper that kills the program if anyrealloc
call returnsNULL
. I assumed that if two programs from a fairly small sample set depended on this, changing it would break a lot of the world. I was quite surprised to learn that C23 will make this explicitly UB. Though given how low C11 adoption has been, I’m not convinced that anything WG14 says matters anymore.The proposed solution seems to be missing the point somewhat. If everything is compiled for either the new ABI or the old one then you don’t need any of these tricks. You can have different copies of the C standard library and link the correct one. It’s 1-2MiB, so having two copies on a system is basically free these days. The problem is when you have two third-party libraries that use different versions of the ABI. These may store
intmax_t
in a structure, or expose it at their ABI boundaries. Now you need to handle the case that one of them can give you an integer that the other one can’t handle (for example, one library usesintmax_t
to store a unique identifier, the other uses it as the key for a hash table, what happens when the first one uses__int128_t
and the second usesint64_t
forintmax_t
?). This becomes a combinatorial problem in terms of the number of libraries that you use.This is much worse for C++. For example, libc++ made some poor decisions in its implementation of the
futex
-like behaviour forstd::atomic
that hard-code the assumption that the host OS has a single size for itsfutex
-like system call and what that size is as part of thelibc++.so
ABI. I’d like to fix this but if I do then I’ll get different behaviours between different C++ files compiled with different copies of the libc++ headers such that one waiting will not be awoken by another signalling the same atomic variable. That’s unacceptable breakage and there’s no way of avoiding it without requiring that everything be recompiled.In general, I think assuming that everything can be recompiled is an increasingly safe assumption. For supply-chain security, an increasing number of things are now doing their release builds in a secure pristine VM that builds all dependencies (typically using containers to cache things as appropriate). App Stores often have freshness requirements and can kick things off if they haven’t been compiled sufficiently recently. Open source operating systems package repos typically rebuild everything together to produce atomic package sets (a single machine can compile all 30,000 packages in the FreeBSD repo in around 24 hours). This makes ABI breakage a lot cheaper than it was 15 years ago, when compiling a single package might have been an overnight job for a decent desktop.
For open-source software, I agree.
For closed-source software – the hospital charting program that only runs on Windows XP, the mine mapping and modeling software that costs six figures and produces 10% of the world’s platinum, etc, it seems like it would be a compatibility break on par with switching to a new CPU architecture. Fortunately, Apple has switched to a new CPU architecture like 3 times already and demonstrably kept old software alive via emulation, so… Seems possible maybe?
Gosh, I don’t. Today we have distributions issuing security updates all the time. Those updates don’t replace all of usermode because they maintain ABI compatibility. The issue isn’t the cost of compilation, it’s that installing any update anywhere would require a machine to download gigabytes of updates and take significant downtime to install.
Somehow we ended up in a strange place where a lot of developers believe that ABI compatibility is an obsolete concept, proceed to make changes, and distributions have to go along afterwards and backport changes, putting that ABI compatibility right back.
It’s also odd to me that people in open source have a strange mental block that assumes the user to kernel boundary must be ABI compatible, and usermode libraries do not. From a technical point of view, the two are identical; the difference today is the personalities involved. Saying that app stores allow an entire app to be rebuilt, including dependency libraries, is ignoring that they don’t require the OS to be rebuilt, which means there’s some ABI layer underneath that app which must be compatible. From there, there’s some language and linkage capable of supporting compatibility.
The real question is whether the group of people who maintain ABI compatibility are a small group of elites or whether it’s a broader value among the community.
All good points!
The existence of Nix and Alpine Linux are possible counter-examples, but I don’t know enough about them to know how well they work in practice. Anyone about with more experience than me?
…My cynical experience here is that it will always be a small group of elites who Actually Make Things Work, compared to the broader community that Just Want Things To Work. That’s a human problem though, not a technical one. Totally not influenced by discovering today that a key dependency I have to use for work made an obviously breaking change in their last patchlevel release.
Nix is actually a pretty good example rather than counter-example; any changes to widely used
stdenv
packages (glibc
,libgcc
, etc) effectively require a full world rebuild because all build inputs must version match exactly.On the other hand, Alpine is not a counter-example, because they actually dynamically link
musl
in order to avoid world-rebuilds. Breaking the ABI ofmusl
would still introduce a world-rebuild scenario.I’m not sure these are either examples or counter examples. It’s clearly possible to replace all of usermode. The question is whether it can be done without causing pain for users, in the form of download size or downtime.
To me the best case for this working is cloud-native containers: there’s a redundant fleet of images, so it’s possible to build a new version of a container that modifies every user package, and start rolling it out. End users will experience no downtime, since the previous version is still in operation until the rollout is complete.
A single node desktop is the opposite, since there’s no redundancy. The desktop I’m writing this from has a sizeable collection of Steam games. Whether source is available, or if binaries are downloadable for every conceivable ABI, the impact of a system upgrade would be huge - tens to hundreds of GB. The more software you have, the more impactful an ABI break becomes, regardless of source availability.
…which also tends to raise the question about whether the design of open source systems is now driven by server side cloud workloads, and the concerns of end user devices are no longer a major factor.
I was specifically discussing this point:
These are clearly not solved issues in Nix or Alpine.