Most of the rant is about ABI of dynamic libraries, and how ABI is OS/Arch dependent and how ABI is difficult to get it right. How OP is confusing ABI with the actual standard C is unclear. If one is going to go around any language X and direct interfacing with the binary generated by any language X compiler, one needs to understand the actual binary.
I don’t think OP “is confusing API with the actual standard C”. I think OP understands quite well the difference between C the standardized language and C the ad-hoc ABIs of the popular toolchains. I think OP is ranting because the latter is effectively the only interoperability mechanism available for non-C languages to speak to each other, and thus brings a lot of C-oriented baggage into a situation where it’s neither needed nor wanted, and coupling languages which aren’t C to particular concepts from C.
It goes further than that. Win32 evolved from win16, which was created at a time when it was unclear whether C or Pascal would win as an application programming language and so all of the types for the APIs are defined as fixed-width things that can be mapped to an IDL or, at least, be defined in multiple languages. These types differentiate things like buffers and null-terminated strings, for example. More recently, SAL annotations add length and ownership information for pointers that allow them to be extracted.
Apple also has a thing called BridgeKit that generates property lists for all of its system libraries that include more metadata than a standard C function.
In FreeBSD, the syscall ABI is actually defined in a C-like IDL with SAL annotations and then the C wrappers and userspace assembly stubs are generated from this. I’d love to see more libraries follow a similar approach and generate C headers from a more language-agnostic IDL.
One problem I found while working with a somewhat similar IDL which was derived from the implementation in C (the XML files used to define the X11 protocol) was that it still carried a lot of C baggage, and a lot of information that would be useful for generating bindings for higher-level languages was informally specified and could usually be derived with some heuristics, but required special casing in some situations.
For example variable-sized strings and buffers usually had a corresponding length field which is usually marked as an attribute of the string field, sometimes using an arithmetic expression which has to be manually inverted if you don’t want to manually specify the length of the string in the API, and there’s this concept of “switches” which sometimes are informally-specified discriminated unions whose discriminant might be derived from multiple other fields in the containing struct, or even a parent struct, and other times define the presence of optional fields through a bitmask.
Basically, the homegrown language-agnostic IDL is still still very much tainted by C and there’s a significant amount of work that needs to be done on top of the IDL to make it palatable to higher-level languages.
I think it’d be extremely hard to define an IDL that allows enough expressiveness to work around the edge cases of some of the more spiky APIs while also providing enough information to allow generating somewhat idiomatic code in different languages that use different paradigms.
Even if idiomaticity of the generated bindings is not the priority you still need enough expressiveness to encode all of C’s type system and all the quirks that the OP describes in their post in a format that’s still general enough to be used to generate headers for a lot of C libraries. Every C project defines its own soup of aliases for basic types, custom attributes, some even define their own type system on top of C’s like GObject, and most of them use compiler-specific directives and define piles of macros. And what about inline functions? Header-only libraries? Even if such a perfect IDL did exist, I doubt you’d be able to convince many people to use it for existing libraries and APIs.
Sorry if I’m being too negative. Perfect is the enemy of good and maybe there is an 80% solution to be reached, I just don’t think it’d be easy.
I agree it’s very hard - if it were easy, someone would have done it already. I think it’s helped with things like the COM IDL that they were designed from the start to support non-C languages. I also think that aiming to support all of C is the wrong approach: you should aim to support enough that C libraries can define efficient public interfaces in terms of it (and so can other languages). As the article says, nothing short of a full C compiler can give full C interop (for both C and C++, the code I’ve written for Verona’s interop layer uses all of clang to generate LLVM IR functions with a simple calling convention that call functions / methods and set / get struct fields, which can then be inlined into the Verona code, picking up all of the excitement of the C/C++ type system). That works for interop with C libraries, but what I (and the author of the blog) want is interop with non-C libraries without going via C as the lowest common denominator interface.
XPCOM always struck me as an odd name. An XPCOM component isn’t cross-platform (it’s compiled for a single platform) and is no more cross-platform than COM (which has been implemented for multiple platforms and in multiple languages).
There is no “C the ad-hoc ABI”. ABI is an ABI, which is, by construction, unrelated to any programing language, and depend on OS and arch. Yes, C tool chains are the most ubiquitous. But, there are still fortran, pascal, and a plethora of Windows conventions. It’s a mess. But so does making a syscall on different OSes.
There is no “C the ad-hoc ABI”. ABI is an ABI, which is, by construction, unrelated to any programing language,
The argument being made, in its purest form, appears to me to be:
When binaries of Language 1 and Language 2, neither of which are C, need to communicate with each other, one of if not just the simplest and most reliable way to accomplish this is to hook both languages into an existing C compiler’s toolchain to take advantage of that compiler toolchain’s ABI, or otherwise to emulate the ABI of an existing popular C toolchain.
This is the case because there is no portable cross-language interface, other than “every language sooner or later has a use case for C FFI support, which also gets you FFI to every other language that has C FFI”.
And I believe the author wants to register their distaste for this and for the consequences it wreaks on languages, on toolchains, and on the resulting executables.
Nitpicking about the definition of “C” or whether there is or is not a formally-specified ABI is fundamentally not relevant to this argument.
On top of that OP seems to be looking at “non-standard” C, specifically __int128 and how intmax_t isn’t 128 bit. I guess the OP got hurt by this somehow in a real program.
I wonder what the ratio is of C programs that really only use standard conforming C without UB to the ones that don’t. Furthermore, how many of those programs are unwittingly relying on compiler-specific behavior that would never be revealed until another compiler without that behavior is used? Would make for interesting data.
If I understand correctly, declaring two externally visible identifiers that have the same first 6 characters is undefined behavior. So probably not many.
I wonder why they decided to stick with a ridiculous limit like that. Are there any non-toy C compilers that are that constrained in their identifier lengths in practice, anyway?
A lot of the decisions in C89 were an attempt not to exclude any C compiler at the time (the late 80s) from implementing the standard. Given there were sign magnitude and 1s complement machines that can support standard C, it wouldn’t surprise me if there weren’t some older computers in the late 80s that only supported 6 significant characters in external identifiers.
I get that, but I was wondering more about C99; anything that was modern at the time and willing to implement C99 over the years that followed wouldn’t be so constrained, I would guess.
Well, the 31 character limit (and even the earlier 6 character limit) was the minimum a compiler had to support. Compilers could (and often did) exceed said limits, but if you wanted maximum portability, you had to be aware of it.
In my 30 years of C programming, I had only one job where management took the limit seriously (in the mid-90s, so it was the 6 character limit of C89), and even then, it was a silly thing to do due to the platforms we ran on (MS-DOS, Windows and several Unix variants excluding Linux [1]).
[1] Funny, because the Unix development was primarily done on Linux, but management knew they couldn’t charge the 4 figures for the software for Linux, but they could for the other Unix variants.
Yes, I am genuinely asserting that parsing C is basically impossible.
No, but you need to know something about parsing and lexing; parsing C is no harder (even easier) than many other languages. And it can be done even without LLVM/Clang libraries. Here is e.g. a tool I use to generate Oberon+ FFI import libraries from C headers: https://github.com/rochus-keller/C2OBX. I first used chibicc which is a fine small C compiler, but finally I implemented my own parser version because my transpiler required more information about the source code than required by the C compiler. Oberon+ implements a C FFI for all backends (CLR, LuaJIT and C) and it was not that difficult to implement this (even exceptions via jmp_buf are supported).
Yes and no. If, when you say C, you mean the language specified by WG14, then you are correct. If, when you say C, you mean the language used by people who self-identify as C programmers, then you are wrong. C headers contain compiler-specific attributes and pragmas, for example, which affect how parameters are passed or structures are laid out in a per-platform and per-architecture manner. Any tool that is able to parse these and generate code that interoperates with them is at least 90% of a C compiler.
Parsing 80-90% of C headers is easy, but somehow I always want to depend on a library that’s in the remaining 10-20%.
Not only that, but ABI can vary depending on compiler flags (e.g. availability of AVX), so even if you parse 100% of C correctly, you still may not fully know the ABI.
Headers that declare structs and typedefs relevant for an ABIs may vary depending on what macro definitions are set, and the macros can be defined by an arbitrary turing-complete build system that may poke the environment to decide what library features to enable. So now you don’t just need a C parser, but run arbitrary software on a full OS.
C headers contain compiler-specific attributes and pragmas
If they are specified it is apparently also possible to implement a parser for them; but the argument was about the C programming language, and that was the language I referred to. Compiler-specific extensions are per definition not part of the language, unless specified in the standard.
but somehow I always want to depend on a library that’s in the remaining 10-20%.
Maybe you can make specific examples of things which are “basically impossible to parse”, as the author suggests.
They’re not part of the language, yes, but they are part of the platform. If you want to parse a C header that includes string.h on Windows, you need to parse MSVC-specific attributes. If you want to do it on MacOS, that’s Apple clang attributes (note: distinct from the clang attributes on Linux). If you want to do it on Linux, you need to parse whatever intersection of gcc/clang attributes the distribution thought was reasonable to rely on.
If you want an example of the stuff that’s impossible to parse, all you need to do is fully expand <stdlib.h> on all three platforms.
all you need to do is fully expand <stdlib.h> on all three platforms
Thanks. But these libraries belong to the specific compiler, and I can still parse them with a standard-conforming parser when I don’t add the compiler-specific defines, but treat them as standard conformat. And even if the compiler-specific defines are enabled, I don’t see the point why it should be “basically impossible” to parse these syntax constructs. It is a different case with e.g. C++ or SystemVerilog where the syntax ambiguity is overwhelming.
Yep, this sounds entirely correct.
At least it’s not C++. Try to bind to Qt sometime.
Most of the rant is about ABI of dynamic libraries, and how ABI is OS/Arch dependent and how ABI is difficult to get it right. How OP is confusing ABI with the actual standard C is unclear. If one is going to go around any language X and direct interfacing with the binary generated by any language X compiler, one needs to understand the actual binary.
I don’t think OP “is confusing API with the actual standard C”. I think OP understands quite well the difference between C the standardized language and C the ad-hoc ABIs of the popular toolchains. I think OP is ranting because the latter is effectively the only interoperability mechanism available for non-C languages to speak to each other, and thus brings a lot of C-oriented baggage into a situation where it’s neither needed nor wanted, and coupling languages which aren’t C to particular concepts from C.
Windows has a well defined language interop layer: the widely derided COM ABI :D
It goes further than that. Win32 evolved from win16, which was created at a time when it was unclear whether C or Pascal would win as an application programming language and so all of the types for the APIs are defined as fixed-width things that can be mapped to an IDL or, at least, be defined in multiple languages. These types differentiate things like buffers and null-terminated strings, for example. More recently, SAL annotations add length and ownership information for pointers that allow them to be extracted.
Apple also has a thing called BridgeKit that generates property lists for all of its system libraries that include more metadata than a standard C function.
In FreeBSD, the syscall ABI is actually defined in a C-like IDL with SAL annotations and then the C wrappers and userspace assembly stubs are generated from this. I’d love to see more libraries follow a similar approach and generate C headers from a more language-agnostic IDL.
One problem I found while working with a somewhat similar IDL which was derived from the implementation in C (the XML files used to define the X11 protocol) was that it still carried a lot of C baggage, and a lot of information that would be useful for generating bindings for higher-level languages was informally specified and could usually be derived with some heuristics, but required special casing in some situations.
For example variable-sized strings and buffers usually had a corresponding length field which is usually marked as an attribute of the string field, sometimes using an arithmetic expression which has to be manually inverted if you don’t want to manually specify the length of the string in the API, and there’s this concept of “switches” which sometimes are informally-specified discriminated unions whose discriminant might be derived from multiple other fields in the containing struct, or even a parent struct, and other times define the presence of optional fields through a bitmask.
Basically, the homegrown language-agnostic IDL is still still very much tainted by C and there’s a significant amount of work that needs to be done on top of the IDL to make it palatable to higher-level languages.
I think it’d be extremely hard to define an IDL that allows enough expressiveness to work around the edge cases of some of the more spiky APIs while also providing enough information to allow generating somewhat idiomatic code in different languages that use different paradigms.
Even if idiomaticity of the generated bindings is not the priority you still need enough expressiveness to encode all of C’s type system and all the quirks that the OP describes in their post in a format that’s still general enough to be used to generate headers for a lot of C libraries. Every C project defines its own soup of aliases for basic types, custom attributes, some even define their own type system on top of C’s like GObject, and most of them use compiler-specific directives and define piles of macros. And what about inline functions? Header-only libraries? Even if such a perfect IDL did exist, I doubt you’d be able to convince many people to use it for existing libraries and APIs.
Sorry if I’m being too negative. Perfect is the enemy of good and maybe there is an 80% solution to be reached, I just don’t think it’d be easy.
I agree it’s very hard - if it were easy, someone would have done it already. I think it’s helped with things like the COM IDL that they were designed from the start to support non-C languages. I also think that aiming to support all of C is the wrong approach: you should aim to support enough that C libraries can define efficient public interfaces in terms of it (and so can other languages). As the article says, nothing short of a full C compiler can give full C interop (for both C and C++, the code I’ve written for Verona’s interop layer uses all of clang to generate LLVM IR functions with a simple calling convention that call functions / methods and set / get struct fields, which can then be inlined into the Verona code, picking up all of the excitement of the C/C++ type system). That works for interop with C libraries, but what I (and the author of the blog) want is interop with non-C libraries without going via C as the lowest common denominator interface.
In other words it sounds like a lot of the complaining here is about how linux describes its ABI, rather than every other platform :D
Great idea. Maybe someone should do a cross-platform variation. They could call it XPCOM (more here)
XPCOM always struck me as an odd name. An XPCOM component isn’t cross-platform (it’s compiled for a single platform) and is no more cross-platform than COM (which has been implemented for multiple platforms and in multiple languages).
Not to be confused with XCOM :D
There is no “C the ad-hoc ABI”. ABI is an ABI, which is, by construction, unrelated to any programing language, and depend on OS and arch. Yes, C tool chains are the most ubiquitous. But, there are still fortran, pascal, and a plethora of Windows conventions. It’s a mess. But so does making a syscall on different OSes.
The argument being made, in its purest form, appears to me to be:
When binaries of Language 1 and Language 2, neither of which are C, need to communicate with each other, one of if not just the simplest and most reliable way to accomplish this is to hook both languages into an existing C compiler’s toolchain to take advantage of that compiler toolchain’s ABI, or otherwise to emulate the ABI of an existing popular C toolchain.
This is the case because there is no portable cross-language interface, other than “every language sooner or later has a use case for C FFI support, which also gets you FFI to every other language that has C FFI”.
And I believe the author wants to register their distaste for this and for the consequences it wreaks on languages, on toolchains, and on the resulting executables.
Nitpicking about the definition of “C” or whether there is or is not a formally-specified ABI is fundamentally not relevant to this argument.
On top of that OP seems to be looking at “non-standard” C, specifically
__int128
and howintmax_t
isn’t 128 bit. I guess the OP got hurt by this somehow in a real program.I wonder what the ratio is of C programs that really only use standard conforming C without UB to the ones that don’t. Furthermore, how many of those programs are unwittingly relying on compiler-specific behavior that would never be revealed until another compiler without that behavior is used? Would make for interesting data.
If I understand correctly, declaring two externally visible identifiers that have the same first 6 characters is undefined behavior. So probably not many.
C89 had the six character limit for external names. C99 raised that limit to 31.
Wow, what a generous limit!
</s>
I wonder why they decided to stick with a ridiculous limit like that. Are there any non-toy C compilers that are that constrained in their identifier lengths in practice, anyway?
A lot of the decisions in C89 were an attempt not to exclude any C compiler at the time (the late 80s) from implementing the standard. Given there were sign magnitude and 1s complement machines that can support standard C, it wouldn’t surprise me if there weren’t some older computers in the late 80s that only supported 6 significant characters in external identifiers.
I get that, but I was wondering more about C99; anything that was modern at the time and willing to implement C99 over the years that followed wouldn’t be so constrained, I would guess.
Well, the 31 character limit (and even the earlier 6 character limit) was the minimum a compiler had to support. Compilers could (and often did) exceed said limits, but if you wanted maximum portability, you had to be aware of it.
In my 30 years of C programming, I had only one job where management took the limit seriously (in the mid-90s, so it was the 6 character limit of C89), and even then, it was a silly thing to do due to the platforms we ran on (MS-DOS, Windows and several Unix variants excluding Linux [1]).
[1] Funny, because the Unix development was primarily done on Linux, but management knew they couldn’t charge the 4 figures for the software for Linux, but they could for the other Unix variants.
Correct - a bug was reported to rustc, which treats u128 as a fundamental type that needs to work on all platforms.
FWIW, I wrote about this back in March: https://www.theregister.com/2022/03/23/c_not_a_language/
Repeat/Duplicate? https://lobste.rs/s/w9sotc/c_isn_t_programming_language_anymore
Sure looks like it, the URL in the older submission now redirects to the one here. That’s why the dupe detector didn’t catch it.
Reposts after ~6 months are ok however. I personally wouldn’t have submitted it if it had been detected though.
No, but you need to know something about parsing and lexing; parsing C is no harder (even easier) than many other languages. And it can be done even without LLVM/Clang libraries. Here is e.g. a tool I use to generate Oberon+ FFI import libraries from C headers: https://github.com/rochus-keller/C2OBX. I first used chibicc which is a fine small C compiler, but finally I implemented my own parser version because my transpiler required more information about the source code than required by the C compiler. Oberon+ implements a C FFI for all backends (CLR, LuaJIT and C) and it was not that difficult to implement this (even exceptions via jmp_buf are supported).
Yes and no. If, when you say C, you mean the language specified by WG14, then you are correct. If, when you say C, you mean the language used by people who self-identify as C programmers, then you are wrong. C headers contain compiler-specific attributes and pragmas, for example, which affect how parameters are passed or structures are laid out in a per-platform and per-architecture manner. Any tool that is able to parse these and generate code that interoperates with them is at least 90% of a C compiler.
Parsing 80-90% of C headers is easy, but somehow I always want to depend on a library that’s in the remaining 10-20%.
Not only that, but ABI can vary depending on compiler flags (e.g. availability of AVX), so even if you parse 100% of C correctly, you still may not fully know the ABI.
Headers that declare structs and typedefs relevant for an ABIs may vary depending on what macro definitions are set, and the macros can be defined by an arbitrary turing-complete build system that may poke the environment to decide what library features to enable. So now you don’t just need a C parser, but run arbitrary software on a full OS.
If they are specified it is apparently also possible to implement a parser for them; but the argument was about the C programming language, and that was the language I referred to. Compiler-specific extensions are per definition not part of the language, unless specified in the standard.
Maybe you can make specific examples of things which are “basically impossible to parse”, as the author suggests.
They’re not part of the language, yes, but they are part of the platform. If you want to parse a C header that includes string.h on Windows, you need to parse MSVC-specific attributes. If you want to do it on MacOS, that’s Apple clang attributes (note: distinct from the clang attributes on Linux). If you want to do it on Linux, you need to parse whatever intersection of gcc/clang attributes the distribution thought was reasonable to rely on.
If you want an example of the stuff that’s impossible to parse, all you need to do is fully expand
<stdlib.h>
on all three platforms.Thanks. But these libraries belong to the specific compiler, and I can still parse them with a standard-conforming parser when I don’t add the compiler-specific defines, but treat them as standard conformat. And even if the compiler-specific defines are enabled, I don’t see the point why it should be “basically impossible” to parse these syntax constructs. It is a different case with e.g. C++ or SystemVerilog where the syntax ambiguity is overwhelming.
I wonder if the article title’s a homage to
From https://cacm.acm.org/magazines/2010/2/69354-a-few-billion-lines-of-code-later/