1. 46

A compiler I’ve been working on, now at the point where I think it might actually be useful.

  1.  

  2. 4

    It’s very impressive that it can build GCC, it’s about the same size as 8cc (if we just count the cc code, not the additional code from the qbe backend) but far more capable. It shows that the task of compiling C code can be implemented in a modular way (separating frontend and backend) in an approachable codebase. gcc and clang are roughly 100x bigger in scale (lines of code) but they do include very powerful linkers and extra programming languages/frontends.

    Can it build any libc?

    I’m not sure how to use it to compile my own stuff though, I get an error that “random” function is unknown but my code includes the header.

    1. 6

      Can it build any libc?

      Not yet. The main missing pieces for building musl are inline assembly, volatile, and long double, all of which require some new features in QBE. I’m hoping to work with the author to implement them. Building musl is definitely on the roadmap.

      I’m not sure how to use it to compile my own stuff though, I get an error that “random” function is unknown but my code includes the header.

      By default, on glibc targets we pass -D __STRICT_ANSI__ to cpp, since otherwise glibc headers will use statement expressions, which are not yet supported. This has the side-effect of disabling the default feature-test macros defined by the preprocessor. My guess is you haven’t defined the right feature-test macros to get your libc to expose random, and are relying on your compiler defaults. I think you’d hit the same error if you built with CC='gcc -std=c99'. Try adding -D _DEFAULT_SOURCE to your CFLAGS.

      However, now that I think about it, this was because we used to define __GNUC__=1 which ended up causing more problems then it solved and has since been reverted. I think it should be okay to also remove -std=c11 and -D __STRICT_ANSI__ from config.h to get the behavior you expect by default. I’ll look into it.

      Thanks for trying it out!

    2. 6

      Amazing work - everyone remember to star the github mirror

      https://github.com/michaelforney/cc :D

      This compiler actually builds lots of real software - we can dethrone C++ based compilers like gcc and clang again :D

      1. 2

        Nice work. If you get around to it, “restrict” is a great C feature that gcc/clang have ignored in favor of dangerous “optimizations”. Can “while(n– >0) *x = *y” be optimized by using vector registers or similar? Traditional C says, no because we don’t know if for example y = x +1 or something - if the pointers overlap. Modern GCC/Clang say “Oh boy, overlapping pointers is undefined, so we can just assume they do not” and compile to something that fails if they do”. But there is no reason why a C compiler can’t force the programmer to make it explicit: to assume no overlap, see if the programmer told us * (restrict)x = * (restrict )y or similar.

        1. 1

          Thanks for the suggestion. I think at the moment, there is quite a bit more lower hanging fruit in QBE, but it’s something to keep in mind.

          I think we’d need to come up with some syntax in QBE IL to communicate the restrict information from the frontend to QBE, and I’m not quite sure how that would look. I’d have to spend some more time reading and understanding C11 6.7.3.1 (formal definition of restrict).

          1. 1

            unfortunately, the “restrict” definition in the C11 standard is absolute murk. I think it actually makes no sense - but the examples and notes give a clear indication of what it is supposed to mean. For example

            EXAMPLE 2 The function parameter declarations in the following example
            void f(int n, int * restrict p, int * restrict q)
            {
            while (n-- > 0)
            *p++ = *q++;
            }
            assert that, during each execution of the function, if an object is accessed through
            one of the pointer
            parameters, then it is not also accessed through the other.
            
            1. 1

              While I am asking for features - sorry - there is a similar issue with “volatile” which is a key feature, but needs to be passed into the back end.

              1. 2

                Yep, I’m well aware of this one. Right now there is no way to tell QBE not to optimize loads and stores, and I’ve been talking to the author about adding this feature.

          2. 1

            How portable is your compiler, and how easy is it to target a new architecture?

            I’m utterly unfamiliar with QBE. Looking at QBE page, I see it says “Only x64 platforms are currently supported. ARM support is planned”, but I see you have AArch64 working, at least in your branch.

            Without having to do a deep dive, does QBE or the compiler make non-portable assumptions? I’m thinking of things like big endianness, odd byte width (9-bit bytes), NULL != 0, 36-bit registers, etc.

            I’m thinking specifically of the Honeywell 6000, but many of the same concerns would apply to other systems, such as the PDP-10.

            1. 4

              Unfortunately for your 36-bit dream, QBE strongly assumes 8-bit bytes, and I don’t think they want to change it. I also agree assuming 8-bit bytes is the right choice for QBE.

              1. 1

                Thanks for the response - we are currently exploring other options for our project, but we are always looking at all new compiler technology that comes around.

              2. 3

                The compiler itself (excluding the driver) should be completely standard C, no POSIX functions are used. I believe this is true of QBE as well.

                I think the QBE page is out of date, since aarch64 is indeed supported. Judging by the size of the existing amd64 and arm64 directories in QBE, it looks like it takes ~1500 lines for a new architecture.

                Regarding non-portable assumptions, QBE was designed for machines with 32/64 bit integer and floating point registers, and this is reflected in the basic data types of the IL (see the IL doc). So, I expect that it would not work so well with the type of machines you’re describing.

                In the frontend, currently the primitive data types have hard-coded sizes, since these match between x86_64 and aarch64. There are also a few isolated places that assume little-endian in the bit-field implementation. But both of these are pretty easily fixable.

                For NULL != 0, conversion from integer to pointer is implemented as a copy, so yes, I do make the assumption that null pointers are 0. However, I’m curious about how that works. Do you mean there are platforms where the C expression NULL != 0 evaluates to 1, or just where the underlying integer for a null pointer is non-zero? C defines a null pointer constant as an integer constant 0, possibly cast to void *. That means this expression is a comparison between two 0 values of integer type, or two null pointers which the standard says must compare equal. So I think the result is 0 in all cases.

                1. 2

                  The H6000 is an odd beast.

                  There are existing C implementations, but we are looking for (possible) new options.

                  Reflecting the design of the hardware, the standard C implementation has char types that are 9 bits, ints and floats that are 36 bits and longs that are 72 bits. There are no 18-bit (short) types. The standard signed char type is actually 8-bits and a high-order sign bit, representative of a range of -256 to 254. The unsigned char type is 9 bits, representing a value of 0 to 512. Importantly, the pointer is 72 bits wide (so it does not fit in an integer type, a common cause of porting headaches), as it consists of (if my memory is correct) a 36 bit offset location and a 36 bit segment reference.

                  However, I’m curious about how that works. Do you mean there are platforms where the C expression NULL != 0 evaluates to 1, or just where the underlying integer for a null pointer is non-zero? C defines a null pointer constant as an integer constant 0, possibly cast to void *. That means this expression is a comparison between two 0 values of integer type, or two null pointers which the standard says must compare equal. So I think the result is 0 in all cases.

                  On the Honeywell, the NULL is a pointer to void, which would be “0|-1” (location 0 of segment -1, or vice versa, I’d have to look) and is not equal to integer 0, but is an “impossible” value.

                  1. 2

                    Not to mention that the frontend itself could probably quite easily be ported to emit assembly directly (like 8cc), or emit llvm text or link against the llvm C api.