1. 27

  2. 10

    I have great respect for the author of QBE, it does have some slow parts, but I know they are being addressed. QBE does lots of nice things like packing values really tightly in memory. I won’t be surprised if QBE is far far faster than LLVM.

    1. 4

      The claims about Jai’s speed are indeed very impressive, I’d be curious to see how/why they got there, especially if it’s about the architecture of the compiler. I think that compilers should be implemented as a set of tree traversals that are then weaved together using a meta-compiler. There’s a ton of potential optimizations when you know how to combine traversals and wether you can defer a computation or not. The Nanopass compiler framework is a good step in that direction. What we need is a language to work with trees, and that is able to combine/collapse tree transformation together. This will guarantee an efficient compiler

      1. 8

        I would take those claims with a grain of salt until Jai is more widely used. Go was just as fast when it came out. I think one of the first demos of Go that Rob Pike did was compiling the entire standard library (100K lines of code) in a second or so on an underpowered Macbook. There is a video online somewhere.

        Go still compiles quickly, but not that quickly, simply because it generates faster code now, and has to handle edge cases in production (e.g. the memory model WRT concurrent reads and writes seems to be a big issue from LLVM that you don’t see in naive compilers [1]). Jai will have to move in that direction once it gets usage, and it will become slower.

        BTW I believe the Go compilers are based on the Plan 9 C compilers by Ken Thompson describe here:


        I didn’t see anything special that Jai is doing in terms of speed that Go compilers didn’t already do. I think it’s basically careful C coding and taking a few shortcuts, which is also the style of Ken Thompson :)

        [1] https://www.reddit.com/r/ProgrammingLanguages/comments/b22tw6/papers_and_algorithms_in_llvms_source_code/

        1. 3

          The go compilers have long since been totally rewritten in go.

          1. 4

            Yeah, that’s true, but they were automatically translated using this bespoke “grind” tool:


            And then polished by hand. I watched the talk about this translation, and also the one about the SSA back end. I recall them saying that the Go compilers in Go ended up slower partly because of escape analysis failures, which they have since improved. And there were issues like the lack of unions in Go.

            They also said that the SSA back end slowed things down a bit, but it’s worth it because the generated code is faster.

            So basically the point is that if Jai ever gets as popular Go, I won’t be surprised if it goes through a similar evolution.

            EDIT: I might have missed your point, which is that Go compilers might always be slower than Jai compilers because they’re written in Go rather than C++? That’s possible. C++ is still crazy fast if you really know and control all the code …

            I think Go eventually reached the point where it wasn’t a “fit it in one or two people’s head” project anymore, so they accepted the speed hit in favor of maintainability by programmers other than the original authors. I think most compiler projects never reach that point and the original authors are doomed to maintain it forever … :-/

            1. 2

              My main point was the compiler has been through so many iterations and rewrites of different sections that is barely has anything in common with the plan9 compilers anymore. The parsers were rewritten, the C parts removed, and the backend was rewritten. All that is left is maybe the assembler syntax.

        2. 2

          I think for faster compiler you should do less work, nano pass compilers do more work by definition, since the whole design is many passes, coordinating those passes is more work. LLVM even has a class called “pass manager”, while a small fast compiler does as few passes as possible. A beefy slow compiler has frameworks.

          LLVM and clang started it’s life trying to be faster than GCC, the best intentions of those engineers failed, because the answer to speed is almost never ‘do more things’.

          1. 6

            That’s exactly the idea: you can think of a compiler pass as a tree transformation. If {T0..Tn} is your set of tree transformations, you should be able to create a derivative set {T'0..T'm} where m <= n and where any T' transform is a composition of a subset of the {T0..Tn} set. In other words, a meta-compiler should be able to analyze the transforms, collapse them and reorder them (it’s more like weaving, as you can break down a pass into independent sub-passes).

            I’m working on a language (called TLang, for tree-language) that aims to do just that: you define transformation passes declaratively, and the meta-compiler works out the best way to combine these passes so that the number of traversal is minimized. Another important aspect is to support incremental updates, so that the compiler is going to translate a tree delta (or tree patch) into the corresponding changes when applied through the passes. This means that you can have live/incremental compilation for free.

            1. 3

              This sounds cool and something I’d like to hear / learn more about.

              I printed out this paper on stream fusion recently, but haven’t gone through it yet. It claims some pretty impressive performance. It uses staged programming techniques.


              I guess I would call what you are doing “tree pass fusion” ? Are there existing systems or papers that you were inspired by?

              1. 5

                Stream fusion seems a bit like a “transducer algebra”, ie. composing transforms together – but streams are easy compared to trees, there’s just a lazy sequence that you pull from. It’s definitely closely related, and I haven’t read that paper, so thanks for the link!

                In TLang, a tree transform is expressed as templates (rewriting parts or trees) and synthetic attributes (attributes that are computed based on queries that extract data from the tree). One big part of the compiler is to minimize the number of traversal given a set of tree queries, and the other is to determine when is the best time to compute an attribute value (sometime it’s better to have eager evaluation, sometimes lazy is better). I’m still in the early stage, but what I’m trying to do is for the compiler to emit a state machine that walks the tree, emitting new branches and updating attributes. I have a feeling it’s going to end up like a mini-VM where the instruction set is dedicated to tree traversal.

                BTW, the Oil shell blog was a super useful source of information, and ASDL is one of the projects I took a close look at. Your idea of serializing the AST according to its in-memory layout is very clever, and something I’d definitely like to use in TLang to support cross-invocation and cross-process incremental compilation/processing (because there’s no parsing/transform to load the data structure).

                1. 2

                  OK that sounds like an interesting approach. I’m interested in seeing a repo link / demo when it’s ready.

                  It sounds like you are imposing fairly tight restrictions on the input programs? It reminded me of this work which I link from my wiki, but to be honest I don’t remember exactly what their limitations were.



                  I found that when I was experimenting with the “OHeap” encoding for ASDL trees.

                  Also, your comment triggered some Googling, and now I remember I first heard about the “tree pass fusion” idea from a video on Scala, I think by Martin Odersky.

                  Have you seen this work? It’s definitely encouraging that it’s starting to be used in production!


                  Production compilers commonly perform dozens of transformations on an intermediate representation. Running those transformations in separate passes harms performance. One approach to recover performance is to combine transformations by hand in order to reduce number of passes. Such an approach harms modularity, and thus makes it hard to maintain and evolve a compiler over the long term, and makes reasoning about performance harder. This paper describes a methodology that allows a compiler writer to define multiple transformations separately, but fuse them into a single traversal of the intermediate representation when the compiler runs. This approach has been implemented in a compiler for the Scala language. Our performance evaluation indicates that this approach reduces the running time of tree transformations by 35% and shows that this is due to improved cache friendliness. At the same time, the approach improves total memory consumption by reducing the object tenuring rate by 50%.

                  I think being able to transform a many-pass compiler over pointer-rich data structures into a single pass over compact encoded nodes is a cool idea, although it’s too ambitious for me right now! I will continue to lurk and bookmark though :)

                  1. 2

                    I’ll definitely publish something when I have more working code code, right now I’m a bit following in your footsteps and trying to document the process as much as I can. At this stage, I’m spending most of the time designing and prototyping until I arrive to a clear concept and architecture that I can then implement. In the end, I’d like to follow your radical transparency approach and publish everything, including the research notes and prototypes – on a side note, the Curv project does a similiar thing, see here.

                    And yes, TLang’s input is restricted: it is essentially a (declarative) DSL to define tree transformations, but the actual computation of synthetic attributes is delegated to a host language. That means that you don’t have to specify computations or effect in TLang, just their interface. In terms of output, TLang’s compiler will produce the tree data types and a state machine to process changes (in batch or in sequence). A big aspect that I’m probably going to leave for V2 is to synthesize data structures that best match the type of tree transformations that the TLang program does.

                    One a side note, TLang is not meant to be only for compilers: I’m writing examples to show how it could be used to do XSLT-type document formatting, a make-like build system and a structured editor engine with schema validation. In that sense it’s really meant to be a generic tool for tree transformation, not necessarily a compiler tool, although it’s definitely the most obvious application.

            2. 2

              Chez Scheme is a nanopass compiler. It definitely makes the compiler slower[1], but it can make it easier to reason about and to produce more optimized code.

              [1] Compile times are slower by about half. This might have improved since, this video is 5 years old now.

              1. 1

                A constant factor slowdown is not bad for what you get in terms of ease-of-use and correctness checks which come with a nanopass framework. Especially considering it’s against a compiler like Chez that has been many years of optimizations placed there and in particular which strives to have only compiler passes which can run in time linear wrt the size of the program.

              2. 2

                I think the argument is you combine nanopass with some meta-compiler to turn it into a system that does less passes for an equivalent result. What’s curious is that we often don’t have a great dialog over what passes pay for themselves.

                1. 1

                  nano pass compilers do more work by definition

                  I think that’s an overstatement. The compilers have to do all kinds of work anyway. They might do it jumping around all over the codebase. Nanopass gets it more organized with potential cache or parallelization benefits. It also might add some overhead due to the extraction into passes. I think we should just measure the implementations instead of assume anything high-level about traditional vs nanopass. Only thing I assume is verification might be easier for nanopass compilers since they’re a series of smaller, local computations.

                  1. 2

                    Well, the core idea of nanopass is many passes over the data with many intermediate steps. I don’t see how that could be faster. Even with the OP’s proposed automatic collapsing, it would just collapse into what most fast compilers already do. It may be easier for verification, or implementing passes, but i am doubtful it would be faster.

                    1. 2

                      You’re probably only thinking of it as runtime with associated overheads. The integrations or calls could be optimized away at compile-time such as a DSL that got translated into something super-efficient. Doing stuff like that at compile time is normal in the Lisp scene. We’ve been seeing some of that kind of stuff in Rust, too, given it has macros with focus on high-efficiency in core language. Passes might just become jumps with the data structures being pointer passing. Underlying representation after compile-time might be really efficient.

                      1. 2

                        That’s spot-on, in my experiments I’m trying to derive a state machine from a set of passes/transforms expressed as a DSL, so that we get code hyper locality while minimizing queries/traversals, so this should be faster than hand written implementations. Next is finding data structures that give high data locality wrt the type of traversals done over them. I’m looking at data structure synthesis based on a set of constraints/properties. This part is a bit ambitious for the moment but I’m keeping it on my radar. Let me know if you find any good paper on the that ;)

                  2. 0

                    I don’t think LLVM ever tried to be faster than GCC in compile time. Clang was faster than GCC when it came out. It’s just that GCC rapidly catched up.

                    1. 6

                      It was definitely a goal of Clang to be faster GCC, and in my experience it IS noticeably faster. Here it compiles parts of CPython in in 13.9 seconds vs. 18.0 for GCC on one machine, and in 40.3 seconds vs. 49.5 for GCC on another machine.


                      It’s not a 2x difference, but it is noticeable and consistent IME.

                      Chris Lattner’s 2007 tech talk on LLVM talks a lot about Clang performance. It was definitely a design goal.


                      There are also some talks about LLVM’s internals like this that are great for learning C++ performance tips:


                      Overall, the problem of compiling C++/C/Objective C is very hard, and they are going to great lengths to make it fast while still generating fast code.

                      This list gives a flavor of the kinds of things it does, and some of those are inherently expensive.


                      Also, if a codebase uses a lot of templates, there’s basically no way to compile it quickly. You are essentially simulating a Turing machine … I recall a talk about speeding up compile times at Facebook by changing compile-time template code.

                      1. 1

                        I suspect nuance was lost. LLVM is a slow backend. Clang is a fast frontend. LLVM and Clang are two different things.

                        1. 1

                          Sure, but being faster than GCC was always a goal for both projects, as far as I know. Not sure what you’re trying to say here.

                          1. 1

                            I am trying to say that as far as I know, it has never been a goal of LLVM backend to be compile time competitive with GCC backend, both debug and release build. In fact, LLVM knowingly uses algorithms inferior in compile time. Survey on Instruction Selection has details, but in summary, GCC uses macro expansion, but LLVM uses DAG covering. DAG covering is inherently slower than macro expansion.

                            1. 1

                              I’m sure the #1 goal is the speed of the binary, and the two goals are in tension. But saying “it’s never been a goal” is silly.

                              I remember when Google switched from GCC to Clang, and they would have never switched if there were significant regressions in build speed. The #1 goal again is code speed, but people complain all the time about build time and it was absolutely a goal.

                              Try this query in Google

                              compile time gcc site:lists.llvm.org

                              And it’s not hard to find comparisons to GCC compile time, along with patches provided to bring it up to speed.

                              Making C++ compile fast is hard, and it’s just silly to think that they would have accidentally made it faster than GCC without aiming for it. Clang is consistently faster than GCC in my experience (numbers given above).

                              Also Chris Lattner is giving the 2007 talk above about Clang performance, and he is also the designer of LLVM. It makes no sense that he would aim to compete with GCC’s compile speed in the front end but not the back end. There is a lot of performance talk in the video – it’s nontrivial, and it’s also early in the history of both Clang and LLVM.

                              1. 2

                                I agree Google cares about build speed. I opine that LLVM project as a whole cares about build speed substantially less than Google does. (Source: I’ve been on LLVM lists and contributed to it for last 10 years.)

                        2. 1

                          Also, if a codebase uses a lot of templates, there’s basically no way to compile it quickly. You are essentially simulating a Turing machine …

                          I think it’s more how C++ templates just compile slowly versus things like Scheme than simulating Turing machines. Just not designed well for high performance.

                        3. 2

                          Ah maybe you are right, though I’m not sure if GCC caught up, or if Clang slowed down.

                      2. 1

                        I submitted some tree languages and papers after someone requested them. A few are here.

                        1. 2

                          Thanks, this one Fast: a Transducer-Based Language for Tree Manipulation seems exactly like what I was looking for. I’ve been looking at tree automata for a while (like the TATA book), but I found them hard to get it because they’re too math focused, and I’m approaching TLang more from a soft.eng. than a CS perspective. On the other hand, it’s pretty clear that the best way to automatically optimize and compose tree transformations is to have clearly specified set of semantics and operations. What got me started was “attribute grammars” and “incremental programming”, and then more specific topics such as “tree automata” and “transducers” popped up as I was exploring the field. It seems to be pretty active right now, which is good sign!

                      3. 3

                        Link to Cone language if anyone was curious about it.

                        1. 2

                          Good article!

                          From my standpoint, choosing C improves the odds, as it makes it easier to get closer to the metal

                          I don’t see how.

                          Although some might argue C has slowed my productivity, bloated the code base, and made it susceptible to all manner of unsafety bugs


                          that has not been my experience

                          If not done already, fuzzing the compiler and using ASAN might be interesting.

                          1. 1

                            I’m confused about Jai. I’m sure I’ve seen streams where Jonathan Blow demoed their LLVM backend.

                            Did that change at some point?

                            1. 1

                              I believe they have more than one backend (the first one was even slower as it compiled to C which was then compiled and linked again). Their fast compilation is meant for development and debugging. It produces reasonable code but then release builds can take much longer to regain the extra performance left on the table.

                              1. 1

                                Yeah I remember the C backend, but haven’t seen any other besides the LLVM one.

                                Just curious if there’s more info on that