It’s still quite easy to craft LLVM IR that will not cause optimisations to infinite loop, but will cause exponential growth of memory or time. I even did this accidentally. Before we had linker support for CHERI, I made clang emit a function to initialise all globals that contained pointers. These ended up all being inlined into one massive function. This function had a single basic block with thousands of instructions. The register allocator was quadratic in the number of live values (I believe this specific case is fixed by falling back to linear scan if the input is too large), which normally didn’t matter because the verge length of basic blocks is 7 instructions and most outliers are under a hundred. 10,000 hit our CI timeouts.
The key thing for most compilers is that, contrary to popular belief, the programmer is not actually the adversary. It doesn’t matter if it is possible to craft input that causes the compiler to take forever, it matters if it’s easy to do accidentally.
E-graphs change the game. In exchange for not necessarily finding the best answer, an e-graph solver always terminates and can be given a hard time limit.
How exactly do you propose to use e-graphs to implement a register allocator?
More generally, I’m not sure how much of a game-changer e-graphs are. I mean, a traditional term-rewriter can also be given a time limit. Whereas, many abstract interpretations not only admit termination proofs, but fairly reasonable time bounds (this is based on the finite descending chain property—the type lattice in use should have a finite and preferably small height—of course, performance must be traded off with precision, but you always have that tradeoff). Like I said in the linked answer, I think the main thing that’s interesting about e-graphs is the ability to better explore locally non-monotonic changes in search of a global optimum.
When the number of registers is small, then I think we could use a coloring approach. Each e-graph rule would perform a distinctness check, and we could enforce a register ordering to prevent the exponential explosion of register permutations.
That said, I’m not sure how we would handle dozens of general-purpose registers without dozens of corresponding register-by-register rules, nor how spilling and overflowing concepts map onto e-graphs. If you forced me to provide a complete answer today, I’d probably perform rewriting as a separate stage prior to register allocation.
I think the most promising approach to register allocation is unison, which uses a constraint solver to implement a combined register allocator and instruction scheduler (combining with instruction selection is still an open problem). (In particular, its approach to spilling/rematerialisation is to add a logical spill instruction after every instruction that produces a result, and a logical unspill before every use; these can later be made redundant or realised otherwise, pursuant to the global cost model.) An e-graph is not scheduled; if you don’t combine register allocation with scheduling, then you have to do it afterwards (or do them in interleaved stages, as llvm does; but the stages still have to be totally ordered). I guess you could try to schedule the e-graph directly by adding spurious control dependencies until the schedule is totally determined, but I can’t imagine anything good’s coming of doing that.
There’s a list of egg users here, and Herbie is a great example of their effectiveness. I used egg in my reference toolchain for Cammy because I was tired of dealing with phase-ordering issues; the resulting optimizer is only a few hundred lines of Rust and commentary.
I see a lot of papers but no big compilers using it. There are lots of ideas that look great in small languages but end up not scaling to large systems and I can’t tell if e-graphs are in that category.
I think that you’ve shifted the goalposts so that you don’t have to revise your top-level opinion. Also, I can’t help but notice that you’ve conflated three different types of objects (compilers, languages, systems) without clarifying why size issues with one type of object would impact another.
Sorry, I assumed my question was obvious from context. I know that e-graphs exist and so must be used by something, if only by a toy implementation for a paper.
My experience with 99% of compiler papers is that they provide things that don’t work in large compilers and that’s what I meant. Has anyone shipped an optimising compiler that runs on a nontrivial codebase with the? Either they end up with complexity that doesn’t scale to the kind of inputs you see, or they rely on some aspect of type systems that is true only for pure functional languages, or they have a pathological case that turns out to be common (e.g. in 1% of codebases, which is make it II usable in practice), or they use a crappy old version of LLVM as a baseline and their speedup goes away once you move to a version with better heuristics for the same paper, or their speedup is entirely due to random noise from perturbing code order that happens to win for their compiler on the small corpus they tested, or… and so on.
This is why I am deeply sceptical of any academic compiler project that claims a game-changing idea and can’t show me a demonstration in a real compiler that I can throw a massive pile of existing code at.
E-graphs don’t handle mutable references well. Their input should be in SSA form, I suppose. I deliberately designed Cammy to not have variables partially to dodge this problem, so I think it’s a fair critique.
Anyway, Cranelift has an e-graph “midend”, although I would not think of Cranelift as a “big compiler” but more of a medium-sized compiler construction kit. If you check the dates, you’ll see that most of that work was merged last year. E-graphs are being added to projects as fast as possible, from my perspective.
Cranelift’s e-graphs are neutered, applying only per basic block.
Istr the paper ssa translation is an abstract interpretation mentions the possibility of applying e-graphs to the (abstract) cyclic term graph; that seems more sensible to me than applying them directly to the concrete terms because of how critically important incrementality is to sophisticated analyses and optimisations (as well as debugging, which I mention there too)—although the specifics here are not yet entirely clear to me. That also avoids your ssa problem, because it enables iterative discovery of ‘ssa’ terms in combination with other optimisations, avoiding the need to do a translation up-front and get it right (abstract interpretations can be combined and therefore don’t have phase-ordering problems).
(If one is not interested in sophisticated analyses and optimisations, only basic ones, then e-graphs are too complex and a complete distraction; pareto optimal is still sea of nodes from 1995, plus whichever you care for of the non-outdated optimisations from 1971’s ‘catalogue’.)
Thanks. Requiring SSA is an issue for anything that deals with memory. The MemorySSA work is interesting (effectively treating every store as returning a new memory) but it’s not clear how you model things like atomics with it. It also looks like it doesn’t (yet?) scale well enough to be enabled by default.
Cranelift definitely counts as ‘real’. I’ll have to take a look.
Mathematically you can prove that a iterated process will terminate by finding a ‘measure’ and proving that the value of that measure always decreases.
A simple example could be ‘number of optimizations left to apply in the source code’ that’s a function m : source_code –> positive_int.
If the optimization pass always maps optimize(s1) = s2 with the condition that m(s1) > m(s2) then you know for sure that at some point there will be no optimizations left to apply, and then the process can terminate there.
This particular example of a measurement function is actually a very strict constraint. It doesn’t let you apply any steps that manipulate code in a way that produces a bigger piece of code (e.g. inlining an unoptimized block of code). To support that you would need a more complex measure.
The real answer: heuristics, hacks, and hope.
It’s still quite easy to craft LLVM IR that will not cause optimisations to infinite loop, but will cause exponential growth of memory or time. I even did this accidentally. Before we had linker support for CHERI, I made clang emit a function to initialise all globals that contained pointers. These ended up all being inlined into one massive function. This function had a single basic block with thousands of instructions. The register allocator was quadratic in the number of live values (I believe this specific case is fixed by falling back to linear scan if the input is too large), which normally didn’t matter because the verge length of basic blocks is 7 instructions and most outliers are under a hundred. 10,000 hit our CI timeouts.
The key thing for most compilers is that, contrary to popular belief, the programmer is not actually the adversary. It doesn’t matter if it is possible to craft input that causes the compiler to take forever, it matters if it’s easy to do accidentally.
E-graphs change the game. In exchange for not necessarily finding the best answer, an e-graph solver always terminates and can be given a hard time limit.
How exactly do you propose to use e-graphs to implement a register allocator?
More generally, I’m not sure how much of a game-changer e-graphs are. I mean, a traditional term-rewriter can also be given a time limit. Whereas, many abstract interpretations not only admit termination proofs, but fairly reasonable time bounds (this is based on the finite descending chain property—the type lattice in use should have a finite and preferably small height—of course, performance must be traded off with precision, but you always have that tradeoff). Like I said in the linked answer, I think the main thing that’s interesting about e-graphs is the ability to better explore locally non-monotonic changes in search of a global optimum.
When the number of registers is small, then I think we could use a coloring approach. Each e-graph rule would perform a distinctness check, and we could enforce a register ordering to prevent the exponential explosion of register permutations.
That said, I’m not sure how we would handle dozens of general-purpose registers without dozens of corresponding register-by-register rules, nor how spilling and overflowing concepts map onto e-graphs. If you forced me to provide a complete answer today, I’d probably perform rewriting as a separate stage prior to register allocation.
I think the most promising approach to register allocation is unison, which uses a constraint solver to implement a combined register allocator and instruction scheduler (combining with instruction selection is still an open problem). (In particular, its approach to spilling/rematerialisation is to add a logical spill instruction after every instruction that produces a result, and a logical unspill before every use; these can later be made redundant or realised otherwise, pursuant to the global cost model.) An e-graph is not scheduled; if you don’t combine register allocation with scheduling, then you have to do it afterwards (or do them in interleaved stages, as llvm does; but the stages still have to be totally ordered). I guess you could try to schedule the e-graph directly by adding spurious control dependencies until the schedule is totally determined, but I can’t imagine anything good’s coming of doing that.
What compilers use them? I’ve only ever seen them in some research papers.
There’s a list of
eggusers here, and Herbie is a great example of their effectiveness. I usedeggin my reference toolchain for Cammy because I was tired of dealing with phase-ordering issues; the resulting optimizer is only a few hundred lines of Rust and commentary.I see a lot of papers but no big compilers using it. There are lots of ideas that look great in small languages but end up not scaling to large systems and I can’t tell if e-graphs are in that category.
I think that you’ve shifted the goalposts so that you don’t have to revise your top-level opinion. Also, I can’t help but notice that you’ve conflated three different types of objects (compilers, languages, systems) without clarifying why size issues with one type of object would impact another.
Sorry, I assumed my question was obvious from context. I know that e-graphs exist and so must be used by something, if only by a toy implementation for a paper.
My experience with 99% of compiler papers is that they provide things that don’t work in large compilers and that’s what I meant. Has anyone shipped an optimising compiler that runs on a nontrivial codebase with the? Either they end up with complexity that doesn’t scale to the kind of inputs you see, or they rely on some aspect of type systems that is true only for pure functional languages, or they have a pathological case that turns out to be common (e.g. in 1% of codebases, which is make it II usable in practice), or they use a crappy old version of LLVM as a baseline and their speedup goes away once you move to a version with better heuristics for the same paper, or their speedup is entirely due to random noise from perturbing code order that happens to win for their compiler on the small corpus they tested, or… and so on.
This is why I am deeply sceptical of any academic compiler project that claims a game-changing idea and can’t show me a demonstration in a real compiler that I can throw a massive pile of existing code at.
E-graphs don’t handle mutable references well. Their input should be in SSA form, I suppose. I deliberately designed Cammy to not have variables partially to dodge this problem, so I think it’s a fair critique.
Anyway, Cranelift has an e-graph “midend”, although I would not think of Cranelift as a “big compiler” but more of a medium-sized compiler construction kit. If you check the dates, you’ll see that most of that work was merged last year. E-graphs are being added to projects as fast as possible, from my perspective.
Cranelift’s e-graphs are neutered, applying only per basic block.
Istr the paper ssa translation is an abstract interpretation mentions the possibility of applying e-graphs to the (abstract) cyclic term graph; that seems more sensible to me than applying them directly to the concrete terms because of how critically important incrementality is to sophisticated analyses and optimisations (as well as debugging, which I mention there too)—although the specifics here are not yet entirely clear to me. That also avoids your ssa problem, because it enables iterative discovery of ‘ssa’ terms in combination with other optimisations, avoiding the need to do a translation up-front and get it right (abstract interpretations can be combined and therefore don’t have phase-ordering problems).
(If one is not interested in sophisticated analyses and optimisations, only basic ones, then e-graphs are too complex and a complete distraction; pareto optimal is still sea of nodes from 1995, plus whichever you care for of the non-outdated optimisations from 1971’s ‘catalogue’.)
Thanks. Requiring SSA is an issue for anything that deals with memory. The MemorySSA work is interesting (effectively treating every store as returning a new memory) but it’s not clear how you model things like atomics with it. It also looks like it doesn’t (yet?) scale well enough to be enabled by default.
Cranelift definitely counts as ‘real’. I’ll have to take a look.
Mathematically you can prove that a iterated process will terminate by finding a ‘measure’ and proving that the value of that measure always decreases.
A simple example could be ‘number of optimizations left to apply in the source code’ that’s a function m : source_code –> positive_int.
If the optimization pass always maps optimize(s1) = s2 with the condition that m(s1) > m(s2) then you know for sure that at some point there will be no optimizations left to apply, and then the process can terminate there.
This particular example of a measurement function is actually a very strict constraint. It doesn’t let you apply any steps that manipulate code in a way that produces a bigger piece of code (e.g. inlining an unoptimized block of code). To support that you would need a more complex measure.