This is super cool! I wrote a continuation-passing interpreter last year, and a switch-statement-based one the year before, so I’m nodding my head sagely at this exposition.
It’s worth noting, though, that it isn’t just the novel interpreter generator that made this faster than LuaJIT — they also added shape-based table lookup caching:
A lot of LJR’s speedup over LuaJIT interpreter comes from our support of inline caching. We have rewritten the Lua runtime from scratch. In LJR, table objects are not stored as a plain hash table with an array part. Instead, our table implementation employed hidden classes, using a design mostly mirroring the hidden class design in JavaScriptCore. …
LuaJIT’s hand-written assembly interpreter is highly optimized already. Our interpreter generated by Deegen is also highly optimized, and in many cases, slightly better-optimized than LuaJIT. However, the gain from those low-level optimizations are simply not enough to beat LuaJIT by a significant margin, especially on a modern CPU with very good instruction-level parallelism, where having a few more instructions, a few longer instructions, or even a few more L1-hitting loads have negligible impact on performance.
IMHO this is problematic for a research report — they’ve changed multiple variables at once, making it hard to draw conclusions. I would have liked to have seen additional figures for how their interpreter alone compares with LuaJIT.
I also noted that the article refers to “LuaJIT’s interpreter,” so it sounds like they did not compare against JITted code. Which is understandable, as that would not really be a fair comparison, JITs being intrinsically faster.
Anyway, it is cool to see progress being made on non-JIT interpreters, because there are several important platforms where JITs are not a viable option, such as iOS (OS security policy prevents apps from mapping pages as executable) and small embedded CPUs (too little RAM to be using it for code.)
Anyway, it is cool to see progress being made on non-JIT interpreters, because there are several important platforms where JITs are not a viable option, such as iOS (OS security policy prevents apps from mapping pages as executable) and small embedded CPUs (too little RAM to be using it for code.)
The Alphabet Soup work from Sun Labs (which, I think, was released as Grail?) on optimising AST interpreters was similarly interesting. They showed that they could do type feedback in the interpreter to produce an optimised AST and then have a naive AST interpreter that performed as well as most hand-optimised bytecode interpreters. It also had the nice benefit of being a great input to a compiler, because it was already aggressively specialised for the dynamic types that it encountered.
In addition to the places you list, I’d also add places where latency is more important than throughput. JITs generally have better overall throughput but at the expense of either start-up times, jitter as things are JIT’d, or both. Interpreters can have stable performance and can start executing immediately.
Fascinating area that I wish I had more time to play with.
This is super cool! I wrote a continuation-passing interpreter last year, and a switch-statement-based one the year before, so I’m nodding my head sagely at this exposition.
It’s worth noting, though, that it isn’t just the novel interpreter generator that made this faster than LuaJIT — they also added shape-based table lookup caching:
IMHO this is problematic for a research report — they’ve changed multiple variables at once, making it hard to draw conclusions. I would have liked to have seen additional figures for how their interpreter alone compares with LuaJIT.
I also noted that the article refers to “LuaJIT’s interpreter,” so it sounds like they did not compare against JITted code. Which is understandable, as that would not really be a fair comparison, JITs being intrinsically faster.
Anyway, it is cool to see progress being made on non-JIT interpreters, because there are several important platforms where JITs are not a viable option, such as iOS (OS security policy prevents apps from mapping pages as executable) and small embedded CPUs (too little RAM to be using it for code.)
The Alphabet Soup work from Sun Labs (which, I think, was released as Grail?) on optimising AST interpreters was similarly interesting. They showed that they could do type feedback in the interpreter to produce an optimised AST and then have a naive AST interpreter that performed as well as most hand-optimised bytecode interpreters. It also had the nice benefit of being a great input to a compiler, because it was already aggressively specialised for the dynamic types that it encountered.
In addition to the places you list, I’d also add places where latency is more important than throughput. JITs generally have better overall throughput but at the expense of either start-up times, jitter as things are JIT’d, or both. Interpreters can have stable performance and can start executing immediately.
Fascinating area that I wish I had more time to play with.
FWIW there is a reference to this article which I posted a year ago: https://lobste.rs/s/ffssdz/parsing_protobuf_at_2_gb_s_how_i_learned
Interesting exchanges by the authors and a few other people here: https://news.ycombinator.com/item?id=33711583
This seems like an awesome way of writing faster interpreters – i.e. not in assembly, but in C++ snippets you stitch together with a tool.
I did peek at the deegen tool a bit, and it seems quite large? https://github.com/luajit-remake/luajit-remake/tree/master/deegen
I would be interested in an overview of all the analysis it has to do, which as I understand is basically “automated Mike Pall”
FWIW I think this is the hand-written equivalent with LuaJIT’s dynasm tool: https://github.com/LuaJIT/LuaJIT/blob/v2.1/src/vm_x64.dasc (just under 5000 lines)
Also there are several of these files with no apparent sharing, as you would get with deegen:
https://github.com/LuaJIT/LuaJIT/blob/v2.1/src/vm_x86.dasc
https://github.com/LuaJIT/LuaJIT/blob/v2.1/src/vm_ppc.dasc