Hi sillycross! Ever since your first post in this series I have been eagerly watching the repo. This is such a neat project and I am excited to see where it goes!
I’m pretty sure I’ve heard of the “copy and patch” technique before in JIT implementations, although not copying from a .o file but copying actual functions from the running code. (I guess it would take the address of a function that implements one bytecode instruction, magically find the end of the function, and copy that range of memory to the output.) So I’m not clear what makes this implementation so much faster.
So I’m not clear what makes this implementation so much faster.
By “fast” do you mean compilation time or execution performance?
For compilation time, look at the ‘Add’ bytecode example in the post and you’ll see why it is fast: the whole thing is branchless.
For execution performance, the main reason is because we can burn runtime constants into the instruction flow, which is something your approach cannot achieve.
Then, for the original copy-and-patch paper (Wasm/Sql use case), some additional perf comes from some simple register allocation & non-local instruction selection.
For luajit-remake, some additional perf comes from inline caching.
I guess it would take the address of a function that implements one bytecode instruction
A similar very-unportable thing I’ve seen someone do in a just-for-fun project was take the address of a label in C code the start and another at the end of some code implementing each operation.
This relied on some gcc extensions and lots of assumptions about how the code would happen be laid out
there has been commercially successful products heavily reliant on that unportable thing - mainly Dalvik ((the JVM) in Android using it for many years in its pre-ART days. It has been kind of a common pattern in emulator circles as well – mind is really rusty on the matter but I would put the GCC extension even in the early 00s or so.
Thanks! I want to write a high-performance CHERIoT emulator soon. If there’s a good bit in the code to look at for getting started, pointers would be very welcome.
If I understood what you said correctly, you are likely looking for the wrong thing. The VM you said (i.e., a virtual machine that emulates an ISA) is not the VM in this post (a virtual machine that executes a dynamic language).
By the way, this terminology confusion is also why LLVM is officially no longer an acronym, “but the name of the project”.
If I understood what you said correctly, you are likely looking for the wrong thing. The VM you said (i.e., a virtual machine that emulates an ISA) is not the VM in this post (a virtual machine that executes a dynamic language).
I don’t see a distinction. You have built a tool the implements high-performance interpreters and simple JITs for that execute sequences of instructions. I have an architecture specified in terms of instructions that are executed sequentially. Exactly the same techniques used for implementing interpreters apply.
By the way, this terminology confusion is also why LLVM is officially no longer an acronym, “but the name of the project”.
I have been an LLVM contributor since 2008, I am well aware of its history.
You have built a tool the implements high-performance interpreters and simple JITs for that execute sequences of instructions.
Yes, but for dynamic languages. Most of the hard part of Deegen is to support dynamic-language-specific optimizations (inline caching, type-related opts, slow paths, etc). These complications do not exist for a VM that emulates an ISA.
Also, Deegen makes the fundamental assumption of a register-based bytecode machine model, where each virtual register holds a boxed value. Again, the assumption is valid for dynamic language use cases, but not else (why do you want boxed values for ISA emulation?)
Copy-and-Patch might still be useful for your use case to generate code (in fact, QEMU did something similar [1], though I didn’t know about this until I’ve finished writing the Copy-and-Patch paper), but Deegen is likely not.
Thanks. The copy and patch infrastructure was the thing that I was most interested in. Specifically, we have a formal model of our ISA and so it might be easy to mechanically generate C functions from that. For my specific use case, I don’t care about self-modifying code (I want to simulate immutable firmware images) so I can use LLVM’s object-code parsing and disassembly to get a complete dump of everything I want to simulate.
Some of the slow-path optimisation might also be interesting because any instruction might happen when an interrupt arrives (but most won’t) and some loads and stores might trigger faults (again, most won’t).
The results (including the ones of the interpreter) are impressive and the chance that it eventually outperforms LuaJIT are good, not least because the closure instantiation (FNEW) in LuaJIT is still NYI and breaks traces, which is where a lot of performance is lost (see e.g. https://github.com/rochus-keller/Som/blob/master/Readme.md).
Hi sillycross! Ever since your first post in this series I have been eagerly watching the repo. This is such a neat project and I am excited to see where it goes!
I’m pretty sure I’ve heard of the “copy and patch” technique before in JIT implementations, although not copying from a .o file but copying actual functions from the running code. (I guess it would take the address of a function that implements one bytecode instruction, magically find the end of the function, and copy that range of memory to the output.) So I’m not clear what makes this implementation so much faster.
By “fast” do you mean compilation time or execution performance? For compilation time, look at the ‘Add’ bytecode example in the post and you’ll see why it is fast: the whole thing is branchless.
For execution performance, the main reason is because we can burn runtime constants into the instruction flow, which is something your approach cannot achieve.
Then, for the original copy-and-patch paper (Wasm/Sql use case), some additional perf comes from some simple register allocation & non-local instruction selection.
For luajit-remake, some additional perf comes from inline caching.
A similar very-unportable thing I’ve seen someone do in a just-for-fun project was take the address of a label in C code the start and another at the end of some code implementing each operation.
This relied on some gcc extensions and lots of assumptions about how the code would happen be laid out
there has been commercially successful products heavily reliant on that unportable thing - mainly Dalvik ((the JVM) in Android using it for many years in its pre-ART days. It has been kind of a common pattern in emulator circles as well – mind is really rusty on the matter but I would put the GCC extension even in the early 00s or so.
Very cool! It would be nice if Deegen provided an alternative to Truffle or RPython. It might be hard to match RPython’s ergonomics, though.
Is Deegen available anywhere? I didn’t see a link in the blog and it sounds like a fun tool to play with.
It’s not a standalone tool right now, as it is still under active development. Just look at the ‘deegen’ folder in the LuaJIT Remake repo.
Thanks! I want to write a high-performance CHERIoT emulator soon. If there’s a good bit in the code to look at for getting started, pointers would be very welcome.
If I understood what you said correctly, you are likely looking for the wrong thing. The VM you said (i.e., a virtual machine that emulates an ISA) is not the VM in this post (a virtual machine that executes a dynamic language).
By the way, this terminology confusion is also why LLVM is officially no longer an acronym, “but the name of the project”.
I don’t see a distinction. You have built a tool the implements high-performance interpreters and simple JITs for that execute sequences of instructions. I have an architecture specified in terms of instructions that are executed sequentially. Exactly the same techniques used for implementing interpreters apply.
I have been an LLVM contributor since 2008, I am well aware of its history.
Yes, but for dynamic languages. Most of the hard part of Deegen is to support dynamic-language-specific optimizations (inline caching, type-related opts, slow paths, etc). These complications do not exist for a VM that emulates an ISA.
Also, Deegen makes the fundamental assumption of a register-based bytecode machine model, where each virtual register holds a boxed value. Again, the assumption is valid for dynamic language use cases, but not else (why do you want boxed values for ISA emulation?)
Copy-and-Patch might still be useful for your use case to generate code (in fact, QEMU did something similar [1], though I didn’t know about this until I’ve finished writing the Copy-and-Patch paper), but Deegen is likely not.
[1] https://www.usenix.org/conference/2005-usenix-annual-technical-conference/qemu-fast-and-portable-dynamic-translator
Thanks. The copy and patch infrastructure was the thing that I was most interested in. Specifically, we have a formal model of our ISA and so it might be easy to mechanically generate C functions from that. For my specific use case, I don’t care about self-modifying code (I want to simulate immutable firmware images) so I can use LLVM’s object-code parsing and disassembly to get a complete dump of everything I want to simulate.
Some of the slow-path optimisation might also be interesting because any instruction might happen when an interrupt arrives (but most won’t) and some loads and stores might trigger faults (again, most won’t).
In that case you should definitely read the QEMU paper above :)
I have read it before. This isn’t my first JIT…
Is this your PhD project?
Reminds me of jitter.
The results (including the ones of the interpreter) are impressive and the chance that it eventually outperforms LuaJIT are good, not least because the closure instantiation (FNEW) in LuaJIT is still NYI and breaks traces, which is where a lot of performance is lost (see e.g. https://github.com/rochus-keller/Som/blob/master/Readme.md).