1. 14
  1.  

  2. 4

    This was an interesting read! I don’t have much background in implementing efficient VMs, so I’m curious: given this was written in 2008, how much has the current state-of-the-art in VM instruction dispatch evolved? Does anyone know of a more modern reference for this?

    1. 5

      It looks as if the high-level techniques are still pretty much the same. If you’d written this in the ’80s then it would have looked similar. The differences are in the subtleties of implementation. For example:

      • Code patching is trivial if you have single-threaded execution. In multithreaded VMs, you need to make sure that the orderings of updates are such that the code is always correct (often, this is done by limiting the size of patch points to a single machine word).
      • Specialisation early is really useful. Graal is really great for this and specialises the AST, before you get to bytecode dispatch.
      • Modern implementations typically have multiple tiers of interpreter / JIT and on-stack replacement costs are important for moving between them.
      • Lots of software engineering bits that don’t affect the underlying abstractions.

      My favourite example of the last two is WebKit’s JavaScriptCore. The first two tiers of this use the same code. The interpreter is written in portable macro assembly. This is done for two reasons:

      • It keeps the code size small (the entire interpreter will fit in L1 i-cache on a modern CPU.
      • It gives complete control over stack layout.

      The first is important because if the interpreter is too slow then you need to move to a JIT more aggressively. The second is important because it lets the interpreter and tier-1 JIT use the same stack layout (and register usage) and so moving between them is just a jump. The tier-1 JIT is implemented using exactly the same code as the interpreter: it either inserts direct calls to the bytecodes for large ones or pastes their contents into small ones. There are some corner cases where this can cause a slowdown, if the interpreted code fits nicely into L1 cache and the JIT’d code doesn’t.

      Early tiers in any such system are expected to collect profiling information.