A key assumption in assessing the runtime performance of the lambda set compilation technique is that loading and switching on an integer (e.g. implemented as a jump table) is faster than loading and dereferencing a function pointer.
This is a reasonable assumption in general (especially when there is nothing to switch over, that is the switch is constant), however, whether it holds for a specific use case on a specific operating system and architecture is impossible to determine without benchmarks.
If you’re familiar with computer architecture, you may recognize that the fundamental tradeoff in the defunctionalization approach boils down to the efficacy of a CPU’s branch predictor (for the conditional-dispatch case) relative to its branch target predictor (for the pointer-chasing case). I am not aware of any existing studies that measure the relative efficacies of these predictors across CPUs; if you are aware of any, please let us know!
This is all nonsense.
The advantages to this approach, if any, seem to be that:
A call to a closure can know statically the set of functions which might be called, enabling some interprocedural optimisations, and
It is possible to avoid an indirection for the environment in some cases where it would otherwise be desirable but not possible.
Both of these are quite marginal; heap allocation is quite fast (if you have a good gc!), and compilers are not in the habit of performing much in the way of interesting interprocedural optimisations. And most of the interesting cases of #1 (ie where the set is not too large) are likely to be caught anyway, without the need for a technique like this—because wanting to know what values a term might take on at runtime is completely general, and not specific to functions.
EDIT: in particular, all of this seems strictly worse than an inline cache.
I think there’s an interesting distinction between languages that are traditionally JIT-compiled (like Java or JavaScript) vs languages that are traditionally AOT-compiled (like C++ or Rust). Let’s take the example of mapping over a list.
In Java, you’d write the code like:
public static <T, U> ArrayList<U> map(ArrayList<T> x, Function<T, U> f) {
ArrayList<U> out = new ArrayList<U>();
for(T t : x) out.add(f.apply(t));
return out;
}
In the first approach, a new version of the map function is instantiated for each function we pass into map. In the second approach, std::function contains a function pointer to the function that we want to use to map. Only one instance of the map function is generated (for each T,U pair).
There are trade-offs for each of the C++ approaches, and which approach you’d prefer depends on a number of factors–and there isn’t a one-size-fits-all solution. Regardless in C++ (just like in Rust), the programmer is forced to choose between the two approaches.
In Java (and other JVM languages), the runtime system chooses–not the programmer. By default, the JVM will use an indirect function call (like the second approach). However, if the function gets called enough (and other optimization criteria have been met), the JVM will switch to using the first approach, which allows f to be inlined into the map function.
I think this represents an interesting distinction between static and dynamic language runtimes. Neither the C++ nor Java compiler has enough information to make this decision at compile-time, so the JVM pushes it off to runtime, where it can have enough info to make the decision.
I don’t see how this is strictly worse than an inline cache. In this model, the cache set is known completely statically - at the least, you can populate the cache at compile-time. Could you let me know if I missing something?
I agree with your argument to an extent, but I strongly disagree at the limit. For (1), an interprocedural optimization that this enables is inlining - which is quite common and aptly the mother of all optimizations. I agree that (2) is marginal, especially relative to the cost of a branch. Heap allocation is fast - but not relative to a stack allocation. It may be the case the cost is about the same if most of these allocations are ephemeral and end up in the young generation of a GC/popping off a free list. I do think, however, across a very large number of allocations, there is something to be said about knowing your values don’t touch the heap at all.
I don’t see how this is strictly worse than an inline cache
You’ve got 10 functions you might call; turns out only one of them is hot, so your inline cache bubbles that one up to the top and inlines it. Or three are hot, and you can make direct calls to them and help out your btb, etc.
an interprocedural optimization that this enables is inlining
Once again: you should in general be able to find out as precisely as you can what values any term might take on; and if a function call is known to be to a particular function, sure, turn it into a direct call and maybe inline it. That’s very precedented. If there are two potential targets? Meh…could be worthwhile—but see again the inline cache; highly counterproductive to do it if the call is not hot enough.
young generation of a GC/popping off a free list
(Aside: those have very different performance profiles…)
Like Lua, Roc’s automatic memory management doesn’t require a virtual machine, and it’s possible to call Roc functions directly from any language that can call C functions
Was reading up on Roc and saw this, does anyone know what this refers to in practice?
Roc uses reference counting for memory reclamation, which is a local technique, instead of a tracing garbage collector, which is a global technique (“requires a virtual machine”).
I’m curious what you mean by this? Are you referring to something more like newlisp, which ensures local memory is freed immediately after use? Or did you have something else in mind?
which is a global technique (“requires a virtual machine”)
Nothing about tracing garbage collection requires a virtual machine. It does make things easier to discover “roots” and be more precise, but as a counter example, the Boehm-Demers-Weiser GC just scans the stack for anything that might look like a pointer, and all you, as a programmer, have to do is call GC_{malloc,realloc,free} instead of the typical malloc, realloc, free. It’s incremental, generational, but not precise. It can miss things. (this is a very simplified explanation of Boehm, a ton more details here)
This is all nonsense.
The advantages to this approach, if any, seem to be that:
A call to a closure can know statically the set of functions which might be called, enabling some interprocedural optimisations, and
It is possible to avoid an indirection for the environment in some cases where it would otherwise be desirable but not possible.
Both of these are quite marginal; heap allocation is quite fast (if you have a good gc!), and compilers are not in the habit of performing much in the way of interesting interprocedural optimisations. And most of the interesting cases of #1 (ie where the set is not too large) are likely to be caught anyway, without the need for a technique like this—because wanting to know what values a term might take on at runtime is completely general, and not specific to functions.
EDIT: in particular, all of this seems strictly worse than an inline cache.
I think there’s an interesting distinction between languages that are traditionally JIT-compiled (like Java or JavaScript) vs languages that are traditionally AOT-compiled (like C++ or Rust). Let’s take the example of mapping over a list.
In Java, you’d write the code like:
In C++, there are two ways to write the code:
In the first approach, a new version of the map function is instantiated for each function we pass into map. In the second approach,
std::function
contains a function pointer to the function that we want to use to map. Only one instance of the map function is generated (for eachT,U
pair).There are trade-offs for each of the C++ approaches, and which approach you’d prefer depends on a number of factors–and there isn’t a one-size-fits-all solution. Regardless in C++ (just like in Rust), the programmer is forced to choose between the two approaches.
In Java (and other JVM languages), the runtime system chooses–not the programmer. By default, the JVM will use an indirect function call (like the second approach). However, if the function gets called enough (and other optimization criteria have been met), the JVM will switch to using the first approach, which allows
f
to be inlined into themap
function.I think this represents an interesting distinction between static and dynamic language runtimes. Neither the C++ nor Java compiler has enough information to make this decision at compile-time, so the JVM pushes it off to runtime, where it can have enough info to make the decision.
Note that Roc doesn’t have a tracing garbage collector. Reference counting is used.
I don’t see how this is strictly worse than an inline cache. In this model, the cache set is known completely statically - at the least, you can populate the cache at compile-time. Could you let me know if I missing something?
I agree with your argument to an extent, but I strongly disagree at the limit. For (1), an interprocedural optimization that this enables is inlining - which is quite common and aptly the mother of all optimizations. I agree that (2) is marginal, especially relative to the cost of a branch. Heap allocation is fast - but not relative to a stack allocation. It may be the case the cost is about the same if most of these allocations are ephemeral and end up in the young generation of a GC/popping off a free list. I do think, however, across a very large number of allocations, there is something to be said about knowing your values don’t touch the heap at all.
You’ve got 10 functions you might call; turns out only one of them is hot, so your inline cache bubbles that one up to the top and inlines it. Or three are hot, and you can make direct calls to them and help out your btb, etc.
Once again: you should in general be able to find out as precisely as you can what values any term might take on; and if a function call is known to be to a particular function, sure, turn it into a direct call and maybe inline it. That’s very precedented. If there are two potential targets? Meh…could be worthwhile—but see again the inline cache; highly counterproductive to do it if the call is not hot enough.
(Aside: those have very different performance profiles…)
Was reading up on Roc and saw this, does anyone know what this refers to in practice?
Roc uses reference counting for memory reclamation, which is a local technique, instead of a tracing garbage collector, which is a global technique (“requires a virtual machine”).
I’m curious what you mean by this? Are you referring to something more like newlisp, which ensures local memory is freed immediately after use? Or did you have something else in mind?
Nothing about tracing garbage collection requires a virtual machine. It does make things easier to discover “roots” and be more precise, but as a counter example, the Boehm-Demers-Weiser GC just scans the stack for anything that might look like a pointer, and all you, as a programmer, have to do is call
GC_{malloc,realloc,free}
instead of the typicalmalloc, realloc, free
. It’s incremental, generational, but not precise. It can miss things. (this is a very simplified explanation of Boehm, a ton more details here)Tracing garbage collectors do not require a virtual machine, that statement (and not just that statement) is confused.
Feedback: it would be better (and I know it takes time) to explain why it’s confused instead of pointing it out and leaving it at that.