One particular program where this effect is notable is the implementation of a simple stack-based virtual machine (or abstract machine), such as the SECD or the ZINC machine. Variables can be represented as just integers (De Bruijn indices), representing indices in an stack-like environment, with 0 denoting a reference to the last pushed/defined variable.
A naive implementation just represents environments as lists, and uses linear-time List.nth to retrieve the value of a variable. This feels horribly inefficient, so it is natural to try to use a dynamic array instead (you don’t even need a hash-table). It is at first surprising to discover that, for most programs one may want to write, the list-based implementation is noticeably faster than all others. This comes from the fact that most programs access the more recently-defined variables (the local variables and function parameters) very often, and the rest much less often, so the accessed indices are very small in practice.
One of the MIT Lisp Machines used pairs (frame, index) when compiling variables. The binding stack was a list, with each entry being a De Brujin indexed list of variables so for any static context the (frame, index) pair uniquely identified a single variable.
Oh, very neat explanation.
Replied to the wrong comment. Sorry.
Neat, never heard of De Brujin indices, but it sounds pretty close to lexical addressing. Anyone know if there’s a relation between the two?
Hadn’t heard of lexical addressing, but it looks similar. The intent of De Bruijn indices is that they simplify compiler implementation because you don’t have to worry about names conflicting as you compose subtrees, nor do you have to do any substitution. It looks as though lexical addressing is pretty much equivalent in that respect.
It’s one of those things that become obvious as soon as you learn about them so they’re often left unexplained for generations of engineers to come. This post is very valuable in this regard, as it explains one thing well in a simple hands-on manner. Instantly linkable!
Measuring the performance of data structures can be a lot of fun and can totally get you hooked for long periods of time. You just have to be very careful on stating when the crossover points are for different structures. That said, it’s still worth it to measure for the systems you’re working on.
Here’s a report that shows how implementations for vector in C++ on various platforms with different element sizes can be very hard to predict. The conclusion states,
There is too much complexity. The cache hierarchies in the hardware, the operating system used, the C library and the standard library for the language used; all of them conspire to introduce effects that may even invert the results that you could expect.
This reminded me of a post about how CoreFoundation’s Array isn’t actually an array. Whether the data structure morphs based on access pattern, or just on size, I have no idea.