I appreciate the detail you’ve gone into to present the concepts. But, some honest feedback: I found it very confusing. I kept waiting for “why shouldn’t I bump from the bottom? I’ve been doing it for years just fine” and when I got to assembly code, I stopped reading:
Can bumping downwards do better?
Can bumping downwards do better?
Do better than what? I don’t understand what the problem is. A few conditional branches? Modern CPUs are incredibly efficient, with very deep pipelines that do speculative execution. A couple conditional branches is nothing.
But again, maybe I’m misreading or misunderstanding. But in general, unless your conditional branches are in the inner loop of an N^2 operation that happens every frame of a simulation, it just isn’t worth worrying about. Even if you allocate millions of objects (tens of millions? hundreds?) you still probably couldn’t measure the difference.
I got hopeful when I reached the benchmarks section. Aha, now you’ll discover that this doesn’t really affect the overall execution time! But… no, you’re measuring the improvement for each allocation. Sure, that’s nice, but allocations in general don’t happen enough to impact overall execution time, unless you’re talking about large-scale allocations like string allocations throughout an entire codebase. And in those cases, the performance gain from switching from malloc to a small block allocator is so great that the timing difference of a few conditionals might be a rounding error.
Not trying to be negative here, fwiw.
An author of a bump allocation library found that bumping downwards improves allocation throughput by 19% compared to bumping upwards. Why is that not significant? Bump allocator may not be a bottleneck for you, but it is a bottleneck for some workloads.
Of course it’s significant. I’m not trying to minimize the contribution. I am saying, as someone who has experience with a variety of large codebases, when you profile to find out where the time is spent, the allocator itself is almost never the issue except for strings. So a 19% throughput in the allocator would equate to almost no overall time savings.
To put it differently, pick any random section of your codebase. You can improve that by 19%. But unless that section is actually the bottleneck at runtime, there will be no difference.
That doesn’t mean it’s a waste of time. For example, I learned more about Rust from reading this.
Also, I didn’t realize that he was the author of a library. Yes, I agree that library authors should generally try to make their libraries more efficient if it’s easy to do. I was just surprised that this particular workload would make a difference at all. It sounded like it was picking this code out of thin air, but that was likely where I misread.
A very interesting dig into details, but I wonder how specific it is to x86 family code? Does ARM or other targets also benefit from this?
It should - it does better mostly by requiring fewer overflow checks. If anything this probably makes less of a difference on most x86 chips due to smarter branch prediction algorithms.
It should even help on non-pipelined chips (where the relative cost of branching is smaller because everything is slower) just by having fewer instructions.