That’s not really a good fit for a high-level language, nor if you want to expose functionality that may need to do allocation internally. I do think that the module approach (where the programmer specifies the representation) is morally close.
Because the programmer theoretically knows more about their performance requirements and memory system than the library writers. There are many easy examples of this.
In practice, all programming languages are enormous sources of bugs. :)
But, here, from game development, here are reasons not to rely on library routines:
Being able to audit allocations and deallocations
Knowing that, at level load, slab allocating a bunch of memory, nooping frees, and rejiggering everything at level transition is Good Enough(tm) and will save CPU cycles
Having a frame time budget (same as you’d see in a soft real-time system) where GCing or even coalescing free lists takes too long
Knowing that some library (say,std::vector) is going to be doing lots of little tiny allocations/deallocations and that an arena allocator is more suited to that workload.
Like, sure, as a dev I don’t like debugging these things when they go wrong–but I like even less having to rewrite a whole library because they don’t manage their memory the same way I do.
This is also why good libraries let the user specify file access routines.
And not even the deallocation at time of writing. The problems show up ten years later with a ninja patch that works and passes tests but fails the allocation in some crazy way. “We just need this buffer over here for later….”
In which case, it really didn’t matter (in this context) if allocation-isn’t-hard-it’s-deallocation-that. The library is leaving both up to the application anyway.
Yeah, we saw what that’s like with MPI. Those bad experiences led to languages like Chapel, X10, ParaSail, and Futhark. Turns out many app developers would rather describe their problem or a high-level solution instead of micromanage the machine.
I 100% sympathize from the perspective of a scientist… But most of computer and program design since the 1950s has been computer engineering, which includes the uncertain art of choosing tradeoffs between perfect science and ugly practical needs. This case is no different, even when we have billions of transistors at our command.
More specifically, what this article discusses is a trade-off in GPU computation overhead vs. aggregate performance. This is an optimization problem. The tradeoffs that make sense now are not the ones that made sense ten years ago, and will not be the ones that make sense ten years from now when the balance of CPU computation speed vs memory bandwidth and GPU computation speed vs CPU<->GPU transfer speed is different.
So what it sounds like, without being critical, is that the compiler writer needs to step back from writing compilers, consider this problem as a more abstract balance of trade-offs, and consider their goals to see where they fall in the spectrum of options. Then go back to writing compilers with that goal in mind.
You can always change the compiler as hardware changes. That’s the point of a compiler - that you can put local, hardware-specific information in it, and then change it as hardware changes, without changing the code that uses the compiler.
I thought the punchline was macros but alas it was map and reduce which still relies on compiler magic. If it was macros then the programmer could decide on the threshold themselves.
There is nothing that prevents a programmer from providing a module that implements map and reduce with some threshold mechanism. It’s as flexible as macros in that regard.
Yes. The only thing that is missing from the vector package is that there is no dynamic value exposing the size of the vector, so you’d have to roll your own. However, you’d have to actually produce code that performs the branch dynamically, and then depend on the compiler doing constant-folding to remove the branch (but this is pretty much guaranteed to work).
It’s certainly not fully as powerful as Lisp-style macros, but good enough for this purpose.
And yet none of these options include what most good C libraries do, which is let the programmer worry about allocation.
That’s not really a good fit for a high-level language, nor if you want to expose functionality that may need to do allocation internally. I do think that the module approach (where the programmer specifies the representation) is morally close.
Wait, why do we want programmer’s to worry about allocation? Isn’t that prone to error and therefore best automated?
Because the programmer theoretically knows more about their performance requirements and memory system than the library writers. There are many easy examples of this.
Theoretically, yes. In practice, it is an enormous source of bugs.
In practice, all programming languages are enormous sources of bugs. :)
But, here, from game development, here are reasons not to rely on library routines:
std::vector) is going to be doing lots of little tiny allocations/deallocations and that an arena allocator is more suited to that workload.Like, sure, as a dev I don’t like debugging these things when they go wrong–but I like even less having to rewrite a whole library because they don’t manage their memory the same way I do.
This is also why good libraries let the user specify file access routines.
It’s not the allocation that’s error-prone, it’s the deallocation.
And not even the deallocation at time of writing. The problems show up ten years later with a ninja patch that works and passes tests but fails the allocation in some crazy way. “We just need this buffer over here for later….”
How would a library take control of deallocations without also taking control of the allocations, too?
As I understand, a library does not allocate and does not deallocate. All users are expected to BYOB(Bring Your Own Buffer).
In which case, it really didn’t matter (in this context) if allocation-isn’t-hard-it’s-deallocation-that. The library is leaving both up to the application anyway.
Yeah, we saw what that’s like with MPI. Those bad experiences led to languages like Chapel, X10, ParaSail, and Futhark. Turns out many app developers would rather describe their problem or a high-level solution instead of micromanage the machine.
I 100% sympathize from the perspective of a scientist… But most of computer and program design since the 1950s has been computer engineering, which includes the uncertain art of choosing tradeoffs between perfect science and ugly practical needs. This case is no different, even when we have billions of transistors at our command.
More specifically, what this article discusses is a trade-off in GPU computation overhead vs. aggregate performance. This is an optimization problem. The tradeoffs that make sense now are not the ones that made sense ten years ago, and will not be the ones that make sense ten years from now when the balance of CPU computation speed vs memory bandwidth and GPU computation speed vs CPU<->GPU transfer speed is different.
So what it sounds like, without being critical, is that the compiler writer needs to step back from writing compilers, consider this problem as a more abstract balance of trade-offs, and consider their goals to see where they fall in the spectrum of options. Then go back to writing compilers with that goal in mind.
You can always change the compiler as hardware changes. That’s the point of a compiler - that you can put local, hardware-specific information in it, and then change it as hardware changes, without changing the code that uses the compiler.
I thought the punchline was macros but alas it was map and reduce which still relies on compiler magic. If it was macros then the programmer could decide on the threshold themselves.
There is nothing that prevents a programmer from providing a module that implements
mapandreducewith some threshold mechanism. It’s as flexible as macros in that regard.So I can reflect over the structure and count elements to decide how many can be inlined before recursing?
Yes. The only thing that is missing from the
vectorpackage is that there is no dynamic value exposing the size of the vector, so you’d have to roll your own. However, you’d have to actually produce code that performs the branch dynamically, and then depend on the compiler doing constant-folding to remove the branch (but this is pretty much guaranteed to work).It’s certainly not fully as powerful as Lisp-style macros, but good enough for this purpose.