The zlib-rs implementation relies heavily on instructions that are specific to your CPU. To make use of these instructions, and let the compiler optimize with the assumption that the instructions will exist, it is important to pass the target-cpu=native flag
Wouldn’t it be more appropriate to test whatever most consumers will actually run, that is whatever is present in their distribution? Maybe x86_64-v3. Or would it be best for the library to determine the capabilities at runtime and use AVX-512 or whatnot if available. I believe this is what optimised math libraries such as OpenBLAS do.
In any case was zlib_ng compiled with the same march=native flag on the benchmark machine?
Wouldn’t it be more appropriate to test whatever most consumers will actually run, that is whatever is present in their distribution?
Picking the right target to benchmark is tricky. I think that the one I chose is relevant at least.
Most consumer applications still use stock zlib, which is roughly 2X slower than zlib-rs/zlib-ng, because it does not make good use of modern CPU features like SIMD. Either compression performance is just not that important for what they do, or they are not aware that better options exist. The places where performance really matters is e.g. in server racks, where compiling for the specific target (and the performance boost that entails) is worth it.
I just picked my own machine as a pretty capable one, but not exactly top of the line (e.g. it does not support AVX-512, also AVX-512 is tricky: operations on wider registers are not always faster, it really depends). The AVX-2 implementation at least shows that we can create a capable rust implementation.
Or would it be best for the library to determine the capabilities at runtime and use AVX-512 or whatnot if available.
In any case was zlib_ng compiled with the same march=native flag on the benchmark machine?
Yes. Specifically that means that it does not need to perform runtime CPU feature detection and that SIMD intrinsics get properly inlined. I have confirmed this with perf/objdump.
Note that the instructions number is structurally much higher for rust code, even when that is not reflected in the wall time. Most of this increase is bounds checks
How did the author come to this conclusion? Is this just a guess, or did they actually instrument the code?
That shows about a ~10% reduction in instructions executed when bounds checks are disabled.Afterwards the instruction count is still higher than the C version though, so there are other aspects to rust codegen and our implementation that make the instruction count higher.
Quick question as a guy who didn’t major in CS- With enough compile-time type checking, isn’t it possible to prove that bounds won’t be exceeded and thereby remove the runtime checks?
e.g. in cases where you do array[1] + array[2] only one bounds check is needed: array.len() >= 3. Furthermore rust iterators and std functions are smart about bounds checks, and our custom SIMD logic lifts bounds checks out inner loops.
But, you can see from those measurement results that the instruction count is still quite significant, and that the impact on runtime is just so close to zero that spending more effort on removing bounds checks is not actually worth the compile time and implementation effort.
It’d be interesting to compare to older or weaker (cell-phone?) CPU’s that might not have as insane branch predictors… Then again with embedded CPU’s that might not have a branch predictor at all.
Future work for someone else maybe, since you have already spoken eloquently on how hard it is to benchmark for all use cases. :-)
Not really. Sometimes yes. Sometimes (lot), the CPU itself can do it.
A lot of time, not easily. You need to bound every single loop to do that, and all kind of recursions, etc. which is… Usually it leads to a language developers do not want to use.
Wouldn’t it be more appropriate to test whatever most consumers will actually run, that is whatever is present in their distribution? Maybe
x86_64-v3. Or would it be best for the library to determine the capabilities at runtime and use AVX-512 or whatnot if available. I believe this is what optimised math libraries such as OpenBLAS do.In any case was zlib_ng compiled with the same march=native flag on the benchmark machine?
Picking the right target to benchmark is tricky. I think that the one I chose is relevant at least.
Most consumer applications still use stock zlib, which is roughly 2X slower than zlib-rs/zlib-ng, because it does not make good use of modern CPU features like SIMD. Either compression performance is just not that important for what they do, or they are not aware that better options exist. The places where performance really matters is e.g. in server racks, where compiling for the specific target (and the performance boost that entails) is worth it.
I just picked my own machine as a pretty capable one, but not exactly top of the line (e.g. it does not support AVX-512, also AVX-512 is tricky: operations on wider registers are not always faster, it really depends). The AVX-2 implementation at least shows that we can create a capable rust implementation.
We also do perform runtime feature detection, but it is slightly slower than when the supported CPU features are statically known. This thread has some extra info https://hachyderm.io/@decathorpe@mastodon.social/113028060206279172
Yes. Specifically that means that it does not need to perform runtime CPU feature detection and that SIMD intrinsics get properly inlined. I have confirmed this with perf/objdump.
How did the author come to this conclusion? Is this just a guess, or did they actually instrument the code?
I ran some more experiments a couple days ago that investigate the impact of bounds checks and overflow arithmetic checks.
https://gist.github.com/folkertdev/23a92e853eb78e9a66829b71a276bd4b#bounds-checks
That shows about a ~10% reduction in instructions executed when bounds checks are disabled.Afterwards the instruction count is still higher than the C version though, so there are other aspects to rust codegen and our implementation that make the instruction count higher.
Great investigation! Very few people publishing benchmarks actually go the distance with these sort of things.
Quick question as a guy who didn’t major in CS- With enough compile-time type checking, isn’t it possible to prove that bounds won’t be exceeded and thereby remove the runtime checks?
it depends :)
e.g. in cases where you do
array[1] + array[2]only one bounds check is needed:array.len() >= 3. Furthermore rust iterators and std functions are smart about bounds checks, and our custom SIMD logic lifts bounds checks out inner loops.But, you can see from those measurement results that the instruction count is still quite significant, and that the impact on runtime is just so close to zero that spending more effort on removing bounds checks is not actually worth the compile time and implementation effort.
It’d be interesting to compare to older or weaker (cell-phone?) CPU’s that might not have as insane branch predictors… Then again with embedded CPU’s that might not have a branch predictor at all.
Future work for someone else maybe, since you have already spoken eloquently on how hard it is to benchmark for all use cases. :-)
Not really. Sometimes yes. Sometimes (lot), the CPU itself can do it.
A lot of time, not easily. You need to bound every single loop to do that, and all kind of recursions, etc. which is… Usually it leads to a language developers do not want to use.
It exists though!
For the sake of notifications, people seem to have replied to https://lobste.rs/s/zyketc/current_zlib_rs_performance#c_hd857c instead of this for some reason.
(@pushcx or someone uh I definitely replied to https://lobste.rs/s/zyketc/current_zlib_rs_performance#c_byiw8n and the reply got attached to this post instead – is this a bug?)
I actually noticed this on a post I made the other day, doesn’t seem to be a you thing
seems to be a bug: https://github.com/lobsters/lobsters/issues/1313
I see it right in this conversation.
Child comments of parent comments with siblings seem to be visually attached to the wrong parent sibling, but only sometimes.
Maybe folks should consider one of those “languages developers do not want to use” (per @Diana) next time LOL
If both the index and the array length are statically known, then yes, at least usually. The kind of “language developers do not want to use” that supports this overtly in the type system is dependently typed languages such as Idris.
Well that’s interesting because Idris is exactly the language I have started studying of late!
(I think it’s going to be a thing. And I have a knack for detecting this, for some reason…)
I wonder what the impact to energy usage is.