This is a great analysis of the benefits and drawbacks of CachyOS. However, I would be very careful to conclude anything about x86_64-v3 from these tests. It’s known that increasing the optimization level from -O2 to -O3 can harm performance, it’s one of the reasons -O2 is the generally recommended for release builds. We also don’t know if CachyOS is compiling with the same GCC version as Arch.
It would be interesting to see a test of x86_64-v3 in isolation, where all the packages are compiled for source, once with -O2 -march=x86-64-v3 and once with -O2 -march=x86-64.
I don’t have a link handy but I know this was tested on isolation on Ubuntu and found to not be worth it. Such benchmarking takes a lot of time though (when done properly) and it’s difficult to reach conclusions.
The problem is that the ISA defines theoretical best case performance but the microarchitecture defines best real-world performance. Using newer instructions can make things faster, but sometime they’re slow microcode paths or under-provisioned pipelines added for compatibility. For example, lower-end chips often crack vector ops into two halves and so get worse performance in cases where you’re doing a lot of lane masking than if you’d used smaller vectors and skipped some operations entirely. John Baldwin wrote a bunch of memcpy routines a few years back with various different SSE and AVX combinations and there wasn’t a single one that performed better that the others across the board. Some gave a big speed up on some Intel cores and a slowdown on some AMD cores and vice versa. Some performed really well on Atom but poorly on Xeon and vice versa. There was a much bigger difference in performance between different microarchitectures running the same implementation than there was between implementations.
You’re much more likely to get a speed up by shipping LLVM or and upturning for the specific core at install time, but even that requires a good cost model for the CPU.
For example, lower-end chips often crack vector ops into two halves and so get worse performance in cases where you’re doing a lot of lane masking than if you’d used smaller vectors and skipped some operations entirely.
This isn’t always true, for example Zen 4 splits most of AVX512 ops into two 256bit ops, but is still just as fast as full-width implementations Intel processors have because it can run them at full speed and doesn’t need to throttle it’s clock down, while getting instruction density and register pressure benefits.
lower-end chips often crack vector ops into two halves and so get worse performance in cases where you’re doing a lot of lane masking than if you’d used smaller vectors and skipped some operations entirely
You still save on decode and I$, which may be significant.
microcode
Indeed. Most egregious recent example of this: zen 2 microcoded bmi2 ops, taking potentially hundreds of cycles for each.
In any case, the results given by the OP could still be valid and interesting for their particular uarch, if not for the smoking gun which they mention only briefly at the end: O2 vs O3.
much more likely to get a speed up by shipping LLVM or and upturning for the specific core at install time, but even that requires a good cost model for the CPU
Yeah, I don’t really trust llvm to do a great job here. Certainly, better than with a generic target wrt scheduling and insn selection, but not great. (CF.)
Really, the problem is that no one cares about performance on clients. On servers, hpc, etc., people do care and they do tune (and have generally fairly homogeneous hardware, so not much practical annoyances). It is interesting to consider the success stories we have seen, though. Widespread JITs for java and js are the most obvious one, of course. But also see e.g. games consoles and apple mobile devices. And although you pay dearly for it in plt, GNU_IFUNC is … sort of a thing and … sorta has some users.
This is a great analysis of the benefits and drawbacks of CachyOS. However, I would be very careful to conclude anything about x86_64-v3 from these tests. It’s known that increasing the optimization level from -O2 to -O3 can harm performance, it’s one of the reasons -O2 is the generally recommended for release builds. We also don’t know if CachyOS is compiling with the same GCC version as Arch.
It would be interesting to see a test of x86_64-v3 in isolation, where all the packages are compiled for source, once with
-O2 -march=x86-64-v3
and once with-O2 -march=x86-64
.I don’t have a link handy but I know this was tested on isolation on Ubuntu and found to not be worth it. Such benchmarking takes a lot of time though (when done properly) and it’s difficult to reach conclusions.
The problem is that the ISA defines theoretical best case performance but the microarchitecture defines best real-world performance. Using newer instructions can make things faster, but sometime they’re slow microcode paths or under-provisioned pipelines added for compatibility. For example, lower-end chips often crack vector ops into two halves and so get worse performance in cases where you’re doing a lot of lane masking than if you’d used smaller vectors and skipped some operations entirely. John Baldwin wrote a bunch of memcpy routines a few years back with various different SSE and AVX combinations and there wasn’t a single one that performed better that the others across the board. Some gave a big speed up on some Intel cores and a slowdown on some AMD cores and vice versa. Some performed really well on Atom but poorly on Xeon and vice versa. There was a much bigger difference in performance between different microarchitectures running the same implementation than there was between implementations.
You’re much more likely to get a speed up by shipping LLVM or and upturning for the specific core at install time, but even that requires a good cost model for the CPU.
This isn’t always true, for example Zen 4 splits most of AVX512 ops into two 256bit ops, but is still just as fast as full-width implementations Intel processors have because it can run them at full speed and doesn’t need to throttle it’s clock down, while getting instruction density and register pressure benefits.
You still save on decode and I$, which may be significant.
Indeed. Most egregious recent example of this: zen 2 microcoded bmi2 ops, taking potentially hundreds of cycles for each.
In any case, the results given by the OP could still be valid and interesting for their particular uarch, if not for the smoking gun which they mention only briefly at the end: O2 vs O3.
Yeah, I don’t really trust llvm to do a great job here. Certainly, better than with a generic target wrt scheduling and insn selection, but not great. (CF.)
Really, the problem is that no one cares about performance on clients. On servers, hpc, etc., people do care and they do tune (and have generally fairly homogeneous hardware, so not much practical annoyances). It is interesting to consider the success stories we have seen, though. Widespread JITs for java and js are the most obvious one, of course. But also see e.g. games consoles and apple mobile devices. And although you pay dearly for it in plt, GNU_IFUNC is … sort of a thing and … sorta has some users.