I’m a bit sad the author doesn’t address that SIMD and vectorization was discussed by the end of the original post and in the /r/rust subreddit. It also goes very much into details in which context the code is used and that there are interactions with the outward world here.
BTW I want to be clear that I have no idea what context this code is used in or whether there are higher level code changes that would make a bigger difference, I just want to focus on optimising the snippet from the original post.
Seeing this backdrop, I find “finishing the job” a little much. The original post was not focused to be a tutorial of finding the last bit of performance. They did literally finish the job, but that was a level of detail unnecessary in the original.
TBH it would look almost exactly the same. The intrinsics map 1-1 with assembly instructions and the slow loops are nothing fancy, so it would be identical code but with Rust keywords instead of C keywords.
It’s kind of weird to be using POPCNT for this. You could simply subtract the mask from a vector of 16 counters initialized to 0, in effect adding 1 every time there’s a hit. You could do this at most 256 times, and you’d have to refresh the counter at least that often, but this would be cheaper than two POPCNTs and the associated register transfers. Furthermore, this approach would scale better to AVX-{2, 512}.
Alternatively you could use PMOVMSKB to create an int with a single bit per element, and use POPCNT just once.
One optimization the author doesn’t address that can be done in plain rust is to get rid of the branches in the loop (there are two: One for the bounds check because of indexing, the second for the if). In fact, I introduced a small optimization to iterators that do the trick (in nightly and possibly beta), so you can simply write:
let mut count = 0;
for x in &bwt[(i * self.k) + 1 .. r + 1] { // look ma, no branches!
count += (*x == a) as usize; // boolean false is 0 and true is 1
}
count
This is the naive way, and also the fastest for small slices. If you want to go even faster, the bytecount crate has the currently fastest implementation, and is used by both ripgrep and xi. If you are on nightly, you can even enable SSE / AVX for best-in-class performance.
That’s not actually removing the branch in the loop condition, and I didn’t write about removing the inner loop branch because I thought the compiler was able to do that already. I retested that and apparently not!
Saying it’s branchless because “boolean false is 0 and true is 1” is not really telling the full story, in particular there are some architectures where that would still be implemented with a branch. (e.g. MIPS I’m pretty sure)
The trick is that CMP on x86 is like a SUB that discards its result, and SUB sets a number of flags depending on the result, including the Zero Flag which gets set to 1 when the result is 0 and 0 when the result is 1. So if *x == a, *x - a == 0 and ZF gets set to 1, and then you can you use SETZ to read that out and add it to count.
OK, I was overgeneralizing – it appeared to remove the branch when I was doing the Rust PR (and yes, I checked the assembly) on x86_64 (I didn’t have other platforms to test this at the time).
Also, I didn’t mean the branch in the loop condition – that one branch is generally unavoidable. I meant the out-of-bounds check that is hoisted out of the loop by iterating over a slice.
I’m a bit sad the author doesn’t address that SIMD and vectorization was discussed by the end of the original post and in the /r/rust subreddit. It also goes very much into details in which context the code is used and that there are interactions with the outward world here.
http://blog.adamperry.me/rust/2016/07/24/profiling-rust-perf-flamegraph/#update-72516
With followups here: https://github.com/rust-bio/rust-bio/pull/76 https://www.reddit.com/r/rust/comments/4udxmr/rust_performance_a_story_featuring_perf_and/d5plly1/ (and children)
The author clearly states that:
Seeing this backdrop, I find “finishing the job” a little much. The original post was not focused to be a tutorial of finding the last bit of performance. They did literally finish the job, but that was a level of detail unnecessary in the original.
Still, a nice introduction to SIMD in C++.
Kind of a bummer that it goes directly to C++. I was hoping to see how to finish the job in Rust.
TBH it would look almost exactly the same. The intrinsics map 1-1 with assembly instructions and the slow loops are nothing fancy, so it would be identical code but with Rust keywords instead of C keywords.
Actually it would look like bytecount.
Thanks, fixed the link.
It’s kind of weird to be using POPCNT for this. You could simply subtract the mask from a vector of 16 counters initialized to 0, in effect adding 1 every time there’s a hit. You could do this at most 256 times, and you’d have to refresh the counter at least that often, but this would be cheaper than two POPCNTs and the associated register transfers. Furthermore, this approach would scale better to AVX-{2, 512}.
Alternatively you could use PMOVMSKB to create an int with a single bit per element, and use POPCNT just once.
I will have to try those out!
One optimization the author doesn’t address that can be done in plain rust is to get rid of the branches in the loop (there are two: One for the bounds check because of indexing, the second for the
if
). In fact, I introduced a small optimization to iterators that do the trick (in nightly and possibly beta), so you can simply write:This gets inlined into roughly:
This is the naive way, and also the fastest for small slices. If you want to go even faster, the bytecount crate has the currently fastest implementation, and is used by both ripgrep and xi. If you are on nightly, you can even enable SSE / AVX for best-in-class performance.
That’s not actually removing the branch in the loop condition, and I didn’t write about removing the inner loop branch because I thought the compiler was able to do that already. I retested that and apparently not!
Saying it’s branchless because “boolean false is 0 and true is 1” is not really telling the full story, in particular there are some architectures where that would still be implemented with a branch. (e.g. MIPS I’m pretty sure)
The trick is that CMP on x86 is like a SUB that discards its result, and SUB sets a number of flags depending on the result, including the Zero Flag which gets set to 1 when the result is 0 and 0 when the result is 1. So if
*x == a
,*x - a == 0
and ZF gets set to 1, and then you can you use SETZ to read that out and add it tocount
.OK, I was overgeneralizing – it appeared to remove the branch when I was doing the Rust PR (and yes, I checked the assembly) on x86_64 (I didn’t have other platforms to test this at the time).
Also, I didn’t mean the branch in the loop condition – that one branch is generally unavoidable. I meant the out-of-bounds check that is hoisted out of the loop by iterating over a slice.