It’s kind of weird to be using POPCNT for this. You could simply subtract the mask from a vector of 16 counters initialized to 0, in effect adding 1 every time there’s a hit. You could do this at most 256 times, and you’d have to refresh the counter at least that often, but this would be cheaper than two POPCNTs and the associated register transfers. Furthermore, this approach would scale better to AVX-{2, 512}.
Alternatively you could use PMOVMSKB to create an int with a single bit per element, and use POPCNT just once.
It’s kind of weird to be using POPCNT for this. You could simply subtract the mask from a vector of 16 counters initialized to 0, in effect adding 1 every time there’s a hit. You could do this at most 256 times, and you’d have to refresh the counter at least that often, but this would be cheaper than two POPCNTs and the associated register transfers. Furthermore, this approach would scale better to AVX-{2, 512}.
Alternatively you could use PMOVMSKB to create an int with a single bit per element, and use POPCNT just once.
I will have to try those out!