This makes me nostalgic, I reviewed PR which implemented this feature: https://github.com/rust-lang/rust/pull/8410
You can just use target-cpu=native if you don’t want to figure out or hardcode your architecture.
– -C target-cpu=your_target as a rustc flag to take advantage of cpu specific instructions. Definitely handy when you know you’ve got goodies and it doesn’t need to be portable.
Is there a way to figure out the target? It would be useful when I don’t build on the same machine.
Just use -Ctarget=cpu=native.
Even if I’m not compiling and running on the same processor?
Well, you’ll have problems running on older processors (that don’t support some instruction set that’ll be used in the binary).
I wrote about this and more on my blog
This is interesting! Anyone know if Go has any way to do something similar?
Through build tags. The standard library uses that a lot for system calls; I’m sure there’s optimizations for architectures somewhere in there as well
Not really. All architecture optimizations in Go (except the ones controlled by GOARM) are activated at runtime. The idea is that binaries should run everywhere. Also, the compiler does not do auto-vectorization like LLVM.
To expand on this, go binaries can contain instructions that the current architecture does not support (like say AES acceleration). Support for those instructions is detected at runtime using CPUID. https://golang.org/src/crypto/internal/cipherhw/asm_amd64.s
Works out reasonably well for core stdlib stuff that has been optimized like crazy like byte searching. Unfortunately it doesn’t really solve the problem of making arbitrary go source code leverage cpu features.
But why? What does this change?
He used to represent a cell with a btree but changed to pack the info (which of these 9 values is still possible) into a binary field. He says 32 bits, so presumably he ignores 21 bits and stores the 9 true/false possibilities in the remaining 9 bits. Big wins for allocating less, fitting in caches, following fewer pointers.
Solving Sudoku is all about finding which single possibility remains. Popcount is an opcode for counting the number of 1 bits in a register, so it directly optimizes a key operation.
When I specify the architecture, the translation of count_ones() becomes the instruction popcnt; I haven’t explored the issue in much detail, but since I use the population count to know if a cell is solved or not and do this a lot, I suspect that using this native processor instruction contributes to the speed-up of the program.
This is equivalent to -mcpu=native in GCC.