I’m actually surprised. Completely did not expect the smaller allocation to make any difference. Since the only difference should be the initialisation time… that sounds like a lot of time for simple clears. Unless there’s more page faulting than I would expect.
The next improvement then would be to preallocate that slice and clear instead of making a new one.
So the accounting of “alloc”s seem weird to me. Here are 4 extra variants:
1 is original, 2 is original with clear instead of make, 3 is slice, 4 is slice with clear instead of make.
I feel like this will need some deep dive to understand how make is better optimised than a loop clearing the values. And why does benchmark 4 have the same number of allocs as 3.
And why is the difference so large for clearing? Clearing 256 bytes of data should be trivial, unless go doesn’t optimise that loop at all…
Looks like go is silly at (not) optimising trivial loops. make for slice becomes DUFFZERO $276 which I assume is unrolled. Clearing the same slice becomes a really standard, basic, byte-at-a-time loop, which… why are you like this, Go?
The funny thing is that this actually ends up going against the title of the post. Go can’t optimise a basic slice clear and a (potentially) memory-wasteful version stays 10x faster, because it’s got known optimisations hardcoded.
The alloc count turns out to not include stack allocations, which makes sense, so the count is equivalent for both the make variant and for a reused global.
Compile times. The compiler team has been up front that they are willing to trade performance for faster compile times.
As of a couple of years ago, the rules for what kinds of functions could be inlined was a great example of this. Only the simplest of functions were eligible.
That hasn’t been my experience. I’m having difficulty writing code that clears a slice without having it be optimized to a runtime call that does vectorized clearing, etc. See https://go.godbolt.org/z/e9hnbcaKd. What code did you use to clear the slice?
So both a basic for i:=0... and for i:=range... with slice[i]=false is slower (10x and 2x respectively) than the DUFFZERO that go uses for clearing a new local.
That’s similar to what I was trying to test in the other comment. It seems that the whole slice ended up on the stack anyway. Even with the size only mentioned in make arguments.
I’m sorry to hear that. I don’t even check; I just count on the duplicate detection to let me know. I would delete the post, but now this thread has comments.
I’m actually surprised. Completely did not expect the smaller allocation to make any difference. Since the only difference should be the initialisation time… that sounds like a lot of time for simple clears. Unless there’s more page faulting than I would expect.
The next improvement then would be to preallocate that slice and clear instead of making a new one.
So the accounting of “alloc”s seem weird to me. Here are 4 extra variants:
1 is original, 2 is original with clear instead of make, 3 is slice, 4 is slice with clear instead of make.
I feel like this will need some deep dive to understand how
make
is better optimised than a loop clearing the values. And why does benchmark 4 have the same number of allocs as 3.And why is the difference so large for clearing? Clearing 256 bytes of data should be trivial, unless go doesn’t optimise that loop at all…
Looks like go is silly at (not) optimising trivial loops.
make
for slice becomesDUFFZERO $276
which I assume is unrolled. Clearing the same slice becomes a really standard, basic, byte-at-a-time loop, which… why are you like this, Go?The funny thing is that this actually ends up going against the title of the post. Go can’t optimise a basic slice clear and a (potentially) memory-wasteful version stays 10x faster, because it’s got known optimisations hardcoded.
The alloc count turns out to not include stack allocations, which makes sense, so the count is equivalent for both the make variant and for a reused global.
I thought one of the points of keeping Go a simple language was that they can make exactly these kinds of compiler optimizations?
Also explains why my dumbest thing I could think of in Rust was 3 times faster :-/
Compile times. The compiler team has been up front that they are willing to trade performance for faster compile times.
As of a couple of years ago, the rules for what kinds of functions could be inlined was a great example of this. Only the simplest of functions were eligible.
They kept everything simple. For example, only adopting a register-based calling convention two years ago.
https://www.infoq.com/news/2020/08/go-register-calling-convention/
That hasn’t been my experience. I’m having difficulty writing code that clears a slice without having it be optimized to a runtime call that does vectorized clearing, etc. See https://go.godbolt.org/z/e9hnbcaKd. What code did you use to clear the slice?
So both a basic
for i:=0...
andfor i:=range...
withslice[i]=false
is slower (10x and 2x respectively) than the DUFFZERO that go uses for clearing a new local.It’s possible to exploit the fact that inputs to hasDuplicates function are always of a known fixed size, and avoid extra allocation like this:
Same result, but we use a single uint64 as a bitmap.
I would’ve liked to see
[256]bool
included and see whether avoiding the heap allocation entirely helped.That’s similar to what I was trying to test in the other comment. It seems that the whole slice ended up on the stack anyway. Even with the size only mentioned in
make
arguments.Yeah I did a test as well and there’s no difference in performance.
This is a repost.
https://lobste.rs/s/7jtird/neat_xor_trick
Not a repost actually just the same AoC problem and solution.
I’m sorry to hear that. I don’t even check; I just count on the duplicate detection to let me know. I would delete the post, but now this thread has comments.
EDIT: ah, good—not a repost.