“It gets slower when I run it on a bigger machine” does immediately sound like contention over ownership of one cache line or mutex.
The synchronisation costs get bigger in a server big enough to have multiple sockets.
As a recovering frontend developer, I can’t quite wrap my head around why that is.
I can understand it for latency, but not throughput.
When you’re sharing cache lines and they can’t be lazily fetched, you’ll have to “stop the world” and update the cacheline for every core.
See also here for something similar where a repr(C) was used specifically to avoid cacheline synchronization every time one of the cores accesses the data structure.
Thank you! I really appreciate that link. Very good read that taught me something new.