I dunno, I think the takeaway is if you care about performance, use literally anything but Python?
Go famously does not aggressively optimize. It does some simple stuff, but no SIMD, for example. Writing out the dumb loop version, I got 110ms, which is faster than all of the Python versions except for basel_chunks, which was 90ms in TFA (could just be hardware differences), and then I hacked together a version with Go routines and it dropped to 60ms. Go should really be considered a performance floor. If you can’t do as well as Go, you should use something faster because it’s really just using the hardware in the least optimized manner possible.
This is a very nice empirical breakdown of different techniques. The computation itself is very suitable for JIT-compilation and parallelization with Numba:
from numba import njit, prange
def basel_numba(N: int)->float:
result = 0.
for x in prange(1, N):
result += (1.0/x)**2
Running the author’s multicore processing method on my M2 MBP (and timing with iPython’s %timeit):
In : %timeit basel_multicore(N=1_000_000_000, chunk_size=50_000)
409 ms ± 10.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
and this Numba method:
In : %timeit basel_numba(N=1_000_000_000)
94.8 ms ± 1.92 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)