Instead of falling back to a single goroutine the author could’ve used GOMAXPROCS=$(nproc) instead, such that for every routine there’s one processor core. The article correctly noted that the measurements were wrong because a lot of go routines were preempted and not scheduled back to finalize the timing measurement after the IO operations were done, probably because there was no free core available.
Does Go guarantee that if you have N procs and N cores, it will distribute one proc to each core with perfect affinity? You’d never end up pre-empting a proc in favor of another on the same core due to scheduling vagaries at the runtime and OS level?
The Go runtime explicitly does not provide any goroutine/OS thread affinity guarantees unless you call runtime.LockOSThread.