That context manager is a nice snippet of code. I’ll be using that for sure!

Small nit:

the rest of the performance improvement is likely due to a significant reduction in cache misses

This might be more readily explained by the fact that there are half as many bytes processed, regardless of cache.
Similarly, AVX instructions can process 2x as many fp32 operations compared to fp64.

I thought numpy was supposed to use BLAS and LAPACK under the hood for everything. Apparently it has fallback implementations (that weren’t always using SIMD).

Docs say:

NumPy does not require any external linear algebra libraries to be installed. However, if these are available, NumPy’s setup script can detect them and use them for building

It uses those for more complex linear algebra stuff, but not for simple things like computing means or adding numbers. For those it has its own code (~30% of the code is in C).

So this about NumPy code, not BLAS fallback code.

Also it’s quite difficult to get a copy of NumPy that doesn’t use BLAS, every easy method of installation gives you either OpenBLAS or mkl as backend.

The article talks a little about reducing total cache misses, but a ~35% miss rate for even the numpy implementation seems exceedingly high for logic like this, that essentially has ideal temporal and spatial locality characteristics. What am I missing?

That context manager is a nice snippet of code. I’ll be using that for sure!

Small nit:

This might be more readily explained by the fact that there are half as many bytes processed, regardless of cache. Similarly, AVX instructions can process 2x as many fp32 operations compared to fp64.

Fair point. I ended up just rewriting that whole section, fp32 is probably a distraction.

I thought numpy was supposed to use BLAS and LAPACK under the hood for everything. Apparently it has fallback implementations (that weren’t always using SIMD).

Docs say:

It uses those for more complex linear algebra stuff, but not for simple things like computing means or adding numbers. For those it has its own code (~30% of the code is in C).

So this about NumPy code, not BLAS fallback code.

Also it’s quite difficult to get a copy of NumPy that

doesn’tuse BLAS, every easy method of installation gives you either OpenBLAS or`mkl`

as backend.Does BLAS not include routines for the simple stuff like pointwise addition? I would’ve thought that would be entirely in its wheelhouse.

I could be wrong, but docs suggest only stuff mentioned in

`numpy.linalg`

uses BLAS (https://numpy.org/doc/stable/reference/routines.linalg.html).UPDATE: Some playing around with ltrace suggests that:

Interesting, makes sense, presumably there’s some FFI cost to calling into BLAS. Thanks!

Yep it does, it’s called axpy and it’s level 1 BLAS.

https://en.m.wikipedia.org/wiki/Basic_Linear_Algebra_Subprograms

I don’t know why you linked me the wikipedia article to tell me that: it doesn’t actually have a list of subroutines in it.

It does in the “functionality” section. I was on mobile and couldn’t figure out how to link it directly

The article talks a little about reducing total cache misses, but a ~35% miss rate for even the numpy implementation seems exceedingly high for logic like this, that essentially has ideal temporal and spatial locality characteristics. What am I missing?

Nevermind, on consulting what that

`perf`

event means, it occurs to me that misses/instruction is the salient ratio here.