    The amx accelerator is shared between multiple cores (though not all), afaik, so ‘on a single core’ is a bit misleading here.

      that’s a good point. I meant it to mean you don’t need to mess with threads (which has implications for cache use and ideal problem size).

      I added this bit “An important distinction is that the AMX:CPU ratio is not 1:1; not every core has its own AMX co-processor.”