Threads for bwasti

    1. 2

      The amx accelerator is shared between multiple cores (though not all), afaik, so ‘on a single core’ is a bit misleading here.

      1. 2

        that’s a good point. I meant it to mean you don’t need to mess with threads (which has implications for cache use and ideal problem size).

        I added this bit “An important distinction is that the AMX:CPU ratio is not 1:1; not every core has its own AMX co-processor.”

    2. 2

      What an awesome idea! Keep up the great work!

    3. 4

      This seems like a great idea at first, but for anything in production it’s much better to build a library of primitives and write scripts in something with better error checking than bash.

      What ends up happening is you end up with hundreds of these tiny scripts like in Moses and it quickly becomes very cumbersome to refactor and debug.

      1. 1

        It’s a good point that command-line pipes won’t scale with training complexity. An escape hatch here is the fact that learn.ts is just a regular TypeScript file that can be used for more advanced logic (import model from './learn' will work with everything shown in the post).

        The target use cases (for my work at least) are very much akin to the second example presented. Lots of simple data + small models that can be used for embedded heuristics. These certainly have their place in production!

        1. 1

          I think for prototyping it’s great, and definitely a slick alternative to using a notebook. Maybe the best of both worlds is writing libraries but if you must have commands make them amendable to piping like in your post.

    4. 7

      cat /usr/share/dict/words | grep -e '^a..$' | wc -l

      As everyone knows, this is the worst demonstration of pipes:

      grep -c '^a..$' /usr/share/dict/words

      1. 12

        I think this is actually a good demonstration of pipes, because it shows that even if you don’t know the -c flag you can kludge together the same thing with pipes. That makes them a lot more accessible!

      2. 5

        grep -c '^a..$' /usr/share/dict/words

        I’d say this is an even worse demonstration of pipes 😉

        If you happen to have a nice introductory demonstration of pipes that you’d be willing to share, I’d really appreciate it!

    5. 3

      I’ve been working on shumai, a javascript tensor library, for the past month ( and hoping to get some compelling example use cases out!

    6. 1

      not mentioned is the utilization of cache

      not a very fast matrix multiplication, then :P

      1. 1

        Well it’s nearly identical reasoning to register saturation (modulo the often inscrutable but largely benign eviction policies of varying hardware) - so I felt it was fair to leave it as an exercise to the reader :)

        1. 1

          On a single core, a good matrix product is 99% cache management, 1% parallelism. The post does its readers a disservice by minimising this point.

          1. 1


            1. 1

              Personal experience… :P

              Here’s a simple example. I get the following timings:

              • naive: 5.742471
              • simple vectorised: 0.150027
              • simple local: 0.316796
              • blas singlethreaded: 0.038781

              ‘Simple local’ just does the same loop transposition as the simple vectorised approach.

              1. 1

                results for me (writeup below)

                • naive: 2.381266
                • simple vectorized: 0.346979
                • 12 registers used: 0.180736
                • 15 registers used: 0.168493
                • numactl -C2 blas: 0.158050

                You might want to double check that blas is running single threaded (maybe use numactl or taskctl) - 55gflops with double precision is extremely fast for avx2 code. My machine’s blas runs in 0.157807s or ~15gflops. If you have an avx512 machine then it make sense, but you should be using the 512 intrinsics for comparison :). Also, you should use fmadd, which is around 2x faster than mul+add as separate instructions.

                Disregarding the FMA instruction, the simple vectorized code doesn’t quite capture everything in the animations.

                You’ll want to use more registers in the inner loop. Right now your code uses a single broadcast register (the red matrix in the animation) and then immediately loads and stores partial sums of the blue matrix. We’ve got 16 registers to work with, and we’re using ~2-3, so there’s some speed to be had. That’s what the last animation is all about.

                Instead, we should load some red registers and some blue registers, compute their outer product (in yellow registers) and then write that out to memory. This can be done by loading 4 times from the red matrix (broadcasting to 4 registers), 2 times from the blue matrix (contiguous loads) and then storing the outer product in 8 registers. This adds up to 14 registers, which is just shy of the 16 register limit that we have for avx2. That code can be found below in the function called “mm_12_registers.” It runs in about 0.180736s, or 12gflops per core.

                But why is it called 12 registers when I clearly use 14? Well avx2 fma instruction actually have the ability to use a different port on the CPU to pull directly from memory. So gcc should emit instructions that only use 12 registers (if we massage it right… I prefer using asm directly for this kinda stuff).

                That means we have 4 extra registers - which we can almost use if we load 5 broadcast red matrix values and store 10 running outer products. That can be found in mm_15_registers and that runs in 0.168493s for me (12.7gflops).

                The remaining ~15% of performance is certainly cache management. The trick is to actually write intermediate memory into a local buffer to do the exact same “reuse logic” we just did with registers but instead with real memory.


                1. 1

                  use more registers in the inner loop

                  Yes … definitely more performance to be had. I just wanted to show that having good locality is what makes the biggest difference, vs naive traversal order.

                  Also, you should use fmadd, which is around 2x faster than mul+add as separate instructions.

                  2x better throughput. Comparable latency. Which is sort of proving my point; the simple vectorised code I showed is mainly limited by latency, and runs at the same speed if I use fma.

                  double check that blas is running single threaded (maybe use numactl or taskctl) - 55gflops with double precision is extremely fast for avx2 code. My machine’s blas runs in 0.157807s or ~15gflops. If you have an avx512 machine then it make sense, but you should be using the 512 intrinsics for comparison :)

                  I have avx2—no avx512, sadly—and definitely running on one thread. I get the same result with numactl -Ck as with the envvar, and a much faster result if I leave off any thread limiting. blas on one thread is 2.3x faster than your mm_15_registers. Is your blas specialised to your uarch?

                  Anyway, call it ~28 fma per ns. That is 4 elements per vector×2 fma per cycle×3.5 cycles per ns; seems about right. I think my CPU boosts to 4.2ghz or so, so that leaves space for housekeeping.

                  write intermediate memory into a local buffer to do the exact same “reuse logic” we just did with registers but instead with real memory

                  A big part of the advantage from multi-level blocking comes from reducing cache pollution.

                  massage it right… I prefer using asm directly for this kinda stuff

                  Yeah, c is annoying. But slightly less hassle for a quick hack.

    7. 4

      This is cool.

      Loading each function individually is fine grained, but ends up making lots of requests (and hence overall slower).

      Grouping functions into related blocks would likely help a lot. Which is pretty much what people used to do to make a program fit in ram:

      I guess the same batching/grouping algorithms would help here.

      1. 1

        I do not miss real mode 🤣

      2. 1

        Grouping functions into related blocks would likely help a lot.

        Absolutely! This would also help with the performance overhead associated with calling functions in different modules.

    8. 8

      Great work! One small nitpick: A 8450% performance improvement is 85.5 times faster or a decrease in runtime by 98,8%. I believe the latter two convey the speed-up better than the first.

      1. 8

        decrease in runtime by 98,8%

        haha, I find this extremely confusing. E.g. 98% reduction -> 99% reduction is a 2x speed up

    9. 3

      Looks really interesting, the whole caching & reactive invalidation stuff sounds super cool. Had a look at the docs and couldn’t actually find much about the actual caching aspect though, are there any other resources that go into more detail on that?

      1. 3

        I briefly interacted with this code base at one point and remember the key beauty (for me) of the language runtime is this file

        The comment at the top explains how it works in great detail. Basically interning graphs makes it possible to get reactive things very very efficient

    10. 2

      This tool seems neat! I haven’t done much linear algebra optimization, so I am still wrapping my head around the optimization examples in the github repo.

      I might not be your target audience, but a little comment about why/how the optimization example works might be useful. But it is also possible that this tool is only really meant for people that already understand the example.

      In the github example code are all of the assert np.allclose(...) not supposed raise an error? In the github code example, only the first one is asserting without error.

      1. 2

        Thanks for the feedback! I’ve added a section that hopefully helps clear things up (“Tuning”, but please let me know if anything is unclear.

        Re: assertion failures: 😨. Would you be able to highlight which section fails? I may have included old features that don’t work anymore

        1. 2

          That explanation was super helpful, thank you! I will submit an issue on github with the asserts that are failing on my machine as to not clog up this thread.

    11. 1

      Interesting project! However, I find the following snippet a little confusing

      a_np = np.random.randn(128, 128)
      b_np = np.random.randn(128, 128)
      a = lt.Tensor(a_np).to(s.m, s.k)
      b = lt.Tensor(b_np).to(s.k, s.n)
      c = (a * b).sum(s.k)

      Which parameters am I tuning? Clearly I cannot change s.m and s.n because these are my input/output specification, but I cannot tune s.k either since s.m * s.k and s.k * s.n are fixed to be the number of elements in both matrices.

      In addition, I assume by (a * b) you are actually matrix multiplication (it cannot be element-wise multiplication, because the number of elements in a and b may not match when, for example, you are multiplying a 2 * 3 matrix with a 3 * 5 one). In that case, I think the notation (a @ b) would make more sense to most numpy users, because (a * b) does element-wise multiplication in numpy.

      1. 2

        yep, the sizes of s.k, s.m, and s.n are all fixed so those can’t be tuned. However, the loops needed to execute this computation (e.g. looping in different orders) can be tuned. I wrote up a quick notebook to clarify:

        (a * b) is actually point-wise multiplication. You’re right that the elements don’t match, so there’s implicit broadcasting (much like a.reshape(128, 128, 1) * b.reshape(1, 128, 128) in numpy). This is immediately followed by a reduction (sum) over the shared dimension s.k. In numpy this is inefficient, because the whole 128,128,128 array needs to be created. In loop tool you can order the loops to skip that big intermediate array entirely. It allows you to craft complex operations without worrying about performance ahead of time (as they can always be made fast later).

        1. 4

          Thanks for the clarification! Just want to add that (128, 128) * (128, 128) is probably not the best example for matrix multiplication. Maybe you can switch it to something like (200, 300) * (300, 500) to show a more general case, and more importantly, the difference in dimensions can help readers like me to identify which one is which :)

          1. 3

            That’s a great suggestion - updated the post and the notebook with unique sizes!

    12. 3

      i’m kinda surprised this just tosses the polytopal model and Banerjee inequalities in with other, lesser optimization methods? i thought polytopes were pretty old hat now. they’re definitely covered at length in Kennedy’s “optimizing compilers for modern architectures” and were de rigueur in GT’s first semester of graduate compilers back in 2009. just mentioning them seems like talking about various heuristics for code transformation and then mentioning “oh yeah and SSA is used sometimes.” GCC acquired polyhedral modeling in GRAPHITE a decade ago iirc.

      really cool tool, though!

      1. 1

        Oh I should add a reference to graphite, thanks! Polyhedral models are a deep and well explored topic, but I wanted to keep the text approachable. I think empirically Polly and Graphite don’t actually achieve the same performance as the other two methods (due to strict functionality guarantees), but I could be mistaken.

        1. 5

          I’ve not looked at Graphite, but Polly often gives orders-of-magnitude better performance than other LLVM optimisations for loop nests. Polytope optimisations encompass an arbitrary mechanism for reordering loop iterations, whereas other techniques are very specialised (e.g. loop unrolling simply exposes the opportunity to parallelise across sequential iterations of the inner loop, loop rotation provides a somewhat similar set of opportunities). Optimisations contain three components:

          • A transform.
          • A set of preconditions that define whether the transform is sound.
          • A set of heuristics that define whether the transform is beneficial.

          Polytope encompasses arbitrary loop-reordering transforms. This needs to be coupled with alias analysis to determine whether they’re sound for any given loop. Julia and Fortran often get more benefit from Polly than C/C++ because they’re better able to give rich information about which stores will definitely not alias other loads. The more structured your input is, the better you can do here (if you’re doing matrix multiplication without in-place modification then you can guarantee that all loads and stores are independent, which is expands the space of valid polytope transforms to basically any loop order).

          The last part is still a very large open research area and one that depends a lot on your target. For a GPU, a massively parallel streaming operation is better, whereas on a CPU you want to pull things into caches, operate on blocks, and prefetch the next blocks into cache while you’re operating on the current one (and, if you’re parallelising across CPU cores, then you want to try to have multiple cores reading the same data at the same time so that they can snoop shared lines from other caches if they miss locally, but writing to different areas of the output so they aren’t playing cache-line ping-pong with exclusive lines).

          To make things worse, the trade off between memory and compute changes depending on the target and the size. If you can recompute a value based on data in an L1 cache or GPU local memory, then that may be faster than fetching it from main memory / texture memory, depending on the complexity of the calculation. If your schedule defines too much recomputation then it can be slower than doing the naïve thing.

          1. 1

            great points! Generalized loop-reordering is something loop_tool attempts to help out with, so it’s good hear I’m not barking up the totally wrong tree.

            W.r.t GPU I’d like to note that cache utilization (shared memory) is also important and there are many advanced techniques to avoid things like bank conflicts, e.g.

            Recompute tradeoffs are very interesting the domain of ML training pipelines as well (RAM residency rather than L1 cache). Since functions and their derivatives often use the same inputs but are executed at different times, it often makes sense to recalculate inputs rather than keep them around.

        2. 2

          huh, that would surprise me, but it’s certainly within the realm of possibilities. probably not too difficult to explore, though no doubt involving arcane command line options out the ass.

          either way, cool tool.

          1. 2

            I haven’t done any precise benchmarking myself, but I was looking at recent comparisons to MKL e.g. here:

            Since TVM/XLA are free to actually use MKL for inner-loops, I think polyhedral methods are relatively rare in the ML-compiler space.

    13. 2

      Great for graphics, but I’m itching for browser access to the more modern compute capabilities of GPGPUs. Anyone know what’s up with WebGPU?

      1. 1

        what legitimate use cases exist for this? the only thing I can think of is websites that mining shitcoins on users’ systems… which isn’t a legitimate use case in my book.

        1. 3

          My use case is that students in graphics courses have different operating systems on their laptops. This makes it hard to create exercises everyone can run and modify. The browser is the perfect sandbox

        2. 2

          I’m interested in running neural networks. Here’s a list of around 30 other applications:

          1. 2

            I know why GPGPUs are useful, but not why a web browser needs to use them. That’s what I was specifically asking about.

          2. 1

            Why would you do want to do this in the browser instead of a standard process?

    14. 4

      What is the server-side configuration for something like this? Does sshd have some kind of inted-like functionality or does the code have to implement its own ssh server?

      1. 6

        The latter I believe. The project seems to use this library

        1. 2

          Although the former would also be possible. This would be similar to how git is usually set up where there is only one allowed command.

    15. 4

      I’ve always enjoyed this test registry + macro paradigm! gtest does it in cpp, nice to make a pure C version

      You may want to free your calls to malloc when the tests complete :). Alternatively, check out BSD’s famous queue.h header

    16. 1

      Maybe specify 1 dimensional slicing in the title? The article is missing ideas related to multiple slices as the key to __getitem__ (common in data science applications)

    17. 5

      That context manager is a nice snippet of code. I’ll be using that for sure!

      Small nit:

      the rest of the performance improvement is likely due to a significant reduction in cache misses

      This might be more readily explained by the fact that there are half as many bytes processed, regardless of cache. Similarly, AVX instructions can process 2x as many fp32 operations compared to fp64.

      1. 2

        Fair point. I ended up just rewriting that whole section, fp32 is probably a distraction.

    18. 1

      I thought numpy was supposed to use BLAS and LAPACK under the hood for everything. Apparently it has fallback implementations (that weren’t always using SIMD).

      Docs say:

      NumPy does not require any external linear algebra libraries to be installed. However, if these are available, NumPy’s setup script can detect them and use them for building

      1. 2

        It uses those for more complex linear algebra stuff, but not for simple things like computing means or adding numbers. For those it has its own code (~30% of the code is in C).

        So this about NumPy code, not BLAS fallback code.

        Also it’s quite difficult to get a copy of NumPy that doesn’t use BLAS, every easy method of installation gives you either OpenBLAS or mkl as backend.

        1. 1

          Does BLAS not include routines for the simple stuff like pointwise addition? I would’ve thought that would be entirely in its wheelhouse.

          1. 2

            I could be wrong, but docs suggest only stuff mentioned in numpy.linalg uses BLAS (

            UPDATE: Some playing around with ltrace suggests that:

            1. BLAS is not used for tiny arrays.
            2. Even for large arrays, BLAS is not used for e.g. adding an integer to an array.
            1. 1

              Interesting, makes sense, presumably there’s some FFI cost to calling into BLAS. Thanks!

          2. 1

            Yep it does, it’s called axpy and it’s level 1 BLAS.


            1. 1

              I don’t know why you linked me the wikipedia article to tell me that: it doesn’t actually have a list of subroutines in it.

              1. 1

                It does in the “functionality” section. I was on mobile and couldn’t figure out how to link it directly

    19. 1

      I’m not sure if it’s good or bad that there are so many different ways of interacting with / generating WebAssembly. I’ve been evaluating them for a side project, and I hadn’t even heard of Wasmblr.

      The main thing I learned from this post is that the tool can actually have quite a big effect on performance, because they have different models for the code generation. e.g. emscripten creates a malloc implementation that uses WebAssembly linear memory, but wasemblr allows you to generate WebAssembly vs. just compile your implementation, so you can control the memory allocation more directly. Super interesting approach. I’m not an emscripten expert, so I wonder if there are similar features in it as well?

      1. 1

        I really like your take away. I think that question really gets to the heart of what I was going for!

        With respect to your exact example about memory allocation, Emscripten actually does expose the entire heap (for example, check out Module.HEAPF32).

        That being said, the way memory is used by the compiled code is not exposed (you just have to trust LLVM will do a good job). It’s a classic tradeoff between high and low level coding.

        However, there are more ways in which wasmblr (or any fast-to-recompile tool) can come in handy. If you determine (at runtime) that a certain value will always live at an address in memory, you can hard code that value directly into your web assembly. These types of situations are somewhat common: e.g. if a user resizes their window, you may want to use that size directly for rendering a bit faster.

    20. 3

      I was curious about that typedarrays part, but at least in my browser it’s the expected way around for every step: (i5-9300H, FF, linux)

       benchmarking vec add of size 4
         pure javascript:         15384615.38 iters/sec (0.74 GB/s)
         typed arrays:            19230769.23 iters/sec (0.92 GB/s)
       benchmarking vec add of size 131072
         pure javascript:         2820.08 iters/sec (4.44 GB/s)
         typed arrays:            3835.68 iters/sec (6.03 GB/s)

      Not an intel thing either - FF on Android:

      benchmarking vec add of size 4
         pure javascript:         5000000 iters/sec (0.24 GB/s)
         typed arrays:            9090909.09 iters/sec (0.44 GB/s)

      Looks like a V8 issue. Same Android on Chrome:

      benchmarking vec add of size 4
         pure javascript:         7733952.12 iters/sec (0.37 GB/s)
         typed arrays:            4940711.57 iters/sec (0.24 GB/s)

      Maybe worth reporting as a bug?

      1. 1

        ah, interesting. Testing it out now, SpiderMonkey (Firefox’s engine) is definitely showing consistent results for me as well

        Good catch I’ll add this to the writeup!