Threads for bwasti

  1. 1

    not mentioned is the utilization of cache

    not a very fast matrix multiplication, then :P

    1. 1

      Well it’s nearly identical reasoning to register saturation (modulo the often inscrutable but largely benign eviction policies of varying hardware) - so I felt it was fair to leave it as an exercise to the reader :)

      1. 1

        On a single core, a good matrix product is 99% cache management, 1% parallelism. The post does its readers a disservice by minimising this point.

        1. 1

          Citation?

          1. 1

            Personal experience… :P

            Here’s a simple example. I get the following timings:

            • naive: 5.742471
            • simple vectorised: 0.150027
            • simple local: 0.316796
            • blas singlethreaded: 0.038781

            ‘Simple local’ just does the same loop transposition as the simple vectorised approach.

            1. 1

              results for me (writeup below)

              • naive: 2.381266
              • simple vectorized: 0.346979
              • 12 registers used: 0.180736
              • 15 registers used: 0.168493
              • numactl -C2 blas: 0.158050

              You might want to double check that blas is running single threaded (maybe use numactl or taskctl) - 55gflops with double precision is extremely fast for avx2 code. My machine’s blas runs in 0.157807s or ~15gflops. If you have an avx512 machine then it make sense, but you should be using the 512 intrinsics for comparison :). Also, you should use fmadd, which is around 2x faster than mul+add as separate instructions.

              Disregarding the FMA instruction, the simple vectorized code doesn’t quite capture everything in the animations.

              You’ll want to use more registers in the inner loop. Right now your code uses a single broadcast register (the red matrix in the animation) and then immediately loads and stores partial sums of the blue matrix. We’ve got 16 registers to work with, and we’re using ~2-3, so there’s some speed to be had. That’s what the last animation is all about.

              Instead, we should load some red registers and some blue registers, compute their outer product (in yellow registers) and then write that out to memory. This can be done by loading 4 times from the red matrix (broadcasting to 4 registers), 2 times from the blue matrix (contiguous loads) and then storing the outer product in 8 registers. This adds up to 14 registers, which is just shy of the 16 register limit that we have for avx2. That code can be found below in the function called “mm_12_registers.” It runs in about 0.180736s, or 12gflops per core.

              But why is it called 12 registers when I clearly use 14? Well avx2 fma instruction actually have the ability to use a different port on the CPU to pull directly from memory. So gcc should emit instructions that only use 12 registers (if we massage it right… I prefer using asm directly for this kinda stuff).

              That means we have 4 extra registers - which we can almost use if we load 5 broadcast red matrix values and store 10 running outer products. That can be found in mm_15_registers and that runs in 0.168493s for me (12.7gflops).

              The remaining ~15% of performance is certainly cache management. The trick is to actually write intermediate memory into a local buffer to do the exact same “reuse logic” we just did with registers but instead with real memory.

              Code: http://ix.io/3YjF

              1. 1

                use more registers in the inner loop

                Yes … definitely more performance to be had. I just wanted to show that having good locality is what makes the biggest difference, vs naive traversal order.

                Also, you should use fmadd, which is around 2x faster than mul+add as separate instructions.

                2x better throughput. Comparable latency. Which is sort of proving my point; the simple vectorised code I showed is mainly limited by latency, and runs at the same speed if I use fma.

                double check that blas is running single threaded (maybe use numactl or taskctl) - 55gflops with double precision is extremely fast for avx2 code. My machine’s blas runs in 0.157807s or ~15gflops. If you have an avx512 machine then it make sense, but you should be using the 512 intrinsics for comparison :)

                I have avx2—no avx512, sadly—and definitely running on one thread. I get the same result with numactl -Ck as with the envvar, and a much faster result if I leave off any thread limiting. blas on one thread is 2.3x faster than your mm_15_registers. Is your blas specialised to your uarch?

                Anyway, call it ~28 fma per ns. That is 4 elements per vector×2 fma per cycle×3.5 cycles per ns; seems about right. I think my CPU boosts to 4.2ghz or so, so that leaves space for housekeeping.

                write intermediate memory into a local buffer to do the exact same “reuse logic” we just did with registers but instead with real memory

                A big part of the advantage from multi-level blocking comes from reducing cache pollution.

                massage it right… I prefer using asm directly for this kinda stuff

                Yeah, c is annoying. But slightly less hassle for a quick hack.

    1. 4

      This is cool.

      Loading each function individually is fine grained, but ends up making lots of requests (and hence overall slower).

      Grouping functions into related blocks would likely help a lot. Which is pretty much what people used to do to make a program fit in ram:

      https://en.m.wikipedia.org/wiki/Overlay_(programming)

      I guess the same batching/grouping algorithms would help here.

      1. 1

        I do not miss real mode 🤣

        1. 1

          Grouping functions into related blocks would likely help a lot.

          Absolutely! This would also help with the performance overhead associated with calling functions in different modules.

        1. 8

          Great work! One small nitpick: A 8450% performance improvement is 85.5 times faster or a decrease in runtime by 98,8%. I believe the latter two convey the speed-up better than the first.

          1. 8

            decrease in runtime by 98,8%

            haha, I find this extremely confusing. E.g. 98% reduction -> 99% reduction is a 2x speed up

            https://en.wikipedia.org/wiki/Potato_paradox

          1. 3

            Looks really interesting, the whole caching & reactive invalidation stuff sounds super cool. Had a look at the docs and couldn’t actually find much about the actual caching aspect though, are there any other resources that go into more detail on that?

            1. 3

              I briefly interacted with this code base at one point and remember the key beauty (for me) of the language runtime is this file https://github.com/skiplang/skip/blob/master/src/runtime/native/src/intern.cpp

              The comment at the top explains how it works in great detail. Basically interning graphs makes it possible to get reactive things very very efficient

            1. 3

              i’m kinda surprised this just tosses the polytopal model and Banerjee inequalities in with other, lesser optimization methods? i thought polytopes were pretty old hat now. they’re definitely covered at length in Kennedy’s “optimizing compilers for modern architectures” and were de rigueur in GT’s first semester of graduate compilers back in 2009. just mentioning them seems like talking about various heuristics for code transformation and then mentioning “oh yeah and SSA is used sometimes.” GCC acquired polyhedral modeling in GRAPHITE a decade ago iirc.

              really cool tool, though!

              1. 1

                Oh I should add a reference to graphite, thanks! Polyhedral models are a deep and well explored topic, but I wanted to keep the text approachable. I think empirically Polly and Graphite don’t actually achieve the same performance as the other two methods (due to strict functionality guarantees), but I could be mistaken.

                1. 5

                  I’ve not looked at Graphite, but Polly often gives orders-of-magnitude better performance than other LLVM optimisations for loop nests. Polytope optimisations encompass an arbitrary mechanism for reordering loop iterations, whereas other techniques are very specialised (e.g. loop unrolling simply exposes the opportunity to parallelise across sequential iterations of the inner loop, loop rotation provides a somewhat similar set of opportunities). Optimisations contain three components:

                  • A transform.
                  • A set of preconditions that define whether the transform is sound.
                  • A set of heuristics that define whether the transform is beneficial.

                  Polytope encompasses arbitrary loop-reordering transforms. This needs to be coupled with alias analysis to determine whether they’re sound for any given loop. Julia and Fortran often get more benefit from Polly than C/C++ because they’re better able to give rich information about which stores will definitely not alias other loads. The more structured your input is, the better you can do here (if you’re doing matrix multiplication without in-place modification then you can guarantee that all loads and stores are independent, which is expands the space of valid polytope transforms to basically any loop order).

                  The last part is still a very large open research area and one that depends a lot on your target. For a GPU, a massively parallel streaming operation is better, whereas on a CPU you want to pull things into caches, operate on blocks, and prefetch the next blocks into cache while you’re operating on the current one (and, if you’re parallelising across CPU cores, then you want to try to have multiple cores reading the same data at the same time so that they can snoop shared lines from other caches if they miss locally, but writing to different areas of the output so they aren’t playing cache-line ping-pong with exclusive lines).

                  To make things worse, the trade off between memory and compute changes depending on the target and the size. If you can recompute a value based on data in an L1 cache or GPU local memory, then that may be faster than fetching it from main memory / texture memory, depending on the complexity of the calculation. If your schedule defines too much recomputation then it can be slower than doing the naïve thing.

                  1. 1

                    great points! Generalized loop-reordering is something loop_tool attempts to help out with, so it’s good hear I’m not barking up the totally wrong tree.

                    W.r.t GPU I’d like to note that cache utilization (shared memory) is also important and there are many advanced techniques to avoid things like bank conflicts, e.g. https://github.com/NVIDIA/cutlass/blob/master/media/docs/implicit_gemm_convolution.md#shared-memory-layouts

                    Recompute tradeoffs are very interesting the domain of ML training pipelines as well (RAM residency rather than L1 cache). Since functions and their derivatives often use the same inputs but are executed at different times, it often makes sense to recalculate inputs rather than keep them around.

                  2. 2

                    huh, that would surprise me, but it’s certainly within the realm of possibilities. probably not too difficult to explore, though no doubt involving arcane command line options out the ass.

                    either way, cool tool.

                    1. 2

                      I haven’t done any precise benchmarking myself, but I was looking at recent comparisons to MKL e.g. here: https://tiramisu-compiler.org/#comparison-with-polyhedral-compilers

                      Since TVM/XLA are free to actually use MKL for inner-loops, I think polyhedral methods are relatively rare in the ML-compiler space.

                1. 2

                  This tool seems neat! I haven’t done much linear algebra optimization, so I am still wrapping my head around the optimization examples in the github repo.

                  I might not be your target audience, but a little comment about why/how the optimization example works might be useful. But it is also possible that this tool is only really meant for people that already understand the example.

                  In the github example code are all of the assert np.allclose(...) not supposed raise an error? In the github code example, only the first one is asserting without error.

                  1. 2

                    Thanks for the feedback! I’ve added a section that hopefully helps clear things up (“Tuning” https://jott.live/markdown/interactive_optimization#tuning), but please let me know if anything is unclear.

                    Re: assertion failures: 😨. Would you be able to highlight which section fails? I may have included old features that don’t work anymore

                    1. 2

                      That explanation was super helpful, thank you! I will submit an issue on github with the asserts that are failing on my machine as to not clog up this thread.

                  1. 1

                    Interesting project! However, I find the following snippet a little confusing

                    a_np = np.random.randn(128, 128)
                    b_np = np.random.randn(128, 128)
                    a = lt.Tensor(a_np).to(s.m, s.k)
                    b = lt.Tensor(b_np).to(s.k, s.n)
                    
                    c = (a * b).sum(s.k)
                    

                    Which parameters am I tuning? Clearly I cannot change s.m and s.n because these are my input/output specification, but I cannot tune s.k either since s.m * s.k and s.k * s.n are fixed to be the number of elements in both matrices.

                    In addition, I assume by (a * b) you are actually matrix multiplication (it cannot be element-wise multiplication, because the number of elements in a and b may not match when, for example, you are multiplying a 2 * 3 matrix with a 3 * 5 one). In that case, I think the notation (a @ b) would make more sense to most numpy users, because (a * b) does element-wise multiplication in numpy.

                    1. 2

                      yep, the sizes of s.k, s.m, and s.n are all fixed so those can’t be tuned. However, the loops needed to execute this computation (e.g. looping in different orders) can be tuned. I wrote up a quick notebook to clarify: https://colab.research.google.com/drive/1Uf0DCdyUXLGica6vjb3rG-ntD4b_ZWUD?usp=sharing

                      (a * b) is actually point-wise multiplication. You’re right that the elements don’t match, so there’s implicit broadcasting (much like a.reshape(128, 128, 1) * b.reshape(1, 128, 128) in numpy). This is immediately followed by a reduction (sum) over the shared dimension s.k. In numpy this is inefficient, because the whole 128,128,128 array needs to be created. In loop tool you can order the loops to skip that big intermediate array entirely. It allows you to craft complex operations without worrying about performance ahead of time (as they can always be made fast later).

                      1. 4

                        Thanks for the clarification! Just want to add that (128, 128) * (128, 128) is probably not the best example for matrix multiplication. Maybe you can switch it to something like (200, 300) * (300, 500) to show a more general case, and more importantly, the difference in dimensions can help readers like me to identify which one is which :)

                        1. 3

                          That’s a great suggestion - updated the post and the notebook with unique sizes!

                    1. 2

                      Great for graphics, but I’m itching for browser access to the more modern compute capabilities of GPGPUs. Anyone know what’s up with WebGPU?

                      1. 1

                        what legitimate use cases exist for this? the only thing I can think of is websites that mining shitcoins on users’ systems… which isn’t a legitimate use case in my book.

                        1. 3

                          My use case is that students in graphics courses have different operating systems on their laptops. This makes it hard to create exercises everyone can run and modify. The browser is the perfect sandbox

                          1. 2

                            I’m interested in running neural networks. Here’s a list of around 30 other applications: https://en.wikipedia.org/wiki/General-purpose_computing_on_graphics_processing_units#Applications

                            1. 2

                              I know why GPGPUs are useful, but not why a web browser needs to use them. That’s what I was specifically asking about.

                              1. 1

                                Why would you do want to do this in the browser instead of a standard process?

                                1. 1

                                  convenience

                          1. 4

                            What is the server-side configuration for something like this? Does sshd have some kind of inted-like functionality or does the code have to implement its own ssh server?

                            1. 6

                              The latter I believe. The project seems to use this library https://github.com/charmbracelet/wish

                              1. 2

                                Although the former would also be possible. This would be similar to how git is usually set up where there is only one allowed command.

                            1. 4

                              I’ve always enjoyed this test registry + macro paradigm! gtest does it in cpp, nice to make a pure C version

                              You may want to free your calls to malloc when the tests complete :). Alternatively, check out BSD’s famous queue.h header

                              1. 1

                                Maybe specify 1 dimensional slicing in the title? The article is missing ideas related to multiple slices as the key to __getitem__ (common in data science applications)

                                1. 1

                                  I thought numpy was supposed to use BLAS and LAPACK under the hood for everything. Apparently it has fallback implementations (that weren’t always using SIMD).

                                  Docs say:

                                  NumPy does not require any external linear algebra libraries to be installed. However, if these are available, NumPy’s setup script can detect them and use them for building

                                  1. 2

                                    It uses those for more complex linear algebra stuff, but not for simple things like computing means or adding numbers. For those it has its own code (~30% of the code is in C).

                                    So this about NumPy code, not BLAS fallback code.

                                    Also it’s quite difficult to get a copy of NumPy that doesn’t use BLAS, every easy method of installation gives you either OpenBLAS or mkl as backend.

                                    1. 1

                                      Does BLAS not include routines for the simple stuff like pointwise addition? I would’ve thought that would be entirely in its wheelhouse.

                                      1. 2

                                        I could be wrong, but docs suggest only stuff mentioned in numpy.linalg uses BLAS (https://numpy.org/doc/stable/reference/routines.linalg.html).

                                        UPDATE: Some playing around with ltrace suggests that:

                                        1. BLAS is not used for tiny arrays.
                                        2. Even for large arrays, BLAS is not used for e.g. adding an integer to an array.
                                        1. 1

                                          Interesting, makes sense, presumably there’s some FFI cost to calling into BLAS. Thanks!

                                        2. 1

                                          Yep it does, it’s called axpy and it’s level 1 BLAS.

                                          https://en.m.wikipedia.org/wiki/Basic_Linear_Algebra_Subprograms

                                          1. 1

                                            I don’t know why you linked me the wikipedia article to tell me that: it doesn’t actually have a list of subroutines in it.

                                            1. 1

                                              It does in the “functionality” section. I was on mobile and couldn’t figure out how to link it directly

                                    1. 5

                                      That context manager is a nice snippet of code. I’ll be using that for sure!

                                      Small nit:

                                      the rest of the performance improvement is likely due to a significant reduction in cache misses

                                      This might be more readily explained by the fact that there are half as many bytes processed, regardless of cache. Similarly, AVX instructions can process 2x as many fp32 operations compared to fp64.

                                      1. 2

                                        Fair point. I ended up just rewriting that whole section, fp32 is probably a distraction.

                                      1. 1

                                        I’m not sure if it’s good or bad that there are so many different ways of interacting with / generating WebAssembly. I’ve been evaluating them for a side project, and I hadn’t even heard of Wasmblr.

                                        The main thing I learned from this post is that the tool can actually have quite a big effect on performance, because they have different models for the code generation. e.g. emscripten creates a malloc implementation that uses WebAssembly linear memory, but wasemblr allows you to generate WebAssembly vs. just compile your implementation, so you can control the memory allocation more directly. Super interesting approach. I’m not an emscripten expert, so I wonder if there are similar features in it as well?

                                        1. 1

                                          I really like your take away. I think that question really gets to the heart of what I was going for!

                                          With respect to your exact example about memory allocation, Emscripten actually does expose the entire heap (for example, check out Module.HEAPF32).

                                          That being said, the way memory is used by the compiled code is not exposed (you just have to trust LLVM will do a good job). It’s a classic tradeoff between high and low level coding.

                                          However, there are more ways in which wasmblr (or any fast-to-recompile tool) can come in handy. If you determine (at runtime) that a certain value will always live at an address in memory, you can hard code that value directly into your web assembly. These types of situations are somewhat common: e.g. if a user resizes their window, you may want to use that size directly for rendering a bit faster.

                                        1. 3

                                          I was curious about that typedarrays part, but at least in my browser it’s the expected way around for every step: (i5-9300H, FF, linux)

                                           benchmarking vec add of size 4
                                             pure javascript:         15384615.38 iters/sec (0.74 GB/s)
                                             typed arrays:            19230769.23 iters/sec (0.92 GB/s)
                                          
                                          ...
                                          
                                           benchmarking vec add of size 131072
                                             pure javascript:         2820.08 iters/sec (4.44 GB/s)
                                             typed arrays:            3835.68 iters/sec (6.03 GB/s)
                                          

                                          Not an intel thing either - FF on Android:

                                          benchmarking vec add of size 4
                                             pure javascript:         5000000 iters/sec (0.24 GB/s)
                                             typed arrays:            9090909.09 iters/sec (0.44 GB/s)
                                          

                                          Looks like a V8 issue. Same Android on Chrome:

                                          benchmarking vec add of size 4
                                             pure javascript:         7733952.12 iters/sec (0.37 GB/s)
                                             typed arrays:            4940711.57 iters/sec (0.24 GB/s)
                                          

                                          Maybe worth reporting as a bug?

                                          1. 1

                                            ah, interesting. Testing it out now, SpiderMonkey (Firefox’s engine) is definitely showing consistent results for me as well

                                            Good catch I’ll add this to the writeup!

                                          1. 2

                                            @bwasti I wonder if plain JS array performs so well because the values are always 0? Perhaps if you actually stuff it with well distributed floating point numbers, TypedArray will be more competitive? I figure you are comparing TyoedArray float math to JIT’d int math.

                                            1. 3

                                              that’s a good point, the values aren’t zero (set to random values https://github.com/bwasti/wasmblr/blob/main/emscripten_example/benchmark.js#L125-L139), but they aren’t changing at all, which V8 may leverage to some degree.

                                              I’d suspect that if the JIT knew it was genuinely a no-op the measured “performance” at that point would be much higher.

                                              1. 1

                                                It’s good that you’re using random floats as modern JS engines distinguish between arrays of ints, floats, etc to shortcut various checks and conversions. It would be interesting to compare int vs float array, but you’d probably needed to change downstream code to avoid measuring int->float promotion on load. These days I have no idea of the cost of float conversion to loads however :)

                                            1. 5

                                              One thing I learned while working at Sourcefire VRT on the ClamAV core dev team: never write “clever code.” The reason being that clever code tends to introduce unneeded complexity, long-term maintenance burdens, and potential vulnerabilities. When we write clever code, we sabotage our future selves in addition to other developers, who may not understand the WHY behind the clever code.

                                              If there’s now way getting around having to write clever code, then an explicit comment describing not only what the code does, but why it was written that way should be a requirement.

                                              Writing clear and concise code, which would include insightful comments, relaxes all the burdens listed above. It also enables efficient development since brain cycles don’t need to be wasted understanding the complexity of that particular part of the codebase and its ramifications elsewhere.

                                              I would consider every single part of this article as demonstrating clever code–things to avoid. I don’t think the article intends to read that way, but that’s what I take away from it.

                                              1. 5

                                                Cleverness for its own sake can be bad, but the constructs here have valid uses.

                                                • The array bound designators make APIs more expressive to the reader, and enable compile- and run-time safety checks that can prevent buffer overruns.
                                                • Precise specification of float values, down to the bit, is important in some math/numerics code.
                                                • Being able to declare data structures at compile time so they’re built into the binary reduces code size and RAM usage and is important in embedded development.

                                                IMO if you value clear and concise code you shouldn’t be using C in the first place. The lack of safety and abstraction contribute to verbosity and fragility.

                                                1. 2

                                                  Preprocessor as a functional language is heavily employed in C++ projects that define many operations (LLVM, PyTorch).

                                                  1. 2

                                                    Writing clear and concise code

                                                    The preprocessor is your greatest friend in this regard :)

                                                  1. 2

                                                    How hard is calculating edit distance?

                                                    Assuming a computer can execute a billion steps in a second and an edit distance calculation takes about 100 steps(?) or so, that’s an ability to scan 10,000,000 words per second. Most full dictionaries seem to have 1,000,000 words, so that’s 10 full scans per second. Could probably cheaply maintain a topK, but full sorting would take ~20,000,000 steps which can happen at a rate of 50 full sorts per second and we’re still bounded by the edit distance calculation

                                                    1. 3

                                                      Hyyro 2003 shows how to compute edit distances pretty quickly with the 1999 Myers bit vector algorithm. References are HERE.

                                                      This is also coded up in the Nim implementation already mentioned. One of the graphs there shows sub-ms time to scan 80,000 words hot cache. Really, that is 80,000*6 because that graph shows “6 batches” - the concept being modeled is that one first does vanilla hash look ups to identify possible misspells in some document, gets (say) 6 words, and then wants the closest K matches for all 6 which you can get with just a single IO sweep over the dictionary - amortizing the IO cost. Anyway, roughly 1/2 million distance calculations per millisecond is probably a simple ballpark for words with English-like statistics.

                                                      This batch method is a pretty natural way to go since the vast majority of words in the vast majority of documents will not be mispelled. It also scales well to very high edit distances like 5 because it does not really depend upon “how far apart” the words are from dictionary words. Wolfe’s SymSpell does not scale well to high edit distance - either in space usage where it basically starts blowing up or time usage where it gets worse more gradually.

                                                      The Peter Norvig idea that is “a million times slower” was just some did-it-on-an-airplane-in-an-hour implementation. That might almost define “strawman” except for Norvig’s fame/status/captive audiences.

                                                      If you have further interest then reading the whole README may be enlightening and/or interesting.

                                                    1. 19

                                                      In general when someone compares Git to a blockchain they are missing the while point. Merkle trees have been around and understood for a long time. The think that what makes Bitcoin technologically interesting is that it provides a proven solution for byzantine distributed consensus (within some parameters and very inefficiently).

                                                      Now this is also why it is a hype. We have seen very few actual use cases for byzantine distributed consensus and very likely banks don’t need it. They achieve concenses in simpler ways and have their own root of trust.

                                                      That being said I think Bitcoin was a technological breakthrough and I hope that people do keep experimenting with this type of technology (particularly reducing the costs). I see it as a very early prototype that needs way more research to be market-ready. In the meantime I would appreciate if the hype died down a lot and people stopped signing up for scams and ponzi schemes which would hopefully drop the resource consumption to a much lower level.

                                                      1. 8

                                                        Well it’s been 10 years and nobody uses bitcoin as a currency…

                                                        If anything, people’s behaviour in response to the technology is way more interesting than the technology itself.

                                                        1. 2

                                                          El Salvador does?

                                                      1. 12

                                                        astrology, inane as it is, does seem to equip a lot of people with a very simple and rich language for discussing the human condition. The parallels with functional programming and computation are striking