i’m kinda surprised this just tosses the polytopal model and Banerjee inequalities in with other, lesser optimization methods? i thought polytopes were pretty old hat now. they’re definitely covered at length in Kennedy’s “optimizing compilers for modern architectures” and were de rigueur in GT’s first semester of graduate compilers back in 2009. just mentioning them seems like talking about various heuristics for code transformation and then mentioning “oh yeah and SSA is used sometimes.” GCC acquired polyhedral modeling in GRAPHITE a decade ago iirc.
really cool tool, though!
Oh I should add a reference to graphite, thanks! Polyhedral models are a deep and well explored topic, but I wanted to keep the text approachable. I think empirically Polly and Graphite don’t actually achieve the same performance as the other two methods (due to strict functionality guarantees), but I could be mistaken.
I’ve not looked at Graphite, but Polly often gives orders-of-magnitude better performance than other LLVM optimisations for loop nests. Polytope optimisations encompass an arbitrary mechanism for reordering loop iterations, whereas other techniques are very specialised (e.g. loop unrolling simply exposes the opportunity to parallelise across sequential iterations of the inner loop, loop rotation provides a somewhat similar set of opportunities). Optimisations contain three components:
Polytope encompasses arbitrary loop-reordering transforms. This needs to be coupled with alias analysis to determine whether they’re sound for any given loop. Julia and Fortran often get more benefit from Polly than C/C++ because they’re better able to give rich information about which stores will definitely not alias other loads. The more structured your input is, the better you can do here (if you’re doing matrix multiplication without in-place modification then you can guarantee that all loads and stores are independent, which is expands the space of valid polytope transforms to basically any loop order).
The last part is still a very large open research area and one that depends a lot on your target. For a GPU, a massively parallel streaming operation is better, whereas on a CPU you want to pull things into caches, operate on blocks, and prefetch the next blocks into cache while you’re operating on the current one (and, if you’re parallelising across CPU cores, then you want to try to have multiple cores reading the same data at the same time so that they can snoop shared lines from other caches if they miss locally, but writing to different areas of the output so they aren’t playing cache-line ping-pong with exclusive lines).
To make things worse, the trade off between memory and compute changes depending on the target and the size. If you can recompute a value based on data in an L1 cache or GPU local memory, then that may be faster than fetching it from main memory / texture memory, depending on the complexity of the calculation. If your schedule defines too much recomputation then it can be slower than doing the naïve thing.
great points! Generalized loop-reordering is something loop_tool attempts to help out with, so it’s good hear I’m not barking up the totally wrong tree.
W.r.t GPU I’d like to note that cache utilization (shared memory) is also important and there are many advanced techniques to avoid things like bank conflicts, e.g. https://github.com/NVIDIA/cutlass/blob/master/media/docs/implicit_gemm_convolution.md#shared-memory-layouts
Recompute tradeoffs are very interesting the domain of ML training pipelines as well (RAM residency rather than L1 cache). Since functions and their derivatives often use the same inputs but are executed at different times, it often makes sense to recalculate inputs rather than keep them around.
huh, that would surprise me, but it’s certainly within the realm of possibilities. probably not too difficult to explore, though no doubt involving arcane command line options out the ass.
either way, cool tool.
I haven’t done any precise benchmarking myself, but I was looking at recent comparisons to MKL e.g. here: https://tiramisu-compiler.org/#comparison-with-polyhedral-compilers
Since TVM/XLA are free to actually use MKL for inner-loops, I think polyhedral methods are relatively rare in the ML-compiler space.
This tool seems neat!
I haven’t done much linear algebra optimization, so I am still wrapping my head around the optimization examples in the github repo.
I might not be your target audience, but a little comment about why/how the optimization example works might be useful. But it is also possible that this tool is only really meant for people that already understand the example.
In the github example code are all of the assert np.allclose(...) not supposed raise an error? In the github code example, only the first one is asserting without error.
Thanks for the feedback! I’ve added a section that hopefully helps clear things up (“Tuning” https://jott.live/markdown/interactive_optimization#tuning), but please let me know if anything is unclear.
Re: assertion failures: 😨. Would you be able to highlight which section fails? I may have included old features that don’t work anymore
That explanation was super helpful, thank you! I will submit an issue on github with the asserts that are failing on my machine as to not clog up this thread.
This is really interesting and looks like a useful tool.
I spent some time working on ideas for tools to help hand-tuning, ~15 years ago. This was in the HPC context, but trying to publish in more mainstream venues, I got feedback that not enough people cared about hand tuning at that level. It’s cool to see that there are new important use cases for this sort of thing now.
Interesting project! However, I find the following snippet a little confusing
a_np = np.random.randn(128, 128)
b_np = np.random.randn(128, 128)
a = lt.Tensor(a_np).to(s.m, s.k)
b = lt.Tensor(b_np).to(s.k, s.n)
c = (a * b).sum(s.k)
Which parameters am I tuning? Clearly I cannot change s.m and s.n because these are my input/output specification, but I cannot tune s.k either since s.m * s.k and s.k * s.n are fixed to be the number of elements in both matrices.
s.m * s.k
s.k * s.n
In addition, I assume by (a * b) you are actually matrix multiplication (it cannot be element-wise multiplication, because the number of elements in a and b may not match when, for example, you are multiplying a 2 * 3 matrix with a 3 * 5 one). In that case, I think the notation (a @ b) would make more sense to most numpy users, because (a * b) does element-wise multiplication in numpy.
(a * b)
2 * 3
3 * 5
(a @ b)
yep, the sizes of s.k, s.m, and s.n are all fixed so those can’t be tuned. However, the loops needed to execute this computation (e.g. looping in different orders) can be tuned. I wrote up a quick notebook to clarify: https://colab.research.google.com/drive/1Uf0DCdyUXLGica6vjb3rG-ntD4b_ZWUD?usp=sharing
(a * b) is actually point-wise multiplication. You’re right that the elements don’t match, so there’s implicit broadcasting (much like a.reshape(128, 128, 1) * b.reshape(1, 128, 128) in numpy). This is immediately followed by a reduction (sum) over the shared dimension s.k. In numpy this is inefficient, because the whole 128,128,128 array needs to be created. In loop tool you can order the loops to skip that big intermediate array entirely. It allows you to craft complex operations without worrying about performance ahead of time (as they can always be made fast later).
a.reshape(128, 128, 1) * b.reshape(1, 128, 128)
Thanks for the clarification! Just want to add that (128, 128) * (128, 128) is probably not the best example for matrix multiplication. Maybe you can switch it to something like (200, 300) * (300, 500) to show a more general case, and more importantly, the difference in dimensions can help readers like me to identify which one is which :)
(128, 128) * (128, 128)
(200, 300) * (300, 500)
That’s a great suggestion - updated the post and the notebook with unique sizes!