I’ll be writing up a blog post soon that goes into a lot of detail of exactly how I made this library step-by-step.
LMK if there is anything that you’d like to be covered in the post or if you have any questions
Do you think that a rust-based library can outperform LibTorch? If so, in what ways?
You should be able to match pytorch on cpu pretty easily and my library is already within ~3x slower than pytorch on cpu for most ops (I don’t think there is much of a difference in speed for pytorch and libtorch since they should use the same cpp backend).
I added BLAS accelerated matmul to l2 at the last minute and hacked together some really messy code with a whole bunch of transposes (since I store my tensors in row-major format and the Fortran BLAS lib wants it in column-major) and it made my matmul 100x faster than a normal for loop based implementation and within about 30us of pytorch IIRC
As for gpus if you use CUDA and cudnn kernels properly (that’s a little out of my depth rn) it should be pretty close to pytorch’s speed as well.
The really interesting part (imo) where rust can shine would be to fix what all the major ml libraries are working on rn: making fast, efficient code run on the GPU without the need for carefully handwritten cudnn kernels.
Pytorch has its JIT that makes an IR of the scripted function and optimizes it. The tf team is working on xla which is another IR that’s supposed to optimize ops by fusing together ops and preventing a whole bunch of expensive read/writes to the gpu’s memory. JAX is doing something similar and builds on top of xla and gives you a kinda functional approach to ml. Swift4TF was (is?) a really ambitious project that wanted to make autodiff and xla optimizations a core part of swift. I think Julia is also working on approaches that are quite a bit like Swift4TF but I haven’t looked into it too much
Now to rust: Rust has a big emphasis on safety and i think it has a lot of potential in ml where you could take it the xla/swift4tf route and try to make a “compiler” for tensor ops.
Another thing that I wanted to work on but really didn’t have the time for in this version of l2 was to use rusts const generics (still in nightly but hopefully being finalized soon) to run compile-time shape checks on tensors and the outputs of tensor ops.
Python is a really nice language since either lets you focus more on making your code do what you want it to do and less on how to write that code. but I’ve found a lot of bugs in my python code where I end up getting a runtime exception about two tensors not being compatible/broadcastable and imo this would be a really neat thing to try and make as a proof of concept to encourage more people to work on building the future of ml libraries and infra.
This is just a brain dump of a whole lot of things that I’ve been thinking about lately and I’m not 100% sure that everything I’ve said is 100% correct so I’d really appreciate it if anyone could correct me on the things that I’ve got wrong :)