As someone who likes to play around with little effort, no experience and modest hardware, is there a comparison of what’s available and doable on, let’s say…
A raspberry pi, an X1 thinkpad, a 16gb MacBook pro M1, a maxed out M4, a gaming workstation with a recent-ish nvidia gpu?
The biggest limiting factor is memory. The Llama 3.3 I ran from Ollama seems to need just over 40GB. The llama 3.2 1B model should run in 2GB or 4GB, and I’d expect the 3B model to run in 4GB or 8GB.
There are so many models and so many ways to run them (lots of different quantized versions) that it’s very hard to provide an answer more useful than “try it and see”.
Disk space is tricky too, I have 1TB and constantly run out of space for experiments!
Are you talking about video memory or CPU RAM? Are you running AI on the CPU only? Is the performance acceptable with a decent CPU? How many tokens per second are we talking about with a reasonably modern computer? I always thought that using the CPU is so painfully slow that it’s not worth bothering. At least, that seemed to be the case with Stable Diffusion.
Right - that’s the great thing about Apple Silicon for running models (now that the various libraries are in good shape - llama.cpp and MLX both made all the difference there).
Is there any way to run a large model on discrete graphics cards without unified memory model? Ordinary commercial cards only seem to go to 24GB of VRAM. I really don’t like Apple’s ecosystem, but I have to admit, 64GB of potential VRAM is an extraordinary thing on a commercial device.
Some inference engines (TGI, vLLM, etc.) support shading (through e.g. tensor parallelism), so you can do inference using multiple GPUs. Though for best performance you want NVLink, since the cards need to synchronize after several layers (e.g. after the down projection in a feed-forward layer). I have written a bit about how it works at the model level here:
How about outside the context of an Apple? A Mac Mini with 64GB of RAM is $2000, but I can add another 32 GB of RAM to my Linux workstation-class laptop for $60.
One option would the NVidia Orin AGX, although they’re not cheap either. The CPU cores aren’t great compared to the M4 Mac Mini but the GPU is pretty good. You can get them with 64GB of unified RAM. I don’t have any benchmarks or prices handy but I’d guess you’d still get a lot more bang/$ with the Mac Mini.
Edit: One potential advantage though is that the Orin AGX just runs plain old Cuda and TensorRT so it might be easier to port existing models to it.
You can’t really run models on CPU’s RAM. AIUI the bus is too slow and current CPUs don’t have purpose-made IP for inference. Your best chance are PCIe Nvidia cards, 48 GB of VRAM for just under 8900€. Yeah.
There are some github repos that manage to run them, but it feels very slow and power-hungry.
If you want to play with this model and don’t have a Mac with 64gb RAM, you can on together.ai. They also have all the other open-source LLMs. There is a generous free tier, and then after that the costs are extremely low (orders of magnitude cheaper than the big companies).
On the topic of speed, am I interpreting correctly that the last code snippet took a bit less than 1.5 minutes (end to end–excluding you writing the prompt, of course) to generate?
I tried both for the recent Qwen2.5-Coder-32B-Instruct, and MLX was faster by maybe 20%, but I was using different 4-bit quantizations based on what was recommended by ollama (q4_K_M) and what was most popular on the mlx-community huggingface org (q4_0). I don’t know how much of the effect can be explained by the quantization, but MLX was also a bit smaller in memory.
I focused on Ollama in this article because it’s the easiest option, but I also managed to run a version of Llama 3.3 using Apple’s excellent MLX library, which just celebrated its first birthday.
For those wanting to try MLX using a friendly interface, LM Studio has MLX support.
Pay for some time on H100s somewhere like replicate or another GPU farm. The power of doing this quickly and then being able to try something else quickly is probably worth it.
As someone who likes to play around with little effort, no experience and modest hardware, is there a comparison of what’s available and doable on, let’s say…
A raspberry pi, an X1 thinkpad, a 16gb MacBook pro M1, a maxed out M4, a gaming workstation with a recent-ish nvidia gpu?
The biggest limiting factor is memory. The Llama 3.3 I ran from Ollama seems to need just over 40GB. The llama 3.2 1B model should run in 2GB or 4GB, and I’d expect the 3B model to run in 4GB or 8GB.
There are so many models and so many ways to run them (lots of different quantized versions) that it’s very hard to provide an answer more useful than “try it and see”.
Disk space is tricky too, I have 1TB and constantly run out of space for experiments!
Are you talking about video memory or CPU RAM? Are you running AI on the CPU only? Is the performance acceptable with a decent CPU? How many tokens per second are we talking about with a reasonably modern computer? I always thought that using the CPU is so painfully slow that it’s not worth bothering. At least, that seemed to be the case with Stable Diffusion.
In context of the article, they’re the same thing. Apple’s Mn silicon has unified memory model.
Right - that’s the great thing about Apple Silicon for running models (now that the various libraries are in good shape - llama.cpp and MLX both made all the difference there).
Is there any way to run a large model on discrete graphics cards without unified memory model? Ordinary commercial cards only seem to go to 24GB of VRAM. I really don’t like Apple’s ecosystem, but I have to admit, 64GB of potential VRAM is an extraordinary thing on a commercial device.
Some inference engines (TGI, vLLM, etc.) support shading (through e.g. tensor parallelism), so you can do inference using multiple GPUs. Though for best performance you want NVLink, since the cards need to synchronize after several layers (e.g. after the down projection in a feed-forward layer). I have written a bit about how it works at the model level here:
https://danieldk.eu/Tensor-Parallelism
How about outside the context of an Apple? A Mac Mini with 64GB of RAM is $2000, but I can add another 32 GB of RAM to my Linux workstation-class laptop for $60.
One option would the NVidia Orin AGX, although they’re not cheap either. The CPU cores aren’t great compared to the M4 Mac Mini but the GPU is pretty good. You can get them with 64GB of unified RAM. I don’t have any benchmarks or prices handy but I’d guess you’d still get a lot more bang/$ with the Mac Mini.
Edit: One potential advantage though is that the Orin AGX just runs plain old Cuda and TensorRT so it might be easier to port existing models to it.
You can’t really run models on CPU’s RAM. AIUI the bus is too slow and current CPUs don’t have purpose-made IP for inference. Your best chance are PCIe Nvidia cards, 48 GB of VRAM for just under 8900€. Yeah.
There are some github repos that manage to run them, but it feels very slow and power-hungry.
If you want to play with this model and don’t have a Mac with 64gb RAM, you can on together.ai. They also have all the other open-source LLMs. There is a generous free tier, and then after that the costs are extremely low (orders of magnitude cheaper than the big companies).
Did you notice a speed difference between ollama and mlx?
They felt about the same to me but I didn’t measure, I really need to start recording that.
On the topic of speed, am I interpreting correctly that the last code snippet took a bit less than 1.5 minutes (end to end–excluding you writing the prompt, of course) to generate?
Yeah, that’s about right. I left it running while I was making the tea.
I tried both for the recent Qwen2.5-Coder-32B-Instruct, and MLX was faster by maybe 20%, but I was using different 4-bit quantizations based on what was recommended by ollama (q4_K_M) and what was most popular on the mlx-community huggingface org (q4_0). I don’t know how much of the effect can be explained by the quantization, but MLX was also a bit smaller in memory.
For those wanting to try MLX using a friendly interface, LM Studio has MLX support.
I second this; it’s a really convenient way to run a local OpenAI-compatible MLX server to use with your tools of choice.
If you prefer simple to easy, make a python venv, add the mlx-lm package, and run mlx_lm.server.
Now, what’s the best setup for someone who wants to train models (though much smaller than GPT class, maybe a few million parameters)?
A few million parameters is nothing, an old NVIDIA GPU or Apple Silicon GPUs will do.
Pay for some time on H100s somewhere like replicate or another GPU farm. The power of doing this quickly and then being able to try something else quickly is probably worth it.
Wishing I went for the 64GB model too. Though 24GB has been useful for trying out smaller models, especially while offline or with low connectivity.
I got a heavily discounted Mac Studio with M1 Ultra. It’s really great for running local LLMs (I have the 64GiB version).