This is very cool. The M1/M2 systems happen to have large amounts of very very high-bandwidth, low-latency RAM. This is available in a GPU, but it’s extremely expensive (like $5000+ per GPU) and nVidia has a lot of incentive to keep them expensive since they want to price-discriminate between AI users and videogame users. LLMs need tons of high-bandwidth RAM (running on a normal Intel CPU will be RAM-bandwidth-limited).
Apple may have inadvertently made LLMs much more accessible to many more people by making cheap CPUs with high-bandwidth RAM.
I wrote about why I think this represents a “Stable Diffusion moment” for large language models here: https://simonwillison.net/2023/Mar/11/llama/
This is very cool. The M1/M2 systems happen to have large amounts of very very high-bandwidth, low-latency RAM. This is available in a GPU, but it’s extremely expensive (like $5000+ per GPU) and nVidia has a lot of incentive to keep them expensive since they want to price-discriminate between AI users and videogame users. LLMs need tons of high-bandwidth RAM (running on a normal Intel CPU will be RAM-bandwidth-limited).
Apple may have inadvertently made LLMs much more accessible to many more people by making cheap CPUs with high-bandwidth RAM.
I’m very happy this is possible now, and the surveillance-loving silicon valley vampire doesn’t have monopoly on this any more.
This gives bad quality gens
Yeah the model hasn’t been instruction trained like GPT3 was so you need to know how to prompt it - some tips here (I’m still trying to figure out good prompts myself): https://github.com/facebookresearch/llama/blob/main/FAQ.md#2-generations-are-bad
Almost as if they know exactly what it will be used for…