I’m a bit confused about layering. I’ve played with llama.cpp and it has Metal offload and so mostly runs on my GPU. Is that something done in llama.cpp or in GGML?
I believe it’s GGML, but it’s not automatic. I think GGML will use the CPU by default unless it’s explicitly told to offload to the GPU.
How’re you finding Metal performance? For 13B 4-bit quantized models on my 16GB M1 MacBook Air, I only see a minor improvement; but I’m considering an upgrade to a newer machine so that I can toy around with larger models locally.
I haven’t done extensive profiling but with the M2 max the CPU is mostly idle, the GPU has brief spikes to 100% at the start of generating output then sits at under 50% while emitting the rest. Stable diffusion has CoreML offload and takes 10-30 seconds to generate an image.
I had a chance to try it. The instructions were a bit painful. In-tree builds for everything (cmake . is a terrible idea, substituting pipenv for pip at least didn’t pollute my whole system with Python things.
I tried uploading an image and asking about it, but the chat just said ‘I’m sorry, but I cannot see any image in the provided code. Please provide a valid image code so I can assist you better’ (the image is displayed correctly and was uploaded as a standard JPEG, nothing special).
While running, it seems to use a single CPU. Other things using GGML happily scale out across multiple cores, so I’m not sure what’s going on there.
Trying it with some text-only things were a bit disappointing. After the fun I had with LLaMa trying to translate ‘twelve bread rolls please’ into French, I tried it here. The answer was ‘Encore mes fraises s’il vous plait’ (more strawberries for me please!). Apparently open source LLMs have something against petits pains. This one doesn’t have an opinion on brioche though, and I know from LLaMa that nothing is better than brioche.