1. 24
  1.  

    1. 4

      Note that macOS reserves some of the “unified memory” as system RAM.

      By default, MacOS allows 2/3rds of this RAM to be used by the GPU on machines with up to 36GB / RAM and up to 3/4s to be used on machines with >36GB RAM.

      I haven’t tried mlx the library yet, but when I was using LM Studio with their MLX backend, I had to run sudo sysctl iogpu.wired_limit_mb=<something> to increase the VRAM limit. I wonder if it’s possible to run mlx-community/Llama-3.3-70B-Instruct-4bit on a 48GiB model with this “trick”. However, be aware that the system would panic when you try to load a big model or use a large context size when <something> is too big.

      BTW: how do I install llm with llm-mlx under nix-darwin and home-manager? I got the following message, and llm-mlx does not seem available on nixpkgs (yet).

      $ llm install llm-mlx
      Install command has been disabled for Nix. If you want to install extra llm plugins, use llm.withPlugins([]) expression.
      
      1. 3

        If anyone is looking for guidance about how high to reasonably set this, consider the configure script attached to exo.

        1. 2

          Huh, I didn’t know that about Nix. I’d love to understand what that message means so I can mention it in the LLM documentation.

        2. 3

          I’ve been wondering when a plugin might connect these tools :) Will begin using it straight away!

          By the way, I found that if you have models in ~/.cache/huggingface/hub already, due to having used mlx-lm directly, the llm mlx download-model mlx-community/… command will skip the download and just add them to llm’s index.

          1. 3

            I would really like to see a speed (and memory usage) comparison of running mlx vs just ollama out-of-the-box.

            1. 6

              Ollama is built on llama.cpp, so this should be relevant:

              When comparing popular quant q4_K_M on Llama.cpp to MLX-4bit, in average, MLX processes tokens 1.14x faster and generates tokens 1.12x faster. This is what most people would be using.

              That was before the small model performance improvement mentioned in the article.

            2. 1

              Assuming you’re still using your Mac with 64 GB of RAM that you’ve previously mentioned, what’s the largest context window you’ve managed to use so far with this solution?