It sucks that nVidia is the king of AI hardware because it’s such a horrible company I don’t care to give money to, especially the huge amounts required by AI. Democratising AI will involve developing cross-platform backends for running inference and training.
How about a good priced consumer grade card that gives similar performance? Is there any option to the Nvidia Tesla P40? Slightly more modern, less power without all the hacky stuff?
You can run inference using CPU only, but you’ll have to use smaller models since it’s slower. But the P40 is the best value right now given the amount of VRAM it has.
There are several options for a consumer grade card, but it all gets incredibly expensive really fast. I just checked for my country (The Netherlands) and the cheapest 24GB card is 949 euros new. And that is an AMD card, not an Nvidia. While I am sure the hardware is just as good, the fact is that the software support for AMD is currently not at the same level as Nvidia.
Second-hand, one can look for RTX 3090’s and RTX 4090’s. But a quick check shows that a single second-hand 3090 would cost over 600 euros at minimum here. And this does not consider that those cards are really power hungry and often take up 3 PCIe slots, to make space for the cooling, which would have been an issue in this workstation.
Since I could only accommodate speeds to what PCIe 3.0 offers anyway, a limitation of the workstation, this seemed the best option to me. But of course, check the markets that are available to you to see if there are better deals to be made for your particular situation.
$1700 is quite a large budget. If the total cost were halved, that would still be a sizeable budget. I feel like tech writers these days are forgetting what the phrase “on a budget” implies.
I agree, it is a big sum of money. I interpret “on a budget” as “relatively cheap”, not as “nearly free”. I think it is pretty cheap compared to what one normally needs to pay for that amount of VRAM. To me, the term is more justified here than in a post where someone buys a second-hand Apple laptop for $1100 and claims it is a cheap solution to browse the web.
I really hope AMD catches up and prices come down, because AI-capable hardware is not nearly as accessible as it should be.
Maybe it would be more accurate to say “on a budget” is a form of weasel word. Its interpretation depends on your familiarity with current prices and your socioeconomic status.
From my very subjective (and probably outdated) PoV…
$1000 is a fancy high-end laptop
$2000 buys a laptop only extremely well-paid people can justify
$800 buys a high-end desktop
You can imagine the surprise I felt (or was that shame?) when seeing a $1700 price tag on a “budget” desktop PC that can do AI.
“Building a personal, private AI computer for $1700” would communicate the intent a little better, without suggesting to the reader anything about their ability to afford it.
Btw, I don’t mean to imply any wrong was committed. I’m just pointing out that the wording on the post had some unintended effects on at least this reader. To a large degree that is unavoidable, no matter what a person publishes on the web.
Agreed. I’m running local models with an RTX 3060 12 GB that costs about $330 on NewEgg or 320€ new at ebay.de, and it’s actually useful. The context sizes must be kept tiny but even then it can provide basic code completion and short chats.
The code they write is riddled with subtle bugs but making my computer program itself seems to never get old. Luckily they also make it quicker to write throwaway unit tests. The small chat models are useful for language tasks such as translating unstructured article title + URL snippets to clean markdown links. They also act as a thesaurus, very useful for naming things, and can unblock progress when you’re really stuck with some piece of code (rubberduck debugging). Usually the model just tells me to “double-check you didn’t make any mistakes” though :)
On the software side I use ollama for running the models, continue.dev for programming (it’s really janky), and the PageAssist Firefox extension for chat.
If you look at “modern” gaming graphics cards, that is so cheap I was actually surprised. (even compared to a 30xx from some years ago)
If the median price of a thing is high, then absolute values don’t matter. A new car for under 10k EUR would still be “on a budget”, even if it’s a lot of money.
It’s funny when people look at the Apple machines and ignore other integrated graphics systems. They will be slower at the compute than a high-end NVIDIA card, but a mid-range NVIDIA card is limited by RAM (they can swap to main memory, but it’s slow). I’d expect the embedded GPU in a modern AMD system to be able to run these things much faster than the CPU and to just slow down as models get more complex rather than having an abrupt performance cliff when they don’t fit in on-board VRAM.
I bought a Ryzen system with only the integrated CPU on the assumption that it would work as you said. But Ollama falls back to the CPU anyway. I wonder why.
It DOES work… Sort of. I get about +3-5 toks/sec on my new Ryzen 7840U when running on the integrated GPU versus just CPU inference. And it hangs the GPU half the time. And you have to compile a fork of ollama to take advantage of the RAM. And I have to hold back my kernel because AMD broke their drivers again on the latest kernel. And it simply fails to load models that take up too much % of my free RAM. But it DOES work! Here are the build instructions: https://nocoffei.com/?p=342 and here are some of my comments on how well it works in the unmerged PR https://github.com/ollama/ollama/pull/6282#issuecomment-2629402216
Remember that rocm is a completely broken disaster and has been for the last 5+ years. Also, memory bandwidth is a limitation with these iGPUs– things will only slow down more and more as you get into larger models as you simply can’t load all the weights quickly enough with these limited 128-bit LP/DDR5 buses.
Strix Halo is a very interesting part for this reason as it has a MUCH larger iGPU with a 256-bit bus and up to 128GB of RAM. But I expect that when it comes out AMDs engineers will be unable to make rocm work on it, again, and everyone will go buy Nvidia Digits boxes instead.
Too slow for whom ? I’m fine getting 1~1.5 response tokens / second on a 32B model. I can get ~3 t/s on a 14B model, and that’s about as fast as I can read it anyways.
The point is you can run it, and it works. Why does it have to be instantaneous
Yeah it is more or less a gaming laptop marketed towards machine learning people. One thing that is somewhat cool about it is that it has a pretty cool proprietary library management system. Also they assume you are good at tech so when you call in and ask for help fixing something they will often just send you instructions to do it yourself.
when “ollama” uses the model id “mistral-small:24b”, what exactly is the model and what quantization does it use? Does it use a single GPU or does it use both?
apparently a single or fp16/bf16 or even the 8bit 70B is not gonna fit in 2x24GB, what exactly is used here and how?
mistral-small:24b points to 24b-instruct-2501-q4_K_M, so definitely not full precision. It is only 14GB and fits well on one card. You can find the available versions here: https://ollama.com/library/mistral-small/tags
I thought I read that the p40 software support was unsure moving forward. These seemed nice but I don’t want to fight with software incompatibilities. Anyone have personal recent experience they can share about that? Hopefully I read it wrong
Tesla P40 has compute capability 6.1. A lot of newer kernels require 7.5 (e.g. FlashInfer) or 8.0 (e.g. Marlin). It depends on what you want to do, but I’d now get something with compute capability 8.0 or higher.
The Teslas are intended to crunch numbers, not to play video games with. Consequently, they don’t have any ports to connect a monitor to. The BIOS of the HP Z440 does not like this. It refuses to boot if there is no way to output a video signal.
Is this a case where one could use a KVM switch? Would that be detected as a video output by the BIOS?
Just now realized what you might have been thinking of: There are cases where systems won’t run or not unlock full performance if no screen is connected (“there is no monitor, so nobody needs 3D rendering”). In such cases, connecting something that pretends to be a screen does help. Aside from KVMs they also make dedicated HDMI dongles you plug in that just connect what’s needed to pretend to be a screen.
No I was actually quite fuzzy on what a KVM plugs into, so your first response was correct. I thought they could somehow connect to a non-video port, and emulate (or embed) a graphic card so that the BIOS would believe one was present on the system.
It sucks that nVidia is the king of AI hardware because it’s such a horrible company I don’t care to give money to, especially the huge amounts required by AI. Democratising AI will involve developing cross-platform backends for running inference and training.
If you just want to run models AMD is a viable option as well.
What makes it so bad in your opinion? Just curious
How about a good priced consumer grade card that gives similar performance? Is there any option to the Nvidia Tesla P40? Slightly more modern, less power without all the hacky stuff?
the RTX series have a desktop form factor and comparable memory, but ain’t exactly cheap.
You can run inference using CPU only, but you’ll have to use smaller models since it’s slower. But the P40 is the best value right now given the amount of VRAM it has.
There are several options for a consumer grade card, but it all gets incredibly expensive really fast. I just checked for my country (The Netherlands) and the cheapest 24GB card is 949 euros new. And that is an AMD card, not an Nvidia. While I am sure the hardware is just as good, the fact is that the software support for AMD is currently not at the same level as Nvidia.
Second-hand, one can look for RTX 3090’s and RTX 4090’s. But a quick check shows that a single second-hand 3090 would cost over 600 euros at minimum here. And this does not consider that those cards are really power hungry and often take up 3 PCIe slots, to make space for the cooling, which would have been an issue in this workstation.
Since I could only accommodate speeds to what PCIe 3.0 offers anyway, a limitation of the workstation, this seemed the best option to me. But of course, check the markets that are available to you to see if there are better deals to be made for your particular situation.
$1700 is quite a large budget. If the total cost were halved, that would still be a sizeable budget. I feel like tech writers these days are forgetting what the phrase “on a budget” implies.
I agree, it is a big sum of money. I interpret “on a budget” as “relatively cheap”, not as “nearly free”. I think it is pretty cheap compared to what one normally needs to pay for that amount of VRAM. To me, the term is more justified here than in a post where someone buys a second-hand Apple laptop for $1100 and claims it is a cheap solution to browse the web.
I really hope AMD catches up and prices come down, because AI-capable hardware is not nearly as accessible as it should be.
Maybe it would be more accurate to say “on a budget” is a form of weasel word. Its interpretation depends on your familiarity with current prices and your socioeconomic status.
From my very subjective (and probably outdated) PoV…
You can imagine the surprise I felt (or was that shame?) when seeing a $1700 price tag on a “budget” desktop PC that can do AI.
“Building a personal, private AI computer for $1700” would communicate the intent a little better, without suggesting to the reader anything about their ability to afford it.
Btw, I don’t mean to imply any wrong was committed. I’m just pointing out that the wording on the post had some unintended effects on at least this reader. To a large degree that is unavoidable, no matter what a person publishes on the web.
A few weeks ago I saw a reference to someone on ex-Twitter speccing an LLM workstation for $6,000, so $1,700 is a on a budget compared to that.
They get 5 tokens per second on the 70b models
Agreed. I’m running local models with an RTX 3060 12 GB that costs about $330 on NewEgg or 320€ new at ebay.de, and it’s actually useful. The context sizes must be kept tiny but even then it can provide basic code completion and short chats.
The code they write is riddled with subtle bugs but making my computer program itself seems to never get old. Luckily they also make it quicker to write throwaway unit tests. The small chat models are useful for language tasks such as translating unstructured article title + URL snippets to clean markdown links. They also act as a thesaurus, very useful for naming things, and can unblock progress when you’re really stuck with some piece of code (rubberduck debugging). Usually the model just tells me to “double-check you didn’t make any mistakes” though :)
On the software side I use ollama for running the models, continue.dev for programming (it’s really janky), and the PageAssist Firefox extension for chat.
Apparently the Commodore Amiga 500 was introduced at 699 USD in 1987 - just shy of 2 000 USD inflation adjusted.
Guess that says more about how much prices for computers have come down, than anything.
If you look at “modern” gaming graphics cards, that is so cheap I was actually surprised. (even compared to a 30xx from some years ago)
If the median price of a thing is high, then absolute values don’t matter. A new car for under 10k EUR would still be “on a budget”, even if it’s a lot of money.
TLDR: don’t use an HP CPU or you’ll have to spend more trying to keep the BIOS happy.
I wonder how this setup would compare with a CPU-only Ryzen system with 128 gigs of memory. Sometimes people forget that LLMs can be run on CPU.
It’s funny when people look at the Apple machines and ignore other integrated graphics systems. They will be slower at the compute than a high-end NVIDIA card, but a mid-range NVIDIA card is limited by RAM (they can swap to main memory, but it’s slow). I’d expect the embedded GPU in a modern AMD system to be able to run these things much faster than the CPU and to just slow down as models get more complex rather than having an abrupt performance cliff when they don’t fit in on-board VRAM.
I bought a Ryzen system with only the integrated CPU on the assumption that it would work as you said. But Ollama falls back to the CPU anyway. I wonder why.
It DOES work… Sort of. I get about +3-5 toks/sec on my new Ryzen 7840U when running on the integrated GPU versus just CPU inference. And it hangs the GPU half the time. And you have to compile a fork of ollama to take advantage of the RAM. And I have to hold back my kernel because AMD broke their drivers again on the latest kernel. And it simply fails to load models that take up too much % of my free RAM. But it DOES work! Here are the build instructions: https://nocoffei.com/?p=342 and here are some of my comments on how well it works in the unmerged PR https://github.com/ollama/ollama/pull/6282#issuecomment-2629402216
Remember that rocm is a completely broken disaster and has been for the last 5+ years. Also, memory bandwidth is a limitation with these iGPUs– things will only slow down more and more as you get into larger models as you simply can’t load all the weights quickly enough with these limited 128-bit LP/DDR5 buses.
Strix Halo is a very interesting part for this reason as it has a MUCH larger iGPU with a 256-bit bus and up to 128GB of RAM. But I expect that when it comes out AMDs engineers will be unable to make rocm work on it, again, and everyone will go buy Nvidia Digits boxes instead.
I’ve seen people suggest that you run the entire model (670B) for a similar price in RAM for around 1-2 tokens per second.
I there a point? 1-2 tokes still too slow.
Too slow for whom ? I’m fine getting 1~1.5 response tokens / second on a 32B model. I can get ~3 t/s on a 14B model, and that’s about as fast as I can read it anyways.
The point is you can run it, and it works. Why does it have to be instantaneous
I feel very silly for blowing all my internship money on a tensorbook lmao
Live and learn I guess?
Just looked up what a tensorbook is. Looks sort of like my gaming laptop.
Yeah it is more or less a gaming laptop marketed towards machine learning people. One thing that is somewhat cool about it is that it has a pretty cool proprietary library management system. Also they assume you are good at tech so when you call in and ask for help fixing something they will often just send you instructions to do it yourself.
Not work 4k however.
when “ollama” uses the model id “mistral-small:24b”, what exactly is the model and what quantization does it use? Does it use a single GPU or does it use both?
apparently a single or fp16/bf16 or even the 8bit 70B is not gonna fit in 2x24GB, what exactly is used here and how?
mistral-small:24bpoints to24b-instruct-2501-q4_K_M, so definitely not full precision. It is only 14GB and fits well on one card. You can find the available versions here: https://ollama.com/library/mistral-small/tagsI have not tried the other versions yet.
I thought I read that the p40 software support was unsure moving forward. These seemed nice but I don’t want to fight with software incompatibilities. Anyone have personal recent experience they can share about that? Hopefully I read it wrong
Tesla P40 has compute capability 6.1. A lot of newer kernels require 7.5 (e.g. FlashInfer) or 8.0 (e.g. Marlin). It depends on what you want to do, but I’d now get something with compute capability 8.0 or higher.
Is this a case where one could use a KVM switch? Would that be detected as a video output by the BIOS?
A KVM-switch is connected to an existing video out, how would that help if there isn’t one?
You’re right! Thanks for clarifying.
Just now realized what you might have been thinking of: There are cases where systems won’t run or not unlock full performance if no screen is connected (“there is no monitor, so nobody needs 3D rendering”). In such cases, connecting something that pretends to be a screen does help. Aside from KVMs they also make dedicated HDMI dongles you plug in that just connect what’s needed to pretend to be a screen.
No I was actually quite fuzzy on what a KVM plugs into, so your first response was correct. I thought they could somehow connect to a non-video port, and emulate (or embed) a graphic card so that the BIOS would believe one was present on the system.