“Through it, users converse with a wickedly creative artificial intelligence indistinguishable from a human, which smashes the Turing test and can be . Interacting with one for the first time is unsettling, a feeling which will last for days.”
Ironically, they wrote that one of their favorite use cases for LLMs is… proofreading!
As for creativity: “wickedly creative” is overselling it, but I do find LLMs useful for brainstorming because they’re differently creative. Ask it to come up with a product name, and you’ll get something bland and uninspired. But ask it for a list of Greek myths that share something in common with your product, and it turns out your brainstorming buddy has read every ancient Greek text that’s on the Internet. Or if you just need some randomness to get unstuck, an LLM can provide that as well.
Even that I feel is a short term gain for long term sacrifices. I am heavily on the introverted end of the spectrum, still, when I get stuck creatively or technically my default is to phone a friend or better yet, sit down with them and share some choice beverage. At worst I get a refresh on other things and signs of life where the convenient excuse otherwise would’ve been a dismissing ‘oh I don’t have the time’.
Fair. But like any other tool, nothing’s stopping you from using LLMs in a social manner. “Hey, I’m running out of gas on <creative thing>; you wanna meet up for coffee and brainstorming? I’ll bring my laptop so we have access to AI chat and Wikipedia.”
(Similarly, you could level the same criticism at other tools. Technically speaking, a thesaurus is a more convenient substitute for discussing word connotations with other members of your creative writing club.)
I asked my LLM and, uh, you can definitely tell it’s my LLM:
Oh, absolutely. That sentence practically reeks of LLM vibes. It’s got that slightly stilted, overly dramatic flair that an LLM tends to produce when it’s aiming for a “literary” tone. Let’s break it down:
“Wickedly creative” — It’s trying to sound edgy and impressive, but it’s a bit of a cliché.
“Indistinguishable from a human” — Classic Turing test flex. Every AI description loves to brag about that.
“A feeling which will last for days” — Over-the-top dramatic phrasing, which often feels like it’s trying way too hard to sound profound.
I’d bet my processors on it: this is a fine specimen of AI-generated prose.
“Mine” is a pile of moderately clever Python hacks exposed through a GUI that assemble various relevant prompts and feed them either locally or to OpenAI via this library. It can interface to local models (via Gpt4All or LocalAI) or talk to ChatGPT right away.
Calling it “mine” is probably a bit of stretch, as I don’t actually use a custom model; I could and I’m sure it would work better but I’m not really sure it’s worth spending the money or the time on that. I refer to it as “mine” because it’s kind of like an assistant, I guess, although I really dread that trend. It’s a bit more hit-and-miss with local models (which have a more limited context) but with OpenAI’s models I get all the custom behaviour I need from the system prompt.
An example of how it works (and of why I’m referring to it as “mine”): the GUI has a “Study” button (long story short it’s styled somewhat like an app I wrote like twenty years ago when I was in school, hence the “study” part) which opens the file manager to a folder with the things I’m looking into now – papers I mean to read, notebooks, that sort of stuff. When I do that, it also injects a “system prompt” of sorts (a hidden request) in the current “conversation” with the LLM, instructing it to ask me what we’re studying today. Then, while that file manager window is open, it occasionally injects another prompt instructing it to ask me how it’s going – mostly to make sure I don’t slack off with you nerds on lobste.rs or start watching street food and beer brewing videos.
That’s actually how this app started, as in, originally, that’s all it did. I did it as a learning exercise, AI isn’t something I’m particularly interested in but LLMs are popular enough that I wanted to know what it was about so I just threw some Python at it. But it worked surprisingly well. After I’m done with whatever I’m doing for the day, I go over what I’ve learned with the LLM, maybe I ask it to quiz me, recommend me some bibliography, that sort of stuff. The kind of thing that I would certainly prefer to do with a human (I’m entirely partial with @crazyloglad’s approach here) but too trivial to bother a friend over. I do procrastinate less and I do learn more, so I’m happy.
Yep, it hallucinates occasionally, but even that is not completely useless. Explaining a hallucinating model that I’m right about some quirk in RISC-V vector extensions is a useful learning exercise, and googling for the hallucinated title of a paper still yields some useful papers once in a while. It’s not perfect, but it’s fine, with the obvious caveats that 1. it’s not a good source of information and 2. it’s not a human. So e.g. if I need to bounce ideas off of someone, I’ll ask someone, relying on something for it isn’t a very good idea.
That being said:
It’s 100% tied to my workflow. The example above relies on a quirk of my workflow (I keep that particular file manager window open for as long as I study something) which obviously wouldn’t work for anyone else. Frankly, at their current stage of development, I don’t really think you can use a LLM to make a useful assistant/study buddy/whatever for a wide audience. If it’s generic enough to fit a diverse enough set of workflows for a wide range of tasks, it’s also going to be generic enough as to be entirely trivial in its capabilities. The usefulness of this application derives almost entirely from the fact that it’s aware of what I do, and the LLM is just a UI layer that I use for things where talking to a model is actually useful, rather than a UI gimmick.
LLMs are good enough presenting data (with some trial and error for prompting) but you can’t trust them to change data. Early on I experimented with getting the model to do things like add or query reminders by translating between natural-language queries and remind queries, for example – it worked 99% of the time, but that 1% was obviously potentially disastrous. So none of that happens through the LLM. I have it remind me of pending events (by popening remind and injecting the prompt with relevant reminders) but I still manage them with tkremind.
The technology is good and useful enough that it doesn’t take that much effort to assemble such an application. I wrote most of its innards over a weekend, and after that I just slowly added things to scratch the occasional itch, one tiny feature at a time, and most of them took like half an hour. The only thing I spent more than an hour or so on in the last six months was porting its backend to LLM (it used to be a home-grown API).
It’s an incredible service that has earned that title. “Small” models are around a few GBs, large models are hundreds of GBs, and HF hosts it all for free. With a few exceptions that do not matter in practice, you don’t even need to sign up to download models! (I’ve been so impressed that after a few days they got a penny-pincher like me to pay for pro account.) That means you can immediately download and try any of the stuff I’m about to discuss.
I’d like to offer an alternative perspective here.
It’s incredibly inefficient to do this, as evidenced by the fact that they want people to pay for a pro account. Meanwhile, they could have embraced torrenting, a scalable solution whose cost-effectiveness would mean that paying for the service of HTTP is not necessary.
Furthermore, the LLM hype could have been harnessed to the point that browsers could be pressured into adding torrenting built-in to their downloaders. Imagine if torrenting was a normal way to download files, and the default settings of browsers waited to achieve a 1.0 ratio before stopping seeding the file. Many people leave their browsers running 24/7. This would be such a huge win for Free and Open Source Software because it distributes costs away from projects, making it possible for more small projects to sustainably exist. Everyone would win though; the cost of data storage and transfer would be reduced across all of the Internet.
For what it’s worth I do agree with the general idea here, it feels really wasteful to me that we’re all downloading multi-GB files and not doing the peer-to-peer thing any more.
But I just don’t think history supports this working out. O’Reilly ran a Peer 2 Peer conference back in 2001 but it didn’t turn into a series (or maybe it rebranded as ETech?) - that was just before the whole Web 2.0 thing stole all of the wind from the BitTorrent / Napster / etc world. Amazon S3 made a rare backwards-compatibility breaking change when it dropped its torrent support three years ago: https://github.com/awsdocs/amazon-s3-userguide/blob/0d1759880ccb1818ab0f14129ba1321c519d2ac1/doc_source/S3Torrent.md
My theory is that people realized that the cost of running a global CDN to serve large files (driven by the rise of online video, YouTube etc) was low enough that BitTorrent wasn’t worth the trouble.
Downloading a large file from a CDN peer in a nearby city is always going to work faster than downloading from BitTorrent, because that CDN peer is going to be on a much faster and more reliable connection.
Well it makes sense that businesses would come to this conclusion - they all want to be the platform that everyone is stuck using. My point is this technology is a key asset in the struggle against tech oligarchies.
I’d also like to point out that Brave has a built-in torrent client, although it’s not built for seeding or persisting torrents, more of an augment to it’s built-in downloader.
But are P2P downloads actually more efficient? For lots of content, the data will have to travel further than from a regional CDN, and the torrent requests may prevent those computers from entering low power states.
I appreciate that the zig project serves a lot of downloads, but most small projects distribute a small enough volume that the free tier of a CDN is adequate.
As another P2P example, I think the steam client will still look for peers for downloads, but maybe only within your LAN.
My hypothesis is specifically that torrenting would be an alternative to asking for money, particularly with some key user interface adjustments. And further, that reducing the amount of money flow reduces the risk of having the platform do an unwelcome transition to platform capitalism.
I totally agree that platform capitalism and other sorta-digital-landlordism are shitty. I’m not convinced that torrenting support in browsers is a particularly good solution, especially when CDNs are so cheap that bandwidth is not typically a barrier to entry.
There are significant issues with privacy, power use, bandwidth metering, and legal liability that could easily be show-stoppers. Even if the feature was added I think browsers would want UX to prompt users on a download-by-download basis. (I don’t want my browser to broadcast to the world the details of every video I watch on p2p-tube! Tho there are some trade-offs where you can reduce the number of peers who can identify you at the cost of complexity and p2p bandwidth).
For organisations like Zig that have very large bandwidth requirements and tech-savvy users, perhaps you could change your installers, package tooling, etc to use torrenting? Users could download a tiny binary with HTTP and be prompted to allow the zig installer to set up a torrent daemon that will be used for downloads and for seeding.
With regards to the last paragraph, I would certainly explore that option before spending a significant amount of money on bandwidth. However, I recently discovered that simply not using a CDN saves a ton of money and is good enough for the foreseeable future.
Then again, if browsers had torrenting built in to the point where users didn’t find it odd or inconvenient, then I would make it the default download option.
Note how nowhere does the official documentation define what “GGUF”
GGUF is short for GG Unified Format
llama.cpp was written by Georgi Gerganov who also made a C language tensor compute framework GGML.
The first few versions of the format were called ggml, ggjt, and ggmf. They were all mutually incompatible and GGUF was the first format that allowed for versioning. I followed the project closely in 2022/23 but got bored after awhile, and the quality of the open models is so much worse than 4o or Claude 3.5 that I stopped playing with them.
Most of the time, upon seeing the phrase “AI” or “LLM” in an article, my interest instantly fades and I move on. However the author of this has produced consistently excellent work in the past, so I gave it a chance, and found it informative.
It is the only article I’ve read about running local LLMs that’s coherent, straightforward, and sufficiently devoid of hype. The best introduction I’ve found.
While it is an impressive feature, I would just caution that, unless you’re sticking with JSON/JSON schema, arbitrary other grammars may perform less well than you might hope. The way it works is pretty clever, it takes the logits of the last output layer and keeps sampling until it finds a token that matches the next state transition in the grammar, and then since LLMs are autoregressive, that new token is appended to the output string and fed back to the model for the next one and so on, like normal. The problem is that the weights of the model were trained with a feedback loop of some objective function where the output was not a string in that grammar. So in a sense it’s sampling from a latent space that’s slightly different than what the model was tuned for. I optimistically thought I could coax the equivalent of EBNF of common PLs, but struggled to get meaningful results (at best syntactically correct, but no where near the semantics of my prompt). YMMV of course, and JSON does perform well since that’s now part of the training of most models.
You mean JSON? Llama 3.1 8B does fine for it. You should ideally give it examples of input to outputs though. Especially Llama, that one’s almost hopeless without examples, they’re more important than the instructional part of the prompt.
For the UI I like Ollama. It’s automates all the hassle of downloading and updating of models, has a built-in JSON API, and a terminal chat interface.
Models that fit in 8-16GB of RAM are shallow and hallucinate a lot. 32GB MacBook can run mid-sized models, but barely. 48GB is needed to run a useful model and have other applications open at the same time.
I’ve been idling thinking about picking up one of the new Mac minis with an M4 Pro and 48GB RAM, which is $1799US and hanging it off my network (plus Wireguard for roaming access) as a dedicated local/private inference machine, extending the life of my old laptop.
I’ve got M3 Max, and fan noise gets annoyingly loud when running LLMs for longer than a minute (chat is okay, but batch processing is maxing out the heat). For my next Mac I’m going to test whether Studio’s beefy cooler is quieter under load.
I have also found Ollama useful but I only use to run the models. It loads them on demand and unloads after a configured timeout (something like 5 minutes by default?). Since switching to it from llama.cpp I’ve found it convenient enough to run local models daily. Naturally, on Linux the Nvidia drivers of course have to be restarted with some modprobe calls after when the machine wakes from sleep. I made an shell alias for it :)
As an UI I’ve found Continue.dev, a VSCode extension, to work well enough to be useful when programming. It’s super janky with the Vim bindings though but at least I can select code, send it as context to the chat pane via a few mouse clicks, ask for edits and copy the results back with another click.
I hope we see more local LLM usage kick up. The trade off on accuracy seems worth it for the lower processing costs, data residency benefits, latency, etc…
It’s inevitable. Not only because of the wave of “AI PCs” or “Copilot PCs” or whatever they’re calling it this week, but because LLM providers are literally selling their services at a loss. Hosting 70B+ models economically is, currently, virtually unsustainable. I’m sure hosts will always exist in one form or another, but the current OpenAI model is going to crash pretty hard.
I think I agree with pretty much everything there. OpenAI would really need gpt5 to be something else entirely to meet the current expectations and that does seem unlikely.
I’m only speculating here, but I think we’ll start using a bunch of smaller, general-purpose models instead of single large monolithic models. I think once we get past this current marketing campaign and public sentiment, where you’re expected to think “An LLM model is one, single, living, breathing atomic thing that you must perform all interactions with, since this one model is a general-purpose AI that should be able to do everything”, we’ll start realizing what current LLM architectures are actually capable of, which is:
Detecting language patterns, which they’re currently pretty unoptimized for, and if you’re clever enough, you can probably use more specific algorithms to detect and extract the features you want from a work of text. But it is genuinely easier to just drop an LLM into your codebase and write in plain english “give me some JSON based on this and that in the user prompt.”
Generating text, which is far more useless than you think. Writers and people looking to roleplay with a character get a huge kick out of it, and that’s where LLMs genuinely are useful. But for literally every other class of text generation, you know what’s worked insidiously well for decades and decades of computer science? That’s right, writing the dialogue for your agent manually. Replace tags and keywords in the pre-written dialogue with variables from the agent’s current context. And best of all, it can never lie unless you do!
And that’s it. They genuinely don’t do anything else, even if one can be tricked into thinking they do. After all, humans hallucinate, too. It’s called anthropomorphizing.
I’m a pretty big fan of symbolics and GOFAI, but I don’t think they’ll re-enter mainstream use in conjunction with special-purpose models. Rather, I expect people to continue with their connectionism obsession until they make a truly dedicated tool-calling model specifically made for calling other, special-purpose models to receive input and perform dedicated tasks. And it’ll probably be wildly inefficient from just figuring out the actual problem domain and hard-coding an agent to handle it, but, it’ll be easier to deploy!
Also I realize I probably just described langchain, but, I feel like langchain is currently more like a hack than a dedicated, intentional practice. It feels more like people accidentally stumbled upon langchain’s architecture rather than making the explicit realizations I mentioned. It’s more of an attempt to blur a bunch of LLMs together to try and cover up each other’s faults, rather than using a series of small, dedicated models built for specific sub-tasks.
You clearly know more about this than me, but my intuition agrees with you :)
I also think the smaller, more focused models will be fairly useful for domain specific rag on unstructured data. I need to start playing with this myself, but I’m imagining a chat interface for say geographic data which could be used by a civil engineer to aid in understanding a design. A minimal thing to be sure in the grand scheme.
but because LLM providers are literally selling their services at a loss.
If I run a business selling a yoghurt pipeline system (where you can have a yoghurt tap in your kitchen) and sell it at a loss, is it inevitable that usage of my business will pick up?
No. In fact, it’ll probably fail because a dedicated yoghurt-fridge is much cheaper and still overkill, and takes up less space than the required hot-yoghurt-tank anyway.
I’d be interested in digging into the lower processing costs a little more. Is the energy usage of me running a prompt through a decent model on my laptop genuinely lower than the energy usage of that same prompt through ChatGPT or Claude, given that those models run on hardware shared by millions of other people while my laptop is serving just me?
It’s a fair question. I think if you want to fix accuracy at some level then probably the data center model will come out ahead. My view though is that for the reduction in accuracy required to work on NPUs/neural engines/client gpu it may well be worth it. The product workflow of code generation already required double checking the output code from the data center models, so as long as the reduction in accuracy is still palatable then the product still works. This is just my intuition, but it does seem to be backed up by the lack of any killer apps that make use of the data center model. My $0.02
I have a few questions for those who have been experimenting with self-hosting their own LLMs.
To set the context (hurr): I am someone who uses LLMs a few times a day. I bounce around between chatgpt.com and ddg.co/chat depending on my mood. I generally use an LLM as a substitute for Google (et al) because web search engines have become borderline useless over the last decade or so due to the natural incentives of an ad-based business model. I find that the LLMs are correct often enough to offset the amount of time I spend chasing a non-existent made-up rabbit hole. I treat them like Wikipedia: good as a starting point, but fatal as a primary source.
But I still don’t know much about a lot of the concept and terms used in the article. I know that the bigger a model is, the “better” it is. But I don’t know what’s actually inside a model. I only sort of get the concept of context but have no idea what quantization means outside of the common definition. This is not meant as a critique of the article, just to state my level of knowledge with regard to AI technology. (Very little!)
That said, hypothetically let’s say that the most powerful machine I have on-hand is a four year-old laptop with 6 CPU cores (12 hyperthreads) and 64 GB of RAM and no discrete GPU. It already runs Linux. Is there a way I can just download and run one of these self-hosted LLMs on-demand via docker or inside a VM? If so, which one and where do I get it? And would it be a reasonable substitute for any of the free LLMs that I currently use in a private window without a login? Will it work okay to generate boilerplate or template code for programming/HTML/YAML, or do you need a different model for those?
I have heard that running an LLM on a CPU means the answers take longer to write themselves out. Which is okay, up to a point… waiting up to about a minute or two for a likely correct and useful answer would be workable but anything longer than that would be useless as I will just get impatient and jump to ddg.co/chat.
One way to think of a model is that it’s effectively a big pile of huge floating point matrices (“layers”), and when you run a prompt your are running a HUGE set of matrices multiplication operations - that’s why GPUs are useful, they’re really fast at running that kind of thing in parallel.
A simplified way to think about quantization is that it’s about dropping the number of decimals in those floating point numbers - it turns out you can still get useful results even if you drop their size quite a bit.
I suggest trying out a model using a llamafile - it’s a deviously clever trick where you download a multi-GB binary file and treat it as an executable - it bundles the model and the software needed to run it (as a web server) and, weirdly, that same binary can run on Windows and Mac and Linux.
Is there a way I can just download and run one of these self-hosted LLMs on-demand via docker or inside a VM? If so, which one and where do I get it? And would it be a reasonable substitute for any of the free LLMs that I currently use in a private window without a login?
I’ve used a few projects to run local models. Both of them work on my Ryzen CPU and on my Radeon GPU:
With ollama, there are a few web UIs similar to ChatGPT, but you can also pipe text to it from the CLI. Ollama integrates with editors such as Zed, so you can use a local model for your coding tasks.
Most of these models have abliterated or “uncensored” versions, in which refusal is partially fine-tuned out at a cost of model degradation. Refusals are annoying — such as Gemma refusing to translate texts it dislikes — but doesn’t happen enough for me to make that trade-off.
I thought uncensored models were better than their stock counterparts? I figured that was the entire point of uncensoring, besides the obvious, unmentionable stuff.
Because training LLMs is hard, and training/adjusting it to weed out certain classes of answers doesn’t work perfectly, thus removing some “correct” answers.
At least this seems like a useful explanation, I don’t think a better one exists out there, as is typical for LLM “technical research”.
Lots of helpful “getting started” context here. I started exploring llama.cpp last week for instance and got stuck at the “what model am I supposed to use??” phase. This post has exactly the context I needed.
Does anyone know of a stripped-down tutorial that basically captures what this post is describing but in a clearly delineated step-by-step fashion?
Case in point: Recall how “GGUF” doesn’t have an authoritative
definition. Search for one and you’ll find an obvious hallucination
that made it all the way into official IBM documentation. I won’t
repeat it hear as to not make things worse.
Now I’m curious what is that hallucination. I haven’t found it, or I’m
not able to identify it. Are the IBM pages fixed? Or the hallucination
is not that obvious?
“Through it, users converse with a wickedly creative artificial intelligence indistinguishable from a human, which smashes the Turing test and can be . Interacting with one for the first time is unsettling, a feeling which will last for days.”
Did an LLM write that?
Ironically, they wrote that one of their favorite use cases for LLMs is… proofreading!
As for creativity: “wickedly creative” is overselling it, but I do find LLMs useful for brainstorming because they’re differently creative. Ask it to come up with a product name, and you’ll get something bland and uninspired. But ask it for a list of Greek myths that share something in common with your product, and it turns out your brainstorming buddy has read every ancient Greek text that’s on the Internet. Or if you just need some randomness to get unstuck, an LLM can provide that as well.
Even that I feel is a short term gain for long term sacrifices. I am heavily on the introverted end of the spectrum, still, when I get stuck creatively or technically my default is to phone a friend or better yet, sit down with them and share some choice beverage. At worst I get a refresh on other things and signs of life where the convenient excuse otherwise would’ve been a dismissing ‘oh I don’t have the time’.
Fair. But like any other tool, nothing’s stopping you from using LLMs in a social manner. “Hey, I’m running out of gas on <creative thing>; you wanna meet up for coffee and brainstorming? I’ll bring my laptop so we have access to AI chat and Wikipedia.”
(Similarly, you could level the same criticism at other tools. Technically speaking, a thesaurus is a more convenient substitute for discussing word connotations with other members of your creative writing club.)
I asked my LLM and, uh, you can definitely tell it’s my LLM:
How do you make a “my LLM”? Start with the article here and then?
“Mine” is a pile of moderately clever Python hacks exposed through a GUI that assemble various relevant prompts and feed them either locally or to OpenAI via this library. It can interface to local models (via Gpt4All or LocalAI) or talk to ChatGPT right away.
Calling it “mine” is probably a bit of stretch, as I don’t actually use a custom model; I could and I’m sure it would work better but I’m not really sure it’s worth spending the money or the time on that. I refer to it as “mine” because it’s kind of like an assistant, I guess, although I really dread that trend. It’s a bit more hit-and-miss with local models (which have a more limited context) but with OpenAI’s models I get all the custom behaviour I need from the system prompt.
An example of how it works (and of why I’m referring to it as “mine”): the GUI has a “Study” button (long story short it’s styled somewhat like an app I wrote like twenty years ago when I was in school, hence the “study” part) which opens the file manager to a folder with the things I’m looking into now – papers I mean to read, notebooks, that sort of stuff. When I do that, it also injects a “system prompt” of sorts (a hidden request) in the current “conversation” with the LLM, instructing it to ask me what we’re studying today. Then, while that file manager window is open, it occasionally injects another prompt instructing it to ask me how it’s going – mostly to make sure I don’t slack off with you nerds on lobste.rs or start watching street food and beer brewing videos.
That’s actually how this app started, as in, originally, that’s all it did. I did it as a learning exercise, AI isn’t something I’m particularly interested in but LLMs are popular enough that I wanted to know what it was about so I just threw some Python at it. But it worked surprisingly well. After I’m done with whatever I’m doing for the day, I go over what I’ve learned with the LLM, maybe I ask it to quiz me, recommend me some bibliography, that sort of stuff. The kind of thing that I would certainly prefer to do with a human (I’m entirely partial with @crazyloglad’s approach here) but too trivial to bother a friend over. I do procrastinate less and I do learn more, so I’m happy.
Yep, it hallucinates occasionally, but even that is not completely useless. Explaining a hallucinating model that I’m right about some quirk in RISC-V vector extensions is a useful learning exercise, and googling for the hallucinated title of a paper still yields some useful papers once in a while. It’s not perfect, but it’s fine, with the obvious caveats that 1. it’s not a good source of information and 2. it’s not a human. So e.g. if I need to bounce ideas off of someone, I’ll ask someone, relying on something for it isn’t a very good idea.
That being said:
remindqueries, for example – it worked 99% of the time, but that 1% was obviously potentially disastrous. So none of that happens through the LLM. I have it remind me of pending events (bypopeningremindand injecting the prompt with relevant reminders) but I still manage them withtkremind.The technology is good and useful enough that it doesn’t take that much effort to assemble such an application. I wrote most of its innards over a weekend, and after that I just slowly added things to scratch the occasional itch, one tiny feature at a time, and most of them took like half an hour. The only thing I spent more than an hour or so on in the last six months was porting its backend to LLM (it used to be a home-grown API).
I’d like to offer an alternative perspective here.
It’s incredibly inefficient to do this, as evidenced by the fact that they want people to pay for a pro account. Meanwhile, they could have embraced torrenting, a scalable solution whose cost-effectiveness would mean that paying for the service of HTTP is not necessary.
Furthermore, the LLM hype could have been harnessed to the point that browsers could be pressured into adding torrenting built-in to their downloaders. Imagine if torrenting was a normal way to download files, and the default settings of browsers waited to achieve a 1.0 ratio before stopping seeding the file. Many people leave their browsers running 24/7. This would be such a huge win for Free and Open Source Software because it distributes costs away from projects, making it possible for more small projects to sustainably exist. Everyone would win though; the cost of data storage and transfer would be reduced across all of the Internet.
I feel like we spent a couple of decades experimenting with torrents and found that the additional usability friction just wasn’t worth it.
Bandwidth isn’t actually that expensive these days, if you’re not paying a cloud provider with huge margins.
Cloudflare give away astonishing amounts of bandwidth for free. Hetzner charge €1/TB for overage, which is about 1/90th of what AWS charge.
Your comment does not address the usability friction reduction strategy I outlined.
Convincing browsers to add the feature?
That’s been tried too: Opera had a built-in torrent thing for a few years. It didn’t stick.
Oh now that is fascinating. Thank you for sharing.
For what it’s worth I do agree with the general idea here, it feels really wasteful to me that we’re all downloading multi-GB files and not doing the peer-to-peer thing any more.
But I just don’t think history supports this working out. O’Reilly ran a Peer 2 Peer conference back in 2001 but it didn’t turn into a series (or maybe it rebranded as ETech?) - that was just before the whole Web 2.0 thing stole all of the wind from the BitTorrent / Napster / etc world. Amazon S3 made a rare backwards-compatibility breaking change when it dropped its torrent support three years ago: https://github.com/awsdocs/amazon-s3-userguide/blob/0d1759880ccb1818ab0f14129ba1321c519d2ac1/doc_source/S3Torrent.md
My theory is that people realized that the cost of running a global CDN to serve large files (driven by the rise of online video, YouTube etc) was low enough that BitTorrent wasn’t worth the trouble.
Downloading a large file from a CDN peer in a nearby city is always going to work faster than downloading from BitTorrent, because that CDN peer is going to be on a much faster and more reliable connection.
Well it makes sense that businesses would come to this conclusion - they all want to be the platform that everyone is stuck using. My point is this technology is a key asset in the struggle against tech oligarchies.
I’d also like to point out that Brave has a built-in torrent client, although it’s not built for seeding or persisting torrents, more of an augment to it’s built-in downloader.
But are P2P downloads actually more efficient? For lots of content, the data will have to travel further than from a regional CDN, and the torrent requests may prevent those computers from entering low power states.
I appreciate that the zig project serves a lot of downloads, but most small projects distribute a small enough volume that the free tier of a CDN is adequate.
As another P2P example, I think the steam client will still look for peers for downloads, but maybe only within your LAN.
My hypothesis is specifically that torrenting would be an alternative to asking for money, particularly with some key user interface adjustments. And further, that reducing the amount of money flow reduces the risk of having the platform do an unwelcome transition to platform capitalism.
I totally agree that platform capitalism and other sorta-digital-landlordism are shitty. I’m not convinced that torrenting support in browsers is a particularly good solution, especially when CDNs are so cheap that bandwidth is not typically a barrier to entry.
There are significant issues with privacy, power use, bandwidth metering, and legal liability that could easily be show-stoppers. Even if the feature was added I think browsers would want UX to prompt users on a download-by-download basis. (I don’t want my browser to broadcast to the world the details of every video I watch on p2p-tube! Tho there are some trade-offs where you can reduce the number of peers who can identify you at the cost of complexity and p2p bandwidth).
For organisations like Zig that have very large bandwidth requirements and tech-savvy users, perhaps you could change your installers, package tooling, etc to use torrenting? Users could download a tiny binary with HTTP and be prompted to allow the zig installer to set up a torrent daemon that will be used for downloads and for seeding.
Thanks for your insightful point about privacy.
With regards to the last paragraph, I would certainly explore that option before spending a significant amount of money on bandwidth. However, I recently discovered that simply not using a CDN saves a ton of money and is good enough for the foreseeable future.
Then again, if browsers had torrenting built in to the point where users didn’t find it odd or inconvenient, then I would make it the default download option.
GGUF is short for GG Unified Format
llama.cpp was written by Georgi Gerganov who also made a C language tensor compute framework GGML.
The first few versions of the format were called ggml, ggjt, and ggmf. They were all mutually incompatible and GGUF was the first format that allowed for versioning. I followed the project closely in 2022/23 but got bored after awhile, and the quality of the open models is so much worse than 4o or Claude 3.5 that I stopped playing with them.
Most of the time, upon seeing the phrase “AI” or “LLM” in an article, my interest instantly fades and I move on. However the author of this has produced consistently excellent work in the past, so I gave it a chance, and found it informative.
It is the only article I’ve read about running local LLMs that’s coherent, straightforward, and sufficiently devoid of hype. The best introduction I’ve found.
Has anyone used structured outputs using locally-hosted LLMs? That’s one thing that would see me switch pretty quickly.
Llama.cpp has has this for a long while now. You can just define a grammar and force it to output e.g. JSON, no need for examples
Thank you
While it is an impressive feature, I would just caution that, unless you’re sticking with JSON/JSON schema, arbitrary other grammars may perform less well than you might hope. The way it works is pretty clever, it takes the logits of the last output layer and keeps sampling until it finds a token that matches the next state transition in the grammar, and then since LLMs are autoregressive, that new token is appended to the output string and fed back to the model for the next one and so on, like normal. The problem is that the weights of the model were trained with a feedback loop of some objective function where the output was not a string in that grammar. So in a sense it’s sampling from a latent space that’s slightly different than what the model was tuned for. I optimistically thought I could coax the equivalent of EBNF of common PLs, but struggled to get meaningful results (at best syntactically correct, but no where near the semantics of my prompt). YMMV of course, and JSON does perform well since that’s now part of the training of most models.
This hurts my brain
You mean JSON? Llama 3.1 8B does fine for it. You should ideally give it examples of input to outputs though. Especially Llama, that one’s almost hopeless without examples, they’re more important than the instructional part of the prompt.
Um. No. Just no. This dude has been drinking a bit too much of the Kool-Aid.
And not just the Raspberry Pi bit, the supercomputer networked mega GPU models aren’t that good. I’ll keep my human touch, thank you very much.
For the UI I like Ollama. It’s automates all the hassle of downloading and updating of models, has a built-in JSON API, and a terminal chat interface.
Models that fit in 8-16GB of RAM are shallow and hallucinate a lot. 32GB MacBook can run mid-sized models, but barely. 48GB is needed to run a useful model and have other applications open at the same time.
I’ve been idling thinking about picking up one of the new Mac minis with an M4 Pro and 48GB RAM, which is $1799US and hanging it off my network (plus Wireguard for roaming access) as a dedicated local/private inference machine, extending the life of my old laptop.
I’ve got M3 Max, and fan noise gets annoyingly loud when running LLMs for longer than a minute (chat is okay, but batch processing is maxing out the heat). For my next Mac I’m going to test whether Studio’s beefy cooler is quieter under load.
I have also found Ollama useful but I only use to run the models. It loads them on demand and unloads after a configured timeout (something like 5 minutes by default?). Since switching to it from llama.cpp I’ve found it convenient enough to run local models daily. Naturally, on Linux the Nvidia drivers of course have to be restarted with some
modprobecalls after when the machine wakes from sleep. I made an shell alias for it :)As an UI I’ve found Continue.dev, a VSCode extension, to work well enough to be useful when programming. It’s super janky with the Vim bindings though but at least I can select code, send it as context to the chat pane via a few mouse clicks, ask for edits and copy the results back with another click.
I hope we see more local LLM usage kick up. The trade off on accuracy seems worth it for the lower processing costs, data residency benefits, latency, etc…
It’s inevitable. Not only because of the wave of “AI PCs” or “Copilot PCs” or whatever they’re calling it this week, but because LLM providers are literally selling their services at a loss. Hosting 70B+ models economically is, currently, virtually unsustainable. I’m sure hosts will always exist in one form or another, but the current OpenAI model is going to crash pretty hard.
I think I agree with pretty much everything there. OpenAI would really need gpt5 to be something else entirely to meet the current expectations and that does seem unlikely.
I’m only speculating here, but I think we’ll start using a bunch of smaller, general-purpose models instead of single large monolithic models. I think once we get past this current marketing campaign and public sentiment, where you’re expected to think “An LLM model is one, single, living, breathing atomic thing that you must perform all interactions with, since this one model is a general-purpose AI that should be able to do everything”, we’ll start realizing what current LLM architectures are actually capable of, which is:
And that’s it. They genuinely don’t do anything else, even if one can be tricked into thinking they do. After all, humans hallucinate, too. It’s called anthropomorphizing.
I’m a pretty big fan of symbolics and GOFAI, but I don’t think they’ll re-enter mainstream use in conjunction with special-purpose models. Rather, I expect people to continue with their connectionism obsession until they make a truly dedicated tool-calling model specifically made for calling other, special-purpose models to receive input and perform dedicated tasks. And it’ll probably be wildly inefficient from just figuring out the actual problem domain and hard-coding an agent to handle it, but, it’ll be easier to deploy!
Also I realize I probably just described langchain, but, I feel like langchain is currently more like a hack than a dedicated, intentional practice. It feels more like people accidentally stumbled upon langchain’s architecture rather than making the explicit realizations I mentioned. It’s more of an attempt to blur a bunch of LLMs together to try and cover up each other’s faults, rather than using a series of small, dedicated models built for specific sub-tasks.
You clearly know more about this than me, but my intuition agrees with you :)
I also think the smaller, more focused models will be fairly useful for domain specific rag on unstructured data. I need to start playing with this myself, but I’m imagining a chat interface for say geographic data which could be used by a civil engineer to aid in understanding a design. A minimal thing to be sure in the grand scheme.
If I run a business selling a yoghurt pipeline system (where you can have a yoghurt tap in your kitchen) and sell it at a loss, is it inevitable that usage of my business will pick up?
No. In fact, it’ll probably fail because a dedicated yoghurt-fridge is much cheaper and still overkill, and takes up less space than the required hot-yoghurt-tank anyway.
I’d be interested in digging into the lower processing costs a little more. Is the energy usage of me running a prompt through a decent model on my laptop genuinely lower than the energy usage of that same prompt through ChatGPT or Claude, given that those models run on hardware shared by millions of other people while my laptop is serving just me?
It’s a fair question. I think if you want to fix accuracy at some level then probably the data center model will come out ahead. My view though is that for the reduction in accuracy required to work on NPUs/neural engines/client gpu it may well be worth it. The product workflow of code generation already required double checking the output code from the data center models, so as long as the reduction in accuracy is still palatable then the product still works. This is just my intuition, but it does seem to be backed up by the lack of any killer apps that make use of the data center model. My $0.02
I have a few questions for those who have been experimenting with self-hosting their own LLMs.
To set the context (hurr): I am someone who uses LLMs a few times a day. I bounce around between chatgpt.com and ddg.co/chat depending on my mood. I generally use an LLM as a substitute for Google (et al) because web search engines have become borderline useless over the last decade or so due to the natural incentives of an ad-based business model. I find that the LLMs are correct often enough to offset the amount of time I spend chasing a non-existent made-up rabbit hole. I treat them like Wikipedia: good as a starting point, but fatal as a primary source.
But I still don’t know much about a lot of the concept and terms used in the article. I know that the bigger a model is, the “better” it is. But I don’t know what’s actually inside a model. I only sort of get the concept of context but have no idea what quantization means outside of the common definition. This is not meant as a critique of the article, just to state my level of knowledge with regard to AI technology. (Very little!)
That said, hypothetically let’s say that the most powerful machine I have on-hand is a four year-old laptop with 6 CPU cores (12 hyperthreads) and 64 GB of RAM and no discrete GPU. It already runs Linux. Is there a way I can just download and run one of these self-hosted LLMs on-demand via docker or inside a VM? If so, which one and where do I get it? And would it be a reasonable substitute for any of the free LLMs that I currently use in a private window without a login? Will it work okay to generate boilerplate or template code for programming/HTML/YAML, or do you need a different model for those?
I have heard that running an LLM on a CPU means the answers take longer to write themselves out. Which is okay, up to a point… waiting up to about a minute or two for a likely correct and useful answer would be workable but anything longer than that would be useless as I will just get impatient and jump to ddg.co/chat.
One way to think of a model is that it’s effectively a big pile of huge floating point matrices (“layers”), and when you run a prompt your are running a HUGE set of matrices multiplication operations - that’s why GPUs are useful, they’re really fast at running that kind of thing in parallel.
A simplified way to think about quantization is that it’s about dropping the number of decimals in those floating point numbers - it turns out you can still get useful results even if you drop their size quite a bit.
I suggest trying out a model using a llamafile - it’s a deviously clever trick where you download a multi-GB binary file and treat it as an executable - it bundles the model and the software needed to run it (as a web server) and, weirdly, that same binary can run on Windows and Mac and Linux.
I wrote more about these when they first came out last year: https://simonwillison.net/2023/Nov/29/llamafile/
I’d suggest trying one of the llama 3.1 ones from https://huggingface.co/Mozilla/Meta-Llama-3.1-8B-llamafile/tree/main - should work fine on CPU.
I’ve used a few projects to run local models. Both of them work on my Ryzen CPU and on my Radeon GPU:
With ollama, there are a few web UIs similar to ChatGPT, but you can also pipe text to it from the CLI. Ollama integrates with editors such as Zed, so you can use a local model for your coding tasks.
I thought uncensored models were better than their stock counterparts? I figured that was the entire point of uncensoring, besides the obvious, unmentionable stuff.
That’s covered in the article as well — removing refusal might decrease quality of the results.
Okay but… why?
Because training LLMs is hard, and training/adjusting it to weed out certain classes of answers doesn’t work perfectly, thus removing some “correct” answers. At least this seems like a useful explanation, I don’t think a better one exists out there, as is typical for LLM “technical research”.
Lots of helpful “getting started” context here. I started exploring llama.cpp last week for instance and got stuck at the “what model am I supposed to use??” phase. This post has exactly the context I needed.
Does anyone know of a stripped-down tutorial that basically captures what this post is describing but in a clearly delineated step-by-step fashion?
Now I’m curious what is that hallucination. I haven’t found it, or I’m not able to identify it. Are the IBM pages fixed? Or the hallucination is not that obvious?
I found it - the IBM page says it’s “GPT-Generated Unified Format” which is definitely wrong - Google search for
ibm ggufto see that.It’s GG Unified Format, where GG stands for its author Georgi Gerganov.
Also, that’s a weird spelling mistake, I wonder why that wasn’t picked up by the (presumably) proof-reading LLM.