While this is true to some extent, it’s important to note that when you type on a keyboard, you are also essentially predicting and adding the next word.
I’m genuinely offended by this. I try to synthesise something closest to my desired result using words in order. The only “predicting” is that the next word which comes to mind is most useful for that synthesis.
It’s fascinating that so many of the folks who get into these models seem to think that this is how language works, as if language were just a statistical pattern isolated from behavior and environment and our other mental faculties.
It’s the result of taking the positivist, behaviourist approach of “we’re all biological deterministic computing machines” (which is a fair-enough aspect for a model of human behaviour) with the classic tech-bro attitude of not having understood the problem domain, but posturing just enough to convince people you know what you’re talking about
It’s the result of taking the positivist, behaviourist approach of “we’re all biological deterministic computing machines” (which is a fair-enough aspect for a model of human behaviour) with the classic tech-bro attitude of not having understood the problem domain, but posturing just enough to convince people you know what you’re talking about
Actually, no. I wrote it the way I did for a specific reason; it was designed as a jab at people who don’t realize that the lowlevel behavior of predicting the next token doesn’t really mean that there are not higher level abstractions, or processes behind that. Which is quite common, especially among the people who have some knowledge of markov chains and other similar outdated techniques of generating text (“it just completes the text it saw somewhere” response I’ve got many times while discussing LLM’s with people). My hope was that it will catch their attention and make them think about the next part of the blog, instead of just dismissing it.
Only a few folks appreciate that predicting the next word accurately requires a healthy measure of ‘thinking ahead’.
Sure it comes out one word at a time, that’s probably why it can accurately model real writing, where the cursor is at the end, and doesn’t jump around everywhere (at least on a first draft, and a first draft by an excellent draftsman is the only draft.)
just thinking about a basic decoder only autoregressive LLM. I’m certain that it is markov but googling around people seem to strongly believe that it is not based on personal desires. I do not understand why one would be emotionally attached to LLMs not being markov
Absolutely not, because they have a context window and because the user inserts steps. What they generate depends on your prompts and what they previously generated in that session.
Because the indices of its ‘transition matrix’ don’t deterministically correspond to states in its (reduced) state space. Even at a temperature of 0 it won’t always generate the same completion twice.
From the human perspective, what else is there? From the memetic perspective, of course, memes are evolving in a way akin to the RNA world (a selfish meme theory), but I’m not sure if there’s more to it than statistical mechanics and biological evolution.
I’m not an expert in LLMs, but having read what I’ve read about them, the conclusion I’ve drawn is that “statistically predicting the next word” is a good description of their training process, but it’s not actually a good description of how they work. We don’t really understand how they work on a very deep level. From probing and whatnot, it seems like they’re building world models in there, but who knows. It’s like saying human brains are just “trying to reproduce their genes.” That is how the brain evolved, but it’s not what brains do.
I’m not an expert either, but it doesn’t seem to me your link contradicts the “they just predict the next word” assertion. It might be that the best way to predict the next word involves learning a superficial world model during training, but the way the text is constructed is still by picking the most likely word.
This pretty much reflects my personal experience. I offload most of the process of typing to my autonomous nervous system, which (for most words) handles the spelling and predicts the next word. If my attention moves to something else, I don’t immediately stop typing but I do stop making sense after a few words. I can’t imagine how slow and painful it would be if I had to use the bit of my brain that I use for conscious thought for this.
I’m genuinely offended by this. I try to synthesise something closest to my desired result using words in order. The only “predicting” is that the next word which comes to mind is most useful for that synthesis.
It’s fascinating that so many of the folks who get into these models seem to think that this is how language works, as if language were just a statistical pattern isolated from behavior and environment and our other mental faculties.
It’s the result of taking the positivist, behaviourist approach of “we’re all biological deterministic computing machines” (which is a fair-enough aspect for a model of human behaviour) with the classic tech-bro attitude of not having understood the problem domain, but posturing just enough to convince people you know what you’re talking about
Actually, no. I wrote it the way I did for a specific reason; it was designed as a jab at people who don’t realize that the lowlevel behavior of predicting the next token doesn’t really mean that there are not higher level abstractions, or processes behind that. Which is quite common, especially among the people who have some knowledge of markov chains and other similar outdated techniques of generating text (“it just completes the text it saw somewhere” response I’ve got many times while discussing LLM’s with people). My hope was that it will catch their attention and make them think about the next part of the blog, instead of just dismissing it.
Only a few folks appreciate that predicting the next word accurately requires a healthy measure of ‘thinking ahead’.
Sure it comes out one word at a time, that’s probably why it can accurately model real writing, where the cursor is at the end, and doesn’t jump around everywhere (at least on a first draft, and a first draft by an excellent draftsman is the only draft.)
Don’t let them get to you.
Is an LLM a markov chain or not?
edit: I believe that it is a markov chain. technically.
Depends on the architecture. GPTs might not count, but RWKV does. You’re right to check for the Markov property directly.
just thinking about a basic decoder only autoregressive LLM. I’m certain that it is markov but googling around people seem to strongly believe that it is not based on personal desires. I do not understand why one would be emotionally attached to LLMs not being markov
Well, on the token level, I think you could argue that it has some similarities in how the temperature works, but apart from that, I wouldn’t say so.
Absolutely not, because they have a context window and because the user inserts steps. What they generate depends on your prompts and what they previously generated in that session.
You’re telling me that an LLM is not a Markov Chain (definition 1.2)? Why is it not?
Because the indices of its ‘transition matrix’ don’t deterministically correspond to states in its (reduced) state space. Even at a temperature of 0 it won’t always generate the same completion twice.
There’s plenty of that in all levels of academia and industry, let’s put away the tired old meme of the tech bro.
I would posit the differentiating factor is tech bros move fast to make money and “hustle”, for better or for worse. Off-topic though.
From the human perspective, what else is there? From the memetic perspective, of course, memes are evolving in a way akin to the RNA world (a selfish meme theory), but I’m not sure if there’s more to it than statistical mechanics and biological evolution.
How does language actually work?
Is this a rhetorical question?
Edit to clarify, I am genuinely curious.
I’m not an expert in LLMs, but having read what I’ve read about them, the conclusion I’ve drawn is that “statistically predicting the next word” is a good description of their training process, but it’s not actually a good description of how they work. We don’t really understand how they work on a very deep level. From probing and whatnot, it seems like they’re building world models in there, but who knows. It’s like saying human brains are just “trying to reproduce their genes.” That is how the brain evolved, but it’s not what brains do.
It is factual? if not, is there is something factually incorrect about this description?
How they work seems to involve some kind of world model having been encoded into the system by the training: https://thegradient.pub/othello/.
I’m not an expert either, but it doesn’t seem to me your link contradicts the “they just predict the next word” assertion. It might be that the best way to predict the next word involves learning a superficial world model during training, but the way the text is constructed is still by picking the most likely word.
This pretty much reflects my personal experience. I offload most of the process of typing to my autonomous nervous system, which (for most words) handles the spelling and predicts the next word. If my attention moves to something else, I don’t immediately stop typing but I do stop making sense after a few words. I can’t imagine how slow and painful it would be if I had to use the bit of my brain that I use for conscious thought for this.
I was recommended to use
micromamba
instead ofMiniconda3.sh
and would continue on that recommendation.I’ll look into that, thanks.
but why?
Why not?