I’m not an expert in the field but to me it seems obvious that LLM needs some sort of db lookup for esteblished facts in order to produce reliable factual output. Why is this not obvious to experts?
For example, if you ask ChatGPT how to get from Dallas to Paris it will tell you a lot about how to get to France. It wouldn’t care to clarify which Paris you actually want to get to. Maybe it’s the one 100 miles away. All just because statistically Paris, France comes up more often in the training data set.
Why would an LLM show any different behaviour in science? Pi stands for all sorts of things (nucleotide diversity in chemistry, pion particle in physics, population proportion in statistics, prime counting function in maths, to name a few) but most often it’s circle circumference to diameter ratio constant. Would we expect an LLM to reliably use pion in the phisics context? Would we expect an LLM to always properly pick either pion or the ratio constant given that both have place in the related math?
Statistically plausible text generation is maybe good to come up with technobabble for your next sci-fi but I don’t see why experts in the AI field though it might produce good science.
I wonder what I miss that made them confident enough to release this Galactica model.
You’re saying this like humans don’t have the exact same issue. Ask me about pi and I’ll tell you the math answer. (Because it’s statistically likely) Start a conversation about physics and maybe I’ll expect you mean pion instead. Yet our science does just fine (usually).
You can construct prompts with enough context to tell GPTs what kind of lookup is available and how to use it too. They just can’t guess what you’re thinking about.
Concrete knowledge is the same issue for humans. One obvious difference though is humans are pretty good at knowing when they don’t know something or have low confidence in their knowledge. Especially in scientific/research environment. That’s why every paper has a tonne of citations and why most papers have a whole section restating previous findings that the paper build on. LLMs though are way to happy to make stuff up to be useful for doing novel research.
Wikipedia, Wikidata, etc. do not “produce reliabl[y] factual output”; they encode many biases of humanity, and even their lists of incorrect beliefs are likely biased
For example, Bing’s chat product can query Wikipedia, but even if the LLM dutifully summarized Wikipedia, there are biases in both the LLM’s training data and in Wikipedia’s own text.
Well, yes, but LLMs often hallucinate things completely divorced from reality as opposed to merely biased datasets of Wikipedia or whatever. The discourse would’ve been very different if LLMs were biased but factually correct on the level of Wikipedia. As of right now we’re very far from that and yet some people still think Galactica is a good idea.
It’s not as easy as “just throw a DB at it”. I expect this problem will eventually be solved. Companies like Google or Meta were once careful not to release early, untested models but the competition from OpenAI changed that. Things are just going so fast currently that we will see issues like this for a while.
Part of the issue is that this is fundamentally impossible. LLMs as an approach cannot do anything even remotely like this. There are many other approaches that either already do do something like this or could (with sufficient research) plausibly be made to do so, but just fundamentally the LLM approach cannot ever do anything like this. The closest thing to this that I have seen plausibly demonstrated for LLMs is usually euphemistically called ‘fallback’ - basically, you either have:
some form of (non-LLM) recognizer that scans the input for things that can be actually solved by some real AI (or just regular Comp Sci) technique, it replaces the thing with some kind of placeholder token, the external system solves the thing, and then it gets substituted in where the LLM has emitted a ‘placeholder answer’ token.
or
you have some (non-LLM) system to detect ‘the LLM has generating something with problems’, and the prompt + answer (or some subset of it) gets sent to a human in a call center somewhere, who writes the actual response and does some data entry followup.
And neither of these is actually connecting the LLM to a db or giving it the ability to look up facts - they are both ways to hide the fact that the LLM can’t actually do any of the things that are advertised by using non-LLM systems as a patch over the top in an ad-hoc way.
The correct way to proceed is to mine the LLM development for the one particular interesting bit of math that they developed that could be of a lot of use to statistics if it actually turns out to be mathematically valid (basically a way to make high dimensional paths with ~ arbitrary waypoints that are differentiable and/or a way to do sampling+something like correlation very nonlocally, depending on how you look at it) and then throw the rest away as dangerous but also useless garbage, and pour the billions of dollars that have been mustered into any of the many actual areas of AI that are real and useful.
I wonder why it’s fundamentally impossible? At least on the surface it appears LLMs are capable of some form of reasoning so why can’t they know thye’re making an inference and need to look stuff up?
LLMs do not reason, and they do not ‘know’. You are misunderstanding what the technology is and how it works (unfortunately aided in your misunderstanding by the way the major promoters of the technology consistently lie about what it is and how it works). They are a technology that produces output with various different kinds of ‘vibe matching’ to their input, and where ‘fallback’ (a fundamentally non-LLM technology) is used in an ad-hoc way to patch over a lot of the most obvious nonsense that this approach produces.
Edit: That is, LLMs are fundamentally a different approach to the general problem of ‘knowledge stuff’ or ‘machine intelligence’ where, instead of doing all the difficult stuff around knowledge and belief and semantics and interiority and complicated tensorial or braided linear logic and solvers and query planners and knowledge+state representation and transfer to connect all of these different things plus a whole bunch of difficult metacognition stuff and etc etc that would mean that connecting to a knowledge db is something that could actually work, you just… don’t do any of that difficult stuff and then lie and say you did.
I don’t disagree with your assessment but I wonder if there is a way of tweaking LLMs to do this without altering their fundamental architecture. If you add a load of ‘don’t know’ tokens to the training data then I would expect their predictions to be full of these in any output where the other things in the input set did not provide a stronger signal. It wouldn’t be completely reliable but you could probably count the proportion of ‘don’t know’ tokens that end up interleaved with your output to get a signal of how accurate the response is and then train another model to generate searches based on the other tokens and then have the tandem system loop feeding more lookup results from the second model’s output into the first until it is below a threshold ratio of ‘don’t know’ to other tokens.
Do you have a rocomendation for a shortish explainer on LLMs that can clear up this apparent reasoning ability (and other wonderful capabilities) for me? Or materials on what it actually does and why it might look like it has some cpabilities that it actually doesn’t have?
Part way through a long answer and there was a power cut. Bleh. So, a short answer it will have to be.
Basically, there are no good short explainers out there that I have been able to find. The short explainers are basically all lies, and the good stuff is all zoomed in critique that assumes a lot of background and only covers the specific bit under examination.
I am actually part way through writing my own paper about why programmers seem to get taken in by these things more than they ‘should’, but it is a real slog and I’m not sure I’ll be able to get it into a form where it is actually useful - I’ve had to start by working on a phenomenology of programming (which is itself a large and thankless undertaking) that satisfies me in order to even have the tools to talk about ‘why it looks like it has capabilities’.
The two papers I’d suggest looking at to get a bit of a starting point on the deceptiveness of the ‘LLM capabilities’ propaganda are “Language modelling’s generative model: is it rational?” by Karen Spärck Jones and “Are Emergent Abilities of Large Language Models a Mirage?” by Rylan Schaeffer, Brando Miranda, and Sanmi Koyejo.
Alas, the issue is that I need to establish a bunch of pretty foundational stuff about what programming is to a programmer, what is the nature of the current technology stack, and a bunch of stuff around how we think about and talk about the current technology stack (the “‘secretly two layer’ nature of contention about whether some technology ‘works’” thing) before I can use that to show how it opens programmers up to being mindeaten.
But also, the paper “AI and the Everything in the Whole Wide World Benchmark” by Inioluwa Deborah Raji, Emily M. Bender, Amandalynne Paullada, Emily Denton, and Alex Hanna, has some good stuff about ‘construct validity’ that lays out some of the formal case behind why I say that LLM boosters are just lying when they make claims about how LLMs can do all kinds of things.
They trained it on the internet and were surprised it „mindlessly spat out biased and incorrect nonsense.“ There might be a hint of what should be and shouldn’t be done… 😉
I think it is obvious to most people trying to build applications out of LLMs, but it seems some people, like researchers have a harder time with this. Most practitioners are using models to produce embeddings which are used in conjunction with vector databases to find similar information. It works fairly well.
LLM needs some sort of db lookup for esteblished facts in order to produce reliable factual output.
Wolfram Alpha and the wikidata have been earlier attempts at making such DB’s. Both done the hard way. Maybe the next killer application will be an LLM instruction-trained to use them?
I wish. It seem to me currently LLMs don’t use either because they have a very small input buffer (i.e. wikipedia doesn’t fit into it) or don’t do multi-step inference (can’t look up missing data and put it into context for another try).
Things like AutoGPT might be a viable approach even with smaller context if they didn’t try to pursue the task from the get go and instead did a few rounds of “what do I need to know before I can give an answer?” before replaying to the initial prompt.
But there was that paper that promised very large/virtually unlimited inputs so maybe that one’s going to work. I’m sceptical though because it probably would take a lot of time for a GPT-4-sized LLM to chew through the whole Wikipedia on every prompt.
This is an insane assertion, considering that no LLM that I have seen can reliably provide the provenance of its output.
I’m not an expert in the field but to me it seems obvious that LLM needs some sort of db lookup for esteblished facts in order to produce reliable factual output. Why is this not obvious to experts?
For example, if you ask ChatGPT how to get from Dallas to Paris it will tell you a lot about how to get to France. It wouldn’t care to clarify which Paris you actually want to get to. Maybe it’s the one 100 miles away. All just because statistically Paris, France comes up more often in the training data set.
Why would an LLM show any different behaviour in science? Pi stands for all sorts of things (nucleotide diversity in chemistry, pion particle in physics, population proportion in statistics, prime counting function in maths, to name a few) but most often it’s circle circumference to diameter ratio constant. Would we expect an LLM to reliably use pion in the phisics context? Would we expect an LLM to always properly pick either pion or the ratio constant given that both have place in the related math?
Statistically plausible text generation is maybe good to come up with technobabble for your next sci-fi but I don’t see why experts in the AI field though it might produce good science.
I wonder what I miss that made them confident enough to release this Galactica model.
You’re saying this like humans don’t have the exact same issue. Ask me about pi and I’ll tell you the math answer. (Because it’s statistically likely) Start a conversation about physics and maybe I’ll expect you mean pion instead. Yet our science does just fine (usually).
You can construct prompts with enough context to tell GPTs what kind of lookup is available and how to use it too. They just can’t guess what you’re thinking about.
Concrete knowledge is the same issue for humans. One obvious difference though is humans are pretty good at knowing when they don’t know something or have low confidence in their knowledge. Especially in scientific/research environment. That’s why every paper has a tonne of citations and why most papers have a whole section restating previous findings that the paper build on. LLMs though are way to happy to make stuff up to be useful for doing novel research.
Two things are obvious to experts (Experts in what? Ontology?):
For example, Bing’s chat product can query Wikipedia, but even if the LLM dutifully summarized Wikipedia, there are biases in both the LLM’s training data and in Wikipedia’s own text.
Well, yes, but LLMs often hallucinate things completely divorced from reality as opposed to merely biased datasets of Wikipedia or whatever. The discourse would’ve been very different if LLMs were biased but factually correct on the level of Wikipedia. As of right now we’re very far from that and yet some people still think Galactica is a good idea.
There is ongoing work to solve that, for instance: https://arxiv.org/abs/2305.03695
It’s not as easy as “just throw a DB at it”. I expect this problem will eventually be solved. Companies like Google or Meta were once careful not to release early, untested models but the competition from OpenAI changed that. Things are just going so fast currently that we will see issues like this for a while.
Part of the issue is that this is fundamentally impossible. LLMs as an approach cannot do anything even remotely like this. There are many other approaches that either already do do something like this or could (with sufficient research) plausibly be made to do so, but just fundamentally the LLM approach cannot ever do anything like this. The closest thing to this that I have seen plausibly demonstrated for LLMs is usually euphemistically called ‘fallback’ - basically, you either have:
or
And neither of these is actually connecting the LLM to a db or giving it the ability to look up facts - they are both ways to hide the fact that the LLM can’t actually do any of the things that are advertised by using non-LLM systems as a patch over the top in an ad-hoc way.
The correct way to proceed is to mine the LLM development for the one particular interesting bit of math that they developed that could be of a lot of use to statistics if it actually turns out to be mathematically valid (basically a way to make high dimensional paths with ~ arbitrary waypoints that are differentiable and/or a way to do sampling+something like correlation very nonlocally, depending on how you look at it) and then throw the rest away as dangerous but also useless garbage, and pour the billions of dollars that have been mustered into any of the many actual areas of AI that are real and useful.
What are you referring to when you say “this”? What is fundamentally impossible for LLMs?
Edit: apologies for lack of clarity in my initial reply to you
I wonder why it’s fundamentally impossible? At least on the surface it appears LLMs are capable of some form of reasoning so why can’t they know thye’re making an inference and need to look stuff up?
LLMs do not reason, and they do not ‘know’. You are misunderstanding what the technology is and how it works (unfortunately aided in your misunderstanding by the way the major promoters of the technology consistently lie about what it is and how it works). They are a technology that produces output with various different kinds of ‘vibe matching’ to their input, and where ‘fallback’ (a fundamentally non-LLM technology) is used in an ad-hoc way to patch over a lot of the most obvious nonsense that this approach produces.
Edit: That is, LLMs are fundamentally a different approach to the general problem of ‘knowledge stuff’ or ‘machine intelligence’ where, instead of doing all the difficult stuff around knowledge and belief and semantics and interiority and complicated tensorial or braided linear logic and solvers and query planners and knowledge+state representation and transfer to connect all of these different things plus a whole bunch of difficult metacognition stuff and etc etc that would mean that connecting to a knowledge db is something that could actually work, you just… don’t do any of that difficult stuff and then lie and say you did.
I don’t disagree with your assessment but I wonder if there is a way of tweaking LLMs to do this without altering their fundamental architecture. If you add a load of ‘don’t know’ tokens to the training data then I would expect their predictions to be full of these in any output where the other things in the input set did not provide a stronger signal. It wouldn’t be completely reliable but you could probably count the proportion of ‘don’t know’ tokens that end up interleaved with your output to get a signal of how accurate the response is and then train another model to generate searches based on the other tokens and then have the tandem system loop feeding more lookup results from the second model’s output into the first until it is below a threshold ratio of ‘don’t know’ to other tokens.
Do you have a rocomendation for a shortish explainer on LLMs that can clear up this apparent reasoning ability (and other wonderful capabilities) for me? Or materials on what it actually does and why it might look like it has some cpabilities that it actually doesn’t have?
Part way through a long answer and there was a power cut. Bleh. So, a short answer it will have to be.
Basically, there are no good short explainers out there that I have been able to find. The short explainers are basically all lies, and the good stuff is all zoomed in critique that assumes a lot of background and only covers the specific bit under examination.
I am actually part way through writing my own paper about why programmers seem to get taken in by these things more than they ‘should’, but it is a real slog and I’m not sure I’ll be able to get it into a form where it is actually useful - I’ve had to start by working on a phenomenology of programming (which is itself a large and thankless undertaking) that satisfies me in order to even have the tools to talk about ‘why it looks like it has capabilities’.
The two papers I’d suggest looking at to get a bit of a starting point on the deceptiveness of the ‘LLM capabilities’ propaganda are “Language modelling’s generative model: is it rational?” by Karen Spärck Jones and “Are Emergent Abilities of Large Language Models a Mirage?” by Rylan Schaeffer, Brando Miranda, and Sanmi Koyejo.
Thank you.
Potentially misguided advice: maybe aiming for something less formal than a paper would make is less of a slog.
Alas, the issue is that I need to establish a bunch of pretty foundational stuff about what programming is to a programmer, what is the nature of the current technology stack, and a bunch of stuff around how we think about and talk about the current technology stack (the “‘secretly two layer’ nature of contention about whether some technology ‘works’” thing) before I can use that to show how it opens programmers up to being mindeaten.
But also, the paper “AI and the Everything in the Whole Wide World Benchmark” by Inioluwa Deborah Raji, Emily M. Bender, Amandalynne Paullada, Emily Denton, and Alex Hanna, has some good stuff about ‘construct validity’ that lays out some of the formal case behind why I say that LLM boosters are just lying when they make claims about how LLMs can do all kinds of things.
Openai spent 6 months testing GPT-4 before releasing it… There might be a hint of what should be and shouldn’t be done…
They trained it on the internet and were surprised it „mindlessly spat out biased and incorrect nonsense.“ There might be a hint of what should be and shouldn’t be done… 😉
I think it is obvious to most people trying to build applications out of LLMs, but it seems some people, like researchers have a harder time with this. Most practitioners are using models to produce embeddings which are used in conjunction with vector databases to find similar information. It works fairly well.
https://www.wikidata.org/ maybe?
Also see http://lod-a-lot.lod.labs.vu.nl/
There are a plenty of options and it doesn’t take much effort to find them. The issue is that LLMs don’t use any of them.
Wolfram Alpha and the wikidata have been earlier attempts at making such DB’s. Both done the hard way. Maybe the next killer application will be an LLM instruction-trained to use them?
I wish. It seem to me currently LLMs don’t use either because they have a very small input buffer (i.e. wikipedia doesn’t fit into it) or don’t do multi-step inference (can’t look up missing data and put it into context for another try).
Things like AutoGPT might be a viable approach even with smaller context if they didn’t try to pursue the task from the get go and instead did a few rounds of “what do I need to know before I can give an answer?” before replaying to the initial prompt.
But there was that paper that promised very large/virtually unlimited inputs so maybe that one’s going to work. I’m sceptical though because it probably would take a lot of time for a GPT-4-sized LLM to chew through the whole Wikipedia on every prompt.
ChatGPT can do this - it will perform searches online to find information to help it answer prompts. It requires beta though.