I’m not an expert in the field but to me it seems obvious that LLM needs some sort of db lookup for esteblished facts in order to produce reliable factual output. Why is this not obvious to experts?
For example, if you ask ChatGPT how to get from Dallas to Paris it will tell you a lot about how to get to France. It wouldn’t care to clarify which Paris you actually want to get to. Maybe it’s the one 100 miles away. All just because statistically Paris, France comes up more often in the training data set.
Why would an LLM show any different behaviour in science? Pi stands for all sorts of things (nucleotide diversity in chemistry, pion particle in physics, population proportion in statistics, prime counting function in maths, to name a few) but most often it’s circle circumference to diameter ratio constant. Would we expect an LLM to reliably use pion in the phisics context? Would we expect an LLM to always properly pick either pion or the ratio constant given that both have place in the related math?
Statistically plausible text generation is maybe good to come up with technobabble for your next sci-fi but I don’t see why experts in the AI field though it might produce good science.
I wonder what I miss that made them confident enough to release this Galactica model.
You’re saying this like humans don’t have the exact same issue. Ask me about pi and I’ll tell you the math answer. (Because it’s statistically likely) Start a conversation about physics and maybe I’ll expect you mean pion instead. Yet our science does just fine (usually).
You can construct prompts with enough context to tell GPTs what kind of lookup is available and how to use it too. They just can’t guess what you’re thinking about.
Concrete knowledge is the same issue for humans. One obvious difference though is humans are pretty good at knowing when they don’t know something or have low confidence in their knowledge. Especially in scientific/research environment. That’s why every paper has a tonne of citations and why most papers have a whole section restating previous findings that the paper build on. LLMs though are way to happy to make stuff up to be useful for doing novel research.
Wikipedia, Wikidata, etc. do not “produce reliabl[y] factual output”; they encode many biases of humanity, and even their lists of incorrect beliefs are likely biased
For example, Bing’s chat product can query Wikipedia, but even if the LLM dutifully summarized Wikipedia, there are biases in both the LLM’s training data and in Wikipedia’s own text.
Well, yes, but LLMs often hallucinate things completely divorced from reality as opposed to merely biased datasets of Wikipedia or whatever. The discourse would’ve been very different if LLMs were biased but factually correct on the level of Wikipedia. As of right now we’re very far from that and yet some people still think Galactica is a good idea.
It’s not as easy as “just throw a DB at it”. I expect this problem will eventually be solved. Companies like Google or Meta were once careful not to release early, untested models but the competition from OpenAI changed that. Things are just going so fast currently that we will see issues like this for a while.
Part of the issue is that this is fundamentally impossible. LLMs as an approach cannot do anything even remotely like this. There are many other approaches that either already do do something like this or could (with sufficient research) plausibly be made to do so, but just fundamentally the LLM approach cannot ever do anything like this. The closest thing to this that I have seen plausibly demonstrated for LLMs is usually euphemistically called ‘fallback’ - basically, you either have:
some form of (non-LLM) recognizer that scans the input for things that can be actually solved by some real AI (or just regular Comp Sci) technique, it replaces the thing with some kind of placeholder token, the external system solves the thing, and then it gets substituted in where the LLM has emitted a ‘placeholder answer’ token.
or
you have some (non-LLM) system to detect ‘the LLM has generating something with problems’, and the prompt + answer (or some subset of it) gets sent to a human in a call center somewhere, who writes the actual response and does some data entry followup.
And neither of these is actually connecting the LLM to a db or giving it the ability to look up facts - they are both ways to hide the fact that the LLM can’t actually do any of the things that are advertised by using non-LLM systems as a patch over the top in an ad-hoc way.
The correct way to proceed is to mine the LLM development for the one particular interesting bit of math that they developed that could be of a lot of use to statistics if it actually turns out to be mathematically valid (basically a way to make high dimensional paths with ~ arbitrary waypoints that are differentiable and/or a way to do sampling+something like correlation very nonlocally, depending on how you look at it) and then throw the rest away as dangerous but also useless garbage, and pour the billions of dollars that have been mustered into any of the many actual areas of AI that are real and useful.
I wonder why it’s fundamentally impossible? At least on the surface it appears LLMs are capable of some form of reasoning so why can’t they know thye’re making an inference and need to look stuff up?
LLMs do not reason, and they do not ‘know’. You are misunderstanding what the technology is and how it works (unfortunately aided in your misunderstanding by the way the major promoters of the technology consistently lie about what it is and how it works). They are a technology that produces output with various different kinds of ‘vibe matching’ to their input, and where ‘fallback’ (a fundamentally non-LLM technology) is used in an ad-hoc way to patch over a lot of the most obvious nonsense that this approach produces.
Edit: That is, LLMs are fundamentally a different approach to the general problem of ‘knowledge stuff’ or ‘machine intelligence’ where, instead of doing all the difficult stuff around knowledge and belief and semantics and interiority and complicated tensorial or braided linear logic and solvers and query planners and knowledge+state representation and transfer to connect all of these different things plus a whole bunch of difficult metacognition stuff and etc etc that would mean that connecting to a knowledge db is something that could actually work, you just… don’t do any of that difficult stuff and then lie and say you did.
I don’t disagree with your assessment but I wonder if there is a way of tweaking LLMs to do this without altering their fundamental architecture. If you add a load of ‘don’t know’ tokens to the training data then I would expect their predictions to be full of these in any output where the other things in the input set did not provide a stronger signal. It wouldn’t be completely reliable but you could probably count the proportion of ‘don’t know’ tokens that end up interleaved with your output to get a signal of how accurate the response is and then train another model to generate searches based on the other tokens and then have the tandem system loop feeding more lookup results from the second model’s output into the first until it is below a threshold ratio of ‘don’t know’ to other tokens.
So, there are a couple of problems with the LLM architecture that make these kinds of schemes (ways to LLM-ly trigger lookups and incorporate the results) not work, which is why basically everyone uses ‘fallback’ as I described above.
For ‘triggering lookups LLM-ly’, your problem has the same issues that make it impossible to have the LLM recognise when it ‘does not know something’ and respond appropriately with ‘I don’t know’ - you can use stuff you put in the dataset plus initial prompt stuff to bluntly close certain things off, and you can have ‘fallback’ mechanisms that recognise ‘possibly problematic’ things in the question or the answer and use some non-LLM system to answer, but the whole point of LLMs is not doing the difficult work you would need to do to make this even a thing that it is possible to think about how to do LLM-ly - that is, the point of the ‘make a differentiable million+ dimensional curve from some points’ thing is that ~ all of the measure of the curve is ‘not supported’, so if you take that away you end up with the degenerate case where the curve is only defined at the points, so it isn’t differentiable, and you can’t do any of the interesting sampling over it, and the whole thing becomes a not very good document retrieval system.
For ‘incorporating lookup results’, there is no mechanism in the LLM architecture that would allow this. The closest two are ‘adding the information to the corpus somehow and retraining’ and ‘prompt stuffing’, and of those the latter is basically already used for some forms of ‘fallback’ but it falls short of actually working the way you indicate.
Finally, when I talk with people about this I do see a frustrating amount of ‘just train another model to generate searches based on the other tokens’ and then the idea that the results can somehow just go into the LLM LLMly, and like, the closest thing to this that could possibly work is fallback as I described above, which has irreducible issues, and also the ‘just train a model to x’ bit is also doing a lot of work there.
Do you have a rocomendation for a shortish explainer on LLMs that can clear up this apparent reasoning ability (and other wonderful capabilities) for me? Or materials on what it actually does and why it might look like it has some cpabilities that it actually doesn’t have?
Part way through a long answer and there was a power cut. Bleh. So, a short answer it will have to be.
Basically, there are no good short explainers out there that I have been able to find. The short explainers are basically all lies, and the good stuff is all zoomed in critique that assumes a lot of background and only covers the specific bit under examination.
I am actually part way through writing my own paper about why programmers seem to get taken in by these things more than they ‘should’, but it is a real slog and I’m not sure I’ll be able to get it into a form where it is actually useful - I’ve had to start by working on a phenomenology of programming (which is itself a large and thankless undertaking) that satisfies me in order to even have the tools to talk about ‘why it looks like it has capabilities’.
The two papers I’d suggest looking at to get a bit of a starting point on the deceptiveness of the ‘LLM capabilities’ propaganda are “Language modelling’s generative model: is it rational?” by Karen Spärck Jones and “Are Emergent Abilities of Large Language Models a Mirage?” by Rylan Schaeffer, Brando Miranda, and Sanmi Koyejo.
Alas, the issue is that I need to establish a bunch of pretty foundational stuff about what programming is to a programmer, what is the nature of the current technology stack, and a bunch of stuff around how we think about and talk about the current technology stack (the “‘secretly two layer’ nature of contention about whether some technology ‘works’” thing) before I can use that to show how it opens programmers up to being mindeaten.
But also, the paper “AI and the Everything in the Whole Wide World Benchmark” by Inioluwa Deborah Raji, Emily M. Bender, Amandalynne Paullada, Emily Denton, and Alex Hanna, has some good stuff about ‘construct validity’ that lays out some of the formal case behind why I say that LLM boosters are just lying when they make claims about how LLMs can do all kinds of things.
They trained it on the internet and were surprised it „mindlessly spat out biased and incorrect nonsense.“ There might be a hint of what should be and shouldn’t be done… 😉
I think it is obvious to most people trying to build applications out of LLMs, but it seems some people, like researchers have a harder time with this. Most practitioners are using models to produce embeddings which are used in conjunction with vector databases to find similar information. It works fairly well.
LLM needs some sort of db lookup for esteblished facts in order to produce reliable factual output.
Wolfram Alpha and the wikidata have been earlier attempts at making such DB’s. Both done the hard way. Maybe the next killer application will be an LLM instruction-trained to use them?
I wish. It seem to me currently LLMs don’t use either because they have a very small input buffer (i.e. wikipedia doesn’t fit into it) or don’t do multi-step inference (can’t look up missing data and put it into context for another try).
Things like AutoGPT might be a viable approach even with smaller context if they didn’t try to pursue the task from the get go and instead did a few rounds of “what do I need to know before I can give an answer?” before replaying to the initial prompt.
But there was that paper that promised very large/virtually unlimited inputs so maybe that one’s going to work. I’m sceptical though because it probably would take a lot of time for a GPT-4-sized LLM to chew through the whole Wikipedia on every prompt.
So what is this mysterious link between Lisp and OOP?
I was really hoping that this would talk about meta-object protocols, COLAs, and the equivalence of lambdas and objects.
The C++ code with an explicit release made me cringe. Why not just use std::shared_ptr, since you’re implementing reference counting. Ideally, your Element would be a smart pointer that either contained an inline integer value or a pointer to some real refcounted value. This is a C++ model of how most real Lisp implementations work.
Of course… I could have. But I did not. I did some experiments when I first discovered all these new structures such as unique_ptr or shared_ptr, and I must say I was not very impressed by their implementation efficiency.
Second, shared_ptr is quite heavy to use, and frankly as it is the case for many of these modern structures, pretty horrible to read…
std::shared_ptr p = std::make_shared();
Finally, release can be used in many ways:
- It can delete a object
- It can push the object back into a pool of reusable object
- It does nothing on constant values
It has the advantage of readability and simplicity, which is my main objective here. Using shared_ptr would completely muddy most of my code.
You don’t need to use the built in shared pointer type, which is generic and so not optimised for various cases, to write idiomatic and clean C++. You write a LispValue class that is a tagged union of either a pointer to some rich structure or an embedded value. Most Lisp implementations use tagging on the low bit, with modern CPUs and addressing modes it’s useful to use 0 to indicate an integer and 1 to indicate a pointer because the subtract 1 can be folded into other immediate offsets for field or vtable addressing. If the value is a pointer, the pointee exposes non-virtual refcount manipulation operations in the base class. Your value type calls these in its copy constructor and copy assignment operator. For move assignment or move construction, your value type simply transfers ownership and zeroes out the original. In its destructor, you drop the ref count. When the refcount reaches zero, it calls a virtual function which can call the destructor, put the object back into a pool, or do anything else.
Implemented like this, you have no virtual calls for refcount manipulation and so all of your refcount operations are inlined. All refcount operations are in a single place, so you can switch between atomic and non atomic ops depending on whether you want to support multiple threads. More importantly, it is now impossible by construction to have mismatched refcount operations. Any copy of the value object implicitly increments the refcount of the object that it refers to, any time one is dropped, it does the converse. Now you can store LipsValues in any C++ collection and have memory management just work.
I don’t use C++ collection. For instance, my List implementation is based on a template of my own, which is designed to make CDR as fast as possible.
Beside, if you think your solution would work better, you can implement it yourself and see if it brings any benefits compared to my own implementation. The code of LispE compiles on all platforms, you might be able to tweak the Element class itself, which contains the reference counter that is shared by all classes.
Combined Object-Lambda Architectures. They were fashionable for a whole about 20 years ago as VMs built around the equivalence of object models and lambdas as a way of building different object models (class vs prototype, single vs multiple inheritance, and so on) on the same core substrate. Most of what they could do is also expressive in CLOS.
As far as I know they’re part of the “Reinventing Computing” research from VPRI. Ian Piumarta is one of the researchers on this and has a series of languages: Coke, Jolt, Pepsi that relate to this
Good article! But with the caveat that tree-walking interpreters are the slowest type, so most “real” (practical) language implementations move to a bytecode representation, or else compile to native code or transpile to a faster language.
(I guess pure LISP interpreters are by necessity tree-walking. Or are there any that compile to bytecode?)
Common LISP actually compiles to a binary.
Compiling to bytecode does not ensure a faster interpreter, as demonstrated by Python. :-)
What make other approaches faster is very often the fact that some very efficient heuristics will detect recurrent patterns in the tree that will be collapsed into faster routines. Whatever your compiler you still have to run your execution tree in a way or another.
As a simple example, consider an expression in tree form like 1+2. It’s easy to move to a linearizes form suitable for a stack machine. You can write it in postfix notation as 1 2 +. Push 1, Push 2, Add.
This is why https://hedycode.com/ supports many different languages, so that students can spend their time focusing on learning programming ideas and not learning English.
I think it’s somewhat obvious, if you consider that these characters are not available to the English / Latin keyboards, why would they be in the Greek (QWERTY) keyboard? (As a Greek layout user, I can confirm they are not).
I was wondering if some Greek readers would eventually have a look on this blog, as the title was somewhat misleading. Did you have the same kind of feeling as the colleague who helped devise this experiment?
Indeed I do, the English and Greek versions read the same to me. It feels like the same programming language with different visuals: Greek letters are more round hence less dense, so the Greek keywords make the code have less color.
When I read these lines of instructions, I immediately switch to a particular mode where they cease to be English and become something else: to become code.
Unlike @4ad I do subvocalize when reading code. I even structure my code so it’s more “literate.” I don’t know how exactly to describe that, I suppose in the same way that people have descriptive function names, I try to also have descriptive code structure, so it can be read in 1 pass as often as possible.
This what is the most interesting. It is like listening to music or reading a book. The experience is different for each of us.
For instance, I cannot work on a code without color highlighting. I need these landmarks on screen to detect and recognize structures.
I had this conversation with one my friends who used to listen to classes without ever taking notes, while I would need this process of note taking to stabilize my memory.
And I do read the code on screen in a subvocalization way, as I have alway read novels and articles.
Just waste seven years of your life on a PhD in Asian philosophy and you too can read “I have a number. It says 3. It is called by the first cosmic stem.”
I wrote a bit about this particular rabbit hole. But I think this is genius, actually making the language and not just presenting a hypothetical example.
And it is in Lisp… :-)
However, this Lisp is the one that I implemented myself based on array and not linked list.
What is quite amusing is that the first computer I ever programmed was a TRS80 clone, which used the same exact BASIC language…
Actually, if you edit the grammar file and gives it a different name, you can create your programming language. The Greek version took us about 1h to create.
To search for a sub-string in another one, we often use the strstr in C or the std::string::find in C++. The least we can say is that it is far, very far from optimal…
But is it? Shouldn’t those library functions be heavily optimized since they’re so widely used? (And if they aren’t, why not?)
Actually, the mod of a power of two is always a simple and with 1^^n - 1…
If you compute the mod of a value with 11, all your values will be between 0 and 10…
If you compute the mod of v with 8, all your values will be between 0 and 7, which is v & 0b111
This a Lisp that is being developed at Naver Lab Europe (France)
You have pre-compiled versions in the binaries section for Windows and Mac OS.
You can also download it and compile it for Linux…
It almost feels like Python was written in a deliberately slow way. Like:
sum([w * e for (w,e) in zip(a,x)])
Simply removing square brackets to avoid creation of intermediate lists in memory would speed this line up very significantly.
And that’s not even touching on the subject of numpy: most people actually using Python for statistics would use it for such a task, because why not? So it’s not really “LispE vs. Python”, it’s “LispE vs. deliberately unoptimized pure-Python code”.
For your second point, the problem with punting to a library is that the same (or similar) libraries can be used from Lisp, making the comparison pointless. I’m not sure about LispE, but Common Lisp at least has bindings for a lot of math libraries (BLAS, GSL, LAPACK, etc.).
On the other hand, using one of those seems like overkill for the examples in the article. And if a language is so slow that any meaninful work needs to pull in third party libraries like NumPy, then that’s worth knowing, too.
Simply removing square brackets to avoid creation of intermediate lists in memory would speed this line up very significantly.
Didn’t seem to have much effect when I tried it out. If anything the code became slightly slower, 63ish ms vs 59ish on my computer. Why? No idea. The variance on either was +/- 3 ms anyway though, just from eyeballing it, so it’s a little difficult to know whether that difference was real.
That’s a good observation, thanks for going through the trouble :-). It is true that on small lists generator expressions can be slower than list comprehensions. It was just something that jumped on me from cursory looking.
Yeah list comprehensions are faster at a certain size, because think of the machinery to do generators. Unfortunately, I think that size is pretty large :)
And that’s not even touching on the subject of numpy: most people actually using Python for statistics would use it for such a task, because why not? So it’s not really “LispE vs. Python”, it’s “LispE vs. deliberately unoptimized pure-Python code”.
This seems like a pretty incurious reading of the article. Mine is that it’s quite explicit on exactly this question: the answer is, this algorithm was chosen as an exemplar of a sufficiently computationally complex yet still-understandable benchmark case for comparing the two languages as general compute languages. Not as a demonstration of why you should use this lisp to do your data engineering. The fact that the domain of this algorithm, in the real world, tends to revolve around some other highly optimized library is completely neither here nor there.
Look at K and Mathematica (aka Wolfram) for two languages that integrate the ideas from APL and Lisp in a more interesting way. In these languages, multidimensional arrays are represented by nested lists, and all scalar operations are generalized to operate on lists. This is a powerful idea whose benefits go well beyond the linear algebra implemented by LispE.
The idea is most powerful if you have only one kind of list (like in K). So a character string is just a list of characters. A matrix is a list of lists of numbers. Instead of representing bit strings as integers (as LispE does with the bitwise operators), a bit string is a list of booleans. All of your scalar operations (on numbers, booleans, and characters) are generalized to work on lists.
Do you have any good resources for learning K? I have a few bookmarked here but in everything I’ve found there seems to be a lack of practical exercises/projects so I end up forgetting everything I read.
Tensors do not an apl make. Tensors, a concise notation, a rank operator…
In apl, everything is an array, and functions operate homogenously on the same. Power emerges from that.
In lisp, everything is an atom or a pair of things, and functions operate homogenously on the same. Power emerges from that, too.
Newer lisps add things like arrays, but an array is little more than a concession to performance; its role is similar to that of the cons: a container for other objects. Lisp’s language is abstraction, and it speaks it well.
When you try to blend the two, you end up with something that is neither a very good lisp nor a very good apl. (It may be good at being something else. I do not mean to make a value judgement.) Tensors feel more scalar than lisp’s aggregates and more aggregate than lisp’s scalars, and the homogeneity of each language is lost.
Actually, you should have a look on how lists are implemented in LispE (see https://github.com/naver/lispe/blob/master/include/listes.h). They are implemented as arrays, not as lists. I worked with many versions of Lisp over the years, and I had an issue with the idea of pushing elements at the beginning of the list, which is usually quite unnatural in most languages. I opted instead to a different implementation which is composed of two layers: ITEM (the array implementation) and LISTE, which contains a pointer to an ITEM. LISTE also contains an index to the first element in the array. A CDR for instance, creates a LISTE object that shares the same ITEM with the initial list, but move its index one notch over.
I don’t see why something’s being unnatural in one language should have any bearing on that thing’s role in another language.
Which being said, you can push elements to the end of a list, by keeping a pointer to the final cons. Another pattern which is very common is to add elements to the head of a list and reverse it once done.
That’s all mostly beside the point, though; again, the semantic role of arrays in lisp is almost identical to that of conses in this context.
I don’t see why this had to be implemented in C++ when you’re already working with a Lisp. The nice thing with Lisp is that the macro system allows for stuff like this to be added.
LispE is described as a pretty minimal LISP, so it may not have macros; I didn’t find any mention of them when I skimmed the language description.
I’m wondering if this could be applied to concatenative (Forth-like) languages. I have a toy one I’m working on, that I might do some syntactic experiments with.
But it has already been done in the examples I’ve showed.
Anyways, whatever. I would implement the entire thing in Lisp from the ground up, like SBCL has done, but judging from your codebase it looks like you have a strong preference for C. I won’t argue with you any further, but I urge you to explore Lisp itself a bit before taking the C route.
You don’t even need macros, you can write an metacircular interpreter that can be switched to infix or postfix notation. Changing how a lisp is interpreted is the whole point of a lisp.
I haven’t had the ability to explore metacircular evaluators yet, but I am excited to see they are a topic of SICP which I am currently reading. Do you know any other introductions to the topic?
Changing how a lisp is interpreted is the whole point of a lisp.
This in particular is what piqued my interest, as I love Lisp but don’t have any experience “changing how it’s interpreted”.
Reading the documentation of racket is probably the best place to start. It is the lisp that I would advise anyone starting off to look at. It feels like python felt in the early 00s.
I was implementing a programming language with multi-threading, and I was heavily testing it when I started having weird crashes that I could not understand. It took about a week to discover that the problem was linked with my reference counting mechanism. Sometimes the reference would jump to 0 and the object would be deleted while being still in use with another thread. Sometime the reference would be 2 and the object would never be destroyed, creating massive memory leaks (it was implemented in C++). Eventually, I managed to understand that it was a concurrency problem and I implemented the reference counter as an atomic value, which saved the day.
I’m not an expert in the field but to me it seems obvious that LLM needs some sort of db lookup for esteblished facts in order to produce reliable factual output. Why is this not obvious to experts?
For example, if you ask ChatGPT how to get from Dallas to Paris it will tell you a lot about how to get to France. It wouldn’t care to clarify which Paris you actually want to get to. Maybe it’s the one 100 miles away. All just because statistically Paris, France comes up more often in the training data set.
Why would an LLM show any different behaviour in science? Pi stands for all sorts of things (nucleotide diversity in chemistry, pion particle in physics, population proportion in statistics, prime counting function in maths, to name a few) but most often it’s circle circumference to diameter ratio constant. Would we expect an LLM to reliably use pion in the phisics context? Would we expect an LLM to always properly pick either pion or the ratio constant given that both have place in the related math?
Statistically plausible text generation is maybe good to come up with technobabble for your next sci-fi but I don’t see why experts in the AI field though it might produce good science.
I wonder what I miss that made them confident enough to release this Galactica model.
You’re saying this like humans don’t have the exact same issue. Ask me about pi and I’ll tell you the math answer. (Because it’s statistically likely) Start a conversation about physics and maybe I’ll expect you mean pion instead. Yet our science does just fine (usually).
You can construct prompts with enough context to tell GPTs what kind of lookup is available and how to use it too. They just can’t guess what you’re thinking about.
Concrete knowledge is the same issue for humans. One obvious difference though is humans are pretty good at knowing when they don’t know something or have low confidence in their knowledge. Especially in scientific/research environment. That’s why every paper has a tonne of citations and why most papers have a whole section restating previous findings that the paper build on. LLMs though are way to happy to make stuff up to be useful for doing novel research.
Two things are obvious to experts (Experts in what? Ontology?):
For example, Bing’s chat product can query Wikipedia, but even if the LLM dutifully summarized Wikipedia, there are biases in both the LLM’s training data and in Wikipedia’s own text.
Well, yes, but LLMs often hallucinate things completely divorced from reality as opposed to merely biased datasets of Wikipedia or whatever. The discourse would’ve been very different if LLMs were biased but factually correct on the level of Wikipedia. As of right now we’re very far from that and yet some people still think Galactica is a good idea.
There is ongoing work to solve that, for instance: https://arxiv.org/abs/2305.03695
It’s not as easy as “just throw a DB at it”. I expect this problem will eventually be solved. Companies like Google or Meta were once careful not to release early, untested models but the competition from OpenAI changed that. Things are just going so fast currently that we will see issues like this for a while.
Part of the issue is that this is fundamentally impossible. LLMs as an approach cannot do anything even remotely like this. There are many other approaches that either already do do something like this or could (with sufficient research) plausibly be made to do so, but just fundamentally the LLM approach cannot ever do anything like this. The closest thing to this that I have seen plausibly demonstrated for LLMs is usually euphemistically called ‘fallback’ - basically, you either have:
or
And neither of these is actually connecting the LLM to a db or giving it the ability to look up facts - they are both ways to hide the fact that the LLM can’t actually do any of the things that are advertised by using non-LLM systems as a patch over the top in an ad-hoc way.
The correct way to proceed is to mine the LLM development for the one particular interesting bit of math that they developed that could be of a lot of use to statistics if it actually turns out to be mathematically valid (basically a way to make high dimensional paths with ~ arbitrary waypoints that are differentiable and/or a way to do sampling+something like correlation very nonlocally, depending on how you look at it) and then throw the rest away as dangerous but also useless garbage, and pour the billions of dollars that have been mustered into any of the many actual areas of AI that are real and useful.
What are you referring to when you say “this”? What is fundamentally impossible for LLMs?
Edit: apologies for lack of clarity in my initial reply to you
I wonder why it’s fundamentally impossible? At least on the surface it appears LLMs are capable of some form of reasoning so why can’t they know thye’re making an inference and need to look stuff up?
LLMs do not reason, and they do not ‘know’. You are misunderstanding what the technology is and how it works (unfortunately aided in your misunderstanding by the way the major promoters of the technology consistently lie about what it is and how it works). They are a technology that produces output with various different kinds of ‘vibe matching’ to their input, and where ‘fallback’ (a fundamentally non-LLM technology) is used in an ad-hoc way to patch over a lot of the most obvious nonsense that this approach produces.
Edit: That is, LLMs are fundamentally a different approach to the general problem of ‘knowledge stuff’ or ‘machine intelligence’ where, instead of doing all the difficult stuff around knowledge and belief and semantics and interiority and complicated tensorial or braided linear logic and solvers and query planners and knowledge+state representation and transfer to connect all of these different things plus a whole bunch of difficult metacognition stuff and etc etc that would mean that connecting to a knowledge db is something that could actually work, you just… don’t do any of that difficult stuff and then lie and say you did.
I don’t disagree with your assessment but I wonder if there is a way of tweaking LLMs to do this without altering their fundamental architecture. If you add a load of ‘don’t know’ tokens to the training data then I would expect their predictions to be full of these in any output where the other things in the input set did not provide a stronger signal. It wouldn’t be completely reliable but you could probably count the proportion of ‘don’t know’ tokens that end up interleaved with your output to get a signal of how accurate the response is and then train another model to generate searches based on the other tokens and then have the tandem system loop feeding more lookup results from the second model’s output into the first until it is below a threshold ratio of ‘don’t know’ to other tokens.
So, there are a couple of problems with the LLM architecture that make these kinds of schemes (ways to LLM-ly trigger lookups and incorporate the results) not work, which is why basically everyone uses ‘fallback’ as I described above.
For ‘triggering lookups LLM-ly’, your problem has the same issues that make it impossible to have the LLM recognise when it ‘does not know something’ and respond appropriately with ‘I don’t know’ - you can use stuff you put in the dataset plus initial prompt stuff to bluntly close certain things off, and you can have ‘fallback’ mechanisms that recognise ‘possibly problematic’ things in the question or the answer and use some non-LLM system to answer, but the whole point of LLMs is not doing the difficult work you would need to do to make this even a thing that it is possible to think about how to do LLM-ly - that is, the point of the ‘make a differentiable million+ dimensional curve from some points’ thing is that ~ all of the measure of the curve is ‘not supported’, so if you take that away you end up with the degenerate case where the curve is only defined at the points, so it isn’t differentiable, and you can’t do any of the interesting sampling over it, and the whole thing becomes a not very good document retrieval system.
For ‘incorporating lookup results’, there is no mechanism in the LLM architecture that would allow this. The closest two are ‘adding the information to the corpus somehow and retraining’ and ‘prompt stuffing’, and of those the latter is basically already used for some forms of ‘fallback’ but it falls short of actually working the way you indicate.
Finally, when I talk with people about this I do see a frustrating amount of ‘just train another model to generate searches based on the other tokens’ and then the idea that the results can somehow just go into the LLM LLMly, and like, the closest thing to this that could possibly work is fallback as I described above, which has irreducible issues, and also the ‘just train a model to x’ bit is also doing a lot of work there.
Do you have a rocomendation for a shortish explainer on LLMs that can clear up this apparent reasoning ability (and other wonderful capabilities) for me? Or materials on what it actually does and why it might look like it has some cpabilities that it actually doesn’t have?
Part way through a long answer and there was a power cut. Bleh. So, a short answer it will have to be.
Basically, there are no good short explainers out there that I have been able to find. The short explainers are basically all lies, and the good stuff is all zoomed in critique that assumes a lot of background and only covers the specific bit under examination.
I am actually part way through writing my own paper about why programmers seem to get taken in by these things more than they ‘should’, but it is a real slog and I’m not sure I’ll be able to get it into a form where it is actually useful - I’ve had to start by working on a phenomenology of programming (which is itself a large and thankless undertaking) that satisfies me in order to even have the tools to talk about ‘why it looks like it has capabilities’.
The two papers I’d suggest looking at to get a bit of a starting point on the deceptiveness of the ‘LLM capabilities’ propaganda are “Language modelling’s generative model: is it rational?” by Karen Spärck Jones and “Are Emergent Abilities of Large Language Models a Mirage?” by Rylan Schaeffer, Brando Miranda, and Sanmi Koyejo.
Thank you.
Potentially misguided advice: maybe aiming for something less formal than a paper would make is less of a slog.
Alas, the issue is that I need to establish a bunch of pretty foundational stuff about what programming is to a programmer, what is the nature of the current technology stack, and a bunch of stuff around how we think about and talk about the current technology stack (the “‘secretly two layer’ nature of contention about whether some technology ‘works’” thing) before I can use that to show how it opens programmers up to being mindeaten.
But also, the paper “AI and the Everything in the Whole Wide World Benchmark” by Inioluwa Deborah Raji, Emily M. Bender, Amandalynne Paullada, Emily Denton, and Alex Hanna, has some good stuff about ‘construct validity’ that lays out some of the formal case behind why I say that LLM boosters are just lying when they make claims about how LLMs can do all kinds of things.
Openai spent 6 months testing GPT-4 before releasing it… There might be a hint of what should be and shouldn’t be done…
They trained it on the internet and were surprised it „mindlessly spat out biased and incorrect nonsense.“ There might be a hint of what should be and shouldn’t be done… 😉
I think it is obvious to most people trying to build applications out of LLMs, but it seems some people, like researchers have a harder time with this. Most practitioners are using models to produce embeddings which are used in conjunction with vector databases to find similar information. It works fairly well.
https://www.wikidata.org/ maybe?
Also see http://lod-a-lot.lod.labs.vu.nl/
There are a plenty of options and it doesn’t take much effort to find them. The issue is that LLMs don’t use any of them.
Wolfram Alpha and the wikidata have been earlier attempts at making such DB’s. Both done the hard way. Maybe the next killer application will be an LLM instruction-trained to use them?
I wish. It seem to me currently LLMs don’t use either because they have a very small input buffer (i.e. wikipedia doesn’t fit into it) or don’t do multi-step inference (can’t look up missing data and put it into context for another try).
Things like AutoGPT might be a viable approach even with smaller context if they didn’t try to pursue the task from the get go and instead did a few rounds of “what do I need to know before I can give an answer?” before replaying to the initial prompt.
But there was that paper that promised very large/virtually unlimited inputs so maybe that one’s going to work. I’m sceptical though because it probably would take a lot of time for a GPT-4-sized LLM to chew through the whole Wikipedia on every prompt.
ChatGPT can do this - it will perform searches online to find information to help it answer prompts. It requires beta though.
I was really hoping that this would talk about meta-object protocols, COLAs, and the equivalence of lambdas and objects.
The C++ code with an explicit
release
made me cringe. Why not just usestd::shared_ptr
, since you’re implementing reference counting. Ideally, your Element would be a smart pointer that either contained an inline integer value or a pointer to some real refcounted value. This is a C++ model of how most real Lisp implementations work.Of course… I could have. But I did not. I did some experiments when I first discovered all these new structures such as unique_ptr or shared_ptr, and I must say I was not very impressed by their implementation efficiency.
Second, shared_ptr is quite heavy to use, and frankly as it is the case for many of these modern structures, pretty horrible to read…
Finally, release can be used in many ways:
It has the advantage of readability and simplicity, which is my main objective here. Using shared_ptr would completely muddy most of my code.
You don’t need to use the built in shared pointer type, which is generic and so not optimised for various cases, to write idiomatic and clean C++. You write a LispValue class that is a tagged union of either a pointer to some rich structure or an embedded value. Most Lisp implementations use tagging on the low bit, with modern CPUs and addressing modes it’s useful to use 0 to indicate an integer and 1 to indicate a pointer because the subtract 1 can be folded into other immediate offsets for field or vtable addressing. If the value is a pointer, the pointee exposes non-virtual refcount manipulation operations in the base class. Your value type calls these in its copy constructor and copy assignment operator. For move assignment or move construction, your value type simply transfers ownership and zeroes out the original. In its destructor, you drop the ref count. When the refcount reaches zero, it calls a virtual function which can call the destructor, put the object back into a pool, or do anything else.
Implemented like this, you have no virtual calls for refcount manipulation and so all of your refcount operations are inlined. All refcount operations are in a single place, so you can switch between atomic and non atomic ops depending on whether you want to support multiple threads. More importantly, it is now impossible by construction to have mismatched refcount operations. Any copy of the value object implicitly increments the refcount of the object that it refers to, any time one is dropped, it does the converse. Now you can store LipsValues in any C++ collection and have memory management just work.
I don’t use C++ collection. For instance, my List implementation is based on a template of my own, which is designed to make CDR as fast as possible. Beside, if you think your solution would work better, you can implement it yourself and see if it brings any benefits compared to my own implementation. The code of LispE compiles on all platforms, you might be able to tweak the Element class itself, which contains the reference counter that is shared by all classes.
What are “COLAs”? Internet searches for various flavors of “lisp cola” and “plt cola” yielded nothing.
Combined Object-Lambda Architectures. They were fashionable for a whole about 20 years ago as VMs built around the equivalence of object models and lambdas as a way of building different object models (class vs prototype, single vs multiple inheritance, and so on) on the same core substrate. Most of what they could do is also expressive in CLOS.
As far as I know they’re part of the “Reinventing Computing” research from VPRI. Ian Piumarta is one of the researchers on this and has a series of languages: Coke, Jolt, Pepsi that relate to this
Good article! But with the caveat that tree-walking interpreters are the slowest type, so most “real” (practical) language implementations move to a bytecode representation, or else compile to native code or transpile to a faster language.
(I guess pure LISP interpreters are by necessity tree-walking. Or are there any that compile to bytecode?)
Common LISP actually compiles to a binary.
Compiling to bytecode does not ensure a faster interpreter, as demonstrated by Python. :-)
What make other approaches faster is very often the fact that some very efficient heuristics will detect recurrent patterns in the tree that will be collapsed into faster routines. Whatever your compiler you still have to run your execution tree in a way or another.
Common Lisp is a language with several different implementations. Some of them compile to native code binaries and some don’t.
SBCL can produce native code?
yes, it does compile to native code
CLISP compiles to bytecode.
As a simple example, consider an expression in tree form like 1+2. It’s easy to move to a linearizes form suitable for a stack machine. You can write it in postfix notation as 1 2 +. Push 1, Push 2, Add.
I don’t know what “pure” means here. Here is
eval
in a tiny interpreter/compiler, https://github.com/udem-dlteam/ribbit/blob/d0082a08df12efca921d37daf04d64fb4ad63b78/src/lib/max-tc.scm#L1078-L1079The paper, http://www.iro.umontreal.ca/~feeley/papers/YvonFeeleyVMIL21.pdf And previous discussions, https://lobste.rs/s/8vtuhs/small_portable_scheme_implementation
reminds me of https://nas.sr/%D9%82%D9%84%D8%A8/
Actually, someone published a similar link above (see technomancy).
This is why https://hedycode.com/ supports many different languages, so that students can spend their time focusing on learning programming ideas and not learning English.
???
I’m wondering if that’s a type. Felienne spoke at Strange Loop 2022 and demoed Hedy running in many non-English languages including RTL langauges.
Great resource. Thanks
Since you’re using non-ASCII characters anyway, why not have proper operators like ≠ instead of ASCII substitutes?
I think it’s somewhat obvious, if you consider that these characters are not available to the English / Latin keyboards, why would they be in the Greek (QWERTY) keyboard? (As a Greek layout user, I can confirm they are not).
So typing them would not be convenient.
I was wondering if some Greek readers would eventually have a look on this blog, as the title was somewhat misleading. Did you have the same kind of feeling as the colleague who helped devise this experiment?
Indeed I do, the English and Greek versions read the same to me. It feels like the same programming language with different visuals: Greek letters are more round hence less dense, so the Greek keywords make the code have less color.
Thanks. This is very interesting. I did not know this concept before
It is actually pretty easy to do in LispE:
(link “≠” ’neq) (≠ 19 29) true
You can modify anything the way you want. However, the only issue is simply to access it on the keyboard. :-)
In Tamgu, another language that I implemented you can even write: 2x² and it works…
Isn’t this the same for everyone? How else?
This is the question I ask myself. Is the experience different for someone who speaks English to someone who don’t?
Unlike @4ad I do subvocalize when reading code. I even structure my code so it’s more “literate.” I don’t know how exactly to describe that, I suppose in the same way that people have descriptive function names, I try to also have descriptive code structure, so it can be read in 1 pass as often as possible.
When I read code, I just ingest it, there’s no English “translation”, same with math.
I don’t do subvocalization in general, if there’s any relation.
Also, I don’t do syntax highlighting. Whichever process I use for consuming code, syntax highlighting breaks it. I can’t read highlighted code.
This what is the most interesting. It is like listening to music or reading a book. The experience is different for each of us. For instance, I cannot work on a code without color highlighting. I need these landmarks on screen to detect and recognize structures.
I had this conversation with one my friends who used to listen to classes without ever taking notes, while I would need this process of note taking to stabilize my memory.
And I do read the code on screen in a subvocalization way, as I have alway read novels and articles.
See also Wenyan: https://wy-lang.org/
The documentation is gorgeous on desktop: https://book.wy-lang.org/ I’ve never seen syntax highlighted Chinese before. I wish I could read it!
Just waste seven years of your life on a PhD in Asian philosophy and you too can read “I have a number. It says 3. It is called by the first cosmic stem.”
Nice!!!
I’m amused it’s in Classical Chinese, and even more amused the name of the language is literally the Mandarin word for Classical Chinese.
I wrote a bit about this particular rabbit hole. But I think this is genius, actually making the language and not just presenting a hypothetical example.
And it is in Lisp… :-) However, this Lisp is the one that I implemented myself based on array and not linked list.
What is quite amusing is that the first computer I ever programmed was a TRS80 clone, which used the same exact BASIC language…
Actually, if you edit the grammar file and gives it a different name, you can create your programming language. The Greek version took us about 1h to create.
If this line of thinking is interesting to you, then you should definitely watch this talk from Deconstruct about programming for non-English speakers: https://www.deconstructconf.com/2019/ramsey-nasser-a-personal-computer-for-children-of-all-cultures
Thank you for the link. Actually, I followed the same kind of reasoning as him. The Lisp I implemented is based on Unicode to this effect.
But is it? Shouldn’t those library functions be heavily optimized since they’re so widely used? (And if they aren’t, why not?)
I guess it is a very simple case one size fits for all.
Actually, the mod of a power of two is always a simple and with 1^^n - 1… If you compute the mod of a value with 11, all your values will be between 0 and 10… If you compute the mod of v with 8, all your values will be between 0 and 7, which is v & 0b111
what is LispE
If you go up a couple of directories in that github you’ll find the ReadMe for the language.
Seems to be an interpreted, lazy lisp with pattern matching & arithmetic types a la Haskell. Sounds fun!
And array instructions à la APL… It is fun… :-)
What is LispE? I can’t find anything about it using google. Also not mentioned in the Lisp wikipedia article anywhere.
The github repo says:
This a Lisp that is being developed at Naver Lab Europe (France) You have pre-compiled versions in the binaries section for Windows and Mac OS. You can also download it and compile it for Linux…
It almost feels like Python was written in a deliberately slow way. Like:
Simply removing square brackets to avoid creation of intermediate lists in memory would speed this line up very significantly.
And that’s not even touching on the subject of numpy: most people actually using Python for statistics would use it for such a task, because why not? So it’s not really “LispE vs. Python”, it’s “LispE vs. deliberately unoptimized pure-Python code”.
For your second point, the problem with punting to a library is that the same (or similar) libraries can be used from Lisp, making the comparison pointless. I’m not sure about LispE, but Common Lisp at least has bindings for a lot of math libraries (BLAS, GSL, LAPACK, etc.).
On the other hand, using one of those seems like overkill for the examples in the article. And if a language is so slow that any meaninful work needs to pull in third party libraries like NumPy, then that’s worth knowing, too.
Didn’t seem to have much effect when I tried it out. If anything the code became slightly slower, 63ish ms vs 59ish on my computer. Why? No idea. The variance on either was +/- 3 ms anyway though, just from eyeballing it, so it’s a little difficult to know whether that difference was real.
That’s a good observation, thanks for going through the trouble :-). It is true that on small lists generator expressions can be slower than list comprehensions. It was just something that jumped on me from cursory looking.
Yeah list comprehensions are faster at a certain size, because think of the machinery to do generators. Unfortunately, I think that size is pretty large :)
This seems like a pretty incurious reading of the article. Mine is that it’s quite explicit on exactly this question: the answer is, this algorithm was chosen as an exemplar of a sufficiently computationally complex yet still-understandable benchmark case for comparing the two languages as general compute languages. Not as a demonstration of why you should use this lisp to do your data engineering. The fact that the domain of this algorithm, in the real world, tends to revolve around some other highly optimized library is completely neither here nor there.
Thank you, you are absolutely right
Hum… I wonder if you had a look on Haskell? Function composition is very elegant in this language:
filter . map (+) [1..10]
It even have an operator to flip arguments…
Look at K and Mathematica (aka Wolfram) for two languages that integrate the ideas from APL and Lisp in a more interesting way. In these languages, multidimensional arrays are represented by nested lists, and all scalar operations are generalized to operate on lists. This is a powerful idea whose benefits go well beyond the linear algebra implemented by LispE.
The idea is most powerful if you have only one kind of list (like in K). So a character string is just a list of characters. A matrix is a list of lists of numbers. Instead of representing bit strings as integers (as LispE does with the bitwise operators), a bit string is a list of booleans. All of your scalar operations (on numbers, booleans, and characters) are generalized to work on lists.
Do you have any good resources for learning K? I have a few bookmarked here but in everything I’ve found there seems to be a lack of practical exercises/projects so I end up forgetting everything I read.
I’m not sure if any of my resources are “good”. Another problem is that there are many dialects of K.
Thanks. The last one is the one I’m trying to follow this time.
K crash course was good. It refers to k7, which is no longer current, though it can still be run via web.
Q for mortals is decent, though very slow.
I believe this is the same as https://github.com/kparc/kcc
Latter is incomplete, in the process of being updated to k9.
Oh, good to know.
LispE operates on tensors, which are implemented as nested lists.
Tensors do not an apl make. Tensors, a concise notation, a rank operator…
In apl, everything is an array, and functions operate homogenously on the same. Power emerges from that.
In lisp, everything is an atom or a pair of things, and functions operate homogenously on the same. Power emerges from that, too.
Newer lisps add things like arrays, but an array is little more than a concession to performance; its role is similar to that of the cons: a container for other objects. Lisp’s language is abstraction, and it speaks it well.
When you try to blend the two, you end up with something that is neither a very good lisp nor a very good apl. (It may be good at being something else. I do not mean to make a value judgement.) Tensors feel more scalar than lisp’s aggregates and more aggregate than lisp’s scalars, and the homogeneity of each language is lost.
Actually, you should have a look on how lists are implemented in LispE (see https://github.com/naver/lispe/blob/master/include/listes.h). They are implemented as arrays, not as lists. I worked with many versions of Lisp over the years, and I had an issue with the idea of pushing elements at the beginning of the list, which is usually quite unnatural in most languages. I opted instead to a different implementation which is composed of two layers: ITEM (the array implementation) and LISTE, which contains a pointer to an ITEM. LISTE also contains an index to the first element in the array. A CDR for instance, creates a LISTE object that shares the same ITEM with the initial list, but move its index one notch over.
I don’t see why something’s being unnatural in one language should have any bearing on that thing’s role in another language.
Which being said, you can push elements to the end of a list, by keeping a pointer to the final cons. Another pattern which is very common is to add elements to the head of a list and reverse it once done.
That’s all mostly beside the point, though; again, the semantic role of arrays in lisp is almost identical to that of conses in this context.
You may be interested in gamelisp for a different take on how to do a “lisp” with a vector rather than cons cells.
actually, there is a macro implemtation: defmacro (see https://github.com/naver/lispe/wiki/6.1-Description-of-methods-and-operators#defmacro)
Why was this implemented in C instead of in terms of a macro then?
I don’t see why this had to be implemented in C++ when you’re already working with a Lisp. The nice thing with Lisp is that the macro system allows for stuff like this to be added.
Examples, CL: https://github.com/mrcdr/polisher and https://www.cliki.net/ugly-tiny-infix-macro
Scheme even has a SRFI dedicated to this: https://srfi.schemers.org/srfi-105/srfi-105.html
LispE is described as a pretty minimal LISP, so it may not have macros; I didn’t find any mention of them when I skimmed the language description.
I’m wondering if this could be applied to concatenative (Forth-like) languages. I have a toy one I’m working on, that I might do some syntactic experiments with.
defmacro: https://github.com/naver/lispe/wiki/6.1-Description-of-methods-and-operators#defmacro
Then that should probably be implemented as soon as possible, at least much sooner than infix operator features.
If you’re referring to infix operators, it seems to be possible: https://arduino-forth.com/article/FORTH_exemples_convertInfixToPostfix
Also this one: https://github.com/rigetti/cmu-infix
because with macros you cannot easily handle operator precedence
But it has already been done in the examples I’ve showed.
Anyways, whatever. I would implement the entire thing in Lisp from the ground up, like SBCL has done, but judging from your codebase it looks like you have a strong preference for C. I won’t argue with you any further, but I urge you to explore Lisp itself a bit before taking the C route.
Gambit scheme has SIX: https://www.iro.umontreal.ca/~gambit/doc/gambit.html#Scheme-infix-syntax-extension
You don’t even need macros, you can write an metacircular interpreter that can be switched to infix or postfix notation. Changing how a lisp is interpreted is the whole point of a lisp.
Ah, I’m showing my ignorance it seems.
I haven’t had the ability to explore metacircular evaluators yet, but I am excited to see they are a topic of SICP which I am currently reading. Do you know any other introductions to the topic?
This in particular is what piqued my interest, as I love Lisp but don’t have any experience “changing how it’s interpreted”.
Reading the documentation of racket is probably the best place to start. It is the lisp that I would advise anyone starting off to look at. It feels like python felt in the early 00s.
Thanks for the tip, I’ll check it out.
I was implementing a programming language with multi-threading, and I was heavily testing it when I started having weird crashes that I could not understand. It took about a week to discover that the problem was linked with my reference counting mechanism. Sometimes the reference would jump to 0 and the object would be deleted while being still in use with another thread. Sometime the reference would be 2 and the object would never be destroyed, creating massive memory leaks (it was implemented in C++). Eventually, I managed to understand that it was a concurrency problem and I implemented the reference counter as an atomic value, which saved the day.