1. 19

    1. 20

      Many years ago, I was at a talk by Robin Milner where he argued that artificial intelligence is the wrong goal and that we should focus on augmented intelligence. Making computers do things that humans can do is not especially useful, making computers and humans able to do things that neither can do in isolation is hugely beneficial. Things like Copilot fit well into that lens, they don’t replace the human they let the human accomplish more.

      To the notes at the end, I don’t find it worrying that an LLM can reproduce college essays, at least in the sense that’s implied. I see this as evidence that these forms of assessment are not measuring the right thing. I hope LLMs will be a disaster for the kind of assessment that doesn’t require original thinking yet purports to measure that faculty.

      1. 6

        I don’t find it worrying that an LLM can reproduce college essays, at least in the sense that’s implied. I see this as evidence that these forms of assessment are not measuring the right thing.

        I agree. When these LLMs reach such high scores on standardized school/university tests, the shocking thing is not how far computers advanced, but how primitive our assessment of skill truly is. That it can be fooled by what amounts to a probabilistic mechanism that generates likely chains of tokens. That is not the kind of skill and resourcefulness you look for when you’re hiring people. True resourcefulness is navigating uncharted waters, not doing what is familiar but doing what is logical, given the circumstances.

        1. 9

          It’s tricky. When you design an assessment there are four key things you need to think about:

          • Validity. Are we actually measuring the right thing? In the simplest case, are we actually measuring something: internal consistency metrics leisure this.
          • Reliability. Are our results reproducible? If the two cohorts of equal ability sit the same test, will we see the same distribution of results?
          • Impact. Does the thing that we’re measuring correlate with the thing that people looking at the test score care about? For example, if the test is a qualification used for getting a job or getting into university, does the test score correlate with your ability to do well after being accepted?
          • Practicality. Can we actually run this test at the scale necessary? Having an expert follow the candidate around for a few weeks while they demonstrate the skill works really well, but tends not to scale (especially if you want reliability, so have to calibrate your experts’ judgement).

          A lot of university exams don’t even start with a construct against which they can measure validity. They have some aggregate hand waving towards impact (do people who graduate our course get good jobs and do well in them?) but often the sample sizes are too low. They do care about reliability because that’s the kind of thing that they need to demonstrate for accreditation, but only very loosely. They care a lot about practicality because the people marking assessments are not doing that full time, they’re also doing research.

          Even with those caveats, a number of the things where LLMs do well are the kind of assessments that are intended to drive learning rather than give a final score. They are intended to tell the student what they need to study more. Cheating at these is like cheating at exercise.

      2. 4

        Many years ago, I was at a talk by Robin Milner where he argued that artificial intelligence is the wrong goal and that we should focus on augmented intelligence. Making computers do things that humans can do is not especially useful, making computers and humans able to do things that neither can do in isolation is hugely beneficial. Things like Copilot fit well into that lens, they don’t replace the human they let the human accomplish more.

        I wasn’t at the talk but I think I saw either a recording of it, or he made that point in a later talk that I saw, and it was a pretty defining moment for my outlook on programming, and on technology in general. Based on that, and on my personal observations, I also think rumours of programming being on its way to the urn because of “AI” (by which commenters usually mean LLM) are greatly exaggerated. This article is probably a good example: it has a few good ideas, but the examples are tone-deaf.

        I don’t remember where I read it first (I think Norvig quotes it in his big book on the topic, but I mostly skimmed it – my interest in AI is mostly confined to writing things that allow people who are better at math than I am write AI tools) but at some point, someone noted that mankind finally achieved its ancient dream of flying when it stopped trying to emulate the flight of birds, and tried to make things that fly starting from the principles of flight but without attempting to emulate the form of flight.

        The examples in this article miss that: they get hung up on code monkey “tasks” (which may ultimately be fixed by incremental progresses in LLMs; ChatGPT 4 fares better on a lot of things, who knows what ChatGPT 6 will be able to do) or pointless programming puzzles. With any luck, something good may come out of that, like the death of interview programming puzzles, but that’s IMHO that’s not where the potential of LLMs lays.

        I’m (slowly – the web is really not my platform) writing a tool that uses LLMs for very practical problems. tl;dr wife runs a marketing agency, I’m writing a tool that assists with the analysis of various types of promotional text. It’s very much an analysis tool; it has an interface that vaguely alludes to virtual assistants but the interaction is not chat-driven at all.

        After hacking at it for a couple of evenings I’m fully convinced that LLMs will not spell the end of programming, and I’m really enthusiastic about what they can do:

        1. Getting that kind of analysis from a language model would’ve been absolutely impossible 10 years ago, and would’ve required concerted effort from several graduate students to come up with a very narrow model. Getting a very useful analysis (I’m talking things like identifying elements of marketing/sales strategy) is something I’ve been able to do with an investment of like thirty minutes, ten of which were spent yelling at Docker.
        2. Turning that into a useful program absolutely requires programming effort. This isn’t something you can do with a no-code tool. You can get some of the output the program produces with nothing but the ChatGPT interface but it’s extremely unwieldy. Properly organising it and correlating various bits and pieces of output is not something a LLM can do, and being able to do it efficiently and repeatably – which is definitely done by old-fashioned programming – is a big deal.

        The examples in the article are relevant in one way – I’ve found LLMs to be remarkably good at doing things that otherwise require tedious and somewhat “machine-like” human effort. For example, despite extracting and correlating a bunch of data from freeform text, and then showing it in standard format, my code has basically no serialization/deserialization logic. If I need ideas extracted from a text, and I need to hand them off in JSON format to the UI layer, I just request them in JSON format. If I need to show formatted text, I just request it in Markdown format. It’s not foolproof: chatgpt-3.5 occasionally hands me invalid JSON or tries to be nice and say “Here’s your JSON first”, so there’s some basic sanitizing logic in there; and there are security implications to it that I haven’t fully come to terms with. However, these look like problems that are perfectly tractable between incremental improvements in LLM (or, for that matter, some format awareness) and some real-life experience.

        This part of writing an application may go the way of the Dodo – all that code that takes these two JSON objects and produces this other JSON object in a slightly different format, or takes this CSV document uploaded by the customer and produces this standard BOM table. But bloody hell, good riddance, if anyone dreads a future in which we’ll have to do less CRUD and object mapping, they’re suffering from Stockholm syndrome.

      3. 2

        Seeing how prescient Milner was in general, I’ll go ahead and agree with him.

      4. 1

        […] we should focus on augmented intelligence

        I think things could head in this direction? LLM-generated code is probably difficult to trust. There is some interesting work on synthesis based on contracts / advanced type systems using SAT/SMT. I expect both approaches might converge. We may end up designing programs, with AI helping to fill in the gaps.

        1. 4

          Some things are. This is explicitly the goal of copilot. I am not convinced that it’s going in the right direction though because it’s acting as a glorified autocomplete and if typing is the bottleneck when programming then I’m doing something wrong. The kinds of things that I would love to see from LLMs are:

          • Give me LLM-generated code review comments, trained on code review comments fine tuned with my repo, before I push. Even if these have a moderately high false positive rate, if they can flag things that I should double check them they may be useful.
          • Natural-language-driven refactoring tools. There are a bunch of mechanical refactoring that I do that are easy to say but hard to automate.
          • Naming style checking. Clang format can do things like ‘this must be camel case’, but not things like ‘interfaces must be adjectives, classes that implement these interfaces should be noun phrases incorporating the interface’. That kind of thing would be fantastic for automated tooling to do since it’s one of the most time consuming things to get right in good APIs.
          • Automated flagging of APIs that don’t match (or, worse, almost match) design patterns used elsewhere in the library. This requires fuzzy pattern matching, the kind of thing that LLMs are really good at, but also requires quite a large context window. Humans are often bad at spotting this because their equivalent of a context window is smaller and doesn’t find the places where the library has exposed an interface that’s confusingly similar. I’m the simplest case, this can just flag the same kinds of parameters in different orders but there’s a lot more that it could do.

          The main thing here is that they’re all low stakes: it won’t break the code if the LLM is wrong (except the refactorings, but that can be coupled with testing) but it will make the output much nicer for the same human investment.

      5. 1

        assuming that the human brain is not the maximum possible intelligence a mind can have, and that artificial intelligences can surpass our intelligence by orders of magnitude, and that they can be run on faster processors with larger data stores, that AIs could be reproduced with a simple ctrl-v ctrl-p– the question should not be “what is AI good for?”, but “extrapolating current trends, how much longer will humans be useful for anything at all?”

        1. 8

          assuming that the human brain is not the maximum possible intelligence a mind can have,

          That is probably true, though it may be that it devolves into discussions about forms of intelligence. The human brain is an interesting balance of compute vs I/O and it may be that increasing compute power alone is not useful, increasing compute and I/O gives something that is unrecognisable. Given that we currently have a sample size of one for building models of kinds of intelligence, there’s a lot of margin for error in anything that we build.

          and that artificial intelligences can surpass our intelligence by orders of magnitude,

          That’s a nice hypothetical but no one knows how to build one that even reaches the level of a human yet. The article shows problems that young children can solve but that state of the art LLMs can’t. To exceed by orders of magnitude, they’d first have to reach parity.

          That said, there are some things that computers can already do orders of magnitude faster than humans. They can compute the sum of a list of integers many orders of magnitude faster than humans, for example. A human with a spreadsheet can achieve far more than a human without a computer, but the spreadsheet by itself can achieve nothing without human agency framing the problems. There are a lot of places where human+computer could be vastly more useful now than they are, but are being overlooked in the rush to make computer-without-human systems.

          and that they can be run on faster processors with larger data stores

          Again, that’s also not clear. Human brains are probably not quantum computers, but they are highly connected neural networks. Neurons in the brain can connect to several thousand other neurons. Transistors on silicon chips can connect to single digits of other transistors. Simulating this with matrix multiplication is not space efficient and slows down as the connectivity grows (especially when you use back propagation to represent feed-backwards neural networks). Adding larger data stores increases latency.

          So, your argument starts with three axioms, none of which is proven and at least one of which may not be true in the sort to medium term. Logical arguments that start from axioms that are not generally helpful.

          extrapolating current trends, how much longer will humans be useful for anything at all?

          Extrapolating from current trends, forever. That may not be good extrapolation because this kind of development often happens in bursts. Extrapolating from the difference engine, humans would be useful for menial calculation for a long time, but extrapolating from the first electronic computers gave a clear end point for that utility. AI systems running on quantum computers may be able to express forms of conscious intelligence that are so far beyond humans that they’re unrecognisable, given some clever bit of mathematics that no one has thought of yet.

        2. 3

          I think that you’ve begged the setup. What is intelligence? Or, at least, what does it mean to compare intelligences? Is there a lattice of intelligence?

    2. 6

      This was a very well written example of quite a common genre of article: find a strawman claim that AI will replace humans. Then dive deep into some experiments with LLMs that show that claim not to be true… while incidentally demonstrating an astonishing array of capabilities that would have been seen as completely stunning just a couple of years ago.

      To this author’s credit, unlike many other instances of this genre they didn’t conclude with “…. and therefore LLMs are a waste of time”.

      1. 7

        I’m not sure I understand what you mean by “strawman” here. Matt Welsh seems like a real person who really did claim that “In situations where one needs a “simple” program […] those programs will, themselves, be generated by an AI rather than coded by hand.” I don’t think the author “intentionally misrepresented [the] proposition” (as per the definition of strawman) that Welsh made.

        At best, if there’s any misrepresentation, it is that the word ladder finding program is not “simple”? I think even if you disagree that it is “simple” according to Welsh, a charitable reading would say that this is an unintentional misrepresentation, since I could not easily find Welsh’s definition of “simple” which disagrees with word ladder being “simple”. If anything, Welsh’s abstract is that “[t]he end of classical computer science is coming, and most of us are dinosaurs waiting for the meteor to hit” – unless “most of us” are people who cannot write programs more than 200 lines, word ladder would fall squarely inside of the definition.

        Welsh, according to my quick review, holds a PhD in CS from University of California, Berkeley, a respected university and far from a diploma mill. While not being a specialist in AI, this doesn’t even point to any cherry picking on behalf of the writer of the article: he argued with a claim made in earnest with a professional in the field and published in a journal with over 50 years of history and an impact factor of over 20.

        I think disputing claims made by CS PhDs in respected journals should not qualify as “arguing with a strawman”.

        1. 4

          That’s fair: calling this a “strawman” doesn’t hold up to the definition of strawman.

          The rest of my comment still stands. There are plenty of articles that start from what I think is a bad proposition - that AI is an effective replacement for human intellectual labour - and then argue against that in a way that also highlights quite how incredible the capabilities of modern LLMs are.

        2. 4

          Yes good point. FWIW the article by Welsh was discussed here


          Welsh is an accomplished person, but “the End of Programming” is obviously using his academic credibility to promote a viewpoint beneficial to his AI coding startup.

          It’s a little disappointing to see the obvious “troll” title and conflict of interest.

    3. 5

      Fabulously written and engaging article. Loved it. The prose was superb. I wish I could write even 1/4 as well as the author.

      Prior to reading this, I thought I was firmly in the camp of “LLMs are fake”, but I appreciate the nuance of the argument a bit more.

      What I don’t understand (and I don’t pretend to understand LLMs well), is when I ask a programmer to write a program, I’m considering the writer to employ creativity, experience, and perspective. Where does that fit into an LLMs response? The author mentioned new algorithms which is pretty much what I’m asking: if LLMs have a “mental” model of the world, can they also use that model to generate creative and novel solutions? If they have a model of the world, I think they should be able to. I’ve never seen a creative solution by LLMs.

      I guess I’m still in the camp of “LLMs are fake”.

      1. 3

        What do you mean by “creative solutions” here?

        I get what I think are creative solutions from LLMs genuinely several times a day - because I’ve developed intuition as to what kind of prompts will get the most interesting results out of them.

        “Give me twenty suggestions for ….” is a good format, because the obvious ideas come first but the interesting stuff will start to emerge once those have been covered.

        Just in the last 24 hours: “20 ideas for exciting and slightly unusual twists on backyard grill burgers” https://chat.openai.com/share/cc72358b-8915-4f40-98e7-684b9987ef0d

        And I got it to brainstorm Spanish nicknames for my mischievous dog: https://chat.openai.com/share/eb8bec31-76d0-464f-a123-a9d0823ad1f8 - I particularly enjoyed “Desenrolladora de Papel” and “Deslizadora de Alfombras”.

        I also got an interesting optimization out of it for loading the graph when I tried the word chain example myself: https://chat.openai.com/share/c2b2538e-4d8b-40e2-a603-b9808b932000

        1. 5

          I think this falls into the “kaleidoscope for text” bucket. There are interesting and novel juxtapositions of existing ideas, but no truly new ideas. Which TBF, humans rarely produce as well. “Genius” is the term for some work that goes outside of the ordinary bounds and then manages to bring a critical consensus behind it to endorse it. Most people have no genius, so it’s not fair to ask an LLM to have it. But genius is crucial to the movement of history. Without it, there’s just stagnation: the rearrangement of existing ideas.

          1. 2

            Oh I really like “kaleidoscope for text”.

          2. 1

            Arguably “Genius” by this definition can still be the product of a trickle of innovations building upon one another over decades or centuries. The example that comes to mind is Television, which arguably was invented entirely independently by several people roughly simultaneously around the world. It could be argued that the invention of television was an inevitability, brought about by the confluence of a multitude of innovations over the previous century by thousands of individual humans.

            I agree that it’s not fair to expect LLMs to have “genius”, or even to expect them to innovate on their own (absent human input). But could they still be said to “innovate” by putting together the creations of humans (or past LLMs!) in novel ways, the same way humans often do? I think it’s possible, and quite useful.

      2. 2

        Building on @simonw’s excellent reply, which you should respond to first:

        The whole point of LLMs is that they produce novel, “creative” solutions. The most amazing thing about them is that despite being trained on a corpus of mostly human-generated text, the best ones are still able to generate surprisingly sophisticated results. Here’s another prompt I generated today, to write Raspberry Pi code for cheese-finding robotic mice. Perhaps not the most complicated example, and I had to hold its hands a bit, but certainly relatively novel.

        I’m also not sure at all what you could mean by “LLMs are fake”. Clearly they are a real thing that exist. It’s not a “Mechanical Turk” where a human is actually generating responses for you. They’re just programs that take input and produce output, based on a complex algorithm manipulating a huge corpus of text. In what way are they “fake”?

        For fun, I asked ChatGPT what it thought you meant. Let me know if it got it right.

    4. 3

      Nice article. I got interested and ran a little experiment.I asked Gepeto “can you write a javascript program that adds two numbers together and prints the result?”, and then kept asking “Can you find the three bugs in the last iteration of the program?” and as expected, it just goes on forever and ever, always finding three “problems” - sometimes real, sometimes not related to the code, and the solutions also only sometimes related to the problem.

      I mean, I’ve expected that result, this was just a fun experiment. I wish more people realized that all of it is nonsense that needs to be double and tripple checked.

      1. 4

        That’s a good experiment, and illustrative how little ChatGPT is reasoning about what you give it.

        It’s a little bit reminiscent of how older chatbots (going all the way back to Eliza, really) depended on the human input to provide meaningful content, which the programs would mine for guidance on how the conversation ought to proceed, thus creating the illusion that the chatbot is meaningfully engaged with the topic. When this content is withheld (or, as in your experiment, undermined), their output abruptly becomes much less impressive.

      2. 2

        I’ve had problems with this where I ask ChatGPT to proofread a paragraph, and then it will suggest I “fix” the paragraph by remove problems not in the actual text. :-(

    5. 2

      I agree with Sutskever that there is some world model formation, both because I buy his philosophical argument and because I think we have some evidence that supports it. But that model is very primitive. Partly that’s because the technology is still very primitive (anyone who says differently is selling something). Partly that’s because other forms of experience than text are beneficial for building a world model. I’m an avid reader and I believe in the power of text to immerse the reader in a world, but it sure works a lot better if the reader has prior direct experience of something at least somewhat similar that you can build on.

      Sure, I buy the argument that LLMs’ language acquisition is similar to babies’, but babies are experiencing the world in other ways even before they start talking. They’re eating and drinking, they’re looking around, they’re sticking things in their mouths to find out how they feel and taste, they’re wiggling around and learning what it’s like to have limbs, how it feels to move them, what gravity is, what it means for something to be soft or hard. They have the “mama” concept before they have the “mama” word. We have some visual and audio “AIs” at least, but they exist separately from LLMs. LLMs are brains-in-jars, to the extent that they’re brains at all.

      And I simultaneously agree with 90% of what the stochastic-parrot folks say. You can write a college essay that will get a good grade with only superficial understanding of a subject and no deep thought — in fact you can get an entire undergrad degree that way. Popular music is formulaic. And most people do need the exact same advice as millions of other people who have been in the same situation — but they prefer to believe that their case is special, so you have to break past that barrier to get them to accept advice.