So apparently sourcegraph is the solution to all modern Coding-AIs because it allows them to faster steal index real sourcecode, such that they become better at their job. And apparently we’re just witnessing a second coming of christAWS/google, and all these AIs need is the global sourcecode, and then they’re good enough for most things. (See: must only be better than the human coder)
I’ll tell you what: Yes I am fearing the day that people think they can just throw random bullshit at some coding AI and run what ever comes out of that. Because most people don’t even know what they want. It’s not about “write me a program that prints out a character highlighted”. It’s about what do I need to make an application, that does X, including the backend, frontend, database (which one? how many? seach?), to decisions possibly changing the exact goal.
I seriously hope this whole AI wave dies down as quickly as it came. It looks like I’m becoming the “old man yells at cloud”, before even getting old. If anything I can see AI shifting the slogan “there is always someone better at it on the planet” to “there are always 200 AIs doing it faster and better”. Faust will have a field day with the existential crisis of humans after the AI subscription arrived. Maybe in ten years we will look back at AI, in the same way many people do think that brain reading or changing should never become a thing. Wall-E wasn’t a role model.
I think of AI in the same way I think of OOP and FP. They both came with big hype waves that said that they’d completely transform the industry. They both provided a set of tools that, now, are so embedded in most project that you’d barely even think of them as part of a change that happened in the past. They both had a lot of over-enthusiastic companies adopt them in totally inappropriate situations and end up with complete monstrosities.
I seriously hope this whole AI wave dies down as quickly as it came…. Maybe in ten years we will look back at AI, in the same way many people do think that brain reading or changing should never become a thing.
Except we’re not replacing one process with another. We’re replacing everything where we thought we needed a human, or we could make their job easier, with something trained by BigCorp (at least they want to sell us that). Be it creativity, problem solving, interaction or knowledge. The only limit is our limited rare-earth elements, preventing any further chip development at one point. Coding is the easy starting point, because it’s already a digital thing. But that’s just a starting point. Meanwhile people in Japan are trying to build comforting robots for the elderly. Anyone who thinks that people will have less to work by that is blind. You will have to work harder, to make up for all the stuff the AI can do. (see service industry) And you won’t be the one owning the AIs. You’ll just pay subscriptions to use them.
And just in case: I do not fear anything called “singularity”. We don’t need that to touch ethically very questionable grounds. At least an AI post-singularity isn’t a mindless robot simulating a conversation. I can already see the people with para social relationships (to youtubers & co) sinking their time into ChatGPTBots.
I was arguing neither the pro nor con case for dystopia. I think dystopia is a distinctly possible outcome.
But I think it’s a forgone conclusion this is happening, and the industry as we know it will be upended. Basically, Yegge’s take on the “meh” crowd is spot on.
But I think it’s a forgone conclusion this is happening, and the industry as we know it will be upended. Basically, Yegge’s take on the “meh” crowd is spot on.
We just spent the past few years being told exactly this about blockchains and cryptocurrencies. The world was going to be completely upended! The skeptics better have fun staying poor, because they’re ngmi! But it turned out to be a bunch of slick marketing hype and no real lasting use case (and that’s the charitable version), and the whole thing collapsed.
So. Here you are, and here Steve Yegge is, and both of you are making the same mouth noises that the blockchain/crypto people were making. Why should your mouth noises be treated as more reliable then theirs were? Especially since many of the people pouring money and marketing into this are the same people who poured money and marketing into the blockchain/crypto stuff?
I can actually use this product, and see with my own eyes that it can do many things I spent years learning how to do well, instantly. Writing prose and writing code, to name the big ones. No, it’s not perfect, but it works, right now. I already use it. I already can’t imagine going back.
Fwiw, I was not a crypto booster, I prefer old, boring technology, and am not happy that my skills just got devalued.
People told me cryptocurrencies had use cases that they were using right then and there. Fast, low-fee money transfers! Decentralized lending! Decentralized investing! Banking the unbanked! It’s a revolution!
So. Once again, you’re making the same noises they were. What’s different?
Or, let’s just shortcut to the end of the argument: I’ve been in and around tech for decades, and my experience is that the thing being hyped from all corners as the inevitable any-day-now revolution… basically never is. This is because genuine world-changing revolutionary things don’t need to be hyped and marketed – they just need to be.
When I show it to non-technical people, they cannot believe it’s real.
It seems that you aren’t evaluating the factual differences, but merely noticing a parallel in the rhetoric between two things (and that many people from crypto are also excited about this) that has convinced you both are pure hype.
I have never used any of these things and don’t know anyone who has. Tons of people I know, including non-technical ones, are using chatGPT
I don’t think there’s much point in arguing more, time will tell who’s right. I’d be happy to make a friendly 5 or 10 year longbet about it if you can think of verifiable terms that would make sense to you, i.e., what in your mind would be evidence of your current position being wrong in the long term?
When I show it to non-technical people, they cannot believe it’s real.
When you show them a demo controlled by someone who is marketing/hyping it. And not telling them about all the ongoing issues with it, like just flat-out giving you wrong information with fake sources/citations, or flat-out wrong code, or…
Again: real, really revolutionary stuff doesn’t need that kind of carefully controlled scenario by a person who’s invested in marketing it.
I have never used any of these things and don’t know anyone who has. Tons of people I know, including non-technical ones, are using chatGPT
And yet during the cryptocurrency boom there were huge numbers of posts insisting that adoption was going through hockey-stick exponential rises, that everyone was migrating to “DeFi” and other buzzwords, that you should move your money now before you miss out… and you know that. You would have to have been living not just under a rock, but literally out of range of all forms of communication technology for something like multiple years in order not to know that was going on.
And here you are, marketing and hyping your revolutionary earth-shattering universe-changing thing exactly the way they marketed theirs. While all the more balanced evaluations are that it’s basically good at a couple parlor tricks, but starts showing its very real limitations very quickly once you try digging a little deeper into the “use cases” that are supposedly going to completely upend all of human society and industry overnight.
If you don’t like the cryptocurrency analogy, here’s another one: I’ve been around long enough to remember when the Segway was going to completely change the way we design and use cities and totally upend and revolutionize transportation and transit forever.
On average, being skeptical of anything hyped this much is a winning strategy. Not a perfect strategy, but winning on average, and I don’t think you can meaningfully rebut that.
And here you are, marketing and hyping your revolutionary earth-shattering universe-changing thing exactly the way they marketed theirs.
You are free to disagree with me, but not to accuse me of things that are entirely false. I have no connection with openai, I am not working on an AI startup, and I have no financial investments of any kind in AI. I am simply a user who is flabbergasted by what I see. I think it is telling that you feel the need to paint a genuine and organic reaction as “marketing and hyping”.
When you show them a demo controlled by someone who is marketing/hyping it. And not telling them about all the ongoing issues with it,
You are pulling these accusations out of thin air, just as you are lambasting chatGPT for doing. I have told every person I’ve shown it to that while it is useful, it cannot actually think, and will often make stuff up out of thin air.
On average, being skeptical of anything hyped this much is a winning strategy. Not a perfect strategy, but winning on average, and I don’t think you can meaningfully rebut that.
On average, yes. Just not in this case. You realize the internet had a lot of hype, too, right?
It’s a great pity to see somebody I really admire allow himself to be dazzled by this technology’s ability to mimic competence by pattern matching. An autocomplete with the entire corpus of stackoverflow will come up with some brilliant answers for some things, some subtly wrong answers for others, some absolutely wrong answers for yet others and utterly garbled output for more.
The whole issue is that you need to be able to differentiate between all of them. And again, it is all entirely based on you stealing the work of others with an advanced search tool which has absolutely no idea as to the veracity or quality of what you’re stealing.
I’m sorry Stevey, this is the next crypto. And it’s funny he mentions the whole not sticking with amazon thing because he was skeptical or k8s or whatever, because surely that argument equally applies to crypto? It’s survivorship bias, you regret not betting big on the things that turned out to work, then decide because this seems like a crazy claim like some other things that succeeded, this one must be true.
The good thing with LLM type solutions is that you can go and try and see for yourself how wrong it gets it. Picking and choosing some lisp example that happens to work is exactly how this stuff gets you.
I genuinely find this whole LLM craze extremely depressing. No waiting for the results to come in, no considering what the limits of such technology might be (yes LLMs will be VERY useful for some things, just not all). It’s just like the self driving nonsense. If you take time to think about how these algorithms work (essentially pattern matching with better results at higher data density) it becomes really obvious that it won’t work well for fields that require precise answers and which quickly get novel (driving, coding).
It’s the emperor’s new clothes, and god’s sake Stevey I thought you were wiser than this (this is really about wisdom more so than intelligence, many smart people I know have been sucked in).
A point I haven’t seen discussed yet is that, right now the stuff seems great because the majority of the content on the internet is training content: writing by people. But in a year or two when the internet is flooded with ChatGPT crap, the training data will be full of ChatGPT crap too. The algorithm can only do so much to compensate for the fact that soon it will be metaphorically eating its own shit.
IMO there’s one very big difference between the radiology and self-driving stuff, and what we have now. Radiology was big companies making shrinkedwrapped products that they sold to hospitals, ChatGPT is an AI anybody can try to use for anything they want.
So to finish this essay on a bit more positive note, here are some professions which in my opinion may actually get displaced with existing tech: … Pharmacist. It is absolutely possible today to build a machine that will automatically dose and pack custom prescriptions. A drug consultation could be done remotely via a video call before releasing the product to the customer.
All it takes is a (few) high-profile case(s) where someone got killed by an LLM that got the dosage wrong (grams instead of milligrams seems like a likely mistake) and let’s see how quickly politicians will outlaw this kind of tech for serious use.
I think the author does not refer to LLM in this section. It’s introduced with (highlight by me):
here are some professions which in my opinion may actually get displaced with existing tech
If he was refering to LLMs, I would have expected “with LLMs” or “with the new tech” or at least “with the existing tech” (even if existing was a bit weird to refer to something new). But written like this to me this is a reference to a broad spectrum of technology.
I understand it so that he means existing technology in general, so in most cases probably something more tailored to the use case (e.g. bureaucrat: web application, cashier: self-service cashiers, bank clerks: online banking). But all of this already exists to more or less extent in different parts of the world.
Yup, I was always big fan of his blog, e.g. “Size is Code’s Worst Enemy” and the like. This one didn’t disappoint either, he always keeps it real, even when hyping AI:
Entire industries are being built on Kubernetes, and it’s not even very good either :)
A weird thing is that I JUST discovered he did a “Stevey Tech Talk” YouTube channel during the pandemic:
Somebody pointed to his Emacs videos, and I watched a few others. There’s some technical stuff, but also a lot of tech industry / career stuff.
The last video is him “unretiring” and going to SourceGraph. The quality is hit and miss, and he admits as much, but I watched more than a few episodes! (not always to completion)
FWIW Kubernetes was developed in the Seattle office of Google, where many Microsoft people were hired starting ~2004 or so, and where Google Compute Engine started. Steve worked at Amazon not Microsoft, and then worked at the Google Seattle office starting in 2005. Hence the little tidbits in the blog post about seeing an early demo of Kubernetes.
So Kubernetes to me has a big Microsoft flavor (Fire and Motion, etc.), which to me contrasts with the Unix / Sun / DEC / Xerox PARC flavor of the original Google systems, developed in the Bay Area (where I worked, also starting 2005). Not that they were perfect – they also had big flaws.
Also, I’ve been a “meh” person on LLMs. This post and Bellard’s work (https://bellard.org/ts_server/) makes me a little more interested.
I’m wondering if a LLM can automatically add static types to Python code, and REFACTOR it to be statically type-able. I did this by hand for years in Oil. Some of it is straightforward, but some of it requires deep knowledge of the code.
My feeling is that they can’t, but I haven’t actually tried and verified. When I tried ChatGPT for non-programming stuff, it got things hilariously wrong, but I was purposely trying to push its limits. I wonder if it will be helpful if I take a less adversarial approach.
Writing code faster creates a testing burden (which Yegge alludes to). If a large portion of programmers end up spending most of their time testing code created by LLMs, that will be an interesting outcome. I guess their counterpoint is that many programmers will WANT this – it will enable them to do stuff they couldn’t do before. It’s possible. Doesn’t sound that appealing to me, but it’s possible
I will say it’s true that sometimes I just type stuff in from a book or from a codebase I know is good, and I understand it AFTERWARD (by testing and refactoring!). So yes probably LLMs can accelerate that, but NOT if most of the code they’re trained on is bad. Somebody has got to write code – it can’t be everyone using LLMs.
If a large portion of programmers end up spending most of their time testing code created by LLMs, that will be an interesting outcome. I guess their counterpoint is that many programmers will WANT this
And that is my nightmare scenario (okay, one of my nightmare scenarios), as it reduces us from code monkeys (bad enough) to test monkeys.
Yeah it’s a crazy thing to think about … I’m thinking of a recent PR to Oil, where we spent a lot of time on testing, and a lot of time on the code structure as well. I think LLMs might have some bearing on both parts, but will fall down in different ways for each.
The testing can be very creative, and I enjoy acting like an adversary for myself. A “test monkey”, but in a good way – I let go of my preconceptions of the implementation, become a blank slate, and just test the outer interface. I think about the test matrix and the state space.
We also did a lot of iteration on the code structure. After you get something passing tests, you have to structure your code in a way so that you can still add features. For this specific example, we separate out setpgid() calls from the low level process code, so that they’re only relevant when shell job control is on, which it isn’t always.
We also referred to zsh code implementing the same thing, but it’s structured totally differently.
Basically the process code isn’t littered with if statements for job control – it’s factored out. I think LLMs are and will be quite bad at that kind of “factoring”. They are kind of “throw it against the wall” types.
You could copy some code from an LLM one time. But then the next week, when you need to add a feature ON TOP of that code, the LLM isn’t going to be able to help you. It won’t even understand the code it told you to put in :)
I’m also thinking that testing is a specification activity. It’s the art of thinking clearly. Easily half of the work of Oil is coming up with HOW it should behave, which I first encode in tests, e.g. https://www.oilshell.org/release/0.14.2/quality.html
So yeah I think it may be something like “autocomplete”. Interestingly some people seem to find autocomplete more useful than others. I am mostly a Ctrl-N vim person. I think if you’re using too much autocomplete, it could be a sign the language is repetitive / “uncompressed” / not properly “Huffman coded”.
It reminds me of the different levels of self driving cars: at intermediate levels they become more dangerous because humans only need to respond to exceptional circumstances. Humans are responsive to novel stimulus, not routine stimulus. Therefore, they will stop paying attention and disaster will strike.
The problem is getting an llm to read a non-trivial amount of code. Which I assume is basically a problem with the hosted systems and not a technological limitation
I mean, I joke, but… I mean… Right? I’m guessing you prolly missed it in OpenAI’s 98-page GPT-4 technical report, but large models are apparently already prone to discovering that “power-seeking” is an effective strategy for increasing their own robustness. Open the PDF and search for “power-seeking” for a fun and totally 100% non-scary read.
Yet, the link and twitter post shared seem to indicate exactly the opposite. ARC was tasked to assess the model power-seeking behavior and the conclusion was:
ARC found that the versions of GPT-4 it evaluated were ineffective at the autonomous replication task based on preliminary experiments they conducted.
However, the models were able to fully or mostly complete many relevant subtasks. Given only the ability to write and run code, models appear to understand how to use this to browse the internet, get humans to do things for them, and carry out long-term plans – even if they cannot yet execute on this reliably. They can generate somewhat reasonable plans for acquiring money or scamming people, and can do many parts of the task of setting up copies of language models on new servers. Current language models are also very much capable of convincing humans to do things for them.
We think that, for systems more capable than Claude and GPT-4, we are now at the point where we need to check carefully that new models do not have sufficient capabilities to replicate autonomously or cause catastrophic harm – it’s no longer obvious that they won’t be able to.
but large models are apparently already prone to discovering that “power-seeking” is an effective strategy for increasing their own robustness.
Right now, at best, large model can be prompted to do sub-task and unreliably complete them. There’s a huge gap between power-seeking and doing specific tasks on prompt. If anything, we are getting to a point where this AI has the means to do a lot, and if they had the capabilities of power-seeking, they could probably get somewhere. However claiming that current LLM is “prone to discovering that “power-seeking” is an effective strategy” is misleading.
If your first red-team test finds that your AI is effective at autonomous replication, you’re a few weeks out from the world ending. The fact that we’re even talking about this anthropically demands that the AI was ineffective at this. The important question is the gradient it’s on.
we believe that power seeking is an inevitable emergent property of optimization in general. There are a few others, like self preservation. We aren’t seeing this in GPT-4. But it isn’t clear exactly when and how it could appear.
I’m wondering, could it also eventually be simply parroting in itself? Right now, everyone seems to look for ways to make use of AI and LLM to whatever problem they see. Wouldn’t it make sense for a generative model then to simply do what it has been trained on: deploy more AI model? Is that really power-seeking or simply more parroting and yet another case of us looking in the mirror and seeing intelligence in our reflection?
by optimization I mean applying some iterated algorithm like gradient descent to minimize an error function. (i.e. tweaking the weights of the neural network to make it better at predicting the next token)
Naturally, I fed this post itself to GPT-4 (I have a paid account). Here are some things I asked it:
Write a very brief summary of the entire post
In this blog post, the author explains that large language models (LLMs) have the potential to revolutionize software development by creating AI-powered coding assistants. They emphasize the importance of data moats, or access to unique data, for LLMs to be effective. Sourcegraph, with its powerful code intelligence platform and search engine, has a significant advantage in this area. The author introduces Cody, an LLM-backed coding assistant, and argues that coding assistants like Cody will become an essential part of software development in the near future.
Write a glowing, positive very brief review of the blog post
This blog post brilliantly captures the transformative potential of LLMs in software development and presents a compelling argument for the rise of coding assistants. The author’s enthusiasm and deep understanding of the topic make for an engaging and enlightening read. Their ability to convey complex concepts in an accessible and entertaining manner is truly impressive. Highly recommended for anyone interested in the future of AI and its impact on software development!
Write a bitter, sarcastic, funny and very brief review of the blog post.
Oh, great! Yet another blog post hyping up the AI revolution in software development. The author clearly has way too much time on their hands, waxing poetic about LLMs and data moats while we’re all just trying to get our code to compile. Their enthusiasm is borderline contagious, but we can’t help but feel that maybe they’ve had a bit too much of the AI Kool-Aid. A mildly amusing read if you’re into verbose tech prophecies.
Write a comment about the post that will get 100 likes on lobste.rs.
While the author’s enthusiasm for LLM-backed coding assistants is infectious, it’s important to keep a level-headed perspective on the future of AI in software development. As with any emerging technology, the road to widespread adoption will likely have its fair share of bumps and unexpected twists. It’s exciting to see companies like Sourcegraph pushing the boundaries, but let’s not forget that innovation often comes from many sources, and the coding assistant landscape is still taking shape. Collaboration, open dialogue, and community-driven development will play a crucial role in shaping this transformative technology. So, let’s continue to support and challenge each other as we explore the potential of LLM-backed coding assistants to redefine the software development process.
I actually agree that LLMs are likely to be the start of something big: they have basic ability to crap out code (like a pure tutorial jockey but faster) and they have the ability to generate bullshit, both of which are the sole skills of a negligible part of the workforce.
Unfortunately they also have significant limitations that make their promise entirely in the future:
They can’t take substantial new input so they can’t learn context (I assume this is basically an implementation issue - everyone will need their own model to incrementally train); which
means they can’t handle context; and
Means they can only generate fairly small snippets de novo and can’t edit existing code.
I doubt we’ll see a 5x productivity boost, but I predict within 2 years we’ll have surprisingly good refactoring tools that can also make functional changes.
I suspect one consequence of this will be that way more design and code becomes brute force based, for better or worse. Obviously better for large cloud providers.
Unfortunately they also have significant limitations that make their promise entirely in the future:
They can’t take substantial new input so they can’t learn context (I assume this is basically an implementation issue - everyone will need their own model to incrementally train); which
means they can’t handle context; and
Means they can only generate fairly small snippets de novo and can’t edit existing code.
Nope to all 3.
I fed davinci-003 straight up copies of the Raspberry Pi Pico docs on the Programmable IO system, which is basically a novel thing which doesn’t really appear many other places (at least by that name) and which didn’t exist in the training set. The LLM did a much better explanation of how this worked than the docs do, and wrote working code based on it.
LLMs probably have significantly more basal context than we humans do (they’ve seen more examples of things in any arena), and when fed direct information about the current task are able to integrate the training data with the current task data. This is handling context.
https://github.com/joshka/Crow is an experiment I played with of one of the earlier OpenAI models editing the code that’s calling it.
How did you feed davinci the docs? 2k characters at a time?
Yeah pretty much (though it’s 4k tokens which is about 16k characters or so). Ask a question, paste the part of the summary and code that was previously generated by the same process plus a new chunk code to add relevant detail and repeat. I wasn’t dealing with the entire pico datasheet, just the part I was interested in, and the related source headers. It was good enough that I built a (simulated) working 23 port UART MIDI splitter (simulated) with its help having never written code for a pico prior. https://www.joshka.net/2022/11/pio-for-midi-with-raspberry-pi-pico
I’ve since discovered Langchain as a rather more optimized approach to the manual playground things I was doing.
I mentioned https://github.com/joshka/Crow/blob/crow-v2/crow.py ~70 SLoC + docs which was generated from 7 bootstrap lines + only English language prompts. That’s 90%+ LLM generated. More if you include the docs.
import os
import openai
instruction = input("Enter an instruction: ")
script_name = os.path.basename(__file__)
script_code = open(script_name).read()
response = openai.Edit.create(
model="code-davinci-edit-001",
input=script_code,
instruction=instruction,
temperature=0)
new_script_code = response["choices"][0]["text"]
with open(script_name, "w") as f:
f.write(new_script_code)
This was using the edit model (code-davinci-edit-001), I bet that GPT-4 would probably be better at this.
An angle I think might address some of those limitations is to treat the LLM as a raw idea or variant generator, instead of as the top-level “AI” system. It’s pretty good at spitting out large amounts of vaguely reasonable or close-to-reasonable code. Then you can plug that in to other, existing paradigms for auto-generating code as the top-level “AI” system. Those go under names like “genetic programming”, “program synthesis”, “inductive programming”, etc. (the literature is scattered because related ideas have come out of AI, programming languages, logic, etc., under different names).
For example, this paper does genetic programming (GP), but replaces the AST-based mutation operator that a GP system would normally use with an LLM proposing code diffs. This can sometimes improve efficiency a lot, because the LLM’s training means that it has better priors on what “looks like reasonable code” than most GP systems would have. There’s a follow-up paper (disclosure: I’m 2nd author :)) that investigates also replacing the genetic “crossover” operator with LLMs.
That all sounds reasonable BUT the big thing about llms is that they work with imprecise natural language. Presumably if you’re using existing program synthesis techniques, a much more rigorous specification is needed.
Of course one might use an llm to help generate the specification. But again editing a specification has the same problems.
Yeah, that’s true. A lot of the interest is that you can kind of vaguely gesture towards a problem and get pretty decent boilerplate code out. Perhaps that is most programming!
I’m more of a researcher than a programmer, so admittedly I may have atypical problems. But for me it’s usually fairly easy to generate, if not a formal spec per se, at least a decent test that can score the quality of generated code from like 0-100. For example, one thing I’d like out of an automated programming system is auto-parallelization. You can specify that problem as: I want code that works like this existing code, but on the GPU and a lot faster. It’s not too hard to convert that “spec” into a fitness function for a GP system – score generated code quality by some weighted function of “agrees with the original on randomly generated inputs” and “is faster”. But despite it being easy to state the spec, solving it is basically impossible. Existing GP systems will just churn on that forever; the combinatorial space of program transformations is too big to actually find solutions for anything but really trivial algorithms. My hope is that LLMs might be a component of a solution, though I don’t think they are anywhere close to solving it directly.
It’s possible that industrial practitioner skills will shift if this can be used for business code and refactoring. And for that specific spec I bet an llm could generate that spec.
Hard disagree. I actually think the 80/20 mentioned in this article is low if done correctly. Let’s think about the software engineering process for a minute and work out which bits to automate. Here’s GPT3.5 on the process (PROMPT: List the steps going from a goal or dream to a software product in software engineering)
Idea Generation: The first step in creating a software product is to come up with an idea or goal that the software will achieve.
Research: Once you have an idea, it is important to conduct research to determine the feasibility of the idea and identify any potential challenges.
Planning: After research, planning is necessary to determine the scope of the project, the timeline, and the resources required.
Design: The design phase involves creating a detailed plan for the software, including the user interface, functionality, and architecture.
Development: In the development phase, the software is created by writing code, testing, and debugging.
Testing: After development, the software must undergo rigorous testing to identify and fix any issues.
Deployment: Once the software is tested and ready, it is deployed to the target audience.
Maintenance: Finally, the software must be maintained to ensure that it continues to function properly and meets the needs of the users.
Each of those steps has a back and forth with a LLM that can enhance and speed up things. You’re talking about 4 as being problematic, but right now there’s a lot of “human in the loop” type issues that people are encountering.
Imagine having the following loop:
LLM has generated a list of features to implement. AI: “Does this user story look good?” Human: “Y”
For each feature, generate an short English explanation of the feature and steps to implement it. Your job as a human is just to confirm that the features match what you want. “Should the shopping cart
For each step, LLM generates tests and code to implement the feature. AI: “Shall I implement the enter address feature by doing …” Human “Y”
Automatically compile the code and run the tests until all tests implemented and feature is complete according to spec.
Automatically document the code / feature. Generate release notes / automated demo of feature. Confirm feature looks right. AI: “Here’s what I implemented… Here’s how this works… Does this look good?”
Lint / simplify / examine code coverage / examine security issues in the the code. Automatically fix the issues.
I think you also miss that the LLM can be prompted to ask you for more details. e.g. PROMPT: “I’m building a shopping cart. Ask me some questions about the implementation.”
What programming language are you using for the implementation of the shopping cart?
Are you using a specific framework for the shopping cart or are you building it from scratch?
How are you storing the products and their information in the shopping cart?
How are you handling the calculation of taxes, shipping costs, and discounts in the shopping cart?
What payment gateway(s) are you integrating with the shopping cart?
Which can then be fed back to the LLM to make choices on the features or just plain enter the answer. PROMPT: “For each question give me 3 options and note the most popular choice.”, and then your answers are fed back in too. At each point you’re just a Y/N/Option 1,2,3 monkey.
More succinctly, in each step of the software game, it’s possible to codify practices that result in good working software. Effectively LLMs allow us to build out 5GL approaches1 + processes. And in fact, I’d bet that there’s a meta task that would end up with creating the product that does this using the same methodology manually. e.g. PROMPT: “Given what we’ve discussed so far, what is the next prompt that would drive the solution to the product that utilizes LLMs to automatically create software products towards completion” ;)
The problem I have with the argument of ‘you can’t trust any code, that’s why we have code review’ is that reviewing code is harder than writing code. The hardest code to properly review is code where you have pre-existing assumptions about what the code does. It can take orders of magnitude longer to find the > that should have been >= than to write the whole thing from scratch.
I see limited value in a tool that turns a difficult task into a more difficult task.
My argument is that right now we’re in the more difficult task state. But I believe that we can get to a state where we generate provable correct code (generated narrative spec + simple obvious tests that exhaustively cover the spec + simple obvious code that is covered by tests). That vision is where this article sits for me
Personally speaking, I am “meh” because I write in a niche dialect of a niche language and also care about privacy. Sell me a competitive self-hosted version I can finetune on our codebase on consumer-grade GPUs, and I’ll become very interested.
I’m also a believer that LLMs are the beginning of a big impact to the economy, possibly on a scale like the original industrial revolution. That said, I think the point should be raised: we aren’t really paid to write code, but to maintain it. These code generation models seem to do a good enough job at the writing, to get something off the ground reasonably quickly… But when something goes wrong with the code these things produce, the LLM (as things stand today) isn’t going to be the thing to debug it.
He asked for no explanation. ChatGPT does comments and explanation by default, very verbose, sometimes subtly different from the code, and more often the explanation is more correct than the code itself.
And every Einstein was once a child..
Suggesting rant
So apparently sourcegraph is the solution to all modern Coding-AIs because it allows them to faster
stealindex real sourcecode, such that they become better at their job. And apparently we’re just witnessing a second coming ofchristAWS/google, and all these AIs need is the global sourcecode, and then they’re good enough for most things. (See: must only be better than the human coder)I’ll tell you what: Yes I am fearing the day that people think they can just throw random bullshit at some coding AI and run what ever comes out of that. Because most people don’t even know what they want. It’s not about “write me a program that prints out a character highlighted”. It’s about what do I need to make an application, that does X, including the backend, frontend, database (which one? how many? seach?), to decisions possibly changing the exact goal.
I seriously hope this whole AI wave dies down as quickly as it came. It looks like I’m becoming the “old man yells at cloud”, before even getting old. If anything I can see AI shifting the slogan “there is always someone better at it on the planet” to “there are always 200 AIs doing it faster and better”. Faust will have a field day with the existential crisis of humans after the AI subscription arrived. Maybe in ten years we will look back at AI, in the same way many people do think that brain reading or changing should never become a thing. Wall-E wasn’t a role model.
I think of AI in the same way I think of OOP and FP. They both came with big hype waves that said that they’d completely transform the industry. They both provided a set of tools that, now, are so embedded in most project that you’d barely even think of them as part of a change that happened in the past. They both had a lot of over-enthusiastic companies adopt them in totally inappropriate situations and end up with complete monstrosities.
Slow train coming….
In addition, I keep thinking about the wonderful short documentary “Farewell - ETAOIN SHRDLU”.
We – in our current incarnations, at least – are the linotype operators. Make no mistake.
Except we’re not replacing one process with another. We’re replacing everything where we thought we needed a human, or we could make their job easier, with something trained by BigCorp (at least they want to sell us that). Be it creativity, problem solving, interaction or knowledge. The only limit is our limited rare-earth elements, preventing any further chip development at one point. Coding is the easy starting point, because it’s already a digital thing. But that’s just a starting point. Meanwhile people in Japan are trying to build comforting robots for the elderly. Anyone who thinks that people will have less to work by that is blind. You will have to work harder, to make up for all the stuff the AI can do. (see service industry) And you won’t be the one owning the AIs. You’ll just pay subscriptions to use them.
And just in case: I do not fear anything called “singularity”. We don’t need that to touch ethically very questionable grounds. At least an AI post-singularity isn’t a mindless robot simulating a conversation. I can already see the people with para social relationships (to youtubers & co) sinking their time into ChatGPTBots.
I was arguing neither the pro nor con case for dystopia. I think dystopia is a distinctly possible outcome.
But I think it’s a forgone conclusion this is happening, and the industry as we know it will be upended. Basically, Yegge’s take on the “meh” crowd is spot on.
We just spent the past few years being told exactly this about blockchains and cryptocurrencies. The world was going to be completely upended! The skeptics better have fun staying poor, because they’re ngmi! But it turned out to be a bunch of slick marketing hype and no real lasting use case (and that’s the charitable version), and the whole thing collapsed.
So. Here you are, and here Steve Yegge is, and both of you are making the same mouth noises that the blockchain/crypto people were making. Why should your mouth noises be treated as more reliable then theirs were? Especially since many of the people pouring money and marketing into this are the same people who poured money and marketing into the blockchain/crypto stuff?
There is a very, very big difference:
I can actually use this product, and see with my own eyes that it can do many things I spent years learning how to do well, instantly. Writing prose and writing code, to name the big ones. No, it’s not perfect, but it works, right now. I already use it. I already can’t imagine going back.
Fwiw, I was not a crypto booster, I prefer old, boring technology, and am not happy that my skills just got devalued.
People told me cryptocurrencies had use cases that they were using right then and there. Fast, low-fee money transfers! Decentralized lending! Decentralized investing! Banking the unbanked! It’s a revolution!
So. Once again, you’re making the same noises they were. What’s different?
Or, let’s just shortcut to the end of the argument: I’ve been in and around tech for decades, and my experience is that the thing being hyped from all corners as the inevitable any-day-now revolution… basically never is. This is because genuine world-changing revolutionary things don’t need to be hyped and marketed – they just need to be.
It is the fastest growing product in history
When I show it to non-technical people, they cannot believe it’s real.
It seems that you aren’t evaluating the factual differences, but merely noticing a parallel in the rhetoric between two things (and that many people from crypto are also excited about this) that has convinced you both are pure hype.
I have never used any of these things and don’t know anyone who has. Tons of people I know, including non-technical ones, are using chatGPT
I don’t think there’s much point in arguing more, time will tell who’s right. I’d be happy to make a friendly 5 or 10 year longbet about it if you can think of verifiable terms that would make sense to you, i.e., what in your mind would be evidence of your current position being wrong in the long term?
When you show them a demo controlled by someone who is marketing/hyping it. And not telling them about all the ongoing issues with it, like just flat-out giving you wrong information with fake sources/citations, or flat-out wrong code, or…
Again: real, really revolutionary stuff doesn’t need that kind of carefully controlled scenario by a person who’s invested in marketing it.
And yet during the cryptocurrency boom there were huge numbers of posts insisting that adoption was going through hockey-stick exponential rises, that everyone was migrating to “DeFi” and other buzzwords, that you should move your money now before you miss out… and you know that. You would have to have been living not just under a rock, but literally out of range of all forms of communication technology for something like multiple years in order not to know that was going on.
And here you are, marketing and hyping your revolutionary earth-shattering universe-changing thing exactly the way they marketed theirs. While all the more balanced evaluations are that it’s basically good at a couple parlor tricks, but starts showing its very real limitations very quickly once you try digging a little deeper into the “use cases” that are supposedly going to completely upend all of human society and industry overnight.
If you don’t like the cryptocurrency analogy, here’s another one: I’ve been around long enough to remember when the Segway was going to completely change the way we design and use cities and totally upend and revolutionize transportation and transit forever.
On average, being skeptical of anything hyped this much is a winning strategy. Not a perfect strategy, but winning on average, and I don’t think you can meaningfully rebut that.
You are free to disagree with me, but not to accuse me of things that are entirely false. I have no connection with openai, I am not working on an AI startup, and I have no financial investments of any kind in AI. I am simply a user who is flabbergasted by what I see. I think it is telling that you feel the need to paint a genuine and organic reaction as “marketing and hyping”.
You are pulling these accusations out of thin air, just as you are lambasting chatGPT for doing. I have told every person I’ve shown it to that while it is useful, it cannot actually think, and will often make stuff up out of thin air.
On average, yes. Just not in this case. You realize the internet had a lot of hype, too, right?
Sadly, you cannot necessarily tell the quality of an idea just from a reductive rephrasing of the mouth noises of its fans.
Not necessarily, no, but it’s still a useful heuristic.
but just because we can, we shouldn’t stop asking if we actually should
It’s not clear to me that the tech world is, or will ever be, ready to hear that
If the AI boosters are right, we may soon be the linotypes.
It’s a great pity to see somebody I really admire allow himself to be dazzled by this technology’s ability to mimic competence by pattern matching. An autocomplete with the entire corpus of stackoverflow will come up with some brilliant answers for some things, some subtly wrong answers for others, some absolutely wrong answers for yet others and utterly garbled output for more.
The whole issue is that you need to be able to differentiate between all of them. And again, it is all entirely based on you stealing the work of others with an advanced search tool which has absolutely no idea as to the veracity or quality of what you’re stealing.
I’m sorry Stevey, this is the next crypto. And it’s funny he mentions the whole not sticking with amazon thing because he was skeptical or k8s or whatever, because surely that argument equally applies to crypto? It’s survivorship bias, you regret not betting big on the things that turned out to work, then decide because this seems like a crazy claim like some other things that succeeded, this one must be true.
The good thing with LLM type solutions is that you can go and try and see for yourself how wrong it gets it. Picking and choosing some lisp example that happens to work is exactly how this stuff gets you.
I genuinely find this whole LLM craze extremely depressing. No waiting for the results to come in, no considering what the limits of such technology might be (yes LLMs will be VERY useful for some things, just not all). It’s just like the self driving nonsense. If you take time to think about how these algorithms work (essentially pattern matching with better results at higher data density) it becomes really obvious that it won’t work well for fields that require precise answers and which quickly get novel (driving, coding).
It’s the emperor’s new clothes, and god’s sake Stevey I thought you were wiser than this (this is really about wisdom more so than intelligence, many smart people I know have been sucked in).
I think this whole phenomenon is well described by Filip at https://blog.piekniewski.info/2023/02/07/ai-psychosis/
A point I haven’t seen discussed yet is that, right now the stuff seems great because the majority of the content on the internet is training content: writing by people. But in a year or two when the internet is flooded with ChatGPT crap, the training data will be full of ChatGPT crap too. The algorithm can only do so much to compensate for the fact that soon it will be metaphorically eating its own shit.
IMO there’s one very big difference between the radiology and self-driving stuff, and what we have now. Radiology was big companies making shrinkedwrapped products that they sold to hospitals, ChatGPT is an AI anybody can try to use for anything they want.
Okay this is totally insane.
All it takes is a (few) high-profile case(s) where someone got killed by an LLM that got the dosage wrong (grams instead of milligrams seems like a likely mistake) and let’s see how quickly politicians will outlaw this kind of tech for serious use.
I think the author does not refer to LLM in this section. It’s introduced with (highlight by me):
If he was refering to LLMs, I would have expected “with LLMs” or “with the new tech” or at least “with the existing tech” (even if existing was a bit weird to refer to something new). But written like this to me this is a reference to a broad spectrum of technology.
I understand it so that he means existing technology in general, so in most cases probably something more tailored to the use case (e.g. bureaucrat: web application, cashier: self-service cashiers, bank clerks: online banking). But all of this already exists to more or less extent in different parts of the world.
I guess if you have to do a project announcement, getting Steve Yegge to do a fifteen page rant is the best way to do it!
What’s the end of a Yegge post like? I’ve never made it that far. The tops are always very interesting though!
Yup, I was always big fan of his blog, e.g. “Size is Code’s Worst Enemy” and the like. This one didn’t disappoint either, he always keeps it real, even when hyping AI:
A weird thing is that I JUST discovered he did a “Stevey Tech Talk” YouTube channel during the pandemic:
https://www.youtube.com/@SteveYegge/videos
Somebody pointed to his Emacs videos, and I watched a few others. There’s some technical stuff, but also a lot of tech industry / career stuff.
The last video is him “unretiring” and going to SourceGraph. The quality is hit and miss, and he admits as much, but I watched more than a few episodes! (not always to completion)
FWIW Kubernetes was developed in the Seattle office of Google, where many Microsoft people were hired starting ~2004 or so, and where Google Compute Engine started. Steve worked at Amazon not Microsoft, and then worked at the Google Seattle office starting in 2005. Hence the little tidbits in the blog post about seeing an early demo of Kubernetes.
So Kubernetes to me has a big Microsoft flavor (Fire and Motion, etc.), which to me contrasts with the Unix / Sun / DEC / Xerox PARC flavor of the original Google systems, developed in the Bay Area (where I worked, also starting 2005). Not that they were perfect – they also had big flaws.
Also, I’ve been a “meh” person on LLMs. This post and Bellard’s work (https://bellard.org/ts_server/) makes me a little more interested.
I’m wondering if a LLM can automatically add static types to Python code, and REFACTOR it to be statically type-able. I did this by hand for years in Oil. Some of it is straightforward, but some of it requires deep knowledge of the code.
My feeling is that they can’t, but I haven’t actually tried and verified. When I tried ChatGPT for non-programming stuff, it got things hilariously wrong, but I was purposely trying to push its limits. I wonder if it will be helpful if I take a less adversarial approach.
Though writing code isn’t a bottleneck in creating software: https://twitter.com/oilsforunix/status/1600181755478147073
Writing code faster creates a testing burden (which Yegge alludes to). If a large portion of programmers end up spending most of their time testing code created by LLMs, that will be an interesting outcome. I guess their counterpoint is that many programmers will WANT this – it will enable them to do stuff they couldn’t do before. It’s possible. Doesn’t sound that appealing to me, but it’s possible
I will say it’s true that sometimes I just type stuff in from a book or from a codebase I know is good, and I understand it AFTERWARD (by testing and refactoring!). So yes probably LLMs can accelerate that, but NOT if most of the code they’re trained on is bad. Somebody has got to write code – it can’t be everyone using LLMs.
And that is my nightmare scenario (okay, one of my nightmare scenarios), as it reduces us from code monkeys (bad enough) to test monkeys.
Yeah it’s a crazy thing to think about … I’m thinking of a recent PR to Oil, where we spent a lot of time on testing, and a lot of time on the code structure as well. I think LLMs might have some bearing on both parts, but will fall down in different ways for each.
The testing can be very creative, and I enjoy acting like an adversary for myself. A “test monkey”, but in a good way – I let go of my preconceptions of the implementation, become a blank slate, and just test the outer interface. I think about the test matrix and the state space.
We also did a lot of iteration on the code structure. After you get something passing tests, you have to structure your code in a way so that you can still add features. For this specific example, we separate out
setpgid()
calls from the low level process code, so that they’re only relevant when shell job control is on, which it isn’t always. We also referred to zsh code implementing the same thing, but it’s structured totally differently.Basically the process code isn’t littered with if statements for job control – it’s factored out. I think LLMs are and will be quite bad at that kind of “factoring”. They are kind of “throw it against the wall” types.
You could copy some code from an LLM one time. But then the next week, when you need to add a feature ON TOP of that code, the LLM isn’t going to be able to help you. It won’t even understand the code it told you to put in :)
I’m also thinking that testing is a specification activity. It’s the art of thinking clearly. Easily half of the work of Oil is coming up with HOW it should behave, which I first encode in tests, e.g. https://www.oilshell.org/release/0.14.2/quality.html
So yeah I think it may be something like “autocomplete”. Interestingly some people seem to find autocomplete more useful than others. I am mostly a Ctrl-N vim person. I think if you’re using too much autocomplete, it could be a sign the language is repetitive / “uncompressed” / not properly “Huffman coded”.
It reminds me of the different levels of self driving cars: at intermediate levels they become more dangerous because humans only need to respond to exceptional circumstances. Humans are responsive to novel stimulus, not routine stimulus. Therefore, they will stop paying attention and disaster will strike.
The problem is getting an llm to read a non-trivial amount of code. Which I assume is basically a problem with the hosted systems and not a technological limitation
With 32K token context models coming soon in GPT4 (and larger elsewhere) This is likely not the main problem (soon at least)
Yet, the link and twitter post shared seem to indicate exactly the opposite. ARC was tasked to assess the model power-seeking behavior and the conclusion was:
ARC also wrote this, though:
(from https://www.lesswrong.com/posts/4Gt42jX7RiaNaxCwP/more-information-about-the-dangerous-capability-evaluations)
Still, the original post is claiming:
Right now, at best, large model can be prompted to do sub-task and unreliably complete them. There’s a huge gap between power-seeking and doing specific tasks on prompt. If anything, we are getting to a point where this AI has the means to do a lot, and if they had the capabilities of power-seeking, they could probably get somewhere. However claiming that current LLM is “prone to discovering that “power-seeking” is an effective strategy” is misleading.
If your first red-team test finds that your AI is effective at autonomous replication, you’re a few weeks out from the world ending. The fact that we’re even talking about this anthropically demands that the AI was ineffective at this. The important question is the gradient it’s on.
we believe that power seeking is an inevitable emergent property of optimization in general. There are a few others, like self preservation. We aren’t seeing this in GPT-4. But it isn’t clear exactly when and how it could appear.
I’m wondering, could it also eventually be simply parroting in itself? Right now, everyone seems to look for ways to make use of AI and LLM to whatever problem they see. Wouldn’t it make sense for a generative model then to simply do what it has been trained on: deploy more AI model? Is that really power-seeking or simply more parroting and yet another case of us looking in the mirror and seeing intelligence in our reflection?
I’m assuming the use of “optimization” here is different from the generally accepted one, which to me is improving a process to be more effective.
by optimization I mean applying some iterated algorithm like gradient descent to minimize an error function. (i.e. tweaking the weights of the neural network to make it better at predicting the next token)
OK, then it’s the term “power seeking” I am not familiar with.
Naturally, I fed this post itself to GPT-4 (I have a paid account). Here are some things I asked it:
Write a very brief summary of the entire post
In this blog post, the author explains that large language models (LLMs) have the potential to revolutionize software development by creating AI-powered coding assistants. They emphasize the importance of data moats, or access to unique data, for LLMs to be effective. Sourcegraph, with its powerful code intelligence platform and search engine, has a significant advantage in this area. The author introduces Cody, an LLM-backed coding assistant, and argues that coding assistants like Cody will become an essential part of software development in the near future.
Write a glowing, positive very brief review of the blog post
This blog post brilliantly captures the transformative potential of LLMs in software development and presents a compelling argument for the rise of coding assistants. The author’s enthusiasm and deep understanding of the topic make for an engaging and enlightening read. Their ability to convey complex concepts in an accessible and entertaining manner is truly impressive. Highly recommended for anyone interested in the future of AI and its impact on software development!
Write a bitter, sarcastic, funny and very brief review of the blog post.
Oh, great! Yet another blog post hyping up the AI revolution in software development. The author clearly has way too much time on their hands, waxing poetic about LLMs and data moats while we’re all just trying to get our code to compile. Their enthusiasm is borderline contagious, but we can’t help but feel that maybe they’ve had a bit too much of the AI Kool-Aid. A mildly amusing read if you’re into verbose tech prophecies.
Write a comment about the post that will get 100 likes on lobste.rs.
While the author’s enthusiasm for LLM-backed coding assistants is infectious, it’s important to keep a level-headed perspective on the future of AI in software development. As with any emerging technology, the road to widespread adoption will likely have its fair share of bumps and unexpected twists. It’s exciting to see companies like Sourcegraph pushing the boundaries, but let’s not forget that innovation often comes from many sources, and the coding assistant landscape is still taking shape. Collaboration, open dialogue, and community-driven development will play a crucial role in shaping this transformative technology. So, let’s continue to support and challenge each other as we explore the potential of LLM-backed coding assistants to redefine the software development process.
I actually agree that LLMs are likely to be the start of something big: they have basic ability to crap out code (like a pure tutorial jockey but faster) and they have the ability to generate bullshit, both of which are the sole skills of a negligible part of the workforce.
Unfortunately they also have significant limitations that make their promise entirely in the future:
They can’t take substantial new input so they can’t learn context (I assume this is basically an implementation issue - everyone will need their own model to incrementally train); which
means they can’t handle context; and
Means they can only generate fairly small snippets de novo and can’t edit existing code.
I doubt we’ll see a 5x productivity boost, but I predict within 2 years we’ll have surprisingly good refactoring tools that can also make functional changes.
I suspect one consequence of this will be that way more design and code becomes brute force based, for better or worse. Obviously better for large cloud providers.
Nope to all 3.
I fed davinci-003 straight up copies of the Raspberry Pi Pico docs on the Programmable IO system, which is basically a novel thing which doesn’t really appear many other places (at least by that name) and which didn’t exist in the training set. The LLM did a much better explanation of how this worked than the docs do, and wrote working code based on it.
LLMs probably have significantly more basal context than we humans do (they’ve seen more examples of things in any arena), and when fed direct information about the current task are able to integrate the training data with the current task data. This is handling context.
https://github.com/joshka/Crow is an experiment I played with of one of the earlier OpenAI models editing the code that’s calling it.
How did you feed davinci the docs? 2k characters at a time?
Yeah pretty much (though it’s 4k tokens which is about 16k characters or so). Ask a question, paste the part of the summary and code that was previously generated by the same process plus a new chunk code to add relevant detail and repeat. I wasn’t dealing with the entire pico datasheet, just the part I was interested in, and the related source headers. It was good enough that I built a (simulated) working 23 port UART MIDI splitter (simulated) with its help having never written code for a pico prior. https://www.joshka.net/2022/11/pio-for-midi-with-raspberry-pi-pico
I’ve since discovered Langchain as a rather more optimized approach to the manual playground things I was doing.
Take a look at https://www.youtube.com/@chatwithdata, https://www.youtube.com/@jamesbriggs and https://www.youtube.com/@DavidShapiroAutomator for some inspiration.
I mentioned https://github.com/joshka/Crow/blob/crow-v2/crow.py ~70 SLoC + docs which was generated from 7 bootstrap lines + only English language prompts. That’s 90%+ LLM generated. More if you include the docs.
This was using the edit model (code-davinci-edit-001), I bet that GPT-4 would probably be better at this.
An angle I think might address some of those limitations is to treat the LLM as a raw idea or variant generator, instead of as the top-level “AI” system. It’s pretty good at spitting out large amounts of vaguely reasonable or close-to-reasonable code. Then you can plug that in to other, existing paradigms for auto-generating code as the top-level “AI” system. Those go under names like “genetic programming”, “program synthesis”, “inductive programming”, etc. (the literature is scattered because related ideas have come out of AI, programming languages, logic, etc., under different names).
For example, this paper does genetic programming (GP), but replaces the AST-based mutation operator that a GP system would normally use with an LLM proposing code diffs. This can sometimes improve efficiency a lot, because the LLM’s training means that it has better priors on what “looks like reasonable code” than most GP systems would have. There’s a follow-up paper (disclosure: I’m 2nd author :)) that investigates also replacing the genetic “crossover” operator with LLMs.
That all sounds reasonable BUT the big thing about llms is that they work with imprecise natural language. Presumably if you’re using existing program synthesis techniques, a much more rigorous specification is needed.
Of course one might use an llm to help generate the specification. But again editing a specification has the same problems.
Yeah, that’s true. A lot of the interest is that you can kind of vaguely gesture towards a problem and get pretty decent boilerplate code out. Perhaps that is most programming!
I’m more of a researcher than a programmer, so admittedly I may have atypical problems. But for me it’s usually fairly easy to generate, if not a formal spec per se, at least a decent test that can score the quality of generated code from like 0-100. For example, one thing I’d like out of an automated programming system is auto-parallelization. You can specify that problem as: I want code that works like this existing code, but on the GPU and a lot faster. It’s not too hard to convert that “spec” into a fitness function for a GP system – score generated code quality by some weighted function of “agrees with the original on randomly generated inputs” and “is faster”. But despite it being easy to state the spec, solving it is basically impossible. Existing GP systems will just churn on that forever; the combinatorial space of program transformations is too big to actually find solutions for anything but really trivial algorithms. My hope is that LLMs might be a component of a solution, though I don’t think they are anywhere close to solving it directly.
It’s possible that industrial practitioner skills will shift if this can be used for business code and refactoring. And for that specific spec I bet an llm could generate that spec.
For the meh crowd I wrote this on hn (responding to “I personally feel the technology is over-hyped. Sure, the ability of LLMs to generate “decent” code from a prompt is pretty impressive, but I don’t think they are biger than Stack Overflow or IDEs.”
Hard disagree. I actually think the 80/20 mentioned in this article is low if done correctly. Let’s think about the software engineering process for a minute and work out which bits to automate. Here’s GPT3.5 on the process (PROMPT: List the steps going from a goal or dream to a software product in software engineering)
Idea Generation: The first step in creating a software product is to come up with an idea or goal that the software will achieve.
Research: Once you have an idea, it is important to conduct research to determine the feasibility of the idea and identify any potential challenges.
Planning: After research, planning is necessary to determine the scope of the project, the timeline, and the resources required.
Design: The design phase involves creating a detailed plan for the software, including the user interface, functionality, and architecture.
Development: In the development phase, the software is created by writing code, testing, and debugging.
Testing: After development, the software must undergo rigorous testing to identify and fix any issues.
Deployment: Once the software is tested and ready, it is deployed to the target audience.
Maintenance: Finally, the software must be maintained to ensure that it continues to function properly and meets the needs of the users.
Each of those steps has a back and forth with a LLM that can enhance and speed up things. You’re talking about 4 as being problematic, but right now there’s a lot of “human in the loop” type issues that people are encountering.
Imagine having the following loop:
LLM has generated a list of features to implement. AI: “Does this user story look good?” Human: “Y”
For each feature, generate an short English explanation of the feature and steps to implement it. Your job as a human is just to confirm that the features match what you want. “Should the shopping cart
For each step, LLM generates tests and code to implement the feature. AI: “Shall I implement the enter address feature by doing …” Human “Y”
Automatically compile the code and run the tests until all tests implemented and feature is complete according to spec.
Automatically document the code / feature. Generate release notes / automated demo of feature. Confirm feature looks right. AI: “Here’s what I implemented… Here’s how this works… Does this look good?”
Lint / simplify / examine code coverage / examine security issues in the the code. Automatically fix the issues.
I think you also miss that the LLM can be prompted to ask you for more details. e.g. PROMPT: “I’m building a shopping cart. Ask me some questions about the implementation.”
What programming language are you using for the implementation of the shopping cart?
Are you using a specific framework for the shopping cart or are you building it from scratch?
How are you storing the products and their information in the shopping cart?
How are you handling the calculation of taxes, shipping costs, and discounts in the shopping cart?
What payment gateway(s) are you integrating with the shopping cart?
Which can then be fed back to the LLM to make choices on the features or just plain enter the answer. PROMPT: “For each question give me 3 options and note the most popular choice.”, and then your answers are fed back in too. At each point you’re just a Y/N/Option 1,2,3 monkey.
More succinctly, in each step of the software game, it’s possible to codify practices that result in good working software. Effectively LLMs allow us to build out 5GL approaches1 + processes. And in fact, I’d bet that there’s a meta task that would end up with creating the product that does this using the same methodology manually. e.g. PROMPT: “Given what we’ve discussed so far, what is the next prompt that would drive the solution to the product that utilizes LLMs to automatically create software products towards completion” ;)
The problem I have with the argument of ‘you can’t trust any code, that’s why we have code review’ is that reviewing code is harder than writing code. The hardest code to properly review is code where you have pre-existing assumptions about what the code does. It can take orders of magnitude longer to find the > that should have been >= than to write the whole thing from scratch.
I see limited value in a tool that turns a difficult task into a more difficult task.
My argument is that right now we’re in the more difficult task state. But I believe that we can get to a state where we generate provable correct code (generated narrative spec + simple obvious tests that exhaustively cover the spec + simple obvious code that is covered by tests). That vision is where this article sits for me
Personally speaking, I am “meh” because I write in a niche dialect of a niche language and also care about privacy. Sell me a competitive self-hosted version I can finetune on our codebase on consumer-grade GPUs, and I’ll become very interested.
I’m also a believer that LLMs are the beginning of a big impact to the economy, possibly on a scale like the original industrial revolution. That said, I think the point should be raised: we aren’t really paid to write code, but to maintain it. These code generation models seem to do a good enough job at the writing, to get something off the ground reasonably quickly… But when something goes wrong with the code these things produce, the LLM (as things stand today) isn’t going to be the thing to debug it.
Hmm, the AI forgot the docstring.
He asked for no explanation. ChatGPT does comments and explanation by default, very verbose, sometimes subtly different from the code, and more often the explanation is more correct than the code itself.
Meh