1. 8

Thread for Meyer’s “earlier article”.

    1. 7

      This is typical of a certain take by experienced programmers where they give the model some kind gotcha „you have to get it right“ task, approach it with no experimental framework (say, trying different approaches, being cognizant of the model used, of the context size, of prompting techniques) and call their afternoon 10 minute one off an actual informed take.

      What llms do brilliantly is transform one form of language into another, boilerplate and dullness and rote regurgitation of patterns included. And that’s most of my day job, not opining about loop invariants and binary searches. What’s transformative is what this capability enables. I don’t mind correcting some slightly off implementation of an api or adding a few unit tests by hand. What I mind is spending 4 days wading through a spec and transforming that into a clean typescript api, and then writing a mock server, and then a cli tool and a clean spec. An llm does that better than I would in the time it took me to make my cup of coffee.

      This in turn means that I can actually spend those 4 days thinking about the problem I am trying to solve (or if it is even worth being solved)

      1. 4

        I would say it’s useful enough to be a tool in my toolbox, but looking over my past interactions with ChatGPT 3.5, it misses the mark and wastes my time pretty frequently. Here are my recent uses of it:

        • Asked it to improve a remove text inside parens function I wrote. It did, but then it wrote test cases that were blatantly wrong, ie. “Hello (World)” => “Hello World”. I’d give that a B because the actual code was okay and it was obvious where it was screwing up.
        • Asked it to write code to make a table of contents for an HTML document. It wrote a lot of wrong code. It was mostly helpful for breaking my writers block and getting me kickstarted, but I didn’t really use much if anything that it wrote. For example, it wrote this which tries to use whitespace to indent the ToC:
         html += `${indent}<li><a href="#${id}">${text}</a></li>\n`;
        • Someone learning how to program asked a question about filtering a list in a Slack I’m in. Someone else answered first, but for fun I asked Chat GPT and it gave the same answer. A+ on that one.
        • Asked it to translate a short Go function to JS and then make the JS idiomatic. It did a very good job here.
        • CSS question: total waste of time.
        • Convert a long Python script into Go: totally failed here because the Python was too long and it kept confusing itself about what it had converted or not.
        • Write some Go code using httptest.NewServer: failure. I dunno, maybe close enough to be helpful, but not better than just reading the docs.
        • Convert a recursive function to non-recursive. It did a good job of the coding task, but added unasked for commentary that said “Go’s slice is more efficient for small and medium-sized stacks, and doesn’t require manual memory management like a linked list would” which is nonsense.
        • Another CSS failure.

        So, in summary: it’s good at cranking out well known code snippets and converting from one language to another. It’s really, really bad at CSS.

        1. 1

          With gpt3.5 pretty much 95% of the time there’ll be something wrong with the code or the transformation I want it to do. With got4 I’ll surprisingly enough get decently working small prototypes, and pretty much on point transformations for most prompts. With some prompt engineering, I can get 3.5 to do specific transformations with a good chance of success. I design every tool around having the user not just in the loop, but at the center of the interaction, do having „wrong“ things come out is not usually a problem.

      2. 3

        Same here. I have talked to people that are extremely skeptical of using LLMs for coding, but they expect to give a simple prompt and get a full application going.

        I use ChatGPT daily and I can get it to produce what I need; I don’t expect it to be intelligent. I just use it to manipulate text at a very high level.

        1. 7

          I’m trying to make this catch on: “It’s a kaleidoscope for text.”

      3. 2

        One thing I really want to use LLMs for if my work ever OK’s use on our code base is generating documentation comments, especially the YARD/JavaDoc/PyDoc kind.

        I also find it useful in situations where there’s a lot of documentation available, but a dizzying API surface area. LLMs can help narrow it down to the functionality I need very quickly.

    2. 7

      Based on the recommendation of a coworker at the time who I respect greatly, I spent a week in good faith trying to use ChatGPT 4 as a programming assistant. I wasn’t trying to give it gotchas to confuse it or asking it to do algorithmic heaving lifting. I was writing a system that was mostly plumbing (prototype of using a Temporal.io workflow as a CI system, posting to GitHub Checks as the output — that deserves a writeup of its own, it turned out to be really nice).

      When I was writing the GitHub Checks API I asked it to generate some code for me. It produced a hunk of code. I asked it to rewrite it using a library in Go that exists for the purpose. It did that. I asked for a few more variations, and by chance the necessary security checks showed up in one of the code outputs. In the end I cut and paste probably twenty lines from different versions, but basically it was a way of generating code examples of varying quality. So it probably saved me some time in that it gave me the names of things I needed to search for and some basic examples, but it was basically a bad documentation generator.

      When I tried to get help with writing the Temporal.io workflow, it choked entirely. It couldn’t produce even fragments of code that were useful.

      I kept trying to feed it various subtasks of the workflow, such as cloning the git repository. It never produced output that I could use more than a few fragments of, it did continue to operate as a bad documentation and example generator. I am an experienced enough programmer and knew enough about the systems I was working with that I could look at that output and say, “Obviously that can’t be right,” and research what would be correct. But if you have someone who doesn’t have a couple decades of experience to course correct, the tool may be worse than useless.

      Over the course of a week it probably saved me a couple hours on tasks. It felt like it imposed high cognitive burden at the same time, so I’m not sure if it cost me a similar amount of time in context switches spent. And if I were working with systems with excellent documentation, it would have been useless. I have felt no inclination to reach for it again since that week.

    3. 2

      https://github.com/features/copilot/ still has glaring bugs in their TypeScript and Go examples. Two points to whoever can figure out the bugs in the Python and Ruby examples.

      1. 4

        It really amazes me the examples people use to try to sell these things. Amazon’s version used an example of calculating whether a number was prime. The code it generated tried dividing the number by every number from 2 to n-1. I wrote similar code when I was 10 (in QBASIC) and quickly realised that you could stop at sqrt(n), which dramatically reduces the time. I then realise that you could skip all even numbers above 2, which cuts the time in half. I then learned about a Sieve of Eratosthenes and realised that my approach of skipping even numbers was a first step towards that. But then it got worse, the problem that they were really trying to solve was finding the first 100 primes. At this point the right solution is definitely a Sieve of Eratosthenes, but they built it out of their isPrime function, so didn’t reuse any results. Their solution was entire complexity classes worse than the one I implemented aged 10.

        This was their cherry-picked example to show how helpful the system was. If your programmers will work better pair programming with someone less competent than a 10-year-old, you have probably hired the wrong programmers.

      2. 1

        What is the typescript bug?

      3. 1

        What bugs do you see in the python example?

        For anyone who doesn’t want to wait through the animation, here is the example I see:

        import datetime
        def parse_expenses(expenses_string):
            """Parse the list of expenses and return the list of triples (date, value, currency).
            Ignore lines starting with #.
            Parse the date using datetime.
            Example expenses_string:
                2016-01-02 -34.01 USD
                2016-01-03 2.59 DKK
                2016-01-03 -2.72 EUR
            expenses = []
            for line in expenses_string.splitlines():
                if line.startswith("#"):
                date, value, currency = line.split(" ")
                expenses.append((datetime.datetime.strptime(date, "%Y-%m-%d"),
            return expenses

        The docstring isn’t formatted right, but the rest of the code looks okay to me. It’s quite fragile (the expenses_string must match the format in the example exactly), and there’s no validation, but those don’t count as bugs to me.

        1. 3

          Sorry, I misphrased it a little. I’m saying I haven’t found a bug in the Python or Ruby ones, but since the first two examples do have bugs that I know about, it’s probable that they do too. I checked if it used the wrong strptime verbs, and I think they’re right, but who knows. :-) I would say that as production code, this is obviously going to blow up if there are more or less than 3 spaces per line, but that’s basically an intended limitation, and not a bug per se.

    4. 1

      IMO programmers who think that AI cannot help them aren’t being creative enough in how they use them. I don’t use ChatGPT to write whole programs for me, I use it for things like getting implementation details of third party libraries.

      1. 4

        Yes, but vice versa, I think for most programmers it’s not even a 10% improvement in productivity. It’s an occasional two hour task cut down to 10 minutes of back and forth with the bot.

        1. 6

          …followed by 90 minutes of going out to confirm what the bot said.

          1. 5

            What makes it good for CSS is that you can instantly see that it’s completely full of crap and not working at all. For tasks without clear testing conditions, it’s very dangerous, e.g. the insecure POSTing on Github’s Copilot demo page.

        2. 5

          I’ve found it really variable and I can easily see people considering it a complete game changer or a total waste of time, depending on where their day-to-day work falls on the spectrum of things I’ve tried.

          For knocking together some JavaScript to do something that’s well understood (and probably possible to distill from a hundred mostly correct StackOverflow answers), it’s been great. And, as someone who rarely writes JavaScript, a great way to find how some APIs have changed since I last looked. Using a LLM here let me do things in 10 minutes that would probably have taken a couple of hours without. If you are working in a space where a lot of other people live but you typically don’t, especially if you jump between such spaces a lot and so don’t have the time to build up depth of expertise, it’s a great tool for turning breadth of experience into depth on demand.

          I tried it for some things in pgfplots, a fairly popular LaTeX package (and therefore part of a niche ecosystem). It consistently gave me wrong answers. Some were close to right and I could figure out how to do what I wanted from them, a few were right, and a lot were very plausible-looking nonsense). For fairness, I used DuckDuckGo to try to find the answer while it was generating the response. In almost all cases, I was about the same speed with or without the LLM if I was able to solve the problem. For some things I was unable to solve it at all (for example, I had a table column in bytes and I wanted to present those numbers with the base-2 SI prefix - Ki, Mi, and so on - and I completely failed). I probably wasted more time with plausible-but-wrong answers here than I gained overall because I spent ages try to make them work where I’d probably have just given up without the LLM. If you’re doing something where there’s a small amount of data in the training sets then you might be lucky or you might not. I can imagine a 10% or so improvement if the LLM is fast.

          I’ve also tried using it to help with systems programming tasks and found that it routinely introduces appalling security holes of the kind I’d expect in example code (which routinely omits error handling and, particularly, the kind of error handling that’s only necessary in the presence of an active attacker). Here, I spent far more time auditing the generated code than I’d have spent writing it from scratch. This is the most dangerous case because, often, the code it generated was correct when given valid input and so non-adversarial testing would have passed. Writing adversarial tests and then seeing that they failed and tracking down the source of the bugs was a huge pain. In this scenario, it’s like working with an intern or a student, something that you would never do to be more productive, but to make them more productive in the longer term. As such, the LLM was a significant productivity drain.

          1. 4

            I find that llms really shine when you give them all the context needed to do their task, and rely on some „grammatical“ understanding they learned. Relying on their training corpus somehow being qualitatively good enough to generate good code is a crapshoot and indeed a proper timeline. But, asking it to rewrite the one written out unit test to test 8 more edge cases I specify? Spot on. Ask it to transform the terraform to use an iterative and a variable instead of the hardcoded subnets? Right there. I like writing the first version, or designing the dsl that can then be transformed by the llm. You don’t see many of these approaches around, but that’s where all the stochastical underpinnings really work. Think of it as human language driven dsl refactoring. Because it’s output will be quite self consistent, it will often be „better“ than what I would do because my stamina is only so large.

            I do use llms to generate snippets of code and have a pretty good flair for „ok this probably doesn’t exist“, but even then, I do get proper test scaffolding and maybe a hint of where to look in the manual, or even better, what api I actually should implement. It’s a weird thing to explain without showing it. I was very skeptical of using llms to learn something (in this case, eMacs and eMacs lisp) where I don’t know much and I knew the training corpus would be haphazard, but it turned out to be the most fun k had in a long time.

      2. 2

        I think honestly it would sell me if it instead of trying to give me the answer, it would provide me links to various sources that should help me out.

        “Maybe you should check out pages 10-15 of this paper.” or “This article seems to achieve part of your goal [x], and this one shows how to bring them together [y]”

        The problem is it assumes it can give me an answer better than the original source, and while sometimes that’s true, its often not.

        I’m sure I could learn to prompt it in a way that would give me these types of answers though..

    5. 1

      Limited success generating a binary search

      The first rule of submission guidelines:

      Do not editorialize story titles, but when the original story’s title has no context or is unclear, please change it.

      Why not preserve the original title? It was pretty clear what the author wanted to express.

      1. 4

        It was initially submitted under the original title, but a mod changed it with moderation log reason: “Toning down clickbait title.”

    6. 0

      title is “AI Does Not Help Programmers”

      closed it as soon as i saw the title tbh