1. 26
  1. 32

    Classic Ironies of Automation: You need a human operator to check that the machine is operating correctly and solve the complex cases, but humans get bored and distracted when they are checking stuff that is 99% okay, and their ability to check complex cases atrophies from lack of use.

    1. 6

      Like falling asleep at the wheel of a self driving car

    2. 20

      Someone at work used ChatGPT to generate some code for doing interrupt handling of floating point registers. It was subtlely wrong. At first glance it looked correct, but upon closer inspection it has a bug that, if not fixed, would cause confounding errors in ways that are incredibly hard to track down. It is the kind of error that corrupts data and would show up far away (in space and time) from the actual problem. In other words, a nightmare bug.

      The generated code would save no time here. You still have to go read the spec and understand how everything works to ensure that the code is doing the right thing. That’s where all the work is.

      1. 13

        I still find the ‘it’s great for boilerplate’ claim in the article depressing (though true). If a large language model has enough data points to sample that it can fill in what you want by making some tiny tweaks to something it’s seen in thousands of examples, why do you need that in your source code? Whether a human or an algorithm write the code, humans still have to understand and maintain it, so the cost is still there (if anything, being able to write it more easily makes it easier to write larger, and less maintainable, codebases). It really shows how far we still have to go in producing usable programming languages and frameworks.

        1. 6

          I heard the ‘great for boilerplate’ part often enough that I figured I’d test it a while back, based on an acquaintance of mine’s observation that ChatGPT can come up with remarkably correct code when asked for a minimal IIO driver. A device driver for a somewhat obscure Linux kernel subsystem seemed like a good test case for that – it’s something I’m familiar with, something I also had to implement in alternative ways, and also there’s a lot of boilerplate involved, at several levels, from “this is a Linux module” to “this is an SPI device attached to this interface” to “this is an IIO driver which will present values on these channels”.

          Thing is, most of that is “true” boilerplate only for simple devices. All those innards are exposed primarily in order to support more advanced use cases, like sharing device driver code between multiple variants of the same device (e.g. same ADC with different channel counts and sampling frequencies) or allowing the programmer to enforce things that the hardware expects to be enforced at the other side of the bus (e.g. no new requests while the hardware is working on another request, minimum time between consecutive requests etc.).

          For the most simple of devices, you don’t really care about any of that, and you’re left with boilerplate you copy-paste around – something you could, indeed, presumably tuck under a generic “make me an IIO driver” language macro, or a generic IIO device driver that you just hand off a pair of read, write functions to.

          For more advanced devices, though, all that is “just” initialization logic, not generic boilerplate. Where one ends and where the other begins is hard to say. For every step in a more advanced driver, I can sigh and think you know, I wish this was just a config parameter somewhere. But a config definition format that can really describe all these quirks would just be a really awkward DSL with imperative features. We all know how that goes.

          Thing is, ChatGPT is great at the former kind of boilerplate, but years away from the latter. If I ask for a minimal IIO driver, it gives me one (literally one that just probes but, hey, I did ask for a minimal one). If I ask for a minimal IIO driver for an SPI device, it gives me one that just probes, but also reads and writes some junk on the SPI bus (which it doesn’t use in any way). By the time we get to specific things, like write me an IIO driver for this particular device attached via this particular interface, ChatGPT comes up with “boilerplate” that’s loosely related to the prompt but otherwise firmly into well, at least it compiles land – it writes made-up values at made-up addresses, sometimes on made-up buses. (This isn’t unexpected, and is obviously an artefact of training data availability: if I ask for the driver for a device that’s already supported in the kernel, it basically just give me the existing driver code, minus the license :-P).

          Linux is probably not the pinnacle of solid, reusable kernel frameworks, but it makes a trade-off that’s familiar to all of us – it trades verbose, but mostly default code for simple cases, in order to make it less painful to allow customization for more advanced cases. ChatGPT is overwhelmingly biased towards the former because they are bound to be more represented in its training data and because it’s inherently better at generating that kind of data. However, that’s not necessarily representative of what a human programmer needs.

          1. 2

            If a large language model has enough data points to sample that it can fill in what you want by making some tiny tweaks to something it’s seen in thousands of examples, why do you need that in your source code?

            I think the same thing, and it’s also why I don’t consider it all that impressive. If it’s generating boilerplate, why is that code required in the first place?

          2. 7

            I needed to write a thing to turn a 1D array into an N-dimensional array two days ago. I tried Googling “golang N-D array” but couldn’t find it. So then I tried Googling Python N-D array and got solutions that were the opposite of what I wanted (going from N-D to 1) or just 1 to 2-D. So I tried asking ChatGPT to do it in Python and it gave me a good solution, and then I asked it to translate it to Go, and it did a good job. Maybe the code has a subtle bug I haven’t noticed yet, but it probably cut 2 hours of noodling with a pen and paper down to half an hour of searching and reviewing.

            Here’s the version after I cleaned it up by making it generic and removing an unneeded else branch:

            func inflate[T any](lst []T, shape []ArrayDimension) any {
            	if len(shape) == 1 {
            		return lst
            	}
            	dimSize := shape[0]
            	dimShape := shape[1:]
            	result := make([]any, dimSize.Length)
            	sublistSize := len(lst) / int(dimSize.Length)
            	for i := 0; i < int(dimSize.Length); i++ {
            		startIdx := i * sublistSize
            		endIdx := (i + 1) * sublistSize
            		result[i] = inflate(lst[startIdx:endIdx], dimShape)
            	}
            	return result
            }
            

            This is pretty much the ideal use for ChatGPT because it really is just a semantic search of existing code. I would have been just as happy if Google had given me the webpage of someone else writing out the algorithm, but since a keyword based search can’t distinguish converting 1 to N from N to 1, Google-style search failed.

            1. 2

              looks like this function is broken if len(shape) == 0 or if the lst doesn’t contain the exact number of necessary items

              why your function isn’t defined as

               func inflate[T any](lst []T, shape []int) any {
              
                    ....
              
               }
              
              1. 2

                There’s an outer function that calls the inflate function after checking the size is correct. I don’t want to repeat the check on every step of the inflation. See https://github.com/jackc/pgx/pull/1511/files

            2. 1

              Someone at work used ChatGPT to generate some code for doing interrupt handling of floating point registers.

              Ouch. That is also the last thing an LLM should be used for. One could potentially use it to generate test cases. The future in this area is to have an LLM connected to a theorem prover so that the output can be checked.

            3. 9

              My experience with chatGPT is that it’s great as a search engine as a summariser, but middling when it comes to generating stuff, whether that’s prose or code.

              Though once you’re at a senior level - is actually writing the code really the bottleneck? The battle to me seems to be all in understanding existing code, requirements, libraries, tool issues… not actually putting text in the editor.

              1. 7

                A ChatGPT-style AI that can only reply with links to reference docs, libraries, or wikipedia articles

                yes please. and one that can annotate documents

                1. 3

                  This is very good. Presumably if we do go down this route, humans will have to get really good at writing formal specs to automate the “proofreading”.

                  1. 10

                    Writing a specification is the incredibly hard part and implementing a proof is usually the boring and tedious part. We’ve had simple proof searches in proof assistants for a long time, having an AI based proof search is an obvious win!

                    If programming starts to look like that, it’s going to be a fundamentally different skill to what most programmers have today. Implementing a sorting algorithm is very different to fully specifying that algorithm.

                    And I know that getting requirements from a business can even be impossible. I’m not sure how many people could specify even small parts of the system they work on. I’d love to see it though, I just imagine things will have to drastically change as an industry. I hope we get there.

                    1. 3

                      One thing I have always wondered about formal proof specifications… being only surface acquainted with them….

                      Doesn’t this just push the burden of error onto the specification step?

                      Meaning, say I am a business owner and hire a FM expert to specify and formally prove some algorithm for part of my business. We write the spec, and it passes all the formal tests. Great, the spec is proved correct!

                      But was the specification itself correct? The expert I hired could have messed up that step, or something failed in our communication. My overall confidence is the system is only as good as my trust that these still-human-bound steps had no errors. Now perhaps that is higher than it would be with typical software development, perhaps even much higher. But still… maybe not that high in an absolute sense?

                      Am I missing something?

                      1. 9

                        Doesn’t this just push the burden of error onto the specification step?

                        This is more or less what I remind people when talking about formal verification: it is a tool for ensuring that all of your bugs are present in your specification.

                        That doesn’t mean that it’s worthless, for two very important reasons:

                        • The spec defines requirements and not implementation details and so can be a lot (orders of magnitude) simpler. I have more confidence that something is correct when it’s short because it’s easier to think about the whole thing.
                        • The spec is held in some tool that is designed for proving properties. Once you have a spec that says ‘these five things must be possible’, you can start posing questions like ‘is there a mechanism for an attacker to do this sixth undesirable thing?’ Often automated proof assistants are able to either say ‘no’ or ‘yes, with this sequence of steps’. If they’re not, you still have a good starting point for constructing such a proof.

                        The second one of these is critical to your question. The spec is something that you can prove universally quantified properties over, the implementation is (usually) not.

                        1. 1

                          Thanks for this answer.

                          I have more confidence that something is correct when it’s short because it’s easier to think about the whole thing.

                          Indeed.

                          Only short programs have any hope of being correct. ― Arthur Whitney

                        2. 4

                          I have some limited experience with a formal method called “cleanroom development” which was developed by a professor at my alma mater. An acolyte of his became a coworker of mine for a while. A fairly simple project which I would have estimated at being 3-6 months of effort got spun out into an 18 month process because of this and ultimately got finished by his successor who did not retain the method (in about 6 months). My experience is in-line with your hypothesis.

                          The argument in favor of this particular formal method, based on studies done at large corporations, was that you spend more time up-front in the specification phase, but that you don’t suffer very many bugs (“defects”) later on, so the maintenance burden is greatly diminished.

                          I’ve studied software architecture a bit since then. One of the concepts in software architecture is quality attributes. Knowing what the quality attributes are for your project informs what software architecture and development process you can or should use. I would absolutely recommend using cleanroom development or other heavy-weight formal methods for projects that prioritize correctness over cost and time-to-develop: aerospace projects for instance. You typically aren’t going to be able to quickly fix a bug that affects a satellite or autonomous robot on another planet. You have to get it right the first time. So in situations that prioritize correctness, you would want to choose one of these methods.

                          Most of the time, the quality attributes you care about are cost or flexibility or something other than correctness. In those regimes, formal methods are overkill. Although the FM experts (paging Hillel) will say that the costs are not as bad as they are perceived to be. In my experience the customers don’t quite know what they want until they see what you’ve done. I think FM are not going to be a great fit for those situations because of exactly what you’re saying, that the specification has to be very well-understood at the outset.

                          1. 3

                            The difference is how largely readable and terse the spec is.

                            Terseness because there’s a strong correlation between lines of code and big count.

                            Readability because it doesn’t concern itself with petty details required to make it actually run.

                            1. 5

                              But isn’t this just an argument for terseness and readability for languages? That is, if we consider:

                              1. Errors of specification
                              2. Errors of implementation given a spec

                              We can rule out 2 with a provable spec, but are still subject to 1. If the only saving us from 1 is terseness and readability, any language equally terse and readable will as good with the 1 errors. 1 errors being equal, we still have an advantage with a provable spec by this reasoning, but unless the argument is that specification languages are better w.r.t those qualities than all other languages, then we might still be overall better off with a standard language that excelled in terseness and readability?

                            2. 3

                              The main thing is that specification can be much higher level than the program needs to be, since it doesn’t need to be concerned with implementation details. Take the statement “two people can’t reserve the same room at the same time”. The implementation can be arbitrarily complicated:

                              • How are reservations represented in the system?
                              • How do we query them?
                              • At what part of the reservation flow do we check this?
                              • How do we show the user a time is unavailable?
                              • What do we tell the user if they try to reserve an unavailable time, anyway?
                              • How do we avoid race conditions?

                              On the other hand, the high-level spec is pretty simple: all p1, p2: p1 != p2 => !(some loc, time: Reserved(p1, loc, time) && Reserved(p2, loc, time)). The code that implements reservation can also be spread across multiple models, which their own local specifications, and it’s possible that the local specs could all be proved correct but the global spec fails. So proving the global spec correct is still useful.

                              1. 1

                                Given these two sets of bugs:

                                • Implementations which don’t do what the implementer thought
                                • Implementations which don’t do what the business needed

                                Specifications won’t always solve the latter but a perfect specification will always solve the former. I think I see the former a lot more often, but I definitely do see the latter too much.

                                1. 2

                                  Maybe this didn’t answer things well. Why do we trust a software specification being correct more than the implementation?

                                  • More abstract (e.g. a full functional spec for reverse is forall x. reverse [x] = [x]; forall xs ys. reverse (xs ++ ys) = reverse ys ++ reverse xs)
                                  • More broad (e.g. this whole program does not contain an out of bounds error)
                                  • If you make a mistake in the specification, you’ll have to follow through with that mistake in implementation (on topic: AI probably wouldn’t help with this - I imagine it would just do as requested)
                                  • Implementations often change, specifications change less frequently
                                  1. 1

                                    It sounds like you are validating what I thought. It is interesting to me that, even with all the mathematical machinery, we are still, at bottom, subject to the human brain’s ability to reason.

                                    1. 3

                                      we are still, at bottom, subject to the human brain’s ability to reason.

                                      But, that’s because the problem set being described is subject to human factors; if there’s a business out there that needs an algorithm to “determine the set of shops to visit in optimal order, given my priorities”, that’s still dependent on some ability to translate from “priorities given by business”, through requirements gathering processes into “priorities understood by developer”, into “priorities encoded in specification”.

                                      Even if you were able to automate out the “getting understanding into the developer” step, you’d still have a human brain going from abstract business priority into specification.

                                      There’s, imo, no way to automate out human reasoning; math may be platonically ideal, but we still have to project into that space before doing things.

                            3. 2

                              proofreading

                              Just reading code is rarely enough to spot bugs. You need to review and test AI generated code. The regular Q&A stuff any code needs.

                              1. 6

                                Worse, it’s likely that LLM-generated code will make different kinds of mistakes to humans, so you have to learn how to do code review from scratch again. Some thing that humans frequently get wrong, it will have a large corpus to sample from and so will consistently get right. Some things that humans would never get wrong because anyone with a trivial level of actual understanding of the problem will do the obvious thing, it will get wrong because it is not actually reasoning about the code.