1. 10

I make my money writing Ruby. I like Ruby, I find it pleasant to work with. But what I really like is getting work done, and I theorize that, between the duck typing, the meta programming, all the things that make it what it is might make it harder for LLMs to work with it.

I found a paper, A Systematic Evaluation of Large Language Models of Code, that judges some models on the metric perplexity for various languages, which I understand represents the ease of generating completions. I pulled out a table with their results, adding the avg and mean columns:

| Languages  | #tokens   | Average | Median  | Codex* | PolyCoder 2.7B | GPT-Neo 2.7B | GPT-J 6B | GPT-NeoX | CodeParrot |
|------------|-----------|---------|-------|--------|----------------|--------------|----------|----------|------------|
| C          | 55,333    | 5.5     | 2.69  | 2.55   | 2.33           | 3.69         | 2.82     | 2.37     | 19.23      |
| C#         | 67,306    | 3.05    | 2.35  | 1.72   | 2.58           | 2.49         | 2.2      | 2.12     | 7.16       |
| C++        | 69,627    | 3.51    | 2.67  | 1.95   | 2.99           | 2.87         | 2.47     | 2.32     | 8.48       |
| Go         | 79,947    | 3.32    | 2.04  | 1.39   | 2.57           | 2.19         | 1.89     | 1.85     | 10         |
| Java       | 65,484    | 3.23    | 2.64  | 1.94   | 2.92           | 2.78         | 2.49     | 2.47     | 6.79       |
| JavaScript | 54,620    | 3.81    | 2.9   | 2.17   | 3.06           | 3.07         | 2.73     | 2.62     | 9.23       |
| PHP        | 45,682    | 5.74    | 3.21  | 1.98   | 3.7            | 3.61         | 2.81     | 2.45     | 19.91      |
| Python     | 79,653    | 2.65    | 2.82  | 1.47   | 3.18           | 3            | 2.68     | 2.61     | 2.95       |
| Ruby       | 46,537    | 4.9     | 3.45  | 1.39   | 3.96           | 3.77         | 3.13     | 2.89     | 14.26      |
| Rust       | 107,717   | 3.84    | 3.08  | 1.96   | 3.24           | 3.3          | 2.92     | 2.92     | 8.68       |
| Scala      | 65,756    | 4.85    | 3.62  | 1.75   | 3.87           | 3.88         | 3.37     | 3.33     | 12.91      |
| TypeScript | 55,895    | 4.88    | 3.52  | 2.4    | 3.61           | 3.9          | 3.43     | 3.41     | 12.54      |

Table sorted alphabetically. No conclusions implied.

I wasn’t surprised to see Ruby was towards the back of the pack, unfortunately. I also remember someone telling me that Google made Go super terse so they could mass-hire undergrads and then not train them too hard, and these results supports my previously unsupported prejudices, which is always nice. I like to think of ChatGPT as an enthusiastic intern. Quick to respond, but doesn’t always listen.

Can we identify, or I mean, guess at, any attributes of the well performing languages? Or does anyone have any other insight, other discussions of this topic, articles?

    1. 6

      The biggest and most impactful attribute is a large corpus of working code written in a fairly uniform style. That will beat out any feature of the language itself. Lots of data with predictable and regular patterns means easy training.

      1. 2

        This comment is sticking with me; because it’s something we strive for in our internal code, regardless of language. Our entire codebases look like they were written by one person despite there having been like 100 engineers working on them over 5+ years. It’s always super easy to hop into a new area of the code you’ve never been in before and quickly get an understand what is happening and make progress whether that’s fixing bugs or implementing new features.

        We believe (I wish to say “know”, but we have not proven) that this leads to quicker onboarding, less bugs (for a variety of reasons), a much lower bus/lottery factor, and just an overall healthier codebase and team. So it’s really interesting seeing that it provides the same benefits to AI. I hope there’s some sort of empirical study about this at some point.

    2. 3

      Chain of thought reasoning has seemed to be very effective in giving more correct or useful info, and I wonder if there is a parallel in static type signatures. Does having foo(int a, int b) instead of foo(a, b) meaningfully impact the completion inside that function?

      Also - what are Average and Mean here? Why are they different?

      1. 2

        My dumb ass meant median. Editing.

    3. 2

      LLMs know natural language as well as code (or at least OpenAI’s do), and I find when using Copilot that there’s a real advantage to using names that are both meaningful and widely understood. Declaring a variable can be almost as good as writing a comment (when combined with the function name and signature). I’m not sure what that means for languages, though I’m guessing it hurts Scala. And probably helps Java – some verbosity is rewarded.

      Along with verbosity, I’m guessing static typing will help. It can add some mental burden to keep track of and increase the length of the code… and I don’t think those are going to effect LLMs much. To be effective there will have to be tight feedback loops, as we have formal systems that are very good with static types, but LLMs are only OK; they aren’t magically perfect at getting types right. Implicit types will be harder to work with than explicit types. Good for Java, bad for Scala. (I’m doing some work with Elm and GPT, and it’s a bit rough.) Though for an LLM sometimes implicit types will “just work” same as for humans, and to their benefit.

      Having a well understood canonical way to implement something is good; Copilot can and does pick up on local style, but without that it will choose any possible way to do something, and maybe not the way you want to do it. So different loop patterns or whatever other choices might lead to unimportant disagreements.

      We don’t have many tools with really good AI error feedback loops (where the AI writes code, gets an error, and then corrects it), but those are presumably coming soon. Good error messages will support that feedback loop. I doubt there’s nearly as much information out there to learn from for fixing errors as for writing code (since people don’t publish broken code with error messages and fixes as often). So I think LLMs will need to understand errors from first principles more often than through imitation, and well articulated natural language errors, especially with contextual suggestions, will help. I think many patterns of development designed for humans will also help LLMs.

      Stylistically I think verbose and unclever languages will do best. Good for C, PHP, Java, Go. C will be wildly dangerous though, an LLM makes exactly the kind of errors that C doesn’t protect from.

      Not exactly a language but I think an LLM will do best with semantic outputs. It’s terrible at things like coordinates. Something like an unadorned Unity environment would be very hard (though you could build a more semantically relevant layer on top of Unity). Or you use different tools for layout with some semantic layer between.

      Size of the standard library will be interesting. I think that a large and sophisticated standard library (or conventionally standard) will be helpful, as the LLMs will be able to solve problems using those libraries without being overwhelmed by the size of the library. OTOH large libraries often mean duplication and available-but-not-recommended practices, which LLMs might not pick up on (but could probably be trained to understand, if there’s a conventional understanding of what’s best practice).

    4. 2

      Update: I’ve got perplexity backwards, Python is apparently the easiest to predict and php the hardest.

      I’m pretty sure this is correlated with entropy: how likely is the window of text just examined/generated to precisely determine the next token?

      What’s interesting is that most languages are very close, with C and PHP presumably having the fewest sequels on a given window of input, and Python then having the most.

      I wonder to what extent this is an artifact of the problem domains in which code is deployed, or its inherent repetitiousness.

      In any case, ease of generation is probably neither a good thing nor a bad thing. We’re going to get to the next level as llms can start to evaluate their own candidate output and refine it. Properties that make that easy will be important. I’m guessing that syntactic uniformity will either help or hinder a lot, and that compulsory type (and other) annotations to enforce consistency will help. Also structured representations of error messages that precisely identify the nature of the problem and inconsistency.

    5. 1

      Is the way to interpret this table that: “C is the programming language most amenable to LLM based completion, and TypeScript the least?” (Of the considered set, of course?)

      1. 1

        Oh nah, the sort is alphabetical. I’m drawing my conclusions from the median score, lower is better. But I’m also not sure of my interpretation. I’m just some guy.

        1. 3

          Ah, I see. It’s an interesting question, which programming languages are more amenable to this sort of completion. I always assumed that it would be, “Whatever language has the most code examples that the model can use as a training data set.” (Thus maybe HTML, CSS, Python & JavaScript would be winners, based on open source popularity.) But then I thought about it a bit more and I suspect your hunch is right that languages that are more regular / static might have an advantage. One thing that was a bit humbling for me to realize was that LLMs, in some way, have an easier time completing code than natural language since code follows very strict syntax and parser rules, at least compared to fluent English language. Combined with the eventual (inevitable) step that they can train the code generation models on the basis of actual compiler/linter output, and I imagine LLMs will be even better at generating working and fluent code than they are at generating fluent natural language.

    6. 1

      My complete guesses are:

      • Resiliency to small errors. The LLMs pick words based on probabilities, so they can have small inaccuracies. Ability to detect and fix errors may be very useful. This doesn’t have to be limited to just language’s syntax, but good error messages can also help, since LLM’s are able to read and act on them.

      • Grammar with enough keywords, structure, context and redundancy to make it easy for LLM to follow language’s structure.

      • Probably language structure resembling English, with not too many layers of abstractions and implicit behaviors. To me LLMs look like translators, not just between human languages, but code generation is a translation task for English -> programming language. Imperative language is probably better than a stack language (e.g. English -> Japanese translation has historically been hard for machine learning due to almost reverse order of words in a sentence).

      • More training data probably beats everything else.

      1. 1

        On the first point, it’s worth noting that the latest LLMs can plug into other data sources. If you can add a type checker to your LLM so that it can be told as soon as it generates something invalid then that gives you a way for the code that wraps the LLM to generate much better code.

        Some mechanism to feed all available declarations into the LLM for the current source file would probably be a huge win. For C, feeding it the preprocessed output is easy (but the loses anything you haven’t included). For languages with a module system, you could probably dump some serialisation of all of the installed modules on the system, but in both cases you also want it to be now to suggest things that you don’t have installed.