1. 2

Abstract: “Randomized differential testing of compilers has had greatsuccess in finding compiler crashes and silent miscompilations. In this paper we investigate whether we can use similar techniques to improve the quality of the generated code: Can we compare the code generated by different compilers to find optimizations performed by one but missed by another? We have developed a set of tools for running such tests. We compile C code generated by standard random program generators and use a custom binary analysis tool to compare the output programs. Depending on the optimization of interest, the tool can be configured to compare features such as the number of total instructions, multiply or divide instructions, function calls, stack accesses, and more. A standard test case reduction tool produces minimal examples once an interesting difference has been found.

We have used our tools to compare the code generated by GCC, Clang, and CompCert. We have found previously unreported missing arithmetic optimizations in all three compilers, as well as individual cases of unnecessary register spilling, missed opportunities for register coalescing, dead stores, redundant computations, and missing instruction selection patterns.”


  2. 1

    Interesting read, thanks nickpsecurity.

    This really isn’t a scientific paper since there was no formal hypothesis stated nor experiment performed. It probably would’ve made a great presentation, though. What the authors present is a collection of missed optimizations that they found with the help of some software they wrote.

    Would be nice to see if they could find a way to generalize their process or show that differential testing is somehow better at finding these missed optimizations. Currently it seems like they use their tool to generate an “interesting” score for a compiled program, then manually investigate it to look for missed optimizations.

    As a complete side-note, I had to laugh when the author starts talking about lines of code, especially this gem: “At the time of writing, optdiff comprises 573 physical lines of Python code” WTF is a physical line of code?

    1. 1

      I’m glad you found it interesting. I just skimmed it originally since it looked similar to one of my older ideas of using many compilers for a program with each function or module using the fastest compile. I never tried building it. Yet, you said one thing I didn’t get:

      “This really isn’t a scientific paper since there was no formal hypothesis stated nor experiment performed.”

      The scientific method basically says you need to observe stuff, make a hypothesis, conduct an experiment to refute/support it, belief increases w/ success (graduates to theory), and peer review follows. Your claim is a strong accusation. I read the paper in full to assess it. Here’s what’s in the paper:

      1. Observation that running the same program through multiple compilers looking at differences found lots of compiler bugs. This pattern has strong, empirical evidence thanks to Csmith paper.

      2. Their hypothesis is they can do the same thing, even with some of same tools, to spot missed optimization opportunities.

      3. Another hypothesis is that things like spills can be used to find those. Metrics like that are already well-supported by evidence since they’re what backend writers and assembly coders already try to minimize.

      4. They conduct an experiment by running random code through multiple compilers using binary analysis to detect those patterns in the code. They score code based on these patterns.

      5. Their experiments find missed optimization opportunities as predicted. They file bug reports that are low-priority since it’s not correctness but some eventually get action by compiler writers. That’s a little bit of independent replication in itself.

      So, they have hypotheses, experiments, outcomes supporting the hypotheses, some negative news on loops they’re honest about, and some independent confirmation by compiler teams fixing what they talk about. From there, one can rate the details, demand replication, run a new study done better, and so on. They do have scientific evidence for their claims, though, for something worthy of either a replication or clean-slate project done by independent group. There was also some science on the tool development. They noted their unstructured-over-time process kept finding problems in the tools themselves with bug reports on them, too. So, the tooling itself had a mini-cycle of maybe it will work, try to use it, it didn’t meet expectations, modify it, and use it again.

      “As a complete side-note, I had to laugh when the author starts talking about lines of code, especially this gem: “At the time of writing, optdiff comprises 573 physical lines of Python code” WTF is a physical line of code?”

      Lmao! That’s great. Probably was aiming for “uncommented” or something. Wonder if English was second language or just a mental slip-up. I don’t look at lines of code much if we’re talking high-level languages built on complex runtimes. More like it depended on 573 lines of Python + native implementation of that runtime. They might only have to write and review 573 lines, though, which can be an argument favoring productivity and/or maintenance. Python is also fairly reliable, too, if one stays away from any weird or hardly-used features/libraries.

      1. 2

        Again, I’d like to be clear that I thought this was a great read and very interesting work. I think my original comment sounded condescending. I didn’t mean it to sound that way.

        The point I was trying to make when I said “no formal hypothesis nor experiment performed” is that it wasn’t formal. I agree that there were experiments performed and that they did find legitimate issues with the compilers. The reason I thought it was informal was, to put it stupidly, that there were no numbers.

        The actual technique in question was the process of binary analysis and method of finding “interesting” differences. For example, I would be curious how often that led to false positives, i.e., a high interesting score but without any way to optimize. I would also like to see how the authors’ approach compared to other, existing approaches to locating missed opportunities for optimization (assuming they exist). That might just be future work for this research group.

        Anyway, thanks for posting the paper. It’s the type of technical stuff that I like to read on Lobsters.

        1. 1

          I thought you were being condescending. My tone was aimed at that. If not, I take that much of it back.

          I agree their numbers should be in the paper or at least a supplement. You have good reasons for that. Also glad you enjoyed the paper.