1. 33

A response to the “C Is Not A Low-Level Language” article. (lobste.rs discussion)


  2. 26

    Something clearly got this author’s goat; this rebuttal feels less like a reasoned response and more like someone yelling “NO U” into Wordpress.

    Out of order execution is used to hide the latency from talking to other components of the system that aren’t the current CPU, not “to make C run faster”.

    Also, attacking academics as people living in ivory towers is an obvious ad hominem. It doesn’t serve any purpose in this article and, if anything, weakens it. Tremendous amounts of practical CS come from academia and professional researchers. That doesn’t mean it should be thrown out.

    1. 10

      So, in context, the bit you quote is:

      The author criticizes things like “out-of-order” execution which has lead to the Spectre sidechannel vulnerabilities. Out-of-order execution is necessary to make C run faster.

      The author was completely correct here, and substituting in JS/C++/D/Rust/Fortan/Ada would’ve still resulted in a correct statement.

      The academic software preference (assuming that such a thing exists) is clearly for parallelism, for “dumb” chips (because computer science and PLT is cooler than computer/electrical engineering, one supposes), for “smart” compilers and PL tricks, and against “dumb” languages like C. That appears to be the assertion the author here would make, and I don’t think it’s particularly wrong.

      Here’s thing though: none of that has been borne out in mainstream usage. In fact, the big failure the author mentioned here (the Sparc Tx line) was not alone! The other big offender of this you may have heard of is the Itanic, from the folks at Intel. A similar example of the philosophy not really getting traction is the (very neat and clever) Parallax Propeller line. Or the relative failure of the Adapteva Parallela boards and their Epiphany processors.

      For completeness sake, the only chips with massive core counts and simple execution models are GPUs, and those are only really showing their talent in number crunching and hashing–and even then, for the last decade, somehow limping along with C variants!

      1. 2

        One problem with the original article was that it located the requirement for ILP in the imagined defects of the C language. that’s just false.

        Weird how nobody seems to remember the Terra.

        1. 3

          In order to remember you would have to have learned about it first. My experience is that no one who isn’t studying computer architecture or compilers in graduate school will be exposed to more exotic architectures. For most technology professionals, working on anything other than x86 is way out of the mainstream. We can thank the iPhone for at least making “normal” software people aware of ARM.

          1. 4

            I am so old, that I remember reading about the Tomasula algorithm in Computer Architecture class and wondering why anyone would need that on a modern computer with a fast cache - like a VAX.

          2. 1

            For those of us who don’t, what’s Terra?

            1. 2

              Of course, I spelled it wrong.


        2. 9

          The purpose of out of order execution is to increase instruction-level parallelism (ILP). And while it’s frequently the case that covering the latency of off chip access is one way out of order execution helps, the other (more common) reason is that non-dependent instructions that use independent ALUs can issue immediately and retire in whatever order instead of stalling the whole pipeline to maintain instruction ordering. When you mix this with good branch prediction and complex fetch and issue logic, then you get, in effect, unrolled, parallelized loops with vanilla C code.

          Whether it’s fair to say the reasoning was “to make C run faster” is certainly debatable, but the first mainstream out of order processor was the Pentium Pro (1996). Back then, the vast majority of software was written in C, and Intel was hellbent on making each generation of Pentium run single-threaded code faster until they hit the inevitable power wall at the end of the NetBurst life. We only saw the proliferation of highly parallel programming languages and libraries in the mainstream consciousness after that, when multicores became the norm to keep the marketing materials full of speed gainz despite the roadblock on clockspeed and, relatedly, single-threaded performance.

          1. 1

            the first mainstream out of order processor was the Pentium Pro (1996).


        3. 8

          Your response is much better than the original article (which has lowered my opinion of the ACM) but I do have one correction.
          Tomasulo’s algorithm, the first implementation of out-of-order execution, predates the C language by a few years.

          1. 4

            Your response…

            Quick note: I did not write this, if that’s who you’re addressing. The author is Robert Graham.

            1. 2

              Sorry, my mistake. I’m not sure how I missed that.

          2. 7

            [C is] still the fastest compiled language.

            Well hold on there. This claim might be true when you’re working with small critical sections or microbenchmarks, but when you have a significant amount of code all of which is performance-sensitive, my experience is that’s way off-base. I work with image pipelines these days, which are characterized by a long sequence of relatively-simple array operations. In the good ol’ days this was all written in straight C, and if people wanted stages to be parallelized or vectorized or grouped or GPU offloaded or [next year’s new hotness] they did it themselves. The result, as you might guess, is a bunch of hairy bug-prone code that’s difficult to adapt to algorithm changes.

            Nowadays there’s better options, such as Halide. In Halide you write out the computation stages equationally, and separately annotate how you want things parallelized/GPU offloaded/etc. The result is your code is high-level but you retain control over its implementation. And unlike in C, exploring different avenues to performance is pretty easy, and oh by the way, schedule annotations don’t change algorithm behavior so you know the optimizations you’re applying are safe.

            Check out the paper for demonstrations of Halide’s effectiveness: using an automatic schedule search, they manage to beat a collection of hand-optimized kernels for real-world tasks, in a fraction of the code size. Also from personal experience, our project might not have even gotten off the ground without Halide, but with it we’ve been able to port a large camera pipeline with relative ease.

            My point is providing primitives only nets you performance until you hit the cognitive ceiling of your programmers. All those high-performance schedules could’ve been coded up in C and Cuda, but they weren’t because at some point all the parts got too complicated for the experts to hand-code optimally. And if no one can manage to actually get high performance out of your language/system, is it really high performance? So I think in order for a language to stay high-performance in the presence of complex programs, it needs to be high-level. C and intrinsics just aren’t good enough.

            Edit: for a great example, check out the snippets on the second page here.

            1. 8

              It’s not really fair to blame the original authors basic ignorance of processor architecture and language design on him being an “academic”. Hennessy and Patterson are academics. They wrote an undergrad academic text that the author apparently never read.

              1. 3

                Christ, this discussion is way beyond me. I’d still really like to understand it though. Anyone available to compsplain?

                1. 4

                  Back in the day, there was a great series of articles by Jon Stokes on Ars Technica that covered a lot of microarchitectural concepts like this. You can find them with google or buy them compiled into a book: https://nostarch.com/insidemachine.htm

                  1. 4

                    I managed to find a collection of his explanatory articles, of which the two-part series “Understanding the Microprocessor” seemed to best match what you described.