1. 30

  2. 8

    Ah. This article, which argues that there are no low level languages you can write code in.

    1. 5

      I read this article as that computers aren’t improving as much as they can, due to the hardware & software people thinking in the “C” way. People optimize CPUs to the C model, optimize compilers to the CPU, optimize software for compilers… and as a result the entire computing model, despite having multiple processors, fast caches, accurate branch prediction, is still revolving around C. There are some computing models for concurrency (due to some geniuses like Alan Kay and great timing). Maybe there can be models for caches or branch prediction too, but nobody really thinks about these models…

      1. 4

        And this is true. CPUs are high-level machines now. In x86 even the assembly/machine code is an emulated language that is dynamically compiled to μops, executed out of order, on multiple complex execution units that you have no direct control over.

        1. 1

          Totally agree but if you put aside that one rather questionable assertion there’s a whole lot of interesting here :)

        2. 4

          This basically ignores the factor most important in the success of x86, and now ARM as well and maybe someday RISC-V: The instruction set of a CPU is an API that abstracts away some of the details of the processor. Additionally, it is a pretty stable API. This means that the processor’s internals can get better and your program will run better without having to recompile it, up to a point. For better or for worse, modern processors are now very, very good at doing that sort of thing.

          For examples of instruction sets which cast aside this abstraction to a greater or lesser extent, consider MIPS or Itanium. Itanium in particular made things like the number of integer dispatch units and the pipeline structure hardcoded into the instruction set. What in the world would Itanium have done to add more integer dispatch units if and when the technology desired them? Exactly what this article derides: it would have added hardware to emulate the old system’s properties atop the new one.

          1. 1

            The article addresses abstraction multiple times, like here:

            ARM’s SVE (Scalar Vector Extensions)—and similar work from Berkeley4—provides another glimpse at a better interface between program and hardware. Conventional vector units expose fixed-sized vector operations and expect the compiler to try to map the algorithm to the available unit size. In contrast, the SVE interface expects the programmer to describe the degree of parallelism available and relies on the hardware to map it down to the available number of execution units. Using this from C is complex, because the autovectorizer must infer the available parallelism from loop structures. Generating code for it from a functional-style map operation is trivial: the length of the mapped array is the degree of available parallelism.

            1. 2

              In contrast, the SVE interface expects the programmer to describe the degree of parallelism available and relies on the hardware to map it down to the available number of execution units

              This is the also the style of vectorization that is being implemented by the V extension in RISC-V: https://riscv.github.io/documents/riscv-v-spec/#_implementation_defined_constant_parameters

              Each hart supporting the vector extension defines three parameters:

              • The maximum size of a single vector element in bits, ELEN, which must be a power of 2.
              • The number of bits in a vector register, VLEN ≥ ELEN, which must be a power of 2.
              • The striping distance in bits, SLEN, which must be VLEN ≥ SLEN ≥ 32, and which must be a power of 2.

              Platform profiles may set further constraints on these parameters, for example, requiring that ELEN ≥ max(XLEN,FLEN), or requiring a minimum VLEN value, or setting an SLEN value.

              The ISA supports writing binary code that under certain constraints will execute portably on harts with different values for these parameters.

              Video with code examples: https://www.youtube.com/watch?v=GzZ-8bHsD5s

          2. 3

            As an aside, I like raymii’s strategy of posting two articles representing the two sides of an argument, as well as a third for context in another language.

            1. 2

              I don’t particularly appreciate 9 of the articles on the lobste.rs main page (at least with the exclusions I have set) being posted by the same person, though.

              1. 2

                My gut reaction was the same, but then I reminded myself that (barring bot-interference) what shows on the front page is the result of individual readers… but I totally relate.

                1. 1

                  Yeah I also have a lot of filtered tags - most web or mobile development stuff, anything relating to ruby, java, c# or .net, most practices/culture crap, etc. So if someone posts a few C posts they’re likely to all show up for me :)

            2. 2

              To my reading, the author puts a bunch of blame on C. I don’t think you can sum it up that simply, because most of the things he cites as limitations of C that CPUs had to work around are shared by basically all other modern languages. Thinking mostly about the micro-parallelism and branch prediction here, I don’t know as much about cache handling and data structure alignment. I think it could be more accurately stated that both programming language authors and CPU designers are still working together around some of the fundamental assumptions of C. There have been many innovations in both, but those assumptions still remain and are key to tying everything together. I’m not even sure what a language designed for maximum CPU efficiency would look like.

              1. 2

                I don’t think the author blames C, I think the author blames the industry for doubling down on C.

                How are the described limitations shared by other modern languages? The author claims ILP (instruction-level parallelism) exists to optimize serial code, because C has minimal language facilities for writing explicitly parallel code.

                For parallelization, C has platform-dependent syscalls. Almost every other modern language has some kind of threading library. Many have abstractions like thread pools. What about Go? Imagine if C had goroutines and channels in 2005 when Sun’s 8 core / 32 thread SPARC was competing with single and dual core chips from IBM / Intel / AMD.

                The article also describes why the C memory model makes vectorization so hard. Aside from C++, no modern language shares those problems. Most of them don’t vectorize loops simply because everyone writing high performance code is already using C, or using explicitly vectorized libraries written in C (like numpy).

                1. 1

                  There has been a lot of work towards making threading smoother and easier for developers to work with. All of it, though, seems to work towards higher-level threading, the programmer thinking about which sub-tasks make sense to take the effort to spin off a thread for. Much of the kind of compiler-level optimization I’m talking about is more at a micro level. Like if a function has 5 lines, which translate to 20 CPU instructions, how do we order those 20 CPU instructions so they can be executed by the CPU’s multiple execution units in a mostly-parallel way. Things like that, where the compiler actively works to keep as much of the CPU as possible busy, in ways that the developer in any language I know of can’t practically do.

                  1. 1

                    Compilers already do that. C just makes it extra hard because the C abstract machine has such incredibly weak guarantees. The article favors explicit parallelism, as opposed to parallelism delicately massaged out of the processor by reordering instructions like you describe.

                    I don’t understand your point. Any language with a for…in loop provides opportunities for compiling to explicitly parallel instructions. Almost no one does it, because all the compiler work goes into optimizing C, and all the processors cater to C.

              2. 1

                Super interesting article! I don’t necessarily agree with the title which feels a bit click-bait-y to me, but it raises some super interesting points that will be much food for thought. For example:

                For a language to be “close to the metal,” it must provide an abstract machine that maps easily to the abstractions exposed by the target platform. It’s easy to argue that C was a low-level language for the PDP-11. They both described a model in which programs executed sequentially, in which memory was a flat space, and even the pre- and post-increment operators cleanly lined up with the PDP-11 addressing modes.

                So, if we put aside the title, and the author’s assertion that since modern architectures with their highly optimized multiple execution paths are NOT “close to the metal”, there’s an interesting point to be made here.

                Maybe we need to be re-thinking the entire execution model, from the programming language to the compiler to (perhaps more importantly) the runtime libraries on down to take the current reality into account.

                My tiny brain struggles to envision what that might look like. Perhaps something with parallelism baked in from jump like Erlang?

                1. 1

                  I think highly multithreaded CPUs will finally become popular soon. ILP has essentially stopped scaling for many tasks, so loads of historically single threaded programs are properly multithreaded now. For example, until just a few years ago game engines infamously did almost all non-GPU work in their main loop.

                  POWER9 cores have 8 threads each. ARM started making threaded cores last year.

                  1. 2

                    POWER9 cores have 8 threads each

                    There’s SMT4 and SMT8 configurations; the SMT8 configurations seem exclusive to IBM servers. I believe that’s because the SMT4 configuration is actually more efficient, and SMT8 is more useful for licensing regimes and the specific scheduler the IBM proprietary hypervisor their servers use.

                    1. 1

                      What do you mean by more efficient? It seems to me that any highly parallel, memory bound workload would be more efficient on higher thread count processors. And those workloads are more common on servers.

                      1. 1

                        The SMT4 and SMT8 are basically the same processor, but just changes how cache/chiplets are allocated per core/thread. For EE reasons I don’t fully understand, the SMT4 configuration is actually the ideal one.

                        1. 1

                          Interesting. If you have a source I’d love to read it.

                          I still think highly threaded processors will make a comeback. Application servers spend a lot of time stalled on memory access, the slam dunk case for hardware threads.

                2. 1

                  By the paragraph

                  The quest for high ILP was the direct cause of Spectre and Meltdown. A modern Intel processor has up to 180 instructions in flight at a time (in stark contrast to a sequential C abstract machine, which expects each operation to complete before the next one begins). A typical heuristic for C code is that there is a branch, on average, every seven instructions. If you wish to keep such a pipeline full from a single thread, then you must guess the targets of the next 25 branches. This, again, adds complexity; it also means that an incorrect guess results in work being done and then discarded, which is not ideal for power consumption. This discarded work has visible side effects, which the Spectre and Meltdown attacks could exploit.

                  The linked article seems to argue that ILP will inevitably lead to a CISC architecture like the crazy 180-stage Intel x86. I am a little doubtful though, as RISC-V, ARM, and MIPS also have SIMD/vector extensions. Still, I really appreciate the novel perspective.

                  1. 1

                    On a modern high-end core, the register rename engine is one of the largest consumers of die area and power. To make matters worse, it cannot be turned off or power gated while any instructions are running, which makes it inconvenient in a dark silicon era when transistors are cheap but powered transistors are an expensive resource.

                    Is this why ARM cores have historically been more power-efficient than x86 ones? If so, is ARM sacrificing this advantage in its own high-end cores?

                    Stories with similar links:

                    1. C Is Not a Low-Level Language via nickpsecurity 1 year ago | 40 points | 45 comments