1. 39
  1.  

  2. 11

    I enjoyed this, but it did make me wonder – what would a true low-level language designed today actually look like? I’ll hang up and take your answers over the air.

    1. 9

      If I’m reading the article’s premise properly, the author doesn’t even consider assembly language to be ‘low level’ on modern processors, because the implicit parallel execution performed by speculative execution is not fully exposed as controllable by the programmer. It’s an interesting take, but I don’t think anybody other than the author would use “low level” to mean what he does.

      That said, if we were to make a language that met the author’s standards (let’s say “hardware-parallelism-transparent” rather than “low-level”), we’d probably be seeing something that vaguely resembled Erlang or Miranda in terms of how branching worked – i.e., a lot of guards around blocks of code rather than conditional jumps (or, rather, in this case, conditional jumps with their conditions inverted before small blocks of serial code).

      People later in the thread are talking about threading & how there’s no reason threading couldn’t be put into the C standard, but threading doesn’t appear to be the kind of parallelism the author is concerned about exposing. (To be honest, I wonder if the author has a similarly spicy take on microcode, or on firmware, or on the programmability of interrupt controllers!)

      He seems to be saying that, because we made it easy to ignore certain things that were going on in hardware (like real instructions being executed and then un-done), we were taken off-guard by the consequences when a hole was poked in the facade in the form of operations that couldn’t be hidden in that way. I don’t think that’s a controversial statement – indeed, I’m pretty sure that everybody who makes compatibility-based abstractions is aware that such abstractions become a problem when they fail.

      He suggests that the fault lies in not graduating to an abstraction closer to the actual operation of the machine, which is probably fair, although chip architectures in general and x86 in particular are often seen as vast accumulations of ill-conceived kludges and it is this very bug-compatibility that’s often credited with x86’s continued dominance even in the face of other architectures that don’t pretend it’s 1977 when you lower the reset pin and don’t require trampolining off three chunks of arcane code to go from a 40 bit address bus and 16 bit words to 64 bit words.

      People don’t usually go as far as to suggest that speculative execution should be exposed to system programmers as something to be directly manipulated, and mechanisms to prevent this are literally part of the hardware, but it’s an interesting idea to consider, in the same way that (despite their limitations) it’s interesting to consider what could be done with one of those PPC chips with an FPGA on the die.

      The quick and easy answer to what people would do with such facilities is the same as with most forms of added flexibility: most people will shoot themselves in the foot, a few people would make amazing works of art, and then somebody would come along and impose standards that limit how big a hole in your foot you can shoot and it’d kill off the artworks.

      1. 5

        Probably a parallel/concurrent-by-default language like ParaSail or Chapel with C-like design as a base to plug into ecosystem designed for it. Macros for DSL’s, too, since they’re popular for mapping stuff to specific hardware accelerators. I already had a Scheme + C project in mind for sequential code. When brainstorming on parallel part, mapping stuff from above languages onto C was the idea. Probably start with something simpler like Cilk to get feet wet, though. That was the concept.

        1. 2

          Or maybe it would look like Rust.

          1. 8

            The article’s point is that things are parallel by default at multiple levels, there’s different memories with different performance based on locality, orderings with consistency models, and so on. The parallel languages assume most of that given they were originally designed for NUMA’s and clusters. They force you to address it with sequential stuff being an exception. They also use compilers and runtimes to schedule that much like compiler + CPU models.

            Looking at Rust, it seems like it believes in the imaginary model C does that’s sequential, ordered, and so on. It certainly has some libraries or features that help with parallelism and concurrency. Yet, it looks sequential at the core to me. Makes sense as a C replacement.

            1. 5

              But, Rust is the only new low-level language I’m aware of, so empirically: new low-level languages look like Rust.

              Looking at Rust, it seems like it believes in the imaginary model C does that’s sequential, ordered, and so on.

              To be fair, the processor puts a lot of effort into letting you imagine it. Maybe we don’t have languages that look more like the underlying chip is because it’s very difficult to reason about.

              Talking out of my domain here: but the out of order stuff and all that the processor gives you is pretty granular, not at the whole-task level, so maybe we are doing the right thing by imagining sequential execution because that’s what we do at the level we think at. Or, maybe we should just use Haskell where order of execution doesn’t matter.

              1. 3

                How does rust qualify as “low level”?

                1. 1

                  From my understanding, being low-level is one of the goals of the project? Whatever “low-level” means. It’s certainly trying to compete where one would use C and C++.

                  1. 3

                    But does rust meet the criteria for low level that C does not (per the link)?

                    1. [Comment removed by author]

                      1. 3

                        I think you’re probably putting too much faith in Wikipedia. With that said, I must confess, I have no insight into the decision procedure that chooses the terms to describe Rust in that infobox.

                        One possible explanation is that Rust used to bake lightweight threads into its runtime, not unlike Go. Go is also described as being concurrent on Wikipedia. To that end, the terms are at least consistent, given where Rust was somewhere around 4 years ago. Is it possible that the infobox simply hasn’t been updated? Or perhaps there is a turf war? Or perhaps there are more meanings to what “concurrent” actually signifies? Does having channels in the standard library mean Rust is “concurrent”? I dunno.

                        Rust has stuff in the type system to eliminate data races in safe code. Separate from that, there are some conveniences that help avoid deadlock (e.g., you typically never explicitly unlock a mutex). But concurrency is definitely not built into the language like it is for Go.

                        (I make no comment on Rust’s relevance to the other comments in this thread, mostly because I don’t give a poop. This navel gazing about categorization is a huge unproductive waste of time from my perspective. ’round and ’round we go.)

                        1. 1

                          Pretty sure having a type system designed to prevent data races makes Rust count as “concurrent” for many (including me).

                          1. 3

                            The interesting bit is that the type system itself wasn’t designed for it. The elimination of data races fell out of the ownership/aliasing model when coupled with the Send and Sync traits.

                            The nomicon has some words on the topic, but that section gets into the weeds pretty quickly.

                            1. 1

                              I see where you are going with that. The traditional use of it was expressing concepts in a concurrent way. It had to make that easier. The type system eliminates some problems. It’s a building block one can use for safe concurrency with mutable state. It doesn’t by itself let you express things in a concurrent fashion easily. So, they built concurrency frameworks on top of it. A version of Rust where the language worked that way by default would be a concurrent language.

                              Right now, it looks to be a sequential, multi-paradigm language with a type system that makes building concurrency easier. Then, the concurrency frameworks themselves built on top of it may be be thought similar to DSL’s that are concurrent. With that mental model, you’re still using two languages: a concurrent one along with a non-concurrent, base language. This is actually common in high assurance where they simultaneously wrote formal specs in something sequential like Z and CSP for concurrent stuff. The concurrent-by-default languages are the rare thing with sequential and concurrent usually treated separately in most tools.

                  2. 3

                    If exploring such models, check out HLL-to-microcode compilers and No Instruction Set Computing (NISC).

                  3. 1

                    Interestingly, the Rust wikipedia page makes a bit deal about it trying to be a “concurrent” language. Apparently it’s not delivering if that is the major counter you gave.

                    1. 2

                      Occam is an example of a concurrency-oriented language. The core of it is a concurrency model. The Rust language has a design meant to make building safe concurrency easier. Those frameworks or whatever might be concurrency-oriented. That’s why they’re advertised as such. Underneath, they’re probably still leveraging a more sequential model in base language.

                      Whereas, in concurrency- or parallelism-first languages, it’s usually the other way around or sequential is a bit more work. Likewise, the HDL’s the CPU’s are designed with appear to be concurrency-first with them beating the designs into orderly, sequential CPU’s.

                      So, Im not saying Rust isnt good for concurrency or cant emulate that well. Just it might not be that at core, by default, and easiest style to use. Some languages are. That make more sense?

                      1. 0

                        Yes I know all that, my point was that the wikipedia page explicitly states Rust is a concurrent language, which if true means it fits into the idea of this post.

                  4. 3

                    Does Rust do much to address the non-sequential nature of modern high-performance CPU architectures, though? I think of it as a modern C – certainly cleaned up, certainly informed by the last 50 years of industry and academic work on PLT, but not so much trying to provide an abstract machine better matched to the capabilities of today’s hardware. Am I wrong?

                    1. 3

                      By the definitions in the article, Rust is not a low level language, because it does not explicitly force the programmer to schedule instructions and rename registers.

                      (By the definitions in that article, assembly is also not a low level language.)

                      1. 1

                        Ownership semantics make Rust higher-level than C.

                        1. 3

                          I disagree:

                          1. Parallelism would make whatever language higher-level than C too but the point seems to be that a low-level language should have it.
                          2. Even if true, ownership is purely a compile-time construct that completely disappears at run-time, so there is no cost, so it does not get in the way of being a low-level language.
                          1. 2

                            Parallelism would make whatever language higher-level than C too but the point seems to be that a low-level language should have it.

                            This premise is false: Parallelism which mirrors the parallelism in the hardware would make a language lower-level, as it would better mirror the underlying system.

                            Even if true, ownership is purely a compile-time construct that completely disappears at run-time, so there is no cost, so it does not get in the way of being a low-level language.

                            You misunderstand what makes a language low-level. “Zero-cost” abstractions move a language to a higher level, as they take the programmer away from the hardware.

                    2. 2

                      I came across the X Sharp high-level assembler recently, I don’t know if it’s low-level enough for you but it piqued my interest.

                      1. 2

                        There’s no point of a true low-end language, because we can’t access the hardware at that level. The problem (in this case) isn’t C per se, but the complexity within modern chips that are required to make them pretend to be a gussied-up in-order CPU circa 1993.

                      2. 7

                        As a particular point where I felt this went a bit sideways, how does the requirement to manually pack and pad structs mean C is not low level? That whole section was weird. Low level languages are expected to support easy compiler optimizations?

                        1. 4

                          I feel like I read far too much about how C interacts with processors just to get to the main point,

                          It’s more accurate to say that parallel programming in a language with a C-like abstract machine is difficult, and given the prevalence of parallel hardware, from multicore CPUs to many-core GPUs, that’s just another way of saying that C doesn’t map to modern hardware very well.

                          1. 3

                            Yeah, it doesn’t match what the machine is actually doing. Therefore, it’s more a high-level language than a low-level one even if it’s low-level in some ways. That’s true for high-performance CPU’s but it might still fit for MCU’s or simple CPU’s. For the former, Cilk is probably closer than C to what the combos of compilers and CPU’s are doing.

                            1. 2

                              It does seem like a decent fit for MCUs. In that world, you’re typically without an OS and threads, so you’re dealing with memory directly (so it’s a giant array) and the operations the CPU probably doesn’t have much in the way of pipelining or caching. The only gotcha is interrupts since they’re the main source of input. Oddly enough, an MCU is mostly an event driven device so parallelism is still a huge concern.

                              The low-level connection can fool you too. If you’re working with an 8-bit MCU, things you’re used to in C can be really slow, such as anything related to floating point or using a short int.

                              1. 1

                                Compared to what?

                            2. 6

                              This article is yet another indication that the Clang/LLVM developer culture is seriously broken. The level of reasoning and technical accuracy would be noticeably bad in an HN rant. The basic premise is nutty: nobody invested billions of dollars and huge amounts of effort and ingenuity to cater to the delusions of C programmers. Processor development is extensively data driven by benchmarks and simulations that include code written in multiple languages. Just to mention one claim that caught my eye, caches are not a recent invention.

                              Consider another core part of the C abstract machine’s memory model: flat memory. This hasn’t been true for more than two decades. A modern processor often has three levels of cache in between registers and main memory, which attempt to hide latency.

                              Wow! “More than two decades” is right. In fact caches were even present on the PDP-11s and - they were not a new idea back in 1980. Poor Dennis and Ken, developing a programming language in ignorance of the effects of cache memory. The rest of it is scarcely better.

                              The root cause of the Spectre and Meltdown vulnerabilities was that processor architects were trying to build not just fast processors, but fast processors that expose the same abstract machine as a PDP-11. This is essential because it allows C programmers to continue in the belief that their language is close to the underlying hardware.

                              WTF? Where is the editorial function on ACM Queue? Or consider this explanation of ILP in processor design.

                              so processors wishing to keep their execution units busy running C code rely on ILP (instruction-level parallelism). They inspect adjacent operations and issue independent ones in parallel. This adds a significant amount of complexity (and power consumption) to allow programmers to write mostly sequential code. In contrast, GPUs achieve very high performance without any of this logic, at the expense of requiring explicitly parallel programs.

                              Who knew that pipelining was introduced to spare the feelings of C coders who lack the insight to do GPU coding?

                              1. 4

                                nobody invested billions of dollars and huge amounts of effort and ingenuity to cater to the delusions of C programmers

                                People that worked in hardware often said the opposite was true. Mostly due to historical circumstances combined with demand. We had Windows and the UNIX’s in C. Mission-critical code went into legacy mode more often than it was highly-optimized. Then, optimization-oriented workloads like HPC and especially gaming demanded improvements for their code. In games, that was largely in C/C++ with HPC a mix of it and Fortran. Processor vendors responded by making their CPU’s really good at running those things with speed doubling every 18 months without work by software folks. Compiler vendors were doing the same thing for the same reasons.

                                So yeah, just because people were using C for whatever reasons they optimized for those workloads and C’s style. Weren’t the benchmarks apps in C/C++, too, in most cases? That would just further encourage improving C/C++ style along with what patterns were in those workloads.

                                “Who knew that pipelining was introduced to spare the feelings of C coders who lack the insight to do GPU coding?”

                                There were a lot of models tried. The big bets by CPU vendors on alternatives were disasters because nobody wanted to rewrite the code or learn new approaches. Intel lost a fortune on stuff like BiiN. Backward compatibility with existing languages and libraries over everything else. Those are written in C/C++ that people are mostly not optimizing: just adding new features. So, they introduced other ways to speed up those kind of applications without their developers using alternative methods. This didn’t stop companies from trying all kinds of things that did boost numbers. They just remained fighting bankruptcy despite technical successes (Ambric), niche scraping by (Moore’s chips), or priced too high to recover NRE (eg FPGA’s w/ HLS or Venray CPU’s). Just reinforced why people keep boosting legacy and high demand systems written in stuff like C.

                                1. 5

                                  You are not going to find anyone who does processor design who says that.

                                  1. Processors are highly optimized for existing commercial work loads - which include significant C and Java - true
                                  2. “processor architects were trying to build not just fast processors, but fast processors that expose the same abstract machine as a PDP-11. This is essential because it allows C programmers to continue in the belief that their language is close to the underlying hardware.” - not even close.

                                  First is a true statement (obvious too). Second is a mix of false (PDP-11??) and absurd - I’m 100% sure that nobody designing processors cares about the feelings of C programmers and it’s also clear that the author doesn’t know the slightest thing about the PDP-11 architecture (which utilized caches a not very flat memory model, ILP etc. etc. )

                                  Caches, ILP, pipelining, oo execution - all those pre-date C. Spec benchmarks have included measurements of Java workloads for decades, fortran workloads forever. The claim “so processors wishing to keep their execution units busy running C code rely on ILP (instruction-level parallelism). “ is comprehensively ignorant. ILP works with the processor instruction set fundamentals, not at the language level. To keep a 3GHz conventional processor busy on pure Erlang, Rust, Swift, Java, Javascript loads, you’d need ILP, branch prediction, etc etc as well. It’s also clear that processor designers have been happy to mutate the instruction set to expose parallelism whenever they could.

                                  “The key idea behind these designs is that with enough high-level parallelism, you can suspend the threads that are waiting for data from memory and fill your execution units with instructions from others. The problem with such designs is that C programs tend to have few busy threads.”

                                  Erlang’s not going to make your thread switching processor magically work. The problem is at the algorithmic level, not the programming language level. Which is why, on workloads suited for GPUs, people have no problem writing C code or compiling C code.

                                  Caches are large, but their size isn’t the only reason for their complexity. The cache coherency protocol is one of the hardest parts of a modern CPU to make both fast and correct. Most of the complexity involved comes from supporting a language in which data is expected to be both shared and mutable as a matter of course.

                                  Again, the author is blaming the poor C language for a difficult algorithm design issue. C doesn’t say much at all about concurrency. Only in the C11 standard is there an introduction of atomic variables and threading (it’s not very good either) but this has had zero effect on processor design. It’s correct that large coherent caches are design bottleneck, but that has nothing to do with C. In fact, shared memory multi thread java applications are super common.

                                  etc. etc. He doesn’t understand algorithms or processor design, but has a kind of trendy psycho-babble hostility to C.

                                  1. 3

                                    Good counterpoints. :)

                                    1. 2

                                      The article argues that modern CPU architecture spends vast amounts of die space supporting a model of sequential execution and flat memory. Near as I can tell, he’s absolutely correct. You, otoh, seems to have not understood that, and moved straight to ad hominem. “He doesn’t understand algorithms or processor design”; please.

                                      1. 4
                                        1. The claim that “that modern CPU architecture spends vast amounts of die space supporting a model of sequential execution and flat memory” is totally uncontroversial - if you have sensible definitions of both sequential execution and flat memory.

                                        2. The claim that “The features that led to these vulnerabilities [Spectre and Meltdown] , along with several others, were added to let C programmers continue to believe they were programming in a low-level language,” is absurd and indicates a lack of knowledge about processor design and a offers a nutty theory about the motivations of processor architects.

                                        3. The claim “The root cause of the Spectre and Meltdown vulnerabilities was that processor architects were trying to build not just fast processors, but fast processors that expose the same abstract machine as a PDP-11.” is similarly absurd and further comments indicate that the author believes the PDP-11 architecture predated the use of features such as caches and instruction level parallelism which is a elementary and egregious error.

                                        4. The claim “Creating a new thread [in C] is a library operation known to be expensive, so processors wishing to keep their execution units busy running C code rely on ILP (instruction-level parallelism)” - involves both a basic error about C programming and a basic misunderstanding of the motivations for ILP in computer architecure. It precedes a claim that shows that the author doesn’t understand that C based code is widely used in GPU programming which is another elementary and egregious error.

                                        5. Those are not the only errors in the essay.

                                        6. “Ad hominem” involves attempting to dismiss an argument based on claims about the character of the person making the argument. If I had argued that Dave Chisnall’s personal attributes invalidate his arguments, that would be ad hominem. I did the opposite: I argued that the gross technical errors in Chisnalls argument indicate that he does not understand computer architecture. That’s not use of ad hominem, but is ad argumentum - directed at the argument, not the person.

                                        Thanks.

                                        vy

                                        1. 2

                                          The claim that “The features that led to these vulnerabilities [Spectre and Meltdown] , along with several others, were added to let C programmers continue to believe they were programming in a low-level language,”

                                          Of course, only the author could say that for sure, but my interpretation of this, and other similar sentences, in the article is not to make the point about modern processors somehow trying to satisfy a bunch of mentally unstable programmers, but rather than they are implementing an API that hides a vast amount of complexity and heuristics. As far as I understand, this claim is out of question in this thread.

                                          The second point, which in my opinion is a bit too hidden in somewhat confusing prose, is that a lower level of programming could be possible, by allowing programs to control things like branch prediction and cache placement for example. That could also simplify processors by freeing them from implementing heuristics without having the full context of what the application may be trying to achieve, and grant full control to the software layer so that better optimisation levels could be reached. I think that is a valid point to make.

                                          I don’t really like the connection to spectre, which I think is purely anecdotal, and I think that the exposition as a discussion about whether C is or is not low level programmer and what C programmers believe muddies what I think is the underlying idea of this article. Most of the article would be equally valid if it talked about assembly code.

                                          1. 1

                                            I think it would be a really good idea for processor designers to allow lower level programming, but if that’s what the author is attempting to argue, I missed it. In fact, his lauding of Erlang kind of points the other way. What I got out of it is the sort of generic hostility to C that seems common in the LLVM community.

                                            1. 1

                                              I think (one of) his point(s) is that if CPU designers abandoned the C abstract machine/x86 semantics straightjacket, they could build devices that utilizes the silicon much more effectively. The reason they don’t do that is ofc not that they are afraid of C programmers with pichforks, but that such a device would not be marketable (because existing software would not run on it). I don’t understand your erlang remark though. I believe a CPU that did not do speculative execution, branch prediction, etc, but instead, say, exposed thousands of sequential processing units, would be ideal for erlang.

                                              1. 3

                                                This was a very interesting machine https://en.wikipedia.org/wiki/Cray_MTA it did a thread switch on every memory load

                                                The success of GPUs shows that: 1) if there are compelling applications that fit a non-orthodox processor design, software will be developed to take advantage of it and 2) for many parallel computer designs, C can easily be adapted to the environment - in fact, C is very widely used in GPUs. Also, both AMD and Intel are rushing to extend vector processing in their processors. It is curious that Erlang has had so little uptake in GPUs. I find it super awkward, but maybe that’s just me.

                                                Obviously, there are annoying limitations to both current processor designs ( see https://lobste.rs/s/cnw9ta/synchronous_processors ) and to C. My objection to this essay was that it it confirmed my feeling that there are many people working on LLVM who just dislike C for not very coherent reasons. This is a problem because LLVM/Clang keep making “optimizations” that make C even more difficult to use.

                                2. 3

                                  There are truly modern CPU(designs)s that use their die space for more, simpler processors like Tilera and the Adapteva Epiphany, but they’re impossible to actually get in a useable form in exchange for money as a developer. Or you can get boards, but it’s a cut-down version, designed for some specific IoT niche, so they’ve got no high speed IO, or have half a megabyte of ram.

                                  1. 2

                                    Oh yeah, I feel you on that: had the same problem when wanting a badass system with those same two devices. Especially Epiphany-V recently since I definitely have uses for 1,000 cores. I followed Tilera on and off since the MIT RAW project. Its non-wide availability has been going on a long time. Also saw three Tilera’s used to power a 100Gbps NIDS one time. Badass. In my desktop or laptop? “So sorry…”

                                  2. 3

                                    This is a dreadful article. There is nothing in the design of C, for example, that requires thread creation to be a big deal. The register renaming operations of CPUs has nothing to do with C semantics. etc.

                                    1. 2

                                      Far as thread creation, I thought C didnt address concurrency at all with it left to library authors or maybe a later standard. If true, that would mean concurrency had to be bolted on to a PDP-11 abstract machine run by hardware that was nothing like that. That’s two layers of indirection.

                                      Other systems languages like Ada, Modula-3, and Active Oberon did address concurrency. Still not low-level but at least a concept in those languages.

                                      1. 1

                                        The abstract machine that the later C standard has come up with is pretty stupid and does not fit with the design of the language. Essentially they have come up with all sorts of useless and complex constraints . But there is no reason why thread creation has to be hard - this is an advantage that comes from the language not providing any concurrency semantics at all. You could arrange outside the language for multiple threads to run the same code and come up with a data sharing/control strategy that is a good fit for the machine. Or you could follow DJBs example, and directly program vector operation. What you can’t do is have a compiler automatically adapt to different concurrency regimes, but the compiler often doesn’t know anything useful about these either.