1. 22

As I write this, I’m using just 6% of my 16-core CPU attempting to run Microsoft’s new Application Inspector on the Chromium source tree, which clocks in at 6 GB. This got me thinking about my previous experience with multi-threaded programming, and how difficult it was to write compared to single-threaded applications.

If a language/compiler was designed from the ground-up for multi-core programming, what would it look like? In theory, could a compiler default to multi-threading where possible? I imagine since we as developers can look at something and see that it can be multi-threaded, shouldn’t a compiler be able to do the same? I imagine the compiler could detect parts of the code that can be distributed across multiple cores, and detect when you’re attempting to perform an operation that might result in data corruption. Just as current languages can determine if a variable is in scope, this theoretical language could detect when a given piece of data can be safely accessed across multiple cores.

One language that I think gets pretty close is Go, but it’s still single-threaded by default, and requires quite a bit of effort to translate something from single-threaded to multi-threaded.

While my question is more theoretical, I’d also be interested in hearing about languages that have really great multi-threading support, especially if the implementation details are hidden from the developer.

  1.  

  2. 18

    I would say that both Erlang and OpenCL are made for multicore. They are very different languages, because the problems they solve are very different, but they still target multicore systems. There are different aspects of parallelism that they target.

    As for taking a serial program and making use of parallelism, this is actually possible with C and Fortran. With gcc you can get, if you’re careful, instruction-level parallelism through auto-vectorization. I believe that Intel’s compiler can even give you automatic multithreading.

    This historical presentation by Guy Steele is an amazing introduction to the problem: What Is the Sound of One Network Clapping? A Philosophical Overview of the Connection Machine CM-5. The people at Thinking Machines Corporation had it down to an art.

    1. 3

      Intel ISPC too. (and that supports both SIMD and multicore scaling)

      1. 2

        Thank you for that link, it was incredibly educational.

      2. 13

        Clojure and Erlang were both designed for the ground up around making concurrency the default, although in the former it was just the language design and compiler, not the whole runtime. The main thing Erlang and Clojure have in common is an insistence on immutable data structures.

        Detecting whether a function is safe to parallelize requires way more detailed information than most compilers have; that would take a type system along the lines of Haskell or more expressive. If you want to learn about automated attempts to introduce concurrency I expect the Haskell people are the only ones even capable of beginning to answer that question. (Outside research-only languages)

        1. 4

          Erlang’s concurrency was designed around multiple communicating machines. It happened that when we got multiple cores on single machines, this model still worked very well.

          OTP (Erlang is usually run on OTP) does have shared mutable state available - for optimisation. Erlang Term Storage or ‘ets’ being an example. I believe Clojure has something called Agent but I don’t know if it’s a similar concept.

          1. 3

            Clojure agents are a way to async queue operations on shared data. It’s basically like an atom, but instead of happening immediately the operation happens at some point in the future.

            It also has software transactional memory but I don’t think anyone really uses that.

          2. 2

            Although I do not consider myself one of the Haskell people, I stumbled across a talk on Improving implicit parallelism in Haskell a while ago.

          3. 10

            Occam is what you’re looking for. :)

            Programs are built up of PAR and SEQ blocks. Statements in PAR blocks are executed in parallel and statements in SEQ blocks are executed sequentially. CSP-style channels are used to send data between communicating processes.

            Occam’s IDE was also interesting, being heavily centered around code folding.

            1. 4

              Along those lines, there’s a whole sub-field worth of process calculus theory, too. Concurrency is not parallelism, though. On a real multi-core computer with a relatively small number of processors and a fixed bus topology, efficient scheduling becomes the real challenge.

            2. 9

              ParaSail is Tucker Taft’s attempt at a parallel programming language. Statements and expressions are evaluated in parallel unless there are data dependencies between them.

              1. 1

                Wow. I’ve only read the homepage, but it already is hitting all the points I was thinking about! Particularly this bit:

                Every ParaSail expression is defined to have parallel evaluation semantics. That is, given a ParaSail expression like F(X) + G(Y), the language rules ensure that it is safe to evaluate F(X) and G(Y) in parallel. The compiler makes the decision based on complexity or other criteria whether a given computation should be created as a potentially parallel activity. An underlying scheduler then maps these potentially parallel activities to particular processing resources, by default using a work-stealing approach, which provides load balancing across processors while also providing good locality of reference and minimal cache contention.

                Edit: From the reference manual:

                Parallelism should be built into the language to the extent that it is more natural to write parallel code than to write explicitly sequential code, and that the resulting programs can easily take advantage of as many cores as are available on the host computer.

                Edit 2: here’s another goodie

                parallel activities are identified automatically by the compiler, and language rules prevent data races between readers and writers of the same object, while explicitly concurrent objects can be used for safe synchronization when concurrent access from multiple readers and writers is required, without any need for explicit lock/unlock or signal/wait.

                ParaSail really looks like exactly what I was thinking about.

              2. 6

                Someone asked the “dual” of your question here, i.e. you can look at the same problem from the data side (memory hierarchy) or code side (multiple cores):

                Are there languages that have first-class support for representing/optimizing memory hierarchy characteristics?

                https://news.ycombinator.com/item?id=18009581

                Sequoia is the first thing I thought of, and researcher Elliot Slaughter showed up and dumped a wealth of information on it and the successor Legion. To me there is a strong relation to multicore MapReduce – which I think of as “imperative in the small and functional in the large”.

                https://news.ycombinator.com/item?id=18010382

                Also I looked through the recent Sorbet type checker and to me it very much looks like a multicore (not distributed) MapReduce (or maybe two of them strung together).

                It’s has an “hourglass” shape, with the “thin waist” being the merge of GlobalState.

                https://github.com/sorbet/sorbet/blob/master/docs/pipeline.md

                I think parallel type checking is going to become the norm in the next 10 years and this architecture makes a lot of sense. It makes your code look very very different. They did it in C++ but it’s a little like its own programming language, i.e. eschewing the use of raw pointers so that data can be serialized / cached / merged.

                Multicore systems will be programmed like distributed systems. The only difference is that your your communication paths are more reliable between cores than between machines. But the performance difference is still there (cache and NUMA) and it makes good (fast) multicore programs look like good distributed programs.

                So if you want to see what a language designed for multi-core would look like, then I would look at big data systems.

                This 2009 paper is philosophically related:

                Your computer is already a distributed system. Why isn’t your OS?

                And the obvious follow is “Why isn’t your programming language”? In fact this is one of the reasons I’m working on a shell – because it’s very much suited to using a multicore system like a distributed system. It’s a “coordination language”.


                edit: follow up comment on Sorbet

                1. 5

                  VHDL or Verilog are such beasts practically.

                  1. 4

                    A high-level, abstract perspective: If you write expressions as a dataflow graph, it is trivial to figure out which operations can be computed in parallel. The complicated factor is that parallelism induces some overhead. The million-dollar question becomes: When is it worthwhile to actually spin up a separate thread? If you want to compute (a + b) + (c + d), you don’t want to compute a + b and c + d in separate threads. Unless, maybe, a, b, c, and d represent numbers with a gazillion digits.

                    If you program in a purely functional language, the dataflow graph is easy to obtain. For imperative languages, the relation is a bit more obscure (consider a for-loop, for example).

                    1. 4

                      Chapel sort of does this with parallel blocks and implicit parallelism in for loops.

                      Julia puts a lot of effort into this as well: there are easy to use parallelism tools and a fair amount of work is done to enable SIMD automatically and manually.

                      1. 3

                        There has been a wealth of research in parallel logic languages (Concurrent Prolog, GHC, Parlog, Strand) which unfortunately got lost with the death of Japan’s 5th generation project. I find this approach (implicit fine-grained parallelism) very natural and elegant.

                        1. 3

                          Chapel has lots of thoughts on data and instruction parallelism. I remember that it was the main survivor/winner of the parallel language project/contest with ParaSail/X10/Fortress.

                          1. 3

                            I worked with X10 for a while. It is a research language with the initial goal to replace Fortran and C++/MPI/OpenMP. The target were computer clusters/supercomputers. The language contains syntax for concurrency, distributed message passing, barrier synchronization, and transactional memory. The most interesting aspect is that programs are deadlock-free by construction.

                            While the vision is commendable, the realization is quite broken:

                            • Transactional memory was implemented with a global lock, thus slow and you never used it.
                            • The deadlock-free by construction guarantee disappears once you use conventional locks. Of course locks became available in the standard library at some point as a faster substitute for the transactional memory feature.
                            • The message passing implicitly created closures and serialized them for transport. It is an easy mistake to reference too much and then a lot of unnecessary data is sent around. You notice it is slow but finding the mistake is hard. Too much magic.
                            1. 1

                              Did you have a chance to compare X10 with UPC or Chapel? I’ve not really used any PGAS languages, but they’ve always interested me.

                              1. 1

                                No, not really. The choice was made before I joined.

                                In my biased opinion PGAS is too much magic. It tries to treat shared-memory and non-shared-memory parallelism with the same concept. It looks simple and nice on paper but when performance matters in practice you want to have the clear control.

                                PGAS does provide some interesting research topics but I have not seen a killer application yet.

                            2. 3

                              Go is not single threaded by default in versions after version 1.5. GOMAXPROCS governs this, and it is configurable, and defaults to the number of CPUs available.

                              1. 4

                                Sorry, I meant that unless you use goroutines, Go is single-threaded. I’m imagining a programming language that even the most basic primitives are all multi-threaded.

                                1. 2

                                  Goroutines are basic primitives in Go. That’s the point of the thing.

                                  1. 7

                                    Right, which is why I specifically called out Go as being close. That being said, Go’s concurrency is still opt-in, as in I have to specifically tell go “hey, this section right here, distribute it across multiple cores” as opposed to Go detecting that a particular piece of code can be run across multiple cores without my telling it.

                                  2. 1

                                    Kind of depends what you’re writing; Golang is very good at making you not realize you’re on a separate stack/goroutine from the entry point to prevent blocking and the like.

                                    An example of this is net/http each request is on it’s own routine (which is its own can of worms), but to the view of your caller it’s just a function you write, you don’t ever have to type go ... the standard library does it for you.

                                    But yes, you have to explicitly kick out your own goroutines for parallel processing but there are a lot of primitives to help with this (i.e. channels) and doing so requires you to think through the implications (memory usage) so IMHO it’s a fair trade-off.

                                2. 2

                                  Extreme limits on flow control. In Rust, a sufficiently perverse foo() can allocate memory, recurse, spawn a thread, close stderr, panic, never return, and shoot your dog. The compiler must give consideration to these possibilities. Not so in GLSL.

                                  1. 2

                                    Making things parallel can backfire. Some of my attempts to PARALLEL ALL THE THINGS!!! in C++ and Go ended up making my code slower, because of the overhead of coordination (whether mutexes or channels or dispatch queues or whatever.) It seems you have to be careful of the granularity, letting things run long enough without interruption; I’m not sure compilers are good enough to make that determination yet.

                                    1. 1

                                      Go.

                                      Go is not single-threaded by default. As of Go 1.5, the default value of GOMAXPROCS is the number of CPUs (whatever your operating system considers to be a CPU) visible to the program at startup.

                                      All goroutines are multiplexed through a thread pool and cooperatively scheduled.

                                      1. 1

                                        Except when running CPU-hungry applications (say, number-crunching apps), I’d say that’s very good that a multi-core CPU shows very low usage (and that means also less power wasted).

                                        As usual, you shouldn’t look at the “blue” peaks (CPU) or the “green” ones (I/O), but the “red” ones (I/O wait), if any. Except when you’re compiling LLVM/Firefox/gcc/Linux from scratch, there’s nothing better than old hardware doing its job without having to top out CPU usage and without being bottlenecked by I/O wait.

                                        I’ve got a Core i5 7200U and I feel ashamed it’s constantly wasting ~20% CPU on an idle desktop. Well, there’s the Corsair USB keyboard/mouse driver (ckb-next) constantly pumping RGB LEDs data over USB (wasted 3.5 hours CPU in five days uptime); there’s acpid wasting ~2 minutes CPU per day; there’s X11 on Wayland with its wasted 2.5 hours; there’s a solid 8 hours wasted by gnome-shell (wtf) and I accidentally found that it sometimes does some “wait” by repeatedly calling sleep .1 as an executable (don’t have enough info to file a bug yet, but I wish I knew the name of the moron who invented that thing), and there are Firefox and Opera wasting no less than 7-15% CPU time each while idle on some widgetless Google blog (I uninstalled Chrome long ago and it had the same issues).

                                        That means I’m burning a lot of power (also power bills) over software that keeps wasting CPU cycles when it shouldn’t (OK, Corsair USB RGB peripherals are badly designed: “have the CPU continuously update the LED values”: ). I mean, the desktop applets shouldn’t be “running” when the desktop is in “monitor off, power saving mode”, should they? The browser shouldn’t be “running” anything that is either off the current tab or invisible, should it? (Yet I admit it’s mostly webdeveloper’s fault). In five days of uptime, I see 22% CPU wasted (more than 24 hours)… ouch!

                                        1. 1

                                          Ninja (the build system). It will parallelize all the things unless you restrict parallel jobs to 1.

                                          Make is similar, but has the wrong default.

                                          1. 1

                                            There are a couple different ways, illustrated by languages that already exist.

                                            As some folks have mentioned, functional languages are pretty well suited for implicit parallelism, and some (ex., Erlang) already do so. Making parallelism implicit in a functional language can be done trivially (if not particularly intelligently) by ensuring that side-effecting gets done after all dependent computation has finished & spawning a new thread for every possible branch. (This works even better for logic/constraint languages, especially if they support guards.) You can keep the results of whatever branch succeeds first, kill the others, and merge in. Standard prolog can’t be trivially made to support implicit parallelism because so much existing code relies upon execution order, but if you wrote code that used guards instead of relying upon statement order to fail early, you could stay pretty close to the standard.

                                            Some folks have mentioned Go, but that’s really a less extreme form of the pipelining approach familiar from unix shells. People don’t necessarily realize that every command in a shell pipeline is coresident & running in parallel, and that on an n-core system you can (with careful design of data flow) saturate all n cores with approximately n steps. This is why even naive shell scripts often outperform hadoop: the straightforward and idiomatic way to write scripts will avoid waiting for the largest batch to finish before compiling resultsets & other essentially-unavoidable problems of a general purpose map-reduce framework.

                                            Languages with a smalltalk-style object model are also amenable to implicit parallelism. If every object has its own (green) thread & manages its own inbox, then inter-object communication can be allowed to be asynchronous. You’d just need to distinguish between asynchronous and blocking messages – which Io and Julia do with a sigil.

                                            1. 2

                                              Picking a nit: Erlang is built for parallelism and concurrency, but it’s not implicit; instead the language provides first-class concurrency primitives and allows the developer, and generally not the compiler, to control the execution topology by combining those primitives.

                                              1. 1

                                                Good to know. I was under the impression that different structures were under different threads but messages were synchronous by default, but I never had a full & confident understanding of erlang (and haven’t touched it in 10 years).

                                                1. 1

                                                  All message passing is async in Erlang; synchronous messaging just means you send a message then wait for a reply. All actors are loosely coupled.

                                                  1. 1

                                                    But actors are not running in separate threads by default? I was under the impression that they were.

                                                    1. 1

                                                      They are in separate conceptual threads, as they are fully isolated, but in reality the Erlang VM (aka the BEAM) is running one OS thread per CPU core and the actors (Erlang calls them “processes”) are scheduled to run on these OS threads. Erlang processes are lightweight enough that a single server can run millions of them at once before overwhelming the Erlang scheduler.

                                                      1. 1

                                                        Oh, ok. This is basically what I thought it was doing. I wasn’t trying to distinguish between logical threads & OS threads. This makes the erlang model basically the same as io’s model (though io uses cooperative coroutines & not OS threads, with task switching done by the interpreter, so isn’t in effect actually concurrent at all).

                                            2. 1

                                              This article has both categories and lists of them that’s a mix of parallel/concurrent languages and those that just do it well. Chapel is my recommendation since it’s actually being maintained and works in many types of deployments.

                                              I’ll throw a few random ones in. Cilk was an older one that tried to stay C-like. Lustre, used in safety-critical industry, is good example of dataflow lang. Futhark is functional for GPU’s. Parlog for logical style. SequenceL is a commercial one in two categories.

                                              1. 1

                                                Pony (ponylang.io) is by far the best approach I have seen in this area.

                                                It’s Actor based model basically means that a program will run twice as fast on a 16-core machine compared to an 8-core.

                                                Actors looks almost exactly like classes, only they can define “async” methods, which return immediately and do work concurrently. Easy to understand and easy to model and reason about.

                                                As far back as Smalltalk method call was talked about as “sending messages to objects”, which then can be thought of as running concurrently.

                                                It’s a great mystery to me why almost no major language (Python, Java, C#, etc) does not implement this.