1. 2

    The idea of a language specifically targeting GPUs is interesting. One thing I’d mention here is that such a language actually would not have to be only vector-based.

    A project I’ve been interested in for a bit is Harry Dietz’ MOG; this translates general purpose parallel code (MIMD, multiple instruction, multiple data) to the GPU with at most a modest slowdown time (1/6 + running vectorized instructions at nearly full speed).

    See: http://aggregate.org/MOG/

    1. 3

      GPUs are quite a bit more flexible in their control flow than traditional SIMD machines (NVIDIA calls this SIMT), so I think it’s quite clear that you could have each thread do quite different work. The problem is that this is going to be very inefficient, and I don’t think a x6 slowdown is the worst it can get. Worst case warp/wavefront divergence is a x32 slowdown on NVIDIA and a x64 slowdown on AMD (or maybe x16; I find the ISA documentation unclear). Further, GPUs depend crucially on certain memory access patterns (basically, to exploit the full memory bandwidth, neighbouring threads must access neighbouring memory addresses in the same clock cycle). If you get this wrong, you’ll typically face a x8 slowdown.

      Then there’s a number of auxiliary issues: GPUs have very little memory, and if you have 60k threads going, that’s not a lot of memory for each (60k threads is a decent rule of thumb to ensure that latency can be hidden, and if the MOG techniques are used it looks like there’ll be a lot of latency to hide). With MIMD simulation, you probably can’t estimate in advance how much memory each thread will require, so you need to do dynamic memory management, likely via atomics, which seems guaranteed to be a sequentialising factor (but I don’t think anyone has even bothered trying to do fine-grained dynamic allocation on a GPU).

      Ultimately, you can definitely make it work, but I don’t think there will be much point to using a GPU anymore. I also don’t think the issue is working with vectors, or data-parallel programming in general. As long as the semantics are sequential, that seems to be what humans need. Lots of code exists that is essentially data-parallel in a way roughly suitable for GPU execution - just look at Matlab, R, Julia, or Numpy. (Of course, these have lots of other issues that make general GPU execution impractical, but the core programming model is suitable.)

      1. 1

        Thank you for the reply.

        I assume you have more experience in making things work on a GPU than I. Still, I’d mention that the MOG project tries to eliminate warp divergence by using byte and a byte-code interpreter. The code is a tight loop of conditional actions. Of course, considerations of memory access remain. I think the main method involved in devoting a bit of main memory to each thread.

        I believe Dietz is aiming to allow traditional supercomputer applications like weather and multibody gravity simulations to run on a GPU. One stumbling block is people who buy supercomputers are working for a large institution and aren’t necessarily that interested in saving their last dime.

    1. 1

      My main question: is it pronounced “futt hark”, “footh ark”, or “futh ark”?

      1. 1

        Etymologically, foo-thark, with th pronounced as in the. But fut-ark is also common.

      1. 2

        This is nice. The best Makefiles are nearly empty and make heavy use of templates and implicit rules. I would make a couple small changes:

        1. I’m not sure why the target that generates dependency Makefile fragments renames the generated file. This should work:

          %.d: %.c Makefile
          $(CPP) $(CPPFLAGS) -M -MM -E -o “$@” “$<”

        2. You might want to prevent generating Makefile fragments for the clean goal. A conditional include can help:

          ifneq ($(MAKECMDGOALS),clean)
          -include $(DEPS)
          endif

        3. Remaking target objects if the Makefile changes can be simply:

          $(OBJS): Makefile

        1. 3

          While I also do use templates and implicit rules when convenient (your example is certainly one of these), my experience is that Makefiles are best when they try not to be clever, and simply define straightforward from->to rules with no room for subtlety. As an example, make treats some of the files produced through chains of implicit rules as temporary, and will delete them automatically. In some cases, I have found this will cause spurious rebuilds. There is some strangely named variable you can set to avoid this deletion, but I’d rather such implicit behaviour be opt-in than opt-out.

          Sometimes a little duplication is better than a little magic.

          1. 3

            Yes, the special target .PRECIOUS can be used to mark intermediate files that should be kept. Cf. https://www.gnu.org/software/make/manual/make.html#index-_002ePRECIOUS-intermediate-files

            My recommendation for anyone who wants to learn to effectively use make: Read the manual. All of it. Keep it handy when writing your Makefile.

            People have already done the hard work of getting it to work right under most circumstances. I don’t consider it clever to stand on their shoulders.

        1. 2

          I 100% sympathize from the perspective of a scientist… But most of computer and program design since the 1950s has been computer engineering, which includes the uncertain art of choosing tradeoffs between perfect science and ugly practical needs. This case is no different, even when we have billions of transistors at our command.

          More specifically, what this article discusses is a trade-off in GPU computation overhead vs. aggregate performance. This is an optimization problem. The tradeoffs that make sense now are not the ones that made sense ten years ago, and will not be the ones that make sense ten years from now when the balance of CPU computation speed vs memory bandwidth and GPU computation speed vs CPU<->GPU transfer speed is different.

          So what it sounds like, without being critical, is that the compiler writer needs to step back from writing compilers, consider this problem as a more abstract balance of trade-offs, and consider their goals to see where they fall in the spectrum of options. Then go back to writing compilers with that goal in mind.

          1. 2

            You can always change the compiler as hardware changes. That’s the point of a compiler - that you can put local, hardware-specific information in it, and then change it as hardware changes, without changing the code that uses the compiler.

          1. 13

            I’m upvoting this mostly in the hope that HPC programmers will comment on it. I’m quite curious about how HPC programmers actually see the world, but they appear quite elusive online, or maybe I just can’t find their meeting places. I only ever meet them at conferences and such.

            1. 15

              I have a masters degree in CS with a focus in HPC. This article is mostly correct for commodity HPC. The lack of fault tolerance primitives in MPI was a pain, meaning you’d start a job and hope for no serious errors, checkpointing as often as possible, and then if something went wrong (hardware, net, etc) you’d have to restart. HPC for me was molecular dynamics simulations and things like that, the control of MPI was needed if you were going to run your systems on large supercomputer setups like the USG has. Still that would often require porting and compiler fun to make things work.

              I wouldn’t say HPC is dying, it’s just diffusing from “hard physics” (mostly rote floating point vectorized calculations) into the worlds of bio and data science that have different needs and are often just as much about data processing as anything. Physics like astronomy and particle physics have been dealing with scads of data already and have their own 20-30 year old data formats and processes.

              The article is correct, the sort of big 1000 core simulation groups are limited to maybe 10-25 research groups worldwide if that (in the multi-disciplinary world of materials science) and in my time in grad school I met most of the big names. That’s not a market of people, that’s a niche user group with their own needs and they can do what they want with the primitives available. I don’t know much about large scale simulation (ie that isn’t ‘embarrassingly parallel’ and requires near lock step execution across tons of machines) in other fields like civil, molecular bio, etc but I’m sure their user bases are small as well.

              In the end the needs of a handful users won’t sway the direction of the market. See for instance the bet on the Cell processor at BlueGene/L not making sony/toshiba/ibm pursue the design (even though it’s influenced CPU/GPU designs to this day). There’s your ramble. :)

              1. 2

                I’m here to agree with this. HPC traditionalists are largely struggling to achieve performance in worlds like deep learning, where their tools and architectures are designed for the wrong problem (e.g. Lustre is great for random access to files, not so great for AI applications where you do a whole lot more reading than writing).

                Meanwhile, cloudy novelty fans struggle to achieve performance in areas where traditional HPC both performs well and has been optimised over decades. I remember a fluid simulation demo, though not the domain, where some Apache-stack people wanted to show off how “performant” Apache-stack was. The MPI code was done before the MapReduce thing had finished launching.

            1. 4

              We really have three options open to use:

              And yet none of these options include what most good C libraries do, which is let the programmer worry about allocation.

              1. 6

                That’s not really a good fit for a high-level language, nor if you want to expose functionality that may need to do allocation internally. I do think that the module approach (where the programmer specifies the representation) is morally close.

                1. 4

                  Wait, why do we want programmer’s to worry about allocation? Isn’t that prone to error and therefore best automated?

                  1. 3

                    Because the programmer theoretically knows more about their performance requirements and memory system than the library writers. There are many easy examples of this.

                    1. 5

                      Theoretically, yes. In practice, it is an enormous source of bugs.

                      1. 3

                        In practice, all programming languages are enormous sources of bugs. :)

                        But, here, from game development, here are reasons not to rely on library routines:

                        • Being able to audit allocations and deallocations
                        • Knowing that, at level load, slab allocating a bunch of memory, nooping frees, and rejiggering everything at level transition is Good Enough(tm) and will save CPU cycles
                        • Having a frame time budget (same as you’d see in a soft real-time system) where GCing or even coalescing free lists takes too long
                        • Knowing that some library (say,std::vector) is going to be doing lots of little tiny allocations/deallocations and that an arena allocator is more suited to that workload.

                        Like, sure, as a dev I don’t like debugging these things when they go wrong–but I like even less having to rewrite a whole library because they don’t manage their memory the same way I do.

                        This is also why good libraries let the user specify file access routines.

                    2. 3

                      It’s not the allocation that’s error-prone, it’s the deallocation.

                      1. 6

                        And not even the deallocation at time of writing. The problems show up ten years later with a ninja patch that works and passes tests but fails the allocation in some crazy way. “We just need this buffer over here for later….”

                        1. 3

                          How would a library take control of deallocations without also taking control of the allocations, too?

                          1. 3

                            As I understand, a library does not allocate and does not deallocate. All users are expected to BYOB(Bring Your Own Buffer).

                            1. 2

                              In which case, it really didn’t matter (in this context) if allocation-isn’t-hard-it’s-deallocation-that. The library is leaving both up to the application anyway.

                      2. 3

                        Yeah, we saw what that’s like with MPI. Those bad experiences led to languages like Chapel, X10, ParaSail, and Futhark. Turns out many app developers would rather describe their problem or a high-level solution instead of micromanage the machine.

                      1. 2

                        I thought the punchline was macros but alas it was map and reduce which still relies on compiler magic. If it was macros then the programmer could decide on the threshold themselves.

                        1. 3

                          There is nothing that prevents a programmer from providing a module that implements map and reduce with some threshold mechanism. It’s as flexible as macros in that regard.

                          1. 1

                            So I can reflect over the structure and count elements to decide how many can be inlined before recursing?

                            1. 5

                              Yes. The only thing that is missing from the vector package is that there is no dynamic value exposing the size of the vector, so you’d have to roll your own. However, you’d have to actually produce code that performs the branch dynamically, and then depend on the compiler doing constant-folding to remove the branch (but this is pretty much guaranteed to work).

                              It’s certainly not fully as powerful as Lisp-style macros, but good enough for this purpose.

                        1. 3

                          Gods, for a moment I saw REMY.DAT;2 and thought I’d been writing articles about VMS while I thought I was sleeping.

                          VMS is the best OS in history, and nothing has quite matched it. And I say that even after learning to do assembling on a VAX machine. Which was horrible, but that’s not VMS’s fault.

                          1. 2

                            Haha, another Remy here. Shell accounts are available on DECUS if you want to play around again

                            1. 1

                              What makes OpenVMS better than its competitors (mostly Unix I guess)? From the article, it seems fascinatingly different, but ultimately it just looks more complex in terms of feature count.

                              1. 5

                                Admittedly, it’s a bit of hyperbole. But there are a lot of things VMS did before Unix. When I used VMS, it was on a mildly large cluster. Clustering at that scale just wasn’t a thing in Unixland at the time. The filesystem is itself interesting, and the inherent versioning doesn’t make things all that much more complex.

                                But the biggun is its binary formats. VMS had a “common language environment” which specified how languages manage the stack, registers, etc., and it meant that you could call libraries written in one language from any other language. Straight interop across languages. COBOL to C. C to FORTRAN. FORTRAN into your hand-coded assembly module.

                                1. 3

                                  As a newbie, I notice more consistency. DCL (shell) options and syntax is the same for every program, no need to remember if it’s -h, –help or /? etc. Clustering is easy, consistent and scales. Applications don’t have to be cluster aware and you are not fighting against the cluster (as compared to Linux with keepalived, corosync, and some database or software).

                                  1. 3

                                    I’ll add on the clustering that it got pretty bulletproof over time with many running years, even 17 claimed for one. Some of its features included:

                                    1. The ability to run nodes with different ISA’s for CPU upgrades

                                    2. A distributed, lock protocol that others copied later for their clustering.

                                    3. Deadlock detection built into that.

                                    There was also the spawn vs fork debate. The UNIX crowd went with fork for its simplicity. VMS’s spawn could do extra stuff such as CPU/RAM metering and customize security privileges. The Linux ecosystem eventually adopted a pile of modifications and extensions to do that sort of thing for clouds. Way less consistent than VMS, though, with ramifications for reliability and security.

                                    EDIT: In this submission, I have a few more alternative OS’s that had advantages over UNIX. I think it still can’t touch the LISP machines on their mix of productivity, consistency, maintenance, and reliability. The Smalltalk machines at PARC had similar benefits. Those two are in a league of their own decades later.

                                1. 6

                                  I’ve never understood why Scroll Lock doesn’t do something sensible by default on Unix systems. For example, locking the terminal so further command output does not cause further scrolling. This is probably what I would a priori assume was the purpose of Scroll Lock (related to what Ctrl-s does).

                                  1. 1

                                    I think this is troublesome when you have more than one terminal, but could be fun to cook something up.

                                    1. 1

                                      Looked into this. xterm seems to have pretty intelligent handling. Locks scrolling while active, turns on led, etc.

                                    1. 20

                                      Impressive! I particularly like the smoke effects that come from fire. Are you doing some fluid dynamics to get that behaviour?

                                      1. 21

                                        yep, the fluid simulation also runs completely on the GPU, which is one reason it’s so smooth. I adapted this excellent implementation for my needs: https://github.com/PavelDoGreat/WebGL-Fluid-Simulation

                                      1. 4

                                        I don’t understand why the author compares an Amiga (model not mentioned) with a laptop and remarks:

                                        How long do you think you could keep a modern laptop working? Four or five years? Maybe?

                                        When in the next paragraph it states:

                                        While the system has been in service all these years, it hasn’t always been a smooth ride. The monitor, mouse, and keyboard have all broken at one time or another.

                                        So, it is not that surprising, one could easily keep a desktop computer operational for all these years, if we disregard any philosophical paradox which might occur.

                                        1. 2

                                          Laptops are built with compromises to permit portability. I would not be surprised if they are not as long-lasting as machines built with fewer such compromises. There are also stories of ancient DOS machines still running, so I’m not sure the Amiga was anything special in that regard. Unfortunately, I doubt anyone has done proper studies on the long-term durability of 80s microcomputers!