1. 15
  1.  

  2. 2

    The idea of a language specifically targeting GPUs is interesting. One thing I’d mention here is that such a language actually would not have to be only vector-based.

    A project I’ve been interested in for a bit is Harry Dietz’ MOG; this translates general purpose parallel code (MIMD, multiple instruction, multiple data) to the GPU with at most a modest slowdown time (1/6 + running vectorized instructions at nearly full speed).

    See: http://aggregate.org/MOG/

    1. 3

      GPUs are quite a bit more flexible in their control flow than traditional SIMD machines (NVIDIA calls this SIMT), so I think it’s quite clear that you could have each thread do quite different work. The problem is that this is going to be very inefficient, and I don’t think a x6 slowdown is the worst it can get. Worst case warp/wavefront divergence is a x32 slowdown on NVIDIA and a x64 slowdown on AMD (or maybe x16; I find the ISA documentation unclear). Further, GPUs depend crucially on certain memory access patterns (basically, to exploit the full memory bandwidth, neighbouring threads must access neighbouring memory addresses in the same clock cycle). If you get this wrong, you’ll typically face a x8 slowdown.

      Then there’s a number of auxiliary issues: GPUs have very little memory, and if you have 60k threads going, that’s not a lot of memory for each (60k threads is a decent rule of thumb to ensure that latency can be hidden, and if the MOG techniques are used it looks like there’ll be a lot of latency to hide). With MIMD simulation, you probably can’t estimate in advance how much memory each thread will require, so you need to do dynamic memory management, likely via atomics, which seems guaranteed to be a sequentialising factor (but I don’t think anyone has even bothered trying to do fine-grained dynamic allocation on a GPU).

      Ultimately, you can definitely make it work, but I don’t think there will be much point to using a GPU anymore. I also don’t think the issue is working with vectors, or data-parallel programming in general. As long as the semantics are sequential, that seems to be what humans need. Lots of code exists that is essentially data-parallel in a way roughly suitable for GPU execution - just look at Matlab, R, Julia, or Numpy. (Of course, these have lots of other issues that make general GPU execution impractical, but the core programming model is suitable.)

      1. 1

        Thank you for the reply.

        I assume you have more experience in making things work on a GPU than I. Still, I’d mention that the MOG project tries to eliminate warp divergence by using byte and a byte-code interpreter. The code is a tight loop of conditional actions. Of course, considerations of memory access remain. I think the main method involved in devoting a bit of main memory to each thread.

        I believe Dietz is aiming to allow traditional supercomputer applications like weather and multibody gravity simulations to run on a GPU. One stumbling block is people who buy supercomputers are working for a large institution and aren’t necessarily that interested in saving their last dime.

    2. 1

      My main question: is it pronounced “futt hark”, “footh ark”, or “futh ark”?

      1. 1

        Etymologically, foo-thark, with th pronounced as in the. But fut-ark is also common.