1. 7

  2. 2

    The curse of VLIW/explicit parallel architecture is that “smart enough” compilers never appear. Itanium taught a valuable lesson.

    1. 3

      I wonder how true that is now - LLVM packs an insane amount of intelligence in a compiler.

      The problem, I thought, was more trying to extract parallelism blood from mostly serial general purpose computing stones.

      1. 5

        LLVM has one VLIW target: Hexagon. It’s a DSP, so most of the code that targets it is written by people who are happy to tweak the source to get good performance.

        VLIW architectures are difficult for compilers. Compilers work on basic blocks, which are sequences of branchless instructions. It’s fairly trivial for a compiler to create VLIW instructions that don’t have data dependencies within a basic block. There are only two problems:

        • Coming from C, the average length of a basic block is 7 instructions.
        • There are often data dependencies between them.

        To get really good performance out of VLIW, you need to do the same kind of things that you need for autovectorisation. In particular, you need to do good predication so that you can move instructions between basic blocks and execute them speculatively and discard the results if they’re not needed. That, unfortunately, removes a lot of the power advantages of VLIW.

        VLIW architectures do work reasonably well as JIT targets, where you can build traces of common-path instruction streams and optimise those, running more slowly on cold paths. The most widespread general-purpose VLIW chips are nVidia’s ARM cores, which have a VLIW pipeline, a state machine that translates ARM instructions to inefficient VLIW instructions, and a small JIT compiler that takes hot code paths and generates efficient VLIW sequences (with side exits if you leave the hot path). The nVidia VLIW design is quite unusual because the long instructions are slightly offset, so each instruction can accept output of the earlier ones in the same bundle as its input without going via register renaming. That’s quite similar to an EDGE architecture in some ways.

        1. 1

          the average length of a basic block is 7 instructions.

          Interesting. Is there a (publicly available) source for this?

          1. 2

            I came across this heuristic in one of the early Berkeley RISC papers (as one of the motivations behind the RISC design). I wondered if it had changed, so I set the first assignment in the compilers course that I used to teach to test it. Students had to find a bit of source code that they thought was interesting and then modify the compiler to collect these statistics and present a histogram of basic block sizes (just to get them comfortable with hacking on LLVM, before they did anything difficult - this filtered out the students who would not be able to hack on a large existing C++ codebase). They pretty much all found that, whatever codebase they picked, the numbers were about the same.

        2. 4

          LLVM is overfitting for C semantics and current architectures.

          Rust can’t even tell LLVM its complete pointer aliasing information (above the minimum C has), because in LLVM these codepaths aren’t battle-tested and are too buggy to use. And aliasing information is a basic requirement for VLIW.

          It’s a chicken-egg problem. I’m sure that if there was a big push for VLIW support, LLVM could be made more suitable for it, but it’s nowhere near there yet. Autovectorization barely works.

      2. 1

        The initial and recurrent issue of VLIW is having a smart enough compiler.

        I have worked on another project using VLIW. On paper it is very interesting, you can get into theoretical very high FLOPS with rather low power, and the arch itself seems pretty simple. You can take older open arch and adapt them (the project I worked was also adapted from SPARC, like this one).

        But you need to have a lot of firepower on the software side, and have a real strategy to get a good compiler. I have yet to see it succeed.

        All those VLIW projects could maybe mutualize their efforts, have a common optimization pass specialized for VLIW in OSS compilers.

        1. 1