1. 4

Interesting because it’s examining not just program size or number of instructions, but the Critical Path Length (the longest path of instructions where each instruction uses the result of the previous one) and the Instruction Level Parallelism (how wide a CPU you need to get the fastest execution time – i.e. equal to the CPL – assuming perfect branch prediction).

As such this is an ideal measure, unrelated to current (or near future) real world CPUs.

They also repeat the analysis with a limited window of instructions (4, 16, 64, 200, 500, 1000 and 2000) effectively adding the effects of a limited size ROB (Re-Order Buffer) and limited decode&commit widths (for which they somewhat unrealistically only consider half the ROB size).

In all cases, the ISAs track each other closely. The largest difference is for CloverLeaf at a window size of 2000, where RISC-V has 12% less ILP available. The only case where RISC-V has more ILP at large window sizes is STREAM with a 5.8% advantage. In every case however, at lower window sizes (500 or less), RISC-V has more ILP available with AArch64 overtaking at higher window sizes.


    1. 6

      Methods such as bitwise XOR-ing a register with itself are not detected as breaking the CP

      An odd comment. On aarch64, a xor-selfie is architecturally not allowed to be dependency-breaking. (Not technically true, but close enough, and unlike x86 there’s no reason for the microarchitecture to treat it specially.)

      RISC-V performs 460,027,962 branches to complete STREAM. This is almost 15% of all instructions executed. If all of these are conditional branches, this is 460 million compare instructions that don’t have to be executed compared to the equivalent program running on AArch64.

      Another odd comment. Because:

      1. Compare-and-branch is fused on high-end aarch64 parts
      2. Since aarch64 has flags, it doesn’t always need an explicit additional comparison before a branch
      3. Aarch64 has some compare-and-branch instructions: cb(n)z and tb(n)z
      4. Aarch64 has csel and ccmp, which can reduce instruction count and eliminate branches entirely, improving usable ilp.

      More generally, I found the paper somewhat shallow. I would have liked to see an exploration of where the differences come from. The applications they considered are also quite domain-specific.