1. 63
  1. 5

    The article notes that a never-taken branch doesn’t get an entry in the BTB, and is automatically predicted to not be taken. However, my understanding was that a branch with no BTB entry would be assumed to be not taken only for a forward jump – backward branches would be assumed to be taken (since such are typically loop guards).

    It looks like their test only tests forward branches, so I’m not sure if my information is out of date (or incomplete), or if they just didn’t test this situation.

    1. 4

      Take a look at the code, there is a backward not-taken branch test as well. I did run it and it didn’t show anything intersting. But think about it: the cpu needs to feed instruction before the jmp is decoded. It would still create a bubble if the cpu had to wait for the backward branch to decode.

      1. 2

        Certainly the bubble would still exist; my only question was if backward branches had their cases reversed – i.e. the CPU would only create an entry the BTB for it if it wasn’t taken. It sounds like that’s not the case, however, so presumably my understanding was just wrong.

      2. 3

        It depends a lot. In a big superscalar core, it’s common to do predictions per fetch granule, rather than per instruction. This is why the article sees a benefit from adding nops. The fetch stage grabs a chunk of data from i-cache and the branch predictor then has to tell it what to fetch next. This prediction has to happen before decode, which makes it difficult to key the decision on anything that happens after decode. You may get very small bubbles if decode tells you that there are no branch instructions but the branch prediction came from some stale state and suggested the next fetch granule was somewhere other than immediately following the current one.

        This gets even more complex when you have things like trace caches. Decoding x86 is so complex that it’s common to store the micro-ops in an internal cache, rather than decode the raw instructions every time. If you’re in a hot code segment, you’re probably skipping decode and going straight from the trace cache. There’s an old heuristic (I don’t actually know how true it is today) that 90% of compute time is spent in hot loops (which is why compilers try so hard to optimise loops) and most time in a loop you’ll be fetching from the trace cache. The trace cache can store extra information to hint the branch predictor, including just storing the next most-likely fetch granule in the buffer immediately after the current one so that you don’t need to do prediction at all, you just lay out the trace caches so that the hot code path is contiguous. That kind of thing is a bit less common on Arm implementations, where decode is cheap, but nVidia’s Project Denver series does something even more aggressive and JITs hot traces as contiguous blocks of VLIW code so that the fast paths are really fast but the cost of leaving the trace can be disproportionately high.

        TL;DR: Modern CPUs are indistinguishable from magic. Any heuristic that you think you know is probably wrong.

      3. 2

        This code stresses the BTB to an extreme - it just consists of a chain of jmp +2 statements (i.e. literally jumping to the next instruction)

        Nit: those are jmp +0. You can see that in the encoding (eb 00): eb is a short jump, and 00 is the displacement. The displacement of a branching instruction is added to the instruction pointer, which points to the instruction after that currently being executed.

        The call/ret chart is funny [https://blog.cloudflare.com/content/images/2021/05/pasted-image-0--17-.png]

        Ha, that is amazing!

        1. 4

          I know, but then I speak about block sizes. This is getting too much into details, and explaining that jmp+14 is block size 16 on x86 and jmp+12 is block size 16 on aarch64 would not be fun. Thanks for the nudge though :)

        2. 1

          This is a good article, but the only reason branch prediction and caching are so important is that latency to main memory is not keeping up with increases in CPU speed. You can see graphs of this in the post in https://lobste.rs/s/qwx517/compiler_will_optimize_away

          Corrected in reply :-)

          1. 13

            Branch prediction doesn’t exist because of the latency to RAM. It exists because of pipelined processors. Pipelined processors want to share and overlap a lot of resources between operations, like instruction decoding, register renaming, multiple ALUs with varying capabilities, and yeah, load/store units as well. This is what a modern core looks like https://en.wikichip.org/w/images/7/7e/skylake_block_diagram.svg You can’t share those execution units effectively if you’re going to stall at every branch because you don’t have a branch predictor.

            1. 2

              Stalling is one option; using a branch delay slot is another. Adding more branch delay slots would require recompiling your code, which is one reason they are not used as much any more.

              1. 2

                And of course you can’t come along later and make Processor 2 with more/less delay slots.