1. 27

  2. 4

    I tried the C++ code with and without sorting and both versions ran in about 1.8s on my machine.

    Has something changed in compilers or hardware since 2012 when this was first posted?

    1. 6

      I have a CPU from that era (Intel i7 2600K, still going strong and I’m very happy with it) and I could reproduce it. With sort about 6.3 seconds, without sort about 16.0 seconds. My g++ version is 7.5.0. My only guess is that yes, they must have improved something about the branch prediction (in hardware or microcode?) for this effect to disappear.

      Also, wow, your machine is so much faster than mine. I didn’t know I could expect such a big single core speedup from upgrading my CPU. Maybe I should buy a new CPU after all. However, having an old CPU is really beneficial for development because if it performs well on my machine, it will perform even better on other people’s.

      1. 7

        Using old hardware for the benefit of your users. You, sir, are a hero. ∠(・`_´・ )

        I thought maybe -O2 optimisation was doing something fancy, but no, doesn’t seem to matter without -O2 either. I’m on an i7-7600U CPU @ 2.80GHz with gcc 8.3.0, which doesn’t seem that newer. I wonder if I’m doing something wrong in my test.

      2. 2

        Your compiler is almost certainly using cmov (conditional move) or adc (add with carry), if it hasn’t gone vectorization crazy (e.g. some versions of clang, depending on the flags). These instructions have the same timing regardless of the value being compared, since there is no branch involved.

        If you use objdump, you’ll probably find something along these lines in your binary:

        cmp word ptr [rax], 128
        adc rdx, 0

        In my experience, an unrolled loop using adc is hard to beat in this context.

        1. 2

          Here’s the disassembly of gcc 8.3.0’s output up until the clock call at the end.

          00000000000011a5 <main>:
              11a5:	41 55                	push   %r13
              11a7:	41 54                	push   %r12
              11a9:	55                   	push   %rbp
              11aa:	53                   	push   %rbx
              11ab:	48 81 ec 08 00 02 00 	sub    $0x20008,%rsp
              11b2:	49 89 e4             	mov    %rsp,%r12
              11b5:	48 8d ac 24 00 00 02 	lea    0x20000(%rsp),%rbp
              11bc:	00 
              11bd:	4c 89 e3             	mov    %r12,%rbx
              11c0:	e8 6b fe ff ff       	callq  1030 <rand@plt>
              11c5:	99                   	cltd   
              11c6:	c1 ea 18             	shr    $0x18,%edx
              11c9:	01 d0                	add    %edx,%eax
              11cb:	0f b6 c0             	movzbl %al,%eax
              11ce:	29 d0                	sub    %edx,%eax
              11d0:	89 03                	mov    %eax,(%rbx)
              11d2:	48 83 c3 04          	add    $0x4,%rbx
              11d6:	48 39 eb             	cmp    %rbp,%rbx
              11d9:	75 e5                	jne    11c0 <main+0x1b>
              11db:	e8 80 fe ff ff       	callq  1060 <clock@plt>
              11e0:	49 89 c5             	mov    %rax,%r13
              11e3:	be a0 86 01 00       	mov    $0x186a0,%esi
              11e8:	bb 00 00 00 00       	mov    $0x0,%ebx
              11ed:	eb 05                	jmp    11f4 <main+0x4f>
              11ef:	83 ee 01             	sub    $0x1,%esi
              11f2:	74 1d                	je     1211 <main+0x6c>
              11f4:	4c 89 e0             	mov    %r12,%rax
              11f7:	8b 08                	mov    (%rax),%ecx
              11f9:	48 63 d1             	movslq %ecx,%rdx
              11fc:	48 01 da             	add    %rbx,%rdx
              11ff:	83 f9 7f             	cmp    $0x7f,%ecx
              1202:	48 0f 4f da          	cmovg  %rdx,%rbx
              1206:	48 83 c0 04          	add    $0x4,%rax
              120a:	48 39 e8             	cmp    %rbp,%rax
              120d:	75 e8                	jne    11f7 <main+0x52>
              120f:	eb de                	jmp    11ef <main+0x4a>
              1211:	e8 4a fe ff ff       	callq  1060 <clock@plt>

          So you’re saying cmovg is what’s removing the branching?

          1. 4

            (Butting in)

            Yes, that’s right! The inner loop runs from 11f7 to 120d. It performs the addition every time, but only moves the result into the sum register if the condition is true.

      3. 4

        I hate subtle things like this that have such a major impact on perf. If you do manage to track them down and fix them, a minor update to the code (or change in compiler) might result in a performance regression. Profilers are an absolute necessity for these types of microoptimizations, but I wish they’d flag particularly bad parts of code (i.e. this if statement is throwing off the branch predictor) rather than reporting total aggregates at the function or whole program level (which can be useful, but don’t really point to specific lines and can hide spikes).

        1. 3

          You can see a similar behavior with loads rather than branches, e.g. dereferencing a sorted array of pointers will be faster than an unsorted one. e.g. something like (pseudo code):

          int** array = ...;
          for (p: array)

          A sorted array will be faster due to improved cache locality, improved TLB hits, etc

          It’s also possible that a reverse ordered array would be slower than increasing order as depending on processor model and generation and PT entry settings. Essentially most loaders assume sequential access.

          1. 1

            I don’t understand your statement, and I would appreciate more details.

            Where is the branch predication at work, (at that would be failling too often), in your example ?

            Also could you elaborate about : “A sorted array will be faster due to improved cache locality, improved TLB hits, etc”

            1. 3

              It’s not branch prediction, but rather load prediction.

              To start with the non-prediction portion are the basic L1,2,.. caches as they load in adjacent memory, if you sort your array of pointers then sequential pointers are much more like to be sufficiently close together to be pulled in together, so loads of successive entries in the array are much more likely to point to data that has been loaded into a cache line already. The TLB is a more extreme version of this essentially sequential loads reduces the likelihood that the MMU will have to do a full page table traversal to find the correct physical memory for a pointer load.

              Following from this processors have a bunch of load and prefetch predictors (although these are partially responsible the meltdown/spectre bugs), the most basic predictors have been the same for a few decades at this point: the processor assumes that you are traversing memory sequentially, so if you load the cache line for address A the prefetch unit may choose to load the cache line for the address (A+cacheline size). Traditionally the assumption was purely increasing sequential loads so decreasing sequential would not benefit from the prefetcher.

              Modern processors have multiple levels of exciting prediction units so they will preload all the way through *a[x] by predicting the value (a+x) and will load *(a+x) easily because in this example we are doing an in order sequential load, at this point we have a real address loaded from a predicted value, and if we’ve got that address in a cache line already then we the value from the cache line into a virtual register and carry on processing prior. The predicted loads can fail at any step along the line but each successful step in the prediction helps performance.

              In HPC work a /lot/ of attention is put onto layout of data in memory for exactly these reasons.

              1. 1

                After thinking more about your example, it is indeed an extremely good example for understanding cache misses. And your present explanation is extremely insightful. I was mislead by the original stackoverflow link in which it is the leaf/final/data that are sorted.

                However, in you example, even if sorted, there may be a little distance (offset between pointers) contained in the pointers array.

                Do you have a simple rule, roughly, so as to estimate how much data is pre-loaded for sequential, or close to sequential pointers ?

                It’s been some time I have not dealt with page memory. Is the page size always the same on one computer? Can it be changed or optimized at will ? Am I wrong thinking it is always related to page memory size ?

                And how is it dealt in HPC, a domain in which I have insufficient knowledge and experience ? Do they mimic CPU cache management ? If so in which way?

                I guess it is a lot of questions in one comment. If you think it’s too much, but that I might learn about them starting reading about some keywords, just tell me: that would be tremendously helpful for me.

                1. 2

                  All of the predictor behaviors are super architecture specific - if you want to keep work sizes inside the cache (or even single lines) you need to know the exact architecture you’re on.

                  Memory data structure layout is effected by many factors - SIMD instructions often benefit from using unusual layouts (the classic example being instead of an array of {x,y,z,w} vectors it’s often better to have four arrays, one each for x,y,z, and w. Because reasons. But even that can change depending on the use cases. Another classic is say you have a large rectangular NxM array it can occasionally be faster to split your task into work groups that are sub-blocks of the main array.

                  For the page question, yes I mean physical pages (the TLB is essentially a virtual address to physical page map). Page size is fixed in hardware - I know windows has the idea of 64k pages (vs regular 4k) but I don’t know if there’s any physical awareness of that or if it’s just an OS optimization. But you’re not guaranteed 4k - modern iOS devices have 16k pages.

                  In general people who are really in HPC-land with real need to get as much out as possible out of the hardware will tailor everything to the exact hardware they’re using - for example on multicore systems different pieces of memory may have drastically different perf characteristics depending on the core accessing the memory.

                  I am glad I don’t work in HPC world, but I work on sufficiently perf sensitive things that I do need to consider the higher level predictor behaviors, and rely on the compiler to do the correct instruction sequencing to improve throughput (compilers often model the number of execution units, dependencies, retire buffer sizes, etc).

              2. 2

                olliej’s example has nothing to do with branch prediction. I’ll try to explain cache locality for you.

                You can google that term for more information, but in short: accessing RAM is slow. In order to speed up memory access, basically all modern processors have a small bit of memory to which the processor has direct, fast access, called the cache. The problem is the cache is really tiny, and can’t hold much data.

                The way it tries to speed up memory access is that when accessing a value from main memory (say, at address p), the chunk of memory that this value’s contained in is copied to the cache. The idea is that if you’ve accessed p, then there’s a pretty good chance you’ll be accessing values like p+1, p+2, etc. shortly after. If you don’t, and you instead access another completely random memory address, the cache then has to be flushed, and you fall back to the (very slow) main memory access again.

                And this is where the sorted vs. unsorted pointers question comes in. If your pointers are sorted, chances are the majority of those pointer dereferences are going to be fulfilled by the cache. If they’re not sorted, it’s likely that you’ll be flushing the cache every single time.

                1. 1

                  I have notes about cache-misses to which I would have liked to add that example.

                  However I strongly disagree with what your explaining. To the point I think it is a FALSE statement. I may look insisting, so I would gladly welcome any corrections to my point of view:

                  • In your explanation you are dealing about the way the pointer is incremented, you are dealing with the algorithm used to increment the pointer. You are not dealing with the value of the memory zone the pointer are pointing to.
                  • If you sweep your memory zone, one by one, incrementing your pointer correctly one by one: there won’t be cache misses.

                  => It has NOTHING to do with the fact that value may be sorted or not. The duration of derefencing in itself (what the original comment was about and about which I was asking) is not depending one the value, but rather on the value of the address of the pointer: this is the reason that produces a cache miss or not.

                  In a nutshell: it is the value of the pointer that creates a cache-miss or not.

                  If and only if, you need to cycle through you array in a sorted way… Only in such case you would then peek at one place in memory, then at another maybe to far. And in this case then yes, there would be cache-misses.

                  Do you see my point of view ?

                  EDIT: I finally get your point => the value of the pointers are to be sorted to have an efficient loop, so as to have a sequential cycle-through the memory zone.

                  I put that in my notes ;-) thanks.

                  But I have have to put it clearly, in that example, we are not interested at all if the int values themselves (those that are eventually pointed after two indirection) are sorted or not. I did not understood that from the original comment. And after re-reading it I still think it is misleading.

                  1. 2

                    I think you got it in your edit: I was talking about the addresses contained within the pointers, not the integers (or whatever) the pointers were pointing to. Those addresses are also what we were sorting. Maybe I should have used the word “addresses” rather than “values” in my comment for clarity.

            2. 2

              Even if it is out of date with respect to optimizing compiler output, this illustration is clear and great for intuition.

              1. 1

                Compilers can reorder and schedule to maximize throughput as long as they don’t change program semantics (as defined by the language). A compiler cannot make higher level changes like sorting your array, nor rearranging data structures - those things are stuck on you still :D

              2. 2

                Branch predictors and compilers have both improved quite a bit since 2012.
                I tried Godbolt and current versions of Clang and G++ use conditional moves at -O2.
                I think ICC uses a compare but could be misinterpreting the listing.
                Clang and G++ vectorize the summation loop at -O3.

                1. 2

                  ICC looks to be doing the same trick mentioned in the SO answer, where it’s recognized that the outer loop and the inner loop can be swapped. This effectively makes data[c] a constant value for most iterations.