1. 22
  1. 5

    This is a really good research proposal. Godspeed to you, @lucic71!

    Please let me suggest Bayesian statistics for the data evaluation. It’s a great tool for this kind of analysis and allows you to quantify the impact of UB-optimization.

    To sketch it out, if you run a given test program for both UBO on and off, the resulting runtimes will inevitably be variable and follow the distribution of two probability densities in the space of possible runtimes. If T is the associated random variable and you formulate it as a conditional probability, the probability mass functions are P(T=t | UBO=true) and P(T=t | UBO=false).

    Instead of the frequentist approach, the question of how much UBO affects the result is flipped around and you ask yourself: “Given a certain outcome of T, how likely is it that UBO was on?” and your null hypothesis becomes “program runtime is independent of UBO”. In the same fashion, in the end you don’t get a p-value (which heavily depends on the sample size due to varying t-scores) or confidence interval (which is not even a probability distribution) but a Bayes-factor. For further study, I recommend Kruschke’s work “Bayesian estimation supersedes the t test” in the Journal of Experimental Psychology from 2013 (doi:10.1037/a0029146), because your experiment takes the form of an equivalence test (where you basically “prove the null hypothesis”) which are always a bit more tricky to handle, but it will be worth it to go the Bayesian way!

    1. 3

      I’d love to see this. A couple of caveats:

      There was a paper at ASPLOS a few years ago that showed that randomising layout can have a 10% (plus or minus) impact on performance. Most CGO papers therefore fail to demonstrate statistical significance. Vikram Adve told me that the same team released a tool that could take two compilers and would build with -function-sections and -data-sections and do enough randomisations that you could test for statistically significant performance deltas. I haven’t tried it, but it’s exactly the sort of thing that this analysis needs.

      Second, most of the benefits from UB related analyses appear two or three transforms later in the pipeline than the one that initially takes advantage of it. In particular, autovectorisation can often benefit from them. Most kernels disable the vector units, so it’s a good idea to make sure that you’re comparing userspace code. LLVM can emit optimisation remarks, which tell you a lot about what things are or aren’t vectorised, which should help with root-cause analysis.

      1. 2

        so it’s a good idea to make sure that you’re comparing userspace code

        My initial plan was to compare the whole system (user+kernel space). Kernel code may disable the vector units but I don’t see why it wouldn’t take advantage of other cases of UB to trigger optimizations.

        LLVM can emit optimisation remarks, which tell you a lot about what things are or aren’t vectorised, which should help with root-cause analysis.

        Are you talking about LLVM’s optview?

        1. 3

          My initial plan was to compare the whole system (user+kernel space). Kernel code may disable the vector units but I don’t see why it wouldn’t take advantage of other cases of UB to trigger optimizations.

          It almost certainly will, but they’re much less likely to have a noticeable impact on performance. Vectorisation can easily give a 2-8x speedup on a hot loop, which can then easily translate to double-digit percentage perf deltas over the whole program. Without that, the changes will be much smaller.

          Are you talking about LLVM’s optview?

          Remarks. It looks as if there are a few tools for visualising them.

      2. 1

        Even where the avoidable UB is “beneficial” for performance, the argument isn’t that good, because the hand-optimized program without UB is often the (only) correct one.

        Case in point, signed overflow: One supposed benefit is to optimize loop indexes from int to size_t. But this is silly, because if you want the debug build to also be correct, size_t is the (only) correct integer type. int is for file descriptors. Actually, there are two schools of thought here, but I don’t agree with the Google style guide one.

        1. 1

          YES please! Though you may find it difficult, since afaict UB tends to not be a decision the compiler makes, but rather an assumption that it implicitly uses in multiple small sneaky places.

          1. 1

            I think this is a pretty neat proposal and I’m super curious to see the results! For a very long time now the discussion around UB optimisation has been stuck in the land of qualitative statements. Everyone “knows” it’s important for performance reasons, but nobody can say how important except by pointing at either trivial benchmarks or fifteen year-old data.

            Oh wow SDF is on the list of allowed tildes, I might even be able to star this :-D.

            1. 1

              For years, I have been asking if anyone can demonstrate a real slowdown in a real application from making well-defined just a single piece of contentious undefined behaviour: signed integer overflow. No one has been able to do it.

              That said, some ‘ub’ optimisations are mostly legitimate. In particular, null check removal is completely legit, if you can guarantee null accesses trap. This may not be true in kernelspace, but is likely true in userspace, and e.g. hotspot uses it to great effect to eliminate null-checks.

              1. 2

                While I agree with you, I’d also note I haven’t seen a valid use case for signed integer overflow (whatever you may define it to be).

                I agree well defined deterministic behaviour is better than undefined.

                But most cases I have seen (I’m happy to be proven wrong), signed integer overflow is a bug and I’d rather it deterministically trapped LOUDLY than did something weird and buggy that sometimes worked and sometimes silently just did the wrong thing. Hmm. Maybe if we can get (a+b)-c === a + (b - c) type of thing to just plain work, not sure there is anything sane one can do with multiplication though.

                But that said, trying to get a committee to decide on which well defined behaviour, sigh…

                1. 3

                  But most cases I have seen (I’m happy to be proven wrong), signed integer overflow is a bug and I’d rather it deterministically trapped LOUDLY than did something weird and buggy that sometimes worked and sometimes silently just did the wrong thing.

                  I’d love for it to trap in a way that you can catch sanely, leveraging the ISA where possible for fast jumps. Instead, we’re stuck with having to use -fwrapv and detecting wraparound manually even if the CPU has dedicated instructions for it. This matters, for instance, when implementing automatic promotion to bigints. This alone would speed up lots of real-world Lisp implementations.

                  I’d be happy for the default behaviour of an unhandled overflow to be aborting the program, but it would definitely be so much better if the language better supported handling overflow in a native way.

                  This would require something that looks a bit like exception handling in C, which probably means it’s not going to happen anytime soon.

                  1. 3

                    This matters, for instance, when implementing automatic promotion to bigints.

                    This is use case I had in mind when we added overflow-checked builtins to clang (LLVM had intrinsics for them already because it’s useful for some other front ends). These compile down to a single branch-on-carry on most architectures (i.e. anything with a sane way of doing this, so everything except RISC-V).

                    1. 2

                      Sounds cool, do you have a link to the docs about this?

                      1. 2

                        They’re in the clang manual. I think we added them some time in the 2.x series, they’ve been there for about a decade now.

                2. 1

                  https://youtu.be/yG1OZ69H_-o?t=2355

                  This was in bzip. The performance, on 64-bit architectures, due to use of unsigned ints (where the compiler had to allow for overflow) was dramatically lower than it was when the code was changed to use a signed int instead.

                  Admittedly it may be possible to avoid this problem (eg by using size_t instead of unsigned int) without requiring that signed integer overflow be UB, but, it is a real example in real code where defined overflow caused a significant performance issue.

                  1. 1

                    Perhaps I missed it, but I do not see anyone measure a real slowdown in a real application.

                    Furthermore, let me suspend my disbelief for a moment and imagine that this particular loop in bzip2 were a major bottleneck for a real application. Then I will point out that:

                    1. As you say, word-sized values would perform favourably vs sub-word-sized values. This should not be surprising.

                    2. Instruction count is not a proxy for performance; I expect the two variants to perform rather similarly (and, again, have not seen a benchmark, just some half-hearted waving at disassembly).

                    3. If (if) this code were actually a bottleneck, it would be a horrible way to write it. A vectorised approach would make mincemeat of it. So the comparands are both strawmen.

                    1. 1

                      imagine that this particular loop in bzip2 were a major bottleneck for a real application

                      Ok, so I think, before we start shifting goalposts, that this would satisfy your original request:

                      if anyone can demonstrate a real slowdown in a real application from making well-defined just a single piece of contentious undefined behaviour: signed integer overflow

                      Unless you’re calling bzip itself a “not-real application” which I think is contentious. As for actual performance measurements, I don’t have them and I don’t care enough to try and make them myself, but if I recall correctly this issue was discovered specifically because bzip had a noticably slower execution on 64 bit architectures than it did on 32-bit architectures. While I agree to the broad statement that higher instruction counts don’t necessarily correspond to increased execution, it’s obviously the case that they do in general.

                      Edit:

                      If (if) this code were actually a bottleneck, it would be a horrible way to write it.

                      Sure, but if people always wrote good code then undefinedness of signed integer overflow wouldn’t be an issue anyway.

                      1. 1

                        sigh

                        shifting goalposts

                        I do not think I shifted any goalposts.

                        While I agree to the broad statement that higher instruction counts don’t necessarily correspond to increased execution, it’s obviously the case that they do in general

                        No such thing is obvious. It is frequently reasonable to trade off size and speed. (Although the extraneous instructions here are obviously not doing anything particularly helpful.)

                        As for actual performance measurements, I don’t have them and I don’t care enough to try and make them myself

                        In other words, you don’t actually have any reason to believe that it makes any difference.


                        I compiled bzip2, using signed and unsigned indices for the routine under consideration, and timed; the timings were identical. Minutiae:

                        • Host: freebsd 13.1 on a xeon W-2123 with ZFS on an SSD

                        • Compiler: clang 15 -O2

                        • Test case: linux 6.1.3 tarball

                        • Compression runtime: ~87s

                        • No idle CPU suckers

                        • Caches primed

                        Theoretical analysis: if the loop stays in the uop cache, its throughput will be limited by the memory subsystem, so the extra alu ops make no difference. Furthermore, the branches will likely have crap predictability, which will overshadow any other differences. If the loop does not stay in cache, decode will become an issue, but in that case it is likely not hot enough to matter. The larger code likely has second-order effects, but (empirically!) they are insignificant.

                        1. 1

                          I do not think I shifted any goalposts.

                          You said you hadn’t seen an example, I provided one, you added stipulations at that point. In my view that’s shifting the goalposts, but honestly, if you don’t see it that way, then whatever. It’s not a hill I’d choose to die on.

                          No such thing is obvious

                          I think you’re ignoring the “generally” part. There are of course specific cases which are exceptions, but executing 1000 instructions is probably going to take longer than executing 1 instruction (edit: and in case my point is lost here, it’s also likely that 20 instructions will take longer than 10 for example, though not as glaringly certain). If that’s not obvious then I guess we’re really not connecting in this conversation.

                          In other words, you don’t actually have any reason to believe that it makes any difference.

                          As I said, I recall that it made enough of a difference for someone to track down the why of it. I don’t know why that wouldn’t be the case when you ran your own experiment. Maybe it was only observable with specific input, maybe it was only on a particular processor series, maybe it was in a situation with high cache pressure so those extra instructions really hurt. Or maybe my memory isn’t correct. In any case, it would be great to see a write-up of your experiment. if you do have time to put that together.

                          Edit again:

                          Compiler: clang 15 -O2

                          You should probably use -O3 if you really want to test the effect on optimisation. I’m not claiming it will necessarily make a difference, just a point of order.

                          1. 1

                            You said you hadn’t seen an example, I provided one, you added stipulations at that point

                            My initial statement was:

                            For years, I have been asking if anyone can demonstrate a real slowdown in a real application

                            And I still have not seen any such slowdown demonstrated. You implied bzip2 ran slower, but did not provide any benchmarks to demonstrate that, and my own benchmarks showed the opposite. Pointing at cherrypicked disassembly and saying ‘it looks worse!’ is not demonstrating a slowdown (which has also been part of my point from the beginning—obviously you can do the former). So I really do not understand where you think I have been inconsistent.

                            There are of course specific cases which are exceptions, but executing 1000 instructions is probably going to take longer than executing 1 instruction

                            1. Dynamic instruction counts ≠ static instruction counts.

                            2. Instructions have wildly different cost models—more diverse than you give them credit for.

                            3. Even leaving aside all of the above, there are work-span tradeoffs; I expect that, for instance, a well-written sum scan might have 2x the dynamic instruction count of a linear version, and yet run at least 2x faster.

                            -O3

                            Most real code is compiled with -O2, and research has not demonstrated that either produces generally better code than the other. Especially with clang, which is more aggressive than gcc by default. (It might be interesting to compare gcc and clang; I used the latter because I expect that is what chandler carruth did.)

                            1. 2

                              So I really do not understand where you think I have been inconsistent.

                              “Shifting the goal posts” is not about inconsistency as such, it’s about tightening your argument after the fact. Because:

                              If (if) this code were actually a bottleneck, it would be a horrible way to write it. A vectorised approach would make mincemeat of it. So the comparands are both strawmen

                              I.e. you’re at this point not happy if the real application takes a performance hit, you need it to be well-written code too. You’ve added a stipulation that wasn’t there in the beginning.

                              And I still have not seen any such slowdown demonstrated.

                              Yep, I grant you this. Like I said, I don’t even care enough to measure it myself. But like I also said, it had come up in a discussion (a long time ago and beers were involved, maybe I’m misremembering) that it made a difference. Maybe that’s wrong. I don’t think your own experiment necessarily proves that it universally is wrong, but I’m also quite happy to admit that I can’t prove it’s not.

                              Most real code is compiled with -O2, and research has not demonstrated that either produces generally better code than the other.

                              Ok but even when “research has not demonstrated” something generally doesn’t mean that it won’t make a difference in specific cases. At least with gcc, -O3 enables various vectorisation-related optimisations, which seems pertinent for this example, and it’s whole raison d’etre is to “optimise harder”. I agree that most real code is probably compiled with -O2 (but what is the purpose of the experiment? Would you not accept an example if the application was compiled with -O3?).

                              1. Dynamic instruction counts ≠ static instruction counts.

                              They both increased. Your two points following this one are also still ignoring what I said: more instructions as a general rule will take longer to execute, but I never said the same was true in every case nor made any claim about (lack of) diversity of instruction costs.

                              I guess you are frustrated because I responded to your haven’t-seen-an-example claim with an example which doesn’t really fit the bill: I don’t have measurements or a reproducible experiment to follow up. Mea culpa. It was more intended as a point of interest, but you’re completely right that even a dramatically different (and somewhat longer) sequence of instructions won’t always equate to slower-running code, and might not even even do so for the particular example I did give (and definitely doesn’t in your particular experiment).

              2. 0

                To make it brutally obvious….

                Loading the element beyond the end of an array is always undefined behaviour and cannot possibly return a predictable result, so optimizations are the least of your problems.

                Thus I feel your proposal is wasted effort, and would much better be deployed on preventing undefined behaviour in the first place, either through better static analysis or better UBSANs.

                For example, in production we’re currently running a very lightweight gcc ubsan, ie. if we index out of bounds, it traps, records a stack trace and resets. We’ve caught a fair number of very hard to track or reproduce bugs red handed this way.

                A far better course of action for your dissertation would be take the idea of “design by contract” one step further.

                The existence of potential undefined behaviour in a function clearly imposes preconditions on the global variables and parameters of the function, that are alas, usually not made explicit or explicitly checked.

                A better program of attack would be to make it easier to make these preconditions explicit.

                1. 1

                  so optimizations are the least of your problems

                  Read carefully the proposal. In this context the optimizations are my biggest problem because they affect the structure of the resulting program. People use UB in practice to express various intentions (see Section I): uninitialized reads or int to float casts. And that’s fine, they take the risk of using UB in order to have a better expressivity power.

                  What is not fine is that the compiler makes “smart” assumptions based on these UBs in order to trigger optimizations. In this context the compiler is not my friend that generates the code that I want, instead it is my adversary because it plays against me. The code that I want falls into the category of C being a high level assembler. If 10 years later something happens to be wrong with the generated code, it should be the fault of the programmer who wrote that code, not the fault of the compiler that generated “smart” code

                  At the end of the day I’m not trying to solve a technological problem. The problem has a deep social nature. It’s the problem of how programmers, on one hand, understand UB and how compiler designers, on the other hand, handle UB inside the compiler implementation.

                  either through better static analysis or better UBSANs

                  Read Section II to see why this is an incomplete approach.

                  A far better course of action for your dissertation would be take the idea of “design by contract” one step further.

                  The existence of potential undefined behaviour in a function clearly imposes preconditions on the global variables > and parameters of the function, that are alas, usually not made explicit or explicitly checked.

                  A better program of attack would be to make it easier to make these preconditions explicit.

                  I don’t understand your course of action. Could you provide some examples?

                  1. 1

                    To make a trivial purely for illustration example…

                    int add( int a, int b)
                    {
                       return a + b;
                    }
                    

                    Has an implied precondition INT_MIN <= (int64_t)a+(int64_t)b <= INT_MAX which may imply preconditions on any function that invoked it.

                    These preconditions are implicitly all over all C code, but we tend to just make the word size big enough and hope it all works.

                    Not exactly a careful or safe strategy, merely a vaguely pragmatic one, as very few physical things grow as fast as doubling the word size.

                    Ideally a static analysis tool should assist us by producing, where possible, explicit precondition checks.

                    If you’re doing Design by Contract, (which I’d argue any working programmer should be), ideally we should be able to state a precondition for which, if satisfied, our function is guaranteed to work successfully.

                    Ideally, a static analysis tool should warn us if there exists a set of parameter values satisfying the precondition, which would result in undefined behaviour, or conversely helping us to deduce one.

                    The program proving crowd are all over this stuff… but I would argue that we don’t need to go as far as full on program proving to get pragmatic gains in terms bug finding.

                    We can simplify the task, depending on what we’re doing by specifying a much simpler, weaker precondition… but we still find a bug if analysis or unit test shows we violate it.

                    Conversely if X() calls Y(), if we simplify the precondition for X by making it much stronger… and simplify the precondition for Y by making it much weaker…. if by unit test or analysis we can show a case that passes the precondition of X but fails the precondition for Y… we have a bug.

                    We haven’t proven X(), merely provided a surprisingly useful pragmatic way of finding bugs.

                    To refer to that example from the ip stack… the operator a->b has a precondition that a is not null. It was violated.

                    1. 2

                      I would argue that C is not the best language for integrating your idea in it. Given the fact that the philosophy of C is Worse is Better and what you are describing is basically MIT approach, I don’t see how something successful could be born out of this combination.

                      That is why fuzzers exists. Because C is not, what I will call, “the complete solution”, and because people saw the problems that you see, they invented fuzzers for helping them spot the programming errors that you describe. They did not go the way that you describe, i.e. integrate preconditions in the language. And now that I think of it, you don’t even need these preconditions integrated in the language, simple if statements do the job.

                      I mean, C is used for writing systems over the hardware layer, you need simplicity to achieve this task, by inserting design by contract in this story you suddenly complicate the situation. For other languages this might work, but I’m not convinced that C is the way to go with this approach.

                      I think Ada has this notion of Design by Contracts. But that’s another story, the language is designed specifically for critical systems.

                  2. 1

                    either through better static analysis or better UBSANs.

                    The right-hand side of page 2 discusses why they don’t want to go that way and why is insufficient.

                    1. 2

                      I’d still argue the problem isn’t the optimizations… it’s the bugs.

                      All these cases are common or garden bugs, blaming the optimizer isn’t going to help you.

                      Certainly there can be constructed less bug prone languages, and indeed I’d recommend one uses them, but they are not C.

                      What this program of research is effectively aiming to do is somehow create a safer subset of C without changing the C standard.

                      Or you could define a new C standard that is C but with no undefined behaviour. However, that’s actually a different language requiring a different compiler.

                      Give up now and move to Rust or D or … which will win you far far more.

                      I’m still adamant that focusing on UB optimizations is misguided and not valuable.

                      1. 1

                        I really wouldn’t describe this linked patch as a common bug https://marc.info/?l=openbsd-tech&m=164332561831177&w=2 Nobody will spot this as “this will cause null check removal”. For each case like this that is detected by a sanitizer, there’s going to be another one that isn’t.

                        1. 1

                          That is one of the most common C bugs.

                          In fact it’s an instance of the Billion Dollar Mistake.

                          Function returns null, he dereferences it, nasal daemons ensue.

                          1. 1

                            It’s not that simple. The code never dereferences null in practice. It could theoretically do that according to the compiler which decides that in that case all bets are off and removes some other code.

                            1. 2

                              Nope.

                              It clearly applies -> to the result of the function, which is clearly returning null. The fact it later applies ‘&’ is irrelevant.

                              Enjoy your nasal daemons.

                              1. 1

                                /usr/include/sys/param.h:

                                 #define offsetof(s, e) ((size_t)&((s *)0)->e)
                                

                                This is part of the system headers.

                                1. 3

                                  Then that’s a bug in the system headers. Compilers (at least gcc and clang) provide a builtin for this, __builtin_offsetof, so you can implement offsetof without it having undefined behaviour.

                                  1. 2

                                    https://en.cppreference.com/w/c/types/offsetof

                                    What it expands to is implementation defined, under gcc it expands to a __builtin_offsetof()

                                    In that instance it may be a case of “works for me” as 0 it a compile time constant, but it’s not portable code under the standard.

                                    gcc flags that usage as incorrect (I think it’s the -Wnull-dereference warning).

                                    As I keep saying, if you want to create a new and different language that almost looks like C, but isn’t quite… you’re welcome, but probably a dead waste of time versus creating or adopting a better one.

                                    ie. If you want that sort of hack to work, change the standard, don’t fight the optimiser.

                                    1. 4

                                      In practice, all working C code is written in a different language that almost looks like C, but isn’t quite.

                                      1. 1

                                        As I keep saying, if you want to create a new and different language that almost looks like C, but isn’t quite… you’re welcome, but probably a dead waste of time versus creating or adopting a better one.

                                        For the OpenBSD team, investing in a “safer C” dialect seems like a very good use of their time, considering they’re quite expert secure C programmers who happen to get caught by the undefined behaviour stuff here and there. Creating/learning a new language and rewriting their entire operating system would be a bigger waste of time for them.

                                  2. 1

                                    The code never dereferences null in practice.

                                    Actually, there we have the heart of it. If the compiler was smart enough, it could deduce that it never dereferences it in practice. But it’s “almost” smart enough, thus it can deduce that the value return cannot be null because that would be dereferencing a null, so it can elide the null point check.

                                    And that really is the heart of it, can we make the compiler really smart? No, the halting problem makes that obvious.

                                    Can we make the standard rigid and anal and focuses on minutia so the somewhat smart compiler can do the best it can with what it has? Yes.

                                    The problem is where we have just plain common or garden bugs that “worked by accident” and we want somehow the standard to be tweaked so they will continue to work. But every time we do that, writing a compiler or optimizer gets much harder…

                              2. 1

                                What this program of research is effectively aiming to do is somehow create a safer subset of C without changing the C standard.

                                Wrong. I’m not trying to change anything. It’s right in the title: Analysis, not Improvement, not Changing, not Modifying, and so on and so on.

                                Give up now and move to Rust or D or … which will win you far far more.

                                The argument Kell gives in his work may give you an idea of why you might not win that much if you make such a move.