1. 113
  1.  

    1. 11

      The labeled-switch feature is a nice optimization for bytecode interpreters — it looks equivalent to the GCC extension for taking the address of a statement label.

      1. 3

        Tokenizers/parsers too. They already converted their tokenizer to use labeled switches.

        1. 2

          How would this be an optimization for bytecode interpreters? It looks like it is just syntax sugar

          1. 2

            See https://lobste.rs/s/ui5fzs/cpython_tail_call_interpreter_merged_for#c_c5fwac

            This zig feature is somewhere between a switch in a loop and GNU C label addresses. The jump table remains implicit in the switch statement, but the dispatch happens at each continue. This eliminates the loop branches and makes each dispatch easier for the CPU’s branch predictor.

            1. 2

              re-reading your comment on the Cpython tail calls, it sounds like this Zig feature is going to approximate “compute goto” in C, which is better than a switch statement, but still worse than the “tail calls with preserve_none” that Python is using. The computed goto is still really hard for the optimizer to reason about and allocate registers for. Is that correct?

              1. 1

                That’s my understanding, yes.

              2. 1

                That is so cool! It looks like it could have the same performance benefits without the need to define custom compiler intrinsic’s for calling conventions. I wonder if it will be as fast in practice.

              3. 1

                The post described it. Jumping from one case to the next saves some instructions and helps branch prediction. It’s a well known technique in C.

            2. 11

              Optimizers like LLVM may reduce this into a @fence(.seq_cst) + load internally.

              Ah, so we’re repeating the C mistake of having to write the low level construct in the high level language so that the optimizer can turn it into the high level construct in the low level language, because the supposed high level language is insufficiently expressive. I cringe every time I have to write out i.e. a popcount the “formal” way only to have the optimizer turn it into a single popcnt instruction.

              1. 13

                In exchange you get thread sanitizer.

              2. 7

                Decl literals seem like they’ll be pretty nifty.

                1. 3

                  A few weeks ago (on nightly) I tried calling an init function this way, having no idea this feature was a new idea. It just seemed like it should work, and it did :)

                2. 5

                  Fences removed with no alternatives, neat. Stuff like seqlocks are now less performant with no recourse (except to do them in C…).

                  1. 7

                    At first I was rolling my eyes at this comment, but I have to say, I’m actually really appreciating the ensuing discussion, so thank you (sincerely).

                    1. 5

                      There are other cases to complain about too on this level, e.g. really efficient RCU that needs fences to work on some arches (e.g. liburcu-qsbr).

                      However, that being said, the current state of affairs with all related things is kind of a hot mess all over. It turns out the C/C++11 memory models are basically useless and broken anyways, and a lot of very smart people are still finding new problems with the whole way we think about all related things. In the meantime, things like ThreadSanitizer don’t handle fences well, and it’s reasonable to say that if you want to reduce/eliminate UB, maybe you shouldn’t allow constructs like fences that are going to cause a situation where you can’t even tell if UB is happening anymore, or on which architectures.

                      For the few use-cases where you really want to do this stuff, and you’re really sure it will work correctly on every targeted architecture (are you??), you’re probably going to drop into arch-specific (and likely, asm for at least some arches) code to implement the lower level of things like seqlocks or RCU constructs, where you’re free to make specific assumptions about the memory model guarantees of particular CPUs, and then others can just consume it as a library.

                      1. 4

                        you’re probably going to drop into arch-specific (and likely, asm for at least some arches) code to implement the lower level of things like seqlocks or RCU constructs

                        The whole point of having it in the language is so you don’t have to implement fences for N cpu arches, and so the compiler doesn’t go behind your back and try to rearrange loads/stores.

                        1. 5

                          Yes, ideally. But the C11 model isn’t trustworthy as a general abstraction in the first place. In very specific cases, on known hardware architectures, it is apparently possible to craft trustworthy, efficient mechanisms (e.g. seqlock, RCU, etc), as observed in e.g. the Linux kernel and some libraries like liburcu, which do not rely on the C11 memory model. But arguably, it is not possible to do so reliably and efficiently in an architecture-neutral way by staying out at the “C11 model” layer of abstraction. There is perhaps a safe subset of the C11 model where you avoid certain things (like fences which sequence relaxed atomics, and certain mis-uses of acquire/release), but you’re not gonna reach peak hardware efficiency in that subset.

                          The safest subset of the C11 model is just to stick to seq_cst ops on specific memory locations. The further you stray deeper, the more sanity questions arise. I “trust”, for example, liburcu, because it is very aware of low-level arch details, doesn’t rely on just the abstract C11 model, is authored and maintained by a real pro at this stuff, and has been battle-tested for a long time.

                          Given this state of affairs, IMHO in the non-kernel C world you expect the compiler or a very solid library like the above to implement advanced efficient constructs like seqlocks or RCU. It’s (IMHO) probably not sane to try to roll your own on top of the abstract C11 atomics model (in C or Zig, either way!) in a maximally-efficient way and just assume it will all be fine.

                          1. 4

                            To be clear, the linux kernel and liburcu don’t use the C11 memory model. They have their own atomics and barriers implemented in assembler that predate C11. liburcu is largely a port of the linux kernel primitives to userland.

                        2. 2

                          However, that being said, the current state of affairs with all related things is kind of a hot mess all over. It turns out the C/C++11 memory models are basically useless and broken anyways

                          How so?

                          1. 8

                            I can’t speak on exactly what the parent comment is saying, but I do know memory_order_consume was finally removed in C++26 having never been implemented correctly, despite trying several times since C++11 introduced it to make it work. it’s been a lot, and IIRC it as well as a hardware issue also affected the transactional memory technical specification.

                            There’s also been more than a few cases in the mid-late 10s of experts giving talks on the memory model at CppCon only for someone in the crowd to notice a bug that just derails the whole talk as everyone realizes the subject matter is no longer correct.

                            1. 5

                              If you want the long version that re-treads some ground you probably already understand, there’s an amazingly deep 3-part series from a few years ago by Russ Cox that’s worth reading: https://research.swtch.com/mm .

                              If you want the TL;DR link path out of there to some relevant and important insights, you can jump down partway through part 2 around https://research.swtch.com/plmm#acqrel (and a little further down as well in https://research.swtch.com/plmm#relaxed ) to see Russ’s thoughts on this topic with some backup research. I’ll quote a lengthy key passage here:

                              In their paper “Common Compiler Optimisations are Invalid in the C11 Memory Model and what we can do about it” (2015), Viktor Vafeiadis and others showed that [… bad things happen in an earlier example …]

                              See the paper for the details, but at a high level, the C++11 spec had some formal rules trying to disallow out-of-thin-air values, combined with some vague words to discourage other kinds of problematic values. Those formal rules were the problem, so C++14 dropped them and left only the vague words. Quoting the rationale for removing them, the C++11 formulation turned out to be “both insufficient, in that it leaves it largely impossible to reason about programs with memory_order_relaxed, and seriously harmful, in that it arguably disallows all reasonable implementations of memory_order_relaxed on architectures like ARM and POWER.”

                              To recap, Java tried to exclude all acausal executions formally and failed. Then, with the benefit of Java’s hindsight, C++11 tried to exclude only some acausal executions formally and also failed. C++14 then said nothing formal at all. This is not going in the right direction.

                              In fact, a paper by Mark Batty and others from 2015 titled “The Problem of Programming Language Concurrency Semantics” gave this sobering assessment:

                              Disturbingly, 40+ years after the first relaxed-memory hardware was introduced (the IBM 370/158MP), the field still does not have a credible proposal for the concurrency semantics of any general-purpose high-level language that includes high-performance shared-memory concurrency primitives.

                              Even defining the semantics of weakly-ordered hardware (ignoring the complications of software and compiler optimization) is not going terribly well. A paper by Sizhuo Zhang and others in 2018 titled “Constructing a Weak Memory Model” recounted more recent events:

                              Sarkar et al. published an operational model for POWER in 2011, and Mador-Haim et al. published an axiomatic model that was proven to match the operational model in 2012. However, in 2014, Alglave et al. showed that the original operational model, as well as the corresponding axiomatic model, ruled out a newly observed behavior on POWER machines. For another instance, in 2016, Flur et al. gave an operational model for ARM, with no corresponding axiomatic model. One year later, ARM released a revision in their ISA manual explicitly forbidding behaviors allowed by Flur’s model, and this resulted in another proposed ARM memory model. Clearly, formalizing weak memory models empirically is error-prone and challenging.

                              The researchers who have been working to define and formalize all of this over the past decade are incredibly smart, talented, and persistent, and I don’t mean to detract from their efforts and accomplishments by pointing out inadequacies in the results. I conclude from those simply that this problem of specifying the exact behavior of threaded programs, even without races, is incredibly subtle and difficult. Today, it seems still beyond the grasp of even the best and brightest researchers. Even if it weren’t, a programming language definition works best when it is understandable by everyday developers, without the requirement of spending a decade studying the semantics of concurrent programs.

                          2. 1

                            Do you mean that using @atomicStore/@atomicLoad on the lock’s sequence number with the same AtomicOrder for @fence would not be equivalent? If not, can you say more about why?

                            1. 2

                              I mean stuff like Linux’s seqcount_t. Used for write-mostly workloads like statistics counting. To implement them you at least need a read barrier and a write barrier, since acquire/release operations put the barrier on the wrong side of the load/store.

                          3. 4

                            ooh very nice

                            liking some of the syntax changes here, and while a lot of the compiler backend stuff is absolutely not applicable to me (aarch64 macos moment) it’s still exciting

                            some of the build system api changes look neat as well, but i’ll have to play with them to be sure

                            1. 2

                              What’s the status of the non-llvm backend? I see they have a goal of getting the x86 backend ready for debug mode. Anyone know how that is all coming along?

                              1. 3

                                Did you have any specific questions beyond what the dedicated section says?

                                1. 2

                                  Ah thanks. I skimmed the article but it was so large and nestled there as a small update! Glad to hear the progress is going well.

                                  I’m particularly curious about the plan to match llvm performance. I know that’s not on the roadmap anytime soon, but building a competitive optimizer seems challenging, while retaining compilation times.

                              2. 1

                                Link to “Optimized memcpy” is broken, @andrewrk