1. 9
  1. 1

    I’m really curious how they determined that an object is shared across threads. In my benchmarking, a load and branch followed by a store cost more than an uncontended atomic RMW, especially outside of microbenchmarks (extra branch predictor state causes performance cliffs earlier). The only way that I can see this working is if you have a static type system that lets you prove at compile time that a code block is never invoked with shared objects.

    1. 1

      I would be curious to see your benchmarks. It is true that you can get problems with aliasing or evicting once you venture past microbenchmarks, but agner quotes 20 cycles for a cas on intel, which is at least as much as I would expect for a mispredicted branch. See also biased reference counting for swift.

      1. 1

        CAS slow, but the last time measured it, LOCK ADD cost about three times as much as ADD (it was seven times on the previous generation) in the uncontended case. This makes sense when you think about how it’s implemented: on a TSO architecture, a store has to ensure that the cache line is in the exclusive state anyway and so the only extra cost is deferring the result of the load (pipeline back pressure) until you have ensured that you own the cache line.

        The branch mispredict cost is much harder to measure because it isn’t so local. You have two options for the implementation. If the refcount manipulation is in a separate function then it’s going to really hurt because your branch is now data dependent. Fortunately, most modern branch predictors mix some of the call stack state into the hash for the prediction (or do something equivalent using neural networks) because otherwise vtable-based dispatch performance is terrible (I was sad when I invented this technique to learn that it had been patented a decade earlier, but I think the patents have expired now). This means that you’ll still end up with some false aliasing and use more state. If you in-line the refcount manipulations then you increase the chance of I-cache misses (which can easily dominate performance), but reduce the chance of false aliasing. For functions that are called with both shared and unshared objects, you will still confuse the branch predictor and the misprediction may end up costing you a lot more.

        Atomic increment on a modern system is generally executed in speculation as if it were uncontended and so you are trying to do something in software that the hardware is already doing. If your type system can give you information about which things are shared, because not doing the branch or the atomic op is definitely faster.

        I’m going to need to read the paper that you link to a lot more carefully, because the quoted costs for reference count manipulations are insane. I’ve never seen anything like that on reference counted systems that I’ve looked at and I can’t believe Swift is doing it that badly.

        1. 1

          CAS slow, but the last time measured it, LOCK ADD cost about three times as much as ADD (it was seven times on the previous generation) in the uncontended case

          Agner finds the two to have about the same performance. I would be surprised if either one were much more expensive than the other; why would cas be more expensive? If anything, a work-efficient adder will have slightly greater span than a compare circuit (but the effect will be very marginal either way, and will as like as not make the same cycle count).

          branch

          One thing worth mentioning: agner told me that never-taken branches will not take up btb space (and continue to be correctly predicted) on some uarchs. I have not looked into this closely yet (really should…), and didn’t ask for details, but that may be an important desideratum if true.

          most modern branch predictors mix some of the call stack state into the hash for the prediction … For functions that are called with both shared and unshared objects, you will still confuse the branch predictor

          The former should help with the latter, shouldn’t it? :)

          I’m going to need to read the paper that you link to a lot more carefully, because the quoted costs for reference count manipulations are insane. I’ve never seen anything like that on reference counted systems that I’ve looked at and I can’t believe Swift is doing it that badly.

          I do find it plausible that somebody naively looked at ‘perf’ output (or similar) and saw a lot of cycles spent on rc-twiddling instructions, but that that was just an artefact of reordering. The reduction in overall runtime on benchmarks seems harder to screw up, though.

          1. 1

            Agner finds the two to have about the same performance. I would be surprised if either one were much more expensive than the other; why would cas be more expensive?

            In the memory subsystem, they’re the same. The difference is in the pipeline interaction. Exactly how different they’ll be depends on the design of the rename logic and that’s so full of trade secrets I have no idea how Intel’s work and can’t talk about the ones that I do understand.

            One thing worth mentioning: agner told me that never-taken branches will not take up btb space

            In the simplest designs, branch predictors use a default prediction (e.g. forwards branches are not taken, backwards branches are) and record only cases where the default mispredicted at least once. With more complex designs, you often have multiple predictors and move branch sites between predictors depending on which one is giving the best predictions, with a default not-taken predictor being part of the pack and the one that you can leave sites parked on when not in use.

            The ‘no state’ aspect is somewhat misleading because of aliasing. There isn’t space for an entry for every branch, so there’s some form of hashing to go from branch site (typically branch site and some other state, such as return address and anything else that the designers fell like mixing in) to a slot. A branch that is never taken may alias one that is always taken and then end up de-training the prediction for that one. This is partly why branch-predictors are such a pain to benchmark: you can increase prediction efficiency and make performance worse at the same time due to subtle changes in the aliasing properties.

            The former should help with the latter, shouldn’t it? :)

            To a degree, yes. It depends on the depth of the functions. For example, if a calls b, which calls c, and d also calls b, and a has a shared version but d has a private version, then a predictor that uses only a one-deep return address as input into its hashing will get confused. One that is two deep will correctly predict, but will end up failing to share state across identical calls.

            I believe the Intel predictors use a neural network that takes a load of things and ends up dynamically weighting them in a way that I don’t understand (and, since it’s a neural network, I am not convinced the designers do either).

            I do find it plausible that somebody naively looked at ‘perf’ output (or similar) and saw a lot of cycles spent on rc-twiddling instructions, but that that was just an artefact of reordering

            That’s what surprises me. I’ve done sample-based profiling of refcounted things and never seen anything like this much overhead, even on systems where the refcount manipulations were in a separate function and had all of the function-call overhead added in. I don’t understand how the base case for Swift can be so bad and I wonder if there’s some compilation mode that does very slow things to make debugging easier.