1. 15
  1.  

    1. 5

      I wonder how long this kind of thing will remain a problem. One of the motivations for adding atomic read-modify-write operations in RISCy architectures has been to support operations on remote cache lines. If a line is in another core’s L1, you can build a cache coherency message that does an atomic ALU op on the remote core (borrowing one of its pipelines for a cycle) and sends the result back (you can also send the original result and do the ALU op on both cores, which makes things faster).

      Some existing cache coherency protocols support remote stores, where you can write into another core’s L1 line, which is intended for producer-consumer workloads. I’m not sure if these are in any mainstream cores yet.

      Doing coherency at a finer granularity than a cache line is very expensive, but there are a bunch of techniques that can reduce the impact. Interestingly, if you do all of the, you ma6 find that 128 byte cache lines make more sense and can reduce overall power.

      1. 1

        Interestingly, if you do all of the, you ma6 find that 128 byte cache lines make more sense and can reduce overall power.

        Is there a typo or two here?

        1. 2

          Stupid iPad keyboard. It doesn’t account for parallax in tracking presses and so I often hit , instead of m, and a slight drag on the y makes it a 6.

        2. 1

          I think the intended text is

          Interestingly, if you do all of them, you may find that 128 byte cache lines make more sense and can reduce overall power.

          The “y”→”6” substitution is common on mobile keyboards.