1. 20
  1.  

  2. 10

    Man, I really wish I understood this blog post.

    1. 6

      This website breaks the back button on Firefox mobile. It’s very annoying.

      Also, am I correct in thinking he is suggesting Ordering::Relaxed require unsafe? That would be quite the breaking change.

      1. 2

        If you hold the back button, a list of previous pages should be shown, allowing you to skip the faulty one.

      2. 2

        What I’ve never understood is, who cares? Shared memory concurrency is more abstract, more complex, and less performant (has worse locality) than message passing. So what if these memory models are so complex? You shouldn’t be doing shared memory concurrency anyway, you should be asking CPU vendors for explicit inter-core message passing support that bypasses cache coherency. Then you can build whatever memory model you want in “userspace”.

        1. 13

          You shouldn’t be doing shared memory concurrency anyway, you should be asking CPU vendors for explicit inter-core message passing support that bypasses cache coherency. Then you can build whatever memory model you want in “userspace”.

          How is your message passing implemented? Most high-performance message-passing systems use shared memory to build zero-copy lock-free data structures for the message passing. Implementing these correctly requires a memory model that you can reason about. So, to answer your ‘who cares?’ question: it’s those of us who are implementing the message-passing abstractions that you want to use.

          You shouldn’t be doing shared memory concurrency anyway, you should be asking CPU vendors for explicit inter-core message passing support that bypasses cache coherency. Then you can build whatever memory model you want in “userspace”.

          I agree in principle but it’s not that simple. The hypervisor and kernel are both virtualising the CPU for you. The hypervisor gives the OS VCPUs that map N:M to cores over time. The OS provides threads, which map N:M to VCPUs. A language runtime for a message-passing language provides actors or similar that map N:M to threads. To do efficient message passing, you need a namespace for cores that supports this kind of virtualisation, such that I can say ‘send a message to actor {foo}’ and the hardware knows where to route that message even if the actor is not currently scheduled (or if it is scheduled on a thread but that thread isn’t scheduled, or if the thread is scheduled but on a VCPU that is not scheduled).

          At the moment, the way we identify programmable endpoints is via virtual memory. My actor’s message queue is identified by a virtual memory address, which is bound to a thread purely by software, the thread is bound to a VCPU by the page-table base register and the VCPU a PCPU by the second-level address-translation page-table base register. The cache coherency bus is basically a message-passing interface, so when you do a store to that message queue you’re really doing a buffered store to something that the other actor can pick up when it’s scheduled. It’s not at all clear (in spite of researchers looking at this problem for at least 40 years) that there’s a better abstraction.

          1. 2

            How is your message passing implemented?

            By using the message-passing support that the CPU already supports internally, and which it uses to implement the abstraction of shared memory.

            The other approach - of implementing message passing on top of shared memory on top of message passing - is a ridiculous abstraction inversion. The fact that it requires us to develop a rich memory model for the entirely superfluous step of shared memory in the middle, is an indication of how ridiculous it is.

            The cache coherency bus is basically a message-passing interface, so when you do a store to that message queue you’re really doing a buffered store to something that the other actor can pick up when it’s scheduled. It’s not at all clear (in spite of researchers looking at this problem for at least 40 years) that there’s a better abstraction.

            Wait, I think this is confusing the issues. If the cache coherence bus was literally only an interface to buffered, virtualizable message passing, with programmable endpoints implemented through virtual memory - then everything would be fine. It’s because we also implement shared memory with that bus that we run into problems and have to come up with a memory model.

            At least, that’s how it seems to me? Like I said, I don’t understand why everyone cares about memory models when this seems like a much simpler alternative direction.

            1. 7

              In reverse order:

              At least, that’s how it seems to me? Like I said, I don’t understand why everyone cares about memory models when this seems like a much simpler alternative direction.

              This was part of my C is not a low-level language article and the reason people care about memory models really boils down to the fact that the hardware and ‘low-level’ languages are trying to maintain the illusion that you have a fast PDP-11. If you want to scale up a language whose abstract machine is a virtualised PDP-11 to multiple cores, you need shared memory. If you want to program shared memory, you need a memory model. If you want to interoperate with any code written in this style, you need to care about that memory model.

              Wait, I think this is confusing the issues. If the cache coherence bus was literally only an interface to buffered, virtualizable message passing, with programmable endpoints implemented through virtual memory - then everything would be fine. It’s because we also implement shared memory with that bus that we run into problems and have to come up with a memory model.

              The cache coherency bus is a message-passing protocol. On top of this is a cache-coherency protocol that works with physical addresses as the identifying tokens defining what data needs to be sent where. The virtual memory abstraction is built on top of this. It’s not clear (in spite of many researchers trying to build systems that expose message-passing primitives over the last few decades) what a virtualised message-queue mechanism should look like. In particular, you also want it to support zero copy, so you actually do want something like shared memory, just with explicit transfer of ownership. The transfer of ownership part becomes a bit tricky if your messages contain complex data structures because you need an O(n) walk of the structure to identify which cache lines (or other memory granules) to include in the message, which is almost as bad as a copy.

              By using the message-passing support that the CPU already supports internally, and which it uses to implement the abstraction of shared memory.

              And this is fine if your messages are 64 bytes or smaller, or can be streamed as chunks of 64 bytes. If you want to be able to prepare a message and then send it to another core and have that core mutate it and pass it somewhere else, then just sending a NoC message for the language’s message send is not sufficient, you also need to serialise the message. Or you can define a memory model that guarantees that the receiving core will see the message if it looks after it receives the pointer to the message.

              The other approach - of implementing message passing on top of shared memory on top of message passing - is a ridiculous abstraction inversion. The fact that it requires us to develop a rich memory model for the entirely superfluous step of shared memory in the middle, is an indication of how ridiculous it is.

              You’re not wrong, but I think you’re underestimating the difficulty of building a sensible model that supports sending messages that are actually useful to software, directly on top of the SoC’s network. I’d suggest that you grab one of the open-source RISC-V cores (I like Toooba for this kind of thing), or even gem5, and try building something. If you can make something that scales well to 1024 cores and has a programmer model that is easy to use, it would easily win the best paper award at ASPLOS.

              1. 2

                The transfer of ownership part becomes a bit tricky if your messages contain complex data structures because you need an O(n) walk of the structure to identify which cache lines (or other memory granules) to include in the message, which is almost as bad as a copy.

                If you want to be able to prepare a message and then send it to another core and have that core mutate it and pass it somewhere else, then just sending a NoC message for the language’s message send is not sufficient, you also need to serialise the message.

                Yeah, but not doing this is exactly why message passing is more efficient - sure you can do shared memory over big datastructures that are fragmented across your address space, but that’s slow. Sending large messages (through the same mechanism as small messages, anyway) is not desirable, because transferring large amounts of data between cores/main memory is not desirable, whether that’s through shared memory or message passing.

                But I suppose that’s not immediately obvious. So thanks to your comment I think I can see where memory models/shared memory are coming from: Wanting a single mechanism that scales up to transferring large data structures and down to small messages.

                So in this sense, a memory model is a serialization mechanism - in both meanings of “serialization”! :)

                the reason people care about memory models really boils down to the fact that the hardware and ‘low-level’ languages are trying to maintain the illusion that you have a fast PDP-11

                I certainly deeply agree with your core idea here. But I do have one nitpick: In retrospect, it’s not clear to me that shared memory was the natural generalization of C-on-Unix-on-PDP-11 to multiple cores. Unix didn’t start with shared memory, it started with pipes and a filesystem and processes. The shared memory idea was added later - of course, we’re stuck with it now, and everyone is trying to speed it up as much as possible.

                1. 2

                  Yeah, but not doing this is exactly why message passing is more efficient - sure you can do shared memory over big datastructures that are fragmented across your address space, but that’s slow.

                  It’s not very slow on modern processors. In the worst case, you miss in all the caches, but a load from a remote cache in the same node is fairly cheap and you can hide the latency with speculative execution or SMT. Importantly, you do it only for the parts of the data structure that are actually used by the recipient. This lets you have a model of moving the compute to the data, rather than the other way around, which tends to be more ergonomic and more efficient.

                  Sending large messages (through the same mechanism as small messages, anyway) is not desirable, because transferring large amounts of data between cores/main memory is not desirable, whether that’s through shared memory or message passing.

                  As you point out later, it’s really eliminating the serialisation requirement that’s a problem. With message sending, you have an implicit copy to implement a move. With shared memory, you can do lazy serialisation: you don’t need to copy the data until just before the data is accessed. You can treat the non-cache memory as a multi-producer, multi-consumer queue with arbitrary reordering, so things that the consumer doesn’t need immediately go there to be picked out later.

                  You might be interested in what we’re doing with Verona. The programmer model fits well with the low-level bits of a cache coherency protocol: all data is either mutable and guaranteed not to alias between concurrent units of execution, or immutable and therefore safe to copy to every core / node in a distributed system. I’m very interested in how we can evolve datacenter memory designs to allow us to scale this kind of design up to entire racks or even entire datacenters.

                  1. 1

                    This lets you have a model of moving the compute to the data, rather than the other way around, which tends to be more ergonomic and more efficient.

                    For sure - this is very much how I think about it - I was writing out something about this in my previous comment (as a reply to “the difficulty of building a sensible model”) but ended up deleting it.

                    One of my big interests is in making large-scale distributed programming easier. With being able to deal with the locations of distributed resources, and solving the code mobility problem, all at the language level using a sophisticated type system.

                    It may be over-applying my one idea, but I think this is also a good way to deal with SoC-scale programming. Having explicit low-level models in the type system of where resources are located, and therefore what code can access them. And make it straightforward to talk about moving code and resources between cores, and orchestrating the network of communicating cores. (And virtualize at the language-level, with object-capability-security and a trusted compiler.) Of course the big difference is that distributed programming is more complicated, because it has multiple failure domains, but I’m more interested in the SoC-scale single-failure-domain case anyway.

                    So I think that would be a good approach to make a model for message-passing-networks-of-cores that really works. But I haven’t really got the chance to work on the SoC-scale, and don’t really know anyone working on this.

                    You might be interested in what we’re doing with Verona. The programmer model fits well with the low-level bits of a cache coherency protocol: all data is either mutable and guaranteed not to alias between concurrent units of execution, or immutable and therefore safe to copy to every core / node in a distributed system. I’m very interested in how we can evolve datacenter memory designs to allow us to scale this kind of design up to entire racks or even entire datacenters.

                    Yes, that is interesting. From reading through https://microsoft.github.io/verona/explore.html, Verona has a potent combination of features. It would be great if memory was designed around being programmed with a Verona-like model, rather than a totally-uncontrolled-arbitrary-sharing C model. That may not be the message passing I want, but it would surely be a vast improvement.

                  2. 1

                    You are the first person I have ever encountered who claims that message passing is more efficient than shared memory. The claim also contravenes my experience. The fundamental difference between the two paradigms is surely that message passing necessarily performs a copy, is that not true? If yes, then how can it be more efficient than the alternative? If no, then is it not by definition shared memory?

                    1. 1

                      It’s all about locality. Read, for example, https://dl.acm.org/doi/pdf/10.1145/3132747.3132771

                      1. 1

                        Well, I can’t read that, unfortunately… does “ffwd” perform a copy?

                        edit: I found the morning paper coverage – as I understand it, the approach is similar to the actor model, where data is owned by a single thread, which acts as a natural synchronization point. That’s I guess a hybrid approach, neither message passing nor shared memory, right? I’ve used it quite successfully in a number of projects, but it doesn’t fully generalize, in my experience…

          2. 1

            The other entries regarding rust also explore various aspects of rusts memory model from the viewpoint of integrating it into the kernel. https://paulmck.livejournal.com/tag/rust (And I think far more interesting problems than any of the “what about rust and gcc” or “where is the specification” discussions we saw lately.)

            1. 8

              Note that when you say ‘the kernel’ you mean ‘the Linux kernel’. FreeBSD uses the C++11 memory model in the kernel, more recently kernels such as Zircon have also adopted it because for any post-2011 system your choices are:

              • Use the C++11 memory model, which has been subjected to far more formal verification effort than any other memory model and which is the target for modern ISAs (ARMv8.3 has instructions specifically added to align the Arm memory model with the C++11 model).
              • Use something different.

              If you pick option 2, be prepared to invest a few millions of dollars in verifying your memory model, more in understanding how it interacts with other models, and then good luck trying to persuade CPU vendors to optimise for it.