1. 33
  1.  

  2. 5

    This is a fantastic talk! The idea that robust systems are inherently distributed systems is such a simple and obvious idea in hindsight. Distributed systems are difficult, and I have had upper managers claim that we need “more robust” software and less downtime, yet refuse to invest in projects which involve distributed algorithms or systems (have to keep that MVP!). I think Armstrong was right that in order to really build a robust system we need to design for millions of users, even if we only expect thousands (to start), otherwise the design is going to be wrong. Of course this is counter-intuitive to modern Scrum and MVPs.

    Additionally, there is so much about Erlang/OTP/BEAM that seem so cutting-edge yet the technology has been around for a while. It will always be a wonder to me that Kubernetes has caught on (and the absolutely crazy technology stack surrounding it) yet Erlang has withered (despite having more features), although Elixir has definitely been gaining steam recently. Having used kubernetes at the past two companies I’ve been at, it has been nothing but complicated and error-prone, but I guess that is just much of modern development.

    I have also been learning TLA+ on the side (partially to just have a leg to stand on when arguing that a quick and sloppy design is going to have faults when we scale up, and we can’t just patch them out), and I think there are so many ideas that Lamport has in the writing of the TLA+ Book that mirror Armstrong’s thoughts. It is really unfortunate that software has figured out all of these things already but for some reason nobody is using any of this knowledge really. It is rare to find systems that are actually designed rather than just thrown together, and that will never lead to robust systems.

    Finally, I think this is where one of Rust’s main features is an under-appreciated super-power. Distributed systems are hard, because consistency is hard. Rust being able to have compile-time checks for data-races is huge in this respect because it allows us to develop small-scale distributed systems with ease. I think some of the projects bringing OTP ideas to Rust (Bastion and Ludicrous are two that come to mind) have the potential to build completely bullet-proof solutions, with the error-robustness of Erlang and the individual-component robustness of Rust.

    1. 4

      No. Rust prevents data races, not race conditions. It is very important to note that rust will not protect you from the general race condition case. In distributed systems, you’ll be battling race conditions, which are incredibly hard to identify and debug. It is an open question if the complexity of rust will get in the way of debugging a race condition (erlang and elixir are fantastic for debugging race conditions because they are simple, and there is very little to get in your way of understanding and debugging them).

      1. 2

        The parent post says rust has compile time checks for data races and makes no claim about race conditions. Did I miss something?

        1. 2

          When you are working with distributed systems, it’s race conditions you worry about, not data races. Misunderstanding the distinction is common.

          Distributed systems are hard, because consistency is hard. Rust being able to have compile-time checks for data-races is huge in this respect because it allows us to develop small-scale distributed systems with ease.

        2. 1

          Yes, Rust prevents data races which is (as mentioned by another poster) what I wrote. However, Rust’s type system and ownership system does makes race conditions more rare in my experience, since it requires the data passed between threads to be explicitly wrapped in an Arc and potentially Mutex. It is also generally easier to use a library such as Rayon or Crossbeam to handle simple multithreaded cases, or to just use message-passing.

          Additionally most race conditions are caused by data races, so… yes, Rust does prevent a certain subsection of race conditions but not all of them. It is no less a superpower.

          It is an open question if the complexity of rust will get in the way of debugging a race condition (erlang and elixir are fantastic for debugging race conditions because they are simple, and there is very little to get in your way of understanding and debugging them).

          I don’t understand this point. Rust can behave just like Erlang and Elixir (in a single-server use-case, which is what I was talking about) via message passing primitives. Do you have any sources for Rust’s complexity being an open question in this case? I am unaware of the arguments for Rust’s affine type system is cause for concern in this situation – in fact it is usually the opposite.

          1. 2

            “most race conditions are caused by data races”

            What definition of “most” are you using here?

            Many people writing distributed system are using copy or copy on write systems and will never encounter a data race.

            Do I have any sources? Yes. I debug distributed systems, I know what tools I use, and ninjaing them into and out of rust is not going to be ergonomic.

            1. 5

              Just some quick feedback/level-setting, I feel like this conversation is far more hostile and debate-like than I am interested in/was hoping for. You seem to have very strong opinions, and specifically anti-Rust opinions, so lets just say I said Ada + Spark (or whatever language with an Affine type system you don’t have a grudge against).

              The point I was making is that an affine type system can prevent data-races at compile-time, which are common in multi-threaded code. OTP avoids data-races by using message-passing, but this is not a proper fit for all problems. So I think an extremely powerful solution would be an affine-type powered system for code on the server (no data-races) with an OTP layer for server-to-server communication (distributed system). This potentially gets the best of both worlds – flexibility to have shared memory on the server, while OTP robustness in the large-scale system.

              I think this is a cool idea and concept, and you may disagree. That is fine, but lets keep things civil and avoid just attacking random things (especially attacking points that I am not making!)

              1. 2

                Not the parent:

                In the context of a message-passing system, I do not think affine|linear types hurt you very much, but a tracing GC does help you, since you can share immutable references without worrying about who has to free them. Linear languages can do this with reference-counted objects—maintaining ref. transparency because the objects have to be immutable, so no semantics issues—but reference counting is slow.

                Since the context is distributed systems, the network is already going to be unreliable, so the latency hit from the GC is not a liability.

                1. 1

                  Interesting point although I don’t know if I necessarily agree. I think affine/linear types and GC are actually orthogonal to each other; I imagine its possible for a language to have both (although I am unaware of any that exist!) I don’t fully understand the idea that affine/linear types would hurt you in a multi-threaded context, as I have found them to be just the opposite.

                  I think you are right that reference counted immutable objects will be slightly slower than tracing GC, but I imagine the overhead will be quickly made up for. And you’re right – since its a distributed system the actual performance of each individual component is less important, and I think a language like Rust is mainly useful in this context in terms of correctness.

                2. 1

                  Can you give an example of a problem where message passing is not well suited? My personal experience has been that systems either move toward a message passing architecture or become unwieldy to maintain, but I readily admit that I work in a peculiar domain (fintech).

                  1. 2

                    I have one, although only half way. I work on a system that does relatively high bandwidth/low latency live image processing on a semi-embedded system (nVidia Xavier). We’re talking say 500MB/s throughput. Image comes in from the camera, gets distributed to multiple systems that process it in parallel, and the output from those either goes down the chain for further processing or persistence. What we settled on was message passing but heap allocation for the actual image buffers. The metadata structs get copied into the mailbox queues for each processor, but it just has a std::shared_ptr to the actual buffer (ref counted and auto freed).

                    In Erlang/Elixir, there’s no real shared heap. If we wanted to build a similar system there, the images would be getting copied into each process’s heap and our memory bandwidth usage would go way way up. I thought about it because I absolutely love Elixir, but ended up duplicating “bare minimum OTP” for C++ for the performance.

                    1. 2

                      Binaries over 64 bytes in size are allocated to the VM heap and instead have a reference copied around: https://medium.com/@mentels/a-short-guide-to-refc-binaries-f13f9029f6e2

                      1. 2

                        Hey, that’s really cool! I had no idea those were a thing! Thanks!

                      2. 1

                        You could have created a reference and stashed the binary once in an ets table, and passed the reference around.

                      3. 1

                        It is a little tricky because message passing and shared memory can simulate each other, so there isn’t a situation where only one can be used. However, from my understanding shared memory is in general faster and with lower overhead, and in certain situations this is desirable. (although there was a recent article about shared memory actually being slower due to the cache misses, as every update each CPU has to refresh its L1 cache).

                        One instance that I have had recently was a parallel computation context where shared memory was used for caching the output. Since the individual jobs were long-lived, there was low chance of contention, and the shared cache was used for memoization. This could have been done using message-passing, but shared memory was much simpler to implement.

                        I agree in general that message passing should be preferred (especially in languages without affine types). Shared memory is more of a niche solution (although unfortunately more widely used in my experience, since not everyone is on the message passing boat).

              2. 4

                I think a good explanation is that K8s allows you to take concepts and languages you’re already familiar with and build a distributed system out of that, while Erlang is distributed programming built from first principles. While I would argue that the latter is superior in many ways (although I’m heavily biased, I really like Erlang) I also see that “forget Python and have your engineering staff learn this Swedish programming language from the 80ies” is a hard sell

                1. 2

                  You’re right, and the ideas behind K8s I think make sense. I mainly take issue with the sheer complexity of it all. Erlang/OTP has done it right by making building distributed systems extremely accessible (barring learning Erlang or Elixir), while K8s has so much complexity and bloat it makes the problems seem much more complicated than I think they are.

                  I always think of the WhatsApp situation, where it was something like 35 (?) engineers with millions of users. K8s is nowhere close to replicating this per-engineer efficiency, you basically need 10 engineers just to run and configure K8s!

              3. 4

                One thing that Erlang gets right that other people miss is Hot Reloading. A distributed system that is self healing has to be able to hot reload new fixes.

                That’s my biggest frustration with the new BEAM compilers in Rust and so on: they choose to not implement hot reloading - it’s often in the list of non-goals.

                In a different video, Joe says to do the hard things first. If you can’t do the hard things, then the project will fail, just at a later point. The hard thing is isolated process hot reloading: getting BEAM compiled in a single binary is not.

                1. 2

                  Hot reloading is one of those features that I have never actually worked with (at least, not like how Erlang does it!) So for possibly that reason alone I don’t see the absence of the feature a major downside of the new BEAM compiler. I wonder if the lack of development in that area is just because it is a rare feature to have, and while it seems like a nice-to-have, it isn’t a paradigm shift in most people’s minds (mine included!).

                  The benefits of it do seem quite nice though, and there was some other lobste.rs member who had written a comment about their Erlang system which could deploy updates in < 5min due to the hot reloading, and it was as if nothing changed at all (no systems needed to restart). This certainly seems incredible, but it is hard to fully understand the impact without having worked in a situation like this.

                2. 3

                  This talk is nice to watch and listen to. It obviously inspires and convinces people.

                  Sigh

                  Therefore, it makes me sad that it contains bits like as a basis for why you need concurrent programming:

                  “Concurrency because of real world” argument

                  I don’t want to understand this other world. That other world is a very strange world [meaning sequential programming]. The real world is parallel. […]

                  At about https://youtu.be/cNICGEwmXLU?t=224

                  If you wanted to similate a group of humans, maybe. But for an ordinary business app? If you take that as a basis why don’t you also say: “Well, humans also run on bio-chemical processes, why don’t we use that for computing? Let’s replicate brains!”

                  Programming is ultimately about idealized models which are not reality. One of the most important things about programming is to model

                  1. something useful
                  2. in a way that humans can understand and modify.

                  Concurrency is incredibly hard for humans to understand.

                  Highly Available Data is Hard

                  Furtherdown Joe basically says that highly available data is really, really hard and you should normally not code something like a consensus protocol on your own. I totally agree. Joe dives into the topic of highly-available (HA) data quite a bit, just to drive this point home. I wish he would spent more time on things that would actually help people write better software bu hey. At least we agree.

                  The Architecture Erlang Should Be Benchmarked Against

                  But if you combine the facts that

                  1. Concurrency is hard for humans to reason about and
                  2. You normaly shouldn’t do HA data yourself anyways,

                  I arrive at a quite different architecture, at least for typical business software:

                  1. Use lots of stateless processes written in pretty much any programming language.
                  2. Use a highly available data layer for most of the concurrency.
                  3. Limit your own code that deals with real concurrency to the bare minimum.

                  If you say, Erlang isn’t made for these kind of apps, that’s OK. I’d like to see that clearly spelled out. If something can tell me why Erlang is better than that, or whatever is better than this for the common case, I am seriously interested.

                  Unreliable message passing

                  In my mind, sequential programming combined with a highly capable and available databse is going to be so much simpler then having a programming model with unreliable message sending. The few projects I have been in that used actors tended to just implicitly assume that the messages would indeed arrive because they mostly did. Not saying, it is impossible to write this in a better way but intelligent people everywhere will fail to do so, normally.

                  If you want to deal with unreliable message passing, you have a few options:

                  1. Implement resending messages, waiting for acks in every actor. Maybe use a library for that. If that is so, makes you wonder why this is not part of the standard stack.
                  2. For 1, if it is any effort at all and people take correctness seriously, they would try to restrict the number of actors to simplify reasoning. Because, guess what, reasoning about a sequential systems is tons easier for humans. But then you lose the parellizability and your failures are isolated but in larger chunks.
                  3. Adopt a reconciling state programming model where you basically sync your state periodically, pampering over any lost messages. Resulting in all messages usually sent unnnecessarily and hiding efficiency errors because they are repaired.
                  4. magic solution nobody ever talks about that I really would like to hear about finally
                  Let it crash

                  Then let’s go over “let it crash”. I like the idea in principle but he doesn’t even touch the problem of crash loops etc. If you restart things, things do not magically start working.

                  Somehow, the talks always stop at that point and don’t go into details.

                  If restarting thing helps, that also means that things are not reproducible which usually means you’ll have a bad debugging and testing experience. If you chunked your app into small pieces/actors, then at least you have very restricted state to reason about which is nice. But that is probably hardly useful without considering the state of the other actors.

                  1. 1

                    There’s a lot to unpack here. I think the most important thing for me to say is that the actor model is not only about concurrency or scalability; it can (and does) help with fault tolerance and the maintainability of code.

                    Programming is ultimately about idealized models which are not reality. One of the most important things about programming is…

                    I thought programming was primarily about getting computers to perform the desired calculations. Everything above coding directly in machine code is to make programming easier for humans, but they are a means to the end, not the end itself.

                    Concurrency is incredibly hard for humans to understand.

                    Agreed, but it’s a programming luxury to get to pretend that concurrency doesn’t exist. We’ve approached a point of diminishing returns on single-core clock speed, and it’s more cost effective to work with multiple cores, multiple CPUs, multiple computers/servers, multiple datacenters, etc.

                    What’s nice about the actor model is that it lets you keep thinking about single-threaded execution a lot of the time; instead of thinking about mutexes, semaphores and critical sections, in Erlang, you’re dealing mainly with isolated process state and with the transmission of immutable data between processes.

                    1. Use lots of stateless processes written in pretty much any programming language.
                    2. Use a highly available data layer for most of the concurrency.
                    3. Limit your own code that deals with real concurrency to the bare minimum.

                    This model seems to take for granted that an HTTP server, a relational DBMS, or an operating system, is written to execute code concurrently. While you correctly point out that some people might say “Erlang is not for these kinds of apps”, others might say “Erlang is for making the substrate on top of which these apps run”.

                    With respect to point #1, your state has to live somewhere. Of course, that state can exist in a HA database, but there’s a cost associated with such an architecture that might not be appropriate in all situations. Joe talks a lot about the merits of isolated state in the linked talk, which is a very powerful idea that is reflected in languages like Rust as well as in Erlang.

                    Point #2 is widely practiced among Erlang programmers. It’s perfectly possible for an Erlang application to speak to a traditional RDBMS, or there are solutions that are written in Erlang, such as mnesia, or Riak, or CouchDB.

                    Point #3 is also widely practiced among Erlang programmers. Many of the people writing web apps that run on the BEAM are not always directly dealing with the task of concurrently handling requests or connections; they’re writing callback functions to perform the appropriate business logic, and then these callbacks are executed concurrently by a library such as cowboy.

                    sequential programming combined with a highly capable and available databse is going to be so much simpler then having a programming model with unreliable message sending

                    Again, there’s no arguing which of these is simpler, but the simplicity you’re talking about is a luxury. What happens when your single-threaded program has more work to do than a single CPU core can handle? What happens when your single-threaded program has to communicate over an unreliable network with a program running on another computer?

                    Then let’s go over “let it crash”. I like the idea in principle but he doesn’t even touch the problem of crash loops etc. If you restart things, things do not magically start working.

                    A shockingly large amount of the time, however, restarting things does work. Transient failures are all over the place. What’s nice about “let it crash” is that it can help eliminate a huge amount of defensive programming to handle these transient failures. Unable to connect to a server? The result of some request is not what you’re expecting? Instead of handling an exception and retrying some number of times, Erlang programmers have the choice to let the process crash, and delegate the responsibility of trying again to a supervisor process. This helps keep business logic code clean of fault-handling code.

                    If restarting thing helps, that also means that things are not reproducible which usually means you’ll have a bad debugging and testing experience

                    Over long periods of execution, you’re inevitably going to encounter situations that are not reproducible. There’s a point of diminishing returns when it comes to testing, and it’s usually a long way away from simulating the kinds of conditions your app might experience over months or years of uptime.

                    If you chunked your app into small pieces/actors, then at least you have very restricted state to reason about which is nice. But that is probably hardly useful without considering the state of the other actors.

                    My experience using Erlang daily for the last 5 years is that I might have to consider the state of 2-3 actors when debugging an Erlang application, but I have almost never had to consider processes outside of the application I’m debugging, let alone the state of all the running actors. For the most part, decomposing applications into communicating actors helps greatly with writing and debugging concurrent code, rather than hindering it.

                    TL;DR: Yes, concurrent programming is hard, but it’s a luxury to pretend like concurrency doesn’t exist. When it comes to writing concurrent code, I find that Erlang’s actor model helps more than it hinders. Erlang’s actor model helps with more than just concurrency; it helps with fault tolerance and with code maintenance, to an almost greater degree than it helps when writing concurrent code.