1. 87
  1. 26

    Queue theory is so nice for understanding almost everything in our machines. Most of my advice in conversations with async library authors boils down to “use unbuffered channels and enforce a global maximum number of requests that your tcp acceptor pauses for when the number is reached, so that your tcp backlog fills up and your load balancer can do its job and send traffic elsewhere or provide feedback to a scaling system”.

    One reason I’m very much in-favor of just using fixed size threadpools for rust services instead of any use of async is that it makes that max number of requests just the number of threads you picked, with no need for complex backpressure mechanisms. I feel like async is really only appropriate for load balancers, as far as its use in rust goes. When you’re running real workloads that do stuff on the CPU, context switching costs disappear into the noise, even post meltdown/spectre mitigations that exacerbate their costs (but we’re still better off than in the c10k days).

    1. 14

      This is great general advice, not only in Rust. Async should be understood as a niche solution to specific problems, not a default, general-purpose approach.

      More fundamental truths about queueing, which I find myself returning to over and over: https://apenwarr.ca/log/20170814.

      1. 1

        Thank you for the link, the article is quite dense but the guidelines given there are very helpful!

      2. 1

        Thank you for this explanation. I started working on a tcp client and I couldn’t figure out why/if I should use async. I felt like I should cause I have experience mostly in node.js but it wasn’t making sense (at least right now) with rust.

      3. 4

        I have suggested adding the tags “python” and “networking”.

        As a side issue, backpressure is one of the things I think makes async/await more attractive than explicit callback async i/o: when reviewing code, you just look for places where the word “ await” is missing on some output. I wish I had an eslint rule for this though. :/

        I’m curious how often it turns out you want something like that explicit 503 in Python code because I would normally expect to be hiding the Python application server behind an instance of Apache (or nginx) and an instance of HAProxy. So either or both of those could implement concurrency limiting. e.g. You can tell HAProxy to only send a fixed maximum number of connections to a given backend.

        1. 5

          In most async systems … you end up in a world where you chain a bunch of async functions together with no regard of back pressure.

          yup. Back pressure doesn’t compose in the world of callbacks / async. It does compose if designed well in coroutine world (see: erlang).

          async/await is great but it encourages writing stuff that will behave catastrophically when overloaded.

          yup. It’s very hard, in larger systems impossible, to do back pressure right with callbacks / async programming model.

          This is how I assess software projects I look at. How fast is database is one thing. What does it do when I send it 2GiB of requests not reading responses? What happens when I open a bazillion connections to it? Will a previously established connections have priority over handling new connections?

          1. 6

            Even in the Erlang world, there are technical risks at every point where you reintroduce asynchrony. See https://ferd.ca/handling-overload.html for a long post on all the ways you can find to handle overload in the Erlang ecosystem, from the simplest one to more complex systematic approaches.

            1. 2

              In Monte, where I/O is managed by callbacks, my “streamcaps” stream library has perfect backpressure. Each callback generates a promise, and each promise is waited upon before more data is generated.

              The main downside to perfect backpressure is that flow is slow and managed. Many APIs have “firehose” or “streaming” configurations that can send multiple packets of data as fast as possible, but such an API is not possible when backpressure is perfectly communicated.

            2. 3

              Sounds important, but the article is too long and complex for me to understand :/ Also, at some point the author mentions that in the Python example, the missing drain in write seems an oversight; so, if the bug was fixed, would the article still have any reason to be written? Is it basically an article about a bug in a Python library’s API? I’m confused.

              1. 7

                The article mentions this introductory article on backpressure. The WP article is not bad either.

                The main concept is that, in a data-processing pipeline, we do not want data to pool or buffer. We want data to flow. It turns out that a key concept for good flow is letting each component of the pipeline understand how fast the data is coming in and how fast it is being read away. Without backpressure, data can linger arbitrarily in memory, creating effects like bufferbloat.

                1. 2

                  Hmh, so I see I wasn’t precise enough: in fact, I kinda mostly understood the general idea of backpressure as it was explained in the introductory part of the original article; but I got lost immediately afterwards once the article started trying to explain how it applies to async and how it leads to an issue in Python. Sorry for not being clear about that; is there a chance you could help me understand this part as well? Do some of the links you provided above already explain it in terms of async? Thanks!!

                  1. 2

                    Part of the issue is not using await, which causes async actions to be effectively invisible since they are not waited for. async/await sugar is not magical, though, and it’s the same as any future-based or promise-based system.

                    The other part of the issue is not being able to explicitly manage backpressure. The author uses an ad-hoc token bucket to quantify how much backpressure is currently manifest.

              2. 3

                I wonder if Rust projects like tokio (the new version) or async-std have already made mistakes in this area.

                1. 4

                  In go channel communication is synchronous by default. If a send operation can’t proceed because the receive side is busy, the goroutine will go to sleep.

                  This will give you backpressure when working with Kafka or similar systems. I believe you can model a similar idea in Python, though I’ve only done it with the threads and the Queue class.

                  I think for other use cases one of the first things to do is to always accept a timeout for requests. Things may get overloaded but at least you won’t end up with tons of zombie requests.

                  Also all the rage right now are service meshes like envoy which can actually implement some of this at the network layer: https://www.envoyproxy.io/docs/envoy/latest/intro/arch_overview/upstream/circuit_breaking#arch-overview-circuit-break

                  1. 1

                    synchronous/blocking communication is a poor model. Good backpressure requires a queue with a maximum fixed size.

                    Even better if the queue allows more complex, pluggable scheduling methods.

                    1. 7

                      Good backpressure requires a queue with a maximum fixed size.

                      Most often, that maximum fixed size should be 1.

                      Queues exist only to handle burstiness. If your system doesn’t need to handle burstiness, communication should be synchronous.

                      Even if you do need to handle burstiness, generally, queues should be provided at the edges of the system, not within it.

                      1. 2

                        Even if you do need to handle burstiness, generally, queues should be provided at the edges of the system, not within it.

                        That’s the purpose of backpressure: to push the queuing to the right location.

                        Often the edge is in a different place depending on where the caller is. The same component could be the edge and not at the edge depending on the caller or network conditions. The same queue could be filled when needed and shrunk down .

                        Especially during failovers and in very dynamic networks.

                        1. 3

                          Valuing highly my own ability to model and predict system behavior, I would not want to be in charge of a system that contained dynamic queues. But I suspect we work in different domains.

                      2. 4

                        Go also has buffered channels.

                    2. 2

                      akka streams has superb support for handling and creating many flavors of back pressure and very nice abstractions for data streams. I feel like the work I’ve done using akka streams is probably the most robust code I’ve ever written.

                      1. 1

                        Niggle: The apostrophe really doesn’t fit the italic headers. Maybe they only provide italic versions of some punctuation (and they do one of the unicode marks, not the single straight quote).

                        1. 1

                          It is strange for me because I only knew the challenges of threads from my schoolwork, they seemed brittle. Then these async, etc frameworks came up and folks started to latch onto their ease of use. Then to me and the author, and I’m sure some others, the async frameworks papered over what would become main failure mode… not that they were too hard to write, but that they were too hard to right once things started going wrong. In a multiply connected system without fixed queue sizes (and/or backpressure) it’s hard to know exactly where something is failing, or if a network break at point A won’t cause a cascading failure of a few systems down the line that have never seen 10-100x the request rate. Everything becomes try and see, instead of do a small calculation and see.