1. 13
  1. 7

    This article talks at-length about the tensions between tokio and mixed workloads that it cannot even theoretically serve well, and goes into a bunch of workarounds that needed to be put in-place to mask these tensions, instead of just avoiding the tensions to begin with.

    When a request’s dependency chain contains a mixture of components optimized for low latency (with short buffers in front) and components optimized for high throughput (with large buffers in front), you get the worst-of-all-worlds high-level system behavior.

    It’s like taking a school bus of kids to McDonalds and going through the drive-through and looping around once for each kid on the bus. Each loop is latency-optimized and whichever kid whose turn it is to order will receive their meal at a low latency after the time that they get to order it, but their sojourn time where they are waiting around doing nothing before being served explodes. But by taking the whole bus’s orders at once, the overall throughput is far higher, and because we have to accomplish a whole bus’s worth of orders anyway, there’s no point optimizing for latency below the whole-bus threshold. The idea has strong descriptive power for so many things in life, especially in software. Most people would probably not have a social media web site server kick off an apache mapreduce job for each GET to the timeline, even if the mapreduce job was to looking at a miniscule amount of data, because it is a similar (more exaggerated but still the same idea) mixing of low-latency components with high throughput components. Mixing of queue depths in a request chain degrades both latency and throughput.

    Sure, it’s software, you can mix queue depths, and in this case it will probably actually give you a social advantage by signalling to a wider group of subcommunities that you are invested in the products of their social software activities, but you are leaving a lot on the table from an actual performance perspective. This is pretty important queue theory stuff for people who want to achieve competitive latency or throughput. I strongly recommend (and give to basically everyone I work with on performance stuff) chapter 2: Methodology from Brendan Gregg’s book Systems Performance: Enterprise and Cloud which goes into the USE method for reasoning about these properties at a high level.

    Drilling more into the important properties of a a scheduler: parallelism (what you need for scaling CPU-bound tasks) is, from a scheduling perspective, the OPPOSITE of concurrency (what you need for blocking on dependencies). I love this video that illustrates this point: https://www.youtube.com/watch?v=tF-Nz4aRWAM&t=498s. By spending cycles on concurrent dependency management, you really cut into the basic compute resources available for accomplishing low-interactivity analytical and general CPU-bound work. The mind-exploder of this perspective is that a single-threaded execution is firmly in-between parallelism and concurrency on the programmer freedom vs scheduler freedom spectrum. The big con is people like Rob Pike who have managed to convince people (while selling compute resources) that concurrency is somehow an admirable path towards effective parallelism.

    Sure, you can essentially build a sub-scheduler that runs within your async scheduler that runs on top of your OS’s scheduler that runs on top of your cluster’s scheduler that runs on your business’s capex resources etc… etc… but it’s pretty clear that from the high-level, you can push your available resources farther by cutting out the dueling subschedulers for workloads where their tensions drive down the available utilization of resources that you’re paying for anyway.

    1. 3

      Basically, they want one process to deal with latency optimized requests and other throughput optimized requests. I don’t know if there use case is valid but basically they try to separate them into different systems. Something you actually advocate for.

      It is not unlike having a UI main thread and offloading long running ops to background tasks which may, e.g. report progress back to the UI.

      That is entirely reasonable, in my opinion.

      They even do use OS scheduling for their needs by using low priority threads.

      The decision whether to use tokio for the long running ops is more questionable. It might be just a matter of “it works well enough and we prefer using the same APIs everywhere.”

      They also put the a big buffer in front with the channel (and a rather slow one, I think).

      I think it can obviously be optimized but the question is more, so they need to? Or is their a simpler solution that is also simpler considering the new APIs it requires devs to know about.

      1. 1

        Just a random thought related to not using OS schedulers to their full extend: in-process app schedulers relate to OS schedulers in a similar way Electron relates to native desktop toolkits.

        Reimplementing part of the OS stack seems silly until you want apps to work cross platform. Then your limited scheduler is still a compromise but at least it works similarly everywhere. Overhead for threads, processes, fibers is pretty different in Operating Systems, at least it used to be.

        1. 1

          Most services are deployed on Linux (and if not Linux then usually only one operating system), however, so discrepancies that cause performance drops on other operating systems are not that important.

          1. 1

            I agree about deployment but the ecosystem would still prefer a cross platform aporoach.

            Being able to repeoduce problems on your dev machine without a vm is very valuable, though.

            Also I don’t think an optimized a single OS approach to async in rust would get adoption.

      2. 6

        I use async even for CPU-bound tasks, because:

        • it supports cancellation nicely (while nothing can abort a tight loop, it can abort on any .await point, and it’s easy to insert extra ones if needed)
        • it supports workloads that mix CPU-bound and network-bound tasks (if you try making a thread pool that efficiently handles both, you’ll re-invent async)

        Tips:

        • simply using tokio’s spawn_blocking is not too bad. I’d say that 80% of the time you don’t need sweat it with moving work to other threads, messages, etc.
        • it’s OK to have more than one tokio runtime in a program. If you really need stable low latencies for your network handling, make make a second sacrificial runtime for spawn_blocking heavy tasks there.
        • Drop of tokio::Runtime needs to happen on a synchronous thread. If you get mysterious panics about blocking in async context, it’s because of the Drop. Use the ManuallyDrop<Runtime> and when droping, send it off to a thread to die there.
        1. 4

          If I’m understanding right, the main point here is to have a separate thread pool for async tasks which are expected to be CPU-heavy. This compares with the standard tokio approach which is to use its built in spawn_blocking threadpool, which takes a sync closure rather than a future. It probably deserved at least a mention for comparison, although it’s prominently discussed behind a couple of the links.

          This does raise a couple of questions for me:

          1. In the futures spawned onto this alternative executor, what kind of async subtasks are we talking about that aren’t sensitive to latency? Wouldn’t it be easy to accidentally smuggle in some I/O or an async lock that undoes the benefits?
          2. Does tokio have any problems with creating futures in the context of Runtime and spawning them on another? I’m honestly not sure, it just raises some red flags since Runtimes have their own timer threads and the like.

          BTW: At this time something has gone wrong with the code snippet for setting up the new runtime and the blocks don’t make sense. The code behind the GitHub link looks okay.

          1. 3

            I cannot answer all your questions but here is the problem with spawn_blocking:

            They are meant for code that spends most of its time blocked, waiting on IO. A resonable strategy for this is to use a large thread number of threads, much higher than the actual paralkelism provided by the available CPUs.

            The article wants to have a solution for CPU heavy tasks that are executed on longer chunks for better throughput but works interfere with low latency requirements of other tasks. These tasks also are lower priority, apparently.

            E.g. in their use case they have some cheap (at least in terms of CPU) requests that they want to serve with low latency. And some CPU-heavy operations for which higher lately is acceptable or unavoidable and throughput more important.

            One example for me latency they give are liveness checks. That to me is weird since we are talking about acceptable latencies in the range of seconds and if I get it right they always recommend slicing even longer tasks into smaller chunks. Otherwise it becomes also difficult to provide meaningful liveness probe coverage for that part of the app.

            1. 2

              are meant for code that spends most of its time blocked

              I do utilize them also for any CPU intense code like bcrypt - though I’m not intentionally relying on the threadpool of tokio for cpu intense tasks, I just don’t want to block my webserver when performing them.

              1. 2

                That should usually work but tokio sets the thread limit probably quite high.

                So if you have a lot of overlapping requests, it might spawn a lot of threads which time share. So instead of handling a part of the requests fast, you handle all of them slowly while spending quite some resources on threads. And that makes it slower, and then you have even more overlapping requests.

                In the past in other programming languages, I could increase resilience and throughput by limiting parallism for expensive operations.

                If you don’t have many overlapping requests and you don’t fear a denial of service attack, all should be fine.

                1. 1

                  Yeah that’s actually a problem I didn’t think about. Maybe I’ll use the actix threadpool for blocking stuff, it already allows me to set an upper limit of threads and integrates nicely into the (tokio) system.

          2. 2

            It’s interesting that the author didn’t mention the block_in_place API, which appears to be playing in the same space as separate runtimes, but perhaps in a finer grained way. Overall, this is an important area of exploration as analytical systems in Rust mature, and with the predominant way of doing IO in Rust being async.

            It’d be great to get a standardized benchmark along with explorations into different solutions. Perhaps a good starting point for a suitable benchmark is grabbing data in (many) .parquet files from S3, decompressing into Arrow, and then a straightforward aggregation like a filter + count.

            1. 2

              Probably because block in place is not cancelable. It becomes just another “regular” thread.