1. 50

  2. 9

    Very interesting read. I wonder how much the bad times with async the author met, are caused by the 2000’s style of async that is based on callback hell. Modern async (as implemented in, say, rust) emphasizes writing async functions in a direct style (with “async”/“await” keywords and compiler support to turn these into state machines, not callbacks) that should be as readable as their purely sequential counterpart. The dispatch might also be different, using a work-stealing scheduler backed by a pool of threads, instead of many single-threaded queues of events.

    1. 14

      I don’t think it fundamentally changes things. async/await is more syntax than anything, and will not prevent ordering bugs because of concurrent executions. It also doesn’t change the fact that async code is contagious, when something deep down your stack suddenly becomes async, you have to mark every caller as async as well. Also, I have some doubts about the argument that “with a good enough scheduler, everything can be made async”. I do think it’s still worth carefully considering whether a function should be async. There is probably a good balance to find, and a lot of measurements to do.

      1. 4

        Async/await is more than syntax. It enables stackless coroutines, which make more efficient use of threads.

        Part of the problem with dispatch_async is that it’s quite expensive, since it requires creating a heap-allocated block. The first time I tried using it instead of locks to make a class thread-safe, I (fortunately) did some comparative profiling and backed out that change in a hurry.

        Contagion is real, but I think it’s less of an issue when there’s much less runtime and syntactic overhead to making a function async.

        1. 7

          it requires creating a heap-allocated block

          So does async/await in most implementations I’ve seen. You still have local variables. If you’re going to have stackless coroutines, those variables need to exist somewhere in memory, right? That somewhere is on the heap.

          1. 4

            Afaik in rust you only allocate on the heap once for a full “task” (comprising many futures, which are translated by the compiler into an enum representing the state of a state machine). So you don’t have one allocation per future + one allocation per callback on that future. https://aturon.github.io/blog/2016/08/11/futures/ and following posts explain it much better than I can hope to do myself.

            1. 3

              Dammit, you’re right. I’d overlooked that. (But c-cube’s reply implies this can be optimized a lot, which is good.)

              1. 4

                In most cases, the optimizations still don’t make up for the high costs associated with distributing work to executor threads, handling communication around scheduling, and stitching results back together. This is well illustrated with the Universal Scalability Law. async is a very sharp tool that only has a couple use cases where the performance actually is improved in Rust, but the downsides are always severe in terms of additional bug classes related to accidental blocking, additional compiler errors due to requiring anything in-scope to be Send if it survives across a suspension point, additional compilation time as large trees of async-specific dependencies are pulled in, etc.. etc… etc… it really eats into the amount of energy that engineers have left to actually be productive, and as a result the async Rust sub-ecosystem suffers from significant performance and quality issues that simply can’t compete with non-async implementations that are not hampered by these costs during their development.

                Like many things in computing, it’s not popular because it’s good, it retains users because of the social connections people form while compensating for its badness. As humans, we avoid acquiring information that could cause us to lose contact with a social group which manifests as async users either never comparing performance to threads or acting like they hadn’t seen any information at all when their tests show them that async makes things worse.

            2. 3

              Contagion is real, but I think it’s less of an issue when there’s much less runtime and syntactic overhead to making a function async.

              Isn’t the async-contagion cost the issue of “two versions of everthing”? (What colour is my function)

              Isn’t this the big idea of golang?

              There is no async contagion. But you get the benefits.

              Coroutines are managed by passing first class objects around (channels), which are lexically scoped so you know which routine is yielding to which.

              OS-level blocking is managed by the runtime so you pretend to write straight-through code but if you call read() then things are managed so that you don’t block anything else.

              1. 5

                Goroutines require custom stack management, which has a high cost in complexity — it took the team many years and two entirely different implementations to fix performance problems. The custom stacks also create interoperability problems with other languages, since the stacks have unusual behaviors like growing on demand and relocating(!). It also creates problems calling out into other-language code that can block, since you don’t want to block the goroutine scheduler.

                Also, all those stacks increase memory usage. Zillions of goroutines doing a task are potentially carrying around a zillion deep stacks.

                Channels are a nice feature but they’re really awkward to use for coroutine / actor implementations; I’ve tried. You have to write a lot of boilerplate to serialize and deserialize the messages, and it ends up being pretty expensive.

                1. 2

                  Do you know of any good articles about how the Go project solved these issues, and what they were, exactly?

                  1. 2

                    Something’s here: https://docs.google.com/document/d/1wAaf1rYoM4S4gtnPh0zOlGzWtrZFQ5suE8qr2sD8uWQ/pub

                    via: https://github.com/golang/go/wiki/DesignDocuments (Contiguous Stacks)

                    for newer design docs, see also: https://github.com/golang/proposal

                    Personally I’m not tracking the proposals since long ago, so I don’t have any idea if more happened since the doc I linked first above, which was around Go 1.3…

                    1. 1

                      Thanks for the links! Definitely worth a read

                  2. 1

                    Also, all those stacks increase memory usage. Zillions of goroutines doing a task are potentially carrying around a zillion deep stacks.

                    I thought this was again a selling pt of golang. goroutines are generally cheap?


                    On a mid-end laptop, I’m able to launch 50 million goroutines.

                    It may be possible to arrange things so that an app has many deep stacks, but it’s not a case I’ve heard people coming across in practice.

                    Channels are a nice feature but they’re really awkward to use for coroutine / actor implementations; I’ve tried. You have to write a lot of boilerplate to serialize and deserialize the messages, and it ends up being pretty expensive.

                    I agree channels are cumbersome. I think this is likely to be due to the lack of higher level helpers, which I hope/expect will be addressed once go2 lands generics.

                    Having routines to handle fan-out, actor idiom etc genericised on “chan T” should help a lot with channel ergonomics I think.

                    It also creates problems calling out into other-language code that can block, since you don’t want to block the goroutine scheduler.

                    That’s an argument for all languages, including golang. If you block a thread in another language, the golang scheduler can’t help you. But that kind of proves the point - in golang, that’s a real win?

                  3. 2

                    Java’s Project Loom is another try at providing concurrency and parallelism without dividing the world in two. Time will tell if it succeeds!

            3. 2

              I follow Swift concurrency discussions quite closely especially now I have more time to invest into Swift development. From what I read, developers are quite aware of the thread explosion problem affecting libdispatch and try to not repeat that when implementing the actor model.

              The libdispatch suffered from thread explosion because it cannot differentiate code that cannot make progress because limited concurrency or because the code error (see the PINCache thread a few years back about the fake deadlock due to concurrency limitation from libdispatch: https://medium.com/pinterest-engineering/open-sourcing-pincache-787c99925445). That directly resulted more complex APIs such as queue targeting to inherent both QoS and concurrency such that multiple private serial queues can multiplex on one serial queue.

              I am hopeful that with actor model, we can figure out something that can efficiently multiplex on fixed number of threads without grow the threads number up dynamically.

              1. 2

                More worryingly async made our program a lot more unpredictable and hard to reason about: because every time we dispatched async we released the execution context until the work item completed, it was now possible for the program to execute new calls in an interleaved fashion in the middle of our methods. This led to all sort of very subtle and hard to debug ordering bugs

                To me the async vs. threads is sort orthogonal to the main issue with concurrency, which is global mutable state, or more generally nonlocal mutable state.

                If you don’t have nonlocal mutable state, then both threads and async are easy. If you do, then they’re both hard.

                “Ordering bugs” means “some state was mutated by someone I didn’t know about”. Well if you have a disciplined program structure, and you don’t pass the state to that piece of code, then it won’t be able to mutate it. I always trace my state from main() to reason about who else can modify it.

                That is, programs need to keep control over their state. But many/most non-concurrent programs aren’t written that way, which makes it hard to graft on concurrency.

                I would call that disciplined program structure “dependency inversion of state”, which means using zero mutable globals. State is an explicit parameter, which is basically functional programming. You can do functional programming in OO languages.

                Another way to think about it is that some classes are like “threads”, and some are like “data”. If you pass the same piece of data to two different threads, then you need some kind of synchronization. If you don’t, then no synchronization is necessary.

                Oil has zero globals, except for signal handlers and garbage collection state. Shells are only a little bit concurrent though – it has to deal with processes finishing in an indeterminate order, signals, and calling back into the interpreter from GNU readline for completion plugins. But still I find maintaining that discipline useful.


                And I would say that what’s nice about coroutines is that they allow you to turn global / nonlocal state into state on the stack. So they are useful tool to reduce nonlocal mutable state, but they don’t address every use case. I have only done a little async/await, but it seems like they basically have that flavor, with perhaps a few more exceptions.

                1. 5

                  If you don’t have nonlocal mutable state, then both threads and async are easy

                  My personal experience tells me that threads, at least, are still hard. Here are the issues you have even without shared mutable state:

                  • concurrency scoping: structured concurrency is not default, it’s way to easy to leak some background thread of control
                  • deadlocks due to structured concurrency: if you do wait for threads of control by default, it’s pretty easy to make wait-loops.
                  • deadlocks due to limited queue sizes: if all queues/channels have fixed-size capacity and the communication graph is not DAG, deadlocks are possible
                  • lack of backpressure: due to unlimited queue sizes: actor systems fix the previous problem using unbounded mailboxes, which requires smart tricks to re-apply backpressure.
                  • parallel cooperative cancelation: if you have n tasks, and one dies with an error, the natural semantics of quickly cancelling others is hard to implement.

                  When I try to implement what feels like a FizzBuzz level of concurrency in Rust (which fully covers shared mutable bit), I inevitably got sucked into scouting the Internet for accumulating wisdom, writing blog posts about things I’ve learned and asking concurrency gurus. For me this stuff is annoyingly hard :(

                  1. 1

                    Yeah it depends a lot on the application, e.g. see my sibling comment about web apps. They technically have threads, but since there is no shared mutable state, you generally don’t think about it.

                    Many of your points are helped by the dependency inversion structure I try to follow. I think it works well for servers and batch programs. GUI frameworks may thwart attempts to use this structure, but some of it probably still applies.

                    • threads are initialized in main() and passed explicitly to modules that use them.
                    • In other words, your app is a library parameterized by threads or thread pools. Libraries don’t start threads themselves; they take threads.
                    • you wait for / destroy threads and thread pools in the same place that you create them – e.g. in main().
                    • Instantiating all threads and resources in a straight line enforces a DAG. If you don’t have a DAG it’s obvious. Oil’s modules necessarily aren’t a DAG and that’s very clear in main().

                    On the other points, I’ve never used unbounded queues … Only bounded ones in C++, Python (and Go). What language / framework has those or encourages you do use them?

                    Cancellation is indeed hard. I would say that the state of the art is to “have bugs” with respect to cancellation :-/ It’s “non-happy” path that can also be hard to test, and often relies on time.

                    One way I think of designing sequential programs is using ONLY “functions and data”, a la Rich Hickey, even though I use an imperative style in the small (as Rust encourages you to do). In his mind data is immutable.

                    I think an analogous thing that helps in designing concurrent programs is using ONLY “threads and resources”. That is, use a very regular style where you instantiate resources, instantiate threads, and pass resources to threads that need them.

                    I noticed this distinction in a couple places:


                    Resource types represent live objects with state and behavior, and often represent resources external to the program

                    (kentonv is a pretty opinionated and accomplished programmer, working on Cloudflare Workers now but also did a lot at Google)


                    In this mental model, resources are types which represent “a thing” - something with an identity and a state which can change with time as the program executes

                    So basically resources are the thing where you worry. They have problems with respect to concurrency, whether it’s thread-based, async based, or a mix.

                    And you make them extremely explicit, like in functional programming. That is basically what I’m advocating. Not sure if it helps all applications but I think it’s a good way to think about it, and other people seem to think the same.

                    I will note again that many frameworks and libraries will thwart this style. You have to make an effort to bridge paradigms. I think there is not enough knowledge around composing different concurrency paradigms.

                    DJB’s “self-pipe trick” was one thing that enlightened me with respect to this. Basically what do you do if you want to wait on data from a file descriptor and a process exiting at the same time? Well those two things have totally different APIs. The answer is to write a byte to a pipe in your signal handler, so you can have a single clean event loop.

                    And ditto for composing events and threads. Most web servers do this, so most application programmers don’t worry about this. But some programs need to compose them at the application level, and that’s usually where you end up with a big mess, which leads to a pile of subtle bugs.

                    Some of the HN comments I linked in the blog post elaborate more on this and related topics of program structure. And you have to be extra paranoid about program structure for concurrent programs. You can’t just iteratively add code, as is the state of the art …

                    These comments basically boil down to “use functional programming” but do it in C++, Python, Rust, etc. Your program doesn’t have to be in a functional language to get the reasoning benefits of the functional style. Functional programming has an obvious relation to concurrency. Pure functions always compose correctly (because they don’t “do” anything). The exceptions to purity are where your program design starts.

                  2. 1

                    “Ordering bugs” means “some state was mutated by someone I didn’t know about”.

                    To me it means “my entry points were called in an unexpected order, triggering a bug when I mutated my state in a way I didn’t expect”. I’ve seen several bugs like this in Actor-based code I wrote & maintain.

                    I don’t think there’s any world where threads and async are easy.

                    1. 1

                      It depends a lot on the application, but what are your “entry points”? My point is that I like to instantiate them with all the state they need to read, all the state they need to write, and nothing more, and nothing less.

                      The “default” should be that the “entry points” don’t share state, but if they do, then you will see it in their signatures.

                      It’s true that this isn’t how most programs are written, and you will have to “fight” most web frameworks or GUI frameworks to do things that way. But I would say if you are starting a new app, it’s a good strategy to experiment with. And it’s basically equivalent to functional programming (so it’s not too exotic).

                      Here is one trivial case of threads being easy: a typical web server, if you use say Django or Rails. You technically have threads, but the framework lets you write straight line code. You basically program them like PHP or Perl.

                      Those types of programs do sometimes have shared state, and tend to grow it once they get large. But it should be very explicit; the “default” is shared nothing.

                      HTTP was designed to be stateless, and that’s one of the main reasons the web apps are so popular, despite the arguable awkwardness. It’s easy to program for and it scales. The state is in the database, which has richer mechanisms to handle concurrency and data integrity.

                  3. 1

                    That Swift is now going the actor route is quite baffling.

                    It’s like “async/await will be as popular as synchronized in 2025”, but for the year 2015.

                    1. 8

                      Not quite sure what you’re getting at. I think I hear you saying “Swift ought to be skating to where the puck is going to be, not where it is now”, but does anyone know where that is? AFAIK, Actors & async/await are the current state of the art in managing concurrency.

                      It’s a shame Swift is kind of behind on handling concurrency well, but it’s a rather young language and it inherited a workable-if-flawed concurrency system from Cocoa / Obj-C so initially they had higher priority things to focus on.

                      1. 1

                        Nobody is forcing you to add more stuff to languages.

                        It’s almost like adding features is not going to improve them.

                        Evidence: all the concurrency constructs added to languages in the past.