1. 21
  1. 22

    Well, I can write

    #pragma omp parallel for

    before a for loop and have my program run 6 times as fast. I don’t see why I shouldn’t do this. Writing my Mandelbrot renderer to use different processes so it can run distributed on multiple machines seems a little silly, to be honest. So yes, I use threads.

    1. 4

      Have a think about what this does. It is a programmer assertion that every iteration of your loop is independent. You have no loop-carried dependencies. For a simple loop, that’s fine. If you call any functions in your loop, you are asserting that they are pure functions. C gives you absolutely no help in validating this, if you get it wrong then you will have subtle data corruption.

      1. 5

        You’re right, of course. But let’s face it, C is a bad choice if you want a safe language.

        1. 5

          I agree, but threads are a massive amplifier to that unsafety. Debugging complexity is related to the number of possible interactions between different components. In a multithreaded environment with type safety (e.g. C#, Java), any store to a field can interfere with any load of the same kind of field. In a multithreaded environment without memory safety (e.g. C), any store can interfere with any load. That makes debugging concurrency bugs in C incredibly hard.

          We have threads because they’re the simplest concurrency primitive to implement (just add more stacks and a scheduler), not because they’re a good fit for hardware or for programmers. They aren’t either. Shared-everything concurrency is difficult to reason about and concurrent mutation is the worst case for any modern cache coherency system.

    2. 13

      The below essay is Frankfurtian bullshit. I wrote it in the same style as https://shouldiusethreads.com/, because I believe that it is also bullshit of the same type; all of the statements are true, or at least very very hard to falsify, but it’s written primarily to offend, not based on its actual truth or falsehood.

      Should I use the filesystem?


      Some people would have you believe that NULL is the worst mistake in computer science. They make a good case for it, but they’re dead wrong. Shared mutable state is the worst mistake in the history of computers. Ask anyone who has debugged both a segfault and a race condition. They will assure you that the latter is 10-100× more difficult to solve.

      Look at a list of files that your system has touched recently. How many of them have actually been manipulated by more than one program? I’m mostly thinking of the hidden ones, like .bash_history and the .dvdcss directory. Outside of highly-visible but comparatively-rare circumstances, the filesystem is an excessively general abstraction for the “silent majority” of its use. The poorly-defined concept of filesystem atomicity result in a system that simultaneously has both excessive synchronization and not enough synchronization for the purposes of the same application (PostgreSQL, in the case of those two links). It’s impossible to design an fsync() call that provides full-filesystem atomicity while preventing two applications on the same computer from blocking each other, because fsync() is fundamentally a filesystem-global lock.

      What’s worse is that the default behaviour is to allow limitless race conditions. A process has access to the same filesystem space as other processes in your machine, and can do whatever they want with it (as long as they’re running as the same user, which is the default behaviour when you launch processes). You have to take extreme pains to avoid accidentally doing the wrong thing. You’ll probably mess it up, and the symptoms will show up in an unrelated part of the multiprocess system 10 minutes later. There is no shame in making such mistakes — we’re only human, after all. The shame is in believing ourselves super-human, and reaching for a tool that we don’t need, in full knowledge that we’re likely to shoot ourselves in the foot with it. Nine out of ten times, this tool is the filesystem. If you’re asked to dig a hole and you point a pistol between your feet to get the job done… you should have used a shovel.

      But, perhaps your problem requires multiple processes to operate on the same data. In that case — you still shouldn’t use the filesystem! Consider using message passing instead. A good model is an overseer program, which organizes the work to be done and aggregates the results, spawning worker programs to run the actual computations. This is not only much more robust, but it scales better, since you could distribute these programs across multiple computers later on*. This approach is more useful, too, because often the user of the program may have a more novel idea about how to distribute work, and if they can invoke the worker processes themselves, they’re empowered to try it out. This is easier to test and debug for the same reasons. And, in addition to all of these benefits, you get brand new virtual memory scratch space which is much more difficult to fuck up.

      What’s worse is that, while there has been a great deal of attention paid to memory-safety, filesystem-safety in programming languages is essentially nonexistent**. Python, Java, Rust, even Haskell expose almost identical APIs for manipulating files, and that interface is terrible. Database designers since before I was born have known that user data and query code should be connected using prepared statements only, but that’s not how you manipulate paths on the filesystem on any of the major operating systems. If you want to connect user input that might potentially contain a forward slash, the best thing that language abstractions might provide is a function that will error out in the presence of “forbidden characters.” At least HTML has standardized escape sequences; how is it acceptable that, in 2020, we can’t have forward slashes in file names? You want to scrape the IMDB and use it to populate file names? Sorry; those episode titles have forward slashes in them, so you’re going to have to invent some arbitrary, non-standardized character substitution and hope it doesn’t result in a naming conflict. “Don’t fuck with user input” is a pretty low standard for API design, and the filesystem paths fail it.

      Don’t use the filesystem! It’s a trap!

      * NFS is not a solution. Because it’s designed to be transparent to the application, it can’t be both reliable and fast, because the application doesn’t give it enough information to know whether an operation should be retried or failed out if something goes wrong.

      ** The closest you can get is sqlite, which is great, but it can’t really fix the pessimistic synchronization, and it doesn’t work very well with existing tooling like git.

      1. 6

        Honestly, this mostly makes sense to me? The fact that we don’t have any better solutions to the use cases filesystems address doesn’t mean that a global shared filesystem is a good idea to begin with. In fact, if you look at more recent OSs that cared less about retro compatibility, like the mobile OSs, they all have some sort of per-app scope for filesystem, and in iOS specifically it’s something very abstracted over.

        I mean, if you were designing a OS today, without any care for any retro compatibility, without any need to support any sort of posix API, would you really provide files and a filesystem as an abstraction for storage? I don’t think I would.

        1. 1

          I can’t decide.

          On the one hand it feels pretty archaic to just store labeled blobs of bits vaguely sorted in a tree and maybe with an extension as a rough type indicator. How are we going to explain this to the kids swiping on the TV? ;) And that in times where we have cloud everything and our phones have share buttons to avoid dealing with files! I am totally on-board with this accidentally true rant.

          On the other hand pretty much every order I need I can create with the wonderful filesystems we have and I prefer that to dealing with having my files scattered over sharepoints and onedrives and google docs and whatnots.

          1. 2

            How are we going to explain this to the kids swiping on the TV?

            The same way it was explained to me? They can learn like anyone else, it’s not unlearnable.

            1. 2

              Your second alternative is not the only other option to filesystems though. It might seem like the only feasible one, but I’m not constraining myself with such minor concerns as “is this even remotely viable in the real world at all”.

              What if we could have one (or a couple) rich, programmable, safe, and portable interface that every program could (and did) use? What if you could have even more flexibility (or rigidity) at your fingertips to organize your data, in pretty much any device?

          2. 1

            You nailed it! I would just add a few solutions like “use INT 13h directly” or “print your data on paper”.

          3. 18

            Generally I think this is bad advice that came from legit experiences. Race conditions are 10-100x the cost of debugging segfaults. I’ve spent significant time debugging race conditions in distributed and lock-free systems, and I can sympathize a lot with this. But I disagree with the conclusions and the recommendations. You absolutely should avoid distribution, concurrency and parallelism if at all possible. But a lot of workloads, especially for building foundational multi-tenant infrastructure, simply have no choice but to take advantage of the abundant computational resources available on modern machinery. If you’re writing a microservice, yeah, coroutines can be great, but they still have tons of opportunities for race conditions. I think Go has taught a lot of people about the downsides of gratuitous concurrency, as well as the amazing upsides of tasteful concurrency.

            I think that really what this author wants is “structured concurrency” (which I wish had been called “structured parallelism” but alas we’re stuck with this conflation between concurrency & parallelism) where our libraries and programming languages really do provide clean abstractions for performing parallel work in a far less leaky abstraction.

            EDIT: I got confused when reading several “not” sentences in a row and misread a recommendation to use coroutine-based concurrency instead of threads when NOT cpu bound (which still seems to contradict their desire not to have race conditions). I’ve moved part of my own counter-rant below this line, with misconceptions of mine stricken.

            this person seems to be conflating concurrency with parallelism, saying things like “use event-loops/coroutines if you’re cpu bound”. Concurrency implies specifying some blocking dependencies to a scheduler, and delaying execution until those dependencies are ready. It’s more freedom for code and less freedom for schedulers. Parallelism is the absence of blocking dependencies - the degree to which code may execute without contention or coherence costs that induce queuing/blocking while executing. Parallelism and concurrency are opposites. While some schedulers can run independent concurrent computation across multiple cores and achieve some degree of parallelism, by doing so you are still bringing in a bunch of machinery that will add additional coherence costs to whatever parallel work you may have been able to accomplish independently. Rob Pike kind of misled a generation of programmers with this by claiming concurrency is the Right way to achieve parallelism, but it’s a cheap knock-off compared to abstractions that are designed from the beginning to be parallel while minimizing blocking, memory barriers, cross-core communication, mutexes, etc…

            Threads are just an abstraction for a CPU core. You have to use more CPU cores if you want to increase your CPU parallelism. Over-reliance on concurrent abstractions increases the CPU overhead and may make your already CPU-bound workload MORE CPU-bound, depending on the contention and coherency costs associated with making your workload parallel in the first place.

            1. 2

              this person seems to be conflating concurrency with parallelism, saying things like “use event-loops/coroutines if you’re cpu bound”.

              Where? It says use event-loops/coroutines “If the sum of your threads is <100% of one core”, that is, if it is not CPU bound.

              It quite explicitly tells, that in situations you are CPU bound, you should use multiple processes, to take advantage of multiple cores.

              1. 1

                you’re right, I’ve misread that part of the article. sometimes I get confused when the word “not” sneaks into several sentences in a row. thanks!

              2. 2

                Go has taught a lot of people about the downsides of gratuitous concurrency

                For me it was Clojure. I was able to utilize a ton of threads without worrying about race conditions and just communicate between them with channels. The improvement between the single-threaded code vs multi-threaded was enormous. I am not sure if this is the right way to come by single connection network IO though.

                1. 1

                  Concurrency implies specifying some blocking dependencies to a scheduler, and delaying execution until those dependencies are ready. It’s more freedom for code and less freedom for schedulers.

                  Indeed. As an example, aren’t user space thread pools required for things like glibc POSIX AIO?
                  With the status quo only very recently having changed on Linux with io_uring (which just uses kernel threads under the hood I think?)?

                2. 9

                  This is very short-sighted in my opinion. Threads have huge advantages over processes:

                  • Spawning a thread is faster than spawning a process;
                  • Threads take less memory that processes;
                  • Classic threading APIs are well-known and well-supported since decades. Tooling exists. The pthread API is even quite simple. It is true that you will likely mess up your first multithreaded program but this just does not really hold for someone experienced with threads. Most of the time, data races aren’t complicated to avoid with careful and proper design and some programming languages make all this really easy.
                  • Sharing memory between threads is easy and lightweight. If it’s immutable it’s shareable right out of the box without performance overhead, if it is mutable you only need a mutex. On the other hand, sharing memory between processes is quite a bit more complex and can be as slow as writing and reading a file on the disk (of course, /tmp may be an in-memory filesystem here and there are probably faster implementations of shm_open our there. (EDIT: this claim about shm_open is incorrect. See david_chisnall’s comment)
                  • Sending file descriptors between threads doesn’t require anything special as well. On the other hand, doing so between processes require messing with one of the most obscure API I ever seen.
                  • Managing child processes isn’t always easy. There are lots of quirks related to signal handling, zombies, SIGCHLD, waitpid and so. Writing async-signal-safe code is a science on its own and signal handling bugs can be awfully hard to debug, just like data races.
                  1. 5

                    Spawning a thread is faster than spawning a process;

                    Generally, this doesn’t matter, because spawning a thread is sufficiently expensive that you don’t create them for short-lived tasks. vfork + execve is slower than pthread_create, but not by much. fork is slower because of creating the CoW mappings, but that’s a one-off cost on creation.

                    Threads take less memory that processes

                    Again, that’s variable. There’s a fixed overhead for a process, but two processes created with fork are only consuming physical memory for pages where the host and child diverge. If these were

                    Classic threading APIs are well-known and well-supported since decades. POSIX threads are barely two decades old. POSIX IPC APIs are older.

                    Sending file descriptor between threads doesn’t require anything special as well.

                    Which is one of the problems. Opening a file in a thread adds contention because the kernel must lock the file descriptor table that is shared among all threads in a process. Accidentally writing to the same file descriptors.

                    can be as slow as writing and reading a file on the disk

                    Your link doesn’t back up your assertion and shows a lack of understanding of how modern (i.e. late 1980s and later) VM subsystems work. Mapping the same file into two processes does not include round trips via the filesystem. The same page in the buffer cache is mapped into two processes. Writes are probably eventually written to the backing store, but usually only when there’s memory pressure. This is no different from thread, where some unused memory in a process may be swapped out.

                    Managing child processes isn’t always easy. There are a lot of quirks related to signal handling, zombies, SIGCHLD, waitpid and so. Signal handling bugs can be awfully hard to debug, just like data races.

                    The same applies to threads, especially across operating systems. For example, Linux and *BSD / macOS have different (POSIX-compliant) rules about which threads asynchronous signals are delivered to. Threads that are created but not detached or joined stick around as zombies.

                    1. 1

                      You’re right for the shared memory. I had in mind that the memory won’t be actually written to the disk unless fsync is called or that the kernel decides so, but indeed, this shouldn’t happen often.


                    2. 4

                      If it’s immutable it’s shareable right out of the box without performance overhead, if it is mutable you only need a mutex.

                      “You only need a mutex” is a huge oversimplification.

                      However, rather than a blanket “NO” that the site provides, a better answer would be a flowchart.

                      Should you use threads?
                      Do you have immutable data?    --- yes -->   YES
                      1. 1

                        Do you use rust? --- yes ---> MAYBE

                        (although one could argue that it’s just a twist on immutable data: if you have a &mut Foo, it’s as if it were immutable, since no one else can observe the mutation)

                      2. 1

                        Spawning a thread is faster than spawning a process;

                        Ahh, the Microsoft FUD that never dies….

                        I remember when Microsoft windows got threads… Oh the wonder, non-stop hype and fud at the dev conference… Light Weight Processes, so much faster!

                        And then I started using linux.

                        And strace.

                        And realised it was all FUD.

                        Threads were a wondrous advance and lightweight in the microsoft world because process creation in the M$ world at the time, was insanely heavyweight.

                        Because they didn’t do mmap and COW!

                        strace through a fork in linux…. it’s as lightweight as a thread.

                        Even now M$ process creation tends to be heavier than linux. Why? DLL’s. yes, M$ has DLL’s but every app has it’s own collection of DLL’s that they obstinately refuse to share.

                        A linux distro is basically a collection of apps and libraries that “play nice and share”.

                      3. 6

                        There is no mention of Inter Process Communication, which seems something you might want to talk about when defending process pools. In any case you have to decide your strategy to communicate between units: either message passing or shared memory (or both). With shared memory between processes, you still have the same problem of locking.

                        1. 5

                          An attempt at a silver bullet argument for threads v async io v processes. What could possibly go wrong?

                          Hint: there’s no silver bullet that applies to all situations.

                          1. 4

                            One case where I think threads are valuable is when you’re CPU-bound and memory-bound. This happens a lot in complex simulations and model-checking.

                            1. 3

                              It’s a little disingenuous to point out all of the “gotchas” of threads, but ignore all of the downsides to the alternatives. Concurrency bugs in async code, or using multiple processes, can be even more difficult to debug than threads, IME.

                              There’s no panacea, but I’ll take code using OpenMP or TBB over hand rolled multi-processing any day.

                              1. 2

                                Personally I’m not entirely sure that event loops are a strictly better solution (at least if your language does the whole ‘async tasks’ concurrency thing) - if you’re calling poll() yourself, it’s fine, but I’m inclined to believe async threads are almost as bad…

                                1. 4

                                  Async brings it’s own problems and scheduling fairness is another a big one (as to be expected in co-operative scheduling). I’m also not at all convinced, at least in Python land, that many applications really are faster under asynchronous IO (whether explicit like asyncio or implicit like gevent). My personal comparisons of web frameworks for example show that UWSGI is considerably quicker than any of the uvloop-based approaches, often by a lot.

                                  1. 3

                                    Only a few niche workloads fare better with a userspace scheduler than threads. Usually it’s just load balancers and proxies that have tons of connections sitting around doing nothing. It’s totally fine to spin up 10k threads on most servers to deal with clients that are doing enough work to keep the server busy, in many cases.

                                    A big part of doing a job is feeling intellectually stimulated though, and async probably makes a lot of engineers happier by just doing more work to get their stuff done, like how dogs are often happier when eating food out of those “puzzle bowls” that make them lick all kinds of corners before getting their reward.

                                    1. 1

                                      I observed that uwsgi configured with the right number of workers can go a long way. However, some workloads are heavily relying on external APIs (e.g., OAuth2 stuff) where almost everything consists in waiting for HTTP calls to complete. For that kind of app (I call them “gateways” or “proxies”), something like aiohttp (async-based) may make sense. That’s not the typical workload, though.

                                      1. 1

                                        I see the rationale there (I think that is how nginx works) but increasing the number of workers is also pretty easy and means you can keep writing normal python.

                                  2. 2

                                    But, perhaps your problem is CPU bound. In that case — you still shouldn’t use threads! Consider using multiple processes instead. A good model is an overseer program, which organizes the work to be done and aggregates the results, spawning worker programs to run the actual computations.

                                    From extensive experience trying to wrangle services written this way (Ruby/Unicorn, etc.) I can state that it is emphatically not a good model. Balkanizing your application’s logic and state across process boundaries to meet performance requirements is like a single-host version of the microservices fallacy: the knock-on costs to overall system coherence, as well as the impact to secondary concerns like observability, dominate whatever benefit you think the approach gives you.

                                    Bluntly: a single process should always be able to saturate whatever the physical bottleneck is on your system for your use case. If your programming language or runtime or design model doesn’t allow for this, fix it or pick a different one, don’t work around it.

                                    1. 1

                                      Invalid TLS cert, according to Chrome.

                                      Edit working now, some CDN shenanigans no doubt.

                                      1. 1

                                        It’s valid for me…

                                      2. 1

                                        I can spin up a vm with 416 vCPUs on Google Cloud - https://cloud.google.com/compute/docs/machine-types .

                                        These are really old arguments that were postulated by various interpreted languages in the late 90s to early 2000s on single core machines when linux threads were not in the shape they are today.

                                        I find this argument completely out of sync with the capabilities of 2020.