1. 23
    1. 3

      It looks like libuv is doing the fork()/exec() dance internally:

      uv_spawn uses execvp(3) internally

      If spawning is part of handling each request this is going to cause you a lot of pain! You’ll have to either keep the heap size small to prevent fork() from becoming too expensive or delegate spawning to a separate process where you can do so (like Chromium’s “zygote”).

      The right way to do this is posix_spawn(3). This creates a subprocess directly without the overhead of CoWing the pages of the current process on fork() and then throwing away those pages on exec().

      1. 3

        Not to be pedantic, but posix_spawn (on Linux) doesn’t “create a subprocess directly”. It does vfork and exec, which is equally problematic since vfork pauses the parent until the child calls exec or exit anyways.

        https://git.musl-libc.org/cgit/musl/tree/src/process/posix_spawn.c#n198

        I don’t think there’s any advantage of using posix_spawn here, it’s just more complicated.

        1. 2

          Pedantry appreciated! The underlying syscalls are important.

          However, the point of vfork() (or clone(CLONE_VM | CLONE_VFORK) in glibc) is that the subprocess has access to the same pages as the parent. You don’t have to create the CoW page tables. That’s the overhead I was focusing on. The subprocess then reports its PID and calls exec.

          Hopefully that doesn’t take too long? But it could if the subprocess isn’t immediately scheduled, such as on an overtaxed system. It does seem like a potential bottleneck if this is all single-threaded, but also unlikely to be a problem in practice? I once worked on a FaaS executor like the system described in the linked article, where even regular fork() + exec() spawning wasn’t a perf problem because it was so small relative to the cost of actually doing anything useful in the subprocess in a dynamic language.

          1. 1

            or clone(CLONE_VM | CLONE_VFORK) in glibc

            Is that supported nowadays? The glibc maintainers used to tell people who wanted to use that they were on their own.

            1. 1

              Do you mean clone3? There are some #ifdef hijinks relating to clone() in the implementation, and the clone(2) man page talks about clone3 as a newer version. There do seem to be glibc wrappers.

              1. 1

                This is what I meant:

                https://sourceware.org/bugzilla/show_bug.cgi?id=10311

                If you use clone() you’re on your own.

                Reading stuff like this led to me starting a liblinux project years ago, which then led to a Linux freestanding C runtime and programming language project that I still work on today.

                The glibc attitude seems to have changed for the better in the meantime. That’s good news.

                1. 2

                  Ah, yeah… that’s an Ulrich right there. I don’t think he’s so involved in glibc these days?

      2. 3

        It seems sort of perverse to use process-based parallelism in a JavaScript environment, when JSVMs are optimized for running lots of sandboxed VMs in a single process.

        The V8 C++ API lets you create any number of Isolates, which are independent VMs with their own heaps and threads. Within an Isolate you can create multiple contexts that share the VM but have (by default) separate global namespaces so they can’t detect each other.

        The only advantages I can see for forking are that (a) each process gets more file descriptors, and (b) forks are isolated from catastrophic VM failures like crashes or sandbox escapes.

        1. 4

          How do you use all your cores then? JS has a single-threaded event loop

          I think v8 isolates may only be used within a single tab, or top-level domain/origin in Chrome. Everything else is process-based, I believe


          In GC languages, true parallelism requires multiple OS processes (except if you’re Erlang)

          1. 3

            JavaScript also has worker threads. Also, don’t forget Java has ben full on parallel threads with a shared GC heap since the 1990s, and somehow its performance isn’t completely terrible.

            1. 3

              I think it’s the browser that has worker threads, not JS. It’s sorta like ANSI C vs. POSIX. (Deno is supposed to be more like the browser than node.js is.)

              Web workers are more like shared-nothing processes - you have to pass JSON-like messages between the browser and worker. I believe they are implemented with v8 isolates and OS threads though, not OS processes.

              (WASM is also single threaded and shared-nothing, I believe it is kinda like Web workers with a different VM)


              Shared GC heaps obviously work and perform fine for many/most situations, but they all are less multicore-scalable than what the OS can provide … there are commercial Java GC’s aimed at specialized markets that need more scalability and performance on big heaps / many cores

              1. 1

                Node.js recently-ish grew worker threads too, slightly different API from the browser but basically the same model.

            2. 1

              Each isolate runs on its own thread. They are entirely independent of each other.

              Chrome is heavily process based, but I believe that’s for security reasons, part of its sandbox. Nothing specific to V8.

              I disagree about GC and parallelism. Separate heaps and separate threads are just as good. Erlang does this, but so does Pony, and Web Workers in browser JS.

              1. 3

                I looked, and I think in addition to

                (a) each process gets more file descriptors, and

                (b) forks are isolated from catastrophic VM failures like crashes or sandbox escapes.

                There is also

                (c) In 2017 (and maybe still now) node.js didn’t have an API for spawning isolates, but it does have one for spawning processes

                Here’s what looks like a separate project for that - https://github.com/laverdet/isolated-vm

                (d) Maybe Deno does not have it either, I’m not sure

                At Val Town we run your code in Deno processes.

                (e) You might want to fork a Python process to run NumPy, or a C++ process to do something else. I guess node does not have the ability to spawn a separate process without forking v8 itself – something similar in spirit to the Chrome zygote could allow that


                edit: Found a better thread

                https://groups.google.com/g/nodejs/c/zLzuo292hX0/m/F7gqfUiKi2sJ?pli=1

                Isolates Removed (2012)

                It was a very informative experiment, but has ultimately turned out to cause too much instability in node’s internal functionality to justify continuing with it at this time. It requires a lot of complexity to be added to libuv and node, and isn’t likely to yield enough gains to be worth the investment.

                1. 1

                  Thanks, that’s interesting! I wonder what the underlying problems were in node’s experiment; I’ve worked with both V8 and libuv, and adding multiple isolates/threads doesn’t seem like it would be much trouble, but I don’t know anything about how node’s higher levels.

          2. 3

            The linked issue from 2017 is an interesting - https://github.com/nodejs/node/issues/14917

            The reporter had a node.js process with a 14 GB heap (RSS)

            And it takes ~300 ms to fork(), which blocks the main loop for 300 ms, which means your performance is limited to like 3 requests/second


            One explanation is:

            The problem is down to http://docs.libuv.org/en/v1.x/process.html#c.uv_spawn being run in the main event loop thread, and not in the thread pool. Forking a new process causes the RSS of the parent process to be copied.

            As RSS increases, spawn time increases.

            I’m aware that fork() is very hard to implement, but I don’t think it should be O(N), where N is RSS

            Does this have something to do with the fact that every node.js process basically has an enormous GCC-like compiler in it at runtime? (v8 is like 1M - 2M lines of code, with multiple compilers and interpreters)

            Are there tons of JIT data structures that are somehow invalidated when you fork?

            I don’t think “normal” C processes should have this big a perf hit, but I could be wrong …


            Oh actually I wonder if the v8 GC is fork() friendly?

            In Oils we made sure that the GC is fork() friendly, since Ruby changed their GC in 2012 because of this “dirty the whole heap” performance bug - https://www.brightbox.com/blog/2012/12/13/ruby-garbage-collector-cow-performance/

            Also CPython has the same issue on large web server workloads, e.g. at Instagram

            I would have thought v8 is designed for this since Chrome is multi-process ….

            1. 3

              Forking is sort of OS-specific, so Chrome can’t rely on it. On macOS, fork and Mach ports are incompatible, so you can’t successfully fork a process that’s done anything but really basic stuff. And apparently Windows doesn’t support forking at all.

              1. 2

                This is not true … at least historically, Chrome has a ton of OS-specific code deep in its bowels, relating to processes, and it uses fork() heavily. It’s different on Windows, Linux, and OS X.

                https://neugierig.org/software/chromium/notes/2011/08/zygote.html

                at startup, before we spawn any threads, we fork off a helper process. This process opens every file we might use and then waits for commands from the main process. When it’s time to make a new subprocess we ask the helper, which forks itself again as the new child.


                Forking is also used as an optimization:

                https://chromium.googlesource.com/chromium/src/+/HEAD/docs/linux/zygote.md

                A zygote process is one that listens for spawn requests from a main process and forks itself in response. Generally they are used because forking a process after some expensive setup has been performed can save time and share extra memory pages.


                Multiprocess is the gold standard for security! https://lobste.rs/s/dmgwip/sshd_8_split_into_multiple_binaries#c_wnaflq

                So back to the original question, I really do wonder if v8 is causing any problems with node.js and forking … I would have thought that is optimized, because of v8’s heritage in a multi-process program

                Although I guess it’s possible that I misunderstand where the forking happens … i.e. the rendering could be forked off, but the v8 parts aren’t forked, etc.

              2. 2

                When I first started working on this blog post I had just returned from Systems Distributed (https://systemsdistributed.com/) in New York.

                I was looking around for any information online about Node spawn performance and found this issue. The author of the issue is Joran Dirk Greef, CEO of Tigerbeetle. I imagine he filed it when he was working on the NodeJS prototype of Tigerbeetle. Tigerbeetle and Joran ran Systems Distributed.

                It was a surreal moment.

                1. 2

                  I don’t think it should be O(N), where N is RSS

                  Kind of. When you call fork the OS has to alias many memory pages, so it’s in part O(N) where N is the number of allocated memory pages, and RSS and memory pages are correlated.

                  1. 1

                    I would have thought you can do page ranges with some kind of tree structure , so it would be more like O(log N) to update that metadata

                    https://en.wikipedia.org/wiki/Page_table#Multilevel_page_tables

                    But I actually don’t know what’s common

                  2. 2

                    I read deeper in the thread, and it looks like there was indeed a 2018 v8 fix motivated by this 2017 issue:

                    https://issues.chromium.org/issues/42210615

                    https://chromium-review.googlesource.com/c/v8/v8/+/4602858

                    Add a new build flag v8_enable_private_mapping_fork_optimization which marks all pages allocated by OS::Allocate as MADV_DONTFORK. This improves the performance of Node.js’s fork/execve combination by 10x on a 600 MB heap.

                    1. 2

                      yeah, in my testing with that fix merged I can no longer reproduce the spawn times using the example script. it’s still slow, but not as slow as what’s outlined in the issue.

                    2. [Comment removed by author]

                    3. 1

                      I’m really confused about what this is about. AFAIK node is a framework for async Io using JavaScript as its API, piggybacking on V8.

                      Is spawning a new process not outside the scope of node’s main intent and therefore off topic? I understand the functionality is there, presumably for convenience, but this begs the question: what are using node for and WHY? Node was created with the specific intent of not relying on forking.

                      I’m not trying to be a node fanboy. I dislike it and think it is a bad idea and sub-pair stack. But it is still what is for what it is.

                      1. 1

                        node.js actually has more powerful subprocess APIs than most languages (e.g. Python until asyncio landed), because they are async

                        It’s totally normal and traditional and powerful for servers to spawn processes

                        In theory, nothing about it being a node.js process changes that , though it sounds like there are some performance limitations