It looks like libuv is doing the fork()/exec() dance internally:
uv_spawn uses execvp(3) internally
If spawning is part of handling each request this is going to cause you a lot of pain! You’ll have to either keep the heap size small to prevent fork() from becoming too expensive or delegate spawning to a separate process where you can do so (like Chromium’s “zygote”).
The right way to do this is posix_spawn(3). This creates a subprocess directly without the overhead of CoWing the pages of the current process on fork() and then throwing away those pages on exec().
Not to be pedantic, but posix_spawn (on Linux) doesn’t “create a subprocess directly”. It does vfork and exec, which is equally problematic since vfork pauses the parent until the child calls exec or exit anyways.
Pedantry appreciated! The underlying syscalls are important.
However, the point of vfork() (or clone(CLONE_VM | CLONE_VFORK) in glibc) is that the subprocess has access to the same pages as the parent. You don’t have to create the CoW page tables. That’s the overhead I was focusing on. The subprocess then reports its PID and calls exec.
Hopefully that doesn’t take too long? But it could if the subprocess isn’t immediately scheduled, such as on an overtaxed system. It does seem like a potential bottleneck if this is all single-threaded, but also unlikely to be a problem in practice? I once worked on a FaaS executor like the system described in the linked article, where even regular fork() + exec() spawning wasn’t a perf problem because it was so small relative to the cost of actually doing anything useful in the subprocess in a dynamic language.
Do you mean clone3? There are some #ifdef hijinks relating to clone()in the implementation, and the clone(2) man page talks about clone3 as a newer version. There do seem to be glibc wrappers.
Reading stuff like this led to me starting a liblinux project years ago, which then led to a Linux freestanding C runtime and programming language project that I still work on today.
The glibc attitude seems to have changed for the better in the meantime. That’s good news.
It seems sort of perverse to use process-based parallelism in a JavaScript environment, when JSVMs are optimized for running lots of sandboxed VMs in a single process.
The V8 C++ API lets you create any number of Isolates, which are independent VMs with their own heaps and threads. Within an Isolate you can create multiple contexts that share the VM but have (by default) separate global namespaces so they can’t detect each other.
The only advantages I can see for forking are that (a) each process gets more file descriptors, and (b) forks are isolated from catastrophic VM failures like crashes or sandbox escapes.
JavaScript also has worker threads. Also, don’t forget Java has ben full on parallel threads with a shared GC heap since the 1990s, and somehow its performance isn’t completely terrible.
I think it’s the browser that has worker threads, not JS. It’s sorta like ANSI C vs. POSIX. (Deno is supposed to be more like the browser than node.js is.)
Web workers are more like shared-nothing processes - you have to pass JSON-like messages between the browser and worker. I believe they are implemented with v8 isolates and OS threads though, not OS processes.
(WASM is also single threaded and shared-nothing, I believe it is kinda like Web workers with a different VM)
Shared GC heaps obviously work and perform fine for many/most situations, but they all are less multicore-scalable than what the OS can provide … there are commercial Java GC’s aimed at specialized markets that need more scalability and performance on big heaps / many cores
Each isolate runs on its own thread. They are entirely independent of each other.
Chrome is heavily process based, but I believe that’s for security reasons, part of its sandbox. Nothing specific to V8.
I disagree about GC and parallelism. Separate heaps and separate threads are just as good. Erlang does this, but so does Pony, and Web Workers in browser JS.
(d) Maybe Deno does not have it either, I’m not sure
At Val Town we run your code in Deno processes.
(e) You might want to fork a Python process to run NumPy, or a C++ process to do something else. I guess node does not have the ability to spawn a separate process without forking v8 itself – something similar in spirit to the Chrome zygote could allow that
It was a very informative experiment, but has ultimately turned out to
cause too much instability in node’s internal functionality to justify
continuing with it at this time. It requires a lot of complexity to
be added to libuv and node, and isn’t likely to yield enough gains to
be worth the investment.
Thanks, that’s interesting! I wonder what the underlying problems were in node’s experiment; I’ve worked with both V8 and libuv, and adding multiple isolates/threads doesn’t seem like it would be much trouble, but I don’t know anything about how node’s higher levels.
The reporter had a node.js process with a 14 GB heap (RSS)
And it takes ~300 ms to fork(), which blocks the main loop for 300 ms, which means your performance is limited to like 3 requests/second
One explanation is:
The problem is down to http://docs.libuv.org/en/v1.x/process.html#c.uv_spawn being run in the main event loop thread, and not in the thread pool. Forking a new process causes the RSS of the parent process to be copied.
As RSS increases, spawn time increases.
I’m aware that fork() is very hard to implement, but I don’t think it should be O(N), where N is RSS
Does this have something to do with the fact that every node.js process basically has an enormous GCC-like compiler in it at runtime? (v8 is like 1M - 2M lines of code, with multiple compilers and interpreters)
Are there tons of JIT data structures that are somehow invalidated when you fork?
I don’t think “normal” C processes should have this big a perf hit, but I could be wrong …
Oh actually I wonder if the v8 GC is fork() friendly?
Forking is sort of OS-specific, so Chrome can’t rely on it. On macOS, fork and Mach ports are incompatible, so you can’t successfully fork a process that’s done anything but really basic stuff. And apparently Windows doesn’t support forking at all.
This is not true … at least historically, Chrome has a ton of OS-specific code deep in its bowels, relating to processes, and it uses fork() heavily. It’s different on Windows, Linux, and OS X.
at startup, before we spawn any threads, we fork off a helper process. This process opens every file we might use and then waits for commands from the main process. When it’s time to make a new subprocess we ask the helper, which forks itself again as the new child.
A zygote process is one that listens for spawn requests from a main process and forks itself in response. Generally they are used because forking a process after some expensive setup has been performed can save time and share extra memory pages.
So back to the original question, I really do wonder if v8 is causing any problems with node.js and forking … I would have thought that is optimized, because of v8’s heritage in a multi-process program
Although I guess it’s possible that I misunderstand where the forking happens … i.e. the rendering could be forked off, but the v8 parts aren’t forked, etc.
When I first started working on this blog post I had just returned from Systems Distributed (https://systemsdistributed.com/) in New York.
I was looking around for any information online about Node spawn performance and found this issue. The author of the issue is Joran Dirk Greef, CEO of Tigerbeetle. I imagine he filed it when he was working on the NodeJS prototype of Tigerbeetle. Tigerbeetle and Joran ran Systems Distributed.
Kind of. When you call fork the OS has to alias many memory pages, so it’s in part O(N) where N is the number of allocated memory pages, and RSS and memory pages are correlated.
Add a new build flag v8_enable_private_mapping_fork_optimization which
marks all pages allocated by OS::Allocate as MADV_DONTFORK. This
improves the performance of Node.js’s fork/execve combination by 10x on
a 600 MB heap.
yeah, in my testing with that fix merged I can no longer reproduce the spawn times using the example script. it’s still slow, but not as slow as what’s outlined in the issue.
I’m really confused about what this is about. AFAIK node is a framework for async Io using JavaScript as its API, piggybacking on V8.
Is spawning a new process not outside the scope of node’s main intent and therefore off topic?
I understand the functionality is there, presumably for convenience, but this begs the question: what are using node for and WHY? Node was created with the specific intent of not relying on forking.
I’m not trying to be a node fanboy. I dislike it and think it is a bad idea and sub-pair stack. But it is still what is for what it is.
It looks like libuv is doing the
fork()/exec()dance internally:If spawning is part of handling each request this is going to cause you a lot of pain! You’ll have to either keep the heap size small to prevent
fork()from becoming too expensive or delegate spawning to a separate process where you can do so (like Chromium’s “zygote”).The right way to do this is posix_spawn(3). This creates a subprocess directly without the overhead of CoWing the pages of the current process on
fork()and then throwing away those pages onexec().Not to be pedantic, but posix_spawn (on Linux) doesn’t “create a subprocess directly”. It does vfork and exec, which is equally problematic since vfork pauses the parent until the child calls exec or exit anyways.
https://git.musl-libc.org/cgit/musl/tree/src/process/posix_spawn.c#n198
I don’t think there’s any advantage of using posix_spawn here, it’s just more complicated.
Pedantry appreciated! The underlying syscalls are important.
However, the point of
vfork()(orclone(CLONE_VM | CLONE_VFORK)in glibc) is that the subprocess has access to the same pages as the parent. You don’t have to create the CoW page tables. That’s the overhead I was focusing on. The subprocess then reports its PID and callsexec.Hopefully that doesn’t take too long? But it could if the subprocess isn’t immediately scheduled, such as on an overtaxed system. It does seem like a potential bottleneck if this is all single-threaded, but also unlikely to be a problem in practice? I once worked on a FaaS executor like the system described in the linked article, where even regular
fork()+exec()spawning wasn’t a perf problem because it was so small relative to the cost of actually doing anything useful in the subprocess in a dynamic language.Is that supported nowadays? The glibc maintainers used to tell people who wanted to use that they were on their own.
Do you mean
clone3? There are some#ifdefhijinks relating toclone()in the implementation, and theclone(2)man page talks aboutclone3as a newer version. There do seem to be glibc wrappers.This is what I meant:
https://sourceware.org/bugzilla/show_bug.cgi?id=10311
Reading stuff like this led to me starting a
liblinuxproject years ago, which then led to a Linux freestanding C runtime and programming language project that I still work on today.The glibc attitude seems to have changed for the better in the meantime. That’s good news.
Ah, yeah… that’s an Ulrich right there. I don’t think he’s so involved in glibc these days?
It seems sort of perverse to use process-based parallelism in a JavaScript environment, when JSVMs are optimized for running lots of sandboxed VMs in a single process.
The V8 C++ API lets you create any number of Isolates, which are independent VMs with their own heaps and threads. Within an Isolate you can create multiple contexts that share the VM but have (by default) separate global namespaces so they can’t detect each other.
The only advantages I can see for forking are that (a) each process gets more file descriptors, and (b) forks are isolated from catastrophic VM failures like crashes or sandbox escapes.
How do you use all your cores then? JS has a single-threaded event loop
I think v8 isolates may only be used within a single tab, or top-level domain/origin in Chrome. Everything else is process-based, I believe
In GC languages, true parallelism requires multiple OS processes (except if you’re Erlang)
JavaScript also has worker threads. Also, don’t forget Java has ben full on parallel threads with a shared GC heap since the 1990s, and somehow its performance isn’t completely terrible.
I think it’s the browser that has worker threads, not JS. It’s sorta like ANSI C vs. POSIX. (Deno is supposed to be more like the browser than node.js is.)
Web workers are more like shared-nothing processes - you have to pass JSON-like messages between the browser and worker. I believe they are implemented with v8 isolates and OS threads though, not OS processes.
(WASM is also single threaded and shared-nothing, I believe it is kinda like Web workers with a different VM)
Shared GC heaps obviously work and perform fine for many/most situations, but they all are less multicore-scalable than what the OS can provide … there are commercial Java GC’s aimed at specialized markets that need more scalability and performance on big heaps / many cores
Node.js recently-ish grew worker threads too, slightly different API from the browser but basically the same model.
Each isolate runs on its own thread. They are entirely independent of each other.
Chrome is heavily process based, but I believe that’s for security reasons, part of its sandbox. Nothing specific to V8.
I disagree about GC and parallelism. Separate heaps and separate threads are just as good. Erlang does this, but so does Pony, and Web Workers in browser JS.
I looked, and I think in addition to
There is also
(c) In 2017 (and maybe still now) node.js didn’t have an API for spawning isolates, but it does have one for spawning processes
Here’s what looks like a separate project for that - https://github.com/laverdet/isolated-vm
(d) Maybe Deno does not have it either, I’m not sure
(e) You might want to fork a Python process to run NumPy, or a C++ process to do something else. I guess node does not have the ability to spawn a separate process without forking v8 itself – something similar in spirit to the Chrome zygote could allow that
edit: Found a better thread
https://groups.google.com/g/nodejs/c/zLzuo292hX0/m/F7gqfUiKi2sJ?pli=1
Thanks, that’s interesting! I wonder what the underlying problems were in node’s experiment; I’ve worked with both V8 and libuv, and adding multiple isolates/threads doesn’t seem like it would be much trouble, but I don’t know anything about how node’s higher levels.
The linked issue from 2017 is an interesting - https://github.com/nodejs/node/issues/14917
The reporter had a node.js process with a 14 GB heap (RSS)
And it takes ~300 ms to fork(), which blocks the main loop for 300 ms, which means your performance is limited to like 3 requests/second
One explanation is:
I’m aware that fork() is very hard to implement, but I don’t think it should be O(N), where N is RSS
Does this have something to do with the fact that every
node.jsprocess basically has an enormous GCC-like compiler in it at runtime? (v8 is like 1M - 2M lines of code, with multiple compilers and interpreters)Are there tons of JIT data structures that are somehow invalidated when you fork?
I don’t think “normal” C processes should have this big a perf hit, but I could be wrong …
Oh actually I wonder if the v8 GC is fork() friendly?
In Oils we made sure that the GC is fork() friendly, since Ruby changed their GC in 2012 because of this “dirty the whole heap” performance bug - https://www.brightbox.com/blog/2012/12/13/ruby-garbage-collector-cow-performance/
Also CPython has the same issue on large web server workloads, e.g. at Instagram
I would have thought v8 is designed for this since Chrome is multi-process ….
Forking is sort of OS-specific, so Chrome can’t rely on it. On macOS, fork and Mach ports are incompatible, so you can’t successfully fork a process that’s done anything but really basic stuff. And apparently Windows doesn’t support forking at all.
This is not true … at least historically, Chrome has a ton of OS-specific code deep in its bowels, relating to processes, and it uses fork() heavily. It’s different on Windows, Linux, and OS X.
https://neugierig.org/software/chromium/notes/2011/08/zygote.html
Forking is also used as an optimization:
https://chromium.googlesource.com/chromium/src/+/HEAD/docs/linux/zygote.md
Multiprocess is the gold standard for security! https://lobste.rs/s/dmgwip/sshd_8_split_into_multiple_binaries#c_wnaflq
So back to the original question, I really do wonder if v8 is causing any problems with node.js and forking … I would have thought that is optimized, because of v8’s heritage in a multi-process program
Although I guess it’s possible that I misunderstand where the forking happens … i.e. the rendering could be forked off, but the v8 parts aren’t forked, etc.
When I first started working on this blog post I had just returned from Systems Distributed (https://systemsdistributed.com/) in New York.
I was looking around for any information online about Node spawn performance and found this issue. The author of the issue is Joran Dirk Greef, CEO of Tigerbeetle. I imagine he filed it when he was working on the NodeJS prototype of Tigerbeetle. Tigerbeetle and Joran ran Systems Distributed.
It was a surreal moment.
Kind of. When you call fork the OS has to alias many memory pages, so it’s in part O(N) where N is the number of allocated memory pages, and RSS and memory pages are correlated.
I would have thought you can do page ranges with some kind of tree structure , so it would be more like O(log N) to update that metadata
https://en.wikipedia.org/wiki/Page_table#Multilevel_page_tables
But I actually don’t know what’s common
I read deeper in the thread, and it looks like there was indeed a 2018 v8 fix motivated by this 2017 issue:
https://issues.chromium.org/issues/42210615
https://chromium-review.googlesource.com/c/v8/v8/+/4602858
yeah, in my testing with that fix merged I can no longer reproduce the spawn times using the example script. it’s still slow, but not as slow as what’s outlined in the issue.
[Comment removed by author]
I’m really confused about what this is about. AFAIK node is a framework for async Io using JavaScript as its API, piggybacking on V8.
Is spawning a new process not outside the scope of node’s main intent and therefore off topic? I understand the functionality is there, presumably for convenience, but this begs the question: what are using node for and WHY? Node was created with the specific intent of not relying on forking.
I’m not trying to be a node fanboy. I dislike it and think it is a bad idea and sub-pair stack. But it is still what is for what it is.
node.js actually has more powerful subprocess APIs than most languages (e.g. Python until asyncio landed), because they are async
It’s totally normal and traditional and powerful for servers to spawn processes
In theory, nothing about it being a node.js process changes that , though it sounds like there are some performance limitations