1. 34
  1. 12

    I don’t understand the musl decision to have such small stacks. FreeBSD uses 1 MiB * sizeof(void*) as the default stack size (I think thread stacks are smaller by default, but they’re controllable via the pthread APIs). On 32-bit platforms, a 4 MiB stack for every thread can be a bit too much overhead. On a real 32-bit system (as opposed to a 32-bit userspace on a 64-bit kernel) you typically have only 2 GiB of virtual address space and so a 4 MiB stack per thread would limit you to at most 512 threads (assuming TLS is allocated in the stack and you have no other consumers of memory such as code or the heap), which is a bit constrained. On a modern CPU with a 48-bit address space, you have space for 2^15 8 MiB stacks, which is plenty (if you have anything like that many kernel threads, you have other problems).

    On a system that can allocate physical memory on demand, there’s no reason to not reserve quite a lot of virtual address space for stacks.

    1. 6

      I think you’ve partially answered your own question: IIRC the original impetus for musl libc was frustration with other “lightweight” libcs at the time, which were used on smaller machines and embedded contexts where glibc was too heavy to be workable. So it was very much targeted at environments where big stacks would, in fact, be a problem.

      Perhaps you could argue that on other platforms, where this is not an issue, the default stack size should be different. I think this is defensible, but a counterargument is that if something must fail on one platform, making it fail on all platforms is a good way to shake out portability problems.

      1. 4

        It seems that some of the design decisions in musl are aimed at minimizing virtual memory requirements, which makes some sense on 32-bit archs but much less on 64-bit archs.

        1. 1

          I wonder what happened in the 30 years since 16M of RAM and 1G of drive space appeared limitless (where just six years prior to that computer was my first computer with 16K of RAM and cassette storage) …

      2. 5

        On architectures with lots of callee-save registers, stack space requirements also increase. For the 32-bit PowerPC JavaScript JIT in TenFourFox, because IonMonkey likes non-volatile registers, we have to save nearly all of them as part of every generated function prologue. This really bloats stack frames and means a fair bit of recursion can run you out of stack space in a hurry. It ran with a 1GB stack space for a very long time, much more than the default on OS X. Sure, we could use less registers, or use more volatile ones, but then it’s still spilling them to the stack with the added performance issues that entails besides fewer registers in flight.

        1. 2

          That just sounds like bad register allocation.

          If you didn’t have enough callee-save registers then you’d just have to allocate values you want to have preserved over function calls in the stack frame in the first place. So it makes no difference to space used. (If they’re used more than once then it’s faster to have them in registers than on the stack, but that’s a different question)

        2. 3

          Did you mean “On a modern CPU with a 48-bit address space, you have space for 2^25 8 MiB stacks”? 2^25 threads * 2^23 bytes = 2^48 bytes.

          1. 3

            Fwiw if I wanted to start hundreds of threads in a 32 bit process I’d call pthread_attr_setstacksize() to lower the limit.

            Also I’d have to ask the obvious questions about why so many native threads in the first place? If they’re all blocking on network then maybe I should use green threads. If they’re all on CPU maybe I should use a work queue. If they’re all blocking on local disk access (which is the one thing I know of where having lots more threads than CPUs helps) then maybe I could use io_uring.

          2. 8

            GCC has -Wstack-usage=BYTES and -Wframe-larger-than=BYTES warnings which can be used to put some guardrails around stack usage at compile time.

            They aren’t perfect, but I’ve started setting them (usually to the lowest power of 2 that still works) in new projects, and found it handy that they will say “hey, that thing you just did increased the stack usage significantly” when necessary, while being easy to adjust or turn off and not requiring any elaborate mechanism in the code to attempt to do this dynamically.

            1. 5

              The other unspecified thing is if you have a thread, it’s impossible to know how much stack it uses unless it doesn’t call into any external library. The moment it calls into libc, there’s now an ABI contract that the function you call can complete in N bytes of stack or less, except libc never promised as much. A fully specified system would need each function to indicate how much stack it uses on entry, with compile time checking that each function it calls still has sufficient remaining stack after the compiler determines that function’s stack usage.

              1. 5

                Ideally, high-level languages would have only one of two stack models, once all of their recursion is desugared and simplified. Either stacks would be explicitly sized and allocated, or stacks would have no limits whatsoever. C seems to sit in a middle ground where stacks are implicitly sized and have no limits until they unpredictably lose correctness.

                1. 3

                  C and many other languages; you can similarly blow the stack in Java, Python, Go, JavaScript and even OCaml (I’m pretty sure), just off the top of my head.

                  Obviously at a certain point you just run out of memory, but the fact that you can e.g. fix a bug in a recursive descent parser by using an explicit stack instead of actual recursion is a bit of a smell.

                  1. 2

                    I can confirm that the OCaml runtime can have stack overflows. We shouldn’t single out C. However, there are flavors of Java and Python which are “stackless” and explicitly allow any sort of recursion, as well as segmented-stack designs like in Go’s runtime. We might hope for flavors of C which address the problem.

                    1. 6

                      We shouldn’t single out C.

                      C is still the only one of those (AFAIK) where you can’t reliably detect and recover from a stack overflow. It’s the combination of having an implicit undetectable stack size and not being able to detect stack overflow that’s so iffy.

                      1. 4

                        I definitely agree languages should just solve this.

                        Tangent re Go: they abandoned segmenting stacks many many years ago, in favor of just allocating a new, larger stack when they run out of space. Though there’s still an artificially enforced limit last I checked; you can still blow the stack, even though in principle it could just keep growing. If I were to guess at the reasoning behind this, I would guess:

                        • Last I checked stack scanning was one of the few stop-the-world parts of their otherwise concurrent, low latency GC, so large stacks could cause latency problems. (disclaimer, I haven’t followed this closely for a couple years, so things may have changed some).
                        • Similarly, as the stack grows copying larger and larger stacks as needed could start to be a latency problem.
                        • Deep recursion isn’t really how Go programs are typically written to start with.
                        • Per the above, it’s status quo, so it’s not like people will be surprised by it.

                        This is of course just my conjecture. I think the problems are surmountable, though I think not being top priority right now is defensible.

                        (Obviously segmented stacks solve the latency issues by making growing non-amortized, but they moved away to solve different performance problems).

                  2. 4

                    It looks like clang outputs this in an ELF metadata section (and this change was originally made by the PS4 team) which could be read by whoever is trying to run the program. I wonder if the best way for you to make your app the most portable™️ would be to provide many different versions using different max stack sizes.

                    quoting the linked article by Ariadne,

                    …or the developer fixes their program to behave correctly only for the Alpine case, and it remains silently broken on other platforms.

                    If I have to special case your platform, then your platform is the one making my app not portable. I’m all for platforms choosing what is best for them (macOS deprecating OpenGL caused many months of work for me) but

                    it is my opinion that if your program is crashing on Alpine, it is because your program is dependent on behavior that is not guaranteed to actually exist

                    is a bit of a stretch. It effectively was guaranteed by being the same for the majority of platforms, or at least the majority of Linux distributions. Is there a reason to change the “default” here? I can see how in some applications a smaller thread stack size could increase performance, but I would have assumed the default is every thread has the same stack space as the main thread. Maybe I’m a little too optimistic with how many threads people are spawning in their applications.

                    1. 3

                      Can somebody explain why it doesn’t just trap and grow the stack, rather than crash? I haven’t written any C in decades, so I’m guessing there’s an obvious reason I’m just not aware of.

                      1. 6

                        Why doesn’t Rust?

                        I think the reason is probably that you can’t just move it around because there are pointers to things on the stack. In a single-threaded world you could devote half your virtual address space to the stack, but if you’ve got an indeterminate number of threads you have to decide at thread creation time how big the stacks can be so how far apart to make the addresses of the stacks.

                        1. 7

                          Note that very early versions of rust did do this, using a segmented approach, but they hit the same performance issues with that that the Go developers did. The Go folks moved to contiguous stacks with a compiler & runtime smart enough to fix up the pointers after copying the data out. The rust folks just dropped the feature. See:

                        2. 4

                          You could do this, but you’re very limited in your implementation strategies; for example, Go does this (up to a point, see https://lobste.rs/s/wddelm/stack_size_is_invisible_c_effects_on#c_ln4kob, so you can still blow the stack but its somewhat artificial afaik), but in doing so it has to move the current contents of the stack, and fix up any pointers that point into said stack. Go can do this because the runtime & compiler know where all the pointers are. It’s even easier in most languages, since e.g. in Java you can’t have pointers into the stack in the first place, and you can find examples of other language implementations that do this.

                          Earlier versions of Go used a “segmented” stack, where they would allocate the stack as a linked-list of pages, rather than moving the whole stack onto a bigger segment of memory. But they hit performance problems with this, which probably could be solved but afaik no one has gone to the trouble of trying to overcome those obstacles, and the Go folks have long since abandoned this approach. You could in principle do this in C I think, but the perf issues are a probable concern.

                          A smaller issue, this requires you to do implicit dynamic memory allocation, which is pretty standard in most languages, but something C programmers would generally find very surprising. It also wouldn’t work in environments without appropriate runtime support, and where it does work it would require fairly tight coupling between the compiler & runtime library. This kindof goes against the grain of how C is supposed to work.

                          1. 4

                            What is below the stack? When you map a large stack with MAP_STACK, the kernel will populate a few pages at the top of the reserved address range but fill the rest with copy-on-write mappings of a page full of zeroes. As the stack grows, the kernel will take CoW faults and populate the rest of the mapping.

                            Typically, there’s a page or two at the end of the stack marked as guard pages so that accesses past the end will trap (stack clash showed that this doesn’t catch everything - if you allocate a 1GiB array on the stack, the start of the array might be in a heap object). The kernel, in theory, could grow the stack past the guard pages but there may well be another mapping underneath. You can’t relocate the stack because C is full of pointers into the stack (particularly on parameters).

                            On Windows, the kernel doesn’t do lazy commit and so most of this process is handled in userspace. Functions that allocate more than one page of local state to a probe read to the bottom of the stack and if the stack needs to grow then the C runtime can do it. This could potentially be used to build segmented stacks, but that’s quite tricky to get right with C calling conventions that expect to find parameters at some offset from the stack pointer (and that may not be copyable).

                            1. 3

                              C doesn’t automagically do anything not explicitly part of the calling convention ABI.

                              1. 2

                                If you have 2 threads with adjacently allocated stacks (because you want them to only take up a small amount of address space) and one of them wants to grow into the space taken up by the other, you have a big problem. You can’t move stacks in C because things may hold pointers into them

                                1. 3

                                  On 64-bit computers you can just start each stack 1TiB apart (OS permitting).

                                  1. 1

                                    Yes, I’ve used similar strategies that leverage demand paging to locate objects of the same type contiguously in a huge sparse array, without requiring any relocation (e.g. if you have say 64-byte objects, a 32TB memory mapping, and 2^16 distinct object types, you can start the subarrays for each type 4MB apart, so you can have up to 2^16 objects for each type).

                                    1. 1

                                      This is true but not germane: I was talking about the case where you really do require small address space allocations for each stack. I failed to convey that. :/

                                2. 2
                                  1. 1

                                    Rust also has this problem but also makes it really easy to put big data structures on the stack.