1. 26
  1.  

  2. 7

    The article highlights “OpenBSD before 4.6” as having the smallest stack space, but OpenBSD 4.6 is from 2009 and essentially not really worth considering IMHO. I certainly wouldn’t call a 12-year old system a “common x86_64 platform”.

    So Alpine actually has the smallest stack space with 128K in that table, followed by OpenBSD and Darwin/macOS with 512K.

    I also looked up the stack size for some other platforms (quick search, may be outdated for some):

    • Solaris: 2M (64-bit) or 1M (32-bit)
    • z/OS: 1M
    • QNX: 256K (amd64), 512K (arm64) or 128K (32bit)
    • HP-UX: 256K (but can be changed easily with PTHREAD_DEFAULT_STACK_SIZE env var).
    • Minix: 132K
    • AIX: 96K

    So AIX is actually smaller, and Minix is close. Not sure if either really count as a “common” platform though.

    Since you can just set the stack size with pthread_attr_setstacksize(), I’m not sure why you wouldn’t do that? I never worked much with pthreads so maybe there’s some disadvantage to that?

    1. 2

      I’m not sure why you wouldn’t do that?

      I think mostly people don’t realize it’s a problem unless they run into an issue at runtime. And as most people don’t develop POSIX programs on musl, they don’t see any problems.

      I never worked much with pthreads so maybe there’s some disadvantage to that?

      No particular disadvantage. But not all programs use pthreads directly, and don’t always use abstractions that expose the stack size directly.

      For example, as far as I can tell from a cursory glance, glib threads don’t provide a stack size parameter. But I don’t know glib, so I could be wrong.

      C++ std::thread doesn’t provide a portable option either, though you can use native_handle() to set pthread-specific parameters.

      On the other hand Rust’s std::thread exposes stack size, and the Rayon crate exposes stack size too. 😁

    2. 6

      Why does alpine use such a small stack?

      1. 10

        This is actually from musl libc: https://wiki.musl-libc.org/functional-differences-from-glibc.html#Thread-stack-size

        This size was determined empirically with the goals of not gratuitously breaking applications but also not causing large amounts of memory and virtual address space to be committed in programs with large numbers of threads.

        1. 4

          Large thread stacks were/are a problem on machines with a 32-bit address space due to virtual address exhaustion. If Linux allows 2GB for user mode addresses, you only have room for about 200 x 8MB thread stacks. (This used to be a huge problem in the days when servers would use one thread per client, consuming thousands of threads.)

          1. 6

            It’s worth noting that the quoted sizes in the article for FreeBSD are for 64-bit architectures: 32-bit ones default to half that size.

            Go did a lot of work on segmented stacks and eventually, I believe, gave up on it completely for 64-bit machines. There are a load of problems with the approach (perf overhead and problems if you call into C code with too small a stack) and on a 64-bit machine you can have a load of 8 MiB stacks without any problems. The kernel will lazily allocate physical memory and virtual address space is not a scarce resource.

            The only problem with this approach is that, to my knowledge, no *NIX kernel ever reclaims stack pages while a thread is running. In the overwhelming majority of cases, it would be fine to look at the stack pointer during a system call and reclaim any pages within the stack allocation range below that (minus the red zone). It would be good to do this periodically in cases where memory pressure is high. Otherwise, if you have a very deep call depth and then return, the stack space is allocated and never reclaimed (until thread exit). Colin Percival’s cooperative stacks thing, which uses setjmp / longjmp tricks to switch between threads that use an on-stack allocation would break this, so it’s not something that you can enable by default.

            Windows does a more complicated dance. The NT kernel philosophy says the kernel shouldn’t make promises it can’t keep. If you commit 8 MiB of memory then the kernel doesn’t necessarily actually give you physical pages but it does add the pages you your process accounting and it will not account for more pages than it could allocate, between main memory and swap. As a result, Windows requires that the compiler explicitly probe the stack in some places so that the kernel has a point where it can either allocate the page or provide an SEH exception at a well-defined point. As I recall, this isn’t needed if you allocate less than a page on the stack because the kernel maintains a commit charge of one more page than you’ve ever actually used for the stack.

            1. 1

              it would be fine to look at the stack pointer during a system call and reclaim any pages within the stack allocation range below that

              It would be somewhat costly to do this, I feel, because you’d have to unmap the pages – which would presumably require synchronisation between CPUs. It is cheaper to map the pages in, because there was nothing mapped at that address before so a subsequent fault on another CPU won’t result in an access to the wrong physical page.

              1. 1

                It would be somewhat costly to do this, I feel, because you’d have to unmap the pages – which would presumably require synchronisation between CPUs.

                Yes, but you’d do it only in conditions of memory pressure. When you identify memory is contrained, you’d go and try to reclaim all of the memory that you have, reverting it to the original CoW copy of the canonical zero page. You’d probably start by doing it for all of the threads that are parked in the kernel waiting for events - if all of the threads for a process are in this state then you can invalidate TLBs on a slow path in syscall return (and since it’s a really infrequent path, you can probably do a complete TLB invalidate, rather than try to track the individual entries that you need to shoot down).

                Note that the cost of unmapping varies hugely across architectures. You typically need some locks for the kernel’s book-keeping entries (though for a stack mapping, these are likely to be uncontended because you generally don’t mess with one thread’s stack from another while the thread is running). You then update the page table entry and invalidate the TLB. This process is much cheaper on architectures such as AArch64 that provide a broadcast invalidate than on x86 where you need to IPI all of the cores and invalidate.

        2. 4

          In a 64-bit address space, why wouldn’t the OS just space the thread stacks really far apart, like 100MB, and map those pages as needed? Then it could budget RAM for the process’s total stack space, instead of a fixed amount per thread.

          1. 1

            Isn’t that just like setting the thread stack size to 100MB? Paging is an independent concept and still happens behind the scenes AFAIK.

            1. 4

              Not exactly: today’s 8mb stack means that the pages are mapped eagerly, when the thread is created. That is, while we don’t allocate physical pages to back the stack space immediately, we do insert 8mb/4k entries into the page table in the kernel.

              An alternative would be to do this mapping lazily: rather than mapping the whole stack, you’d map just a small fraction immediately, with a guard page after that. When a guard page is hit and you get a signal, you’d just map&allocate more of the space.

              1. 1

                Not exactly: today’s 8mb stack means that the pages are mapped eagerly,

                What makes you think that?

                1. 2

                  My theoretical understanding of virtual memory subsystem plus looking at process memory maps in practice.

                  I think there might be some terminological confusion here (and I very well might be using wrong terminology), so let me clarify. When a page of memory is “used” for stack, this is a two-step process. First, the page is “mapped”, that is, the entry is created in the page tabs for this page that describes that it is writable, readable, and doesn’t yet have a physical page packing up. Second, an actual physical page of memory is allocated and recorded in the page table entry previously created.

                  The first step, creating page table entries, is done eagerly. The second step, actually allocating the backing memory, is done on demand. If you don’t do the first step eagerly, you need to do something else to describe 8mb of memory as being occupied by a stack.

                  1. 1

                    Yep, there was a gap in terminology.

                    Still, if you just scatter the stacks far apart without reserving the space, you run the risk of something else mapping the memory, which would prevent you from growing the stack later. And if you reserve a smaller region of virtual memory just for stacks, you’ve invented a scarce resource.

                2. 1

                  An alternative would be to do this mapping lazily

                  Go does this. My knowledge may be out of date but as far as I know Go doesn’t use guard pages—the compiler / runtime explicitly manages Goroutine stacks. I think the default stack size is 8kb. Which makes sense when Goroutines are supposed to be small and cheap to spawn.

                  1. 1

                    Not really, go doesn’t rely heavily on virtual memory to manage stacks and just manually memmoves the data between two allocated chunks of memory:

                    https://github.com/golang/go/blob/c95464f0ea3f87232b1f3937d1b37da6f335f336/src/runtime/stack.go#L899

                    This is a different approach from dedicating a big chunk of address space to the stack, and then lazily mapping & allocating the pages. In this latter approach, stack is never moved and is always continuous, it requires neither scanning the stack for pointers nor support for segmented stacks.

                    1. 1

                      Go defaults to well below a single page of memory for the thread stack.

                      1. 1

                        “Well-bellow” is 2kb of memory, half a page.

                3. 1

                  Not if the pager treats stacks differently than other allocations. For example it could reserve a particular amount of swap space for stacks and segfault if it’s exceeded, and it could unmap pages from a stack that wasn’t using them anymore.. This would make it more practical to have large numbers of threads with small stacks (as in Go.)

                  1. 2

                    The tricky bit is determining when a stack isn’t using them anymore. In theory, when a thread does a syscall then any pages between the bottom of the red zone and the end of the stack could be reclaimed, but in practice there is nothing in the ABI that says that the kernel may do this.

                    1. 1

                      Does the kernel need to necessary be involved here? Can the userspace MADV_FREE the unused stack from time to time?

                      1. 1

                        There are a few problems with a pure userspace implementation but the biggest one is policy, not mechanism. The kernel has a global view of memory and knows when memory is constrained. When there’s loads of spare memory, this is pure overhead for no gain and so even a fairly infrequent madvise would probably be measurable overhead. When memory is constrained, you want to reclaim more aggressively, without going through the MADV_FREE dance (which is fast, but not free: it still needs to lock some VM data structures and collect the dirty bit from every page being marked). Userspace doesn’t have visibility into the global state of memory pressure in the system and so can’t make this decision very easily.

                        In particular, the best candidates for this are threads that have been parked in the kernel for a long time. Consider a process like devd: it does a read on a device node and typically blocks there for ages. Ideally, the kernel would come along periodically and note that its stack pages are a good candidate for reclaim. You might have a pass that notices that a thread has been blocked on a system call for a while and does the MADV_FREE thing to its stack pages, so that they can be reclaimed immediately if you hit memory pressure and then a different mechanism for when you actually hit memory pressure.