With per-thread VDSO, errno probably could now be a kernel abstraction but, in general, this is one of the big differences between the NT and UNIX philosophies. UNIX kernels don’t know about what’s in a userspace process. They don’t know about userspace stacks, userspace threads, userspace globals, or anything else that isn’t passed as a system call argument. In contrast, the NT kernel can walk userspace threading structures, access userspace globals, and so on. This means much looser coupling between kernel and userspace in UNIX systems and the much simpler ability to move to different ABIs and so on, but it means that things like this need some complex wrappers.
In Linux, the convention is that negative return values from syscalls are values to put in errno, non-negative values are success. FreeBSD does it slightly differently and uses the carry bit so you branch in carry to get to the set-errno path (which, on most systems, is statically predicted false, so the success return doesn’t take branch predictor state).
UNIX kernels don’t know about what’s in a userspace process. They don’t know about userspace stacks, userspace threads
I think this is perhaps somewhat less universally true than you expect. In illumos we have an extremely tightly coupled libc and kernel. Pieces of the signal handling machinery are split across both; there’s a 1:1 relationship between kernel threads and lightweight processes (LWPs, our notion of a user mode thread). The kernel is, I believe, generally aware of the current extent of the stack for an LWP. When processes are initially set up after exec the kernel reaches down into user mode to map/inject the runtime loader and assemble the initial data (argv, environ, the aux vector, etc) that libc uses when getting moving. Our AIO implementation is a cooperative mixture of kernel and libc mechanisms, depending on the kind of file you’re operating upon.
Solaris was always a bit of an outlier here, but is still far less tightly coupled than NT. For example, the NT kernel injects DLLs directly into userspace and is responsible for large parts of linking, that’s all ‘just’ a userspace component on Solaris. Setting up the initial stack is common to most UNIX systems. You start with a mapped binary (which might be the run-time loader if it’s dynamically linked) and a stack and then userspace does the rest of the setup. In contrast, the NT kernel is responsible for setting up a huge amount of stuff in a default process environment.
I’m not sure to what degree the kernel is aware of the stack extent on Illumos. On FreeBSD, stacks are just mmap mappings with some hints that they grow downwards when faulting in pages (for example, if faulting in a page then the pages above it are probably just about to be used). Does the Solaris kernel push data structures into the stacks of running code outside of signal delivery? NT does in a few fun places.
From what I remember, Doors are the place with the closest coupling but the last time I looked (Solaris 8?) the kernel would send an event to userspace to create new threads. In contrast, in the NT equivalents the kernel can inject new threads into userspace, set up all of the context for the userspace bits of the threading implementation to work, and then run code on a new thread.
NT certainly does a lot more of that sort of thing.
One somewhat novel mechanism we have in illumos is the “agent LWP”, whereby the kernel can reach down into a process and co-opt the stack of an existing LWP for use in making library and system calls. The LWP is revectored to do something in the context of the victim process, and returned afterwards; the results are fished out through the kernel. This is generally done at the request of another process, but it’s the kernel making it happen.
The branded zones infrastructure is also a bit more lopsided, in that some emulation code is injected into a non-native binary. Some of the responsibility of mapping those initial ELF bits and pieces I believe falls to the kernel prior to involvement from the user mode portion of the linking machinery.
It’s just a little blurry in general, and I don’t believe there’s any philosophical reason that would drive us not to make the kernel more aware of something in a process if it improved the system in some way.
One somewhat novel mechanism we have in illumos is the “agent LWP”, whereby the kernel can reach down into a process and co-opt the stack of an existing LWP for use in making library and system calls.
That sounds fascinating. What is it used for? What happens if there isn’t enough stack space?
The branded zones infrastructure is also a bit more lopsided, in that some emulation code is injected into a non-native binary. Some of the responsibility of mapping those initial ELF bits and pieces I believe falls to the kernel prior to involvement from the user mode portion of the linking machinery.
I’d assumed that they worked in the same way as FreeBSD’s ABI layer (which can be used with jails for a similar abstraction), where the kernel provides a different system call table, signal frame layout, and optionally some filesystem redirection. What parts are userspace on Illumos?
On the topic of errno in general: there’s an errno binary in moreutilshttps://joeyh.name/code/moreutils/ that helps with quickly looking up values:
$ errno 26
ETXTBSY 26 Text file busy
$ errno --search bad
EBADF 9 Bad file descriptor
EFAULT 14 Bad address
EBFONT 59 Bad font file format
EBADMSG 74 Bad message
EBADFD 77 File descriptor in bad state
$ errno -list # prints every error number
In the future, you can find out information like this in a quicker way by grepping through /usr/include and reading the definition of errno and/or reading glibc source.
If you want to understand libc, I’d recommend you steer away from glibc (it’s a torturous mess of indirection, macros, and incomprehensibility) and instead read musl or a *BSD libc which are much easier to grok.
I agree that glibc is really tough to follow… But if you want to know how this behaves for your system, then you have to read glibc, not musl. And it may even tell you interesting things. For errno, for example, even if we restrict to just Linux on x86_64, it works differently in different places. Follow the breadcrumbs, and you’ll eventually find the SYSCALL_SET_ERRNO macro. And we see that there’s a differenterrno in different contexts: the dynamic linker uses its own copy, which does not appear to be thread-local; the C library uses the __libc_errno symbol, and other parts of the distribution (such as libpthread) use errno (though my guess is that these resolve to the same address most of the time), which are at known offsets from the thread-local-storage base register. This suggests that dlopen (which is largely implemented in dynamic linker code) doesn’t set errno if it fails? Now I feel like testing this… I wouldn’t have wondered if I hadn’t actually gone through my own system’s code.
It’s not necessarily clear from header files alone. For example stuff gets weird with vDSO and address space mapping. Also the thread local variable stuff gets confusing if you’re not familiar with the details. But yes, you are right in theory.
What I don’t understand is why everyone should have to go through this trouble (which isn’t all that complicated in the end, I realise), instead of this being upfront in documentation/man pages?
cppreference.com is your friend here. It’s the best resource for reading stuff from the C and C++ standards. The actual standards documents are a tough slog.
As for Linux man pages, it seems to be pretty clear about it (although this one is for C99, not C11).
errno is defined by the ISO C standard to be a modifiable lvalue of type int, and must not be explicitly declared; errno may be a macro. errno is thread-local; setting it in one thread does not affect its value in any other thread.
That doesn’t tell you how it’s implemented. There are at least three plausible ways of implementing it given that description:
Have the kernel return unambiguous success or errno values and have libc maintain errno.
Have the VDSO expose an initial-exec thread-local variable and have the kernel always write the errno value at that offset from the userspace thread pointer (this could also be done in a completely ad-hoc way, without the VDSO).
Have a system call that allows a userspace thread to specify its errno location and have the kernel write into that.
It happens that most (all?) *NIX systems, including Linux, pick the first option from this list. If I were designing a POSIX system today, I’d be somewhat tempted by option 2 so that the system calls could implement the POSIX semantics directly even without libc, at the cost of one extra copyout per failed system call. The main down side is that system calls would then have no mechanism for reporting failure as a result of the thread pointer being invalid, but signals can handle that kind of everything-is-broken failure.
True, the documentation doesn’t say anthing about implementation (thankfully, at least in the case of the C standard), but as I understood the OP the question was about whether errno is kernel-based or libc-based in general. Given the fact that it is documented as part of the C standard that should be a big clue that it is libc-based. On the systems I support it can only be libc based because there is no operating system.
If the OP question was really about whether errno is libc or kernel based on Linux, then there is some room for ambiguity. Perhaps the article should have phrased the question better.
but as I understood the OP the question was about whether errno is kernel-based or libc-based in general. Given the fact that it is documented as part of the C standard that should be a big clue that it is libc-based
Why? Signals are part of the C standard, but are implemented in the kernel on most *NIX systems, for example. The POSIX standard doesn’t differentiate between kernel and libc functionality at all: it is defined in terms of C interfaces, but some things are implemented in the kernel and some in libc. It’s entirely reasonable to ask what the division of responsibilities between kernel and libc is for any part of the C or POSIX standard, particularly a part that is set on system call returns.
On the systems I support it can only be libc based because there is no operating system.
That doesn’t mean that file I/O is a purely libc service in a hosted environment, yet it is also specified in the C standard.
When I was working on a toy-kernel, my Idea was that syscalls would return carry-zero for success and a opaque handle on error with the carry bit set.
You could interrogate the kernel and vDSO to learn more, so finding out if you can retry would be relatively simple and fast, stored in the vDSO, but you could get stack traces over the various nanokernel services that were touched and tell the user what went wrong; (in pseudocode)
let result: SyscallResult = syscall_open_file("/etc/passwd");
if result.carry_bit() {
if vdso_err_retryable(result) {
goto retry;
} else {
panic("could not read file: {reason}\n{stacktrace}",
reason = syscall_err_message(result),
stacktrace = syscall_err_stacktrace(result)
);
}
}
let file_handle: FileHandle = result.cast();
goto do_stuff;
I keep pondering reaching LLVM about the carry-bit-on-failure calling convention. I think it would be a nice way of implementing lightweight exceptions: set carry on exception return and implement exceptions in the caller as branch-on-carry to the unwind handler. You’d get one extra branch per call, but in exchange for that you don’t need an unwind library.
The extra branch per call is virtually free if you branch to the error case and the error is rare (and it should be). Both on big OoO super scalar and small in order microarchs.
Also you shouldn’t place a subroutine call in your hot loop 😇.
I don’t think Herb proposed a calling convention in that document (it’s purely C++, which regards the ABI as a separable concern). I did discuss this as a possibility with him around the time that he wrote that though.
With per-thread VDSO, errno probably could now be a kernel abstraction but, in general, this is one of the big differences between the NT and UNIX philosophies. UNIX kernels don’t know about what’s in a userspace process. They don’t know about userspace stacks, userspace threads, userspace globals, or anything else that isn’t passed as a system call argument. In contrast, the NT kernel can walk userspace threading structures, access userspace globals, and so on. This means much looser coupling between kernel and userspace in UNIX systems and the much simpler ability to move to different ABIs and so on, but it means that things like this need some complex wrappers.
In Linux, the convention is that negative return values from syscalls are values to put in errno, non-negative values are success. FreeBSD does it slightly differently and uses the carry bit so you branch in carry to get to the set-errno path (which, on most systems, is statically predicted false, so the success return doesn’t take branch predictor state).
I think this is perhaps somewhat less universally true than you expect. In illumos we have an extremely tightly coupled libc and kernel. Pieces of the signal handling machinery are split across both; there’s a 1:1 relationship between kernel threads and lightweight processes (LWPs, our notion of a user mode thread). The kernel is, I believe, generally aware of the current extent of the stack for an LWP. When processes are initially set up after exec the kernel reaches down into user mode to map/inject the runtime loader and assemble the initial data (argv, environ, the aux vector, etc) that libc uses when getting moving. Our AIO implementation is a cooperative mixture of kernel and libc mechanisms, depending on the kind of file you’re operating upon.
Solaris was always a bit of an outlier here, but is still far less tightly coupled than NT. For example, the NT kernel injects DLLs directly into userspace and is responsible for large parts of linking, that’s all ‘just’ a userspace component on Solaris. Setting up the initial stack is common to most UNIX systems. You start with a mapped binary (which might be the run-time loader if it’s dynamically linked) and a stack and then userspace does the rest of the setup. In contrast, the NT kernel is responsible for setting up a huge amount of stuff in a default process environment.
I’m not sure to what degree the kernel is aware of the stack extent on Illumos. On FreeBSD, stacks are just mmap mappings with some hints that they grow downwards when faulting in pages (for example, if faulting in a page then the pages above it are probably just about to be used). Does the Solaris kernel push data structures into the stacks of running code outside of signal delivery? NT does in a few fun places.
From what I remember, Doors are the place with the closest coupling but the last time I looked (Solaris 8?) the kernel would send an event to userspace to create new threads. In contrast, in the NT equivalents the kernel can inject new threads into userspace, set up all of the context for the userspace bits of the threading implementation to work, and then run code on a new thread.
NT certainly does a lot more of that sort of thing.
One somewhat novel mechanism we have in illumos is the “agent LWP”, whereby the kernel can reach down into a process and co-opt the stack of an existing LWP for use in making library and system calls. The LWP is revectored to do something in the context of the victim process, and returned afterwards; the results are fished out through the kernel. This is generally done at the request of another process, but it’s the kernel making it happen.
The branded zones infrastructure is also a bit more lopsided, in that some emulation code is injected into a non-native binary. Some of the responsibility of mapping those initial ELF bits and pieces I believe falls to the kernel prior to involvement from the user mode portion of the linking machinery.
It’s just a little blurry in general, and I don’t believe there’s any philosophical reason that would drive us not to make the kernel more aware of something in a process if it improved the system in some way.
That sounds fascinating. What is it used for? What happens if there isn’t enough stack space?
I’d assumed that they worked in the same way as FreeBSD’s ABI layer (which can be used with jails for a similar abstraction), where the kernel provides a different system call table, signal frame layout, and optionally some filesystem redirection. What parts are userspace on Illumos?
On the topic of
errnoin general: there’s anerrnobinary inmoreutilshttps://joeyh.name/code/moreutils/ that helps with quickly looking up values:In the future, you can find out information like this in a quicker way by grepping through /usr/include and reading the definition of
errnoand/or reading glibc source.If you want to understand libc, I’d recommend you steer away from glibc (it’s a torturous mess of indirection, macros, and incomprehensibility) and instead read musl or a *BSD libc which are much easier to grok.
I agree that glibc is really tough to follow… But if you want to know how this behaves for your system, then you have to read glibc, not musl. And it may even tell you interesting things. For
errno, for example, even if we restrict to just Linux on x86_64, it works differently in different places. Follow the breadcrumbs, and you’ll eventually find theSYSCALL_SET_ERRNOmacro. And we see that there’s a differenterrnoin different contexts: the dynamic linker uses its own copy, which does not appear to be thread-local; the C library uses the__libc_errnosymbol, and other parts of the distribution (such as libpthread) useerrno(though my guess is that these resolve to the same address most of the time), which are at known offsets from the thread-local-storage base register. This suggests thatdlopen(which is largely implemented in dynamic linker code) doesn’t seterrnoif it fails? Now I feel like testing this… I wouldn’t have wondered if I hadn’t actually gone through my own system’s code.It’s not necessarily clear from header files alone. For example stuff gets weird with vDSO and address space mapping. Also the thread local variable stuff gets confusing if you’re not familiar with the details. But yes, you are right in theory.
What I don’t understand is why everyone should have to go through this trouble (which isn’t all that complicated in the end, I realise), instead of this being upfront in documentation/man pages?
cppreference.com is your friend here. It’s the best resource for reading stuff from the C and C++ standards. The actual standards documents are a tough slog.
As for Linux man pages, it seems to be pretty clear about it (although this one is for C99, not C11).
That doesn’t tell you how it’s implemented. There are at least three plausible ways of implementing it given that description:
It happens that most (all?) *NIX systems, including Linux, pick the first option from this list. If I were designing a POSIX system today, I’d be somewhat tempted by option 2 so that the system calls could implement the POSIX semantics directly even without libc, at the cost of one extra
copyoutper failed system call. The main down side is that system calls would then have no mechanism for reporting failure as a result of the thread pointer being invalid, but signals can handle that kind of everything-is-broken failure.True, the documentation doesn’t say anthing about implementation (thankfully, at least in the case of the C standard), but as I understood the OP the question was about whether
errnois kernel-based or libc-based in general. Given the fact that it is documented as part of the C standard that should be a big clue that it is libc-based. On the systems I support it can only be libc based because there is no operating system.If the OP question was really about whether
errnois libc or kernel based on Linux, then there is some room for ambiguity. Perhaps the article should have phrased the question better.Why? Signals are part of the C standard, but are implemented in the kernel on most *NIX systems, for example. The POSIX standard doesn’t differentiate between kernel and libc functionality at all: it is defined in terms of C interfaces, but some things are implemented in the kernel and some in libc. It’s entirely reasonable to ask what the division of responsibilities between kernel and libc is for any part of the C or POSIX standard, particularly a part that is set on system call returns.
That doesn’t mean that file I/O is a purely libc service in a hosted environment, yet it is also specified in the C standard.
When I was working on a toy-kernel, my Idea was that syscalls would return carry-zero for success and a opaque handle on error with the carry bit set.
You could interrogate the kernel and vDSO to learn more, so finding out if you can retry would be relatively simple and fast, stored in the vDSO, but you could get stack traces over the various nanokernel services that were touched and tell the user what went wrong; (in pseudocode)
I keep pondering reaching LLVM about the carry-bit-on-failure calling convention. I think it would be a nice way of implementing lightweight exceptions: set carry on exception return and implement exceptions in the caller as branch-on-carry to the unwind handler. You’d get one extra branch per call, but in exchange for that you don’t need an unwind library.
This calling convention for exceptions was proposed for C++ by Herb Sutter.
The extra branch per call is virtually free if you branch to the error case and the error is rare (and it should be). Both on big OoO super scalar and small in order microarchs.
Also you shouldn’t place a subroutine call in your hot loop 😇.
I don’t think Herb proposed a calling convention in that document (it’s purely C++, which regards the ABI as a separable concern). I did discuss this as a possibility with him around the time that he wrote that though.
See top of page 17.
Some manual pages do in fact talk about it in more detail; e.g., errno(3C).
[Comment removed by author]
GCC has
-fno-errno-math, because C floating point functions normally seterrno.