I never understood the Linux decision to try to replace sysctl with a filesystem. It means that reading a value becomes three system calls instead of one (open, read, close vs sysctl) and means that you lose the built-in documentation and type info that sysctls provide.
I don’t think Linux ever removed the sysctl system call, but I’m not sure if sysfs exposes sysctls in a filesystem namespace or tries to replace it entirely. In contrast, *BSD (including XNU) use sysctl as a unified, typed, interface for this kind of thing. I much prefer the interface for a variety of reasons:
It’s a single system call to read or write any sysctl.
It doesn’t require path resolution or directory traversal, sysctl OIDs are trivial to look up.
Both perform atomic reads or writes but sysfs via read or write system calls looks as if it shouldn’t and so you have to handle the failure cases.
sysctls are self describing. You can query a text description of them and their type.
I like to think that Linux’s sysfs has some advantages but it feels like they tried to copy Plan 9 without understanding it.
This is unrelated to the substance of your comment, but since there appears to be a fair amount of confusion about this in this thread, note that sysfs (what’s mounted at /sys) has nothing to do with sysctls – the “sysctl filesystem” is a subtree of /proc, specifically /proc/sys, which despite the similar name is an entirely different thing.
I don’t think it was a Linux decision to introduce /proc, and Linux has merely just implemented whatever was available in other systems (and later it extended it).
As I recall, /proc was originally from Solaris but on my *NIX systems it contains information about running processes and little or nothing else. At least on BSD-derived systems, generic kernel configuration is exposed via the sysctl interface. Linux implemented this, but also put things in procfs, then later moved a bunch of things to sysfs. It was never clear to me why they did either, since both are more clunky to use than sysctl.
Linux has a pathological need to invent poorly defined meta-file systems (proc, debugfs, sisyphus, …) to go with their other sidebands (NETLINK) as an effect of the whole ‘we don’t really control or version userspace that we promised to also not break so syscalls are annoying”.
As I recall, Solaris’ procfs grew to be quite a nice ptrace alternative. XNU uses Mach task ports for its debugging interface, which eliminates some of the nasty race conditions with ptrace (if a thread starts and exits quickly, it may exit before the ptrace call to monitor it and so the debugger tries to attach to the wrong thread). Ever since Capsicum introduced process descriptors, I wanted to add thread descriptors and move debugging into an interface over those, but I never had the time to work on it.
Does Linux not provide ABI stability for netlink? I thought that was where it provided stronger guarantees that FreeBSD. FreeBSD guarantees ABI stability within a major release and will provide compatibility interfaces for system calls beyond that, but is willing to break control-plane interfaces such as ioctls to configure network devices between versions. For example, between 4 and 7 the way that wireless interfaces are exposed (and therefore how ifconfig configured them) changed twice, so ifconfig from FreeBSD 4 won’t work on 7.
“Unfortunately the protocol has evolved over the years, in an organic and undocumented fashion, making it hard to coherently explain. To make the most practical sense this document starts by describing netlink as it is used today and dives into more “historical” uses in later sections.”
Following the path through usb hid device hotplug notification over netlink to udev to pairing with evdev to reclassifying / grouping device nodes based on hotfix databases is a fun exercise.
System V /proc as seen in Solaris and AIX is quite different from the Plan 9-ish /proc Linux copied. In System V, the entities in /proc are often binary structures, defined in a header, so you can just mmap/read them in and use them without having to parse.
There’s a lot hidden in that read/mmap, because a lot of them don’t support mmap (they’re not page-based structures), but some do, or if they do they don’t support MAP_SHARED and will eagerly copy. You can’t write to most of them via mmap, you must do it with write and that write must write a complete record of some structure. This is one of the things that I dislike about the Linux approach here: it looks like a filesystem but doesn’t quite behave like one. It’s a structured, typed, data store, but not exposed as one.
Chris writes super prolifically, but every now and then, I read a post like this one where I can just feel that I’ll definitely need this knowledge later. Some unresponsive microservice causing a prod outage might be fixed significantly faster if I spend less time staring dumbfounded at an IO-heavy strace.
There was another of these (about bash) on the frontpage the other day. I’m glad they’re useful to some people and I don’t want to put a stop to them or anything, but they strike me as very straightforward observations that didn’t take any time or false starts to arrive at. No analysis or insight is offered and, if you already know that doing a lot of syscalls can sometimes use a lot of system CPU time, not much to learn except that the admin of this server made some terrible mistakes. I don’t want to seem grumpy, there’s nothing wrong with what’s there… but a bit more depth would be nice.
He writes the things that he wants to remember via grep of his knowledge base (blog) instead of realizing again from documentation. It also helps him catalogue decisions made in his organization, ostensibly to help onboard new people and have firm decisions that can be challenged properly.
Each post is easy. Sometimes they represent the later analysis of earlier posts. They never start at the beginning of the rabbit hole and find the bottom.
His posts make me better at building systems that should survive me, both as an inspiration and as a model for what your average colleague can be expected to read and digest. And of course the content is byte sized enough to use as a source for decisions I’m trying to make.
I never understood the Linux decision to try to replace sysctl with a filesystem. It means that reading a value becomes three system calls instead of one (open, read, close vs sysctl) and means that you lose the built-in documentation and type info that sysctls provide.
I still have the sysctl binary on my system from https://gitlab.com/procps-ng/procps. I guess that uses the filesystems /sys/ and /proc/.
I don’t think Linux ever removed the sysctl system call, but I’m not sure if sysfs exposes sysctls in a filesystem namespace or tries to replace it entirely. In contrast, *BSD (including XNU) use sysctl as a unified, typed, interface for this kind of thing. I much prefer the interface for a variety of reasons:
I like to think that Linux’s sysfs has some advantages but it feels like they tried to copy Plan 9 without understanding it.
This is unrelated to the substance of your comment, but since there appears to be a fair amount of confusion about this in this thread, note that sysfs (what’s mounted at
/sys
) has nothing to do with sysctls – the “sysctl filesystem” is a subtree of/proc
, specifically/proc/sys
, which despite the similar name is an entirely different thing.I don’t think it was a Linux decision to introduce /proc, and Linux has merely just implemented whatever was available in other systems (and later it extended it).
As I recall, /proc was originally from Solaris but on my *NIX systems it contains information about running processes and little or nothing else. At least on BSD-derived systems, generic kernel configuration is exposed via the sysctl interface. Linux implemented this, but also put things in procfs, then later moved a bunch of things to sysfs. It was never clear to me why they did either, since both are more clunky to use than sysctl.
Original proc paper is a fun read: https://www.usenix.org/sites/default/files/usenix_winter91_faulkner.pdf It was intended as a debugfs to complement ptrace.
Linux has a pathological need to invent poorly defined meta-file systems (proc, debugfs, sisyphus, …) to go with their other sidebands (NETLINK) as an effect of the whole ‘we don’t really control or version userspace that we promised to also not break so syscalls are annoying”.
As I recall, Solaris’ procfs grew to be quite a nice ptrace alternative. XNU uses Mach task ports for its debugging interface, which eliminates some of the nasty race conditions with ptrace (if a thread starts and exits quickly, it may exit before the ptrace call to monitor it and so the debugger tries to attach to the wrong thread). Ever since Capsicum introduced process descriptors, I wanted to add thread descriptors and move debugging into an interface over those, but I never had the time to work on it.
Does Linux not provide ABI stability for netlink? I thought that was where it provided stronger guarantees that FreeBSD. FreeBSD guarantees ABI stability within a major release and will provide compatibility interfaces for system calls beyond that, but is willing to break control-plane interfaces such as ioctls to configure network devices between versions. For example, between 4 and 7 the way that wireless interfaces are exposed (and therefore how ifconfig configured them) changed twice, so ifconfig from FreeBSD 4 won’t work on 7.
https://kernel.org/doc/html/next/userspace-api/netlink/intro.html
“Unfortunately the protocol has evolved over the years, in an organic and undocumented fashion, making it hard to coherently explain. To make the most practical sense this document starts by describing netlink as it is used today and dives into more “historical” uses in later sections.”
Following the path through usb hid device hotplug notification over netlink to udev to pairing with evdev to reclassifying / grouping device nodes based on hotfix databases is a fun exercise.
This seems to be an earlier paper describing /proc from Unix 8th Edition: http://lucasvr.gobolinux.org/etc/Killian84-Procfs-USENIX.pdf
Although I’m not sure if it’s the first one or not.
System V
/proc
as seen in Solaris and AIX is quite different from the Plan 9-ish/proc
Linux copied. In System V, the entities in/proc
are often binary structures, defined in a header, so you can justmmap
/read
them in and use them without having to parse.There’s a lot hidden in that
read/mmap
, because a lot of them don’t supportmmap
(they’re not page-based structures), but some do, or if they do they don’t supportMAP_SHARED
and will eagerly copy. You can’t write to most of them via mmap, you must do it withwrite
and that write must write a complete record of some structure. This is one of the things that I dislike about the Linux approach here: it looks like a filesystem but doesn’t quite behave like one. It’s a structured, typed, data store, but not exposed as one.Chris writes super prolifically, but every now and then, I read a post like this one where I can just feel that I’ll definitely need this knowledge later. Some unresponsive microservice causing a prod outage might be fixed significantly faster if I spend less time staring dumbfounded at an IO-heavy strace.
There was another of these (about bash) on the frontpage the other day. I’m glad they’re useful to some people and I don’t want to put a stop to them or anything, but they strike me as very straightforward observations that didn’t take any time or false starts to arrive at. No analysis or insight is offered and, if you already know that doing a lot of syscalls can sometimes use a lot of system CPU time, not much to learn except that the admin of this server made some terrible mistakes. I don’t want to seem grumpy, there’s nothing wrong with what’s there… but a bit more depth would be nice.
He writes the things that he wants to remember via grep of his knowledge base (blog) instead of realizing again from documentation. It also helps him catalogue decisions made in his organization, ostensibly to help onboard new people and have firm decisions that can be challenged properly.
Each post is easy. Sometimes they represent the later analysis of earlier posts. They never start at the beginning of the rabbit hole and find the bottom.
His posts make me better at building systems that should survive me, both as an inspiration and as a model for what your average colleague can be expected to read and digest. And of course the content is byte sized enough to use as a source for decisions I’m trying to make.