1. 47

Today, I realized that I don’t understand how the write syscall works. While I have a good grasp on the OS-Application interface, I have a very fuzzy understanding of OS-hardware interface.

What would be the best way to learn that?

My first idea is to open the Linux kernel and read that, but Linux does so many things that it might be hard to identify the fraction of the code I am really interested in. Are there better kernels for learning purposes?

    1. 65

      Minix is designed to be readable, but it probably isn’t a good reflection of modern hardware.

      For simple devices, I’d recommended looking at an embedded OS. Embedded systems have many of the same concepts but the hardware that they support is much simpler.

      At the core, there are really only two ways that the OS-hardware interface works. The first is memory-mapped I/O (MMIO)[1]. The device has a bunch of registers and they are exposed into the CPU’s physical address space. You interact with them as if the6 were memory locations. If you look at an RTOS, you’ll often see these accessed directly. On a big kernel, there’s often an abstraction layer where you say ‘write a 32-bit value at offset X in device Y’s mapping’. These can often be exposed directly to userspace by just mapping the physical page containing the device registers into userspace.

      The next step up from that is direct memory access (DMA). This is one of the places you see a lot of variation across different kinds of systems. The big split in philosophies is whether DMA is a thing devices do or a thing that some dedicated hardware does. Most modern busses (including things like AXI in small SoCs) support multiple masters. Rather than just passively exporting MMIO registers, a device can initiate transactions to read or write memory. This is usually driven by commands from the CPU: you write a command to an MMIO register with an address and the device will do some processing there. The best place to look at for this is probably VirtIO, which is a simplified model of how real devices work. It has a table of descriptors and a ring buffer, which is how a lot of higher-performance devices work. You write commands into the ring and then do an MMIO write to tell the device that more commands are ready. It will then read data from and write it to the descriptors for the devices. I think the NetBSD VirtIO drivers are probably the easiest to read. You might also look at DPDK, which has userspace drivers for a bunch of network interfaces, which remove a lot of the kernel-specific complexity.

      The other model for DMA is similar but the DMA controller is a separate device. This is quite common on embedded systems and mainframes, less common on things in between. The DMA controller is basically a simplified processor that has a small instruction set that’s optimised for loads and stores. If a device exposes a MMIO FIFO, you can use a programmable DMA engine to read from that FIFO and write into memory, without the device needing to do DMA itself. More complex ones let you set up pipelines between devices. More recent Intel chips have something like this now.

      Once you get DMA working, you realise that you can’t just expose devices to userspace (or guest VMs) because they can write anywhere in memory. This is when you start to need a memory management unit for IO (IOMMU)[2], which Arm calls a System MMU (SMMU). These do for devices what the MMU does for userspace: let you set up mappings that expose physical pages into the device’s address space for DMA. If you have one of these, you can map some pages into both userspace and a device’s address space and then you can do userspace command submission. Modern GPUs and NICs support this, so the kernel is completely off the fast path. The busdma framework in NetBSD and FreeBSD is a good place to look for this. It’s designed to support both DMA via the direct map and DMA via IOMMU regions.

      For this kind of kernel-bypass abstraction to be useful, you need the device to pretend to be multiple devices. Today, that’s typically done with Single Root I/O Virtualisation (SR-IOV). There’s a lot here that you don’t need to care about unless you’re building a PCIe device. From a software perspective you basically have a control-plane interface to a device that lets you manage a set of virtual contexts. You can expose one to userspace or to a guest VM and treat it as a separate device with independent IOMMU mappings.

      To do any of this, you need to have some kind of device enumeration. On simple SoCs, this can be entirely static. You get something like a flattened device tree and it tells you where all of the devices are, which you either compile into the kernel or get from a bootloader. Systems with expansion need to do this via ACPI, PCIe device enumeration, and things like USB. This is universally horrible and I’d recommend that you never look at how any of it works unless you really need to because it will cause lasting trauma. OpenFirmware was the least bad way of doing this, and so naturally was replace by something much worse.

      Beyond that, the way devices work is in the process of changing with a few related technologies. In PCIe, IDE and TDISP let you establish an end-to-end encrypted and authenticated connection between some software abstraction (for example a confidential VM, or Realm in Arm-speak) and a device context. This lets you communicate with a device and know that the hypervisor and physical attackers on the bus can’t tamper with or intercept your communication. This is probably going to cause a bunch of things to move onto the SoC (there’s no point doing TLS offload on a NIC if you need to talk AES to the NIC).

      The much more interesting thing is what Intel calls Scalable I/O Virtualisation (S-IOV, not to be confused with SR-IOV), and Arm calls Revere. This tags each PCIe message generated from an MMIO read or write with an address-space identifier. This makes it possible to create devices with very large numbers of virtual functions because the amount of on-device state is small (and can be DMA’d out to host memory when a context is not actively being used). This is the thing that will make it possible for every process in every VM to have its own context on a NIC and talk to the network without the kernel doing anything.

      The end of this trend is that the kernel-hardware interface becomes entirely control plane (much like mainframes and supercomputers 30 years ago) and anything that involves actually using the device moves entirely into userspace. The Nouveau drivers are a good place to look for example of how this can work though they’re not very readable. They have some documentation, which is a bit better.

      [1] x86 also has IO ports but they’re really just MMIO in a different namespace done using extra stupidity and it’s safe to pretend it isn’t real.

      [2] These weren’t originally created for security. I believe the first ones were in Sun workstations, where Sun wanted to ship cheap 32-bit NICs in a machine with 8 GiB of RAM and wanted the device to be able to DMA anywhere into physical memory.

      1. 3

        x86 also has IO ports but they’re really just MMIO in a different namespace done using extra stupidity and it’s safe to pretend it isn’t real.

        To give some background on this, I/O mapped I/O (which is what this method is called) was used to make computers cheaper by keeping I/O decoding to a minimum (using fewer logic gates) while at the same time allowing as much memory as possible. CPUs with I/O mapped I/O have dedicated instructions (on the x86, these are IN and OUT) with restrictions that other movement instructions don’t have.

        1. 4

          It really makes sense only when memory is tightly coupled. As soon as you start having something that looks like a memory bus, it’s much easier for the CPU to just send a bus message and let a bridge chip (or block on an SoC) deal with it for regions that are mapped to non-memory devices. This gives a cleaner separation of concerns.

          The other benefit is that, on systems where you control the memory, you know loads respond in a bounded number of cycles, whereas I/O might block and you can handle interrupts in the middle differently. If you have external DRAM and moderately high clocks, this is much less of a benefit but it can be useful on small microcontrollers with tightly-coupled SRAM.

          1. 1

            For another esoteric benefit, I think I’ve seen some QEMU docs that say that PMIO is more performant to dispatch. I think that’s because you end up doing less software address decoding in the virtual machine monitor. I don’t know how relevant these concerns still are, but I thought that was worth mentioning.

            So maybe a para-virtualized device might want its doorbell to the host use PMIO instead of MMIO when possible.

            1. 5

              MMIO is fairly slow on QEMU because it has an emulated memory map and needs to make normal memory accesses as fast as possible so anything that leaves that path ends up being very slow. Some other emulators just map I/O pages as no access and catch the fault. This is fairly fast with most hypervisors but most PV devices tend to favour hypercalls for prodding their doorbells because hypercalls can avoid saving and restoring most register (just zero them on return) and so can be faster.

          2. 1

            I’ve seen it called PMIO (Port-Mapped I/O), don’t know if that’s an Intel-ism or not.

          3. 3

            At the core, there are really only two ways that the OS-hardware interface works.

            A third one: some old CPUs — I only know about the 8080 & Z80 — had dedicated instructions to read and write a numbered I/O port. I think they were called IN and OUT.

            1. 4

              I mentioned I/O ports, but you’re much better off pretending that they never existed.

            2. 2

              To do any of this, you need to have some kind of device enumeration. On simple SoCs, this can be entirely static. You get something like a flattened device tree and it tells you where all of the devices are, which you either compile into the kernel or get from a bootloader. Systems with expansion need to do this via ACPI, PCIe device enumeration, and things like USB. This is universally horrible and I’d recommend that you never look at how any of it works unless you really need to because it will cause lasting trauma. OpenFirmware was the least bad way of doing this, and so naturally was replace by something much worse.

              This is the truth. One thing to note for others (since I’m sure David already knows)… all of the current ‘device tree’ stuff we have today is originally from OpenFirmware. OpenFirmware was just a standardized ‘forth’ used for booting systems. (A system monitor)

              1. 4

                There were some other nice things in OpenFirmware. My favourite was that it provided simple device drivers in Forth that were portable across operating systems so you just needed to OS to provide a generic driver to get basic functionality. These were not as fast as a native driver (though, for something like a UART, this might not matter) but they were enough to get the device basically working.

                The down side of this was that the FORTH drivers were larger than the tiny amount of BUOS firmware that most devices needed on PCs and so needed larger ROM or flash chips to hold, which pushed up the price a lot for simpler devices and required a different board run (with much lower volumes) for other things.

                Vendors then stuck a bigger markup on the OpenFirmware version knowing that you couldn’t just use the PC version. It was quite entertaining that, while PC users complained about Apple’s markup on ATi video cards, Sun users were buying the Apple versions because they were a third the price of the Sun packaging of the identical hardware.

                1. 3

                  The only firmware standard in existence with its own song! (Continuing the theme of OpenFirmware being replaced with worse tech, I can no longer find the .au file download, but https://www.youtube.com/watch?v=b8Wyvb9GotM exists.)

                  1. 3

                    IA to the rescue, in all its “8-bit ISDN mu-law, mono, 8000 Hz” glory.

                    1. 1

                      Lol - i never knew that. This elevates my respect for OpenFirmware even further.

                  2. 1

                    This seems an excellent guide to the history and near future of HW/SW interfaces. Going forward, I’m personally interested in how computation is being distributed through the system. SmartNICs, as I understand the term, include general-purpose CPUs to virtually implement PCI interfaces in physical hardware. GPUs also have heavy compute (obviously) and are becoming more tightly integrated with CPUs through things like Linux’s Heterogeneous Memory Management (I was motivated to post this largely because I’m not aware of other HMM implementations besides Linux’s, though the idea feels generally important – are there other implementations of the idea?), and through CXL. Compute-In-Memory may be interesting, and I posted this here a couple of years ago (!).

                    1. 2

                      There are a bunch of interesting research projects looking at how you properly program heterogeneous distributed systems. The first one I knew about (almost certainly not the first) was BarrelFish, though it didn’t really do much hardware-software co-design.

                    1. 1

                      +1 for plan9

                    2. 8

                      Linux has many abstraction layers and x86 legacy naming schemes, but if you have the time to get past that it is the only relevant implementation nowadays.

                      OpenBSD is very well structured and easy to get into due to its minimal abstraction layers and clear written code. If you don’t have time enough to invest in this, I would definitely go with OpenBSD.

                      Example. For my students I teach them how to write a syscall in OpenBSD, not on Linux, because they only have 2 hours of lab time to do it.

                      1. 15

                        Linux … is the only relevant implementation nowadays.

                        This would be grim if it were true!

                        1. 15

                          There’s an up and coming one called Darwin that’s being used in a few computers.

                        2. 8

                          If you are reasonably familiar with Unix, I would recommend one of the BSDs. NetBSD is what I learned on, so when I had to read OpenBSD code, the latter was immediately approachable to me. I’ve heard good things about FreeBSD (and it helps that there’s an excellent companion book), which is old enough by now to be quite obsolete but is still a good enough aide in navigating the source tree, at least.

                          NetBSD is what I learned basic concepts on. It’s a good choice – the source code is small and readable, it doesn’t change a lot, and it’s not particularly esoteric C. There’s a lot of C magic in it (it is portable code dating from the nineties) but most of it is tasteful C magic. OpenBSD is clean and readable, too. It may be a better choice owing to its wider community and install base. I learned a bit about aarch64 from their port years ago and I thought it was just as easy to navigate.

                          Depending on what you’re interested in learning about, it might also make sense to stray into architectures other than x86. There’s a lot of historical baggage in how modern x86_64 CPUs boot, for example. If you want to know more about what happens between when you hit the power switch and when the cool stuff really starts scrolling on the console, looking into a less historically-encumbered architecture (even with a lot of modern implications, like multiprocessor support on aarch64) might help.

                          Linux isn’t necessarily less readable, or worse, but it’s a much larger kernel that a lot more people are working on, to the point where making it easily approachable for individual developers just isn’t as big a priority anymore. It’s simply a larger tree that’s harder to navigate, as you’ve found.

                          But it does have excellent support for modern hardware, and a lot of “special” subsystems (e.g. iio) that aren’t as well represented elsewhere. For better or for worse, it’s the most popular OS with a (traditional) Unix-inspired design, so knowing your way around it may be worth it in and of itself. It also has a lot of documentation – unofficial, but useful, in the form of blogs and write-ups and whatnot – and good tracing tools that may be useful (e.g. BPF).

                          Plan9 is definitely simple and readable, but it’s in that weird place where it’s Unix but also it’s not. If you know Unix well, reading navigating it is surprisingly hard because your mental map of a Unix system doesn’t apply fully but the source is very Unixy.

                          At the non-Unix spectrum, if you care about that in any way or it’s relevant, I have one good recommendations and two codebases I’ve heard good things about. Maybe other lobsters can chime in on the latter two.

                          A codebase I can recommend is Haiku. It’s not Unix but it is an approachable example of a userland POSIX implementation, it’s clean C++ code, and the community is very friendly. Last time I read it was about two years ago or so – I wanted to do a basic FUSE-base BeFS implementation as a Rust learning project, and I liked it.

                          If you’re more familiar with Windows, I’ve heard that the ReactOS kernel is pretty readable and navigable. I don’t know if that’s the case, I simply don’t know enough Windows to have an opinion on it (and, looking back on it, this Windows blind spot is one of the things I regret the most in how I’ve approached learning things in the last twenty years or so).

                          Genode is a little weird but individual components may be fairly approachable as well. It’s deliberately written as a framework of reusable components, which helps a bit. Years ago it was definitely something I could’ve recommended but I haven’t read or used any Genode code in a very long time, so I don’t know if this is still the case.

                          1. 5

                            Not being a kernel programmer at all, a few years ago I played with xv6 … I recompiled it, and then modified some user space utilities, and ran it in QEMU

                            It was pretty easy as far as I remember

                            They have now have a RISC V version too

                            https://github.com/mit-pdos/xv6-public?tab=readme-ov-file

                            https://github.com/mit-pdos/xv6-riscv

                            https://pdos.csail.mit.edu/6.828/2012/xv6.html

                            write() is mentioned in the first few pages:

                            https://pdos.csail.mit.edu/6.828/2012/xv6/book-rev7.pdf

                            I am pretty sure it runs on real hardware too, though I haven’t done that … maybe there are instructions somewhere

                            1. 5

                              However there’s a big caveat wrt ~matklad’s specific question about write(): xv6 lacks sockets and lacks a VFS layer, so there’s a lot of machinery in post-1980 unix that xv6 does not illustrate.

                              The caveat to the caveat is that this machinery is vtbl-in-C so some of the design is obscured by the limited programming language. And the BSD kernels have some metaprogramming helpers that obscure it further.

                              Even so, a modern BSD is much easier to navigate – better directory layout, more systematic naming – than Linux. Older BSDs are significantly simpler but share a lot of structure and code with modern BSDs, so it might be worth looking at the 4.3BSD or 4.4BSD source alongside the corresponding “Design and Implementation” book by McKusick et al.

                            2. 4

                              Not joking: I suggest reading the kernel from v2.4.?? Era or beginning of 2.6.? It will be simpler than now, it was already SMP witn fine grained locking.

                              The O(1) scheduler from ingo molnar was super readable at that time. The linked list implementation of linux scheduler at the time was for me eye opening

                              But nowadays i would also consider reading openbsd kernel, their codebase is kept in a more pedantic way regarding safety..

                              That being said i never read the source of any other kernel..

                              1. 1

                                just for context: 2.4 was already multiple millions of lines of code https://raw.githubusercontent.com/udoprog/kernelstats/master/gfx/kernelstats-millions.png

                              2. 4

                                A bit depends on how modern you want, how specific to your particular hardware you want, etc. But for learning how a Unix(-like) kernel is constructed, I usually point at one or more of:

                                NetBSD: runs on everything, so very modular and portable, and by extension, pretty readable and easy to focus on only the “core” parts.

                                Minix: literally the OS you get if you follow the textbook

                                Lion’s commentary on Unix 6E: the classic classic. Not sure how approachable it is these days, but a lot of old Unix was “the simplest thing that might work” and it’s really interesting to see where things came from

                                1. 2

                                  As ~andyc says xv6 is a better choice than 6th Edition if you just want a simple kernel. The Lions book is good if you are also interested in it as a historical artefact, but it can be distracting — 6th Edition C is a weird pre-K&R very loosely-typed very PDP-11 C.

                                  1. 2

                                    Also about NetBSD: It’s not perfect, it’s not finished but it has an internal developer guide

                                  2. 4

                                    Might be interested in taking a look at Plan 9!

                                    1. 2

                                      The core of Linux and the important arch-specific bits aren’t too tricky to find your way around. You could start by reading the code for the syscall you’re interested in and following it through, or maybe stepping through it in a debugger if you can get it running in qemu. Some of the online syscall tables can link you to the implementations of each syscall, e.g. for write: https://elixir.bootlin.com/linux/v6.11/source/fs/read_write.c#L652

                                      xv6 might be more approachable though. It’s tiny but still a “real” UNIX-like kernel with processes, syscalls, etc.

                                      1. 2

                                        A while ago I printed and read the openbsd implementation of pledge and unveil. It was fairly easy to follow. Note that that’s not around the hardware interface. But the kernel source is well structured and it’s easy to find your way around.

                                        1. 2

                                          I am not into kernel development myself, but i‘ve watched a video by LiveOverflow where he digs into the source code of SerenityOS. He finds it easier to understand than Linux, but also thinks that the gained knowledge is applicable to Linux. This is because more or less both of them work very similar or at least need achieve similar things. I didn’t rewatch the video for this comment, so I might misremember.

                                          1. 2

                                            A good first step would be to chase the call stack of the system call you’re interested in, and take note of functions that seem important to you. Sometimes it helps working backwards from e.g. device drivers, if you know your system call will eventually poke it.

                                            If you have that, I would highly recommend to run the OS of your choice (doesn’t have to be linux) in QEMU with gdb, set break-points in the places you care about and step through the kernel functions. (might require some special compilation flags to get a debuggable kernel image).

                                            Alternatively, if you care about networking or NVMEe storage, you might instead use a kernel bypass framework (DPDK/SPDK) and simply debug/instrument your userspace application. If you care about drivers and poking hardware registers, that’d be the easiest way. If you care about IOMMU and virtual memory management, you’ll have to look at an actual kernel.

                                            1. 2

                                              i’ve only read it a little while investigating a ps/2 issue but the Haiku source code seems fairly readable.

                                              1. 2

                                                I don’t know if this helps you, but the seL4 is formally verified (with its proofs written in Isabelle). It can be used as Hypervisor as well, so maybe you can experiment with different hardware requirements. https://sel4.systems/Learn/

                                                1. 1

                                                  I highly recommend tilck - small educational Linux-compatible kernel. Easy to build. It has write syscall, but storage and block layer are not implemented yet (as README says).

                                                  https://github.com/vvaltchev/tilck

                                                  1. 1

                                                    Once I wanted to understand how Linux works. So I wrote a bash script, which removes given feature (i. e. one Kbuild’s config feature) from Linux source tree.

                                                    I. e. I run my script passing some “CONFIG_…” to it and script completely removes all functionality enabled by that config from Linux sources. I. e. my script parses all #ifdef’s and removes them.

                                                    My intention was to run this script for nearly all CONFIG_* and remove them all one-by-one except for some small set. So I will get a small Linux kernel easy to understand.

                                                    Currently my script somehow works, but it is fragile and sometimes doesn’t work.

                                                    If you want I can post it here

                                                    1. 1

                                                      Did you check RedoxOS, by the folks at System 76?

                                                      1. 1

                                                        RedoxOS is not by System76 (but the desktop environment COSMIC, by System76, has been ported to it, which might be where the confusion comes from)

                                                        1. 1

                                                          I though one of the System 76 devs started the project. My bad then.

                                                        1. 1

                                                          I think Stevens Advanced Programming in the Unix Environment sticks to userland. However, his TCP/IP illustrated volume 2 is an excellent tour of the BSD kernel network stack https://en.m.wikipedia.org/wiki/TCP/IP_Illustrated

                                                        2. 1

                                                          Maybe contiki? It’s a very simple OS for IoT, not POSIX compliant, without many advanced OS features, but it can be a nice starting point if you want to learn about kernel development.

                                                          1. 1

                                                            @david_chisnall mentioned embedded systems. Espressif’s ESP-IDF is quite nicely organized and easy to follow. Like other embedded systems it’s not so much an OS as a big static library you link your application with, to build a firmware binary you upload to the device’s flash. There’s no kernel or memory protection or processes.

                                                            But it’s got everything from a thread scheduler to I/O drivers to network protocols. A filesystem is very much optional in these devices, but there are a couple available, including FAT32, for managing SD cards.

                                                            1. 1

                                                              The “OS” part of ESP-IDF is based on FreeRTOS, right?