1. 28
    1. 13

      The layering for subsystems in NT is quite similar to the notion of ABIs in FreeBSD (and, actually, a bit less flexible). Each FreeBSD process has an associated ABI that describes:

      • How to load the program into memory (this was, I believe, originally added to support both ELF and a.put binaries).
      • How signals are delivered. 64-bit and 32-bit processes will have different signal frame layouts, and foreign compat layers (such as Linux) may have their own,
      • The system call table. 64-bit and 32-bit processes have different syscall tables, but so do processes that enter Capsicum mode, Linux ones, and so on. There’s clean layering in the kernel between sys_ functions that take userspace arguments and kern_ functions that implement the functionality. Linux and FreeBSD versions of the mmap system call, for example, will both call the same kern_mmap, but will map the userspace flags differently.

      The subsystem abstraction in NT was not sufficient to implement WSL (it uses picoprocesses, which were added for Drawbridge), whereas the FreeBSD ABI support would be sufficient for an NT ABI of someone wanted to write one.

      I’m not sure that the claim about contemporary UNIXes being coupled to a specific architecture is quite true. 4.3BSD predated NT by several years and I think that’s where the VM / pmap separation came from. The VM layer is architecture independent, the pmap layer provides a lowering to different hardware abstractions. This included things like SPARC and MIPS that had a software-managed TLB. This may have come from Mach, I’m not 100% sure. Both NT and Linux implement something that looks like x86 page tables on these, which is why modern hardware is stuck with radix trees for virtual memory, in spite of the fact that they have awful performance characteristics. If a BSD-style pmap had been more common we’d have faster CPUs now.

      The compatibility bit is slightly misleading. Most of the compatibility bits in NT are implemented in userspace. When a process starts, the kernel maps a pile of DLLs into its address space, which provide things like driver APIs. The userspace loader will identify legacy executables and load shim DLLs. This means that there’s a tight coupling between device drivers and userspace ABIs, which is unfortunate.

      The comments on WSL1 performance could do with more detail. The biggest source of performance issues with WSL1 was due to how the filesystem was implemented. NTFS is a nice design for a filesystem because it has a very low-level on-disk format that provides a key-value store (with special handling for very small things) and then builds other things in layers. Disk I/O is often slow in Windows because things go through a lot of these layers (filter drivers). WSL added even more to add POSIX filesystem semantics on top. The nice property was that accessing the host filesystem was as fast as the Linux world, and you could open a $WSL filesystem in Explorer (and anything else that handled UNC paths). The down side was that everything was slow. The other problem was that *NIX apps expect overcommit. It’s common to create large anonymous shared memory object and use small amounts of them. On WSL, these were all implemented as file-backed mappings (no overcommit in NT, but a file backing satisfies the commit requirements) and so brought in the same performance penalties.

      There are a lot of differences in the virtual memory subsystem that aren’t covered here. The decision to make almost all of the kernel pageable meant that it had to store enough metadata in a not-present page to be able to find the page, whereas Linux and *BSD can allocate memory when paging. This makes supporting MTE or CHERI trivial on *NIX and almost impossible on NT. Similarly, the process abstraction on NT meant that copy-on-write memory was rare and shared memory was unusual. In contrast, systems with fork have both of these as common cases. This makes the NT memory manager a very different design to any *NIX system. On *BSD, private memory is just anonymous shared memory with a single mapping (and can possibly be optimised slightly knowing that the mapping is unique), in NT private and shared memory regions are totally different.

      *NIX VFS originated in SunOS, as I recall, and came out of the NFS project (the kernel needed a common abstraction over UFS and NFS).

      The discussion of asynchronous I/O would benefit from a discussion of I/O completion ports, Solaris had these as well (and maybe AIX). They were not in NT from the start though, they were added in 3.5. These are somewhat similar to AIO but are nicer than the poll / select / kqueue mechanism. IOCP kicks off an I/O operation and then tells you when it’s done. The equivalent POSIX mechanism notifies you when an I/O operation may start, which is inherently racy (some other thread may perform an operation that means that you can’t actually do the I/O).

      The registry discussion omits some of the big problems. First, the lack of transactional updates (file-based things can use traditional file locking). As I recall, the registry actually supports transactions, it just doesn’t expose them sensibly and so no one uses them. Second, the inability to clean up configuration because it’s scattered across the system. This matters less for file-based config because small files don’t cost much, but registry performance degrades as it gets larger.

      Similarly, one of the problems with localisation was that MS jumped in slightly too early. They used UCS2 for everything and later retrofitted UTF-16. This isn’t too different from *NIX using ASCII and retrofitting UTF-8.

      A lot of the problems with modern NT come from things like pushing a load of Win32 into the kernel (scroll bar rendering should not be in the kernel, neither should font parsing and executing hinting bytecode!), or from persisting in supporting optimisations long past their shelf life.

      1. 7

        That was a great read. Something that wasn’t touched on in the I/O subsystem section that is very different from BSD and Linux architectures (as far as I’m aware!) is that in addition to its asynchronous design it has a highly layered I/O stack where “filters” can be inserted above or below other components. These filters can be provided by third-parties and have been used extensively to provide additional features (or “anti-features” in some cases).

        A prominent example is antivirus software, which typically works in large part by installing a filter driver so it can inspect all I/O requests and block them where they’re determined to be malicious, but there are countless other examples. A few examples which come to mind that help illustrate the breadth of functionality they can provide:

        • OneDrive
          Uses a filter driver for its “on-demand” functionality (i.e. files are downloaded when they’re accessed).
        • Process Monitor
          Ever wondered how it actually manages to display all I/O requests? Well, now you know.
        • Path virtualisation
          Windows has included a filter driver since Vista which transparently redirects certain write operations to privileged paths to user-writeable paths. As you may have guessed, this is for backwards compatibility with (mostly) pre-Windows Vista apps which just assumed you had Administrator privileges or much older systems where file system permissions weren’t even supported. If such apps try to write to e.g. C:\Windows\win.ini they’ll be redirected to C:\Users\<user>\AppData\Local\VirtualStore\.... Take a look if you’re running Windows and you may be surprised (read: horrified) to see various apps which clearly are writing to paths they shouldn’t be.

        While file system filter drivers are very powerful, they have a lot of potential downsides:

        • Complexity
          They’re complex to write, and being kernel drivers, when things go wrong that’s often going to mean a bluescreen.
        • Performance
          Each filter driver comes with a performance cost, and poorly written filter drivers can cause a big hit. The perception of NTFS as slow is in part due to misbehaving filter drivers, or the cumulative cost of having a lot of them.
        • Abuse
          True of all kernel drivers, but filter drivers are particularly well suited for “rootkit” like behaviour. Less of an issue these days with more stringent requirements to be allowed to run in the kernel (e.g. kernel-mode code signing), but there’s been plenty of past cases.

        You might be wondering how it is determined where filter drivers individually sit in the I/O stack. E.g. if vendor A and vendor B both write a filter driver, is the order with respect to their position in the I/O stack deterministic, or a function of load order, or something else? The answer is deterministic by each filter driver having an assigned “altitude”, with Microsoft maintaining a registry of allocated altitude numbers. Developers can in turn request allocation of a filter driver altitude number. Which brings me to this wonder: Allocated filter altitude.

        A largely historic issue was it used to be the case that a proliferation of loaded filter drivers would cause bluescreens simply due to runnning out of kernel thread stack. That was partly due to 32-bit systems having far more constrained kernel stack size, but also filter drivers would work by each filter driver calling the next in the chain, so the size of the stack would grow in proportion to the number of filter drivers that needed to be called. The move to 64-bit has helped with larger stacks, but the bigger contributor is probably that Windows now has a “Filter Manager” driver which modern filter drivers register with, and the Filter Manager will call each registered filter in the appropriate order. Yes, the Filter Manager is itself a filter driver which manages other filter drivers :-)

        The curious can run fltmc to see the registered filter drivers and where they sit in the I/O stack.

        1. 4

          Something that wasn’t touched on in the I/O subsystem section that is very different from BSD and Linux architectures (as far as I’m aware!) is that in addition to its asynchronous design it has a highly layered I/O stack where “filters” can be inserted above or below other components

          Netgraph and GEOM in the FreeBSD kernel also have these properties, though they both define graphs rather than stacks. There isn’t really a thing like this at the VFS layer on FreeBSD though.

          Path virtualisation

          FreeBSD has a thing for doing this in the ABI layer, designed to allow different types of programs to look for libraries in different places (so a Linux thing that looks in /usr/lib actually looks in /linux/usr/lib or whatever).

          1. 1

            “file system minifilter fractional value altitude” is a wonderful concept to have learned of; thank you.

            1. 2

              I’d go with cursed, but I’m glad :-)

          2. 2

            Super interesting. I would also have loved to see NT’s various narrow waists like the kernel object tree compared to Plan 9, which as I understood it slices the “how do we address files and other things the same way” problem by making all those things implement the file system protocol.