It’s always great to read this author’s posts. He lays out problems simply, finds the essence of it, and produces simple solutions. The average blog post on the web that I read is willing to accept all sorts of complexity in their solutions to solve their immediate need, without seemingly much consideration for what maintenance will look like, whether they’ll be able to diagnose issues, etc.
Here we see the author going against the grain of raidz, zfs-send/restic/Borg, HomeAssistant/nodered, NixOS, etc.
Bravo to simplicity.
I’ve been considering creating a NAS, so thanks for the tips.
My daily backups run quicker, meaning each NAS needs to be powered on for less time. The effect was actually quite pronounced, because figuring out which files need backing up requires a lot of random disk access. My backups used to take about 1 hour, and now finish in less than 20 minutes.
zfs-send or btrfs-send is likely a faster option than rsync for users that are happy to rely on those, though I appreciate the robustness of just relying on rsync, as you mention.
Ubuntu Server comes with Netplan by default, but I don’t know Netplan and don’t want to use it.
To switch to systemd-networkd, I ran: […]
👍
IPv6Token=0:0:0:0:10::253
Glad to see IPv6 Tokens in use in homelabs!
My unlock.service constructs the crypto key from two halves: the on-device secret and the remote secret that’s downloaded over HTTPS.
Nice idea.
If you’re okay with less compute power, but want more power efficiency, you could use an ARM64-based Single Board Computer.
For anyone interested in this approach, a good combo is:
Interesting write up. Like David, I would have (and indeed, have) gone with FreeBSD for the simplicity and it-just-all-works-together peace of mind, but one thing I did not about your piece is with regards to the power consumption numbers.
You mention two 10G network card configurations: I don’t know if you’re aware but the power consumption of these NICs depends massively on the medium. If you use 10G base T (the RJ45 connector version) it consumes significantly more power than base SR/LR (multi-mode/single-mode fiber), which use a little (but non-zero amount) more than a passive DAC (Twinax/Direct Connect). The longer the distance, the more power the -T variant will consume, while the others always use the same amount. (I’m talking 5-10 watt difference under load.)
You’d need to clarify which you’re using for the power metrics to be useful.
Thanks, I know. The power measurements are only included to give the reader a rough impression of the ballpark power consumption for such a setup. The exact watt figures will depend on a lot of detail factors.
I’ve been considering moving from ZFS to an rsync + ext4 + dm-raid + dm-crypt + dm-integrity setup. Since I run my NAS on Linux it seems less risky and more convenient to run in-tree code than ZFS. The theoretical downside is that I lose the ability to snapshot a running system, which I think is a showstopper for me. Have you ever tried that setup @stapelberg?
Or you could use FreeBSD, which has had ZFS in tree for over a decade, and nicely integrated with the base system. A lot of the decisions in the article didn’t make sense to me until I realised that they were specific to Linux.
I built my NAS in 2010, with FreeBSD 9. It had three 2 TB disks in a RAID-Z configuration. I expected to replace the disks periodically, but this was just before the massive floods took out a load of disk factories and so it was a few years before the disks were back to the price I’d paid. If I’d bought a spare set of disks, I’d have been able to sell them six months later and recover the entire cost of the NAS; the price of disks more than doubled.
That machine has been upgraded to FreeBSD 13 (and will get 14 soon) with in-place upgrades. FreeBSD’’s update utility automatically creates a new ZFS boot environment when you run it, so if an update fails you can roll back (you can also select alternative boot environments from the loader, so you can recover even if the kernel doesn’t boot).
Over the lifetime of the device, I’ve replaced the motherboard (including CPU and RAM) and replaced the disks with 4 TB ones (SSDs are now cheap enough that I’ll probably move to 8 TB flash disks on the next upgrade). It has four drive bays, so I can use spool replace to replace each disk with a new one, one at a time, without ever losing redundancy (and without downtime).
The one upgrade that caused some excitement was the motherboard replacement. The old one was BIOS boot, the new one was UEFI without a BIOS compat layer. Fortunately, I reserved 8 GB at the front of each disk as swap space. Swap on ZFS normally works fine, but disk writes with ZFS can allocate memory and so there are some cases where it will deadlock. I ran with swap on ZFS with the older disks for a few years with no issues but decided to not risk it on the newer disks. The swap partition meant I had space to insert an EFI partition on each disk. This is replicated across all of the disks (sadly, I have to do that manually after each update. I think I could avoid that with gmirror but I haven’t tried) so that the system can boot after any disk fails.
Things like zrepl work out of the box for replicating filesystems. As cloud storage gets cheaper, I have been tempted to add a mirror into something backed by cloud storage, for live backup.
Or you could use FreeBSD, which has had ZFS in tree for over a decade, and nicely integrated with the base system. A lot of the decisions in the article didn’t make sense to me until I realised that they were specific to Linux.
Honestly I run Linux and I don’t see why I would want to do the stuff in this blog post instead of taking advantage of more advanced tools and features. ZFS is provided painlessly by my distro, zrepl Just Works, kopia gives me nice deduplicated backups in the cloud. Hell even setting up upgrade snapshotting was super easy despite not being provided by the distro.
I’ve been tempted to try FreeBSD (vanilla, unraid, etc.), but each time haven’t been able to justify spend “innovation tokens” (as the author calls it) on it. It’d end up duplicating a bunch of things I have to know:
boot/services
package manager
firewall
debugging
security/isolations
There’s a simplicity/pragmatism in picking some base tools, and sticking with those were possible. I’m sure FreeBSD is that tool for many, but if I’m already on a Linux stack, ZFS on Linux is a good choice.
I think ‘innovation tokens’ apply to new technologies, not to mature systems that have been in widespread use for decades. Most of the things on your list are fairly similar:
boot/services
Boot on most hardware is via a UEFI boot manager. You can boot FreeBSD with GRUB, though most people don’t. The FreeBSD loader is simpler and generally the only thing that you’ll need to do with it is wait for it to timeout (or press the boot-now button if you’re in a hurry).
Service management is slightly different, though the ‘service {start,stop,…}’ commands are similar to SysV systems (and mentally replacing systemctl with service if you’re in systemd world is not that much mental effort). Enabling and disabling services is different but it’s not much to learn: Most of are enabled by adding a single line to either /etc/rc.conf or /etc/rc.conf.d/{service name} (your choice) and you can use sysrc if you don’t want to edit the files by hand.
package manager
s/apt/pkg is about all that you need to do here. Oh, and maybe save some typing because pkg accepts unambiguous shorthands (e.g. ins for install). And automatically runs pkg update when you do pkg upgrade so you don’t get into the thing that keeps biting me with apt where it starts an update but one of the packages has gone and so downloading them fails because I had a stale local copy of the index.
If you want to build a package set yourself, that’s more complicated, but with Poudriere it’s only a few commands (three steps gets you a complete package set that’s identical to the official one, but if you want to configure options, add additional packages, or use upstream for all of the packages that are unmodified, you need to do more).
firewall
This is quite different, but Linux has changed their firewall infrastructure twice in the time that I’ve been using pf on FreeBSD, so it’s likely to be an investment that has more longevity. That’s a running theme. I started using FreeBSD 4. If I didn’t learn anything in between, the majority of what I learned then would still be relevant for running a modern FreeBSD machine (the main difference is that the package infrastructure is now really nice, back then doing anything other than building from ports caused pain). Configuring WiFi has changed a bit (WiFi was still new and shiny back then and most machines didn’t have it).
In contrast, I was running a room of Fedora machines at the time (having started with RedHat 4.2) and I had to read the docs to do almost anything on a modern Fedora system. Most importantly, achieving the same result is now done in a different way. The FreeBSD project views the Principle of Least Astonishment (POLA) as a core design principle. If a new feature is added, you need to learn how to use it, but if you’re trying to do something that you could do last year then it should be done the same way (from a user’s perspective, at least. The underlying infrastructure may be totally different).
debugging
That’s basically the same. lldb, gdb, valgrind, sanitizers, and so on all work the same way. The differences come with system introspection things (performance introspection, system-call tracing). Most of these tools on FreeBSD expose DTrace probes, so that’s transferrable to macOS, Solaris, and even Windows.
security/isolations
This is one of the places where it is quite different. The TrustedBSD MAC framework can be used in a way that’s equivalent to SELinux (and in a few other ways. It’s also the basis for JunOS’s code signing infrastructure and macOS / iOS’s sandbox framework).
There’s no equivalent of seccomp-bpf (largely because attackers love having a tool for injecting Spectre gadgets and code reuse attack gadgets into the kernel).
If you use containers, Podman gives you the same experience on FreeBSD as on Linux (and can even run Linux containers in a lot of cases) and is a drop-in replacement for Docker (alias docker=podman basically works).
If you’re writing compartmentalised software, Capsicum has no real alternative on Linux but is the only thing that actually makes it easy to reason about your security properties.
I think ‘innovation tokens’ apply to new technologies, not to mature systems that have been in widespread use for decades.
Fair point, I suppose I meant the sentiment of “time spent learning a technology that isn’t required to solve the problem at hand”. But I agree and can see there’s crossover with existing tech I’m already familiar with.
Service management is slightly different, though the ‘service {start,stop,…}’ commands are similar to SysV systems (and mentally replacing systemctl with service if you’re in systemd world is not that much mental effort).
I do appreciate the times I find myself on a SysV system, if only because I seem to be able to introspect all parts of the boot process. systemd gives me lots (declarative, easy dependencies, sandboxing), but I do feel “alienated” from the machine.
The FreeBSD project views the Principle of Least Astonishment (POLA) as a core design principle. If a new feature is added, you need to learn how to use it, but if you’re trying to do something that you could do last year then it should be done the same way (from a user’s perspective, at least. The underlying infrastructure may be totally different).
This seems to be one of FreeBSD’s strongest points, as compared to a typical Linux installation, which has n different configuration surfaces and ways of doing things.
If you’re writing compartmentalised software, Capsicum has no real alternative on Linux but is the only thing that actually makes it easy to reason about your security properties.
Capsicum looks great. I need to read more up on it, but the closest Linux equivalent looks to be user namespaces. Sadly, user namespaces are often disabled on disttibutions, and have little-to-no usage in userspace applications. systemd-managed services often have half-decent sandboxing (which systemd can do without user namespaces since it runs as root). One flaw with the systemd hardening approach is it relies on declaring which privileges to drop, which feels the wrong approach. Ideally which privileges are needed should be declared, which is lot easier to audit and get right.
I’ll check out Capsicum some more, thanks for the pointer.
I think the closest thing to Capsicum on Linux is a combination of seccomp-bpf and Landlock. I’ve used seccomp-bpf and Capsicum to enforce the same sandboxing model. The Capsicum code was shorter, ran faster, and provided finer control. Oh, and cloud providers are increasingly disabling seccomp-bpf because it turns out to be an excellent way of injecting spectre gadgets into the kernel.
Linux namespaces are closer to jails on FreeBSD. Podman uses user namespaces for unprivileged containers on Linux, normal namespaces for privileged containers. It uses jails on FreeBSD. I remain somewhat unconvinced by user namespaces. FreeBSD restricts jails to root in part because you can nest them and then escape from the inner one by having an unprivileged process race renames in the filesystem. This is not a problem if only root can create jails but it is a problem if the user on the outside is unprivileged. I have not seen any discussion of this kind of attack from the Linux folks working on user namespaces and so I suspect that they have introduced a big pile of security vulnerabilities that are impossible to fix without making the VFS layer painfully slow.
Your ZFS + FreeBSD use description is compelling, sounds like you got a real Ship of Theseus over there. Transitioning to a FreeBSD host is not something I can easily do unfortunately. Even if I could, and rationally I would, I have to admit that I still feel biased towards the stackable Linux solution, at least in a system’s designer sort of way. There is something satisfying, if albeit impractical, about being able to reproduce nearly all of ZFS’s feature set from a set of simpler orthogonal primitives. Though I don’t feel exactly the same way about how Linux containers are implemented.
spool replace to replace each disk with a new one, one at a time, without ever losing redundancy (and without downtime).
FWIW you can replace disks one by one and grow an mdadm raid array without losing redundancy and with no downtime either. The hitch is that once the array is finished growing, you have to unmount the actual file system before you can grow it. It’s a relatively quick operation, especially in comparison to growing the array, but the difference between zero downtime and non-zero downtime is virtually infinite.
I have tried rsync + ext4 + dm-crypt, yes, but not with dm-raid and dm-integrity. I am using dm-raid elsewhere and have no concerns about it. I don’t have experience with dm-integrity.
Regarding snapshots: I tried using LVM snapshots many years ago and it wasn’t working reliably. It’s probably better these days, though :)
How important is dm-integrity really? I have moved and compared terabytes (using WD Red + dm-raid + LUKS + ext4) and not a single bit was flipped, even when Ethernet and non-ECC RAM was involved (however, my NAS has ECC RAM). Is bit rot a real issue? When one bit flips every 10 years, then it means that in my entire life, I get no more than a handful of bitflips. While it’s not perfect, I think I can live with that - especially since the bitflip will probably be inside a video and might not even be noticed. Of course, it would be catastrophic if a bit flips in the inode of the root directory of the NAS.
The chance of bit flips in your storage isn’t static, and how it’ll change over time isn’t easily predictable. After my NAS power supply died, I couldn’t access those disks for a while, but now I’ve plugged them in to my PC with a SATA card. Ran a scrub and 3 of the 4 drives had erroneous data, caught by ZFS’s checksumming. None of the drives actually reported any errors. And thanks to Raidz1, the incorrect data was fixed.
This may have been transient, or maybe the drives are just gonna get worse and give me even more bad data. Thanks to checksums and regular scrubbing, I’ll know.
I’m worried about the complexity of ZFS. I feel like the probability of data loss due to a ZFS bug is higher than due to a bitflip, and I know people who have indeed lost data due to ZFS bugs. Also, ZFS has a lot of features I don’t want or need and it requires a fairly fast CPU. I would prefer a stack with ext4, to keep the file system complexity in check and for great performance. And I like the abstraction layers and modularity of chained block devices. So, perhaps my setup could look like this [1]:
In case of a bitflip, the dm-integrity block device would have a read error and RAID1 would make sure that the data in question is read from the other disk. My current setup is like this, but without dm-integrity. The partitions on the bare disks are required to control the size and make sure that the BIOS recognizes the partition and doesn’t do crazy shenanigans [2]. But… Perhaps I just have to give in and use ZFS after all.
That looks a lot more complex and fragile to set up, with many more layers than just setting up a zpool in ZFS. Plus you’re not just getting reliability features, but things like filesystem level snapshots and boot environments, for free.
I’m running ZFS on a 10-year old Atom (J1800) and I haven’t run into performance issues with ZFS - what impacts me more is things like no AES-NI.
My NAS was using an AMD E-350. I upgraded it to a Pentium Silver 5040j, mostly because I wanted more RAM. With the right modules, the Pentium will take 64 GiBs of RAM, which is enough that I can turn on deduplication and have a lot of ARC space left over.
For any disk operations, the CPU load remains fairly low, the bottleneck is the spinning rust. Adding a separate log device on an SSD would speed up writes (I’ve done that on another machine, but not for home).
With SSDs, the CPU might be a bit slow, but it can do zstd stream compression fast enough that I probably wouldn’t notice (same with AES, since it has AES-NI), especially since ZFS encrypts and compresses blocks separately and so can use all four cores when doing heavy I/O.
It tends to run fairly read-heavy workloads and zstd and lz4 are both much faster for decompression than compression. With the amount of RAM I have, a lot of reads are hit from compressed ARC and so involve zstd decompression. Performance tends to be limited by GigE or WiFi there.
In my case, the NAS I’m using is small (two bays, so only running a zmirror), and with 8 TB, no dedupe means 8 GB is more than enough (even if I could take it up to 16). I figure by the time I’m due for an upgrade, flash will be cheaper than spinning rust at capacities I can justify (on top of upgrading from ye old Bay Trail).
Splitting the dm-integrity like that is going to weaken your guarantees. Dm-integrity has two modes. If layered on top of dm-crypt, it stores the HMACs from the encryption and uses these for integrity. If used independently, it uses a separate hash function. This is not tamper resistant.
If you’re using it for error detection, it’s the wrong tool for the job: it’s not an ECC scheme, it’s a cryptographic integrity scheme (though one that’s still vulnerable to trivial block-level replay attacks, which are harder with ZFS’s encryption model).
I know people who have indeed lost data due to ZFS bugs. Also, ZFS has a lot of features I don’t want or need
I have an impression that the data loss bugs are mostly or entirely in the newer, optional features from the last ~10 years, so I leave those all disabled (originally also because I wanted to keep my file systems compatible between different ZFS implementations, which had different sets of optional features, but since then all the implementations I might use have been unified as OpenZFS).
and it requires a fairly fast CPU.
I daily-drive ZFS on a desktop computer with a CPU from 2011, and it works fine even with the CPU’s virtual cores disabled to mitigate some vulnerability and ZFS set to use the ‘slow’ gzip-9 compression… but I admit that the CPU is a 3.4 GHz Core i7, not an Atom; while it’s old and on the low end for an i7, I suppose it still counts as fairly fast.
This is neat. I’ve wanted to go this route when I get some more cash to fund the project since I like how much quieter & less the power consumption is on SSDs.
Lastly, I added a systemd override to ensure the node exporter keeps trying to start until tailscale is up: the command systemctl edit prometheus-node-exporter opens an editor, and I configured the override like so […]
It’s always great to read this author’s posts. He lays out problems simply, finds the essence of it, and produces simple solutions. The average blog post on the web that I read is willing to accept all sorts of complexity in their solutions to solve their immediate need, without seemingly much consideration for what maintenance will look like, whether they’ll be able to diagnose issues, etc.
Here we see the author going against the grain of raidz, zfs-send/restic/Borg, HomeAssistant/nodered, NixOS, etc.
Bravo to simplicity.
I’ve been considering creating a NAS, so thanks for the tips.
I had also planned to use ESPHome and fork my own https://github.com/stapelberg/regelwerk if ever I jumped into IOT, so it was good to see how simple it came together: https://github.com/stapelberg/regelwerk/commit/8b81d7a808b1d76a0e96bdb4ab43964623d133c4
zfs-send or btrfs-send is likely a faster option than rsync for users that are happy to rely on those, though I appreciate the robustness of just relying on rsync, as you mention.
👍
Glad to see IPv6 Tokens in use in homelabs!
Nice idea.
For anyone interested in this approach, a good combo is:
x86 will give you fewer troubles though.
Thank you for the compliment!
Interesting write up. Like David, I would have (and indeed, have) gone with FreeBSD for the simplicity and it-just-all-works-together peace of mind, but one thing I did not about your piece is with regards to the power consumption numbers.
You mention two 10G network card configurations: I don’t know if you’re aware but the power consumption of these NICs depends massively on the medium. If you use 10G base T (the RJ45 connector version) it consumes significantly more power than base SR/LR (multi-mode/single-mode fiber), which use a little (but non-zero amount) more than a passive DAC (Twinax/Direct Connect). The longer the distance, the more power the -T variant will consume, while the others always use the same amount. (I’m talking 5-10 watt difference under load.)
You’d need to clarify which you’re using for the power metrics to be useful.
Thanks, I know. The power measurements are only included to give the reader a rough impression of the ballpark power consumption for such a setup. The exact watt figures will depend on a lot of detail factors.
I would use something more ‘low power’ here like Intel N95/N100/N200 instead.
One can use small USB drives for the system - also can be mirrored (at least on FreeBSD) - details below:
I am also not sure why author disabled compression on purpose.
The autounlock thing is beautiful and simple (vs https://github.com/latchset/clevis), thanks for sharing!
I’ve been considering moving from ZFS to an rsync + ext4 + dm-raid + dm-crypt + dm-integrity setup. Since I run my NAS on Linux it seems less risky and more convenient to run in-tree code than ZFS. The theoretical downside is that I lose the ability to snapshot a running system, which I think is a showstopper for me. Have you ever tried that setup @stapelberg?
Or you could use FreeBSD, which has had ZFS in tree for over a decade, and nicely integrated with the base system. A lot of the decisions in the article didn’t make sense to me until I realised that they were specific to Linux.
I built my NAS in 2010, with FreeBSD 9. It had three 2 TB disks in a RAID-Z configuration. I expected to replace the disks periodically, but this was just before the massive floods took out a load of disk factories and so it was a few years before the disks were back to the price I’d paid. If I’d bought a spare set of disks, I’d have been able to sell them six months later and recover the entire cost of the NAS; the price of disks more than doubled.
That machine has been upgraded to FreeBSD 13 (and will get 14 soon) with in-place upgrades. FreeBSD’’s update utility automatically creates a new ZFS boot environment when you run it, so if an update fails you can roll back (you can also select alternative boot environments from the loader, so you can recover even if the kernel doesn’t boot).
Over the lifetime of the device, I’ve replaced the motherboard (including CPU and RAM) and replaced the disks with 4 TB ones (SSDs are now cheap enough that I’ll probably move to 8 TB flash disks on the next upgrade). It has four drive bays, so I can use spool replace to replace each disk with a new one, one at a time, without ever losing redundancy (and without downtime).
The one upgrade that caused some excitement was the motherboard replacement. The old one was BIOS boot, the new one was UEFI without a BIOS compat layer. Fortunately, I reserved 8 GB at the front of each disk as swap space. Swap on ZFS normally works fine, but disk writes with ZFS can allocate memory and so there are some cases where it will deadlock. I ran with swap on ZFS with the older disks for a few years with no issues but decided to not risk it on the newer disks. The swap partition meant I had space to insert an EFI partition on each disk. This is replicated across all of the disks (sadly, I have to do that manually after each update. I think I could avoid that with gmirror but I haven’t tried) so that the system can boot after any disk fails.
Things like zrepl work out of the box for replicating filesystems. As cloud storage gets cheaper, I have been tempted to add a mirror into something backed by cloud storage, for live backup.
Honestly I run Linux and I don’t see why I would want to do the stuff in this blog post instead of taking advantage of more advanced tools and features. ZFS is provided painlessly by my distro, zrepl Just Works, kopia gives me nice deduplicated backups in the cloud. Hell even setting up upgrade snapshotting was super easy despite not being provided by the distro.
I’ve been tempted to try FreeBSD (vanilla, unraid, etc.), but each time haven’t been able to justify spend “innovation tokens” (as the author calls it) on it. It’d end up duplicating a bunch of things I have to know:
There’s a simplicity/pragmatism in picking some base tools, and sticking with those were possible. I’m sure FreeBSD is that tool for many, but if I’m already on a Linux stack, ZFS on Linux is a good choice.
I think ‘innovation tokens’ apply to new technologies, not to mature systems that have been in widespread use for decades. Most of the things on your list are fairly similar:
Boot on most hardware is via a UEFI boot manager. You can boot FreeBSD with GRUB, though most people don’t. The FreeBSD loader is simpler and generally the only thing that you’ll need to do with it is wait for it to timeout (or press the boot-now button if you’re in a hurry).
Service management is slightly different, though the ‘service {start,stop,…}’ commands are similar to SysV systems (and mentally replacing
systemctlwithserviceif you’re in systemd world is not that much mental effort). Enabling and disabling services is different but it’s not much to learn: Most of are enabled by adding a single line to either /etc/rc.conf or /etc/rc.conf.d/{service name} (your choice) and you can usesysrcif you don’t want to edit the files by hand.s/apt/pkg is about all that you need to do here. Oh, and maybe save some typing because pkg accepts unambiguous shorthands (e.g. ins for install). And automatically runs
pkg updatewhen you dopkg upgradeso you don’t get into the thing that keeps biting me withaptwhere it starts an update but one of the packages has gone and so downloading them fails because I had a stale local copy of the index.If you want to build a package set yourself, that’s more complicated, but with Poudriere it’s only a few commands (three steps gets you a complete package set that’s identical to the official one, but if you want to configure options, add additional packages, or use upstream for all of the packages that are unmodified, you need to do more).
This is quite different, but Linux has changed their firewall infrastructure twice in the time that I’ve been using pf on FreeBSD, so it’s likely to be an investment that has more longevity. That’s a running theme. I started using FreeBSD 4. If I didn’t learn anything in between, the majority of what I learned then would still be relevant for running a modern FreeBSD machine (the main difference is that the package infrastructure is now really nice, back then doing anything other than building from ports caused pain). Configuring WiFi has changed a bit (WiFi was still new and shiny back then and most machines didn’t have it).
In contrast, I was running a room of Fedora machines at the time (having started with RedHat 4.2) and I had to read the docs to do almost anything on a modern Fedora system. Most importantly, achieving the same result is now done in a different way. The FreeBSD project views the Principle of Least Astonishment (POLA) as a core design principle. If a new feature is added, you need to learn how to use it, but if you’re trying to do something that you could do last year then it should be done the same way (from a user’s perspective, at least. The underlying infrastructure may be totally different).
That’s basically the same. lldb, gdb, valgrind, sanitizers, and so on all work the same way. The differences come with system introspection things (performance introspection, system-call tracing). Most of these tools on FreeBSD expose DTrace probes, so that’s transferrable to macOS, Solaris, and even Windows.
This is one of the places where it is quite different. The TrustedBSD MAC framework can be used in a way that’s equivalent to SELinux (and in a few other ways. It’s also the basis for JunOS’s code signing infrastructure and macOS / iOS’s sandbox framework).
There’s no equivalent of seccomp-bpf (largely because attackers love having a tool for injecting Spectre gadgets and code reuse attack gadgets into the kernel).
If you use containers, Podman gives you the same experience on FreeBSD as on Linux (and can even run Linux containers in a lot of cases) and is a drop-in replacement for Docker (
alias docker=podmanbasically works).If you’re writing compartmentalised software, Capsicum has no real alternative on Linux but is the only thing that actually makes it easy to reason about your security properties.
Thanks for your insightful response!
Fair point, I suppose I meant the sentiment of “time spent learning a technology that isn’t required to solve the problem at hand”. But I agree and can see there’s crossover with existing tech I’m already familiar with.
I do appreciate the times I find myself on a SysV system, if only because I seem to be able to introspect all parts of the boot process. systemd gives me lots (declarative, easy dependencies, sandboxing), but I do feel “alienated” from the machine.
This seems to be one of FreeBSD’s strongest points, as compared to a typical Linux installation, which has n different configuration surfaces and ways of doing things.
Capsicum looks great. I need to read more up on it, but the closest Linux equivalent looks to be user namespaces. Sadly, user namespaces are often disabled on disttibutions, and have little-to-no usage in userspace applications. systemd-managed services often have half-decent sandboxing (which systemd can do without user namespaces since it runs as root). One flaw with the systemd hardening approach is it relies on declaring which privileges to drop, which feels the wrong approach. Ideally which privileges are needed should be declared, which is lot easier to audit and get right.
I’ll check out Capsicum some more, thanks for the pointer.
I think the closest thing to Capsicum on Linux is a combination of seccomp-bpf and Landlock. I’ve used seccomp-bpf and Capsicum to enforce the same sandboxing model. The Capsicum code was shorter, ran faster, and provided finer control. Oh, and cloud providers are increasingly disabling seccomp-bpf because it turns out to be an excellent way of injecting spectre gadgets into the kernel.
Linux namespaces are closer to jails on FreeBSD. Podman uses user namespaces for unprivileged containers on Linux, normal namespaces for privileged containers. It uses jails on FreeBSD. I remain somewhat unconvinced by user namespaces. FreeBSD restricts jails to root in part because you can nest them and then escape from the inner one by having an unprivileged process race renames in the filesystem. This is not a problem if only root can create jails but it is a problem if the user on the outside is unprivileged. I have not seen any discussion of this kind of attack from the Linux folks working on user namespaces and so I suspect that they have introduced a big pile of security vulnerabilities that are impossible to fix without making the VFS layer painfully slow.
Your ZFS + FreeBSD use description is compelling, sounds like you got a real Ship of Theseus over there. Transitioning to a FreeBSD host is not something I can easily do unfortunately. Even if I could, and rationally I would, I have to admit that I still feel biased towards the stackable Linux solution, at least in a system’s designer sort of way. There is something satisfying, if albeit impractical, about being able to reproduce nearly all of ZFS’s feature set from a set of simpler orthogonal primitives. Though I don’t feel exactly the same way about how Linux containers are implemented.
FWIW you can replace disks one by one and grow an mdadm raid array without losing redundancy and with no downtime either. The hitch is that once the array is finished growing, you have to unmount the actual file system before you can grow it. It’s a relatively quick operation, especially in comparison to growing the array, but the difference between zero downtime and non-zero downtime is virtually infinite.
Linux has supported online resize¹ with
resize2fsfor ext4 since at least kernel 2.6.¹ growing only, you still have to unmount in order to shrink.
That is incredible. That makes the primitives-based Linux stack address all of my use cases. Thank you for that info!
I have tried rsync + ext4 + dm-crypt, yes, but not with dm-raid and dm-integrity. I am using dm-raid elsewhere and have no concerns about it. I don’t have experience with dm-integrity.
Regarding snapshots: I tried using LVM snapshots many years ago and it wasn’t working reliably. It’s probably better these days, though :)
How important is dm-integrity really? I have moved and compared terabytes (using WD Red + dm-raid + LUKS + ext4) and not a single bit was flipped, even when Ethernet and non-ECC RAM was involved (however, my NAS has ECC RAM). Is bit rot a real issue? When one bit flips every 10 years, then it means that in my entire life, I get no more than a handful of bitflips. While it’s not perfect, I think I can live with that - especially since the bitflip will probably be inside a video and might not even be noticed. Of course, it would be catastrophic if a bit flips in the inode of the root directory of the NAS.
The chance of bit flips in your storage isn’t static, and how it’ll change over time isn’t easily predictable. After my NAS power supply died, I couldn’t access those disks for a while, but now I’ve plugged them in to my PC with a SATA card. Ran a scrub and 3 of the 4 drives had erroneous data, caught by ZFS’s checksumming. None of the drives actually reported any errors. And thanks to Raidz1, the incorrect data was fixed.
This may have been transient, or maybe the drives are just gonna get worse and give me even more bad data. Thanks to checksums and regular scrubbing, I’ll know.
I’m worried about the complexity of ZFS. I feel like the probability of data loss due to a ZFS bug is higher than due to a bitflip, and I know people who have indeed lost data due to ZFS bugs. Also, ZFS has a lot of features I don’t want or need and it requires a fairly fast CPU. I would prefer a stack with ext4, to keep the file system complexity in check and for great performance. And I like the abstraction layers and modularity of chained block devices. So, perhaps my setup could look like this [1]:
In case of a bitflip, the dm-integrity block device would have a read error and RAID1 would make sure that the data in question is read from the other disk. My current setup is like this, but without dm-integrity. The partitions on the bare disks are required to control the size and make sure that the BIOS recognizes the partition and doesn’t do crazy shenanigans [2]. But… Perhaps I just have to give in and use ZFS after all.
[1] Seems like that is roughly what’s proposed here, though they encrypt before the RAID1, which I find silly because you encrypt twice.
[2] https://news.ycombinator.com/item?id=18541493
That looks a lot more complex and fragile to set up, with many more layers than just setting up a zpool in ZFS. Plus you’re not just getting reliability features, but things like filesystem level snapshots and boot environments, for free.
I’m running ZFS on a 10-year old Atom (J1800) and I haven’t run into performance issues with ZFS - what impacts me more is things like no AES-NI.
To add another data point:
My NAS was using an AMD E-350. I upgraded it to a Pentium Silver 5040j, mostly because I wanted more RAM. With the right modules, the Pentium will take 64 GiBs of RAM, which is enough that I can turn on deduplication and have a lot of ARC space left over.
For any disk operations, the CPU load remains fairly low, the bottleneck is the spinning rust. Adding a separate log device on an SSD would speed up writes (I’ve done that on another machine, but not for home).
With SSDs, the CPU might be a bit slow, but it can do zstd stream compression fast enough that I probably wouldn’t notice (same with AES, since it has AES-NI), especially since ZFS encrypts and compresses blocks separately and so can use all four cores when doing heavy I/O.
It tends to run fairly read-heavy workloads and zstd and lz4 are both much faster for decompression than compression. With the amount of RAM I have, a lot of reads are hit from compressed ARC and so involve zstd decompression. Performance tends to be limited by GigE or WiFi there.
In my case, the NAS I’m using is small (two bays, so only running a zmirror), and with 8 TB, no dedupe means 8 GB is more than enough (even if I could take it up to 16). I figure by the time I’m due for an upgrade, flash will be cheaper than spinning rust at capacities I can justify (on top of upgrading from ye old Bay Trail).
Splitting the dm-integrity like that is going to weaken your guarantees. Dm-integrity has two modes. If layered on top of dm-crypt, it stores the HMACs from the encryption and uses these for integrity. If used independently, it uses a separate hash function. This is not tamper resistant.
If you’re using it for error detection, it’s the wrong tool for the job: it’s not an ECC scheme, it’s a cryptographic integrity scheme (though one that’s still vulnerable to trivial block-level replay attacks, which are harder with ZFS’s encryption model).
ZFS has been in production since 2006, so it’s had a lot of years of production use under it’s belt.
I have an impression that the data loss bugs are mostly or entirely in the newer, optional features from the last ~10 years, so I leave those all disabled (originally also because I wanted to keep my file systems compatible between different ZFS implementations, which had different sets of optional features, but since then all the implementations I might use have been unified as OpenZFS).
I daily-drive ZFS on a desktop computer with a CPU from 2011, and it works fine even with the CPU’s virtual cores disabled to mitigate some vulnerability and ZFS set to use the ‘slow’ gzip-9 compression… but I admit that the CPU is a 3.4 GHz Core i7, not an Atom; while it’s old and on the low end for an i7, I suppose it still counts as fairly fast.
I forgot about LVM snapshots, thank you for the pointer!
This is neat. I’ve wanted to go this route when I get some more cash to fund the project since I like how much quieter & less the power consumption is on SSDs.
Have you tried with the following instead?
?
[Comment removed by author]