Threads for stapelberg

    1.  

      This looks like a solution looking for a problem. I’ve heard lots of things about grpc being absurdly complex. I think it’s cool, but I think I’ll stick with SSH. Thank you very much.

      1.  

        They said it pretty clearly

        This is a demo program to show how to use the gokrazy/rsync module over a gRPC transport.

        1.  

          Indeed :) The point is not to get anyone to use rsync over gRPC, but to demonstrate that if you are already working in a corporate (or similar) environment with a landscape of RPC services using gRPC, Thrift, etc., you can now also speak rsync protocol over those channels, if it helps your project.

        2.  

          Upstream SSH with the static buffers or hpn-ssh?

          I just use upstream and accept the inefficiency these days. But if I was doing more performance critical things, it would be an issue.

          1.  

            When scp is too slow, use qcp which is designed to do well with high-bandwidth high-latency lossy links.

        3. 7

          I figured I would share this demo as a way to make folks aware that my gokrazy/rsync module can be used as a library — both its client and its server accept the io.ReadWriter interface type :)

          1.  

            I mostly like gokrazy for stateless systems though I know it supports persistent storage. I assume the rsync module is for file transfer to make a gokrazy system more stateful?

            1.  

              Yes, gokrazy is stateless by default, with a permanent partition for data you want to keep.

              One great use-case for rsync in this scenario is to offer downloading the permanent partition contents for backups as described at https://gokrazy.org/userguide/rsync-backups/

              Another use-case I recently had is to copy static website files onto /perm/srv for serving with the Caddy web server. After installing gokrazy/rsync on my router (https://router7.org/), I can now use normal rsync commands to send data from my PC to /perm/srv :)

            1. 6

              yep, and the “reverse” thing is just a checkbox to check that updates in realtime - I’d never have thought it would be worth a whole article about this feature as it’s just readily available to use in Hotspot (and Heaptrack, which does the same but for memory allocations)

              1. 1

                I’ve used flamegraphs but never heard of hotspot, so…

                1. 1

                  I tried Hotspot. It segfaulted on the first use, so I didn’t investigate it further. But it looks enticing, I’ll give it a shot again.

                  1. 3

                    strange, I’ve used it for.. like 7ish years at this point and don’t really recall it segfaulting

                    1. 1

                      I also recently tried Hotspot and when I installed it from AUR on Arch Linux, it wouldn’t read perf.data files — it just kept on consuming memory until exhausting my machine.

                      When I tried it on an Ubuntu 24.04 machine, hotspot worked fine with the same perf.data file.

                      So double-check whether you might have a broken version / broken installation :)

                      BTW: I ended up not using hotspot, as for the purpose of annotating source code files with counter values, it did not seem any better than perf report. Would be curious what others are using hotspot for primarily…?

                      1. 2

                        For flame graphs.

                        I also prefer perf report for annotating source.

                        I would like some tool to detect register / stack thrashing, though.

                2. 1

                  I’m confused by the comment about bit fields. Probably because I haven’t used protobuffs in a long time and forgotten concepts. I imagine this has something to do with the comment about “uncouple(ing) the Generated Code API from the underlying in-memory representation”.

                  In any case, I’m still confused by the generated code shown after the section mentioning the use of bit fields:

                  package logpb
                  
                  type LogEntry struct {
                    xxx_hidden_BackendServer *string // no longer exported
                    xxx_hidden_RequestSize   uint32  // no longer exported
                    xxx_hidden_IPAddress     *string // no longer exported
                    // …internal fields elided…
                  }
                  
                  func (l *LogEntry) GetBackendServer() string { … }
                  func (l *LogEntry) HasBackendServer() bool   { … }
                  func (l *LogEntry) SetBackendServer(string)  { … }
                  func (l *LogEntry) ClearBackendServer()      { … }
                  // …
                  

                  The only change I see here from previous generated code is the lack of exported fields. The types are standard Go types. Are “bit fields” referring to how presence is modeled in the wire format? I’m sure I’m being dense about something, but after a mention of bit fields for memory savings, I was expecting to see how they’re implemented in generated code, since code was being shown for everything else. Does the section “Opaque structs use less memory” refer only to memory savings in the wire format? It appears evident that those savings are not in generated code, since the above struct is a simple Go struct with elementary Go types and unexported fields.

                  1. 1

                    I think the part that made bitfield make sense to me was eluded in the blog post. In the current implementation there’s an XXX_presence [1]uint32 that’s used as a bitfield. The generated Has methods check this bitfield for if the field was set on the received message, and the Set and Clear methods mutate this bitfield. This allows the struct to directly store the field (eg, for request_size field of the message), and use the type’s zero value instead of a nil pointer.

                    1. 1

                      The only change I see here from previous generated code is the lack of exported fields

                      The change you are missing is from RequestSize *uint32 (pointer to uint32) to xxx_hidden_RequestSize uint32 (no longer a pointer). Hope this helps makes sense of that section of the announcement :)

                    2. 4

                      Why are the methods generated as GetFoo() instead of the idiomatic Foo()?

                      1. 23

                        If we named the accessor method Foo, that would make the Hybrid API impossible: This intermediate API level exposes both, accessor methods and fields, which would both need to be named Foo. The Hybrid API is an important tool when doing a medium to large-scale migration (smaller repositories don’t need to bother with intermediate stages).

                        1. 2

                          I would imagine to match up with the HasFoo() method.

                          Vaguely reminds me of SOAP generated clients from years back.

                        2. 5

                          I think “Auth bypass” is a bit of an exaggeration for this issue.

                          You need to be able to successfully authenticate to the server (i.e. have a valid SSH key) for this attack to work. (Once you authenticate, you can then spoof having logged in as a different user.)

                          1. 3

                            I don’t think that’s right? AIUI, with a server that’s affected you need three things: 1, the name of the user you authenticate as (if the SSH server is set up to check that; some aren’t); 2, the public key of any user that would be let in; 3, any private key at all.

                            1. 5

                              It’s essentially “auth bypass for improperly implemented servers”. Any PublicKeyHandler which was not stateless is potentially vulnerable. This was mainly because the PublicKeyHandler is used for both “checking if a public key would be alright to use” and “actually authenticating with the corresponding private key”.

                              This is complicated by the ssh.Permissions object, which is is the recommended way to pass data between the auth handlers and the ServerConn. Unfortunately it only contains map[string]string properties, and does not provide a way to smuggle other values. This often leads to people storing values from the PublicKeyHandler and re-using them later. Under the previous version of the library, when the vulnerability is exploited, this would always be incorrect, as the the handler would be called to check if a key would be valid, but never again to check the key when actually authenticating.

                              This was due to a cache internal in x/crypto/ssh which stored the Permissions result for a given public key. Note that if you were only using Permissions to return data from the PublicKeyHandler, your server should be fine because the final handler call was only skipped if that key wasn’t seen before. The new behavior makes sure to call the handler one last time when verifying the key as well (if it wasn’t the most recently seen key).

                              It’s not a full bypass, as you do need access to an SSH private key which is authorized for use on that server (to finish the authentication step), and it would need to be a server which was improperly implemented, but that’s unfortunately many of them. In particularly, this is a worry for shared services which allow multiple users to authenticate - they should check their code to make sure if they’re vulnerable.

                              1. 4

                                Notably Gitea and Forgejo both released patches for this yesterday

                          2. 3

                            I’ve never had this problem, but I would have reached for Apple Script first. I mean Automator or whatever it’s called today.

                            1. 2

                              AppleScript is still around, and it’s separate from Automator. Which is also separate from Shortcuts, as far as I’m aware.

                              1. 2

                                Oh. I knew AppleScript was still technically separate, but I thought it was basically integrated into Automator now. Thanks for the update!

                              2. 2

                                In my case, I want this program to work on Linux as well.

                                Also, reaching for a tool I know very well is much quicker than getting familiar with Apple Script :)

                                1. 1

                                  Sorry, I wasn’t trying to criticize, I apologize if you took it that way. I was just saying, how I would have tried to solve the problem.

                                  Of course now you have me thinking if anyone has ever tried to port AppleScript to Linux.

                                  Also, I just looked, Signal doesn’t offer any AppleScript actions, which makes me sad.

                              3. 6

                                Thank god someone did this. Not updating the conversations on the desktop app is Signal’s most annoying feature.

                                @stapelberg, any plans to package this up?

                                1. 1

                                  No plans to package this, no. It’s a single file, simple enough to build/copy around.

                                2. 2

                                  I replaced my little home server with a $200 Intel N100 mini-PC and Proxmox. Been doing great, really like this class of hardware. 7W when idle, max power consumption is 19W. And while the I/O is very limited (single channel RAM, one SSD but USB for external disks) it’s just fine for a home workload.

                                  I’m impressed that AMD’s systems have gotten this good too. At $1070 this box is a lot more expensive than mine but should perform much better. DDR5 RAM, 2x SATA 6GB and room 2.5” drives. Plus the two SSDs.

                                  I’d quibble with their choice of RAID 1 for the SSDs. Proxmox is a bit persnickety about hardware RAID. It’s very good at ZFS though and I wonder if there’s a better way to use two fast disks with ZFS.

                                  1. 1

                                    I’d quibble with their choice of RAID 1 for the SSDs. Proxmox is a bit persnickety about hardware RAID. It’s very good at ZFS though and I wonder if there’s a better way to use two fast disks with ZFS.

                                    The page you link to is about hardware RAID, but I’m not using hardware RAID.

                                    In the proxmox installer I selected ZFS RAID-1. I don’t think that’s an unusual setup?

                                    1. 2

                                      oh my apologies, that’s definitely the usual setup. I misunderstood and thought you were using hardware RAID.

                                  2. 4

                                    Any reason you didn’t consider syslinux instead? It’s a lot lighter than GRUB, and fairly well supported for your use case (it’s common for CD boot images, and Alpine uses it for hard drive booting).

                                    1. 2

                                      Yeah, syslinux would be my next choice. I just am not very familiar with it, and, IIRC, could not easily figure out how to integrate syslinux for my case. I’m sure it can work, it just wasn’t trivial.

                                      1. 1

                                        I use Syslinux (with both FAT and Ext2 file-systems) exclusively for the last (perhaps?) 10+ years on everything, from laptop, to desktops, and USB-sticks. (Though with my latest desktop, booting in “legacy mode” doesn’t work with the built-in video card, thus I need to use UEFI…)

                                        However, getting back to Syslinux, in essence it’s very minimal and non-intrusive:

                                        • I always use the pre-built variant from kernel.org at https://mirrors.edge.kernel.org/pub/linux/utils/boot/syslinux/Testing/6.04/syslinux-6.04-pre1.tar.xz (it requires a 32bit Glibc library to be installed on the host); (this is needed only for initializing the partition;)
                                        • copy the first 440 bytes to the MBR or the partition; (Syslinux doesn’t care about MBR / GPT partitions, but I’ve seen many BIOS/UEFI implementations that when see a GPT partition, they asume UEFI only, and don’t even look at the MBR, even though in the BIOS it’s forced “legacy only”;)
                                        • copy a few modules in the syslinux folder on the target file-system; (in your case I think it’s enough to take linux.c32, libcom32.c32, libutil.c32 and ldlinux.c32;) (remove a few of them an see if any is not-needed; I also use menu.c32 and chain.c32 and don’t remember which depends on which;)
                                        • run the provided tool syslinux or extlinux that installs the stage2 loader in the pointed-to folder; (this file shouldn’t be moved;)
                                        • create a simple syslinux.cfg file where you state where the kernel and initrd lives, plus the arguments;

                                        If, as you describe in the article, you know they offset of each and every file in the file-system, then you don’t even need the Syslinux tooling, you only need to use their MBR code and their stage 2 file, lus the .c32 files I’ve mentioned.

                                    2. 17

                                      Isn’t the 24-bit truncation happening in this quoted portion of the code?

                                      add  word [gdt.dest], 0xfe00
                                      adc  byte [gdt.dest+2], 0
                                      

                                      To extend this to 32-bit sizes you’d need to add another line

                                      adc  byte [gdt.dest+5], 0
                                      

                                      See https://en.wikipedia.org/wiki/Segment_descriptor#/media/File:SegmentDescriptor.svg for why it is +5 and not +3 like you might otherwise expect.

                                      1. 7

                                        Oh, thank you very much for this suggestion!

                                        To verify, I removed a few instructions from the error function to make space in the MBR and added the adc command as you suggested, and this does indeed seem to fix the issue!

                                        I’ll try and get this properly submitted in the coming days.

                                        Edit: I updated the article to include your fix for the benefit of other interested readers: https://michael.stapelberg.ch/posts/2024-02-11-minimal-linux-bootloader-debugging-story/#update-a-fix — thanks again!

                                        1. 4

                                          Glad to help. This is one of those few times when understanding 16-bit 8086 assembly turns out to still have practical applications.

                                          When I was first reading the article and I got to the part where you showed the source for read_protected_mode_kernel, I thought that you had probably already tracked down the problem to somewhere within this block and were challenging the reader to try and spot the bug. When looking for a 24-bit limit, having add word followed by adc byte practically screams “look here!” so I thought I had likely found it. Then I was surprised when I read the rest of the article and there was no definitive answer for where the limit was coming from.

                                          It is always nice when the fix is a single line. And when you can get it correct on the first try. It is too bad there no extra space to insert this instruction without removing something else. You could maybe save a byte by using add byte [gdt.dest+1], 0xfe instead of add word [gdt.dest], 0xfe00 since everything is always 256-byte aligned? I’m sure there are other places to save a byte or two but nothing I can find right now while I’m on my phone.

                                          1. 1

                                            Yeah, my assembly skills are limited to reading and making educated guesses :)

                                            Unfortunately the extra adc instruction seems to take 5 bytes, so changing the error routine seems like the least-invasive change to me to get enough extra bytes :)

                                            Sebastian Plotz (the author of Minimal Linux Bootloader) also emailed me, confirming that your fix looks good to him as well, and saying he wants to publish a fixed version.

                                      2. 3

                                        Very nice read :)

                                        For UEFI, there is systemd-boot, which comes as a single-file UEFI program, easy to include. That’s how gokrazy supports UEFI boot. Unfortunately, the PC Engines apu2c4 does not support UEFI, so I also needed an MBR solution.

                                        I think this is not entirely accurate, there’s official documentation on running coreboot + tioanocore on apus https://github.com/pcengines/apu2-documentation/blob/master/docs/tianocore_build.md I am using this myself without any complaints. Love(d) the APU platform, still sad that they are EOL https://pcengines.ch/eol.htm

                                        1. 2

                                          Thanks for the hint, I wasn’t aware!

                                          Maybe I’ll eventually upgrade my APUs to work with UEFI, but AFAIK other embedded boards (ODROID) that people run gokrazy on still only support MBR boot — it’s not just the APU for which we need an MBR bootloader.

                                        2. 4

                                          My zsh starts in 14ms :)

                                          I spent a bunch of time back in 2015 to make it fast: https://github.com/stapelberg/configfiles/commits/master/zshrc?after=8cbbf81da4ea02cb3de551f150d34f34cfda14c0+69 contains the commits (Nov 11 + Nov 12), and it was definitely worth it!

                                          1. 3

                                            Go is really well-suited for this kind of thing :)

                                            My solution for scanning in Go, directly pushing PDFs to Google Drive (meaning you get full-text search): https://github.com/stapelberg/scan2drive

                                            1. 3

                                              Interesting write up. Like David, I would have (and indeed, have) gone with FreeBSD for the simplicity and it-just-all-works-together peace of mind, but one thing I did not about your piece is with regards to the power consumption numbers.

                                              You mention two 10G network card configurations: I don’t know if you’re aware but the power consumption of these NICs depends massively on the medium. If you use 10G base T (the RJ45 connector version) it consumes significantly more power than base SR/LR (multi-mode/single-mode fiber), which use a little (but non-zero amount) more than a passive DAC (Twinax/Direct Connect). The longer the distance, the more power the -T variant will consume, while the others always use the same amount. (I’m talking 5-10 watt difference under load.)

                                              You’d need to clarify which you’re using for the power metrics to be useful.

                                              1. 1

                                                Thanks, I know. The power measurements are only included to give the reader a rough impression of the ballpark power consumption for such a setup. The exact watt figures will depend on a lot of detail factors.

                                              2. 1

                                                I’ve been considering moving from ZFS to an rsync + ext4 + dm-raid + dm-crypt + dm-integrity setup. Since I run my NAS on Linux it seems less risky and more convenient to run in-tree code than ZFS. The theoretical downside is that I lose the ability to snapshot a running system, which I think is a showstopper for me. Have you ever tried that setup @stapelberg?

                                                1. 9

                                                  Or you could use FreeBSD, which has had ZFS in tree for over a decade, and nicely integrated with the base system. A lot of the decisions in the article didn’t make sense to me until I realised that they were specific to Linux.

                                                  I built my NAS in 2010, with FreeBSD 9. It had three 2 TB disks in a RAID-Z configuration. I expected to replace the disks periodically, but this was just before the massive floods took out a load of disk factories and so it was a few years before the disks were back to the price I’d paid. If I’d bought a spare set of disks, I’d have been able to sell them six months later and recover the entire cost of the NAS; the price of disks more than doubled.

                                                  That machine has been upgraded to FreeBSD 13 (and will get 14 soon) with in-place upgrades. FreeBSD’’s update utility automatically creates a new ZFS boot environment when you run it, so if an update fails you can roll back (you can also select alternative boot environments from the loader, so you can recover even if the kernel doesn’t boot).

                                                  Over the lifetime of the device, I’ve replaced the motherboard (including CPU and RAM) and replaced the disks with 4 TB ones (SSDs are now cheap enough that I’ll probably move to 8 TB flash disks on the next upgrade). It has four drive bays, so I can use spool replace to replace each disk with a new one, one at a time, without ever losing redundancy (and without downtime).

                                                  The one upgrade that caused some excitement was the motherboard replacement. The old one was BIOS boot, the new one was UEFI without a BIOS compat layer. Fortunately, I reserved 8 GB at the front of each disk as swap space. Swap on ZFS normally works fine, but disk writes with ZFS can allocate memory and so there are some cases where it will deadlock. I ran with swap on ZFS with the older disks for a few years with no issues but decided to not risk it on the newer disks. The swap partition meant I had space to insert an EFI partition on each disk. This is replicated across all of the disks (sadly, I have to do that manually after each update. I think I could avoid that with gmirror but I haven’t tried) so that the system can boot after any disk fails.

                                                  Things like zrepl work out of the box for replicating filesystems. As cloud storage gets cheaper, I have been tempted to add a mirror into something backed by cloud storage, for live backup.

                                                  1. 3

                                                    Or you could use FreeBSD, which has had ZFS in tree for over a decade, and nicely integrated with the base system. A lot of the decisions in the article didn’t make sense to me until I realised that they were specific to Linux.

                                                    Honestly I run Linux and I don’t see why I would want to do the stuff in this blog post instead of taking advantage of more advanced tools and features. ZFS is provided painlessly by my distro, zrepl Just Works, kopia gives me nice deduplicated backups in the cloud. Hell even setting up upgrade snapshotting was super easy despite not being provided by the distro.

                                                    1. 2

                                                      I’ve been tempted to try FreeBSD (vanilla, unraid, etc.), but each time haven’t been able to justify spend “innovation tokens” (as the author calls it) on it. It’d end up duplicating a bunch of things I have to know:

                                                      • boot/services
                                                      • package manager
                                                      • firewall
                                                      • debugging
                                                      • security/isolations

                                                      There’s a simplicity/pragmatism in picking some base tools, and sticking with those were possible. I’m sure FreeBSD is that tool for many, but if I’m already on a Linux stack, ZFS on Linux is a good choice.

                                                      1. 2

                                                        I think ‘innovation tokens’ apply to new technologies, not to mature systems that have been in widespread use for decades. Most of the things on your list are fairly similar:

                                                        boot/services

                                                        Boot on most hardware is via a UEFI boot manager. You can boot FreeBSD with GRUB, though most people don’t. The FreeBSD loader is simpler and generally the only thing that you’ll need to do with it is wait for it to timeout (or press the boot-now button if you’re in a hurry).

                                                        Service management is slightly different, though the ‘service {start,stop,…}’ commands are similar to SysV systems (and mentally replacing systemctl with service if you’re in systemd world is not that much mental effort). Enabling and disabling services is different but it’s not much to learn: Most of are enabled by adding a single line to either /etc/rc.conf or /etc/rc.conf.d/{service name} (your choice) and you can use sysrc if you don’t want to edit the files by hand.

                                                        package manager

                                                        s/apt/pkg is about all that you need to do here. Oh, and maybe save some typing because pkg accepts unambiguous shorthands (e.g. ins for install). And automatically runs pkg update when you do pkg upgrade so you don’t get into the thing that keeps biting me with apt where it starts an update but one of the packages has gone and so downloading them fails because I had a stale local copy of the index.

                                                        If you want to build a package set yourself, that’s more complicated, but with Poudriere it’s only a few commands (three steps gets you a complete package set that’s identical to the official one, but if you want to configure options, add additional packages, or use upstream for all of the packages that are unmodified, you need to do more).

                                                        firewall

                                                        This is quite different, but Linux has changed their firewall infrastructure twice in the time that I’ve been using pf on FreeBSD, so it’s likely to be an investment that has more longevity. That’s a running theme. I started using FreeBSD 4. If I didn’t learn anything in between, the majority of what I learned then would still be relevant for running a modern FreeBSD machine (the main difference is that the package infrastructure is now really nice, back then doing anything other than building from ports caused pain). Configuring WiFi has changed a bit (WiFi was still new and shiny back then and most machines didn’t have it).

                                                        In contrast, I was running a room of Fedora machines at the time (having started with RedHat 4.2) and I had to read the docs to do almost anything on a modern Fedora system. Most importantly, achieving the same result is now done in a different way. The FreeBSD project views the Principle of Least Astonishment (POLA) as a core design principle. If a new feature is added, you need to learn how to use it, but if you’re trying to do something that you could do last year then it should be done the same way (from a user’s perspective, at least. The underlying infrastructure may be totally different).

                                                        debugging

                                                        That’s basically the same. lldb, gdb, valgrind, sanitizers, and so on all work the same way. The differences come with system introspection things (performance introspection, system-call tracing). Most of these tools on FreeBSD expose DTrace probes, so that’s transferrable to macOS, Solaris, and even Windows.

                                                        security/isolations

                                                        This is one of the places where it is quite different. The TrustedBSD MAC framework can be used in a way that’s equivalent to SELinux (and in a few other ways. It’s also the basis for JunOS’s code signing infrastructure and macOS / iOS’s sandbox framework).

                                                        There’s no equivalent of seccomp-bpf (largely because attackers love having a tool for injecting Spectre gadgets and code reuse attack gadgets into the kernel).

                                                        If you use containers, Podman gives you the same experience on FreeBSD as on Linux (and can even run Linux containers in a lot of cases) and is a drop-in replacement for Docker (alias docker=podman basically works).

                                                        If you’re writing compartmentalised software, Capsicum has no real alternative on Linux but is the only thing that actually makes it easy to reason about your security properties.

                                                        1. 2

                                                          Thanks for your insightful response!

                                                          I think ‘innovation tokens’ apply to new technologies, not to mature systems that have been in widespread use for decades.

                                                          Fair point, I suppose I meant the sentiment of “time spent learning a technology that isn’t required to solve the problem at hand”. But I agree and can see there’s crossover with existing tech I’m already familiar with.

                                                          Service management is slightly different, though the ‘service {start,stop,…}’ commands are similar to SysV systems (and mentally replacing systemctl with service if you’re in systemd world is not that much mental effort).

                                                          I do appreciate the times I find myself on a SysV system, if only because I seem to be able to introspect all parts of the boot process. systemd gives me lots (declarative, easy dependencies, sandboxing), but I do feel “alienated” from the machine.

                                                          The FreeBSD project views the Principle of Least Astonishment (POLA) as a core design principle. If a new feature is added, you need to learn how to use it, but if you’re trying to do something that you could do last year then it should be done the same way (from a user’s perspective, at least. The underlying infrastructure may be totally different).

                                                          This seems to be one of FreeBSD’s strongest points, as compared to a typical Linux installation, which has n different configuration surfaces and ways of doing things.

                                                          If you’re writing compartmentalised software, Capsicum has no real alternative on Linux but is the only thing that actually makes it easy to reason about your security properties.

                                                          Capsicum looks great. I need to read more up on it, but the closest Linux equivalent looks to be user namespaces. Sadly, user namespaces are often disabled on disttibutions, and have little-to-no usage in userspace applications. systemd-managed services often have half-decent sandboxing (which systemd can do without user namespaces since it runs as root). One flaw with the systemd hardening approach is it relies on declaring which privileges to drop, which feels the wrong approach. Ideally which privileges are needed should be declared, which is lot easier to audit and get right.

                                                          I’ll check out Capsicum some more, thanks for the pointer.

                                                          1. 2

                                                            I think the closest thing to Capsicum on Linux is a combination of seccomp-bpf and Landlock. I’ve used seccomp-bpf and Capsicum to enforce the same sandboxing model. The Capsicum code was shorter, ran faster, and provided finer control. Oh, and cloud providers are increasingly disabling seccomp-bpf because it turns out to be an excellent way of injecting spectre gadgets into the kernel.

                                                            Linux namespaces are closer to jails on FreeBSD. Podman uses user namespaces for unprivileged containers on Linux, normal namespaces for privileged containers. It uses jails on FreeBSD. I remain somewhat unconvinced by user namespaces. FreeBSD restricts jails to root in part because you can nest them and then escape from the inner one by having an unprivileged process race renames in the filesystem. This is not a problem if only root can create jails but it is a problem if the user on the outside is unprivileged. I have not seen any discussion of this kind of attack from the Linux folks working on user namespaces and so I suspect that they have introduced a big pile of security vulnerabilities that are impossible to fix without making the VFS layer painfully slow.

                                                      2. 1

                                                        Or you could use FreeBSD

                                                        Your ZFS + FreeBSD use description is compelling, sounds like you got a real Ship of Theseus over there. Transitioning to a FreeBSD host is not something I can easily do unfortunately. Even if I could, and rationally I would, I have to admit that I still feel biased towards the stackable Linux solution, at least in a system’s designer sort of way. There is something satisfying, if albeit impractical, about being able to reproduce nearly all of ZFS’s feature set from a set of simpler orthogonal primitives. Though I don’t feel exactly the same way about how Linux containers are implemented.

                                                        spool replace to replace each disk with a new one, one at a time, without ever losing redundancy (and without downtime).

                                                        FWIW you can replace disks one by one and grow an mdadm raid array without losing redundancy and with no downtime either. The hitch is that once the array is finished growing, you have to unmount the actual file system before you can grow it. It’s a relatively quick operation, especially in comparison to growing the array, but the difference between zero downtime and non-zero downtime is virtually infinite.

                                                        1. 1

                                                          Linux has supported online resize¹ with resize2fs for ext4 since at least kernel 2.6.


                                                          ¹ growing only, you still have to unmount in order to shrink.

                                                          1. 1

                                                            That is incredible. That makes the primitives-based Linux stack address all of my use cases. Thank you for that info!

                                                      3. 2

                                                        I have tried rsync + ext4 + dm-crypt, yes, but not with dm-raid and dm-integrity. I am using dm-raid elsewhere and have no concerns about it. I don’t have experience with dm-integrity.

                                                        Regarding snapshots: I tried using LVM snapshots many years ago and it wasn’t working reliably. It’s probably better these days, though :)

                                                        1. 1

                                                          How important is dm-integrity really? I have moved and compared terabytes (using WD Red + dm-raid + LUKS + ext4) and not a single bit was flipped, even when Ethernet and non-ECC RAM was involved (however, my NAS has ECC RAM). Is bit rot a real issue? When one bit flips every 10 years, then it means that in my entire life, I get no more than a handful of bitflips. While it’s not perfect, I think I can live with that - especially since the bitflip will probably be inside a video and might not even be noticed. Of course, it would be catastrophic if a bit flips in the inode of the root directory of the NAS.

                                                          1. 2

                                                            The chance of bit flips in your storage isn’t static, and how it’ll change over time isn’t easily predictable. After my NAS power supply died, I couldn’t access those disks for a while, but now I’ve plugged them in to my PC with a SATA card. Ran a scrub and 3 of the 4 drives had erroneous data, caught by ZFS’s checksumming. None of the drives actually reported any errors. And thanks to Raidz1, the incorrect data was fixed.

                                                            This may have been transient, or maybe the drives are just gonna get worse and give me even more bad data. Thanks to checksums and regular scrubbing, I’ll know.

                                                            1. 1

                                                              I’m worried about the complexity of ZFS. I feel like the probability of data loss due to a ZFS bug is higher than due to a bitflip, and I know people who have indeed lost data due to ZFS bugs. Also, ZFS has a lot of features I don’t want or need and it requires a fairly fast CPU. I would prefer a stack with ext4, to keep the file system complexity in check and for great performance. And I like the abstraction layers and modularity of chained block devices. So, perhaps my setup could look like this [1]:

                                                              /dev/sda - partition - dm-integrity \
                                                                                                   >- dm-raid (RAID1) - dm-crypt (LUKS) - ext4
                                                              /dev/sdb - partition - dm-integrity /
                                                              

                                                              In case of a bitflip, the dm-integrity block device would have a read error and RAID1 would make sure that the data in question is read from the other disk. My current setup is like this, but without dm-integrity. The partitions on the bare disks are required to control the size and make sure that the BIOS recognizes the partition and doesn’t do crazy shenanigans [2]. But… Perhaps I just have to give in and use ZFS after all.

                                                              [1] Seems like that is roughly what’s proposed here, though they encrypt before the RAID1, which I find silly because you encrypt twice.
                                                              [2] https://news.ycombinator.com/item?id=18541493

                                                              1. 6

                                                                That looks a lot more complex and fragile to set up, with many more layers than just setting up a zpool in ZFS. Plus you’re not just getting reliability features, but things like filesystem level snapshots and boot environments, for free.

                                                                I’m running ZFS on a 10-year old Atom (J1800) and I haven’t run into performance issues with ZFS - what impacts me more is things like no AES-NI.

                                                                1. 3

                                                                  To add another data point:

                                                                  My NAS was using an AMD E-350. I upgraded it to a Pentium Silver 5040j, mostly because I wanted more RAM. With the right modules, the Pentium will take 64 GiBs of RAM, which is enough that I can turn on deduplication and have a lot of ARC space left over.

                                                                  For any disk operations, the CPU load remains fairly low, the bottleneck is the spinning rust. Adding a separate log device on an SSD would speed up writes (I’ve done that on another machine, but not for home).

                                                                  With SSDs, the CPU might be a bit slow, but it can do zstd stream compression fast enough that I probably wouldn’t notice (same with AES, since it has AES-NI), especially since ZFS encrypts and compresses blocks separately and so can use all four cores when doing heavy I/O.

                                                                  It tends to run fairly read-heavy workloads and zstd and lz4 are both much faster for decompression than compression. With the amount of RAM I have, a lot of reads are hit from compressed ARC and so involve zstd decompression. Performance tends to be limited by GigE or WiFi there.

                                                                  1. 1

                                                                    In my case, the NAS I’m using is small (two bays, so only running a zmirror), and with 8 TB, no dedupe means 8 GB is more than enough (even if I could take it up to 16). I figure by the time I’m due for an upgrade, flash will be cheaper than spinning rust at capacities I can justify (on top of upgrading from ye old Bay Trail).

                                                                2. 4

                                                                  Splitting the dm-integrity like that is going to weaken your guarantees. Dm-integrity has two modes. If layered on top of dm-crypt, it stores the HMACs from the encryption and uses these for integrity. If used independently, it uses a separate hash function. This is not tamper resistant.

                                                                  If you’re using it for error detection, it’s the wrong tool for the job: it’s not an ECC scheme, it’s a cryptographic integrity scheme (though one that’s still vulnerable to trivial block-level replay attacks, which are harder with ZFS’s encryption model).

                                                                  1. 4

                                                                    ZFS has been in production since 2006, so it’s had a lot of years of production use under it’s belt.

                                                                    1. 3

                                                                      I know people who have indeed lost data due to ZFS bugs. Also, ZFS has a lot of features I don’t want or need

                                                                      I have an impression that the data loss bugs are mostly or entirely in the newer, optional features from the last ~10 years, so I leave those all disabled (originally also because I wanted to keep my file systems compatible between different ZFS implementations, which had different sets of optional features, but since then all the implementations I might use have been unified as OpenZFS).

                                                                      and it requires a fairly fast CPU.

                                                                      I daily-drive ZFS on a desktop computer with a CPU from 2011, and it works fine even with the CPU’s virtual cores disabled to mitigate some vulnerability and ZFS set to use the ‘slow’ gzip-9 compression… but I admit that the CPU is a 3.4 GHz Core i7, not an Atom; while it’s old and on the low end for an i7, I suppose it still counts as fairly fast.

                                                                3. 1

                                                                  Regarding snapshots: I tried using LVM snapshots many years ago and it wasn’t working reliably. It’s probably better these days, though :)

                                                                  I forgot about LVM snapshots, thank you for the pointer!

                                                              2. 5

                                                                It’s always great to read this author’s posts. He lays out problems simply, finds the essence of it, and produces simple solutions. The average blog post on the web that I read is willing to accept all sorts of complexity in their solutions to solve their immediate need, without seemingly much consideration for what maintenance will look like, whether they’ll be able to diagnose issues, etc.

                                                                Here we see the author going against the grain of raidz, zfs-send/restic/Borg, HomeAssistant/nodered, NixOS, etc.

                                                                Bravo to simplicity.

                                                                I’ve been considering creating a NAS, so thanks for the tips.

                                                                I had also planned to use ESPHome and fork my own https://github.com/stapelberg/regelwerk if ever I jumped into IOT, so it was good to see how simple it came together: https://github.com/stapelberg/regelwerk/commit/8b81d7a808b1d76a0e96bdb4ab43964623d133c4

                                                                My daily backups run quicker, meaning each NAS needs to be powered on for less time. The effect was actually quite pronounced, because figuring out which files need backing up requires a lot of random disk access. My backups used to take about 1 hour, and now finish in less than 20 minutes.

                                                                zfs-send or btrfs-send is likely a faster option than rsync for users that are happy to rely on those, though I appreciate the robustness of just relying on rsync, as you mention.

                                                                Ubuntu Server comes with Netplan by default, but I don’t know Netplan and don’t want to use it. To switch to systemd-networkd, I ran: […]

                                                                👍

                                                                IPv6Token=0:0:0:0:10::253

                                                                Glad to see IPv6 Tokens in use in homelabs!

                                                                My unlock.service constructs the crypto key from two halves: the on-device secret and the remote secret that’s downloaded over HTTPS.

                                                                Nice idea.

                                                                If you’re okay with less compute power, but want more power efficiency, you could use an ARM64-based Single Board Computer.

                                                                For anyone interested in this approach, a good combo is:

                                                                x86 will give you fewer troubles though.

                                                                1. 6

                                                                  I’m glad you like i3!

                                                                  The MacBook is a truly fascinating machine in many dimensions, most notably battery life: https://michael.stapelberg.ch/posts/2021-11-28-macbook-air-m1/

                                                                  I’m a fan of the Asahi project and all the great work the team is doing. It’s very inspiring!

                                                                  But, Asahi Linux just isn’t quite there yet today. Speakers are about to become supported, I think, but connecting external displays isn’t possible (not sure when it will be). Microphone and webcam are not supported, so no video calls unless you connect all peripherals via USB.

                                                                  Aside from all that, I noticed smaller hangs in the day-to-day UX, like when scrolling around in Google Maps my browser’s GPU process would hang and crash from time to time… Probably not too many people are using it as a daily driver yet.

                                                                  Also, keep in mind that Asahi prefers Wayland over Xorg, so you wouldn’t run i3 on Asahi Linux, you would run sway. If your workflow is supported by Wayland, that might be a non-issue, or a dealbreaker — definitely check before buying new hardware.

                                                                  I would recommend buying the framework laptop for now, and maybe check back in every year or so to see if switching to Asahi on a Mac would work (if you’re still interested at that point).

                                                                  1. 1

                                                                    I’m not sure if the Author has an older OS release, or took a screenshot while using a non-optimal connection? The screenshot in the article shows the display as being at the “more space” setting in macOS, to produce a “3072x1728” UI.

                                                                    I have the same 6K display on macOS 13.4.1 (i.e. latest stable) and it shows as “3072x1728” as the “default” middle option, and two options each left and right, with the highest “more space” being 3840x2160”.

                                                                    1. 1

                                                                      I’m running macOS 13.4.1, too! Not sure why you get different options.

                                                                    2. 11

                                                                      My husband loves his Dell 8k but boy is it plagued with quality issues. He puts up with them all though cause it’s worth it to him.

                                                                      Biggest issue so far: the Dell support experience for a failed display did not feel like we were getting support for a $4k monitor. After many failed replacements they even lost track of what they sent me, accused me of fraud, and tried to cancel the support ticket with my having a broken monitor still.

                                                                      Another issue, which is more amusing than bad, is sometimes you do something not dpi aware, like run memtest, and you end up with this: https://www.reddit.com/r/Monitors/comments/f96hsy/memtest86_on_8k_dell/

                                                                      1. 3

                                                                        I agree that the 8K monitor has many issues.

                                                                        I had to exchange one, and my experience with Dell support was good overall. They immediately replaced the monitor.

                                                                        For some reason, they even sent a second replacement, though! It was a bit of a hassle to get that sorted out.

                                                                        1. 3

                                                                          FWIW the bad support experience I mentioned was the second. The first one was flawless. Mind you, we had to replace it twice, so…