Getting 1-2s from boot to interactive desktop for cold machines and 300ms that Clear Linux got for VMs should really be the reasonable baseline so that one can ignore all the state bleed headaches of suspend/resume.
Back when I was working on Android bringup (around 4.x) we did a lot of investigation into reducing boot times, which could be as bad as 40+ seconds at the time. In the end there were very few excuses other than poor choices and partition signature checks that prevented the system from being interactive somewhere within 5-10 seconds. Nearly all time was spent waiting for the PackageManager to scan and verify the infinitely many little APKs, made worse by the boot animation (just look at the thing .. what the pho .. https://android.googlesource.com/platform/frameworks/base/+/master/cmds/bootanimation/BootAnimation.cpp ) that wouldn’t even have to been there in the first place.
My Stern and Pinball Brothers pinballs (both Linux based) suffer equally today. Older generations self-check and are ready in a few seconds. The Linux ones take at the least 30 seconds or more. The PB one (Alien) even still fights a race condition with Display detection thanks to /dev/dri/card0,1 randomly flipping (systemd and firmware uploads just like what forced the ‘solution’ of breaking eth0 etc. naming in favour of leaking topology) with the graphics code giving up if the wrong node has the wrong display attached.
Lots of modern Linux distributions are remarkably bad at this. I’m not trying to bash them. For one thing, it’s not their fault, as in, in 95% of the cases systems boot fast enough for their users so it’s perfectly reasonable to make anyone in the remaining 5% to work for it a little. For the other, modern hardware is fast enough that they can generally afford to be bad.
The last time I did one of these was around 2017, I think, which I now realise is a hilariously long time ago but we all know that 2020, 2021 and 2022 never happened so it was practically yesterday. I was trying to goad a Yocto image to boot fast (as in, sub-second). Some of the things I learned on that occasion included:
A bunch of udev event handlers exec-ed sed and grep (I think via /bin/sh, even). I shaved a good half-second, I think, just by doing less forking on a fresh, uncached filesystem. Actually, much to my surprise, a good amount of boot-time forking (not just in udev event handling) came from gratuituous cats (i.e. cat foo | grep bar). Boot-time forking, yes, I’ve just said it out loud, too, shut up.
By some uncanny module initialization accident, the MMC driver was one of the last ones to be initialised, and that operation was slow. I realised that pretty late in my process. By that time the board was spending about 10% of the load time just waiting for the MMC devices to start so it could do filesystem stuff.
I don’t know if this is still the case but the kernel insisteed to calibrate the udelay loop on every boot on exactly the same hardware. Someone figured out this is plain silly a long time ago so you can pass the loops per jiffy value on the kernel command line but guess how well-documented this is/was.
Some of the whackiest stuff happened via U-Boot, which is commonly left untouched because it’s Weird Bootloader Stuff so people just do whatever the manufacturer’s boot scripts do. Nine out of ten boards load the kernel and the dtb to whatever location the first working snippet from Serverfault or whatever happens to include. So the first thing that happens afterwards is a dog-slow memmove to the right address. This is probably not relevant for the Pi but I think I’ve seen some Pi clones where it mattered. Then SPL also allocated a huge memory pool (by RAM speed standards), most of which went unused, and reducing it shaved of a surprising amount of time (like half a second IIRC?).
Some of these are probably just “local” artefacts, e.g. I’m sure there’s hardware where that large memory pool actually makes sense. But lots of what I did by poking at U-Boot and ticking items off in systemd-analyze was the kind of stuff that could be fixed with better customization interface. Unfortunately, there’s an entire consulting industry that lives off of making all this clunky tooling work somehow, and it employs most of the people who could fix it, so the motivation to do it is somewhat lacking.
I’m at the point now where my boot-times are fast enough but my stupid ‘hi tech’ Dell monitor takes aaaages to fire up. Makes switching between inputs and testing stuff infuriating.
Displays being one of those where we get slower from dependent tasks modelled as independent sabotaging each-other, then attenuated with topology (because of Link Training, Bandwidth and HDCP, cables quality and length matter).
Some of this is demonstrated in the article with the gains from disabling various forms of probing, but the amount of times the displays are modeset (BIOS/UEFI), then modeset again (Boot animation / splash screen), then modeset again (Display Manager) then modeset again (Display Server) and modeset again (Display Server).
Wait, why the two last modesets? Well, some monitors won’t give you their EDID without something being scanned out. Without the EDID you don’t know which monitor is plugged in where and not which resolutions / timings + options (HDR, VRR) the user is expecting.
There’s been a lot of work put in to making it possible to avoid them! With i915.fastboot / amdgpu.seamless and support in plymouth+gdm+mutter, you can get down to zero extra modesets (keeping the mode all the way from UEFI).
From a distro point-of-view, one major difficulty with such improvements is that they’re not universal: 99% compatibility is not enough, nor is 99.9% or even 99.99%. Then you need to do the identification of all the issues, integration, testing, combined testing, which all add up as work which take time.
And as for the list of lost performance causes, I’d add compression which can be inefficient or poorly balanced.
Sure, most of them are very hardware or configuration-specific. What’s nasty is that both knowing them and improving them revolves around remarkably obscure knowledge and configuration mechanisms.
E.g. both kernel and dtb locations are known to the kernel boot code, but not to the bootloader, even though there are mechanisms through with it can be made known (if push comes to shove, the image header). Even with read-only rootfs, optionally persisting calibrated lpj tables would be possible. In both cases, what actually happens is that a well-paid consultant looks at early boot logs and just pops the right values in the right places. About half the time said right places aren’t accessible through a menuconfig interface, you just kind of know what to patch in a Yocto layer.
I used to do that for a living (most of the “embedded” “development” jobs in my area are like that, I learned all the skills I needed for that on that hot summer of 2004 when I installed Gentoo from stage1), or rather tried to, it was so pointless and boring that it gave me zero motivation to put up with your average corporate assholery and toxic management.
There’s no grand conspiracy of consultants at work, it’s just one of those cases where people’s jobs depend on them not really seeing the point, so they kind of ignore it. E.g. a colleague of mine figured we could try to upstream more generic versions of some solutions. Management generally encouraged upstreaming things and they told him it’s okay if he wants to do it in his spare time, but the company doesn’t really need it since the manual workarounds are pretty easy. He got the same answer when he asked for a few hours to automate some performance and general bring-up hacks.
Thing is, the workaround being pretty easy part wasn’t actually true. It was easy for one or two of us who knew these obscure things and whoever figured they’d ask us, everyone else googled bullshit like this for hours.
It’s U-Boot’s Secondary Program Loader in this context.
but were you using ZFS in embedded?
This is a good excuse to plug my regular pet peeve that embedded doesn’t necessarily mean low-resources or low-power. An embedded computer is just a specialised computer that has a dedicated function within a larger electrical or electronic system. There are 128-core embedded systems with PCIe accelerator cards out there, and who knows, some of them might be using ZFS :-).
I don’t think I ever used ZFS in an embedded systembut I wouldn’t mind!
This will be incredibly handy for a pi-based car infotainment system I’m planning on making, always thought that if there was one thing that’d make it unviable it would be large boot times (my Pi 5, booting NixOS out of an NVMe SSD, takes ~20s, but even that is too long for a car system)
Getting 1-2s from boot to interactive desktop for cold machines and 300ms that Clear Linux got for VMs should really be the reasonable baseline so that one can ignore all the state bleed headaches of suspend/resume.
Back when I was working on Android bringup (around 4.x) we did a lot of investigation into reducing boot times, which could be as bad as 40+ seconds at the time. In the end there were very few excuses other than poor choices and partition signature checks that prevented the system from being interactive somewhere within 5-10 seconds. Nearly all time was spent waiting for the PackageManager to scan and verify the infinitely many little APKs, made worse by the boot animation (just look at the thing .. what the pho .. https://android.googlesource.com/platform/frameworks/base/+/master/cmds/bootanimation/BootAnimation.cpp ) that wouldn’t even have to been there in the first place.
My Stern and Pinball Brothers pinballs (both Linux based) suffer equally today. Older generations self-check and are ready in a few seconds. The Linux ones take at the least 30 seconds or more. The PB one (Alien) even still fights a race condition with Display detection thanks to /dev/dri/card0,1 randomly flipping (systemd and firmware uploads just like what forced the ‘solution’ of breaking eth0 etc. naming in favour of leaking topology) with the graphics code giving up if the wrong node has the wrong display attached.
Lots of modern Linux distributions are remarkably bad at this. I’m not trying to bash them. For one thing, it’s not their fault, as in, in 95% of the cases systems boot fast enough for their users so it’s perfectly reasonable to make anyone in the remaining 5% to work for it a little. For the other, modern hardware is fast enough that they can generally afford to be bad.
The last time I did one of these was around 2017, I think, which I now realise is a hilariously long time ago but we all know that 2020, 2021 and 2022 never happened so it was practically yesterday. I was trying to goad a Yocto image to boot fast (as in, sub-second). Some of the things I learned on that occasion included:
udevevent handlers exec-ed sed and grep (I think via /bin/sh, even). I shaved a good half-second, I think, just by doing less forking on a fresh, uncached filesystem. Actually, much to my surprise, a good amount of boot-time forking (not just inudevevent handling) came from gratuituouscats (i.e.cat foo | grep bar). Boot-time forking, yes, I’ve just said it out loud, too, shut up.udelayloop on every boot on exactly the same hardware. Someone figured out this is plain silly a long time ago so you can pass the loops per jiffy value on the kernel command line but guess how well-documented this is/was.memmoveto the right address. This is probably not relevant for the Pi but I think I’ve seen some Pi clones where it mattered. Then SPL also allocated a huge memory pool (by RAM speed standards), most of which went unused, and reducing it shaved of a surprising amount of time (like half a second IIRC?).Some of these are probably just “local” artefacts, e.g. I’m sure there’s hardware where that large memory pool actually makes sense. But lots of what I did by poking at U-Boot and ticking items off in
systemd-analyzewas the kind of stuff that could be fixed with better customization interface. Unfortunately, there’s an entire consulting industry that lives off of making all this clunky tooling work somehow, and it employs most of the people who could fix it, so the motivation to do it is somewhat lacking.I’m at the point now where my boot-times are fast enough but my stupid ‘hi tech’ Dell monitor takes aaaages to fire up. Makes switching between inputs and testing stuff infuriating.
Displays being one of those where we get slower from dependent tasks modelled as independent sabotaging each-other, then attenuated with topology (because of Link Training, Bandwidth and HDCP, cables quality and length matter).
Some of this is demonstrated in the article with the gains from disabling various forms of probing, but the amount of times the displays are modeset (BIOS/UEFI), then modeset again (Boot animation / splash screen), then modeset again (Display Manager) then modeset again (Display Server) and modeset again (Display Server).
Wait, why the two last modesets? Well, some monitors won’t give you their EDID without something being scanned out. Without the EDID you don’t know which monitor is plugged in where and not which resolutions / timings + options (HDR, VRR) the user is expecting.
There’s been a lot of work put in to making it possible to avoid them! With
i915.fastboot/amdgpu.seamlessand support in plymouth+gdm+mutter, you can get down to zero extra modesets (keeping the mode all the way from UEFI).From a distro point-of-view, one major difficulty with such improvements is that they’re not universal: 99% compatibility is not enough, nor is 99.9% or even 99.99%. Then you need to do the identification of all the issues, integration, testing, combined testing, which all add up as work which take time.
And as for the list of lost performance causes, I’d add compression which can be inefficient or poorly balanced.
Sure, most of them are very hardware or configuration-specific. What’s nasty is that both knowing them and improving them revolves around remarkably obscure knowledge and configuration mechanisms.
E.g. both kernel and dtb locations are known to the kernel boot code, but not to the bootloader, even though there are mechanisms through with it can be made known (if push comes to shove, the image header). Even with read-only rootfs, optionally persisting calibrated lpj tables would be possible. In both cases, what actually happens is that a well-paid consultant looks at early boot logs and just pops the right values in the right places. About half the time said right places aren’t accessible through a menuconfig interface, you just kind of know what to patch in a Yocto layer.
I used to do that for a living (most of the “embedded” “development” jobs in my area are like that, I learned all the skills I needed for that on that hot summer of 2004 when I installed Gentoo from stage1), or rather tried to, it was so pointless and boring that it gave me zero motivation to put up with your average corporate assholery and toxic management.
There’s no grand conspiracy of consultants at work, it’s just one of those cases where people’s jobs depend on them not really seeing the point, so they kind of ignore it. E.g. a colleague of mine figured we could try to upstream more generic versions of some solutions. Management generally encouraged upstreaming things and they told him it’s okay if he wants to do it in his spare time, but the company doesn’t really need it since the manual workarounds are pretty easy. He got the same answer when he asked for a few hours to automate some performance and general bring-up hacks.
Thing is, the workaround being pretty easy part wasn’t actually true. It was easy for one or two of us who knew these obscure things and whoever figured they’d ask us, everyone else googled bullshit like this for hours.
Ugh… sigh… you’re reminding me that there’s some stuff I’ve been working on with Zephyr that I really aught to upstream.
What’s SPL here? To me, it’s the Solaris porting layer used by ZFS on Linux, but were you using ZFS in embedded?
It’s U-Boot’s Secondary Program Loader in this context.
This is a good excuse to plug my regular pet peeve that embedded doesn’t necessarily mean low-resources or low-power. An embedded computer is just a specialised computer that has a dedicated function within a larger electrical or electronic system. There are 128-core embedded systems with PCIe accelerator cards out there, and who knows, some of them might be using ZFS :-).
I don’t think I ever used ZFS in an embedded systembut I wouldn’t mind!
https://docs.u-boot.org/en/latest/usage/spl_boot.html
FreeBSD/amd64 used to calibrate the TSC and LAPIC frequencies by doing measurements around one-second sleeps! :D
Now I want to do some of this for my Arcan displays.
First time I need to port arcan to buildroot.
This will be incredibly handy for a pi-based car infotainment system I’m planning on making, always thought that if there was one thing that’d make it unviable it would be large boot times (my Pi 5, booting NixOS out of an NVMe SSD, takes ~20s, but even that is too long for a car system)