1. 24
    1. 27

      Why does systemd give up by default?

      As an embedded software developer: If your service crashes more than 5 times, restarting it is a futile exercise: Either the hardware is broken or you have a severe software bug, one that is frequent enough that it can and should be analyzed and fixed. The likelihood that the process crashes a 6th time is almost 100%. What you want to catch are those super rare bugs that happen once per leap year. Restart the service and if your system is designed properly, the user won’t even notice. Also, you want to prevent endless restarting of a service because that causes a whole bunch of activity across the system, for example it may emit log messages to the flash drive until the flash cells die or it may toggle relays until the contacts are worn down and break. You generally want to minimize this kind of activity which was not anticipated by the system designers.

      1. 8

        for example it may emit log messages to the flash drive until the flash cells die

        This!!!! We want to write logs to persistent storage because it’s vital to be able to debug an issue which happened before a reboot (especially if the issue, say, causes a reboot!), but it creates a painful trade-off between log verbosity and flash lifetime, and tight failure loops become disastrous.

      2. 12

        It’s like the children have forgotten our lessons about exponential backoff. That sound you hear is the herd coming thundering for you.

        1. 11

          I find it interesting that OpenBSD’s people believe in NOT letting services restart, assuming that a service will only crash if it’s under attack, and a stopped service can’t be exploited.

          I wish the daemons I ran were so reliable that failures only happened when under attack.

          1. 9

            Restarting a few times is often fine. There are basically three kinds of attacks:

            • Those that are deterministic. Restarts don’t matter, attacker always wins.
            • Those that are probabilistic. Some small proportion of victims lose each time, restarts increase this.
            • Those with a probabilistic exploration phase. Each restart gives the attacker some data that they then use in the next phase.

            The last category is interesting and is the basis for things like ASLR: you learn something on the way to the crash, but you can’t then use it after the restart. At least in theory, often there are holes in these things.

            Things like the Blind ROP attack need a few thousand restarts on average to succeed. With a 10ms process creation time, that’s a few seconds. The stagefright vulnerability was a problem because Android had only 8 bits of entropy and automatically restarted the service, so just using a constant value for your guess and repeatedly trying would succeed more than half the time with 128 guesses and that took well under a second.

            Exponential back off is often a good strategy for this kind of thing but if availability is not a key concern then stop and do some postmortem analysis is safer.

            It’s also worth noting that, even when not actively attacked, reconnecting clients will often do the same thing repeatedly. If a particular sequence of operations leads to a crash, restarting may still crash immediately when the client that triggered the bug reconnects and does the same thing again.

            1. 3

              That’s a nice philosophy, I’ve never had a daemon in OpenBSD base crash on me, basically ever.

              Do you have any quotes or discussion that outlines that design philosophy though? Curious to see it from the source.

              1. 3

                Please read the linked email from Theo again. He writes:

                If software has a bug, you want to fix it. You don’t want to keep running it.

                No mention of attacks.

                Bugs can and will happen. If you auto-restart services you might miss a bug because the systems gives you the feeling everything is alright. I am a BIG fan of fail loud and fail hard. Then you can analyze the situation and fix it.

                1. 3

                  Moreover Theo says it that it is a bad default. If your use case calls for it, go ahead.

                  https://marc.info/?l=openbsd-misc&m=150786327012681&w=2

                  One of the reasons why it is a bad default is that it lets the attacker retry exploiting the bug in a short period of time.

                  https://marc.info/?l=openbsd-misc&m=150795572208356&w=2

                  1. 3

                    Mention of attacks. Thank you for the further links from the rest of the thread.

                  2. 2

                    OpenBSD developers also like to say that the only difference between a bug and a vulnerability is the intelligence of the attacker.

                2. 10

                  Please don’t.
                  You want exponential backoff by default, and finally giving up.
                  I don’t want to get an email/notification for every restart/s that happens on a completely broken service. I don’t want the amount of log spam, IO and CPU load. I don’t want to get rate limited or blocked because some service does potentially costly API calls during startup, hammering the other side until someone does a systemctl stop. And I don’t want a thundering herd by default.

                  You can still enable it for your systems or services if needed. I do have it on for some - but even then I want systemd to backoff, so it’s at most once per second.

                  1. 5

                    I don’t think restarting a service in general is the right thing to do.

                    A service should only die in two cases:

                    • if there is a fatal external error. e.g. the disk is bad. Or RAM is bad.
                    • if there is a programmer error, e.g. failing assert, segfault.

                    In those cases I don’t want the service to restart. I want them to stop indefinitely and notify the operator. Restarting does nothing to solve those problems and can exacerbate the problems.

                    Transient errors should not cause services to die. The service should have its own domain specific back-off logic for transient errors.

                    1. 3

                      There is at least a third, transient case that can cause a service to die: it tried to do more than current available resources allow, and the system killed it to prevent that. e.g. OOMkiller got it. In that case, personally, I would prefer to have most services restart, following an alert to the administrator.

                      1. 2

                        If the service dynamically requested resources as a result of an incoming request or similar, and that request for resources failed because none were available, that should be considered a transient failure.

                        If it requested resources as part of initialization and it failed because none were available, that should be considered a permanent failure.

                        If it died because the OOM killer killed it, I would consider that evidence of a system wide failure and I wouldn’t want the service to restart In that case. It’s indicative of a system-wide misconfiguration, I.e. programmer error.

                        1. 2

                          It’s tricky when requests for resources always succeed due to memory overcommit, but then the OOM killer kicks in you when you actually try to use those resources.

                          1. 1

                            If your application is causing overcommit and the OOM killer to activate, I consider that a system-wide misconfiguration. If you allow the application to restart automatically, the same issue will happen again.

                            1. 1

                              For those who don’t know, that behavior can be disabled.

                      2. 3

                        IIUC this also affects timer services where I don’t necessarily want them to retry forever. I think my ideal default would be something like timer services: restart a few times, maybe only once, non-timer services: restart forever with some reasonable backoff.

                        Now systemd doesn’t really differentiate between timers and non-timers (timers just start a service). In many cases you can use “oneshot” as a proxy. But for boot-critical oneshot services I would also like them to retry forever.