1. 16

A good rant on debugging a (basic looking!) ops issue that spiralled into systemd-kafka-land. I’ve got a few timesyncd around here, working without any issue but this does not look too good. :/

The Twitter thread where the author explores solutions is interesting too. :)

  1.  

  2. 3

    The problem turns out to be some obscure FUSE mounts that the author had lying around in a broken state, which subsequently broke the kernel namespace system. Meanwhile, I have been running systemd on every computer I’ve owned in many years and have never had a problem with it.

    Does this not seem a bit melodramatic?

    1. 9

      From the twitter thread:

      Systemd does not of course log any sort of failure message when it gives up on setting up the DynamicUser private namespace; it just goes ahead and silently runs the service in the regular filesystem, even though it knows that is guaranteed to fail.

      It sounds like the system had an opportunity to point out an anomaly that would guide the operator in the right direction, but instead decided to power through anyways.

      1. 8

        A lot like continuing to run in a degraded state is a plague that affects distributed systems. Everybody thinks it’s a good idea “some service is surely better than no service” until it happens to them.

        1. 3

          At $work we prefer degraded mode for critical systems. If they go down we make no money, while if they kind of sludge on we make less but still some money while we firefight whatever went wrong this time.

          1. 8

            My belief is that inevitably you could be making $100 per day, would notice if you made $0, but are instead making $10 and won’t notice this for six months. So be careful.

            1. 4

              We have monitoring and alerting around how much money is coming in, that we compare with historical data and predictions. It’s actually a very reliable canary for when things go wrong, and for when they are right again, on the scale of seconds to a few days. But you are right that things getting a little suckier slowly over a long time would only show up as real growth not being in line with predictions.

          2. 2

            I tend to agree that hard failures are nicer in general (especially to make sure things work), but I’ve also been in scenarios where buggy logging code has caused an entire service to go down, which… well that sucked.

            There is a justification for partial service functionality in some cases (especially when uptime is important), but like with many things I think that judgement calls in that are usually so wrong that I prefer hard failures in almost all cases.

            1. 1

              Running distributed software on snowflake servers is the plague to point out.

              1. 1

                Everybody thinks it’s a good idea “some service is surely better than no service” until it happens to them.

                So if the server is over capacity, kill it and don’t serve anyone?

                Router can’t open and forward a port, so cut all traffic?

                I guess that sounds a little too hyperbolic.

                But there’s a continuum there. At $work, I’ve got a project that tries to keep going even if something is wrong. Honest, I’m not sure I like how all the errors are handled. But then again, the software is supposed to operate rather autonomously after initial configuration. Remote configuration is a part of the service; if something breaks, it’d be really nice if the remote access and logs and all were still reachable. And you certainly don’t want to give up over a problem that may turn out to be temporary or something that could be routed around… reliability is paramount.

                1. 2

                  And you certainly don’t want to give up over a problem that may turn out to be temporary

                  I think that’s close to the core of the problem. Temporary problems recur, worsen, etc. I’m not saying it’s always wrong to retry, but I think one should have some idea of why the root problem will disappear before retrying. Computers are pretty deterministic. Transient errors indicate incomplete understanding. But people think a try-catch in a loop is “defensive”. :(

            2. 4

              So you never had legacy systems (or configurations) to support? I read Chris’ blog regularly, and he works at a university on a heterogeneous network (some Linux, some other Unix systems) that has been running Unix for a long time. I think he started working there before systemd was even created.

              1. 3

                Why do you say that the FUSE mounts were broken? As far as we can see they were just set up in a uncommon way https://twitter.com/thatcks/status/1027259924835954689

                1. 3

                  It does look brittle that broken fuse mounts prevent the ntpd from running. IMO the most annoying part is the debugability of the issue.

                  1. 2

                    Yes, it seems melodramatic, even to my anti-systemd ears. It’s a documentation and error reporting problem, not a technical problem, IMO. Olivier Lacan gave a great talk last year about good errors and bad errors (https://olivierlacan.com/talks/human-errors/). I think it’s high time we start thinking about how to improve error reporting in software everywhere – and maybe one day human-centric error reporting will be as ubiquitous as unit testing is today.

                    1. 2

                      In my view (as the original post’s author) there are two problems in view. That systemd doesn’t report useful errors (or even notice errors) when it encounters internal failures is the lesser issue; the greater issue is that it’s guaranteed to fail to restart some services under certain circumstances due to internal implementation decisions. Fixing systemd to log good errors would not cause timesyncd to be restartable, which is the real goal. It would at least make the overall system more debuggable, though, especially if it provided enough detail.

                      The optimistic take on ‘add a focus on error reporting’ is that considering how to report errors would also lead to a greater consideration of what errors can actually happen, how likely they are, and perhaps what can be done about them by the program itself. Thinking about errors makes you actively confront them, in much the same way that writing documentation about your program or system can confront you with its awkward bits and get you to do something about them.