1. 51
  1. 40

    A novice was trying to fix a broken Lisp machine by turning the power off and on.

    Knight, seeing what the student was doing, spoke sternly: “You cannot fix a machine by just power-cycling it with no understanding of what is going wrong.”

    Knight turned the machine off and on.

    The machine worked.

    1. 23

      It sounds like a joke but I have literally done this, for work. Not that often, but definitely more than once or twice.

      If I’m feeling confident I first rest my hands on the machine and intone “The power of Knuth compels you, be healed!”

      1. 6

        Somewhere in recent historiy a coworker reported the unexpectedly early death of a small handful of machines in a datacenter that we rarely have a person visit. All of them were redundant cattle so it was mostly an accounting loss and an eventual cleanup task.

        They would have been part of a list of machines to get upgrades, and managed to appear on the list of machines that failed to upgrade, so I went to check them out. No ping to the primary address. The BMC pinged, though, so I asked it to power cycle, and then watched the serial console. The machine booted properly.

        All of them in that suspect group came up, in fact. They have each made several reboots since then without issue.

        I conclude that it is possible to do this remotely, no blood sacrifices required.

    2. 13

      The section on granularity alludes to another principle: “reduce the scope and size of your state”. That is, push state to the leaves of your system, so your higher level processes have less opportunity to fall into a broken state.

      Containerisation is a form of the this.

      https://grahamc.com/blog/erase-your-darlings is a realisation of this for a Linux distribution: each boot is from the exact same state, with optional allowlisted state to persist between reboots.

      1. 9

        Curious to hear from some Erlang/Elixir devs here. As I understand it, “rebooting” is at the heart of the OTP philosophy. Generally speaking, does it work well? What are some examples of when it doesn’t?

        1. 7

          Generally speaking, does it work well?

          In my experience, it works very, very well. Restarting is the most efficient way to deal with transient failures, and Erlang’s use of supervision trees means that the piece being restarted tends to have the smallest granularity to be useful, while preserving the state in the rest of the system.

          I feel that Erlang/Elixir’s “Let it Crash” philosophy has less to do with restarting during transient failures than many people might think. It has more to do with turning errors and failures into tools, and being forced to really think about failure scenarios when developing. Fred Hebert explains this idea in a really excellent talk/transcription: https://ferd.ca/the-zen-of-erlang.html

          What are some examples of when it doesn’t?

          “Let it crash” does not really work when dealing with data/state that cannot be thrown away or reconstructed. For example, when I need to integrate with some service whose API does not allow for idempotent requests, I often find myself having to be more careful about crashes, and leaning more heavily on other error handling mechanisms. Another commenter mentioned poison messages, but I haven’t really struggled with those, because practicing “Let it crash” will make bugs around poison messages very loud and straightforward to fix.

          Letting things crash+restart is not a panacea, it’s one strategy among several. Erlang also supports exceptions, and there’s a very common pattern of returning ‘Either-shaped’ terms from functions that can fail; {ok, Result} | {error, Term}.

          1. 2

            It’s less about rebooting the whole thing and more about having a supervisor that can start the process back in a known state. It works well in many cases…provided you don’t let yourself infinitely reprocess poison messages.

          2. 8

            The 787 avionics had a software bug that caused its avionics to crash every 248 days of uptime (time the airplane is turned on, not flying). Clearly no actual flight lasts anywhere near 248 days, so there would have been no harm in simply resetting the computer every, say, 48 hours automatically. This would guarantee the system doesn’t get into any weird, untested states after being online for months at a time.

            There’s no way to practically test the airplane being on for months at a time either, so if you want to guarantee the product is completely tested, this might be a requirement.

            The FAA sent a notice that the plane should be manually reset between flights as a workaround.

            1. 5

              I enjoyed reading this, though I’m not sure I agree entirely with its conclusions.

              Turn it off, and turn it on again. Anything else is less principled.

              I think it depends on the particulars of the situation. If you’re talking about a running process that’s discovered some internal error and can be easily and quickly restarted, yeah, crashing and restarting it probably makes sense. But if restarting is expensive (e.g. 10+ minutes of downtime for a bare-metal server reboot, or worse, the same across an entire cluster) and there’s a fairly simple/obvious fix that can be applied, why not do that?

              It seems like the reasoning in the article is founded on (what I see as) an overly optimistic picture of how well-understood your system’s state really is even when it appears to be functioning as intended. A running process, let alone an entire server or cluster, has many, many bits of state – the fraction of those that its authors are aware of and thoroughly understand is a tiny fraction of the whole. Even before any bugginess has (detectably) reared it’s head, there’s a gigantic iceberg of subsurface state that we just kind of assume is in alignment with the tip of it that we can see. Subtle non-determinism can creep in from all sorts of places and manifest in that hidden state, from ASLR at the OS level to temperature-dependent differences in how many cycles it takes a PLL in your DRAM controller to lock when it comes out of reset (I’ve learned from experience that it’s entirely possible to run the exact same sequence of instructions from system power-on and get different behavior from one run to the next). There is no Mozart; it’s always jazz.

              (I should clarify that this isn’t to say we shouldn’t strive to understand our systems and their states as thoroughly as possible, I just think it’s fair to acknowledge that that understanding is always going to be less than absolutely complete.)

              1. 2

                I’m not sure you disagree with the author. If I’m right about the article’s implicit assumptions and yours, I think we all agree that a system that is functioning as intended is very much capable of concealing latent dysfunction, and it’s only when an error actually occurs that we are informed of the fact that there is a disconnect between our mental model of how the system should behave vs. how it’s actually behaving. But that’s the point: so long as the system is behaving as expected, even if we assume a priori that at least one such possible error state exists, we cannot know its specific nature until it rears its head (assume that we’ve exhausted every avenue for static analysis available to us, since none of those can save us if our spec is incorrect). Once we observe such an error, it’s incumbent on us to investigate its causes and expand our knowledge of how the true system state evolves, even if full knowledge of that evolution will always elude us. Crash-only behavior is valuable both because it surfaces those error states quickly and because, to your concern, it is the strategy that demands the least of us in terms of knowing precisely the ideal state of the system, the current state of the system, and what a viable path between those two might be. So then: because a crash-only strategy is the most resilient to imperfect knowledge, systems should be designed in the first place to minimize the expense of pursuing that strategy.

              2. 5

                The part about “Local crashes and global equilibria” reminds me of the story of how AT&T’s entire long distance network went down for most of a day in 1990: https://users.csc.calpoly.edu/~jdalbey/SWE/Papers/att_collapse

                1. 2

                  https://twitter.com/SaraBWarf/status/1485469273925505026/photo/1

                  Imagine a guide dog that, whenever you’re lost for too long without getting anywhere, drags you home again. This is a watchdog timer. Whenever some tech seems to lose it’s marbles and then suddenly goes back to the home screen.. the watchdog barked and dragged it home.

                  1. 2

                    I was thinking about this and I think I have a nice minimal example of this behavior:

                    Imagine you have a program where the user controls a point on a 2D grid. The user always starts at (0, 0). They can move left, right, up, or down. This is a very simple program, but after a long enough time line it could start to misbehave. For example, let’s say you represent the x and y values with signed 32 bit integers and overflow in your language is undefined. Most runs will never have any problems, but if someone races to an edge they could trigger undefined behavior on integer overflow. The user can’t tell what’s going on. Maybe it maps to their character in a video game and they teleport to a completely different part of the world. They restart the game, go back to origin and everything is fine.

                    1. 10

                      In Legend of Zelda: Breathe of the Wild, there’s a gameplay mechanic where occasionally the “blood moon” will rise and all your defeated enemies will come back to life. What’s actually happening is the game is resetting to a known good state, but they make it seem relevant in game.

                      1. 1

                        Is that really the intent? If so, that’s amazing! I love how video game developers are able to pull the wool over our eyes. Reminds me of classic Hollywood illusions. Trust nothing!

                        1. 6

                          Wing Commander for DOS would crash after exiting the game, before returning to the DOS prompt, with an error message displayed on the screen. The devs just patched the error string to say “Thanks for playing Wing Commander!”

                      2. 1

                        In high school, I learned programming on a Wang 2200-T. 8kB RAM, cassette storage. The instruction manual referenced an operation called “Master Initializing”: turning it off and then back on.

                        1. 1

                          the nice thing about *nix is that you don’t have to restart the machine unless the misbehaving process is teh kernel or init (there is kexec etc. but that doesn’t have the well-testedness avantage)

                          1. 8

                            The whole point of this article is that this statement is untrue.

                            It is not uncommon for machines regardless of OS[1] to all get into a state where they’re globally pantsed - obviously some are worse at this than others (or is that better at it?). Oftentimes the result is just terrible performance, sometimes complete inability to make forward progress. It is possible there is a single faulty process, and an OS that has robust process management can deal with that. However often times you can’t isolate the fault to a single process, so you start having to restart increasingly large amounts of user space. At some point, restarting the system means is the most sensible path forward as it guarantee-ably gets your entire system into a non-pantsed state.

                            A lot of system reliability engineering is the misery of debugging systems once they’re stuck to try and work out how they got into that stuck state, while also being continuously pinged by people wanting things to be running again.

                            [1] I’ve worked with, and encountered “just reboot it” level problems with a variety of linuxes over the years (1.x,2.x,3.x I don’t think I’ve used 4+ in any work situation), macOS 7.* (and people complain about windows), all of the OSX/macOSs at varying levels of stability, windows (weirdly one of the most stable machines I ever had was this compaq windows Me thing), VAX/VMS, freeBSD, and I’m sure at least a couple of others in a general mucking around during uni setting

                            1. 5

                              Ooooh, did I ever tell you that thing about the uptime log :-D?

                              So my first serious computer gig, back in 2002, eventually had me also helping the sysadmin who ran things at $part_time_job, which he graciously agreed to when I told him I wanted to learn a thing or two about Unix. One of the things we ran was a server – aptly called scrapheap – which ran all the things that people liked, but which management could not be convinced to pay for, or in any case, could not be convinced to pay enough. It ran a public-ish (invite-only) mailing list with a few hundred subscriber, a local mirror for distributed updates and packages, and a bunch of other things, all on a beige box that had been cobbled together out of whatever hardware was laying around.

                              Since a lot of people spread around four or five offices in two cities ended up depending on it being around, it had excellent uptime (in fact I think it was rebooted only five or six times between 1999-ish when it was assembled and 2005 – a few times for hardware failure/upgrades, once to migrate it to Linux, and once because of some Debian shenanigans).

                              On the desk under which it was parked laid a clipboard with what we affectionately called “the uptime log”. The uptime log listed all the things that had been done in order to keep the thing running without rebooting it, because past experience had taught us you never know how one of these is going to impact the system on the next boot, and nobody remembers what they did six months ago. Since the system was cobbled together from various parts it was particularly important because hardware failure was always a possibility, too.

                              The uptime log was not pretty. It included things like:

                              • Periodic restart of samba daemon done 04.11.2002 21:30 (whatever), next one scheduled 09.11.2002 because $colleague has a deadline on 11.11 and they really need it running. I don’t recall why but we had to restart it pretty much weekly for a while, otherwise it bogged down. Restarts were scheduled not necessarily so as not to bother people (they didn’t take long and they were easy to do early in the morning) but mostly so as to ensure that they were performed shortly before anyone really needed it, when it was the fastest.
                              • Changed majordomo resend params (looked innocuous, turned out to be relevant: restarting sendmail after an update presumably carried over some state, and it worked prior to the reboot, but not afterwards. That’s how I discovered Pepsi and instant coffee are a bad mix).
                              • Updated system $somepackage (I don’t remember what it was, some email magic daemon thing). Separately installed old version under /opt/whatever. Amended init script to run both correctly but I haven’t tested it, doh.

                              It was a really nasty thing. We called scrapheap various endearing names like “stubborn”, “quirky” or “prickly” but truth is everyone kindda dreaded the thing. I was the only one who liked it, mainly because – being so entangled, and me being a complete noob at these things – I rarely touched it.

                              You could say well, Unix and servers were never meant to be used like that, you should’ve been running a real machine first of all and second of all it was obviously really messy and you could’ve easily solved all that by partitioning services better and whatnot. Thing is we weren’t exactly sh%&ing money, the choice was between this and running mailing lists by aviary carriers so I bet anyone who has to do similar things today, on a budget that’s not exactly a YC-funded startup or investment bank budget, is really super glad for Docker or jails or whatever they’re using.

                              1. 2

                                It is not uncommon for machines regardless of OS[1] to all get into a state where they’re globally pantsed - obviously some are worse at this than others (or is that better at it?)

                                You’re absolutely right, though it’s also the case that Unixes have a lot more scopes that you can restart from initial conditions than other common OSes. Graphical program doesn’t work, and restarting it doesn’t help? Log out and log back in. That doesn’t fix it? Go to console and restart your window system. That doesn’t fix it? Go to single-user mode, then bring it back up to multi-user. Once that’s exhausted is when you need to reboot…

                                Of course, just rebooting would be faster than trying all of these one after another. Usually.

                                1. 1

                                  The whole point of this article is that this statement is untrue.

                                  I can tell you that I almost never restart my whole computer. Certainly, the software I write has been much better tested when restarting just the service, and the “whole computer restart” has not. An easy example is that service ordering may not be properly specified, which is OK when everything is already running, but not OK when booting up.

                                  Unix ain’t Windows. If you aren’t working on the kernel or PID 1 you almost never have to restart.

                                  1. 3

                                    Back in early 2000s, when I had win2k/new XP, and linux systems. All of them went similarly long periods between reboots, measured in weeks. But even then, manually rebooting any of those was not an uncommon event.

                                    Now these days of course, most systems - including nixes - have security updates requiring reboots at that kind of cadence, thus requiring reboots which presumably mitigates any potential “reboot fixed” issues.

                                    1. 2

                                      You’d think so, but I have managed to bring GPU’s into a broken state, where anything trying to communicate with them just hangs. Restart was the only way out.

                                  2. 1

                                    About 70% of the time that I upgrade libc at least one thing is totally hosed until reboot. And that thing might be my desktop environment, in which case “restarting that process” is exactly the same level of interruption as rebooting, just less thorough.

                                    1. 1

                                      Are you, by any chance, from Ontario? [I ask because I’ve only every heard Ontarians use the term “hosed” that way & very much want to know if you are an exception to this pattern.]

                                      1. 1

                                        Nope! I’m from Virginia and have lived in Massachusetts for the past 10-ish years. It’s a term I hear people use from time to time, but I haven’t happened to notice any pattern to who. It’s likely that it spread in some subcultures or even in just some social subgraphs.

                                        1. 1

                                          Good to know! Thanks :)

                                  3. 1

                                    Isn’t ECT just doing that to people? :D

                                    Stories with similar links:

                                    1. On rebooting: the unreasonable effectiveness of turning computers off and on again via calvin 1 month ago | 11 points | 3 comments