1. 11

    I reiterate the request to add OS tag, 1 2 3.

    1. 4

      I think it should be a systems tag to clarify it’s not about anything OS related (lest they tag like Windows posts) but rather, systems development, not just kernel/OS stuff, but drivers and some embedded too.

      1. 3

        os-dev?

      2. 2

        Second. We need a tag!

        1. 1

          Actually, whenever I post something about Jehanne, I feel the same need.

        1. 9

          This is becoming an increasingly severe problem in HPC. To the point where software needs to be written in an explicit fault-tolerant fashion, since errors like these or even hardware failures will happen on nearly every exaflop run. Even petaflop machines that are typical today need to have special handling for hardware failures to avoid crashing constantly.

          1. 2

            Would you mind elaborating on the techniques used when attempting to be fault taulerant of bit flips?

            1. 4

              One place to start is actually Tandem Computers which were built for fault tolerance, basically by running two computers.

              NASA’s guidance system, among other things, has 3 or 4 computers which all compute the same thing then check with each other if they agree.

              For systems that require not running the same thing a whole bunch, one can let a checksum of the data flow end-to-end, checking it at various places.

              I’m sure other solutions exist, but as a non-expert, those are the ones I’ve come across.

              1. 1

                What happens when you get an error? i.e. say computer 4 gets hit by a cosmic ray which flips a bit; what’s the procedure for bringing all computer back into agreement?

                1. 1

                  If you have multiple computers you can do a quorum. Otherwise, information is lost and it’s up to the situation what you do. You can either fail an tell the user or if there is a backup policy, execute that.

              2. 3

                I’m not terribly familiar with this field, but this report should get you started: http://www.netlib.org/lapack/lawnspdf/lawn289.pdf

                1. 2

                  Oh nice! I didn’t have that. Thanks.

                2. 3

                  I second apy’s recommendation of Tandem Computers. I’ll go further with two specific works. The first is by Jim Gray showing how Tandem looked at things systematically to figure out how to eliminate as many error classes as possible. They ended up achieving a five 9’s system. The second is from a competitor, Stratus, covering both hardware and programming techniques for robust systems, including Tandem NonStop.

                  Why Do Computers Stop and What Can Be Done About It?

                  Paranoid Programming: Techniques for Constructing Robust Software

                  Note: First is an old PDF. Second one is a PostScript file from Archive.org since the PDF link is dead with no archive copy.

                  1. 2

                    Thanks for the links! Looks like a there is a bunch of goodies in there…

              1. 3

                My favorite feature of tig is the stash browsing mode. I heavily use stashes my workflow, and at the end of the day I might have a few useful things in my stash. However I also accumulate a bunch of garbage from testing fixes, / debugging random issues, or ideas that went no where.

                I have the following key binding in my .tigrc that lets me drop a selected stash (I think I got it from here):

                # Key binding to drop a stash.
                bind stash D !?git stash drop %(stash)
                

                So I can just fire up tig stash and drop anything that doesn’t look useful.