1. 42
  1. 21

    Another sad story of RAID is not backup.

    DIY infrastructure without backups is just asking for trouble. If you need me, I’ll be backing up home assistant…

    1. 5

      Yes, backups are very important. I’m happy that I’m mostly confident in my current backup strategy, and have tested it a couple times. Including once when I accidentally deleted my own home directory.

    2. 16

      I’m glad this post is cathartic for the author. For me the inflection point was:

      I ran e2fsck -f on the treefort volumes and hoped for the best. Instead of a clean bill of health, I got lots of filesystem errors. But this still didn’t make any sense to me, I checked the array again, and it was still showing as fully healthy. Accordingly, I decided to run e2fsck -fy on the volumes and hope for the best.

      Whew. Speaking from experience, so many battle scars, so many incidents at home and work, I’ve trained myself so that when something doesn’t make sense during an incident I now tell myself “move gently, move slowly, don’t change anything, what do I know, why do I think I know it, what should I do next”.

      It reminds me of a coworker in my first job, I once asked them why they typed commands so slowly. They would literally type one character per second, then reread commands entirely, it took 1 minute to run a command. They laughed and said “Speaking from experience, 1 minute is nothing compared to the consequences of doing the wrong thing”.

      So something as simple as rsync’ing your data off to a remote box would occur to you if you suspected your file system is hosed but your RAID is “healthy”. Or double checking your backup.

      1. 3

        If I had noticed a mysteriously detached volume claiming to be totally healthy on a re-attach, I would have looked up how to force-check that volume. Or just wipe it and resilver it from scratch. RAID1 makes me immediately worried about which volume is going to have the correct blocks. Without ZFS checksums it seems like it would be hard to know which blocks are the latest. ZFS checksums go all the way up to the root block, so it knows which tree of blocks is valid. I can’t think of how mdadm could possibly do this in RAID1 mode without getting it wrong some of the time. I can’t even see how mdadm would recover from bit rot with only 2 volumes in a mirror. I’m guessing it just… doesn’t?

        1. 2

          I’m guessing it just… doesn’t?

          That’s how I understand it. To get bitrot protection, you need to run mdadm on top of md-integrity which provides the checksum validation.

      2. 11

        There’s some mild irony of this post coming after this one. Makes me glad I use ZFS!

        1. 1

          Zfs, btrfs should all be able to flag this. md-integrity maybe we well? (Depends if the wrong data was written cleanly or actually bitroted over time)

        2. 4

          “this wouldn’t have happened with ZFS” is a strange conclusion to come to after a user error. Also: I’d recommend a mundane backup strategy. having to package something smells of novelty. Although I’ve not heard of the system they mention, it might be fine.

          1. 6

            ZFS would have told you why the drive wasn’t in the array anymore, with a counter showing how many checksums failed (the last column in zpool status, it should be 0). The author would thus have known there was something wrong with the SSD, and think twice before mindlessly adding it to the array.

            I’m not entirely sure what would happen if you add the SSD back to the array anyway, at the very least you must give it a clean bill of health with zpool clean. I would also expect that ZFS urges or maybe even forces you to do a resilver of the affected device, which would show the corruption again. The main problem with mdadm in this case was that when re-adding the device, it found it was already part of the array before and decided to trust it blindly, not remembering that it was thrown out earlier, or why.

            1. 3

              ZFS should resilver when you add the drive back the array and verify/update the data on the failed drive.

            2. 5

              the readme in the repo for that project says in bold text that it is experimental. which is exactly what i would avoid if i was looking for some reliable backup system… but to each their own.

              1. 5

                How was this user error? This raid array silently corrupted itself. Possibly because of the ram?

                the filesystem in the treefort environment being backed by the local SSD storage for speed reasons, began to silently corrupt itself.

                ZFS checksums the content of each block, so it would have been able to tell you that what you wrote is not what is there anymore. It could also choose the block from the disk that was NOT corrupted by matching the checksum. It would have also stopped changing things the moment it hit inconsistencies.

                1. 2

                  The drive failed out of the array and they added it back in.

                  1. 4

                    Yeah, but why did the array think it was fine when it had previously failed out?

                    1. 2

                      I don’t know, it’s a reasonable question but doesn’t change that fundamentally it was a user mistake. ZFS may have fewer sharp edges but it’s perfectly possible to do the wrong thing with ZFS too.

              2. 3

                I’m trying to figure out what caused that Kubernetes issue. Kubernetes without the control plane is allegedly like a server without a sysadmin: everything stays static (no new containers started, no routes updated, secrets/configmaps not updated). If a worker node can’t talk to the control plane, it won’t just kill all the containers by default.

                However, I wouldn’t be surprised if some Kubernetes add-on breaks that assumption. The side effect of so many people using managed Kubernetes clusters is that so few people have full-blown control plane outages like the one in this post. Thus, the disaster recovery behavior never gets tested enough. I’ve had just two power outages and discovered tons of bugs and documentation gaps across the Kubernetes ecosystem each time.

                1. 3

                  I really do not see the reason for RAID given that 90% of the articles ive seen about it involve failure of RAID and data destruction. Is this just some sort of reverse survivor bias?

                  Personally I want a tech stack that is the least complicated, and RAID doesn’t fit into that principle of “Don’t overcomplicate X” (Where X is the disk/filesystem layer in this case). A complex setup is more difficult to figure out problems with and more likely to fuck up everything if you have misconfigured something.

                  1. 4

                    I’m sympathetic to this point of view but I’d need to write something really long to go into it. To a very rough approximation, RAID can only really help with uptime and it’s backups that help with safety.

                    1. 4

                      I think a lot of the articles are written by failures and you totally only get the “reverse survivor bias”. Plus, the raid incident blog articles are usually written by people who blog anyway, do something else and not RAID arrays for a living.

                      Not by the guy working at a regular non-tech company, on a 9 to 5 job that just replaces the dead disk, rebuilds the array, and goes home. Maybe someone noticed that “hey the disks are slow”, and you’d answer “yes,we’re rebuilding” but they just roll eyes on us like “ah, the morons from IT are at it again” and in a day or two or something, everybody forgets the thing. I’ve had this happen two times, when I was in that line of work. I also had personal disks fail, or people close to me. Like, 5 disks over maybe 20-30 years?). Rarely was any of that data saved, plus in the “people close to me” case, it often meant downtime of a week or something until a new disk is bought, windows installed, all the drivers etc.

                      RAID has it’s place. And yes, it has issues and it fails. But it also helps a lot deal with a certain classes of problems in certain environments and for certain people. We just don’t know where this is.

                      1. 5

                        Yeah you’re not going to get blog posts about “I replaced the drive in my array and everything went as expected”.

                    2. 1

                      Can it be that the “fickle” Kubernetes want the best tool for the job in this scenario?

                      1. 3

                        The author explicitly says that k8s isn’t right for them, and clearly they hadn’t spent the (inordinate) amount of time necessary to properly manage an HA on-prem cluster.