1. 31

  2. 26

    This is something of a classic…my best resourceful-system-recovery story happened a few years ago (not as dramatic or enjoyably-related as the linked story, but I was sorta proud of myself):

    After making a semi-special request, I had recently been granted root access on my workstation at my (university) department. One afternoon a few days after that, I started mucking around trying to do a local install of GHC – installing a newer glibc into /usr/local (since the system’s C library was too old for it), tweaking some libc/ld.so symlinks, so on and so forth. Despite a few hours of bashing my head against it though, I wasn’t able to get it working and eventually gave up, moved on to other things and basically forgot about it (I had an older GHC available and didn’t need the new one for anything critical).

    Later that night, however, I was busily tapping away at my shell session, when suddenly:

    $ ls
    Segmentation fault


    $ whoami
    Segmentation fault
    $ /bin/true
    Segmentation fault

    Like in the linked story, in my initial semi-panic I almost rebooted the box, but thankfully thought better of it (doing so would have destroyed any chance I had of recovery).

    After some further poking around (I don’t quite remember what clued me in to exactly what was going on) I realized that one of my earlier symlink adjustments had broken my x86-64 ld.so – but only in a delayed fashion that didn’t actually manifest itself until the 4:00AM prelink cron job ran (I’ve always been a bit of a night owl, and was still working at 4:00AM). So any dynamically-linked 64-bit binary I tried to execute would just immediately fall on its face before even reaching main(). That’s basically every binary on the system, very much including the su (actually ksu, since it was a Kerberos environment) binary I’d have needed to actually fix the problem.

    The prospect of sheepishly going back to the department IT folks for help un-breaking my system immediately after convincing them that I could in fact be trusted with root access on it was pretty embarrassing, especially as someone who considers himself a decently competent sysadmin.

    In considering my options, I remembered I actually did have a running root shell on the system, but it was tucked away in a screen session I wasn’t currently attached to – and /usr/bin/screen, needless to say, was a 64-bit dynamic executable and thus not really working. (Yes, perhaps on general principle I should be admonished for having left a root shell lying around unattended, but that’s another matter, and in this case it was critically useful…)

    The other key thing I realized I had at my disposal was the departmental AFS filesystem, which was mounted on the machine. While my x86-64 ld.so was borked, i386 programs continued to work, I just didn’t have i386 versions of anything I actually needed. So I logged in to a 32-bit machine elsewhere in the department, compiled a 32-bit screen from source, dumped the binary into AFS, and hoped like hell screen’s authors had the decency to keep whatever IPC protocol it uses architecture-independent…thankfully, they apparently did, and I was able to re-attach to my 64-bit screen session with my freshly-compiled 32-bit binary, regaining access to my root shell. From there it would have been relatively easy to build 32-bit versions of whatever minimal subset of coreutils I’d need to fix the actual problem, but I realized I wouldn’t even need to do that – there was a 32-bit python interpreter sitting in AFS, so I fired that up as root and manually issued system calls (import os; os.unlink(...); os.symlink(...)) until things in /lib looked like they had before I started messing with them…and then at last:

    $ /bin/true
    $ echo $?

    Success! Admin embarrassment avoided.

    1. 3

      And this is why I have a statically linked busybox handy.

      1. 2

        Where do you keep copies of busybox to ensure you’re not affected? Surely the parent’s problem would have caused problems for you, even with your copy of busybox. Though! I guess you’d have had a shell open and could have cd’d to the directory it was in… Saved… this time. :)

        1. 1

          /usr, /bin, ~/bin, network shares are all good ideas

      2. 2

        Like in the linked story, in my initial semi-panic I almost rebooted the box, but thankfully thought better of it (doing so would have destroyed any chance I had of recovery).

        It would? Couldn’t you have just booted a live cd, mounted your hard drive and done the link changes there?

        1. 4

          On a “normal” system, sure, but in this case no, because the IT-administered machines in the department have their bootloader & BIOS locked down.

        2. 1

          Thank you for sharing this. It would seem that the mechanisms that make you more effective in the face of hunger and fear of embarrassment are alike!

        3. 7

          One way to prevent accidental rm -rf * from deleting your system (at least when using GNU tools, not sure about BSD) is to create a file named -i in important directories.

          The way this works is that * expands to include the -i file, which is intepreted as rm -rf -i which will prompt whether you really want to delete those files.

          A better solution is to use safe-rm, which checks what it’s deleting against a blacklist of important files and directories.


          1. 5

            I don’t always wipe out my drive,

            but when I do, I rm -rf -- *

            1. 1

              This does work on FreeBSD and OpenBSD - I don’t have a NetBSD or DragonflyBSD to check.