1. 44
  1. 16

    Interesting roundabout way to find your disk is full. Issues caused by rootfs being full are so crazy that I make it a habit of “df -h /” first any time I come across strange system problems that cannot be explained by anything I or anyone else has done.

    1. 13

      We had disk space monitoring, so my pager would usually go off well before that happened, but had a few instances of inode exhaustion that caused df -i to be added to my repertoire of “what fresh hell?” debugging tools.

      1. 3

        Ah yes, inode exhaustion is another cause of “strangeness without obvious explanation”.

        1. 2

          I learned this the hard way a few weeks ago!

      2. 12

        Posts like these make me nostalgic for the time I spent using Solaris. The debug tools (iostat, dtrace, mdb), the filesystem (ZFS), the containers & virtualization options (LDOMs, Zones), made it all a truly lovely environment. Best of all, you got to step outside of the x86 monoculture (RIP SPARC).

        1. 10

          The great thing about Solaris is all the debugging tools, the bad thing about Solaris is that you have to use all the debugging tools. :-)

          1. 5

            For better or for worse, debugging is the cornerstone of the modern computing experience. Whether on Windows, Linux, or Solaris, I’ve encountered kernel bugs on every OS I have used!

        2. 8

          I love blog posts like these even if I mostly don’t understand them.

          1. 4

            But why would running stat on a proc entry for a blocked process also block?

            1. 5

              In this case I believe it was because accessing the proc entry touched something within the Solaris kernel that happened to be locked by ZFS, which ZFS couldn’t release because the filesystem was full.

              1. 2

                My guess is there’s at least a third thread in play (not pictured) that might be interacting with the procfs lock and the address space lock; e.g., something like pmap.

              2. 3

                In an old sysadmin life, I learned to put a /BIG ballast file on all servers, so as to have something to quickly recover from this problem while looking for a more permanent solution.

                Of course, this was from the days in which each server had a clever name, and I could count them, etc.

                1. 2

                  and we were singin bye bye miss american pie, drove my chevy to the levy but the levy was dry The good ol’ boys were drinkin whiskey and rye Singin’ this’ll be the day I catch fire This’ll be the day that I catch fire

                  Did you write the book of love And do you have faith in mods above If the module tells you so? Do you believe in proc and zpool? Can code save your mortal soul? And can you teach me how to debug real slow…

                  1. 1

                    Some things never change. Is there a single UNIX variant that doesn’t fail in some way when it can no longer write to disk? It’s unacceptable for most programs to fail if they can no longer write to disk; Firefox also fails, but of course gives no indication as to what’s happening.

                    Now, of course, no UNIX variant I’m aware of has a real notion of system programs and so there’s no programs that get special privileges that would protect them from this. It would be preferable for the system to die, rather than permit this debugging nonsense, considering it would perhaps actually be fixed if the machine died in this case.

                    Regarding @dsschnau , there should be no need to understand this, because it’s an asinine failure case mired in 1970s malpractice. There is nothing of real educational value in this.

                    Regarding @fbo , I believe every operating system you’ve used has such glaring flaws. Don’t you agree that’s damning and unacceptable? No one has any right to be proud of this mess. There’s millions of lines of code and yet basic failure cases aren’t truly accounted for or are handled in the most asinine of ways, such as with the ’‘Out of Memory Killer’’.

                    1. 5

                      many unixes allocate 5% of disk space to only be writable by root, so unprivileged programs will fail, but system daemons won’t.

                      1. 2

                        Is there a single UNIX variant that doesn’t fail in some way when it can no longer write to disk?

                        I’d look into IBM’s AIX on POWER to see what they do. They always were number 1 in uptime surveys I read. Main problem I had with those surveys was they left off AS/400’s and VMS clusters. That’s cheating. AIX systems are reported pretty reliable in the long term, though. I’m curious if any Lobsters can corroborate or refute that with specific details.

                        1. 3

                          The problem with AIX is that it maliciously complies to POSIX, making developing a nightmare for it.

                          1. 2

                            “Malicious compliance” - that’s a new one.

                            “We’re gonna follow the rules - but EVILLY”.

                      2. 1

                        I read the title and first thing to come to mind was that one time I let the system disk get 100% full on one of my servers and all hell broke loose.

                        Made for a nostalgic read. I love these kinds of articles.