1. 87
  1. 24

    Hm, from the comments here, it seems that people generally think of this as a dichotomy:

    • Either you provide a single command line the user just blindly copy-pastes into the terminal
    • Or you teach the user how the thing actually works, so that the required spell becomes obvious to them

    I advocate for neither of those things. Rather, I want two command lines I can blindly copy-paste into the terminal, where the first one prints yay! or nay :( depending on whether the second command is need. The workflow I want:

    • paste the first command, see nay :(
    • paste the second command
    • paste the first command again, see yay!
    1. 5

      So your point is that you want some type of verification step included with a “fix” that is posted?

      I’m not sure how useful that would be during the “troubleshooting” step, since you may get false negatives due to some complex chain of dependencies/situations and end up applying the wrong “fixes” for the actual problem. For example, there are an endless number of reasons why you might lose connectivity to the internet, and a check for “is DNS configured” might fail because packets aren’t leaving your NIC, you might have the wrong subnet configured, there may not be any configured route to the desired DNS server, your firewall might be blocking it, and so on. The “fix” for resolving any of those is likely to be very different. So someone who doesn’t have the ability to troubleshoot the problem much deeper than running some “yay or nay” script to confirm a problem (that may not really be the underlying cause for the problem they are having…), applying a fix for the wrong problem may not result in the “yay or nay” passing after, and they now have an additional problem on their hands (reverting the previous fix successfully).

      Anyways, I think you still need at least two brain cells to rub together in order to troubleshoot any system, Linux or otherwise. While it would be nice to have a test case accompany every single fix or workaround, it’s not very plausible. A human is still necessary to provide some amount of reasonable amount of confidence that you’re about to try and fix the thing that might actually be your problem. Scripting that out for all possible situations, and having fellow humans select the right “yay or nay” to run, would be hard.

      A verification like that is useful after you’ve determined the root cause of the problem you are facing and have applied a fix, but if you’re capable of doing that then it’s almost always trivial to come up with your own “test” for that. E.g. in the DNS example, if the root cause was systemd-resolved had crashed because of some malformed configuration file, you can make sure the service is running after you fix the broken config, and use dig or nslookup to test that name resolution is working again.

      1. 2

        Rather, I want two command lines I can blindly copy-paste into the terminal, where the first one prints yay! or nay :( depending on whether the second command is need.

        I think that’s partly because what you want can only work in a rather narrow set of situations. Like others, I think I just extrapolated from that.

        The example in your post is pretty straightforward, as in, it’s pretty easy to check if a device with that particular dev id was detected, and that the associated module either hasn’t been probed, or that the kernel command line doesn’t have the right incantation to make the module play ball with that device. And even in this case, I think you can’t come up with a comprehensive enough command that prints yay! or nay. For example, there are other mechanisms to force a device ID match (I don’t know if this is specific to the i915 driver though – maybe they’re not relevant?) besides the kernel command line.

        But there are plenty of cases where that’s just not gonna work, not even remotely. If this weren’t Intel but an ARM board without a “real” device probing interface, where it’s all done via a user-supplied device tree, there are thousands of reasons why a device wouldn’t take off. Like, I think there are tens of thousands of dollars billed every day around the world for nothing other than figuring out why that happens.

        “Teach the user how the thing actually works so that the spell is obvious” is obviously extreme. But “provide enough context information so that the user can understand what the problem might be, how to check if they have it, and how the spell would fix it” is doable, and while the first two parts might be easy to automate at times, in my experience, that’s rarely the case.

        1. 1

          Oops! I’m definitely to be partially blamed here. I guess we read what we want to read 🙂

          I see what you’re saying now, and I think it makes sense. My apologies for derailing the conversation.

          1. 2

            Yeah! To be fair, what you are saying also makes sense: as the last paragraph of the TFA says, after all these years I still have zero skills in debugging these sorts of issues from the first principles, and that is potentially also a problem, just a different one!

            1. 2

              I wonder if there is a larger point to be made about the spectrum of learning.

              • Do X if you are seeing Y.
              • Do X if you are seeing Y and it is due to Z. You can verify you have Z by checking A.
              • Do X if you are seeing Y and it is due to Z. You can verify you have Z by checking A. The reason this is the case blah blah blah.

              Essentially, applying a fix blindly is the lowest form of learning. Applying a fix after verifying the exact problem is slightly better. But understanding the problem and why it occurs before applying the fix is the best. But of course it is the inverse for the amount of effort required on behalf of the answer provider. The second bit provides more learning, and a better debugging experience, at minimal effort to the answer provider, which is a good argument for trying to apply that everywhere.

              Its also interesting that it almost seems exponential in effort for the answer provider.

        2. 18

          [Edit] I realize I read what I wanted from the article rather than what was actually written — I’m leaving this here but recognize that it’s not quite what the article is about.

          I really like this, and I think this advice extends beyond just Linux troubleshooting. It’s really advice on how to teach people and how people learn. Answers are 20% of the learning process, 80% is understanding how to get to the answer, and it’s so critical for developing skills. I could rant about the US education system teaching-to-the-test which is focusing on that 20% and how terrible it is.

          One of my roles at my current job is helping people learn Rust, and when someone comes to me with a confusing type error I always make an effort to explain how to read the error message, why the error message occurs, and the various ways to fix it. If I instead just provided the fix, there would be no learning, no growth and no development of self-sufficiency. It takes longer, and sometimes people just want an answer, but I stay firm on explaining what is going on (partially because I don’t want to be fixing everyone’s basic type errors). I wonder if part of the issue with Linux troubleshooting advice is that it doesn’t have that same feedback mechanism — someone not learning doesn’t affect the author of the advice in any way so there is no real push for building self-sufficiency.

          Anyway, I think this post was really short and to the point, and I completely agree with the message, but I also think it’s interesting just how much it extends beyond Linux diagnostics and into learning everywhere.

          1. 3

            I agree, it does work as “How to write good troubleshooting advice” in general (which IHMO would be a better title anyways)

            1. 1

              Dave Jones (EEVBlog) does an excellent job of this in the world of electronics, a playlist of his troubleshooting videos and his approach: https://www.youtube.com/playlist?list=PLvOlSehNtuHsc8y1buFPJZaD1kKzIxpWL

            2. 14

              Boy, have we come full circle!

              Twenty years ago, some time before ESR’s dreadful Aunt Tillie meets CUPS rant, there was a tendency in the Linux community to write pretty much the kind of stuff that the author calls for. For example, there were entire HOWTOs for troubleshooting (see https://tldp.org/HOWTO/Online-Troubleshooting-HOWTO/index.html ) and troubleshooting sections included some basic diagnostic information, too (see e.g. https://tldp.org/HOWTO/DSL-HOWTO/tuning.html ).

              (Also, take a moment to experience that peculiar feeling of nostalgia and dread about Linux and DSL in 2001)

              Obviously, there were lots of bad guides with bad advice back then, too. Probably just as many as today. I’m not claiming we had better docs back then, just that this style of documentation was more common than it is today, and that it was considered to be a good kind of documentation to write.

              That went out of fashion because lots of people began to think of it as bad style – “users” don’t need a lesson in Linux device driver management, they just need a simple solution that fixes their problem. There were various reasons for abandoning this style of documentation. Community managers thought it would scare new users away because it made troubleshooting look complicated. UX innuvashun types thought it was too complex for non-technical users to follow, thus preventing them for achieving the actual tasks they set out to do, like setup a MIPS cross-compiler or use a supported, but unadvertised screen resolution. Customer relations/knowledge base managers felt anything not fixably by failsafe one-line snippets are best handled through paid support channels. l33t power users felt you should be able to figure out these things by yourself, and that if you can’t, you should just use a pleb OS like OS X or Windows. And so on and so forth.

              I’m rooting for it to become fashionable again but I can’t say I have high hopes. An entire generation of developers and admins has been brought up on cheap fixes, so it’ll take about another cycle to have this fixed – if it ever gets fixed.

              (FWIW, this is one of the reasons why I lament the disappearance of the old Gentoo wiki. The Arch wiki probably has more information by now – in part because it’s older, and obviously has more users by now – but IIRC not as much depth.)

              1. 12

                I think there’s a higher-level problem here: if a fix can be applied mechanically without a user understanding what’s going on, then why is the user involved at all? It talked about as much time to write the instructions for a user as it takes to automate the process, why not add the automatic troubleshooter to the code and just leave something in a log for the cases where it failed to handle the problem? Even if you do need some user interaction (is anything visible on the screen?’), this can be wrapped in an interactive troubleshooter. Windows has a surprisingly good sound troubleshooter that mostly just restarts three services in the right order, but tries playing a sound and asks the user if they can hear it at each step (and does some other things like enumerating all of the audio devices and trying with each, to catch cases where sound is working but going to a device with no speakers connected).

                BeOS was great at this kind of thing. The sound subsystem and network stack were unstable, but they had watchdog monitors and would restart them if necessary. Most of the time there was no user interaction.

                Self-healing systems give a great user experience. By the time a user has had to search online for their problem, there’re already frustrated.

                1. 2

                  The example in the article is sort of interesting in this regard because it hints at something even awkwarder. It looks like the problem there is caused by a device that actually has a working driver, it’s just that the PCI device ID isn’t matched with the driver (presumably, in older kernel versions).

                  This isn’t really a fix that you can apply mechanically, a la restarting via a watchdog timer, but it’s also not… really something that absolutely can’t be done without user intervention, either, not for common enough hardware at least, where force probe lists can be maintained separately from the kernel tree, or at least in addition to the kernel tree (versioned per kernel, of course) with relative ease.

                2. 6

                  This sort of advice also permeates throughout the community, and can also be harmful if the people running these commands don’t fully understand them. For example, many Ubuntu troubleshooting forums are full of comments like, ‘I copy-pasted this command and it worked, just try running it’. And who can blame them, if this is how more experienced users troubleshoot issues for newbies?

                  1. 3
                    curl $dead_url | sudo sh

                    works for me! Thanks, denvercoder!

                  2. 3

                    A common problem is someone tries various commands or config changes, then one of them works, but there are all kinds of side effects left in the system, little time bombs that create havoc later. Eventually you just factory reset. It’s not a sane way to live.

                    1. 3

                      This is everything, not just Linux though. I have no idea why I have to enable the IP remote setting in some obscure menu on my TV, but it fixes the issue of the TV restarting 10-30 seconds after I turn it on over HDMI. I wish I did know, but all I can find is people repeating the solution rather than how to diagnose / understand and fix the underlying bug.

                      1. 2

                        Yes, but the advice limits itself to what’s realistically possible to get knowledge of. I think it holds for science and source code, for example, which is well understood by many.

                        But your TV is an anti-example: Your TV might have a glaring firmware bug, but it’s hard for us plebs to set a breakpoint in a closed source blob running on a different machine you have to be a security researcher (or have a jtag dongle) to log into. The bane of firmware is that, while you are stuck with the factory version, the devs have probably fixed it as soon as someone noticed it, 5 years ago, and forgotten it.

                        1. 1

                          No disagreements with that take, though it’s a Bravia from 5 years ago running android, so ¯_(ツ)_/¯

                      2. 2

                        This is about capturing and structuring knowledge. In a perfect world of infinite funding one could see:

                        1. Alice has a problem and posts a vague description.
                        2. Bob replies the information he gleened, often with ‘This worked for me once, at least I think this was it.’.
                        3. Charlie scans over these documents and tries to put in a basic form and keyword. This is more easily searched, scanned, and still has the basic “worked for me once” information.
                        4. Debbie knows more about this area and sees many people having the problem. She adds “how to diagnose”, “how to see if worked”, “how to undo it if it didn’t work”, and “where to read more” sections.
                        5. Erin just puts in a pull request to auto-diagnose and fix the issue on new installs.
                        1. 2

                          This is not at all specific to Linux. It’s probably particularly bad there, because many of the potential fixes are one-liners which take hours or days to understand in detail, so even the people who do understand the steps don’t have the time to explain them.

                          1. 1

                            For error messages, I try to state the immediate problem, (for example “NoMethodError mkdir on nil” then explain why that’s likely a problem “The variable @foo is nil and which does not have the method mkdir”. I agree with adding in a debugging step. I maintain the Ruby buildpack and if we fail to detect rake tasks we output the command rake -p to the user to have them verify that works locally. I also try to add links to documentation with full explanations.

                            This blog also reminded me a bit of a post I wrote on how to write tutorials https://schneems.com/post/60359275700/prepare-do-test-make-your-technical-writing-shine. The core idea I was trying to convey, is that in any tutorial after saying “do this command” you should always convey it with a “check it with this other step” so that way if there’s a mis-match the user isn’t 10 pages away and has to bisect their own work.

                            1. 1

                              @matklad, there’s a bug in your RSS feed and it sends me to URLs with matklad.github.io//path instead of /path.