1. 28
  1.  

  2. 9

    A function may succeed. Or it may, for example, fail because of disconnected backend. Or it may time out. And that’s it. There are only two failure modes and they are documented as a part of the API. Once you have that, the error handling becomes obvious.

    I must be missing something because it really feels like there are plenty of other ways for a function to fail. Is this limited to a specific context? If it’s only for infrastructure, it still seems woefully pidgeonholed.

    As already mentioned, classic exceptions are the worst.

    I’m not clear on why they are “the worst”.

    The discussion does hit on something that makes sense to me: think about and document the error conditions. Frankly, if you have that, the methodology of reporting the error becomes less of a hassle. But still, error handling is plauged by the fact that it is often something non-local that is affecting the computation and there is rarely any useful information or language constructs that make dealing with it anything short of a massive chore. (Correcting it usually means interacting or “conversing” with some other entity to gain the knowledge to proceed.)

    1. 4

      I must be missing something because it really feels like there are plenty of other ways for a function to fail. Is this limited to a specific context? If it’s only for infrastructure, it still seems woefully pidgeonholed.

      POSIX is quite a good example of how it could work. Every function can return few possible error codes and that’s it. The idea is that implemeter of the function deals with the complexity and factors all possible error conditions into a small neat set of error codes that makes sense from the user’s point of view.

      The rule here should be: If you don’t know what to do with an error condition, don’t just pass it to the caller. The caller understands the problem domain even less than you do.

      But still, error handling is plauged by the fact that it is often something non-local that is affecting the computation and there is rarely any useful information or language constructs that make dealing with it anything short of a massive chore.

      The point is to use encapsulation for errors as well as for normal functionality. If something non-local causes an error somewhere down the stack, the layer that deals with the thing (and every layer above it) should convert it into an error that makes sense in the local context.

      1. 1

        If something non-local causes an error somewhere down the stack, the layer that deals with the thing (and every layer above it) should convert it into an error that makes sense in the local context.

        When said this way, I understand the point better. I did not get that from the original post. I think that’s a reasonable way to deal with things, although I don’t think it precludes exceptions as the mechanism for doing it.

        1. 1

          True, but exceptions make it super easy to screw it up. Just forget a catch block in one function and the raw low-level exception escapes up the stack. In C/Golang style of error handling you have to at least pass it up manually which will, hopefully, make you consider whether it’s a good idea in the first place.

      2. 3

        (Correcting it usually means interacting or “conversing” with some other entity to gain the knowledge to proceed.)

        That’s why, even though it is relatively heavy-weight for an API, it seems that passing a callback to be called on error is one of the most versatile things you can do. The callback can correct the error and allow the call to proceed or just throw an exception. At deeper level, doing this allows you to interact with context at the point of detection not the point where you express your intention: the initial call that led to the error.

        I think this is the closest we can come to approximating Lisp’s condition system in languages without those constructs.

        1. 3

          Signals and restarts are wonderful things. It’s such a shame no other language or programming system (to my knowledge) has made a serious effort to emulate it, let alone build on it. Callbacks are the best we can do – or what we’re willing to abide – it seems.

          1. 3

            Have you heard of the Zen of Erlang? https://ferd.ca/the-zen-of-erlang.html

      3. 8

        One pattern that I’ve noticed is that many web developers don’t get rigorous error handling, perhaps because the context in which we’re developing does so much work for us. If you:

        1. Don’t share mutable state between requests
        2. Don’t eat exceptions
        3. Have a transactional data store

        there’s a limited amount of damage that you can do by being lazy with error handling. The worst case is that you fail to make progress in an instance where you could’ve continued past some exception, but you’ll never end up in an inconsistent state.

        1. 4

          Yes. Also, if anything goes wrong, user will just hit “reload page” button. No big harm done. I think GUI applications are ones that rarely need rigorous error handling.

        2. 6

          Given that most bugs are transient, simply restarting processes back to a state known to be stable when encountering an error can be a surprisingly good strategy.

          ~ Fred Hebert, Erlang in Anger

          1. 3

            This of course is how a great many bots find themselves on my blacklist. If a request returns 404, continuing to pound the same URL does not resolve the error.

            1. 4

              Which is of course why error handling still needs some amount of contextual logic, e.g. for a 404 the resource is not there so stop trying. Or even a general-purpose retry logic like exponential backoff with a failure cutoff.

          2. 6

            One thing I enjoy about Rust is that error handling is very uniform. To get the result, I need to check for the error (or explicitly discard it.) When the program compiles, I can feel pretty confident that I’ve handled everything in one way or another. I’m sure there are cases where this breaks down, but it’s a nice “default.”

            1. 2

              I agree, option types and mandatory complete pattern matching solves a good portion of the issues raised in the article. It might be interesting to look at Zig, which is built with error handling as a core feature.

            2. 5

              In my experience, there are four types of errors [1]. To summarize, using connect() as the example:

              • it’s a bug—EBADF should not be happening. That it is, is a bug. Once fixed, it should not happen. So EFAULT, ENOTSOCK, and EISCONN all fall under this category.

              • It’s fixable outside the scope of the program—EACCESS, ENETUNREACH and EAGAIN are examples here. Report, and exit, not much else to do.

              • Resource exhaustion, things are probably going bad quickly—EAGAIN might also be this category. ENOMEM definitely is, but that’s not a possible error for connect().

              • Expected and should be handled by the program—ETIMEDOUT, ECONNREFUSED, EINTR (if using signal()), maybe ENETUNREACH could be considered here as well. Things that can normally happen, and there is some overlap with the second category here.

              It’s now a bit simpler—just check for expected conditions and handle; everything else should be logged (and possibly terminate the program, depending upon the condition).

              [1] On that page I list three ways to handle errors. Since then, my thinking has changed somewhat on that but I’ve yet to write it up.

              1. 2

                I like the categorization. It makes easier to think about the errors.

                One thing that comes to mind is: Can we deal with some of those categories automatically? For example, I’ve never seen ENOMEM handled in a reasonable way. While in theory is looks like it can be handled, thigs like memory overcommitment and OOM killer make it futile. Maybe we’ve given up any chance of handling OOM errors back in 1960’s when we’ve replaced static invocation records by call stack. Anyway, maybe returning ENOMEM makes no sense at all. Instead OOM killer should just kill the process. But I never done embedded programming, so who am I to tell?

                1. 3

                  Slightly related, I really like the HaskellWiki page on errors vs exceptions: https://wiki.haskell.org/Error_vs._Exception

                  At least for me, discussing “errors” has always been confusing due to the ways we recycle terminology. I like to use “failure mode” to describe a way in which a system can fail, “exception” to refer to uncommon - but not expected - behavior, and “error” to mean to un-handled exceptional behavior; an instance of a failure.

                  Not everyone uses these terms this way, and there could be a better vocabulary for this stuff, but I think it’s worth trying to tease apart fine distinctions in this area.

                  1. 5

                    For what it’s worth, I think we’ve drawn a similar distinction in the Joyent guide to Error Handling. Note that while this document was part of our effort to describe the way we build software using Node, the taxonomy really applies everywhere.

                    We split things into two broad categories: Operational Errors and Programmer Errors. The former appears akin to your notion of “Exceptions”; i.e., something that could reasonably happen and should be handled. The latter is effectively a logic error in the program itself: the program is not correct as written and the only completely safe course of action is to terminate without doing further damage (e.g., scribbling into a data file). In this taxonomy, the failure to account for and handle an operational error (e.g., malloc() returning NULL) is, itself, a programmer error (probably a SIGSEGV in the malloc() case).

                  2. 2

                    Bug report: looks like broken markdown around the libdill and POSIX connect() links.

                    1. 1

                      Fixed. thanks!

                    2. 1

                      what I’ve ended up doing was logging all the errors from the queue for debugging purposes and converting the entire thing into a single “OpenSSL failed” error.

                      I am not proud of that but I challenge you to come up with a better solution.

                      Here’s what I do.

                      1. Transcript the exceptional condition with as much detail as is appropriate. In a server application, this means printing out this error code to the log.

                      2. Backtrack to the last point of user-input, and tell them to try something else. In a commandline tool (i.e. unattended batch) this could mean exiting, but for a desktop (including console) or mobile application it looks more like this:

                         for(;;) {
                           if(!(s=get_file_from_user())) { alert("I need a file"); continue; }
                           switch(try_file_with_openssl(s)) {
                           case ERR_R_BAD_GET_ASN1_OBJECT_CALL:
                             /* use as much or as little detail as the user can take advantage of */
                             alert("I can't work with this file because OpenSSL says it has a bad ASN1 object. in it. That could mean that the file is corrupt. Please try another file.");
                             continue;
                           …
                           };
                           break; /* success */
                         }      
                        

                        This can also be done in a server application, and indeed I might prefer it since a sysadmin can log in and potentially correct the underlying issue without watching a process spin/crash all day. If I can take the NFS server offline for a few seconds, and it just cases the mail server to “pause” while the mailboxes are missing (instead of generating errors), then I might even (As a user) appreciate this feature in unexpected ways.

                      NB that many libraries have a way to convert “codes” into human-visible strings. Ideally these would be used, but I felt that ERR_error_string_n was a bit too openssl-specific to be used in an example.

                      1. 0

                        This is why I simply love programming with golang. Everything is setup for you to handle errors somewhat properly, without much hassle.

                        1. 3

                          But when you look at actual real-world code in Golang, what you often see is:

                          err := fn()
                          if err != nil {
                              return err
                          }
                          

                          Even a sane system can be used in silly way.

                          1. 2

                            At least the snippet checks for an err value!

                            1. 2

                              That’s ok, if you don’t want to handle it there.

                              Worst is:

                              output, _ := fun()

                            2. 3

                              That’s a joke, right?

                              Error handling is one of the few things I think go has comprehensively messed up. In no particular order, the builtin errors lack:

                              • Internationalization of error messages
                              • Derived errors (high-level-failure caused by low-level-failure)
                              • Syntactic support for error propagation
                              • Distinction between programmer errors and runtime errors (eg Printf returns an error for a bad format string or stdout being closed)
                              1. 1

                                Different opinions, man. It’s not perfect but I think it’s pretty good, compared to all the languages I worked with.

                                Btw, internationalisation of error messages is dead simple to implement, even if it is not straight out of the box.