1. 30
    1. 31

      Rust is absolutely unprepared for that. There’s no good tooling for finding and eliminating panics. The are no (non-hacky) language features for forbidding certain code paths from panicking. The standard library is full of functions that can panic where you would not expect them to. That are many methods that don’t even have non-panicking equivalents.

      Rust is good for highly-concurrent, CPU-heavy, high-availability services, but this is also the scenario where crashing the whole process is painful. “Just crash” is naive when the crash can interrupt hundreds of independent requests that were in progress. “Then just drain and shut down” is a gamble when the about-to-go-supernova paused thread may be holding locks or having abandoned tasks blocking job queues, turning it into a stall, and then abort anyway, after making situation only worse by the extra timeout.

      Whole-server killing events are problematic, because they amplify a bug of one request to all of them, visibly affecting many customers. Such crashes can’t always be hidden — not all requests are idempotent, and retries require keeping a copy of the inputs, which isn’t always feasible (e.g. uploads). And then retries are a recipe for DoSing yourself, because the request that hit the panic will be retried, and kill the whole server again and again (even when I don’t retry, the client may).

      Adding isolation back by splitting a server into multiple processes is not free, in both performance and complexity. Thread-per-core designs can have issues with unbalanced workloads.

      Complete restart is also difficult for high-traffic servers, because restating will require reloading data, reconnection to databases, warmup of caches, and redo of the lost work, on top of all the new traffic that is piling up. I can do thousands of requests per second, but not thousands restarts per second.

      If attackers find such request, then instead of just wasting their own time on seeing error 500, they will be able to cause a persistent crash loop keeping my servers down, for however long it takes me to patch the problem and build a new release, which with Rust’s compile times will be long enough to cause an outage that I’ll have to apologize for.

      This is not the first time I’m frustrated with Rust’s leadership not caring about reliability of Rust’s processes. I similarly had to fight a defeatist narrow Linux-centric view that Out of Memory can never happen, and even if it did, it would be impossible to handle — all while I was having to deal with actual OOMs happening, and Rust DoS-ing itself.

      Unwind works well enough that I don’t need to shut down servers after a panic. Catching unwind is the strategy of “just crash”, but it has the level of isolation required to make it work, so that the offending request crashes, not the whole service.

      1. 6

        I agree ignoring the fallibility of allocation is unfortunate – but I also know that sometimes you just have to prioritise stuff. I don’t really feel like forcing people to deal with every allocation at the point in the program where it occurs would have resulted in a sufficiently ergonomic experience for Rust to be as successful as it has been.

        On the subject of panics, I often build and deploy my services with panic set to abort and I can’t say that I have noticed a rash of unreliability. What parts of the standard library are routinely panicking for you?

        1. 13

          Handling of OOM could have been equally unobtrusive if it was a panic from the start, instead of a hardcoded abort. It’s very rare for Drop to allocate, so unwind mainly just frees memory. It’s unfortunate that Box<dyn Any> is in the unwind API, but even that is not a big deal, because OOM almost always happens on largest allocations that ask for more pages, while small allocations can be satisfied from pools/freelists and fragmented space (not 100% of the time, but 99% of the time is better than always aborting).

          Unfortunately, because OOM has been a hard abort before, it’s not backward-compatible to add OOM handling to functions that return Result. So you get worst of both — pay cost of a fallible interface, and still get errors that can’t be handled.

          We use Rust for various network proxies and services at Cloudflare. At our scale we can’t afford to have a little unreliability. We have people actively trying to exploit us, so if someone found any code path leading to a failing assert!() or unsupported!() anywhere, they could use it against us, if that had power to take whole services down.

          We use panic=abort too in various tools, sandboxes, background jobs, and short-lived processes. It’s just not sufficient to be the only option.

          As for what panics — check out cargo-show-asm. You’ll find that literally every Rust function has a panicking branch. Not all are dangerous, but many are surprising. std may panic when it can’t make a CStr, e.g. thread::spawn can panic if the thread name isn’t ok (even though it returns Result). split_at doesn’t have a fallible version. iter.chunks(n) must panic on 0. n.clamp(min, max) can panic. LLVM is not good at proving usize+usize won’t overflow (e.g. in with_capacity), so lots of code changing lengths may panic on isize::MAX. That’s not a problem itself (nothing will ever be that large on 64-bit), but a pain when trying to optimize code or prove a function will never panic.

          1. 2

            split_at doesn’t have a fallible version.

            split_at_checked was just now proposed for stabilisation. I wonder if that’s a coincidence…

            1. 2

              At our scale we can’t afford to have a little unreliability.

              To be honest I would surmise that the larger your scale the easier it is to do what, for example, AWS are routinely doing with respect to limiting the blast radius of individual process faults as a core goal.

              Are you not doing fuzzing as part of software testing? If todo!() or unsupported!() are not allowed (and fair enough!) can you just lint them out of changes that are reviewed and tested and integrated into your main branch?

              1. 8

                We have users who want stable long-lived connections (websockets, video calls, ssh, tunnels, etc.). Even for short connections, without unwind we’d have to use more smaller processes to limit the blast radius, and that’s not always desirable due to shared state, caches, load balancing, IPC overhead, etc.

                Being able to use unreachable!("this should never happen") is valuable, when it’s not destructive. It clearly communicates the intent, doesn’t waste time on delicately propagating and handling error that should never happen, and will raise alert if it ever happens. panic=abort turns that from a useful tool to a bomb, and then we need to eliminate these not only from our code, but also all dependencies. This is not a testing issue, but a programming style change for all Rust code.

          2. 4

            There’s no good tooling for finding and eliminating panics. The are no (non-hacky) language features for forbidding certain code paths from panicking. The standard library is full of functions that can panic where you would not expect them to. That are many methods that don’t even have non-panicking equivalents.

            This part at least sounds like a todo list? I mean, they’re perfectly fixable. Although a “non-hacky” method of forbidding panics is difficult because you’d probably want to run it after all optimizations otherwise the ergonomics would not be great.

            1. [Comment removed by author]

            2. 6

              Article touches on the one place where I use unwind fairly often:

              This was meant to be used in libraries like rayon that were simulating many logical threads with one OS thread

              Multithreading and parallel processing is a huge part of modern rust. I think any article suggesting removing unwind (by default or otherwise) would need to spend at least a paragraph or two talking about what they suggest here.

              (The obvious other place for unwind is ffi.)

              1. 5

                Exceptions add so much complexity to a language implementation, I think getting rid of them would be a big benefit to Rust. I also think most cases examples of catch_unwind I’ve seen are code smells.

                1. 5

                  I generally agree. My uses of catch_unwind would go away if tools like Rustig and findpanics weren’t abandoned and I could verify at build time that my dependency tree is as averse to accidentally reachable panics as I am.

                  (To this day, I don’t feel comfortable depending on goblin for EXE-parsing, because my first experiment in throwing my test corpus of “Hello, World! from a bunch of vintage compilers” at it produced a panic and, to me, it being possible for unexpected input to panic it says the author’s philosophy doesn’t incorporate enough defensive design to be compatible with mine. I did report the bug and contribute the corpus, but that doesn’t change my sense of discomfort.)

                  As-is, I use it as an alternative to doing something silly like giving a “shell script, but with more compile-time correctness” or a “miniserve, but it renders as an image gallery” a multi-process architecture to constrain panics to one batch-job task/request-response cycle.

                  Hell, I’d prefer to give up catch_unwind in favour of tools that would allow me to have more compile-time correctness.

                  That said, I do want said tools before I lose the stop-gap that is catch_unwind.

                  1. [Comment removed by author]

                2. 4

                  One of my favourite ways to handle panics is to freeze the thread, and request a graceful shutdown. For a web server intended to run in kubernetes it would look something like this.

                  1. Set a flag that the server is crashing.
                  2. Trigger the graceful shutdown process.

                  The really important part is 1. Because locks may be held by the crashing thread. This means that any shutdown process you have may not complete. (Unless you are very sure that it never takes locks, including any memory allocation or freeing). For most of my servers this looks like writing the current time to an atomic variable. My health check then refuses to pass after some timeout past this time. In a more complex system this could configure some metric that requests some external process to trigger the shutdown. (In a past job we had the “restart server” which would analyze metrics, identify instances that looked unhappy and restart time. This “crashing” flag could be considered unhappy. The main benefit of this is that we could limit to N% of servers restarting at a time)

                  The advantages:

                  1. Assume that panic == bad task. Don’t let it run for too long after panicing.
                  2. Don’t cause too much service disruption in the process.
                  3. Make it very likely that things like logs and metrics will be flushed and exported.
                  4. If the panic is due to some corrupt global state it is very likely being held by a lock when the panic occurs, that lock will not be unlocked preventing other threads from accessing suspicious data.

                  Of course to get this to work well you have to treat any panic seriously and track down the source. If you have semi-expected panics you may be best off just returning a 500 for that request (or whatever background task) and continuing.

                  1. 3

                    catch_unwind is pretty fraught and you should think thrice before using it, I agree, however I will explode my code size if it gives users a better backtrace on crashes to report to me

                    1. 2

                      What you have described isn’t a necessary trade-off. You can generate a stack trace in the abort handler and dump it somewhere (although showing it to the user in a nice way may be difficult). Or just let the crash generate a core dump.

                    2. 2

                      Can i get an abort but still print a backtrace?

                      1. 5

                        Yes. Printing a backtrace happens before unwinding or aborting.

                        1. 1

                          How does it know the call frames without unwinding?

                          1. 4

                            “unwinding” refers specifically to control flow going back up the call stack and running destructors etc. It’s a more complex process that depends on the smaller thing of figuring out what the call stack is.

                            1. 2

                              I’m fuzzy on the exact details, but I believe there is still (all?) the backtrace logic that runs before the actual panic runtime service gets handed control. And it’s not until the panic runtime that the process either aborts or unwinds. Here are the rustc details on panic’ing. There is also rust-lang/backtrace that supplements std’s handling of backtraces and contains code for resolving frames and such.

                          2. 4

                            Ideally you would abort, then the OS would automatically save a core file that includes the entire process state at termination time, unimpacted by any in-process unwinding (which can ruin the state you need) and you’d be able to use a debugger to get not just the stack trace but often quite a lot of other post mortem debugging evidence and fix the bug!