I find it slightly odd that an “epic treatise on error models” would fail to mention Common Lisp and Smaltalk, whose error models provide a facility that all others lack: resuming from an error.
Hi, author here, the title also does say “for systems programming languages” :)
For continuations to work in a systems programming language, you can probably only allow one-shot delimited continuations. It’s unclear to me as to how one-shot continuations can be integrated into a systems language where you want to ensure careful control over lifetimes. Perhaps you (or someone else here) knows of some research integrating ownership/borrowing with continuations/algebraic effects that I’m unfamiliar with?
The closest exception to this that I know of is Haskell, which has support for both linear types and a primitive for continuations. However, I haven’t seen anyone integrate the two, and I’ve definitely seen some soundness-related issues in various effect systems libraries in Haskell (which doesn’t inspire confidence), but it’s also possible I missed some developments there as I haven’t written much Haskell in a while.
I’m sorry for the slightly snarky tone of my original reply, but even if you were to discount the Lisp machines, or all the stuff Xerox and others did with Smalltalk (including today’s Croquet), as somehow not being systems, I would have expected an epic treatise to at least mention that error resumption exists – especially since academia is now rediscovering this topic as effect handlers (typically without any mention of the prior art).
For continuations to work in a systems programming language, you can probably only allow one-shot delimited continuations.
This misconception is so common (and dear to my heart) that I have to use bold:
Resumable exceptions do not require first-class continuations, whether delimited or undelimited, whether one-shot or multi-shot. None at all. Nada. Zilch.
Suppose write() discovers that the disk is full (e.g. from an underlying primitive). This causes it to call signal_disk_is_full(). Note that the call to signal_disk_is_full() happens inside the stack of write() (obviously).
Now signal_disk_is_full() looks for a handler and calls it: disk_is_full_handler(). Again, the call to the handler happens inside the stack of signal_disk_is_full() (and write()). The handler can return normally to write() once it has cleaned up space.
write() is never popped off the stack. It always stays on the stack. IOW, there is never a need to capture a continuation, and never a need to reinstate one. The disk_is_full_handler() runs inside the stack of the original call to write().
effect systems
A side note: most effect systems do use and even require first-class continuations, but IMO that’s completely overkill and only needed for rarely used effects like nondeterminism. For simple effects, like resumable exceptions, no continuations are needed whatsoever.
but even if you were to discount the Lisp machines, or all the stuff Xerox and others did with Smalltalk (including today’s Croquet), as somehow not being systems
I provided the working definition of “systems programming language” that I used in the blog post. It’s a narrow one for sure, but I have to put a limit somewhere. My point is not trying to exclude the work done by smart people; but I need a stopping point somewhere after 100~120 hours of research and writing.
Resumable exceptions do not require first-class continuations, whether delimited or undelimited, whether one-shot or multi-shot. None at all. Nada. Zilch.
Thank you for writing down a detailed explanation with a concrete example. I will update the post with some of the details you shared tomorrow.
You will notice that my comment does not use the phrase “first-class” anywhere; that was deliberate, but perhaps I should’ve been more explicit about it. 😅
As I see it, the notion of a continuation is that of a control operator, which allows one to “continue” a computation from a particular point. So in that sense, it’s a bit difficult for me to understand where exactly you disagree, perhaps you’re working with a different definition of “continuation”? Or perhaps the difference of opinion is because of the focus on first-class continuations specifically?
At the time of the Common Lisp design, Scheme did not have an error system, and so its contribution to the dialog on condition systems was not that of contributing an operator or behavior. However, it still did have something to contribute: the useful term continuation […] This metaphor was of tremendous value to me socially in my efforts to gain acceptance of the condition system, because it allowed a convenient, terse explanation of what “restarts” were about in Common Lisp. [..] And so I have often found myself thankful for the availability of a concept so that I could talk about the establishment of named restart points as “taking a continuation, labeling it with a tag, and storing it away on a shelf somewhere for possible later use.”
So it might be the case that the mismatch here is largely due to language usage, or perhaps my understanding of continuations is lacking.
I’m also a little bit confused as to why your current comment (and the linked blog post) focus on unwinding/stack representation. For implementing continuations, there are multiple possible implementation strategies, sure, and depending on the exact restrictions involved, one can potentially use more efficient strategies. If a continuation is second-class in the sense that it must either be immediately invoked (or discarded), it makes sense that the existing call stack can be reused.
Regardless of the specifics of whether we can call Common Lisp style conditions and resumption a form of continuations or not, I believe the concern about non-local control flow interacting with type systems and notions of ownership/regions/lifetimes still applies.
As I see it, the notion of a continuation is that of a control operator, which allows one to “continue” a computation from a particular point. … Or perhaps the difference of opinion is because of the focus on first-class continuations specifically?
Typically, there are two notions of continuations:
Continuations as an explanatory or semantic concept. E.g. consider the expression f(x + y). To evaluate this, we first need to compute x + y. At this point our continuation is f(_), where _ is the place into which we will plug the result of x + y. This is the notion of a continuation as “what happens next” or “the rest of the program”.
Continuations as an actually reified value/object in a programming language, i.e. first-class continuations. You can get such a first-class continuation e.g. from Scheme’s call/cc or from delimited control operators. This typically involves copying or otherwise remembering some part of the stack on the part of the language implementation.
Resumable exceptions have no need for first-class continuations (2). Continuations as an explanatory concept (1) of course still apply, but only because they apply to every expression in a program.
I believe the concern about non-local control flow interacting with type systems and notions of ownership/regions/lifetimes still applies.
The example I used has no non-local control flow at all. write() calls signal_disk_is_full() and that calls the disk_is_full_handler(), and that finally returns normally to write(). This is my point: resumption does not require any non-local control flow.
As well as what @manuel wrote, it’s worth noting that basically every language has second-class continuations: a return statement skips to the current function’s continuation.
Your comment talked about one-shot delimited continuations, which are a kind of first-class continuation in that (per Strachey’s definition of first vs second class) they can be assigned to variables and passed around like other values.
it’s worth noting that basically every language has second-class continuations: a return statement skips to the current function’s continuation.
In most languages, a return statement cannot be passed as an argument to a function call. So is it still reasonable to call it as “support for a second-class continuation”?
Your comment talked about one-shot delimited continuations, which are a kind of first-class continuation in that (per Strachey’s definition of first vs second class) they can be assigned to variables and passed around like other values.
I understand your and @manuel’s points that the common usage may very well be that “one-shot delimited continuation” implies “first-class” (TIL, thank you).
We can make this same point about functions where generally functions are assumed to be first class. However, it’s not unheard of to have second-class functions (e.g. Osvald et al.’s Gentrification gone too far? and Brachthäuser et al.’s Effects, Capabilities, and Boxes describe such systems). I was speaking in this more general sense.
As I see it, the “one-shot delimited” aspect is disconnected from the “second class” aspect.
In most languages, a return statement cannot be passed as an argument to a function call. So is it still reasonable to call it as “support for a second-class continuation”?
That you can’t pass it as an argument is exactly why it’s called second-class. Only a first-class continuation is reified into a value in the language, and therefore usable as an argument.
As I see it, the “one-shot delimited” aspect is disconnected from the “second class” aspect.
One-shot strongly implies a first-class continuation. Second-class continuations are always one-shot, since, again, you can’t refer to them as values, so how would you invoke one multiple times?
One-shot strongly implies a first-class continuation. Second-class continuations are always one-shot, since, again, you can’t refer to them as values, so how would you invoke one multiple times?
Here is the wording from Strachey’s paper, as linked by @fanf
they always have to appear in person and can never be represented by a variable or expression (except in the case of a formal parameter) [emphasis added]
Isn’t this “except in the case of a formal parameter” exactly what is used by Osvald et al. and Brachthäuser et al. in their papers? Here is the bit from Osvald et al.’s paper:
Our solution is a type system extension that lets us define
file as a second-class value, and that ensures that such
second-class values will not escape their defining scope. We
introduce an annotation @local to mark second-class values,
and change the signature of withFile as follows:
def withFile[U](n: String)(@local fn: (@local File) => U): U
[..] Note that the callback function fn itself is also required
to be second-class, so that it can close over other
second-class values. This enables, for example, nesting calls
to withFile
In the body of withFile, fn is guaranteed to have several restrictions (it cannot be escaped, it cannot be assigned to a mutable variable etc.). But the type system (as in the paper) cannot prevent the implementation of withFile from invoking fn multiple times. That would require an additional restriction – that fn can only be invoked 0-1 times in the body of withFile.
In ALGOL a real number may appear in an expression or be assigned to a variable, and either may appear as an actual parameter in a procedure call. A procedure, on the other hand, may only appear in another procedure call either as the operator (the most common case) or as one of the actual parameters. There are no other expressions involving procedures or whose results are procedures. Thus in a sense procedures in ALGOL are second class citizens—they always have to appear in person and can never be represented by a variable or expression (except in the case of a formal parameter), while we can write (in ALGOL still)
(if x > 1 then a else b) + 6
when a and b are reals, we cannot correctly write
(if x > 1 then sin else cos)(x)
nor can we write a type procedure (ALGOL’s nearest approach to a function) with a result which is itself a procedure.
Regardless of the specifics of whether we can call Common Lisp style conditions and resumption a form of continuations or not, I believe the concern about non-local control flow interacting with type systems and notions of ownership/regions/lifetimes still applies.
That’s a concern, sure, but most “systems” languages have non-local control flow, right? C++ has exceptions, and Rust panics can be caught and handled. It would be very easy to implement a Common Lisp-like condition system with nothing more than thread local storage, function pointers (or closures) and catch/throw.
(And I’m pretty sure you can model exceptions / anything else that unwinds the stack as essentially being a special form of “return”, and handle types, ownership, and lifetimes just the same as you do with the ? operator in Rust)
My point is not about ease of implementation, it’s about usability when considering type safety and memory safety. It’s not sufficient to integrate a type system with other features – the resulting thing needs to be usable…
I’ve added a section at the end, Appendix A8 describing the concrete concerns.
Early Rust did have conditions and resumptions (as Steve pointed out elsewhere in the thread), but they were removed because of usability issues.
If you dig into the code a bit, you discover that SEH on Windows has full support for Lisp-style restartable and resumable exceptions in the lower level, they just aren’t exposed in the C/C++ layer. The same component is used in the NT kernel and so there’s an existence proof that you can support both of these models in systems languages, I just don’t know of anyone who does.
The SEH model is designed to work in systems contexts. Unlike the Itanium model (used everywhere except Windows) it doesn’t require heap allocation. The throwing frame allocates the exception and metadata and then invokes the unwinder. The unwinder then walks the stack and invokes ‘funclets’ for each frame being unwound. A funclet is a function that runs on the top of the stack but with access to another frame’s stack pointer and so can handle all cleanup for that frame without actually doing the unwind. As with the Itanium model, this is a two-stage process, with the first determining what needs to happen on the unwind and the second running cleanup and catch logic.
This model is very flexible because (as with the Lisp and Smalltalk exception models) the stack isn’t destroyed until after the first phase. This means that you can build any kind of policy on top quite easily.
Oh yes, that reminds me, Microsoft’s Annex K broken C library extensions have a runtime constraint handler that is vaguely like a half-arsed Lisp condition.
A two-phase exception-handling model is not strictly necessary to implement C++ language semantics, but it does provide some benefits. For example, the first phase allows an exception-handling mechanism to dismiss an exception before stack unwinding begins, which allows resumptive exception handling (correcting the exceptional condition and resuming execution at the point where it was raised). While C++ does not support resumptive exception handling, other languages do, and the two-phase model allows C++ to coexist with those languages on the stack.
Are you referring to some closed-source code here, or is the implementation source-available/open-source somewhere? I briefly looked that the microsoft/STL repo, and the exception handling machinery seems to be linked to vcruntime which is closed-source AFAICT.
The SEH model is designed to work in systems contexts [..]
Thanks for the context, I haven’t seen a simple explanation of SEH works elsewhere, so this is good to know. I have one follow-up question:
it doesn’t require heap allocation. The throwing frame allocates the exception and metadata
So the exception and metadata is statically sized (and hence space for it is already reserved on the throwing frame’s stack frame)? Or can it be dynamically sized (and hence there is a risk of triggering stack overflow when throwing)?
The same component is used in the NT kernel and so there’s an existence proof that you can support both of these models in systems languages, I just don’t know of anyone who does.
As Steve pointed out elsewhere in the thread, Rust pre-1.0 did support conditions and resumptions, but they removed it.
Are you referring to some closed-source code here, or is the implementation source-available/open-source somewhere?
I thought I read it in a public repo, but possibly it was a MS internal one.
So the exception and metadata is statically sized (and hence space for it is already reserved on the throwing frame’s stack frame)? Or can it be dynamically sized (and hence there is a risk of triggering stack overflow when throwing)?
The throwing context allocates the exception on the stack. The funclet can then use it in place. If it needs to persist beyond the catch scope, the funclet can copy it elsewhere.
This can lead to stack overflow (which is fun because stack overflow is, itself, handled as an SEH exception.
I’ve only dabbled slightly with both - how is resuming from an error different from catching it? Is it that execution restarts right after the line that threw the error?
A program wants to write() something to a file, but – oops – the disk is full.
In ordinary languages, this means write() will simply fail, signal an error (via error code or exception or …), and unwind its stack.
In languages with resumable or restartable errors, something entirely different happens: write() doesn’t fail, it simply pauses and notifies its calling environment (i.e. outer, enclosing layers of the stack) that it has encountered a DiskIsFull situation.
In the environment, there may be programmed handlers that know how to deal with such a DiskIsFull situation. For example, a handler may try to empty the /tmp directory if this happens.
Or there may be no such handler, in which case an interactive debugger is invoked and presented to the human user. The user may know how to make space such as deleting some no longer needed files.
Once a handler or the user has addressed the DiskIsFull situation, it can tell write() to try writing again. Remember, write() hasn’t failed, it is still paused on the stack.
Well, now that space is available, write() succeeds, and the rest of the program continues as if nothing had happened.
Only if there is no handler that knows how to deal with DiskIsFull situations, or if the user is not available to handle the situation interactively, would write() fail conclusively.
Is it that execution restarts right after the line that threw the error?
Yes. Common Lisp and Smalltalk use condition systems, where the handler gets executed before unwinding.
So unwinding is just one possible option (one possible restart), other common ones are to start a debugger, to just resume, to resume with a value (useful to provide e.g. default values, or replacement for invalid values), etc… the signalling site can provide any number of restart for the condition they signal.
It’s pretty cool in that it’s a lot more flexible, although because it’s adjacent to dynamic scoping it can make the program’s control flow much harder to grasp if you start using complex restarts or abusing conditions.
Exactly. For example “call with current continuation” or call-cc allows you to optionally continue progress immediately after the throw. It’s a generalization of the callback/continuation style used in async-await systems.
Even if you want to do stack unwinding, you don’t need continuations. Catch and throw are adequate operations to implement restarts that unwind the stack to some point first.
Long post, many thoughts, but I don’t feel like doing a lot of editing, so apologies in front for
unfiltered feedback! I don’t mean the tone I will use here :-)
The start of the article is 🤌, but it sort of does’t live to my expectations. I feel this is
mostly about extensible, forward compatible enums, which is quite neat (I didn’t realize that “I
want to add a field to all these enum variants” admits such an elegant solution), but I don’t think
solves my problems with error handling facilities in languages.
Basically, this post feels like it attacks “make complicated things possible” part of the problem,
and, sure, if you add across and along non-exhaustiveness, !__ for warnings, auto-delegation to
turn an enum into struct, a capability system to track panics, you can solve everything.
But the problem I see with error handling is that we don’t know how to make simple things easy.
That’s especially true in Rust, of course, but it seems that in every language a simple way to go
about error handling leads to some pretty significant drawbacks, and the money question is not how
can we add extra knobs to handle all of the requirements, but whether there’s some simple idea
that kinda solves stuff?
Like, for example, the sled problem — we want every function to be
precise about its specific error conditions, but, in practice, the stable equilibrium is one error
type for library. Everr (fabulous name by the way, great job) suggest
The Everr language server has a code action for defining error types for a given function based on
the error types of the functions that are called. It also can intelligently suggest modifications to
error cases as the function body evolves over time, taking contextual rules such as access control
into consideration.
But this sucks! With a nominal type system, having to name every function, and every function’s
error type is very much not easy, and even if you add a bunch of tooling support, the result would
still not be easy.
Another simple-things-are-hard problem in error handling is exposing details. If you write a library
A, and it uses a library B, and B is an implementation detail, then a common API design pitfall is
to leak B through your error types (either directly, by including B variants, or indirectly, by
allowing downcasing to B). The problem here isn’t that it’s impossible to either expose or hide B
properly. There’s a bunch of techniques available for that (but I belive that Everr makes them nicer
and more powerful). The problem is that you need to decide what do you do, and that is hard. You
need pretty high level of discipline and experience to even note that this is a problem.
Or another common pitfall of type-based error types, where
That is, that the fact that you can aggregate errors based on types doesn’t mean that you should.
I have no idea how to handle errors in general! I just don’t have bullet proof recipes, every time
it is “well, let’s look at your specific situation, shall we?”. Super annoying!
I don’t think that a lack of language mechanisms is my problem. What I lack is a gang-of-four book
for patterns of error management (I mean, I have such a book at my head obviously, and I consult it
often when writing code, but I can’t condense it to a single-paragraph to put into project’s style
guide and call it a day).
Assorted smaller thoughts:
For an article about systems programming language, it is surprising that no space is dedicated to
the ABI. How exactly do you raise in catch errors, in terms of which bytes go into which register,
I feel is an unsolved problem. Returning values is allegedly slow. Unwinding is,
counterintuitively, faster (see Duffy’s post & the recent talk on C++ exceptions in embedded (has
anyone reproduced that result in particular?)). To avoid potential ambiguity: rust-style error
handling, and Java-style exceptions differ on two orthogonal axis:
whether you syntactically allocate expressions that throws (type system&syntax stuff)
whether throwing happens by stack unwinding (by the way, what is the best one-page explainer, of
how unwinding actually works? I am embarrassed to admit that unwinding is magic pixie dust for
me, and I have no idea how landing pads work), or by “normal” return.
I am strictly speaking about the second one.
And than, there’s Zig, and than there’s this paper by Sutter of several years ago which says that “actually, you do want to return an integer” to be fast.
heap exhaustion
Heap exhaustion is not the central example of OutOfMemory error. The central example is someone
passing you a malformed gif image whose declared size is 67PiB. That’s the sort of thing that you
need to be robust to, a single rouge allocation due to a bug/malformed input.
It would also be interesting to know what sub-fraction of that group has tests for the
out-of-memory error handling code path, and how good that test coverage is.
No these data here, but, anecdotally, eyeballing Zig the code that has both allocator parameter,
try, and defer/errdefer usually tends to reveal errors.
Zig and Odin are different from other languages here; allocators are passed down ~everywhere as
parameters
Such discipline is possible in C++, Rust and other languages to varying extents, but is less
common. Rust has an unstable allocator_api feature, where the discussion originally started in
A lint that prevents error values from being discarded using standard shorthands (e.g. _ = ), without an explicit annotation, such as a comment or a call to an earmarked function (to allow for ‘Find references’) etc.
I used to think the what Rust does, with must_use, is the right thing, and was hesitant of swift approach of requiring everything to be used. After using Zig, I am sold though, no need to other think this, a non-void function whose result is unused and is not _ = should be a compilation error. The amount of false positives is vanishingly small.
“actually, you do want to return an integer” to be fast.
https://mcyoung.xyz/2024/04/17/calling-convention/ had an interesting idea. Change the abi so that the error/success payloads of Result are passed as out parameters, and then just return the Ok/Err tag. That seems like it allows the best of both worlds - effectively automating the common pattern used in zig and making it type-safe.
Returning values is allegedly slow. Unwinding is, counterintuitively, faster
I think this is true in the common case where an error did not occur. Returning error information adds overhead to both the caller and callee, whereas catch/throw has the famous “zero overhead.” On the other hand, when an error does occur, unwinding the stack is significantly slower because a bunch of compiler-generated metadata has to be looked up and processed for each active stack frame.
the recent talk on C++ exceptions in embedded
The talk I watched (i don’t remember who gave it) was primarily about code size, not performance. The common wisdom being that using C++ exceptions bloats your code with all those compiler-generated tables annd extra code for running destructors during unwinding.
Hey Alex, thanks for taking the time to read and share your thoughts. I always appreciate reading your blog posts, so thank you for the specific feedback on this post.
That’s especially true in Rust, of course, but it seems that in every language a simple way to go about error handling leads to some pretty significant drawbacks, and the money question is not how can we add extra knobs to handle all of the requirements, but whether there’s some simple idea that kinda solves stuff?
It would be helpful to have an operational definition of “simple” here with one or two examples before I attempt to answer this. 😅
For example, if there is a guideline that by default, an error should not expose structure, and just expose an interface like:
trait RichDebug: Debug {
type Kind: Debug
fn kind(&self) -> Kind
fn metadata(&self, &mut debug::Metadata<_'>) // similar to fmt::Formatter, but creates something like a JSON object instead of a string
}
and the implementation for this were to be derived using a macro (or comptime machinery in Zig), would that be considered “simple”?
Like, for example, the sled problem
Thanks for linking that blog post, I hadn’t read it earlier. This point stands out to me in particular:
inside the sled codebase, internal systems were [..] relying on the same Error enum to signal success, expected failure, or fatal failure. It made the codebase a nightmare to work with. Dozens and dozens of bugs happened over years of development where the underlying issue boiled down to either accidentally using the try ? operator somewhere that a local error should have been handled, or by performing a partial pattern match that included an over-optimistic wildcard match.
This goes directly against the Rust conventions RFC, which recommends using panics for “catastrophic errors”. I’ve seen this similar tendency in Go codebases, where people will put every kind of error under error, even if it’s technically a serious invariant violation (like a bounds check failure, which does trigger a panic!).
Based on Duffy’s writing on Midori, it feels like a Midori programmer would probably be more likely to use “abandonment” (panic) than a Rust/Go programmer in this kind of situation, given the built-in Erlang-style fault-tolerant architecture.
we want every function to be precise about its specific error conditions, but, in practice, the stable equilibrium is one error type for library
Right, so with Everr’s type system, you could write your code as returning only MyLibraryError, and then the language server can refactor the functions which need specific error conditions to instead return MyLibraryError:union+[.Case1 | .Case2].
The central example is someone passing you a malformed gif image whose declared size is 67PiB. That’s the sort of thing that you need to be robust to, a single rouge allocation due to a bug/malformed input.
This is a fair criticism. In that section, I originally intended to describe a system for regions/different sub-heaps based on some of the research on Verona (and in that context, “heap exhaustion” would mean “this sub-heap is exhausted”, not “the process heap is quite high for the running system”), but then I punted on that because I didn’t feel confident in Verona’s system, so I moved that description to the appendix.
I will update this.
and was hesitant of swift approach of requiring everything to be used. After using Zig, I am sold though
I personally prefer Swift’s approach of warning instead of a hard error, given that iterating on code becomes more fiddly if you need to keep putting/removing _ (speaking from first-hand experience with Go).
However, the point you’ve quoted here is talking about something slightly different. It’s saying that using the same shorthand for discarding ordinary and discarding errors is itself error-prone. Discarding errors should require noisier syntax (in lint form), because an error being discarded is likely to carry higher risk than a success value being discarded.
I have such a book at my head obviously [..]
Perhaps a good idea for a IronBeetle episode? I’m slowly working my way through the list; maybe you’ve already covered this in one of them. 😄
For an article about systems programming language, it is surprising that no space is dedicated to the ABI. How exactly do you raise in catch errors, in terms of which bytes go into which register, I feel is an unsolved problem
I omitted this because:
Since this point is primarily about performance, it doesn’t make sense for me to speculate about designs without having concrete measurements. Performing any kind of realistic measurement would likely be a fair amount of work.
I didn’t realize it was an “unsolved problem” but rather my working assumption was that “for ABI, the small number of people working on it pretty much know what all their options are, so it doesn’t make sense for me to tell them that”. For example, if you only care about 64-bit machines, perhaps you’re fine with burning a register on errors specifically (like Swift). For larger errors, you could reuse the same register as an out-parameter (as described in mcyoung’s post linked by Jamie).
There is a ton of material here, and while I’m not finished, it seems very detailed and well thought out. It cites Joe Duffy’s post from a decade or so ago–if you’re a fan of that post, you’ll probably enjoy this one.
To highlight one point: this article very clearly lays out the case that error handling is contextual, and whether a particular error is recoverable or not is almost never based on the error itself, but the full context the error happens in. For instance, it gives examples here where (at least in principle) you reading past the bounds of an array to be a recoverable error.
I don’t think the contextual nature of error handling is new to this piece (I’ve been thinking about it for years, and I suspect I didn’t come up with it on my own), but it’s very well stated.
I find it slightly odd that an “epic treatise on error models” would fail to mention Common Lisp and Smaltalk, whose error models provide a facility that all others lack: resuming from an error.
Hi, author here, the title also does say “for systems programming languages” :)
For continuations to work in a systems programming language, you can probably only allow one-shot delimited continuations. It’s unclear to me as to how one-shot continuations can be integrated into a systems language where you want to ensure careful control over lifetimes. Perhaps you (or someone else here) knows of some research integrating ownership/borrowing with continuations/algebraic effects that I’m unfamiliar with?
The closest exception to this that I know of is Haskell, which has support for both linear types and a primitive for continuations. However, I haven’t seen anyone integrate the two, and I’ve definitely seen some soundness-related issues in various effect systems libraries in Haskell (which doesn’t inspire confidence), but it’s also possible I missed some developments there as I haven’t written much Haskell in a while.
I’m sorry for the slightly snarky tone of my original reply, but even if you were to discount the Lisp machines, or all the stuff Xerox and others did with Smalltalk (including today’s Croquet), as somehow not being systems, I would have expected an epic treatise to at least mention that error resumption exists – especially since academia is now rediscovering this topic as effect handlers (typically without any mention of the prior art).
This misconception is so common (and dear to my heart) that I have to use bold:
Resumable exceptions do not require first-class continuations, whether delimited or undelimited, whether one-shot or multi-shot. None at all. Nada. Zilch.
To take the example I posted earlier about writing to a full disk: https://lobste.rs/s/az2qlz/epic_treatise_on_error_models_for_systems#c_ss3n1k
Suppose
write()discovers that the disk is full (e.g. from an underlying primitive). This causes it to callsignal_disk_is_full(). Note that the call tosignal_disk_is_full()happens inside the stack ofwrite()(obviously).Now
signal_disk_is_full()looks for a handler and calls it:disk_is_full_handler(). Again, the call to the handler happens inside the stack ofsignal_disk_is_full()(andwrite()). The handler can return normally towrite()once it has cleaned up space.write()is never popped off the stack. It always stays on the stack. IOW, there is never a need to capture a continuation, and never a need to reinstate one. Thedisk_is_full_handler()runs inside the stack of the original call towrite().A side note: most effect systems do use and even require first-class continuations, but IMO that’s completely overkill and only needed for rarely used effects like nondeterminism. For simple effects, like resumable exceptions, no continuations are needed whatsoever.
I provided the working definition of “systems programming language” that I used in the blog post. It’s a narrow one for sure, but I have to put a limit somewhere. My point is not trying to exclude the work done by smart people; but I need a stopping point somewhere after 100~120 hours of research and writing.
Thank you for writing down a detailed explanation with a concrete example. I will update the post with some of the details you shared tomorrow.
You will notice that my comment does not use the phrase “first-class” anywhere; that was deliberate, but perhaps I should’ve been more explicit about it. 😅
As I see it, the notion of a continuation is that of a control operator, which allows one to “continue” a computation from a particular point. So in that sense, it’s a bit difficult for me to understand where exactly you disagree, perhaps you’re working with a different definition of “continuation”? Or perhaps the difference of opinion is because of the focus on first-class continuations specifically?
If I look at Chapter 3 in Advances in Exception Handling Techniques, titled ‘Condition Handling in the Lisp Language Family’ by Ken M. Pitman, that states:
So it might be the case that the mismatch here is largely due to language usage, or perhaps my understanding of continuations is lacking.
I’m also a little bit confused as to why your current comment (and the linked blog post) focus on unwinding/stack representation. For implementing continuations, there are multiple possible implementation strategies, sure, and depending on the exact restrictions involved, one can potentially use more efficient strategies. If a continuation is second-class in the sense that it must either be immediately invoked (or discarded), it makes sense that the existing call stack can be reused.
Regardless of the specifics of whether we can call Common Lisp style conditions and resumption a form of continuations or not, I believe the concern about non-local control flow interacting with type systems and notions of ownership/regions/lifetimes still applies.
Typically, there are two notions of continuations:
Continuations as an explanatory or semantic concept. E.g. consider the expression
f(x + y). To evaluate this, we first need to computex + y. At this point our continuation isf(_), where_is the place into which we will plug the result ofx + y. This is the notion of a continuation as “what happens next” or “the rest of the program”.Continuations as an actually reified value/object in a programming language, i.e. first-class continuations. You can get such a first-class continuation e.g. from Scheme’s
call/ccor from delimited control operators. This typically involves copying or otherwise remembering some part of the stack on the part of the language implementation.Resumable exceptions have no need for first-class continuations (2). Continuations as an explanatory concept (1) of course still apply, but only because they apply to every expression in a program.
The example I used has no non-local control flow at all.
write()callssignal_disk_is_full()and that calls thedisk_is_full_handler(), and that finally returns normally towrite(). This is my point: resumption does not require any non-local control flow.As well as what @manuel wrote, it’s worth noting that basically every language has second-class continuations: a return statement skips to the current function’s continuation.
Your comment talked about one-shot delimited continuations, which are a kind of first-class continuation in that (per Strachey’s definition of first vs second class) they can be assigned to variables and passed around like other values.
In most languages, a return statement cannot be passed as an argument to a function call. So is it still reasonable to call it as “support for a second-class continuation”?
I understand your and @manuel’s points that the common usage may very well be that “one-shot delimited continuation” implies “first-class” (TIL, thank you).
We can make this same point about functions where generally functions are assumed to be first class. However, it’s not unheard of to have second-class functions (e.g. Osvald et al.’s Gentrification gone too far? and Brachthäuser et al.’s Effects, Capabilities, and Boxes describe such systems). I was speaking in this more general sense.
As I see it, the “one-shot delimited” aspect is disconnected from the “second class” aspect.
That you can’t pass it as an argument is exactly why it’s called second-class. Only a first-class continuation is reified into a value in the language, and therefore usable as an argument.
One-shot strongly implies a first-class continuation. Second-class continuations are always one-shot, since, again, you can’t refer to them as values, so how would you invoke one multiple times?
Here is the wording from Strachey’s paper, as linked by @fanf
Isn’t this “except in the case of a formal parameter” exactly what is used by Osvald et al. and Brachthäuser et al. in their papers? Here is the bit from Osvald et al.’s paper:
In the body of
withFile,fnis guaranteed to have several restrictions (it cannot be escaped, it cannot be assigned to a mutable variable etc.). But the type system (as in the paper) cannot prevent the implementation ofwithFilefrom invokingfnmultiple times. That would require an additional restriction – thatfncan only be invoked 0-1 times in the body ofwithFile.@manuel wrote most of what I was going to (thanks, @manuel!) but I think it’s worth quoting the relevant passage from Strachey’s fundamental concepts in programming languages
That’s a concern, sure, but most “systems” languages have non-local control flow, right? C++ has exceptions, and Rust panics can be caught and handled. It would be very easy to implement a Common Lisp-like condition system with nothing more than thread local storage, function pointers (or closures) and catch/throw.
(And I’m pretty sure you can model exceptions / anything else that unwinds the stack as essentially being a special form of “return”, and handle types, ownership, and lifetimes just the same as you do with the
?operator in Rust)My point is not about ease of implementation, it’s about usability when considering type safety and memory safety. It’s not sufficient to integrate a type system with other features – the resulting thing needs to be usable…
I’ve added a section at the end, Appendix A8 describing the concrete concerns.
Early Rust did have conditions and resumptions (as Steve pointed out elsewhere in the thread), but they were removed because of usability issues.
If you dig into the code a bit, you discover that SEH on Windows has full support for Lisp-style restartable and resumable exceptions in the lower level, they just aren’t exposed in the C/C++ layer. The same component is used in the NT kernel and so there’s an existence proof that you can support both of these models in systems languages, I just don’t know of anyone who does.
The SEH model is designed to work in systems contexts. Unlike the Itanium model (used everywhere except Windows) it doesn’t require heap allocation. The throwing frame allocates the exception and metadata and then invokes the unwinder. The unwinder then walks the stack and invokes ‘funclets’ for each frame being unwound. A funclet is a function that runs on the top of the stack but with access to another frame’s stack pointer and so can handle all cleanup for that frame without actually doing the unwind. As with the Itanium model, this is a two-stage process, with the first determining what needs to happen on the unwind and the second running cleanup and catch logic.
This model is very flexible because (as with the Lisp and Smalltalk exception models) the stack isn’t destroyed until after the first phase. This means that you can build any kind of policy on top quite easily.
Oh yes, that reminds me, Microsoft’s Annex K broken C library extensions have a runtime constraint handler that is vaguely like a half-arsed Lisp condition.
Yes. However, even the Itanium model supports it: https://itanium-cxx-abi.github.io/cxx-abi/abi-eh.html
Are you referring to some closed-source code here, or is the implementation source-available/open-source somewhere? I briefly looked that the microsoft/STL repo, and the exception handling machinery seems to be linked to vcruntime which is closed-source AFAICT.
Thanks for the context, I haven’t seen a simple explanation of SEH works elsewhere, so this is good to know. I have one follow-up question:
So the exception and metadata is statically sized (and hence space for it is already reserved on the throwing frame’s stack frame)? Or can it be dynamically sized (and hence there is a risk of triggering stack overflow when throwing)?
As Steve pointed out elsewhere in the thread, Rust pre-1.0 did support conditions and resumptions, but they removed it.
To be clear, I don’t doubt whether you can support it, the question in my mind is whether can you support it in a way that is usable.
I thought I read it in a public repo, but possibly it was a MS internal one.
The throwing context allocates the exception on the stack. The funclet can then use it in place. If it needs to persist beyond the
catchscope, the funclet can copy it elsewhere.This can lead to stack overflow (which is fun because stack overflow is, itself, handled as an SEH exception.
You don’t need continuations for resumable errors. https://lobste.rs/s/az2qlz/epic_treatise_on_error_models_for_systems#c_9efawr
Incidentally, Rust had conditions long ago. They were removed because users preferred Result.
Is there any documentation or code examples of how they worked?
https://github.com/rust-lang/rust/issues/9795 Here’s the bug about removing them. There was some documentation in those early releases, I don’t have the time to dig right now.
I’ve only dabbled slightly with both - how is resuming from an error different from catching it? Is it that execution restarts right after the line that threw the error?
Consider the following:
A program wants to
write()something to a file, but – oops – the disk is full.In ordinary languages, this means
write()will simply fail, signal an error (via error code or exception or …), and unwind its stack.In languages with resumable or restartable errors, something entirely different happens:
write()doesn’t fail, it simply pauses and notifies its calling environment (i.e. outer, enclosing layers of the stack) that it has encountered aDiskIsFullsituation.In the environment, there may be programmed handlers that know how to deal with such a
DiskIsFullsituation. For example, a handler may try to empty the/tmpdirectory if this happens.Or there may be no such handler, in which case an interactive debugger is invoked and presented to the human user. The user may know how to make space such as deleting some no longer needed files.
Once a handler or the user has addressed the
DiskIsFullsituation, it can tellwrite()to try writing again. Remember,write()hasn’t failed, it is still paused on the stack.Well, now that space is available,
write()succeeds, and the rest of the program continues as if nothing had happened.Only if there is no handler that knows how to deal with
DiskIsFullsituations, or if the user is not available to handle the situation interactively, wouldwrite()fail conclusively.Yes. Common Lisp and Smalltalk use condition systems, where the handler gets executed before unwinding.
So unwinding is just one possible option (one possible restart), other common ones are to start a debugger, to just resume, to resume with a value (useful to provide e.g. default values, or replacement for invalid values), etc… the signalling site can provide any number of restart for the condition they signal.
It’s pretty cool in that it’s a lot more flexible, although because it’s adjacent to dynamic scoping it can make the program’s control flow much harder to grasp if you start using complex restarts or abusing conditions.
Exactly. For example “call with current continuation” or call-cc allows you to optionally continue progress immediately after the throw. It’s a generalization of the callback/continuation style used in async-await systems.
(There’s also hurl, which I think was intended as an esolang but stumbled upon something deep (yet already known): https://ntietz.com/blog/introducing-hurl/)
You don’t need continuations to implement resumable errors. The trick is simply to not unwind the stack when an error happens. I wrote an article about how it works a while ago: http://axisofeval.blogspot.com/2011/04/whats-condition-system-and-why-do-you.html
Even if you want to do stack unwinding, you don’t need continuations. Catch and throw are adequate operations to implement restarts that unwind the stack to some point first.
Ah thanks - that’s very informative.
Long post, many thoughts, but I don’t feel like doing a lot of editing, so apologies in front for unfiltered feedback! I don’t mean the tone I will use here :-)
The start of the article is 🤌, but it sort of does’t live to my expectations. I feel this is mostly about extensible, forward compatible enums, which is quite neat (I didn’t realize that “I want to add a field to all these enum variants” admits such an elegant solution), but I don’t think solves my problems with error handling facilities in languages.
Basically, this post feels like it attacks “make complicated things possible” part of the problem, and, sure, if you add across and along non-exhaustiveness,
!__for warnings, auto-delegation to turn an enum into struct, a capability system to track panics, you can solve everything.But the problem I see with error handling is that we don’t know how to make simple things easy. That’s especially true in Rust, of course, but it seems that in every language a simple way to go about error handling leads to some pretty significant drawbacks, and the money question is not how can we add extra knobs to handle all of the requirements, but whether there’s some simple idea that kinda solves stuff?
Like, for example, the sled problem — we want every function to be precise about its specific error conditions, but, in practice, the stable equilibrium is one error type for library. Everr (fabulous name by the way, great job) suggest
But this sucks! With a nominal type system, having to name every function, and every function’s error type is very much not easy, and even if you add a bunch of tooling support, the result would still not be easy.
Another simple-things-are-hard problem in error handling is exposing details. If you write a library A, and it uses a library B, and B is an implementation detail, then a common API design pitfall is to leak B through your error types (either directly, by including B variants, or indirectly, by allowing downcasing to B). The problem here isn’t that it’s impossible to either expose or hide B properly. There’s a bunch of techniques available for that (but I belive that Everr makes them nicer and more powerful). The problem is that you need to decide what do you do, and that is hard. You need pretty high level of discipline and experience to even note that this is a problem.
Or another common pitfall of type-based error types, where
is often a bug, because the actual picture is
That is, that the fact that you can aggregate errors based on types doesn’t mean that you should.
I have no idea how to handle errors in general! I just don’t have bullet proof recipes, every time it is “well, let’s look at your specific situation, shall we?”. Super annoying!
I don’t think that a lack of language mechanisms is my problem. What I lack is a gang-of-four book for patterns of error management (I mean, I have such a book at my head obviously, and I consult it often when writing code, but I can’t condense it to a single-paragraph to put into project’s style guide and call it a day).
Assorted smaller thoughts:
For an article about systems programming language, it is surprising that no space is dedicated to the ABI. How exactly do you raise in catch errors, in terms of which bytes go into which register, I feel is an unsolved problem. Returning values is allegedly slow. Unwinding is, counterintuitively, faster (see Duffy’s post & the recent talk on C++ exceptions in embedded (has anyone reproduced that result in particular?)). To avoid potential ambiguity: rust-style error handling, and Java-style exceptions differ on two orthogonal axis:
I am strictly speaking about the second one.
And than, there’s Zig, and than there’s this paper by Sutter of several years ago which says that “actually, you do want to return an integer” to be fast.
Heap exhaustion is not the central example of OutOfMemory error. The central example is someone passing you a malformed
gifimage whose declared size is 67PiB. That’s the sort of thing that you need to be robust to, a single rouge allocation due to a bug/malformed input.No these data here, but, anecdotally, eyeballing Zig the code that has both allocator parameter, try, and defer/errdefer usually tends to reveal errors.
The Rust allocator API is very much not what Zig is doing. https://ziglang.org/download/0.14.0/release-notes.html#Embracing-Unmanaged-Style-Containers is not at all that.
I used to think the what Rust does, with must_use, is the right thing, and was hesitant of swift approach of requiring everything to be used. After using Zig, I am sold though, no need to other think this, a non-void function whose result is unused and is not
_ =should be a compilation error. The amount of false positives is vanishingly small.https://mcyoung.xyz/2024/04/17/calling-convention/ had an interesting idea. Change the abi so that the error/success payloads of Result are passed as out parameters, and then just return the Ok/Err tag. That seems like it allows the best of both worlds - effectively automating the common pattern used in zig and making it type-safe.
Or, as @david_chisnall suggested, use the carry flag for ok/err https://lobste.rs/s/mzqv3t/why_i_prefer_exceptions_error_values#c_br9wzy
I think this is true in the common case where an error did not occur. Returning error information adds overhead to both the caller and callee, whereas catch/throw has the famous “zero overhead.” On the other hand, when an error does occur, unwinding the stack is significantly slower because a bunch of compiler-generated metadata has to be looked up and processed for each active stack frame.
The talk I watched (i don’t remember who gave it) was primarily about code size, not performance. The common wisdom being that using C++ exceptions bloats your code with all those compiler-generated tables annd extra code for running destructors during unwinding.
Hey Alex, thanks for taking the time to read and share your thoughts. I always appreciate reading your blog posts, so thank you for the specific feedback on this post.
It would be helpful to have an operational definition of “simple” here with one or two examples before I attempt to answer this. 😅
For example, if there is a guideline that by default, an error should not expose structure, and just expose an interface like:
and the implementation for this were to be derived using a macro (or comptime machinery in Zig), would that be considered “simple”?
Thanks for linking that blog post, I hadn’t read it earlier. This point stands out to me in particular:
This goes directly against the Rust conventions RFC, which recommends using panics for “catastrophic errors”. I’ve seen this similar tendency in Go codebases, where people will put every kind of error under
error, even if it’s technically a serious invariant violation (like a bounds check failure, which does trigger a panic!).Based on Duffy’s writing on Midori, it feels like a Midori programmer would probably be more likely to use “abandonment” (panic) than a Rust/Go programmer in this kind of situation, given the built-in Erlang-style fault-tolerant architecture.
Right, so with Everr’s type system, you could write your code as returning only
MyLibraryError, and then the language server can refactor the functions which need specific error conditions to instead returnMyLibraryError:union+[.Case1 | .Case2].This is a fair criticism. In that section, I originally intended to describe a system for regions/different sub-heaps based on some of the research on Verona (and in that context, “heap exhaustion” would mean “this sub-heap is exhausted”, not “the process heap is quite high for the running system”), but then I punted on that because I didn’t feel confident in Verona’s system, so I moved that description to the appendix.
I will update this.
I personally prefer Swift’s approach of warning instead of a hard error, given that iterating on code becomes more fiddly if you need to keep putting/removing
_(speaking from first-hand experience with Go).However, the point you’ve quoted here is talking about something slightly different. It’s saying that using the same shorthand for discarding ordinary and discarding errors is itself error-prone. Discarding errors should require noisier syntax (in lint form), because an error being discarded is likely to carry higher risk than a success value being discarded.
Perhaps a good idea for a IronBeetle episode? I’m slowly working my way through the list; maybe you’ve already covered this in one of them. 😄
I omitted this because:
Since this point is primarily about performance, it doesn’t make sense for me to speculate about designs without having concrete measurements. Performing any kind of realistic measurement would likely be a fair amount of work.
I didn’t realize it was an “unsolved problem” but rather my working assumption was that “for ABI, the small number of people working on it pretty much know what all their options are, so it doesn’t make sense for me to tell them that”. For example, if you only care about 64-bit machines, perhaps you’re fine with burning a register on errors specifically (like Swift). For larger errors, you could reuse the same register as an out-parameter (as described in mcyoung’s post linked by Jamie).
Good idea! I’ll do an error management episode this Thursday then!
And yet no mention of algebraic effects!
There is a ton of material here, and while I’m not finished, it seems very detailed and well thought out. It cites Joe Duffy’s post from a decade or so ago–if you’re a fan of that post, you’ll probably enjoy this one.
To highlight one point: this article very clearly lays out the case that error handling is contextual, and whether a particular error is recoverable or not is almost never based on the error itself, but the full context the error happens in. For instance, it gives examples here where (at least in principle) you reading past the bounds of an array to be a recoverable error.
I don’t think the contextual nature of error handling is new to this piece (I’ve been thinking about it for years, and I suspect I didn’t come up with it on my own), but it’s very well stated.
this is stolen apennwar valor!
(edit to say: I am absolutely joking.)