This is a curious misunderstanding of what “let it crash” means.
“Let it crash” is the observation that if something unexpected, maybe spurious happens in a distributed, ever-changing system, it’s a good approach to crash the unit of work in a structured way and have the supervisor of that unit figure apply a strategy. e.g. if you are connecting to a socket, doing some work and then close - and the connection doesn’t work, the task should crash. The supervisor could then retry 3 times and then propagate the error. It’s a philosophy for error handling and structuring errors in a concurrent system, at far more high level then the example in the code.
That approach is very inspired from how init systems are doing things. Say, you run a cleanup task using your init system - it fails once because of a spurious connection error. The init system tries it 5 more times. If it fails 5 times, it then marks the task faulty. Same philosophy: your init system is a supervision system.
“Let it crash” does not mean “your code can be faulty”, it encourages crashing as a strategy for handling of spurious, hard to predict errors.
This is a curious misunderstanding of what “let it crash” means.
Their understanding of “let it crash” is not your understanding of “let it crash”, and–regardless of the degree to which I prefer your explanation–classifying it as a misunderstanding when I believe they have more expertise in the Erlang ecosystem (7 years at least, judging solely from this blog) and idioms than you do is somewhat presumptuous.
More pithily: please don’t rustsplain to the erlangutans. :P
There are things I disagree with in the article–for example, I think that tests are helpful for refactoring and for things like numerical routines…here the author and I part ways. That said, perhaps the misunderstanding here is that you and author view bugs differently: the author says that “let it crash” covers all defects and failure conditions (including ones introduced by programmer negligence or bad luck or foresight) while you say it covers only spurious errors.
The additional cultural context that’s helpful here is that BEAM systems are (much to the annoyance of devops folks trained in the way of the times) handled as somewhat living critters. It is not uncommon (hell, did this last month during peak traffic with my team) to shell into a running system, handling production traffic, and debug and tweak and patch running code. There is a different mindset about this sort of things than I believe we have when working with bake-and-deploy-and-scrap languages and runtimes.
I agree with your interpretation of “let it crash”, but having also suffered through a lot of oddly-engineered Elixir (which is similar enough to Erlang in the relevant ways I feel comfortable saying so) that has forgotten or misunderstood “let it crash” I do have sympathy for and empathy with the author.
This article, and these comments, seem to be conflating multiple notions of crashing. I guess crashing is usually understood as something that happens at or above the level of an OS process.. That is, a function call that doesn’t return successfully isn’t usually described as having crashed. Or, an autonomous task within a process that hits a terminal error isn’t typically said to have crashed. A process crashes. A system crashes.
Erlang defines a system model in which a supervisor (BEAM/OTP) manages a constellation of redundant entities (actors) which each serve isolated workloads (typically requests). When Erlang says “crash” it means something quite different, and far less impactful, than when Rust or Linux says “crash”. And Erlang’s definition is niche.
Maybe this is a distinction that everyone here fully understands and I’m harping on a pointless point, could be. But it’s definitely misunderstood as a general concept, and IMO shouldn’t be advocated-for without serious qualifications.
To summarise as fairly as I can: the author sees people trying to write correct code via unit testing. They are puzzled by this because Erlang isn’t supposed to be correct, it’s supposed to have bugs and then do some work to isolate those bugs, to be able to crash but the whole system keeps mostly running in the face of that crash.
But like… wouldn’t you like to do both? Not have the bugs, and also safely crash when you have the bugs? I can’t tell if this is meant to be ironic or not.
The goal for the ‘let it crash’ mindset isn’t to ship buggy code, it’s to acknowledge that even formally verified code is not bug free (your specification can never be complete) and your system should aim to be correct in the presence of failures. Testing is important in Erlang, but testing should be of the form of killing processes at surprising times and ensuring that the system recovers and does not lose data. This gives you a far more reliable system than a quest to eliminate bugs via unit tests. You have a limited budget for testing, you should spend it where it will deliver the most impact.
Randomly crashing processes isn’t going to help me if I’m testing my deflate implementation or whatever. Regular unit tests are still very useful in Erlang for checking that functions do what they ought to.
I’m familiar with Erlang and with let-it-crash, but that’s not the claim the article is making. Here it is blow-by-blow:
I see people writing a lot about unit testing in Erlang
Do we really need all that?
Erlang has some testing stuff
But there are cases that it can’t catch, and neither would a type system*
Erlang’s history includes environments that needed very high uptimes
To do that, Erlang has a way to continue in the face of crashes
If you write “clear code with modestly-sized routines” you don’t need tests
Some specific classes of bugs related to mutability aren’t possible
Erlang has supervisors that can restart and log buggy code
∴ “some Erlangutans [are] missing the point of using Erlang”
The only substantive thing there that could possibly mitigate the need for individual functions to be correct is the “clear code with modestly-sized routines”. Okay, yeah, if you never write wrong code then you’ll never have wrong code, that’s not unique to Erlang. But nothing here obviates the need for correctness. Let-it-crash allows your system that does 10 things to keep doing 9 of them while it’s broken. But it doesn’t make the other one not broken. It doesn’t make that function correct. I’m not even a TDD weenie myself but the idea that let-it-crash is in any way related to correctness, whatever form that might take, strikes me as absurd.
Let-it-crash is very situational. It’s good to allow a complex system to limp along in the face of temporary external problems or problems only involving parts of the system. But “the point” of Erlang isn’t that absolves you from writing correct code with a magical “On Error Resume Next” that still builds your widgets even if you close the exit hatch before you eject them from the chute. Let-it-crash lets one of your two widget-builders keep running after one trips the breaker, which is great and is actually the “point of using Erlang”. But if you don’t fix that bug you’ll still never build a single widget.
*: that particular case can be caught by some type systems like affine or linear types. I don’t claim that Erlang would be improved by those, just some fun trivia
How can you be correct with unforeseen failures? Seems like the general opinion around this is if an unplanned error happens, almost all bets are off and you can’t trust anything later to be correct so you should just crash and get a person to review the problem.
How should we distinguish between foreseeable and unforeseeable failures? If I write a program that tries to read a file whose path is provided by user input, is it not foreseeable that this file may not exist? Under what conditions is it appropriate for the entire program to crash in response to this error? Under what conditions is it my responsibility as a programmer to explicitly deal with this error? What about errors during file reads caused by device errors? Or errors during file descriptor close or flush or sync syscalls? Or errors when connecting to a remote API? Or during network transfers? Or when communicating with a database?
If you want to define crashworthy errors as whatever the programmer says are crashworthy errors, okay! But then what does crashing mean? If you’re serving a request in an Erlang actor and you hit a syscall error then it’s pretty OK to “crash” that actor because actors are request-scoped and the crash just terminates that single request. But if you’re serving a request in a concurrent e.g. Rust program and that request hits a syscall error then you definitely shouldn’t “crash” the entire process, right? You can terminate the request, sure, but that’s not really what anybody understands by “crash” terminology.
At my (former) job, I used Lua to process incoming SIP messages. It uses Lua coroutines [1] to handle each transaction. If some unexpected thing happens (like a nil reference missed in testing due to unexpected input for instance) only that coroutine “crashes”—that is, ceases to run. This is caught by the framework I’m using and logged. The “crash” does not affect the rest of the program. In my experience, most of the crashes were due to mishandling of input, or rather, a miscommunication about what, exactly, we could expect from the Oligarchic Cell Phone Company that was our customer [2]. We don’t expect crashes, but sometimes we do overlook some conditions (as I’m fond of saying, “bugs are caused by an inattention to detail.”).
While Lua isn’t Erlang, I think the overall effect is the same—a crashed thread does not impinge on the rest of the program. Perhaps a better distinction is “let it soft crash [3].”
[1] Think threads, but not pthreads threads, but Lua specific ones cooperatively mutlitasked.
[2] Our service is only for the North American Numbering Plan. We did not expect to receive international phone numbers at all (and weren’t for the non-SIP legacy interface).
[3] Where a “soft crash” just stops the processing and a diagnostic log is issued.
Lua coroutine . . . [crashes are] caught by the framework I’m using
Great! Good! My point here is not to assert an absolute, or to deny your specific experience in any way. My point here is to say that the general notion of “crashing” usually does not describe the act of terminating entities under the management of an eg Erlang OTP, or your Lua framework, but instead usually describes the act of terminating entire OS processes.
a crashed thread
I’m trying to communicate that most people do not generally understand threads as things that can “crash”. Threads die or terminate, processes crash. The difference in terminology is my entire point.
I have a Lua script. I try running it, Lua spits out “attempt to index a nil value (global ‘x’)” and ends the program. That is, if I understand your nomenclature, a “crash.” Yet if I wrap that same code in a coroutine, the coroutines fails to finish, yet the script continues. What did the coroutine do? Just terminate? I personally consider it a “crash.”
And what does it mean for a “process” to crash? On Unix, the system continues running. Yet on MS-DOS, such a “crash” will usually “crash” the computer. Yet both are crashes. Why can a “process” running under a protected domain (Unix) crash, and yet, threads cannot? I think the participants here are using a broader definition of “crash” than you personally like.
I have a Lua script. I try running it, Lua spits out “attempt to index a nil value (global ‘x’)” and ends the program. That is, if I understand your nomenclature, a “crash.” Yet if I wrap that same code in a coroutine, the coroutines fails to finish, yet the script continues.
If you can write a bit of code that runs (more or less) equivalently as its own OS process (scenario 1) or as a coroutine among other coroutines in a single shared OS process without modification (scenario 2) then whatever is orchestrating coroutines in the second scenario is necessarily enforcing isolation boundaries that are functionally equivalent to the OS threads in the first scenario.
What did the coroutine do? Just terminate? I personally consider it a “crash.”
If that code fails in scenario 1, is that a crash? IMO yes.
If that code fails in scenario 2, is that a crash? IMO no.
Why can a “process” running under a protected domain (Unix) crash, and yet, threads cannot? I think the participants here are using a broader definition of “crash” than you personally like.
My understanding of “crash”, or yours, isn’t important to me, what I care about is what the average developer understands when they see that term, absent any other qualifications. I don’t think most developers consider a thread termination to be a crash, and I think if you want to say “crash” to mean something like that you need to qualify it. That’s my only point. Could be wrong.
It’s worth pointing out that in the Erlang world, it’s not the entire program that crashes. Like, ever. Instead, whatever individual process that ran into the error is the piece that fails. In the context of a program loading a file, it’s likely that just the process responsible for loading the file would die. Which makes perfect sense, since it’s no longer a meaningful operation to continue loading a file that we can’t open.
The elegance of the “let it crash” mentality is that it lets you handle foreseeable, unforeseeable, and foreseeable-but-too-unlikely-to-care-about failures in the exact same way. Like sure, you could check to make sure a file exists, but what if it’s corrupted? Or someone takes a sledgehammer to the hard drive while you’re in the midst of reading the file? There’s an infinite number of failure states for a given operation, and for a lot of them there isn’t a meaningful resolution within the operation itself.
It’s well-understood that “crashing” in Erlang doesn’t mean crashing the entire program. The problem is that “crashing” in basically any context except Erlang does mean crashing the entire program.
There’s some truth to this: ‘let it crash’ does in general decrease the cost of a programming error, and it does shift the cost/benefit curve of writing tests.
As a practical example of this, when someone writes a new assist/fixit for rust-analyzer, the bar for testing is really low: just a couple of “happy path” tests. Assists are often contributed by non-core contributors and are a fiddly thing to write, so they often contain bugs and trigger panics at runtime. This isn’t a problem though, as, by design of the whole system, buggy assists can’t corrupt any data, and the faults in them are isolated (imperfectly, as Rust lacks true let-it-crash infrastructure). This is appropriate approach, because this allows us to quicker deliver 80% feature to the user, and also encourages contribution.
That being said, I don’t think that “decreasing programming errors” is the primary benefit of the tests. My response here is the same as to some lispers’ claim that “REPL is a substitute for tests”. The bulk of value of tests in typical programs comes from preserving programs behavior over time, when the code of the program itself changes. It’s not “is my code correct”, it’s “does this change accidentally break something else”. For this, let it crash doesn’t really help.
“let it crash” is recognized as being perfectly acceptable in any safe language, because the safe option to take in the face of something unexpected happening is to stop. This is not unique to Erlang. Rust does this. Haskell does this. Hell this is the approach webkit takes and that’s C++. Sure, the environment Erlang operates and its common usage means that you idiomatically write code knowing that it is possible for errors to occur unexpectedly, and to make sure that the system as a whole isn’t taken out by a failure in one component, but “things can crash, deal with it” is far from unique.
No part of “let it crash” means “don’t worry about writing correct code, we’ll find it in production”.
But let’s be honest, the core concept of the post - “requiring tests is an attack on ’let it crash;” - is nonsense. There is a world of difference between “the code is wrong” and “the code crashes”, and pretending that testing is solely the latter is bizarre.
Finally, the example:
maybe_write() ->
{ok, U} = file:open("/path/to/file.txt", [write]),
file:close(U),
ok = file:write(U, <<"foo">>),
ok.
You are correct, no amount of analysis would detect this error. That’s why you use your type system to ensure that this code would not even compile. Though I am loathe to do it let’s imagine trying in rust - rather than any horrific C++ horrorscape using forced moves :D
fn main() -> Result<(), std::io::Error> {
let mut file = File::create("/path/to/file.txt")?;
drop(file); // The closest option to explicitly closing a file
write!(file, "wat")?;
return Ok(());
}
this does not compile because closing/“dropping” the file consumes the ownership and so subsequent usage is an error. Obviously idiomatic rust that wanted to force earlier closing would probably just create a scope, which would make it more overtly obvious that what was being attempted wasn’t possible.
Let it crash means the beam system tolerates program failure and is unlikely to take the whole runtime down. It does not guarantee the program will be usable when it’s constantly crashing. I’m failing to understand the point they’re trying to make. Do they advocate fixing bugs on the go watching crashes? That doesn’t work if the program has users or interfaces with programs that have users. My unsolicited advice is stop looking for excuses and test your programs. You will gain confidence applying patches. Then crash whenever there’s really nothing left to do or you want to make a firm assertion preventing the execution from proceeding. And test that too.
But with all that, as Joe Armstrong once jeered, no amount of type checking would catch the following bogus code:
It is worth to note that the stated problem, of accessing a file after it is closed was addressed in the Phd thesis by Joe Armstrong in which he suggested a solution for exactly this problem by introducing a new testing & development methodology he named “protocol checkers” (9.1 Protocols, page 195). Relevant quote from the thesis:
Given a protocol which is specified in a manner similar to the above it is possible to write a simple “protocol checking” program which can be placed between any pair of processes.
The protocol is a state machine, which would detect the attempt to write to a file after it was closed.
Tangential, but “Erlangutan” is absolutely my favorite name for members of a programming language community, and I thought “Rustacean” would be impossible to beat.
Also, there should be a word for these words. Maybe “daemonym” as a pun on “demonym” and “daemon” as a CS term?
I always found the “Let it crash” very convincing. One of the most convincing talks I know is not even about Erlang. But I’ve never actually coded anything substantial in either Erlang or Elixir myself. I went the other way instead with Rust and static typing.
But I’m curious, for those of you who do work on such systems, does it deliver on its promise? Is it simpler? More robust? And is modern Elixir done in the same vein of “Let it crash” and very few tests or verification?
Rust follows the “let it crash”-philosophy, its panic system is Erlang-inspired. I used to be even stronger baked into the language, when it still had language-level tasking with a runtime. The nomicon chapter on unwinding still calls it out
You can see that in the tasking/threading APIs where a panic crashes that component and another part of the system is responsible for handling.
I’ve had to deal with more than one Rust service that take this philosophy to heart and so will fully crash the entire program in the presence of, say, a network connection timeout to a non-business-critical API endpoint. Maybe this isn’t the intended effect of the panic approach to error management, but it does seem to be a common outcome in my experience.
The problem here is a mismatch of expectations. It’s nominally OK to crash an Erlang actor in response to many/most runtime faults, because Erlang actors always operate in a constellation of redundant peers, and failure is a first order concern of their supervisor. That crash impacts a single request.
But e.g. systemd is not the OTP, and OS processes don’t operate in a cluster. A service running as an OS process is expected to be resilient to basically all runtime errors, even if those errors mean the service can’t fulfill its user-facing requirements. If an OS process crashes, it doesn’t impact a single request, it impacts every request served by the process, every other process with active connections to that process for any other reason, assumptions made by systemd about the soundness of that binary, probably some assumptions about off-host e.g. load balancers shuttling traffic to that instance, everything downstream from them, etc. etc.
If “crash-only” means “terminate the request” and not “terminate the process” then all good! But then “crash” isn’t the right verb, I don’t think, as crashing is pretty widely understood to mean the OS level process of the program. Alas.
Yeah, I think this is an acute case of catchy, but wildly misleading terminology. What is really (as in Elang or Midori) understood as proper “let it crash” is two dual properties:
abandoning the current “thing”
containing abandonment to some well-defined boundary, such that:
abandonment don’t propagate outside of this boundary
tearing things down at this boundary doesn’t compromise the state
restarting at the boundary is a well-defined operation which can fix transient errors
the actual blast radius from abandonment is small
Everyone gets the first point, buts it’s the second one which matters, which is hard, and which leads to simplicity and reliability.
To expand on this, Rust does only marginally, if at all, better here than you average $LANG:
the build-in boundary is OS thread, which is often too coarse-grained, there’s catch_unwind for do-it-yourself boundaries. There’s nothing to protect from thread monopolizing the CPU due to an infinite loop bug. Some errors (stack overflow, OOM) abort the process bypassing the recovery mechanism.
UnwindSafe machinery in theory helps somewhat with the tainted state problem. In practice, it’s too cumbersome to use and people often silence it. I had one spectacular bug where UnwindSafe would’ve saved couple of days of debugging, if it wasn’t silenced due to it tripping a compiler bug.
nothing to make restart workable, do-it-yourself again.
But e.g. systemd is not the OTP, and OS processes don’t operate in a cluster. A service running as an OS process is expected to be resilient to basically all runtime errors, even if those errors mean the service can’t fulfill its user-facing requirements.
I think this might be oversimplifying. Whether it’s reasonable to let a service continue to fulfill tasks despite encountering a serious fault is not clear cut. Example: A service that has a lot of shared state, say thread bound caches of various sensitive user data, a crash might lead to failed cleanups and subsequent data leaks.
A service running as an OS process is generally expected to be resilient to runtime errors. If a runtime error puts the service in a state where it can no longer fulfill user requirements, and that state is transient and/or recoverable, it is usually preferable for the service to continue to respond to requests with errors, rather than crashing.
In my experience across several companies and codebases in Elixir, I’d say the following things.
“let it crash” can lead to clean code. It also originated out of a design space that I believe doesn’t map as directly onto modern webshit as folks want to believe. This is neither good nor bad, it’s just an occasional impedance mismatch between system design philosophies.
“let it crash” encourages some people, drunk on the power of OTP and the actor model, to grossly overcomplicate their code. They decide deep supervision trees and worker pools and things are needed when a simple function will do. This is the curse of the beginner Elixir or Erlang developer, and if properly mentored this goes away quickly. If not properly mentored it progresses to a case of conference talks and awkward libraries.
Testing and verification in the BEAM ecosystem is weird, and until recently was both best and worst in class depending on what languages you were up against. Dialyzer for example is a marvelous typechecker, but there is growing suspicion that it is severely stunted in the sorts of verification it is categorically capable of. On the other side, property-based testing is strictly old-hat over in at least the Erlang ecosystem and has been for quite some time iirc. Other folks are catching up.
(Testing is also–in my opinion–most often valuable to solve team coordination problems and guard against entropy caused by other humans. This is orthogonal to language concerns, but comes out when you have larger webshit-style teams using BEAM stuff when compared with the origin of Erlang.)
Robustness is quite possible. I’ve alluded elsewhere to how a running BEAM instance is more of a living thing (gasp, a pet!) than most contemporary app models (pour one out for Smalltalk)…this unlocks some very flexible things you can do in production that I haven’t really seen anywhere else and which make it possible to do things without downtime during an incident that most folks would just look at and go “wat.”. On the other hand, you have to design your systems to actually enable robustness–writing your standard webshit without structuring the application logic to have affordances for interactivity or process isolation or whatever means you’re basically using the BEAM like you would other more conventional systems.
(You can also, as I’ve done with Python on at least one occasion, build a BEAM-like actor model with fault tolerance. The idioms are baked into Erlang and Elixir, but you can with sufficient effort reproduce them elsewhere.)
(Also, “let it crash” doesn’t mean you won’t sometimes have to wrap your whole VM in a systemd unit to restart things when, say, an intern pushes to prod and blows all the way up the supervision tree.)
In a sense, yes—an example I’ve run into several times is that a service you depend on becomes intermittently unresponsive. In a “regular” software service, unless you clutter your code up with retry and “fail soft” logic (basically your own ad-hoc OTP) this usually means a hard error, eg a end-user received an error message or a service needed to be restarted by the OS process manager.
In Erlang the system can usually deal with these kinds of errors by automatically retrying; if the operation keeps failing, the error will propagate up to the next level of the “supervision tree”. Unless it makes it all the way up to the root the application will keep running; sometimes the only indication that something went wrong is some log output.
The nice thing about “Let it crash” is that you don’t have to consider every possible failure scenario (eg what happens if this service call returns malformed data? What if it times out?). Instead of trying to preempt every possible error, which is messy and basically intractable, you can focus on the happy path and tell OTP what to do in case of a crash.
That said, “Let it crash” is not a silver bullet that will solve all errors in your distributed system; you still have to be acutely aware about which parts of the system can be safely restarted and how. The nice thing is that the base assumption of OTP is that the system will fail at some point, and it gives you a very powerful set of tools to deal with it.
Another thing that makes Erlang more robust is the process scheduling model: Processes are lightweight and use preemptive multitasking with fair scheduling, which means you’re less susceptible to “brownouts” from runaway processes.
But I’m curious, for those of you who do work on such systems, does it deliver on its promise? Is it simpler? More robust? And is modern Elixir done in the same vein of “Let it crash” and very few tests or verification?
I can only speak to the first half, and it is completely dependent on the team culture. If the team is very OTP/Erlang native, it can work out incredibly well. These teams tend to be extremely pragmatic and focused on boring, obvious, naive ways to solve problems, using global state as needed even!
However, when OTP/Erlang collide with a team trying to treat it like anything else things can go badly…. quickly.
The author feels like something is off, but can’t quite put their finger on it. They try to tie test culture to the language, claiming erlang is different. But I think the main diference is the mindset. The technical means to let it crash were perfected in erlang, no doubt, but the principle could stand for any programmer.
Testing as a religion is the direct negation of a culture of thriving to produce well written code and well architected software. Instead, fewer people look at the code or analyse it critically, we just throw test coverage at it. Management assess software quality and engineering skill by direct measuring test coverage. “John doe is such a great engineer, he always writes tests”.
Testing histeria boils down to impose the principle that any code is equally well written and equally error prone. And the only mitigation strategy is writing tests. This is a quite obvious mix of non sense and mediocrity culture. Of course we should thrive to write code with as few errors as possible, as in any other job. An accountant doesn’t open excel every day thinking he is randomly going to mix up all the salaries of his fellow employees. He knows he, like anyone else, can make mistakes and can even have a double check routine, but of course he thrives to not make those mistakes, naturally.
IMO, the sweet spot for testing in Elixir and Erlang is “Does the pretty path work?” and “If this section fails, does it do so elegantly?” I’ve definitely seen over-tested Elixir code, but no tests just means you end up doing tedious by-hand testing. So the example the author gave should definitely be caught by testing, because that functionality is just strictly absent from the program now, but you don’t need to crawl down every combination of branches trying to find every failure condition. Just have a test that murders the process and make sure the failure state is acceptable.
My (admittedly outsider) understanding of ‘let it crash’ is:
You can never anticipate every failure mode of a complex system
Thus, for reliability, you must be robust in the face of unknown failures
Testing is largely orthogonal to this - it’s largely intended to interrogate correctness. You (ideally) write tests which make assertions about the behavior of the code under test, which allows you to make inferences about known properties of the code, so that things like refactors can be assessed for whether these properties are maintained, or to prevent undesirable outcomes regressing, etc.
Take the cited code snippet from the article:
maybe_write() ->
{ok, U} = file:open("/path/to/file.txt", [write]),
file:close(U),
ok = file:write(U, <<"foo">>),
ok.
As the author notes, unit testing does catch this failure (assuming correctly written tests are present), which is dismissed in favor of discussing the failure of static typing to catch this issue (itself a contentious point, but one for another time). However, “let it crash” does not enhance the correctness of the application here - it allows the failure to be isolated to whatever process made this problem, but the code does not work correctly, and usually customers are still annoyed if the application loses data, even if it didn’t crash catastrophically in the process.
This article seems a little dismissive - just because “let it crash” means Erlang applications can be robust against failure, it doesn’t mean you shouldn’t be taking steps to guard against knowable failures.
This is a curious misunderstanding of what “let it crash” means.
“Let it crash” is the observation that if something unexpected, maybe spurious happens in a distributed, ever-changing system, it’s a good approach to crash the unit of work in a structured way and have the supervisor of that unit figure apply a strategy. e.g. if you are connecting to a socket, doing some work and then close - and the connection doesn’t work, the task should crash. The supervisor could then retry 3 times and then propagate the error. It’s a philosophy for error handling and structuring errors in a concurrent system, at far more high level then the example in the code.
That approach is very inspired from how init systems are doing things. Say, you run a cleanup task using your init system - it fails once because of a spurious connection error. The init system tries it 5 more times. If it fails 5 times, it then marks the task faulty. Same philosophy: your init system is a supervision system.
“Let it crash” does not mean “your code can be faulty”, it encourages crashing as a strategy for handling of spurious, hard to predict errors.
Their understanding of “let it crash” is not your understanding of “let it crash”, and–regardless of the degree to which I prefer your explanation–classifying it as a misunderstanding when I believe they have more expertise in the Erlang ecosystem (7 years at least, judging solely from this blog) and idioms than you do is somewhat presumptuous.
More pithily: please don’t rustsplain to the erlangutans. :P
There are things I disagree with in the article–for example, I think that tests are helpful for refactoring and for things like numerical routines…here the author and I part ways. That said, perhaps the misunderstanding here is that you and author view bugs differently: the author says that “let it crash” covers all defects and failure conditions (including ones introduced by programmer negligence or bad luck or foresight) while you say it covers only spurious errors.
The additional cultural context that’s helpful here is that BEAM systems are (much to the annoyance of devops folks trained in the way of the times) handled as somewhat living critters. It is not uncommon (hell, did this last month during peak traffic with my team) to shell into a running system, handling production traffic, and debug and tweak and patch running code. There is a different mindset about this sort of things than I believe we have when working with bake-and-deploy-and-scrap languages and runtimes.
I agree with your interpretation of “let it crash”, but having also suffered through a lot of oddly-engineered Elixir (which is similar enough to Erlang in the relevant ways I feel comfortable saying so) that has forgotten or misunderstood “let it crash” I do have sympathy for and empathy with the author.
+1
This article, and these comments, seem to be conflating multiple notions of crashing. I guess crashing is usually understood as something that happens at or above the level of an OS process.. That is, a function call that doesn’t return successfully isn’t usually described as having crashed. Or, an autonomous task within a process that hits a terminal error isn’t typically said to have crashed. A process crashes. A system crashes.
Erlang defines a system model in which a supervisor (BEAM/OTP) manages a constellation of redundant entities (actors) which each serve isolated workloads (typically requests). When Erlang says “crash” it means something quite different, and far less impactful, than when Rust or Linux says “crash”. And Erlang’s definition is niche.
Maybe this is a distinction that everyone here fully understands and I’m harping on a pointless point, could be. But it’s definitely misunderstood as a general concept, and IMO shouldn’t be advocated-for without serious qualifications.
To summarise as fairly as I can: the author sees people trying to write correct code via unit testing. They are puzzled by this because Erlang isn’t supposed to be correct, it’s supposed to have bugs and then do some work to isolate those bugs, to be able to crash but the whole system keeps mostly running in the face of that crash.
But like… wouldn’t you like to do both? Not have the bugs, and also safely crash when you have the bugs? I can’t tell if this is meant to be ironic or not.
The goal for the ‘let it crash’ mindset isn’t to ship buggy code, it’s to acknowledge that even formally verified code is not bug free (your specification can never be complete) and your system should aim to be correct in the presence of failures. Testing is important in Erlang, but testing should be of the form of killing processes at surprising times and ensuring that the system recovers and does not lose data. This gives you a far more reliable system than a quest to eliminate bugs via unit tests. You have a limited budget for testing, you should spend it where it will deliver the most impact.
Randomly crashing processes isn’t going to help me if I’m testing my deflate implementation or whatever. Regular unit tests are still very useful in Erlang for checking that functions do what they ought to.
I’m familiar with Erlang and with let-it-crash, but that’s not the claim the article is making. Here it is blow-by-blow:
The only substantive thing there that could possibly mitigate the need for individual functions to be correct is the “clear code with modestly-sized routines”. Okay, yeah, if you never write wrong code then you’ll never have wrong code, that’s not unique to Erlang. But nothing here obviates the need for correctness. Let-it-crash allows your system that does 10 things to keep doing 9 of them while it’s broken. But it doesn’t make the other one not broken. It doesn’t make that function correct. I’m not even a TDD weenie myself but the idea that let-it-crash is in any way related to correctness, whatever form that might take, strikes me as absurd.
Let-it-crash is very situational. It’s good to allow a complex system to limp along in the face of temporary external problems or problems only involving parts of the system. But “the point” of Erlang isn’t that absolves you from writing correct code with a magical “On Error Resume Next” that still builds your widgets even if you close the exit hatch before you eject them from the chute. Let-it-crash lets one of your two widget-builders keep running after one trips the breaker, which is great and is actually the “point of using Erlang”. But if you don’t fix that bug you’ll still never build a single widget.
*: that particular case can be caught by some type systems like affine or linear types. I don’t claim that Erlang would be improved by those, just some fun trivia
How can you be correct with unforeseen failures? Seems like the general opinion around this is if an unplanned error happens, almost all bets are off and you can’t trust anything later to be correct so you should just crash and get a person to review the problem.
How should we distinguish between foreseeable and unforeseeable failures? If I write a program that tries to read a file whose path is provided by user input, is it not foreseeable that this file may not exist? Under what conditions is it appropriate for the entire program to crash in response to this error? Under what conditions is it my responsibility as a programmer to explicitly deal with this error? What about errors during file reads caused by device errors? Or errors during file descriptor close or flush or sync syscalls? Or errors when connecting to a remote API? Or during network transfers? Or when communicating with a database?
If you want to define crashworthy errors as whatever the programmer says are crashworthy errors, okay! But then what does crashing mean? If you’re serving a request in an Erlang actor and you hit a syscall error then it’s pretty OK to “crash” that actor because actors are request-scoped and the crash just terminates that single request. But if you’re serving a request in a concurrent e.g. Rust program and that request hits a syscall error then you definitely shouldn’t “crash” the entire process, right? You can terminate the request, sure, but that’s not really what anybody understands by “crash” terminology.
At my (former) job, I used Lua to process incoming SIP messages. It uses Lua coroutines [1] to handle each transaction. If some unexpected thing happens (like a
nil
reference missed in testing due to unexpected input for instance) only that coroutine “crashes”—that is, ceases to run. This is caught by the framework I’m using and logged. The “crash” does not affect the rest of the program. In my experience, most of the crashes were due to mishandling of input, or rather, a miscommunication about what, exactly, we could expect from the Oligarchic Cell Phone Company that was our customer [2]. We don’t expect crashes, but sometimes we do overlook some conditions (as I’m fond of saying, “bugs are caused by an inattention to detail.”).While Lua isn’t Erlang, I think the overall effect is the same—a crashed thread does not impinge on the rest of the program. Perhaps a better distinction is “let it soft crash [3].”
[1] Think threads, but not pthreads threads, but Lua specific ones cooperatively mutlitasked.
[2] Our service is only for the North American Numbering Plan. We did not expect to receive international phone numbers at all (and weren’t for the non-SIP legacy interface).
[3] Where a “soft crash” just stops the processing and a diagnostic log is issued.
Great! Good! My point here is not to assert an absolute, or to deny your specific experience in any way. My point here is to say that the general notion of “crashing” usually does not describe the act of terminating entities under the management of an eg Erlang OTP, or your Lua framework, but instead usually describes the act of terminating entire OS processes.
I’m trying to communicate that most people do not generally understand threads as things that can “crash”. Threads die or terminate, processes crash. The difference in terminology is my entire point.
Then what would you call it?
I have a Lua script. I try running it, Lua spits out “attempt to index a nil value (global ‘x’)” and ends the program. That is, if I understand your nomenclature, a “crash.” Yet if I wrap that same code in a coroutine, the coroutines fails to finish, yet the script continues. What did the coroutine do? Just terminate? I personally consider it a “crash.”
And what does it mean for a “process” to crash? On Unix, the system continues running. Yet on MS-DOS, such a “crash” will usually “crash” the computer. Yet both are crashes. Why can a “process” running under a protected domain (Unix) crash, and yet, threads cannot? I think the participants here are using a broader definition of “crash” than you personally like.
If you can write a bit of code that runs (more or less) equivalently as its own OS process (scenario 1) or as a coroutine among other coroutines in a single shared OS process without modification (scenario 2) then whatever is orchestrating coroutines in the second scenario is necessarily enforcing isolation boundaries that are functionally equivalent to the OS threads in the first scenario.
If that code fails in scenario 1, is that a crash? IMO yes.
If that code fails in scenario 2, is that a crash? IMO no.
My understanding of “crash”, or yours, isn’t important to me, what I care about is what the average developer understands when they see that term, absent any other qualifications. I don’t think most developers consider a thread termination to be a crash, and I think if you want to say “crash” to mean something like that you need to qualify it. That’s my only point. Could be wrong.
It’s worth pointing out that in the Erlang world, it’s not the entire program that crashes. Like, ever. Instead, whatever individual process that ran into the error is the piece that fails. In the context of a program loading a file, it’s likely that just the process responsible for loading the file would die. Which makes perfect sense, since it’s no longer a meaningful operation to continue loading a file that we can’t open.
The elegance of the “let it crash” mentality is that it lets you handle foreseeable, unforeseeable, and foreseeable-but-too-unlikely-to-care-about failures in the exact same way. Like sure, you could check to make sure a file exists, but what if it’s corrupted? Or someone takes a sledgehammer to the hard drive while you’re in the midst of reading the file? There’s an infinite number of failure states for a given operation, and for a lot of them there isn’t a meaningful resolution within the operation itself.
It’s well-understood that “crashing” in Erlang doesn’t mean crashing the entire program. The problem is that “crashing” in basically any context except Erlang does mean crashing the entire program.
There’s some truth to this: ‘let it crash’ does in general decrease the cost of a programming error, and it does shift the cost/benefit curve of writing tests.
As a practical example of this, when someone writes a new assist/fixit for rust-analyzer, the bar for testing is really low: just a couple of “happy path” tests. Assists are often contributed by non-core contributors and are a fiddly thing to write, so they often contain bugs and trigger panics at runtime. This isn’t a problem though, as, by design of the whole system, buggy assists can’t corrupt any data, and the faults in them are isolated (imperfectly, as Rust lacks true let-it-crash infrastructure). This is appropriate approach, because this allows us to quicker deliver 80% feature to the user, and also encourages contribution.
That being said, I don’t think that “decreasing programming errors” is the primary benefit of the tests. My response here is the same as to some lispers’ claim that “REPL is a substitute for tests”. The bulk of value of tests in typical programs comes from preserving programs behavior over time, when the code of the program itself changes. It’s not “is my code correct”, it’s “does this change accidentally break something else”. For this, let it crash doesn’t really help.
“let it crash” is recognized as being perfectly acceptable in any safe language, because the safe option to take in the face of something unexpected happening is to stop. This is not unique to Erlang. Rust does this. Haskell does this. Hell this is the approach webkit takes and that’s C++. Sure, the environment Erlang operates and its common usage means that you idiomatically write code knowing that it is possible for errors to occur unexpectedly, and to make sure that the system as a whole isn’t taken out by a failure in one component, but “things can crash, deal with it” is far from unique.
No part of “let it crash” means “don’t worry about writing correct code, we’ll find it in production”.
But let’s be honest, the core concept of the post - “requiring tests is an attack on ’let it crash;” - is nonsense. There is a world of difference between “the code is wrong” and “the code crashes”, and pretending that testing is solely the latter is bizarre.
Finally, the example:
You are correct, no amount of analysis would detect this error. That’s why you use your type system to ensure that this code would not even compile. Though I am loathe to do it let’s imagine trying in rust - rather than any horrific C++ horrorscape using forced moves :D
this does not compile because closing/“dropping” the file consumes the ownership and so subsequent usage is an error. Obviously idiomatic rust that wanted to force earlier closing would probably just create a scope, which would make it more overtly obvious that what was being attempted wasn’t possible.
I was bored and made a
File
type in C++ that abuses clang to make the example Erlang equivalent not work:Behold the glory!
Let it crash means the beam system tolerates program failure and is unlikely to take the whole runtime down. It does not guarantee the program will be usable when it’s constantly crashing. I’m failing to understand the point they’re trying to make. Do they advocate fixing bugs on the go watching crashes? That doesn’t work if the program has users or interfaces with programs that have users. My unsolicited advice is stop looking for excuses and test your programs. You will gain confidence applying patches. Then crash whenever there’s really nothing left to do or you want to make a firm assertion preventing the execution from proceeding. And test that too.
It is worth to note that the stated problem, of accessing a file after it is closed was addressed in the Phd thesis by Joe Armstrong in which he suggested a solution for exactly this problem by introducing a new testing & development methodology he named “protocol checkers” (9.1 Protocols, page 195). Relevant quote from the thesis:
The protocol is a state machine, which would detect the attempt to write to a file after it was closed.
Tangential, but “Erlangutan” is absolutely my favorite name for members of a programming language community, and I thought “Rustacean” would be impossible to beat.
Also, there should be a word for these words. Maybe “daemonym” as a pun on “demonym” and “daemon” as a CS term?
I always found the “Let it crash” very convincing. One of the most convincing talks I know is not even about Erlang. But I’ve never actually coded anything substantial in either Erlang or Elixir myself. I went the other way instead with Rust and static typing.
But I’m curious, for those of you who do work on such systems, does it deliver on its promise? Is it simpler? More robust? And is modern Elixir done in the same vein of “Let it crash” and very few tests or verification?
Rust follows the “let it crash”-philosophy, its panic system is Erlang-inspired. I used to be even stronger baked into the language, when it still had language-level tasking with a runtime. The nomicon chapter on unwinding still calls it out
You can see that in the tasking/threading APIs where a panic crashes that component and another part of the system is responsible for handling.
I’ve had to deal with more than one Rust service that take this philosophy to heart and so will fully crash the entire program in the presence of, say, a network connection timeout to a non-business-critical API endpoint. Maybe this isn’t the intended effect of the panic approach to error management, but it does seem to be a common outcome in my experience.
The problem here is a mismatch of expectations. It’s nominally OK to crash an Erlang actor in response to many/most runtime faults, because Erlang actors always operate in a constellation of redundant peers, and failure is a first order concern of their supervisor. That crash impacts a single request.
But e.g. systemd is not the OTP, and OS processes don’t operate in a cluster. A service running as an OS process is expected to be resilient to basically all runtime errors, even if those errors mean the service can’t fulfill its user-facing requirements. If an OS process crashes, it doesn’t impact a single request, it impacts every request served by the process, every other process with active connections to that process for any other reason, assumptions made by systemd about the soundness of that binary, probably some assumptions about off-host e.g. load balancers shuttling traffic to that instance, everything downstream from them, etc. etc.
If “crash-only” means “terminate the request” and not “terminate the process” then all good! But then “crash” isn’t the right verb, I don’t think, as crashing is pretty widely understood to mean the OS level process of the program. Alas.
Yeah, I think this is an acute case of catchy, but wildly misleading terminology. What is really (as in Elang or Midori) understood as proper “let it crash” is two dual properties:
Everyone gets the first point, buts it’s the second one which matters, which is hard, and which leads to simplicity and reliability.
To expand on this, Rust does only marginally, if at all, better here than you average $LANG:
I think this might be oversimplifying. Whether it’s reasonable to let a service continue to fulfill tasks despite encountering a serious fault is not clear cut. Example: A service that has a lot of shared state, say thread bound caches of various sensitive user data, a crash might lead to failed cleanups and subsequent data leaks.
Let me rephrase my claim to be more precise.
A service running as an OS process is generally expected to be resilient to runtime errors. If a runtime error puts the service in a state where it can no longer fulfill user requirements, and that state is transient and/or recoverable, it is usually preferable for the service to continue to respond to requests with errors, rather than crashing.
In my experience across several companies and codebases in Elixir, I’d say the following things.
“let it crash” can lead to clean code. It also originated out of a design space that I believe doesn’t map as directly onto modern webshit as folks want to believe. This is neither good nor bad, it’s just an occasional impedance mismatch between system design philosophies.
“let it crash” encourages some people, drunk on the power of OTP and the actor model, to grossly overcomplicate their code. They decide deep supervision trees and worker pools and things are needed when a simple function will do. This is the curse of the beginner Elixir or Erlang developer, and if properly mentored this goes away quickly. If not properly mentored it progresses to a case of conference talks and awkward libraries.
Testing and verification in the BEAM ecosystem is weird, and until recently was both best and worst in class depending on what languages you were up against. Dialyzer for example is a marvelous typechecker, but there is growing suspicion that it is severely stunted in the sorts of verification it is categorically capable of. On the other side, property-based testing is strictly old-hat over in at least the Erlang ecosystem and has been for quite some time iirc. Other folks are catching up.
(Testing is also–in my opinion–most often valuable to solve team coordination problems and guard against entropy caused by other humans. This is orthogonal to language concerns, but comes out when you have larger webshit-style teams using BEAM stuff when compared with the origin of Erlang.)
Robustness is quite possible. I’ve alluded elsewhere to how a running BEAM instance is more of a living thing (gasp, a pet!) than most contemporary app models (pour one out for Smalltalk)…this unlocks some very flexible things you can do in production that I haven’t really seen anywhere else and which make it possible to do things without downtime during an incident that most folks would just look at and go “wat.”. On the other hand, you have to design your systems to actually enable robustness–writing your standard webshit without structuring the application logic to have affordances for interactivity or process isolation or whatever means you’re basically using the BEAM like you would other more conventional systems.
(You can also, as I’ve done with Python on at least one occasion, build a BEAM-like actor model with fault tolerance. The idioms are baked into Erlang and Elixir, but you can with sufficient effort reproduce them elsewhere.)
(Also, “let it crash” doesn’t mean you won’t sometimes have to wrap your whole VM in a systemd unit to restart things when, say, an intern pushes to prod and blows all the way up the supervision tree.)
In a sense, yes—an example I’ve run into several times is that a service you depend on becomes intermittently unresponsive. In a “regular” software service, unless you clutter your code up with retry and “fail soft” logic (basically your own ad-hoc OTP) this usually means a hard error, eg a end-user received an error message or a service needed to be restarted by the OS process manager.
In Erlang the system can usually deal with these kinds of errors by automatically retrying; if the operation keeps failing, the error will propagate up to the next level of the “supervision tree”. Unless it makes it all the way up to the root the application will keep running; sometimes the only indication that something went wrong is some log output.
The nice thing about “Let it crash” is that you don’t have to consider every possible failure scenario (eg what happens if this service call returns malformed data? What if it times out?). Instead of trying to preempt every possible error, which is messy and basically intractable, you can focus on the happy path and tell OTP what to do in case of a crash.
That said, “Let it crash” is not a silver bullet that will solve all errors in your distributed system; you still have to be acutely aware about which parts of the system can be safely restarted and how. The nice thing is that the base assumption of OTP is that the system will fail at some point, and it gives you a very powerful set of tools to deal with it.
Another thing that makes Erlang more robust is the process scheduling model: Processes are lightweight and use preemptive multitasking with fair scheduling, which means you’re less susceptible to “brownouts” from runaway processes.
I can only speak to the first half, and it is completely dependent on the team culture. If the team is very OTP/Erlang native, it can work out incredibly well. These teams tend to be extremely pragmatic and focused on boring, obvious, naive ways to solve problems, using global state as needed even!
However, when OTP/Erlang collide with a team trying to treat it like anything else things can go badly…. quickly.
The author feels like something is off, but can’t quite put their finger on it. They try to tie test culture to the language, claiming erlang is different. But I think the main diference is the mindset. The technical means to let it crash were perfected in erlang, no doubt, but the principle could stand for any programmer.
Testing as a religion is the direct negation of a culture of thriving to produce well written code and well architected software. Instead, fewer people look at the code or analyse it critically, we just throw test coverage at it. Management assess software quality and engineering skill by direct measuring test coverage. “John doe is such a great engineer, he always writes tests”.
Testing histeria boils down to impose the principle that any code is equally well written and equally error prone. And the only mitigation strategy is writing tests. This is a quite obvious mix of non sense and mediocrity culture. Of course we should thrive to write code with as few errors as possible, as in any other job. An accountant doesn’t open excel every day thinking he is randomly going to mix up all the salaries of his fellow employees. He knows he, like anyone else, can make mistakes and can even have a double check routine, but of course he thrives to not make those mistakes, naturally.
IMO, the sweet spot for testing in Elixir and Erlang is “Does the pretty path work?” and “If this section fails, does it do so elegantly?” I’ve definitely seen over-tested Elixir code, but no tests just means you end up doing tedious by-hand testing. So the example the author gave should definitely be caught by testing, because that functionality is just strictly absent from the program now, but you don’t need to crawl down every combination of branches trying to find every failure condition. Just have a test that murders the process and make sure the failure state is acceptable.
My (admittedly outsider) understanding of ‘let it crash’ is:
Testing is largely orthogonal to this - it’s largely intended to interrogate correctness. You (ideally) write tests which make assertions about the behavior of the code under test, which allows you to make inferences about known properties of the code, so that things like refactors can be assessed for whether these properties are maintained, or to prevent undesirable outcomes regressing, etc.
Take the cited code snippet from the article:
As the author notes, unit testing does catch this failure (assuming correctly written tests are present), which is dismissed in favor of discussing the failure of static typing to catch this issue (itself a contentious point, but one for another time). However, “let it crash” does not enhance the correctness of the application here - it allows the failure to be isolated to whatever process made this problem, but the code does not work correctly, and usually customers are still annoyed if the application loses data, even if it didn’t crash catastrophically in the process.
This article seems a little dismissive - just because “let it crash” means Erlang applications can be robust against failure, it doesn’t mean you shouldn’t be taking steps to guard against knowable failures.