1. 14
  1. 15

    There are a bunch of good reasons for this, in no particular order:

    When you’re writing a program, unless you have a 100% accurate specification and formally verify your code, you will have bugs. You also have a finite amount of cognitive load that your brain can devote to avoiding bugs, so it’s a good idea to prioritise certain categories. Generally, bugs impact one or more of three categories, in (for most use cases) descending order of importance:

    1. Integrity
    2. Confidentiality
    3. Availability

    A bug that affects integrity is the worst kind because its effects can be felt back in time (corrupting state that you thought was safe). This is why Raskin’s first law (a program may not harm a user’s data or, through inaction, allow a human’s data to come to harm) is his first law. Whatever you do, you should avoid things that can cause data loss. This is why memory safety bugs are so bad: they place the entire program in an undefined state where any subsequent instruction that the CPU executes may corrupt the user’s data. Things like SQL injection fall into a similar category: they allow malicious or buggy inputs to corrupt state.

    Confidentiality may be almost as important in a lot of cases, but often the data that a program is operating on is of value only to the user and so leaking it doesn’t matter nearly as much as damaging it. In some defence applications the converse is true and it’s better to destroy the data than allow it to be leaked.

    Availability generally comes in last. The only exceptions tend to be safety-critical systems (if your car’s brakes fail to respond for 5 seconds, that’s much worse than your engine management system corrupting the mileage logs or leaking your position via a mobile channel, for example). For most desktop software, it’s a very distant third. If a program crashes without losing any data, and restarts quickly, I lose a few seconds of time but nothing else. macOS is designed so that the entire OS can crash without annoying the user too much. Almost every application supports sudden termination: it persists data to disk in the background and so the kernel can kill it if it runs out of memory. If the kernel panics then it typically takes a minute or two to reboot and come back to the original state.

    All of this means that a bug from not properly handling out-of-memory conditions is likely to have very low impact on the user. In contrast, it requires a huge amount of effort to get right. Everything that transitively allocates an object must handle failure. This is a huge burden on the programmer and if you get it wrong in one path then you may still see crashes from memory exhaustion.

    Next, there’s the question of what you do if memory is exhausted. As programs become more complicated, the subset of their behaviour that doesn’t require allocation becomes proportionally smaller. C++, for example, can throw an exception if operator::new fails[1], but what do you do in those catch blocks? Any subsequent memory allocation is likely to fail, and so even communicating with the user in a GUI application may not be possible. The best you can do is write unsaved data to disk, but if you’re respecting Raskin’s first law then you did that as soon as possible and so doing it on memory exhaustion is not a great idea. Most embedded / kernel code works around this by pre-allocating things at the start of some operation so that it has a clear failure point and can abort the operation if allocation fails. That’s much harder to do in general-purpose code.

    Closely related, the failure is (on modern operating systems that are not Windows) not related to allocation. Overcommit is a very important tactic for maximising the use of memory (memory that you’ve paid for but are not using is wasted). This means that malloc / new / whatever is not the point where you receive the out-of-memory notification. You receive it when you try to write to the memory, the OS takes a copy-on-write fault, and cannot allocate physical memory. This means that any store instruction my be the thing to trigger memory exhaustion ( it often isn’t that bad, but on systems that do deduplication, it is exactly that bad). If you thought getting exception handling right for anything that calls new was hard, imagine how much harder it is if any store to memory needs correct exception handling.

    Finally, and perhaps most importantly, there’s the question of where to build in reliability in a system. I think that the most important lesson from Erlang is that failure should be handled at the largest scale possible. Running out of memory is one possible cause of a program crashing. If you correctly handle it in every possible case, you probably still have other things that can cause the program to crash. In the best case, with formally verified code from a correct and complete specification, hardware failures can cause crashing. If you really want reliable systems then you should work on the assumption that the program can crash. Again, macOS does this well and provides very fast recovery paths. If a background app crashes on macOS, the window server keeps a copy of the window contents, the kernel restarts the app, which reconnects and reclaims the windows and draws back into them. The user probably doesn’t notice. In a server system, if you have multiple fault-tolerant replicas then you handle memory exhaustion (as long as it’s not triggered by allowing an attacker to allocate unbounded amounts of memory) in the same way that you handle any other failure: kill a replica and restart. The same mechanism protects you against large numbers of bug categories, including a blown fuse in the datacenter.

    All other things being equal, I would like programs to handle out of memory conditions gracefully but all other things are not equal and I’d much rather that they provided me with strong data integrity, data confidentiality, and could recover quickly from crashes.

    [1] Which, on every non-Windows platform, requires heap allocation. The Itanium ABI spec requires that the C++ runtime maintain a small pool of buffers that can be used but this has two additional problems. First, on a system that does overcommit, there’s no guarantee that the first use of those buffers won’t cause CoW faults and a SIGSEGV anyway. Second, there’s a finite pool of them and so in a multithreaded program some of the threads may be blocked waiting for others to complete error handling, and this may cause deadlock.

    1. 2

      C++, for example, can throw an exception if operator::new fails[1], but what do you do in those catch blocks? Any subsequent memory allocation is likely to fail, and so even communicating with the user in a GUI application may not be possible

      This may or not be the case depending on what you were doing inside the try. In the example of a particularly large allocation for a single operation, it’d be pretty straightforward to inform the user and abort the operation. For the case of GUI needing (but not being able) to allocate, I’d suggest that good design would have all allocation needed for user interaction being done early (during application startup) so this doesn’t present as a problem, even if it’s only for critical interactions.

      All other things being equal, I would like programs to handle out of memory conditions gracefully but all other things are not equal and I’d much rather that they provided me with strong data integrity, data confidentiality, and could recover quickly from crashes.

      Agreed, but it bothers me that the OS itself (and certain libraries, and certain languages) put blocks in the way to ever handling the conditions gracefully.

      Thanks for your comments.

      1. 1

        This may or not be the case depending on what you were doing inside the try. In the example of a particularly large allocation for a single operation, it’d be pretty straightforward to inform the user and abort the operation.

        That’s definitely true but most code outside of embedded systems has a lot of small allocations. If one of these fails then you need to backtrack a lot. This is really hard to do.

        Agreed, but it bothers me that the OS itself (and certain libraries, and certain languages) put blocks in the way to ever handling the conditions gracefully.

        Apparently there’s been a lot of discussion about this topic in WG21. In the embedded space (including kernels), gracefully handling allocation failure is critical, but these environments typically disable exceptions and so can’t use the C++ standard interfaces anyway. Outside of the embedded space, there are no non-trivial C++ applications that handle allocation failure correctly in all cases, in spite of the fact that the standard was explicitly designed to make it possible.

        Note that Windows was designed from the NT kernel on up to enable precisely this. NT has a policy of not making promises it can’t keep. When you ask the kernel for committed memory, it increments a count for your process representing ‘commit charge’. The total commit charge of all processes (and bits of the kernel) must add up to less than the available memory + swap. Requests to commit memory will fail if this limit is exceeded. Even stack allocations will probe and will throw exceptions on stack overrun. SEH doesn’t require any heap allocations and so can report out-of-memory conditions (it does require stack allocations, so I’m not quite sure what it does for those - I think there’s always one spare page for each stack) and all of the higher-level Windows APIs support graceful handling of allocation errors.

        With all of that in mind, have you seen evidence that Windows applications are more reliable or less likely to lose user data than their macOS counterparts?

        1. 1

          Outside of the embedded space, there are no non-trivial C++ applications that handle allocation failure correctly in all cases

          I’ve written at least one that is supposed to do so, though it depends on your definition of “trivial” I guess. But anyway, “applications don’t do it” was one of the laments.

          With all of that in mind, have you seen evidence that Windows applications are more reliable or less likely to lose user data than their macOS counterparts?

          That’s a bit of a straw-man, though, isn’t it? Nobody’s claimed that properly handling allocation failure at the OS level will by itself make applications more reliable.

          I understand that people don’t think the problem is worth solving (that was somewhat the point of the article) - I think it’s subjective though. Arguments that availability is less important than integrity for example aren’t news, and aren’t enough to change my mind (I’ll point out that the importance of availability doesn’t diminish to zero just because there are higher-priority concerns). Other things that are being bought up are just echoing things already expressed by the article itself - the growing complexity of applications, the difficulty of handling allocation failure correctly; I agree the problem is hard, but I lament that OS behaviour, library design choices and language design choices only serve to make it harder, and for instance that programming languages aren’t trying to tackle the problem better.

          But, if you disagree, I’m not trying to convince you.

          1. 1

            I’ve written at least one that is supposed to do so

            I took a very quick (less than one minute) skim of the code and I found this line, where you use a throwing variant of operator new, in a way that is not exception safe. On at least one of the call chains that reach it, you will hit an exception-handling block that handle that failure and so will propagate it outwards.

            It might be that you correctly handle allocation failure but a quick skim of the code suggests that you don’t. The only code that I’ve ever seen that does handle it correctly outside of the embedded space was written in Ada.

            With all of that in mind, have you seen evidence that Windows applications are more reliable or less likely to lose user data than their macOS counterparts?

            That’s a bit of a straw-man, though, isn’t it? Nobody’s claimed that properly handling allocation failure at the OS level will by itself make applications more reliable.

            No, I’m claiming the exact opposite: that making people think about and handle allocation failure increases cognitive load and makes them more likely to introduce other bugs.

            1. 1

              I took a very quick (less than one minute) skim of the code and I found this line,

              That’s in a utility that was just added to the code base, is still a work in progress, and the “new” happens on the setup path where termination on failure is appropriate (though, yes, it would be better to output an appropriate response rather than let it propagate right through and terminate via “unhandled exception”). The daemon itself - the main program in the repository - is, as I said, supposed to be resilient to allocation failure; If you want to skim anything to check what I’ve said, you should skim that.

              No, I’m claiming the exact opposite: that making people think about and handle allocation failure increases cognitive load and makes them more likely to introduce other bugs.

              Well, if you are making a claim, you should provide the evidence yourself, rather than asking whether I’ve seen any. I don’t think, though, that you can draw such a conclusion, even if there is evidence that Windows programs are generally more buggy than macOS equivalents (and that might be the case). There may be explanations other than “the windows developers are trying to handle allocation failure and introducing bugs as a result”. In any case, I still feel that this is missing the point.

              (Sorry, that’s more inflammatory than I intended: what I meant was, you’re missing the thrust of the article. I’m really not interested in an argument about whether handling allocation failures is harder than not doing so; that is undeniably true. Does it lead to more bugs? With all other things being equal, it quite possibly does, but “how much so” is unanswered, and I still think there is a potential benefit; I also believe that the cost could be reduced if language design tried to address the problem).

              1. 2

                No, I’m claiming the exact opposite: that making people think about and handle allocation failure increases cognitive load and makes them more likely to introduce other bugs.

                Well, if you are making a claim, you should provide the evidence yourself, rather than asking whether I’ve seen any.

                The evidence that I see is that every platform that has designed APIs to require handling of OOM conditions (Symbian, Windows, classic MacOS, Win16) has had a worse user experience than ones that have tried to handle this at a system level (macOS, iOS, Android) and systems such as Erlang that don’t try to handle it locally are the ones that have the best uptime for large-scale systems.

                You are making a claim that handling memory failures gracefully will improve something. Given that the experience of the last 30 years is that not doing so improves usability, system resilience, and data integrity, you need to provide some very strong evidence to back up that claim.

                1. 1

                  You are making a claim that handling memory failures gracefully will improve something

                  Of course it will improve something - it will improve the behaviour of applications that encounter memory allocation failures. I feel like that’s a worthwhile goal. That’s the extent of my “claim”. It doesn’t need proving because it’s not really a claim. It’s a subjective opinion.

                  If all you want to do is say “you’re wrong”, you’ve done that. In all politeness, I don’t care what you think. You made some good points (as well as some that I flat-out disagree with, and some that are anecdotal or at least subjective) but that’s not changing my opinion. If you don’t want to discuss the ideas I was actually trying to raise, let’s leave it.

      2. 2

        Yes it is better that programs crash rather than continue to run in a degraded state but when a program crashes is still a bad thing. This reads like an argument that quality is low because of all the quality that is being delivered, or that memory leaks aren’t worth fixing.

        1. 2

          That’s an argument that you can have programs that don’t crash by correctness. I.e., you can’t just be really really careful and write code that won’t crash. It’s basically impossible. What you can do is handle what is doable and architecture for redundancy, fast recovery, and minimization of damage.

          1. 2

            Data corruption, wrong results, and other Undefined Behavior are usually worse than crashing.

            And I’m sorry to go into Grandpa Mode, but it’s easy to complain about quality when you haven’t had to try to handle and test every conceivable allocation failure (see my very long comment here for details.)

        2. 8

          I suspect few people reading this have had to deliver software that runs in a memory-constrained environment, one where the code must handle allocation failures because they are likely to occur in real use cases.

          Nowadays the only environments like this (aside from niche retro stuff) are embedded systems. I get the impression that people often design embedded software by eliminating dynamic allocation, or restricting it enough that there aren’t too many failure cases to handle.

          Outside the embedded domain, you have to go back to “classic” pre-X MacOS or Windows 95. There is no virtual memory to speak of (MacOS 7.5+ had VM but it just let you raise the apparent RAM limit by 2x or so.) Computers ship with too little RAM because it’s expensive, especially when politicians add tariffs because “the Japanese are taking over.” Users try to do too much with their PCs. On MacOS every app has to predeclare how much RAM it needs, and that’s how big its heap is. There’s a limited way to request “temporary memory” outside that, but it’s problematic. (I hear Windows 95 was slightly better with memory but I have no knowledge of it.)

          This was absolutely hellish. You can get pretty far in development without running into memory problems because your dev machine has a honkin’ 16MB of RAM, you configure the app to request a healthy size heap, and you’re mostly just running it briefly with small-to-moderate data sizes. But around beta time Marketing reminds you that most users only have 4MB and the app needs to run in a 2MB heap. And the testers start creating some big documents for it to open. And the really good testers come up with more and more creative scenarios to trigger OOM. (The best ones involve OOM during a save operation. If they’re sadistic they’ll make it happen in a low-disk-space situation too, because your target hardware only has a 160MB hard drive.)

          So I remember the last six months or so of the development cycle involving tracking down so many crashes caused by OOM. Or worse, the bugs that aren’t crashes, but an ignored malloc error or faulty cleanup code corrupted memory and caused misbehavior or a later crash or data corruption.

          [I just remembered a great story from 1995-ish. I worked on OpenDoc, which had a clever, complicated file format called Bento. The schmuck who implemented it, from Jed Harris’s elegant design, punted on all the error handling by simply calling a “fail” callback passed in by the caller. The callback wasn’t expected to return. For some reason the implications weren’t realized until the 1.0 beta cycle. I think the guy who implemented the document-storage subsystem passed in a callback that called longjmp and did some cleanup and returned an error. Unfortunately this (a) leaked memory allocated by Bento, and (b) tended to leave the file — the user’s document — in a corrupt state. I guess nobody had heard of ACID. This had to be fixed ASAP before beta. I can’t remember whether the fix was to implement a safe-save (slow and requires more disk space) or to fix Bento by putting in real error handling.]

          Oh, and ignoring a NULL result from malloc doesn’t immediately crash. MacOS had this wonderful feature that there was actual memory mapped at location 00000000. You could read or write it without crashing. But if you wrote more than a few hundred bytes (IIRC) there you overwrote interrupt vectors and crashed the whole computer, hard. I have no idea why this was done, it probably let Bruce Horn or Bill Atkinson shave some cycles from something in 1983 and afterwards it could never be changed.

          Anyway, TMI, but my point is that trying to correctly handle all memory allocation errors in large programs is extremely difficult because there are so many new code paths involved. (Recovering from OOM without incurring more failures was a black art, too.) I firmly believe it isn’t worth it, in general. Design the OS so it happens rarely, and make the program crash immediately so at least it doesn’t have time to corrupt user data. Oh, and put a good auto-save feature in the GUI framework’s Document class, so even if this happens the user only loses a minute of work.

          1. 1

            MacOS had this wonderful feature that there was actual memory mapped at location 00000000

            This actually sounds interesting if it wasn’t implemented in a dumb way? Like, handle writes to 00000000000, so that failed allocations don’t immediately fail, but track how many, and if it’s over s threshold, reboot the system safely.

            1. 1

              It stems from the MC68000, which didn’t have virtual memory. The CPU expects a table at location 0 with various pointers (the first two indicate the starting PC and SP values; others are for various exceptions and IRQ handlers). It was most likely in ROM (since it contains start up data) so writes wouldn’t affect it, but mileage may vary (some systems might map RAM into place at a certain point in time).

            2. 1

              my point is that trying to correctly handle all memory allocation errors in large programs is extremely difficult because there are so many new code paths involved. (Recovering from OOM without incurring more failures was a black art, too.) I firmly believe it isn’t worth it, in general.

              Here your sentiment closely echoes something that is alluded to in the article:

              Apart from the increased availability of memory, I assume that the other reason for ignoring the possibility of allocation failure is just because it is easier. Proper error handling has traditionally been tedious, and memory allocation operations tend to be prolific; handling allocation failure can mean having to incorporate error paths, and propagate errors, through parts of a program that could otherwise be much simpler. As software gets larger, and more complex, being able to ignore this particular type of failure becomes more attractive.

              While I agree that for a lot of software where termination is the only suitable response to out-of-memory condition, I also think there is a range of software where sudden termination is quite undesirable. Something I didn’t say is that we do in fact have reasonable tools in some languages for dealing with allocation failure in a reasonable way - I’m thinking of exceptions and RAII / “defer” idioms - without having to add thousands of “if (p == null) …” checks through the code. (I wonder if this last is going to be even more contentious than the article).

              And while I can see that for some applications in nearly all circumstances the sensible option really is to just terminate, what about those applications - or those circumstances - where it’s not? This was meant to be the main point: the frameworks that applications are built on - the OS, the libraries - are preventing an application from cleanly handling OOM even if they otherwise could. I think that’s unfortunate. We may disagree, and I acknowledge there are strong arguments on the other side.

              Thanks for your comments. I found the anecdotes about Win95 / MacOS really interesting.

            3. 2

              I’d blame the operating system. Genode or unikernel frameworks like Solo5 have an explicit memory limit and programs are designed with this taken into account. With memory constraints you have a dial available on every program to adjust things like caching behavior (unless it’s a port or a quick hack).

              1. 1

                This feels like misplaced anger from a over perfectionist. It reminds of how annoyed I get when I see code that is not very well formatted.

                Does ugly code have some impact? Sure, hard to read, hard to modify, takes longer to delivery new value on top of it. But it’s not the end of the world. It works, it solves problems now, it’s likely making money. How much more money will it make if I take an ungodly amount of time to make sure that every single line complies with my OCD obsessions? Probably not enough to cover the costs of my work.

                Same thing here. Is it inelegant how most languages deal with memory? Perhaps. But ain’t no one going around with their ram at 99% usage all the time, at least not in servers and desktops/laptops. In practice, things get handled WAY before allocation ever becomes a concern:

                • horizontal scaling will kick in if instances use too much memory
                • users will kill unused stuff if things get slow
                • and things will get slow because the OS is using swap before letting allocation fail
                • and the OS starts to use too much swap, users will upgrade.

                When this matters, it’s handled, but most of the time it doesn’t, so it’s not. I don’t really see the problem with that.

                1. 1

                  With Linux in particular, vm.overcommit_memory=2 (i.e., don’t do it) seems to be an unhappy path that hasn’t had the developer attention it would need in order to not be terrible. I’m not a BSD user, but I’m told the BSDs don’t overcommit memory. What (if anything) do they do instead?

                  One of the ways not overcommitting memory on Linux is worse than dealing with the OOM killer is that since practically no programs check for allocation failures, running out of memory means that basically a random program will fail, just because it happened to be the next one to ask for memory, even if it’s not really contributing to the problem, and killing it won’t help the state of the system. At least the OOM killer has heuristics for deciding what to kill (and userspace OOM killers usually do better). It might be better just to write apps so they’re prepared to be OOM killed at short notice, like on Android.