This isn’t a C problem, it’s an operating system problem.
In fact, I’d even argue that, on systems that use this sort of allocation scheme, they’re not in keeping with the C standard, which states that a successful return from malloc (or related functions) yields a pointer that can be used to access objects or arrays of any type up to the specified size. Crashing on a valid return is not a specified error condition.
This is a pedantic response, but the pedantic answer is important for why this behavior isn’t incompatible with the C standard: malloc doesn’t cause a program to crash when an overcommit is realized. It causes the kernel to abort the program. The latter is perfectly acceptable in the C/C++ memory model.
Put another way: from C’s perspective, there’s no real sense in which the malloc incorrectly fails. It succeeds, and the system decides not to play along.
Personally I regard malloc() as a fundamentally broken api with the operating systems not providing a reliable alternative.
One of my pet hates is all the introductory text telling everybody to check the return value for NULL… resulting in literally gigabytes of broken useless untested code doing ever more arcane and intricate steps to try and handle malloc returning null.
Most of it is demonstrably broken… and easy to spot… if an out of memory handler uses printf…. guess what. it’s b0rked. Printf uses malloc. Doh!
I always wrap malloc() to check for null and abort(). Invoke the Great Garbage Collector in the Sky.
There after I rely on it being non-null.
These days I also allocate small or zero swap partitions…. by the time you’re swapping heavily… your program is not dead… just unusable, actually worse than that. Your program has made the entire system unusable. So the sooner the OOMKiller wakes and does it’s thing the better.
One of my pet hates is all the introductory text telling everybody to check the return value for NULL…
It’s an extremely important thing to do in embedded systems, many of which are incredibly RAM-constrained (I own at least one board with only 16KB of RAM) and in older “retro” OSs (including MacOS pre-X, and IIRC Windows pre-XP) that don’t have advanced virtual memory.
Most of it is demonstrably broken… and easy to spot… if an out of memory handler uses printf…. guess what. it’s b0rked. Printf uses malloc. Doh!
You young fellers didn’t cut your teeth on out-of-memory errors the way I did :) Here’s how you do this: On startup you allocate a block big enough to handle your error-recovery requirements, say 16KB. Sometimes it was called the “rainy day fund.” When allocation fails, the first thing you do is free that block. Now you have some RAM available while you unwind your call chain and report the error.
In your event loop (or some equivalent) if your emergency block is null you try to reallocate it. Until then you operate in emergency low-memory mode where you disable any command that might use a lot of RAM. (You can also check the heap free space for other clues you’re running low.)
This behavior was baked into old classic-Mac app frameworks like MacApp and PowerPlant. If you didn’t use those frameworks (most apps didn’t), then you damn well rolled your own equivalent. Otherwise your testers or end users would be reporting lots and lots of crashes when memory ran low.
I never coded for Windows, DOS or AmigaOS, but I bet they had very similar band-aids.
I always wrap malloc() to check for null and abort(). Invoke the Great Garbage Collector in the Sky.
That works fine in some use cases like a CLI tool, or a server that can get restarted if it crashes. It’s not acceptable in an interactive application; it makes the users quite upset.
Back in the late 80s I once came back from vacation to find myself hung in effigy on our team whiteboard, because the artists using my app kept losing their work when they ran into a particular memory crasher I’d introduced just before leaving.
(Oh, it’s not OK in a library either, because it defeats the calling code’s memory management. If I’m using a library and find that it aborts when something recoverable goes wrong, it’s a sign to stop using it.)
You young fellers didn’t cut your teeth on out-of-memory errors the way I did :) Here’s how you do this: On startup you allocate a block big enough to handle your error-recovery requirements, say 16KB. Sometimes it was called the “rainy day fund.” When allocation fails, the first thing you do is free that block. Now you have some RAM available while you unwind your call chain and report the error.
I have been bitten by doing exactly this on a modern OS. If the OS performs overcommit then your rainy-day fund may not actually be accessible when you exhaust memory. You need to make sure that you pre-fault it (writing random data over it should work, if the OS does memory deduplication then writing non-random data may still trigger CoW faults that can fail if memory is exhausted) or you’ll discover that there aren’t actually pages there. I actually hit this with a reservation in the BSS section of my binary: in out-of-memory conditions, my reservation was just full of CoW views of the canonical zero page, so accessing it triggered an abort.
Similarly, on modern platforms, just because malloc failed doesn’t mean that exactly the same malloc call won’t succeed immediately afterwards, because another process may have exited or returned memory. This is really bad on Linux in a VM because the memory balloon driver often doesn’t return memory fast enough and so you’ll get processes crashing because they ran out of memory, but if you rerun them immediately afterwards their allocations will succeed.
Back in the late 80s I once came back from vacation to find myself hung in effigy on our team whiteboard, because the artists using my app kept losing their work when they ran into a particular memory crasher I’d introduced just before leaving.
I think you learned the wrong lesson from this. A formally verified app will never crash from any situation within its reasoning framework but anything short of that cannot guarantee that it will never crash, especially when running on a system with many other processes that are out of its control (or on hardware that may fail). The right thing to do is not to try to guarantee that you never crash but instead to try to guarantee that the user never loses data (or, at least, doesn’t lose very much data) in the case of a crash. Even if your app is 100% bug free, it’s running on a kernel that’s millions of lines of C code, using RAM provided by the lowest bidder, so the host system will crash some times no matter what you do.
Apple embraced this philosophy first with iOS and then with macOS. Well-behaved apps opt into a mechanism called ‘sudden termination’. They tell the kernel that they’ve checkpointed all state that the user cares about between run-loop iterations and if the system is low on RAM is can just kill -9 some of them. The WindowServer process takes ownership of the crashed process’s windows and keeps presenting their old contents, when the process restarts it reclaims these windows and draws in them again. This has the advantage that when a Mac app crashes, you rarely lose more than a second or two of data. It doesn’t happen very often but it doesn’t really bother me now when an app crashes on my Mac: it’s a 2-3 second interruption and then I continue from where I was.
There’s a broader lesson to be learned here, which OTP made explicit in the Erlang ecosystem and Google wrote a lot about 20 years ago when they started getting higher reliability than IBM mainframes on much cheaper hardware: the higher the level at which you can handle failure, the more resilient your system will be overall. If ever malloc caller has to check for failure, then one caller in one library getting it wrong crashes your program. If you compartmentalise libraries and expect them to crash, then your program can be resilient even if no one checks malloc. If your window system and kernel expect apps to crash and provide recovery paths that don’t lose user data, your platform is more resilient to data loss than if you required all apps to be written to a standard that they never crash. If you build your distributed system expecting individual components to fail then it will be much more reliable (and vastly cheaper) than if you try to ensure that they never fail.
I generally sympathize with the “crash-only” philosophy, but an issue with that approach is that sometimes a graceful shutdown path can significantly speed up recovery. (Of course, a counterargument is that not having a graceful shutdown path forces you to optimize recovery for all cases, and that in an emergency where recovery time is critical your app likely already crashed anyway.)
One of the original papers benchmarked a graceful shutdown vs a crash and fsck for a journaled file system (ext3? ext4? can’t remember) and found crash and fsck was faster!
The actual use case for a graceful shut down is for things like de-registering from basestations.
But I would argue that such “shutdown activity” should be “business as usual” with the only difference being new requests for activity gets rejected with “piss off I’m shutting down” and once it is done. Crash!
Since you brought up filesystems, there is a lesson to be learned from ZFS: “crash and no fsck” is fastest – try to use atomic/transactional/CoW magic to make sure that any crash basically is graceful, since there’s nothing to corrupt.
The idea with most ‘crash-only’ systems (assuming I’m understanding the term correctly - I don’t think I’ve heard it before) is that your shutdown path isn’t special. You have a small amount of uncommitted data at any given time but anything that actually needs to persist is always persisted. For example, you use an append-only file format that you periodically garbage collect by writing the compacted version to a new file and then doing an atomic rename. You may chose to do the compact on a graceful shutdown, but you’re also doing it periodically. This has the added advantage that your shutdown code path isn’t anything special: everything that you’re doing on the shutdown path, you’re doing periodically. Your code is effectively doing a graceful shutdown every so often, so that it’s always in the recoverable state.
The core mindset is ‘assume that things can fail at any time’. This is vital for building a scalable distributed system because once you have a million computers the probability of one of them breaking is pretty high. Modern software increasingly looks like a distributed system and so ends up needing the same kind of mindset. Isolate whatever you can, assume it will fail at any given time.
I think the full “crash-only” philosophy really requires infrastructure support in a runtime or VM, because sometimes it’s just not acceptable to bring the whole system down. There was some work on “micro-reboot” prototypes of the JVM (and I guess .NET AppDomains were supposed to implement a similar model), but so far AFAIK BEAM/Erlang is the only widely used runtime that implements the “micro-reboot” model.
You make a great point that the sort of recovery-optimizing cleanup one might do in a graceful shutdown path can instead be done periodically in the background. During Win7 there was an org-wide push to reduce system shutdown latency, and I remember doing some work to implement precisely this approach in the Windows Search Engine.
It’s an extremely important thing to do in embedded systems…
That’s my day job.
The way I set things up is this…. in order from best to worst…
If I can do allocation sizing at compile time… I will.
Statically allocate most stuff for worst case so blowing the ram budget will fail at link time.
A “prelink” allocation step (very much like C++’s collect2) that precisely allocates arrays based on what is going into the link and hence will fail at link time if budget is blown. (Useful for multiple products built from same codebase)
Where allocations are run time configuration dependent… Get the configuration validator to fail before you can even configure the device.
Where that is not possible, fail and die miserably at startup time… So at least you know that configuration doesn’t work before the device goes off to do it’s job somewhere.
Otherwise record error data (using preallocated resources) and reset.. aka. Big Garbage Collector in the Sky. (aka. Regain full service as rapidly as possible)
Soak test the hell out of it and record high water marks.
I still find despite all this, colleagues that occasionally write desperate, untested and untestable attempts at handling OOM conditions.
And every bloody time I have reviewed it…. the unwinding code is provably buggy as heck.
The key thing is nobody wants a device that is limping along in a degraded low memory mode. They want full service back again asap.
Sounds like “fun”! I’ve got a side project making alternate firmware for a MIDI controller (Novation’s LaunchPad Pro) where I’ve been making a lot of use of static allocation and C++ constexpr … it’s been interesting to see how much I can do at compile/link time. IIRC I’ve been able to avoid all calls to malloc so far.
and every bloody time I have reviewed it…. the unwinding code is provably buggy as heck
Yeah, this was always a miserable experience developing for the classic Mac OS. QE would keep finding new ways to run out of memory on different code paths, and filing new crashers. But crashing wasn’t an option in a GUI app.
Unwinding is nearly always a bad choice of architecture.
To massively oversimplify a typical device… it’s a pipeline from input events, to interacting with and perhaps modifying internal state to output.
If something on the output side of that pipeline runs out of resource…. attempting to unwind (especially in a multithreaded real time system) is a nightmare beyond belief.
The trick is to either spec the pipeline so downstream always has more capacity / bandwidth / priority than upstream OR have a mechanism to sniff if my output queue is getting near full and so throttle the flow of input events by some means. (Possibly recursively).
By throttle I mean things like ye olde flow xon/xoff flow, blocking, dropping packets, etc…
The important principle is to do this as soon as you can before you have wasted cpu cycles or resources or … on an event that is going to be dropped or blocked anyway.
So, I have the previous-generation LP Pro. They partially open-sourced it in 2015 or so. (I don’t believe the current model is supported.) Instead of releasing the stock firmware, they have a GitHub repo with a minimal framework you can build on. Much of it is still a binary blob. The README describes their reasoning.
I found it easy to get started with — there’s even a sort of emulator that lets you run it on your computer for easier debugging.
But you really are starting from scratch. There are empty functions you fill in to handle pad presses and incoming MIDI, and some functions to call to light up pads and send MIDI. So even recreating what the stock firmware does takes significant work. But I’m having fun with it.
It’s probably still important to this day. Windows has never supported memory overcommit - you cannot allocate more memory than there is available swap to back. This is why the pagefile tends to be at least as large as the amount of physical memory installed.
Didn’t learn much I didn’t know beyond the definition of the swappiness tunable….
and the existence of something in cgroup exists…. and that it does something with memory pressure.
I have been meaning to dig into cgroup stuff for awhile.
But yes, the crux of the matter is apps need to be able to sniff memory pressure and have mechanisms to react.
For some tasks eg. a big build, the reaction may be… “Hey OS, just park me until this is all over, desperately swapping to give me a cycle isn’t helping anybody!”
I literally woke up this morning thinking about this: 😩
guess what. it’s b0rked. Printf uses malloc. Doh!
I have not looked up the source code of [any implementation of] printf, but I can’t think of a reason printf would need to call malloc. It’s just scanning the format string, doing some numeric conversions that can use fixed size buffers, and writing to stdout. Given that printf-like functions can be a bottleneck (like when doing lots of logging) I’d think they’d try to avoid heap allocation.
For localisation, printf provides qualifiers that allow you to reference the arguments by their location in the format string. C’s stdarg does not allow you to get variadic parameter n, so to support this printf may need to do a two-pass scan of the format string. First it collects the references to the formats and then collects them all to an indexed data structure. That indexed data structure needs to be dynamically allocated.
It’s quite common in printf implementations to use alloca or a variable-length array in an inner block of the printf to dynamically allocate the data structure on the stack but it’s simpler to use malloc and free. It’s also common in more optimised printf implementations to implement this as a fall-back mode, where you just do va_next until you encounter a positional qualifier, then punt to a slow-path implementation with the string, a copy of the va_list, and the current position (once you’ve seen a positional specifier you need may discover that you need to collect some of the earlier entries in the va_list that you’ve discarded already. Fortunately, they’re still in the argframe).
To make this even more fun, printf is locale-aware. It will call a bunch of character-set conversion functions localeconv, and so on. If the locale is lazily initialised then this may also trigger allocation. Oh, and as @dbremmer points out, GNU-compatible printf implementations can register extensions. FreeBSD libc actually has two different printf implementations, a fast one and one that’s called if you have ever called register_printf_function.
To put printf in perspective: GNUstep has a function called GSFormat, which is a copy of a printf implementation, extended to support %@ (which is a fairly trivial change, since %@ is just %s with a tiny bit of extra logic in front). I was able to compile all of GNUstep except for this function with clang, about a year before clang could compile all of GNUstep. The printf implementation stressed the compiler more than the whole of the rest of the codebase.
Java’s printf is even more exciting. The first time it’s called, it gives pretty much complete coverage of the JVM spec. It invokes the class loader to load locales (Solaris libc does this as well - each locale is a separate .so that is dlopened to get the locale, most other systems just store locales as tables), generates enough temporary objects that it triggers garbage collection, does dispatch via interfaces, interface-to-interface casts, and dispatches at least one of every Java bytecode.
Whatever language you’re using, there’s a good chance that printf or the local equivalent is one of the most complex functions that you will ever look at.
Yeah I thought the same thing. Printf is a little interpreter and it doesn’t need to Malloc. Sprintf doesn’t either. That’s why it has the mode where it returns the number of bytes you need to allocate.
I’ve only just skimmed the printf code in glibc. All it does is call vfprintf with stdout as the file. It does appear that vfprintf allocates some work buffers to do its thing, but I did not dive too deeply because the code is hard to read (lots of C preprocessor abuse) and I just don’t have the time right now.
These days I also allocate small or zero swap partitions…. by the time you’re swapping heavily… your program is not dead… just unusable
Swap still has its use. I’ve got a server which can evict most of the application because ~50% of its memory is lower levels of jit which will never be hit after a minute of runtime. Without swap, I’d have to run two servers. With swap, 1gb of it is used, and I can run two copies of the server without oomkiller kicking in. 60% of swap is marked used, but almost never touched.
While the author’s question is not without merit, it seems to me that this is very much an exercise in semantics, not programming, and that it is in fact entirely resolved with an exercise in semantics as well. This description:
Under both macOS/clang and Linux/GCC, I find that the program prints “Memory allocated” and then crashes.
is not quite accurate. The program does not, in fact, crash, it gets killed by the operating system, presumably because it doesn’t have 1 TB of swap ;-).
The question remains equally unanswerable (and equally facetious) in any language whose runtime winds up calling malloc underneath (or uses the same mechanism), at least if it’s running on an operating system that allows memory overcommitting and is configured to allow it, that is.
I’m not sure what ‘Linux/GCC’ the author is running but this is actually configuration-dependent, observe:
so I guess the answer would be “disable memory overcommitting and look at the return value”?
Edit: to put it another way, the question essentially stems from a difference in understanding about what “succeeded” means, but the good news is that you can configure most modern operating systems to have the same understanding of “succeeded” as the author’s, and a lot older operating systems ascribe it the same meaning.
While I do not enjoy the occasional wrestle with the OOM killer either, to me, a lowly code monkey who is not only comfortable letting smart mr. OS figure out how to handle all this memory paging stuff for me, this is a design that makes sense, (edit: even if it leaks through malloc), at least on programs where I want all that stuff handled for me. Granted, a good understanding of all that is required in order to make it work in your favour, and I would refer the curious reader to Poul-Henning Kamp’s extraordinary article for a tangential example of how well that works. (NB: it definitely works better for phk than it does for me because phk is a fscking god). If, for whatever reason, that’s not the desired behaviour, that’s mostly a matter of configuration…
I can’t stand that article by phk. It just reeks of arrogance and self-congratulatory ignorance. There is no evidence that he made the slightest effort to acquaint himself with any relevant research. In fact, of course, there is an entire academic research program dedicated to his professed goal: cache-efficient and cache-oblivious algorithms and data structures. But a Real Hacker doesn’t have time to read papers when they could be banging out code!
Yeah, it could be written in a more modest tone to say the least. The way it bubbled up somewhere near the top of my reading list is that it provides an accessible (for people outside academia, that is) example of how to reason about how to reason about performance in addition to complexity and taking system implementation details into account, so I ended up sending it to some of our interns pretty much every summer. With the exact caveat you mention: that cache-efficient algorithms and data structures are a research field in their own right, which the author is only scratching the surface of (or, who knows, poorly reinventing the basics of?)
Derogatory tone aside, though, I think it is a good article. It seems to me that the snark is directed less at academic research and more at formal CS/CompEng undergrad education, which cargo cults performance analysis to the point of uselessness in an effort to teach undergrads how to ace interviews, rather than reason about performance.
That constants (and usage of the memory hierarchy) almost always matter more than asymptotics in practice is not emphasized nearly enough in undergrad CS AFAICT (but what do I know, I never did undergrad CS, although I used to advise CS undergrads).
Quadratic is fine (witness the ubiquity of insertion sort) up to 10^4 or so. But leading constants almost always swamp any difference between say O(n) and O(n lg n) except when n is really big (say 10^9 or so). And some asymptotic distinctions are completely meaningless in any practical context (say O(1) vs O(lg lg n)).
True, I agree that anything involving log n is algorithm nerd territory, but quadratic makes headlines (see the infamous GTA Online load times thing. 10^4 is nothing if the n is, say, the number of bytes you’re parsing.
This isn’t necessarily a problem with C. It’s a problem with how memory allocations are represented. Since heap implementations are just a fancy wrappers around mmap, they could change how they behave. After calling mmap, the heap implementaiton could call mlock, which would force the new memory allocation to fully reside in memory.
In cases like this, the notion of “success” is a little blurry. Your call to malloc did indeed succeed, just not in the way you perhaps expected.
So, if you want to test whether the allocate may fully reside in memory without being swapped out, call mlock. :-)
A long time ago in a galaxy far far away… malloc was commonly implemented on top of the brk()/sbrk() system. Solaris in ’92 had alternative allocator libraries (malloc back-ends if you will), one of which was libmapmalloc. It did use mmap and IIRC it was super slow when I measured it.
FYI casting the return value of malloc in C is not necessary and not advised. It can easily hide bugs if you happen to be compiling with a compiler which allows implicit function declarations. Really this kind of issue only affects beginners so I would especially avoid writing such code in any instructive material.
My take: In “modern” OSs, the abstraction presented by malloc breaks down when you allocate huge amounts of memory. At those scales, you can’t keep pretending memory is free and just comes out of a tap like water. You have to take into account swap space, overcommit, your OS’s naughty-process killer, and such factors.
It’s nice that we have this abstraction — I speak as someone who spent decades coding on systems that didn’t have it — it’s just not perfect.
I’d much rather have malloc return NULL, then overcommiting memory, fearing the OOM-Killer, and running something like getFreeMemory(&how_much_memory_can_my_app_waste); in a loop.
But isn’t this only an issue in a process that allocates “huge” amounts of memory? Where today on a desktop OS “huge” means “tens/hundreds of gigabytes”? If you’re doing that, you can take responsibility for your own backing store by creating a big-enough (and gapless) file, mmap’ing it, then running your own heap allocator in that address space.
(Pre-apologies if I’m being naive; I don’t usually write code that needs more than a few tens of MB.)
Basically creating your own swap file. It’s a fun concept, but here’s some things you may have to consider in practice:
must find a place for the temp file that’s actually on a disk, not an in-memory tmpfs, and it has to be a fast disk with enough space
because mmap was designed for I/O, not this, it would slow you down by flushing your memory to disk unnecessarily.. but okay, you’ve found the non-standard MAP_NOSYNC flag to turn that off
now you think you have your region backed by enough disk space – you initialized that memory with something after all – but oh no, the user has filesystem compression! Your initial data fits into the available disk space, but as you’re replacing it with less compressible data (all when you’re out of RAM), it doesn’t fit anymore. It explodes! Do you want your posessions identified? [ynq]
now if the user has a copy-on-write filesystem like ZFS, and you’re running out of space there… your blocks are not rewritten in-place, so whoops you kinda needed even more free space than you assumed
Oh, and in something like a desktop app, there’s a good chance users will hate you for hogging the disk space :)
I don’t really write those big applications, also. But Java (Tomcat), Browsers and other proprietary business apps are memory hogs. And because they are used to malloc pretty much always returning success, they employ various techniques(ugly hacks) to find out how much RAM there really is. Instead of backing off once they hit a malloc error.
Rolling your own allocator, sometimes can be the answer, but most of the time its just dangerous to overwrite your systems malloc (debuggability, bug prone, security risks)
But Java (Tomcat), Browsers and other proprietary business apps are memory hogs.
The JVM preallocates heap memory, though direct byte buffers are allocated outside of this heap. Generally this means it’s rare for the JVM to continue allocating. You can also force the JVM to commit the memory so it doesn’t hit a copy-on-write fault. As such it shouldn’t have much of an issue if the system runs out of available memory.
For running on machines that I control (e.g. my own servers) I wrap the system allocator with my own that tracks how much memory has been allocated, and refuses allocations after an arbitrary limit has been reached (equal to amount I’m willing to give to the process, or close to amount of RAM available in practice on this machine).
Another trick I’ve used is to have a global counter that runs independently of the allocator and coarsely “reserves” memory for future use. If I know running some application-level task is going to need 100MB, then I can tell it “reserve me 100MB if possible”, and if all memory has already been “reserved”, then I can abort the operation even before it started. This isn’t a guarantee that memory will be available, but works very well for server processes preventing them form starting more work than they could handle.
This isn’t a C problem, it’s an operating system problem.
In fact, I’d even argue that, on systems that use this sort of allocation scheme, they’re not in keeping with the C standard, which states that a successful return from
malloc
(or related functions) yields a pointer that can be used to access objects or arrays of any type up to the specified size. Crashing on a valid return is not a specified error condition.This is a pedantic response, but the pedantic answer is important for why this behavior isn’t incompatible with the C standard:
malloc
doesn’t cause a program to crash when an overcommit is realized. It causes the kernel to abort the program. The latter is perfectly acceptable in the C/C++ memory model.Put another way: from C’s perspective, there’s no real sense in which the
malloc
incorrectly fails. It succeeds, and the system decides not to play along.Personally I regard malloc() as a fundamentally broken api with the operating systems not providing a reliable alternative.
One of my pet hates is all the introductory text telling everybody to check the return value for NULL… resulting in literally gigabytes of broken useless untested code doing ever more arcane and intricate steps to try and handle malloc returning null.
Most of it is demonstrably broken… and easy to spot… if an out of memory handler uses printf…. guess what. it’s b0rked. Printf uses malloc. Doh!
I always wrap malloc() to check for null and abort(). Invoke the Great Garbage Collector in the Sky.
There after I rely on it being non-null.
These days I also allocate small or zero swap partitions…. by the time you’re swapping heavily… your program is not dead… just unusable, actually worse than that. Your program has made the entire system unusable. So the sooner the OOMKiller wakes and does it’s thing the better.
It’s an extremely important thing to do in embedded systems, many of which are incredibly RAM-constrained (I own at least one board with only 16KB of RAM) and in older “retro” OSs (including MacOS pre-X, and IIRC Windows pre-XP) that don’t have advanced virtual memory.
You young fellers didn’t cut your teeth on out-of-memory errors the way I did :) Here’s how you do this: On startup you allocate a block big enough to handle your error-recovery requirements, say 16KB. Sometimes it was called the “rainy day fund.” When allocation fails, the first thing you do is free that block. Now you have some RAM available while you unwind your call chain and report the error.
In your event loop (or some equivalent) if your emergency block is null you try to reallocate it. Until then you operate in emergency low-memory mode where you disable any command that might use a lot of RAM. (You can also check the heap free space for other clues you’re running low.)
This behavior was baked into old classic-Mac app frameworks like MacApp and PowerPlant. If you didn’t use those frameworks (most apps didn’t), then you damn well rolled your own equivalent. Otherwise your testers or end users would be reporting lots and lots of crashes when memory ran low.
I never coded for Windows, DOS or AmigaOS, but I bet they had very similar band-aids.
That works fine in some use cases like a CLI tool, or a server that can get restarted if it crashes. It’s not acceptable in an interactive application; it makes the users quite upset.
Back in the late 80s I once came back from vacation to find myself hung in effigy on our team whiteboard, because the artists using my app kept losing their work when they ran into a particular memory crasher I’d introduced just before leaving.
(Oh, it’s not OK in a library either, because it defeats the calling code’s memory management. If I’m using a library and find that it aborts when something recoverable goes wrong, it’s a sign to stop using it.)
I have been bitten by doing exactly this on a modern OS. If the OS performs overcommit then your rainy-day fund may not actually be accessible when you exhaust memory. You need to make sure that you pre-fault it (writing random data over it should work, if the OS does memory deduplication then writing non-random data may still trigger CoW faults that can fail if memory is exhausted) or you’ll discover that there aren’t actually pages there. I actually hit this with a reservation in the BSS section of my binary: in out-of-memory conditions, my reservation was just full of CoW views of the canonical zero page, so accessing it triggered an abort.
Similarly, on modern platforms, just because
malloc
failed doesn’t mean that exactly the samemalloc
call won’t succeed immediately afterwards, because another process may have exited or returned memory. This is really bad on Linux in a VM because the memory balloon driver often doesn’t return memory fast enough and so you’ll get processes crashing because they ran out of memory, but if you rerun them immediately afterwards their allocations will succeed.I think you learned the wrong lesson from this. A formally verified app will never crash from any situation within its reasoning framework but anything short of that cannot guarantee that it will never crash, especially when running on a system with many other processes that are out of its control (or on hardware that may fail). The right thing to do is not to try to guarantee that you never crash but instead to try to guarantee that the user never loses data (or, at least, doesn’t lose very much data) in the case of a crash. Even if your app is 100% bug free, it’s running on a kernel that’s millions of lines of C code, using RAM provided by the lowest bidder, so the host system will crash some times no matter what you do.
Apple embraced this philosophy first with iOS and then with macOS. Well-behaved apps opt into a mechanism called ‘sudden termination’. They tell the kernel that they’ve checkpointed all state that the user cares about between run-loop iterations and if the system is low on RAM is can just
kill -9
some of them. The WindowServer process takes ownership of the crashed process’s windows and keeps presenting their old contents, when the process restarts it reclaims these windows and draws in them again. This has the advantage that when a Mac app crashes, you rarely lose more than a second or two of data. It doesn’t happen very often but it doesn’t really bother me now when an app crashes on my Mac: it’s a 2-3 second interruption and then I continue from where I was.There’s a broader lesson to be learned here, which OTP made explicit in the Erlang ecosystem and Google wrote a lot about 20 years ago when they started getting higher reliability than IBM mainframes on much cheaper hardware: the higher the level at which you can handle failure, the more resilient your system will be overall. If ever
malloc
caller has to check for failure, then one caller in one library getting it wrong crashes your program. If you compartmentalise libraries and expect them to crash, then your program can be resilient even if no one checks malloc. If your window system and kernel expect apps to crash and provide recovery paths that don’t lose user data, your platform is more resilient to data loss than if you required all apps to be written to a standard that they never crash. If you build your distributed system expecting individual components to fail then it will be much more reliable (and vastly cheaper) than if you try to ensure that they never fail.Ever since I first read about it, I have always thought “crash only software” is the only way to make things reliable!
I generally sympathize with the “crash-only” philosophy, but an issue with that approach is that sometimes a graceful shutdown path can significantly speed up recovery. (Of course, a counterargument is that not having a graceful shutdown path forces you to optimize recovery for all cases, and that in an emergency where recovery time is critical your app likely already crashed anyway.)
One of the original papers benchmarked a graceful shutdown vs a crash and fsck for a journaled file system (ext3? ext4? can’t remember) and found crash and fsck was faster!
The actual use case for a graceful shut down is for things like de-registering from basestations.
But I would argue that such “shutdown activity” should be “business as usual” with the only difference being new requests for activity gets rejected with “piss off I’m shutting down” and once it is done. Crash!
Since you brought up filesystems, there is a lesson to be learned from ZFS: “crash and no fsck” is fastest – try to use atomic/transactional/CoW magic to make sure that any crash basically is graceful, since there’s nothing to corrupt.
The idea with most ‘crash-only’ systems (assuming I’m understanding the term correctly - I don’t think I’ve heard it before) is that your shutdown path isn’t special. You have a small amount of uncommitted data at any given time but anything that actually needs to persist is always persisted. For example, you use an append-only file format that you periodically garbage collect by writing the compacted version to a new file and then doing an atomic rename. You may chose to do the compact on a graceful shutdown, but you’re also doing it periodically. This has the added advantage that your shutdown code path isn’t anything special: everything that you’re doing on the shutdown path, you’re doing periodically. Your code is effectively doing a graceful shutdown every so often, so that it’s always in the recoverable state.
The core mindset is ‘assume that things can fail at any time’. This is vital for building a scalable distributed system because once you have a million computers the probability of one of them breaking is pretty high. Modern software increasingly looks like a distributed system and so ends up needing the same kind of mindset. Isolate whatever you can, assume it will fail at any given time.
Some background:
https://www.usenix.org/conference/hotos-ix/crash-only-software
https://lwn.net/Articles/191059/
https://brooker.co.za/blog/2012/01/22/crash-only.html
I think the full “crash-only” philosophy really requires infrastructure support in a runtime or VM, because sometimes it’s just not acceptable to bring the whole system down. There was some work on “micro-reboot” prototypes of the JVM (and I guess .NET AppDomains were supposed to implement a similar model), but so far AFAIK BEAM/Erlang is the only widely used runtime that implements the “micro-reboot” model.
You make a great point that the sort of recovery-optimizing cleanup one might do in a graceful shutdown path can instead be done periodically in the background. During Win7 there was an org-wide push to reduce system shutdown latency, and I remember doing some work to implement precisely this approach in the Windows Search Engine.
That’s my day job.
The way I set things up is this…. in order from best to worst…
I still find despite all this, colleagues that occasionally write desperate, untested and untestable attempts at handling OOM conditions.
And every bloody time I have reviewed it…. the unwinding code is provably buggy as heck.
The key thing is nobody wants a device that is limping along in a degraded low memory mode. They want full service back again asap.
Sounds like “fun”! I’ve got a side project making alternate firmware for a MIDI controller (Novation’s LaunchPad Pro) where I’ve been making a lot of use of static allocation and C++ constexpr … it’s been interesting to see how much I can do at compile/link time. IIRC I’ve been able to avoid all calls to malloc so far.
Yeah, this was always a miserable experience developing for the classic Mac OS. QE would keep finding new ways to run out of memory on different code paths, and filing new crashers. But crashing wasn’t an option in a GUI app.
The trick is back pressure.
Unwinding is nearly always a bad choice of architecture.
To massively oversimplify a typical device… it’s a pipeline from input events, to interacting with and perhaps modifying internal state to output.
If something on the output side of that pipeline runs out of resource…. attempting to unwind (especially in a multithreaded real time system) is a nightmare beyond belief.
The trick is to either spec the pipeline so downstream always has more capacity / bandwidth / priority than upstream OR have a mechanism to sniff if my output queue is getting near full and so throttle the flow of input events by some means. (Possibly recursively).
By throttle I mean things like ye olde flow xon/xoff flow, blocking, dropping packets, etc…
The important principle is to do this as soon as you can before you have wasted cpu cycles or resources or … on an event that is going to be dropped or blocked anyway.
Yeah, I’ve used backpressure in networking code, and I can see it would be important in a small constrained device processing data.
Unrelated to malloc….
I was watching this bloke https://www.youtube.com/watch?v=ihe9zV07Lgk
His Akai MPK mini was giving me the “buy me’s”….
Doing some research other folk are recommending the LaunchPad….
What’s your opinion? I find it very interesting that you can reprogram the Novation… Does it come with docs and sdk and the like?
So, I have the previous-generation LP Pro. They partially open-sourced it in 2015 or so. (I don’t believe the current model is supported.) Instead of releasing the stock firmware, they have a GitHub repo with a minimal framework you can build on. Much of it is still a binary blob. The README describes their reasoning.
I found it easy to get started with — there’s even a sort of emulator that lets you run it on your computer for easier debugging.
But you really are starting from scratch. There are empty functions you fill in to handle pad presses and incoming MIDI, and some functions to call to light up pads and send MIDI. So even recreating what the stock firmware does takes significant work. But I’m having fun with it.
It’s probably still important to this day. Windows has never supported memory overcommit - you cannot allocate more memory than there is available swap to back. This is why the pagefile tends to be at least as large as the amount of physical memory installed.
One of Butler Lampson’s design principles: “Leave yourself a place to stand.”
Have you read this? https://lobste.rs/s/rgv1sv/defence_swap_common_misconceptions_2018
It made me reconsider swap, anyway.
No I hadn’t…. it’s a pretty good description.
Didn’t learn much I didn’t know beyond the definition of the swappiness tunable….
and the existence of something in cgroup exists…. and that it does something with memory pressure.
I have been meaning to dig into cgroup stuff for awhile.
But yes, the crux of the matter is apps need to be able to sniff memory pressure and have mechanisms to react.
For some tasks eg. a big build, the reaction may be… “Hey OS, just park me until this is all over, desperately swapping to give me a cycle isn’t helping anybody!”
I literally woke up this morning thinking about this: 😩
I have not looked up the source code of [any implementation of] printf, but I can’t think of a reason printf would need to call malloc. It’s just scanning the format string, doing some numeric conversions that can use fixed size buffers, and writing to stdout. Given that printf-like functions can be a bottleneck (like when doing lots of logging) I’d think they’d try to avoid heap allocation.
It’s an edge case, a bad idea, and a misfeature, but glibc allows registering callbacks for custom conversion specifiers for printf.
For localisation,
printf
provides qualifiers that allow you to reference the arguments by their location in the format string. C’sstdarg
does not allow you to get variadic parameter n, so to support thisprintf
may need to do a two-pass scan of the format string. First it collects the references to the formats and then collects them all to an indexed data structure. That indexed data structure needs to be dynamically allocated.It’s quite common in
printf
implementations to usealloca
or a variable-length array in an inner block of theprintf
to dynamically allocate the data structure on the stack but it’s simpler to usemalloc
andfree
. It’s also common in more optimisedprintf
implementations to implement this as a fall-back mode, where you just dova_next
until you encounter a positional qualifier, then punt to a slow-path implementation with the string, a copy of theva_list
, and the current position (once you’ve seen a positional specifier you need may discover that you need to collect some of the earlier entries in theva_list
that you’ve discarded already. Fortunately, they’re still in the argframe).To make this even more fun,
printf
is locale-aware. It will call a bunch of character-set conversion functionslocaleconv
, and so on. If the locale is lazily initialised then this may also trigger allocation. Oh, and as @dbremmer points out, GNU-compatibleprintf
implementations can register extensions. FreeBSD libc actually has two differentprintf
implementations, a fast one and one that’s called if you have ever calledregister_printf_function
.To put
printf
in perspective: GNUstep has a function called GSFormat, which is a copy of aprintf
implementation, extended to support%@
(which is a fairly trivial change, since%@
is just%s
with a tiny bit of extra logic in front). I was able to compile all of GNUstep except for this function with clang, about a year before clang could compile all of GNUstep. Theprintf
implementation stressed the compiler more than the whole of the rest of the codebase.Java’s
printf
is even more exciting. The first time it’s called, it gives pretty much complete coverage of the JVM spec. It invokes the class loader to load locales (Solaris libc does this as well - each locale is a separate.so
that isdlopen
ed to get the locale, most other systems just store locales as tables), generates enough temporary objects that it triggers garbage collection, does dispatch via interfaces, interface-to-interface casts, and dispatches at least one of every Java bytecode.Whatever language you’re using, there’s a good chance that
printf
or the local equivalent is one of the most complex functions that you will ever look at.Sigh … It’s always the most obscure 10% of the feature set that causes 90% of the complexity, isn’t it?
Yeah I thought the same thing. Printf is a little interpreter and it doesn’t need to Malloc. Sprintf doesn’t either. That’s why it has the mode where it returns the number of bytes you need to allocate.
I’ve only just skimmed the printf code in glibc. All it does is call vfprintf with stdout as the file. It does appear that vfprintf allocates some work buffers to do its thing, but I did not dive too deeply because the code is hard to read (lots of C preprocessor abuse) and I just don’t have the time right now.
So I thought to until I set a break point….
I don’t think we were using standard gnu libc at that point so your milage may vary.
I have the vaguest of memory it was printing 64 bit int’s (on a 32 bitter) that trigger that.
Swap still has its use. I’ve got a server which can evict most of the application because ~50% of its memory is lower levels of jit which will never be hit after a minute of runtime. Without swap, I’d have to run two servers. With swap, 1gb of it is used, and I can run two copies of the server without oomkiller kicking in. 60% of swap is marked used, but almost never touched.
Yes. Also, the more unused anonymous pages you can swap out, the more RAM you have for pagecache to accelerate I/O.
While the author’s question is not without merit, it seems to me that this is very much an exercise in semantics, not programming, and that it is in fact entirely resolved with an exercise in semantics as well. This description:
is not quite accurate. The program does not, in fact, crash, it gets killed by the operating system, presumably because it doesn’t have 1 TB of swap ;-).
The question remains equally unanswerable (and equally facetious) in any language whose runtime winds up calling
malloc
underneath (or uses the same mechanism), at least if it’s running on an operating system that allows memory overcommitting and is configured to allow it, that is.I’m not sure what ‘Linux/GCC’ the author is running but this is actually configuration-dependent, observe:
so I guess the answer would be “disable memory overcommitting and look at the return value”?
Edit: to put it another way, the question essentially stems from a difference in understanding about what “succeeded” means, but the good news is that you can configure most modern operating systems to have the same understanding of “succeeded” as the author’s, and a lot older operating systems ascribe it the same meaning.
While I do not enjoy the occasional wrestle with the OOM killer either, to me, a lowly code monkey who is not only comfortable letting smart mr. OS figure out how to handle all this memory paging stuff for me, this is a design that makes sense, (edit: even if it leaks through malloc), at least on programs where I want all that stuff handled for me. Granted, a good understanding of all that is required in order to make it work in your favour, and I would refer the curious reader to Poul-Henning Kamp’s extraordinary article for a tangential example of how well that works. (NB: it definitely works better for phk than it does for me because phk is a fscking god). If, for whatever reason, that’s not the desired behaviour, that’s mostly a matter of configuration…
I can’t stand that article by phk. It just reeks of arrogance and self-congratulatory ignorance. There is no evidence that he made the slightest effort to acquaint himself with any relevant research. In fact, of course, there is an entire academic research program dedicated to his professed goal: cache-efficient and cache-oblivious algorithms and data structures. But a Real Hacker doesn’t have time to read papers when they could be banging out code!
Yeah, it could be written in a more modest tone to say the least. The way it bubbled up somewhere near the top of my reading list is that it provides an accessible (for people outside academia, that is) example of how to reason about how to reason about performance in addition to complexity and taking system implementation details into account, so I ended up sending it to some of our interns pretty much every summer. With the exact caveat you mention: that cache-efficient algorithms and data structures are a research field in their own right, which the author is only scratching the surface of (or, who knows, poorly reinventing the basics of?)
Derogatory tone aside, though, I think it is a good article. It seems to me that the snark is directed less at academic research and more at formal CS/CompEng undergrad education, which cargo cults performance analysis to the point of uselessness in an effort to teach undergrads how to ace interviews, rather than reason about performance.
That constants (and usage of the memory hierarchy) almost always matter more than asymptotics in practice is not emphasized nearly enough in undergrad CS AFAICT (but what do I know, I never did undergrad CS, although I used to advise CS undergrads).
Well… yes and no – there’s accidentally quadratic stuff causing weird slowness everywhere. Both matter, I guess.
Quadratic is fine (witness the ubiquity of insertion sort) up to 10^4 or so. But leading constants almost always swamp any difference between say O(n) and O(n lg n) except when n is really big (say 10^9 or so). And some asymptotic distinctions are completely meaningless in any practical context (say O(1) vs O(lg lg n)).
True, I agree that anything involving
log n
is algorithm nerd territory, but quadratic makes headlines (see the infamous GTA Online load times thing. 10^4 is nothing if then
is, say, the number of bytes you’re parsing.This isn’t necessarily a problem with C. It’s a problem with how memory allocations are represented. Since heap implementations are just a fancy wrappers around mmap, they could change how they behave. After calling mmap, the heap implementaiton could call mlock, which would force the new memory allocation to fully reside in memory.
In cases like this, the notion of “success” is a little blurry. Your call to malloc did indeed succeed, just not in the way you perhaps expected.
So, if you want to test whether the allocate may fully reside in memory without being swapped out, call mlock. :-)
edit[0]: grammar is hard
A long time ago in a galaxy far far away… malloc was commonly implemented on top of the brk()/sbrk() system. Solaris in ’92 had alternative allocator libraries (malloc back-ends if you will), one of which was libmapmalloc. It did use mmap and IIRC it was super slow when I measured it.
FYI casting the return value of malloc in C is not necessary and not advised. It can easily hide bugs if you happen to be compiling with a compiler which allows implicit function declarations. Really this kind of issue only affects beginners so I would especially avoid writing such code in any instructive material.
My take: In “modern” OSs, the abstraction presented by malloc breaks down when you allocate huge amounts of memory. At those scales, you can’t keep pretending memory is free and just comes out of a tap like water. You have to take into account swap space, overcommit, your OS’s naughty-process killer, and such factors.
It’s nice that we have this abstraction — I speak as someone who spent decades coding on systems that didn’t have it — it’s just not perfect.
I’d much rather have malloc return NULL, then overcommiting memory, fearing the OOM-Killer, and running something like getFreeMemory(&how_much_memory_can_my_app_waste); in a loop.
But isn’t this only an issue in a process that allocates “huge” amounts of memory? Where today on a desktop OS “huge” means “tens/hundreds of gigabytes”? If you’re doing that, you can take responsibility for your own backing store by creating a big-enough (and gapless) file, mmap’ing it, then running your own heap allocator in that address space.
(Pre-apologies if I’m being naive; I don’t usually write code that needs more than a few tens of MB.)
Basically creating your own swap file. It’s a fun concept, but here’s some things you may have to consider in practice:
tmpfs
, and it has to be a fast disk with enough spacemmap
was designed for I/O, not this, it would slow you down by flushing your memory to disk unnecessarily.. but okay, you’ve found the non-standardMAP_NOSYNC
flag to turn that offDo you want your posessions identified? [ynq]Oh, and in something like a desktop app, there’s a good chance users will hate you for hogging the disk space :)
I don’t really write those big applications, also. But Java (Tomcat), Browsers and other proprietary business apps are memory hogs. And because they are used to malloc pretty much always returning success, they employ various techniques(ugly hacks) to find out how much RAM there really is. Instead of backing off once they hit a malloc error.
Rolling your own allocator, sometimes can be the answer, but most of the time its just dangerous to overwrite your systems malloc (debuggability, bug prone, security risks)
The JVM preallocates heap memory, though direct byte buffers are allocated outside of this heap. Generally this means it’s rare for the JVM to continue allocating. You can also force the JVM to commit the memory so it doesn’t hit a copy-on-write fault. As such it shouldn’t have much of an issue if the system runs out of available memory.
That’s exactly what I do in a production DB (single 32TB mmap) and it works very well. It does freak out customers when they run
top
though.For running on machines that I control (e.g. my own servers) I wrap the system allocator with my own that tracks how much memory has been allocated, and refuses allocations after an arbitrary limit has been reached (equal to amount I’m willing to give to the process, or close to amount of RAM available in practice on this machine).
Another trick I’ve used is to have a global counter that runs independently of the allocator and coarsely “reserves” memory for future use. If I know running some application-level task is going to need 100MB, then I can tell it “reserve me 100MB if possible”, and if all memory has already been “reserved”, then I can abort the operation even before it started. This isn’t a guarantee that memory will be available, but works very well for server processes preventing them form starting more work than they could handle.
Is this the same as
fwrite
not syncing all the way to disk orsend
returning before the peer has read the message?