Why does Windows require the programmer to manually commit pages?
It makes a lot of sense if you think about it and I think Linux should support that.
Sometimes you may need to allocate a huge amount of address space (with mmap()) but you only need to use a very tiny fraction of it. You could do this because you wish to have some sparse data structure (say, a sparse array), which makes the code very simple and very fast compared to the alternatives and yet also very memory efficient (as you only “pay” for the memory you use, not the address space that you allocate).
AddressSanitizer, for example, uses such a huge, very sparse bitmap for its shadow memory. At any point in time, only a very tiny fraction of the address space corresponding to this bitmap is actually being used.
However, here’s the problem. What happens if this program starts using a lot of this address space and the system starts running out of memory?
Linux handles this problem by doing one of 2 things:
It either returns an error in mmap(), if it decides the program has requested too much address space.
Linux, by default, actually uses heuristics to determine whether a huge mmap() is allowed or not. This means a sufficiently large mmap() can fail unnecessarily when a program is well-behaved and would only use a tiny fraction of it, which means it wouldn’t actually cause any problems for the system if the mmap() were allowed.
Or, it starts killing processes on your system to try and get the memory usage down.
Again, Linux uses some heuristics to determine in which order processes are killed, so that the memory usage of the system goes back to a sustainable level. This can potentially cause innocent processes to get killed unnecessarily, especially when the misbehaving process is the one you least expected.
This is known as the “memory overcommit” problem as well as the OOM killer, and Linux allows you to somewhat configure/tune these heuristics to try and get your system to work better. However, it can be argued that they are not actually fixing the underlying problem in a principled way and that they can cause unnecessary failures in what should be perfectly reasonable scenarios.
Windows, however, allows the programmer to manually commit pages, which means that the programmer can specify which parts of the address space they are actually using (or may use soon).
This means that there is a 3rd alternative which is as follows:
When you allocate address space via mmap(), Windows always allows it (as long as sufficient address space is available, of course).
Either separately or as part of mmap(), the programmer can specify which parts of the address space the program is going to use (i.e. which memory is going to be committed).
At this point, Windows tries to reserve swap space for the committed memory. If there is enough swap space, Windows allows the memory to be committed. Otherwise, an error is returned to the program, which the program can handle in a program-specific way (which makes sense since the program is more knowledgeable than the kernel about how to best handle this out-of-memory situation).
This means that on Windows, there is always enough swap space to handle even the worst case scenario, which is that all programs start using all the memory that they have committed. This means that the Windows kernel never has to kill processes without the user’s consent nor has to refuse mmap() calls unnecessarily.
Notably, Linux can be configured to work this way (by disabling memory overcommit), but it has a huge flaw, which is that Linux would require all allocated address space for the program to be reserved in swap space. This is way too pessimistic and leads to mmap() calls failing unnecessarily due to lack of swap space, which doesn’t make much sense when you realize that most programs that allocate huge amounts of address space only actually need to use a tiny portion of it (the difference can be several orders of magnitude).
If Linux had a way for the programs to communicate to the kernel which chunks of the address space they actually intend to use, then Linux could handle swap reservations a lot more efficiently and memory overcommit (along with its associated problems) wouldn’t be so needed anymore.
By the way, in Linux, the exact same problem with mmap() exists for fork(). This happens because fork() basically creates a new child process, duplicating the address space of the parent. This either requires Linux to overcommit memory (i.e. allow the fork() to proceed even though not enough memory is going to be available for the child process) or refuse the fork(), which also causes unnecessary failures when the child process is not actually going to consume much additional memory.
If Linux had a way to commit memory, then fork() could reserve swap space for the exact amount of memory that the parent has committed, which could be orders of magnitude less than the amount of address space that it has allocated. This would allow for fork() to always succeed without eventually causing the system to run out of memory, and it would also allow fork() to succeed more often than what Linux allows (unless Linux is configured for full memory overcommit, which again, can cause the system to run out of memory and require the Linux kernel to start killing processes).
Note that Linux also requires swap space to be allocated up front but if you think about it, this can unnecessarily waste disk space because it forces the user to pre-allocate the largest possible swap space that he will ever need. This is not taking into account the fact that swap reservations don’t actually require actual disk space to be allocated during normal usage or cause the system to swap more – the only requirement is for disk space to be reserved for swap. So this can actually be implemented extremely cheaply and with very few drawbacks, as long as you allow the swap file to be resized / grow to consume reserved space in the worst case scenario (which Windows also does, although in a flawed way according to the article!).
Another issue is that Linux only allows memory overcommit to be configured on a whole system level, but I think it would also make sense to allow memory overcommit to be configured per-process, so that a well-designed program could always be guaranteed to work reliably and have fully reserved swap space (thus having a complete guarantee that the kernel never kills it because of memory usage) while allowing other, more “loosely implemented” programs to be killed by the kernel if they start misbehaving.
I’ve encountered and dealt with all the problems on Linux that you describe: I agree that the Windows way of doing things does seem a lot simpler. It just didn’t occur to me before that you could solve this problem nicely at the OS level.
For sparse data structures, I allocate the amount of storage required and then implement my own mapping, for example from row/column to address. Being able to just use virtual memory for this makes so much sense: I suppose I never really considered it before because I’ve never programmed on a system which separates allocation and committing in this way.
Managing your own representation for sparse data structures isn’t too bad - and you’ll have to do it anyway if you’re working on a GPU or something similar - but, as you say, this applies to mmap in general. If we want to map a large file into memory, then the alternative of maintaining your own mapping is going to be much more painful, with a lot of fseek or lseek.
I’ve also often been annoyed by the fact that Linux’s overcommit configuration is system-wide. On small systems, i.e. ones where I have a lot of control over all the software that’s running, I always turn off overcommit. But that’s no good on a desktop or most servers: it would be nice to be able to turn it off for just my own software.
Why does Windows require the programmer to manually commit pages?
It makes a lot of sense if you think about it and I think Linux should support that.
Sometimes you may need to allocate a huge amount of address space (with mmap()) but you only need to use a very tiny fraction of it. You could do this because you wish to have some sparse data structure (say, a sparse array), which makes the code very simple and very fast compared to the alternatives and yet also very memory efficient (as you only “pay” for the memory you use, not the address space that you allocate).
AddressSanitizer, for example, uses such a huge, very sparse bitmap for its shadow memory. At any point in time, only a very tiny fraction of the address space corresponding to this bitmap is actually being used.
However, here’s the problem. What happens if this program starts using a lot of this address space and the system starts running out of memory?
Linux handles this problem by doing one of 2 things:
Linux, by default, actually uses heuristics to determine whether a huge mmap() is allowed or not. This means a sufficiently large mmap() can fail unnecessarily when a program is well-behaved and would only use a tiny fraction of it, which means it wouldn’t actually cause any problems for the system if the mmap() were allowed.
Again, Linux uses some heuristics to determine in which order processes are killed, so that the memory usage of the system goes back to a sustainable level. This can potentially cause innocent processes to get killed unnecessarily, especially when the misbehaving process is the one you least expected.
This is known as the “memory overcommit” problem as well as the OOM killer, and Linux allows you to somewhat configure/tune these heuristics to try and get your system to work better. However, it can be argued that they are not actually fixing the underlying problem in a principled way and that they can cause unnecessary failures in what should be perfectly reasonable scenarios.
Windows, however, allows the programmer to manually commit pages, which means that the programmer can specify which parts of the address space they are actually using (or may use soon).
This means that there is a 3rd alternative which is as follows:
When you allocate address space via mmap(), Windows always allows it (as long as sufficient address space is available, of course).
Either separately or as part of mmap(), the programmer can specify which parts of the address space the program is going to use (i.e. which memory is going to be committed).
At this point, Windows tries to reserve swap space for the committed memory. If there is enough swap space, Windows allows the memory to be committed. Otherwise, an error is returned to the program, which the program can handle in a program-specific way (which makes sense since the program is more knowledgeable than the kernel about how to best handle this out-of-memory situation).
This means that on Windows, there is always enough swap space to handle even the worst case scenario, which is that all programs start using all the memory that they have committed. This means that the Windows kernel never has to kill processes without the user’s consent nor has to refuse mmap() calls unnecessarily.
Notably, Linux can be configured to work this way (by disabling memory overcommit), but it has a huge flaw, which is that Linux would require all allocated address space for the program to be reserved in swap space. This is way too pessimistic and leads to mmap() calls failing unnecessarily due to lack of swap space, which doesn’t make much sense when you realize that most programs that allocate huge amounts of address space only actually need to use a tiny portion of it (the difference can be several orders of magnitude).
If Linux had a way for the programs to communicate to the kernel which chunks of the address space they actually intend to use, then Linux could handle swap reservations a lot more efficiently and memory overcommit (along with its associated problems) wouldn’t be so needed anymore.
By the way, in Linux, the exact same problem with mmap() exists for fork(). This happens because fork() basically creates a new child process, duplicating the address space of the parent. This either requires Linux to overcommit memory (i.e. allow the fork() to proceed even though not enough memory is going to be available for the child process) or refuse the fork(), which also causes unnecessary failures when the child process is not actually going to consume much additional memory.
If Linux had a way to commit memory, then fork() could reserve swap space for the exact amount of memory that the parent has committed, which could be orders of magnitude less than the amount of address space that it has allocated. This would allow for fork() to always succeed without eventually causing the system to run out of memory, and it would also allow fork() to succeed more often than what Linux allows (unless Linux is configured for full memory overcommit, which again, can cause the system to run out of memory and require the Linux kernel to start killing processes).
Note that Linux also requires swap space to be allocated up front but if you think about it, this can unnecessarily waste disk space because it forces the user to pre-allocate the largest possible swap space that he will ever need. This is not taking into account the fact that swap reservations don’t actually require actual disk space to be allocated during normal usage or cause the system to swap more – the only requirement is for disk space to be reserved for swap. So this can actually be implemented extremely cheaply and with very few drawbacks, as long as you allow the swap file to be resized / grow to consume reserved space in the worst case scenario (which Windows also does, although in a flawed way according to the article!).
Another issue is that Linux only allows memory overcommit to be configured on a whole system level, but I think it would also make sense to allow memory overcommit to be configured per-process, so that a well-designed program could always be guaranteed to work reliably and have fully reserved swap space (thus having a complete guarantee that the kernel never kills it because of memory usage) while allowing other, more “loosely implemented” programs to be killed by the kernel if they start misbehaving.
Thank you, this is an excellent explanation.
I’ve encountered and dealt with all the problems on Linux that you describe: I agree that the Windows way of doing things does seem a lot simpler. It just didn’t occur to me before that you could solve this problem nicely at the OS level.
For sparse data structures, I allocate the amount of storage required and then implement my own mapping, for example from row/column to address. Being able to just use virtual memory for this makes so much sense: I suppose I never really considered it before because I’ve never programmed on a system which separates allocation and committing in this way.
Managing your own representation for sparse data structures isn’t too bad - and you’ll have to do it anyway if you’re working on a GPU or something similar - but, as you say, this applies to
mmap
in general. If we want to map a large file into memory, then the alternative of maintaining your own mapping is going to be much more painful, with a lot offseek
orlseek
.I’ve also often been annoyed by the fact that Linux’s overcommit configuration is system-wide. On small systems, i.e. ones where I have a lot of control over all the software that’s running, I always turn off overcommit. But that’s no good on a desktop or most servers: it would be nice to be able to turn it off for just my own software.
Anyway, thank you - you’ve enlightened me!