Given the page table overhead of allocating large amounts of virtual memory, does anyone know if people actually use virtual memory to implement sparse arrays?
Virtual memory is fascinating. We take it for granted, but it had to be invented at some point. And, there were several precursors to page-based virtual memory, too. Perhaps we’ll move away from virtual memory?
That’s the premise of this article: “The Cost of Software-Based Memory Management Without Virtual Memory”
“While exact area and power consumption are difficult to quantify, we expect that removing (or simplifying) support for address translation can have a net positive impact since current translation infrastructure uses as much space as an L1 cache and up to 15% of a chip’s energy”
“Modern, performance-critical software is considered non-functional when swapping, so it is avoided at all cost.” We learn about swapping as one of the primary reasons for using virtual memory in college, but then in the real world it’s essentially not used at all.
I don’t think swap being unused is necessarily true - it depends on your use case. Swap is fantastic when, for example, you have a bunch of daemons running in the background, and you want them to be running, but you rarely need them. In that case those programs can be paged out to swap and that memory can be used for something better, like the disk cache. The boost you get from the disk cache far outstrips the hit you take by swapping the daemon back in, because you don’t have to perform the latter operation very often.
I think really the issue is predictability. On my laptop I have a big swap partition to enable the above effect, but in production I don’t because it makes the system easier to understand. IIRC I even go as far as to disable overcommit in production because, again, having it on makes the system less predictable and therefore less reliable. If something gets OOM killed on my laptop it’s annoying; if something gets OOM killed in prod, something just went down.
This fundamental trade-off between predictability/complexity and optimization comes up in other places too. For example: free space under ext4 is trivial to understand. But under ZFS or btrfs? Incredibly complicated, especially when it comes to deletion (i.e. “if I delete this, how much space will I actually get back”). You can delete a snapshot with 1TB of data in it and end up freeing <10 MB because the snapshot was taken only a minute ago. How much space a 5 MB file will take depends on how well its contents compress. Under btrfs this is even affected by where in the filesystem you write the file, because different parts of the filesystem can have different levels of redundancy. And blocks might be deduplicated, too. There is a separation and disconnection between the logical filesystem that userspace sees and the physical disk media that the filesystem driver sees that simply didn’t exist in e.g. ext4. And this can potentially cause big problems because tools like df(1) examine the logical filesystem and expect that that’s equivalent to the physical filesystem.
Sure thing :-) I’m glad you enjoyed it. I also found your original article fascinating. Sometimes I wonder about posting long things like this because a) I wonder if I’m getting way too detailed/if I just like talking too much and b) I have a lot of experience as a technologist, but almost exclusively as a hobbyist (as opposed to in industry) - “production” for me is mostly a single server running in my house that I administer by hand. I always wonder if I have some glaring blind spot that’s going to make me say something silly. So it’s super nice to see that at least some other people thought it made sense :D
As a side note to just tack onto the end of my original comment: all of the ZFS/btrfs examples I listed above are about things that you actually can calculate/understand if you know where to look, but I just thought of an example where (AFAIK) that’s not the case: under ZFS at least, how do you answer the question, “if I delete both these snapshots, how much space will I free?” If both snapshots are identical, but you deleted a 1GB file since they were both taken, zfs(8) will report that deleting either snapshot will free 0 bytes. And yet, deleting both will free 1GB. This is fundamentally harder to present in UI because instead of “if you perform x, y will happen”, it is “if you perform x, y will happen; if you perform a, b will happen, but if you perform x and a, then y, b, and some third effect will happen”. Unix CLIs really like outputting tabular data, where rows are things (disks, datasets, etc.) and columns are properties about that thing. But given that kind of tabular format, it is virtually impossible to usefully express the kind of combinatorial effect I’m describing here, especially because given x operations that could be performed (e.g. destroying a particular snapshot), listing the combinatorial effects would require outputting ℙ(x) rows.
Again, this is all AFAIK. (If someone knows how to answer this question, please correct me because I actually have this problem myself and would like to know which snapshots to delete! Maybe it can be done with channel programs? I don’t know much about them.)
It’s a fascinating UI problem you bring up wrt combinatorial options. Perhaps an explorable UI with a tree of options would make sense. Maybe there needs to be a tool just for the purpose of freeing up space that calculates the best options for you.
A tree UI is actually a phenomenal idea. Whenever I tried to think of a solution to this problem before, the best I could come up with was usually a program that would let you simulate different filesystem operations (perhaps it would actually run the real operations in-kernel, but in a “fake” ZFS txg that was marked to never be committed to disk?) and then interrogate the results. You’d be guessing and checking at the different combinations, but at least you’d actually get a solid answer.
The problem with having a tool to calculate the “best” way is that what’s “best” is incredibly subjective. Maybe I have a few snapshots taking up a large amount of space, but those backups are incredibly valuable to me so I’d rather free up smaller amounts of space by destroying many more smaller but less important snapshots. I really do think a tree UI would work well though, especially if it had a good filtering system - that would help alleviate the power set explosion problem.
You’re absolutely correct that swapping has a role to play. I just don’t think it’s used much in high-performance, low-latency systems - like web servers. Also: the disk cache is my hero <3.
The tradeoff between optimization and predictability is a good point. Another example I see in my work is how much more complexity caching adds to the system. Now, you have to deal with staleness and another point of failure. There’s even a more subtle issue with positive and negative caching. If someone sends too many uncached requests, you can overload your database, no matter your caching setup.
Yeah, this is a great point. I think the predictability problem in both our examples is closely related to capacity planning. Ideally you’re capacity planning for the worst case scenario - loads of uncacheable requests, or lots of undeduplicatable/incompressible/etc. data to store. But if you can handle that worst case scenario, why bother optimizing at all?
I think really all this optimization is not really buying you additional performance or additional storage space, which is how we often think about it. It’s buying you the ability to gamble on not hitting that worst case scenario. The reward for gambling is “free” perf/storage wins… but you’re still gambling. So it’s not actually free because you’re paying for it in risk.
We learn about swapping as one of the primary reasons for using virtual memory in college, but then in the real world it’s essentially not used at all.
I’ll point out that the other, possibly more important feature of virtual memory is memory permissions. The efficiency benefits of physical memory sharing are also significant (the canonical example is sharing libc between every process.) Virtual memory is also important for implementing copy-on-write semantics; sometimes the kernel needs to write to userspace memory and relies on a page fault to tell if this is a CoW page.
I’ll have to see if the paper talks about replacing these.
They propose hardware support for memory permissions - extra metadata for each memory location, with the hardware doing checks before allowing access. They argue that this can still be a net win given how much chip space is given to virtual memory infrastructure.
It should be possible to share physical memory without virtual memory, no? A process simply gets permission to read shared code segments, like libc code. Of course, it would need it’s own copy of libc global state. There might need to be an extra bit of indirection from libc code to the global state if the physical addresses of the state may vary across programs (they could be in the same place under virtual memory).
Since they argue swapping could be implemented at the application layer, perhaps copy-on-write could be too?
people actually use virtual memory to implement sparse arrays
I think sometimes but not often. If you only use virtual memory for your sparse array, you don’t know which entries are set and which aren’t. Without knowing that, you can’t skip reading and multiplying the unset entries.
Iirc Linux has a flag for turning off checks in mmap() that normally prevent you allocating silly numbers of pages in one go. The description of it says it’s there for some scientific programs that wanted unlimited overcommit.
The memory overhead of potentially having an entire page allocated for a single entry, when the entries are spread out, may be unwelcome. On the other hand sometimes people work with sparse matrices that have large contiguous dense chunks embedded in a colossal sea of zeroes.
FWIW I don’t mean to imply that virtual memory is necessarily a great way to implement that, just that the “entire page for one number” thing doesn’t bite you so hard when your matrices look like that.
I think you’re still likely to benefit from a representation where you write down where each sense block is + a dense array holding all the entries.
This is a fascinating topic, and my mind can’t get past the illustration you made at the top. It’s really poignant: Is userspace a complex arrangement of creditors, with the kernel sometimes so deep in debt that it has to default (aka, invoke the OOMkiller)? Is swap space the equivalent of an actual credit default swap?
Great article!
Given the page table overhead of allocating large amounts of virtual memory, does anyone know if people actually use virtual memory to implement sparse arrays?
Virtual memory is fascinating. We take it for granted, but it had to be invented at some point. And, there were several precursors to page-based virtual memory, too. Perhaps we’ll move away from virtual memory?
That’s the premise of this article: “The Cost of Software-Based Memory Management Without Virtual Memory”
“While exact area and power consumption are difficult to quantify, we expect that removing (or simplifying) support for address translation can have a net positive impact since current translation infrastructure uses as much space as an L1 cache and up to 15% of a chip’s energy”
“Modern, performance-critical software is considered non-functional when swapping, so it is avoided at all cost.” We learn about swapping as one of the primary reasons for using virtual memory in college, but then in the real world it’s essentially not used at all.
I don’t think swap being unused is necessarily true - it depends on your use case. Swap is fantastic when, for example, you have a bunch of daemons running in the background, and you want them to be running, but you rarely need them. In that case those programs can be paged out to swap and that memory can be used for something better, like the disk cache. The boost you get from the disk cache far outstrips the hit you take by swapping the daemon back in, because you don’t have to perform the latter operation very often.
I think really the issue is predictability. On my laptop I have a big swap partition to enable the above effect, but in production I don’t because it makes the system easier to understand. IIRC I even go as far as to disable overcommit in production because, again, having it on makes the system less predictable and therefore less reliable. If something gets OOM killed on my laptop it’s annoying; if something gets OOM killed in prod, something just went down.
This fundamental trade-off between predictability/complexity and optimization comes up in other places too. For example: free space under ext4 is trivial to understand. But under ZFS or btrfs? Incredibly complicated, especially when it comes to deletion (i.e. “if I delete this, how much space will I actually get back”). You can delete a snapshot with 1TB of data in it and end up freeing <10 MB because the snapshot was taken only a minute ago. How much space a 5 MB file will take depends on how well its contents compress. Under btrfs this is even affected by where in the filesystem you write the file, because different parts of the filesystem can have different levels of redundancy. And blocks might be deduplicated, too. There is a separation and disconnection between the logical filesystem that userspace sees and the physical disk media that the filesystem driver sees that simply didn’t exist in e.g. ext4. And this can potentially cause big problems because tools like
df(1)
examine the logical filesystem and expect that that’s equivalent to the physical filesystem.This is awesome. Thanks for writing all this!
Sure thing :-) I’m glad you enjoyed it. I also found your original article fascinating. Sometimes I wonder about posting long things like this because a) I wonder if I’m getting way too detailed/if I just like talking too much and b) I have a lot of experience as a technologist, but almost exclusively as a hobbyist (as opposed to in industry) - “production” for me is mostly a single server running in my house that I administer by hand. I always wonder if I have some glaring blind spot that’s going to make me say something silly. So it’s super nice to see that at least some other people thought it made sense :D
As a side note to just tack onto the end of my original comment: all of the ZFS/btrfs examples I listed above are about things that you actually can calculate/understand if you know where to look, but I just thought of an example where (AFAIK) that’s not the case: under ZFS at least, how do you answer the question, “if I delete both these snapshots, how much space will I free?” If both snapshots are identical, but you deleted a 1GB file since they were both taken,
zfs(8)
will report that deleting either snapshot will free 0 bytes. And yet, deleting both will free 1GB. This is fundamentally harder to present in UI because instead of “if you perform x, y will happen”, it is “if you perform x, y will happen; if you perform a, b will happen, but if you perform x and a, then y, b, and some third effect will happen”. Unix CLIs really like outputting tabular data, where rows are things (disks, datasets, etc.) and columns are properties about that thing. But given that kind of tabular format, it is virtually impossible to usefully express the kind of combinatorial effect I’m describing here, especially because given x operations that could be performed (e.g. destroying a particular snapshot), listing the combinatorial effects would require outputting ℙ(x) rows.Again, this is all AFAIK. (If someone knows how to answer this question, please correct me because I actually have this problem myself and would like to know which snapshots to delete! Maybe it can be done with channel programs? I don’t know much about them.)
It’s a fascinating UI problem you bring up wrt combinatorial options. Perhaps an explorable UI with a tree of options would make sense. Maybe there needs to be a tool just for the purpose of freeing up space that calculates the best options for you.
A tree UI is actually a phenomenal idea. Whenever I tried to think of a solution to this problem before, the best I could come up with was usually a program that would let you simulate different filesystem operations (perhaps it would actually run the real operations in-kernel, but in a “fake” ZFS txg that was marked to never be committed to disk?) and then interrogate the results. You’d be guessing and checking at the different combinations, but at least you’d actually get a solid answer.
The problem with having a tool to calculate the “best” way is that what’s “best” is incredibly subjective. Maybe I have a few snapshots taking up a large amount of space, but those backups are incredibly valuable to me so I’d rather free up smaller amounts of space by destroying many more smaller but less important snapshots. I really do think a tree UI would work well though, especially if it had a good filtering system - that would help alleviate the power set explosion problem.
You’re absolutely correct that swapping has a role to play. I just don’t think it’s used much in high-performance, low-latency systems - like web servers. Also: the disk cache is my hero <3.
The tradeoff between optimization and predictability is a good point. Another example I see in my work is how much more complexity caching adds to the system. Now, you have to deal with staleness and another point of failure. There’s even a more subtle issue with positive and negative caching. If someone sends too many uncached requests, you can overload your database, no matter your caching setup.
Yeah, this is a great point. I think the predictability problem in both our examples is closely related to capacity planning. Ideally you’re capacity planning for the worst case scenario - loads of uncacheable requests, or lots of undeduplicatable/incompressible/etc. data to store. But if you can handle that worst case scenario, why bother optimizing at all?
I think really all this optimization is not really buying you additional performance or additional storage space, which is how we often think about it. It’s buying you the ability to gamble on not hitting that worst case scenario. The reward for gambling is “free” perf/storage wins… but you’re still gambling. So it’s not actually free because you’re paying for it in risk.
Side note: we love the disk cache! <3
Thank you!
I just searched “virtual memory sparse array” and found this real life example of this being useful with numpy: https://stackoverflow.com/a/51763775/1790085
Sounds like a really interesting paper!
I’ll point out that the other, possibly more important feature of virtual memory is memory permissions. The efficiency benefits of physical memory sharing are also significant (the canonical example is sharing libc between every process.) Virtual memory is also important for implementing copy-on-write semantics; sometimes the kernel needs to write to userspace memory and relies on a page fault to tell if this is a CoW page.
I’ll have to see if the paper talks about replacing these.
They propose hardware support for memory permissions - extra metadata for each memory location, with the hardware doing checks before allowing access. They argue that this can still be a net win given how much chip space is given to virtual memory infrastructure.
It should be possible to share physical memory without virtual memory, no? A process simply gets permission to read shared code segments, like libc code. Of course, it would need it’s own copy of libc global state. There might need to be an extra bit of indirection from libc code to the global state if the physical addresses of the state may vary across programs (they could be in the same place under virtual memory).
Since they argue swapping could be implemented at the application layer, perhaps copy-on-write could be too?
I think sometimes but not often. If you only use virtual memory for your sparse array, you don’t know which entries are set and which aren’t. Without knowing that, you can’t skip reading and multiplying the unset entries.
Iirc Linux has a flag for turning off checks in mmap() that normally prevent you allocating silly numbers of pages in one go. The description of it says it’s there for some scientific programs that wanted unlimited overcommit.
The memory overhead of potentially having an entire page allocated for a single entry, when the entries are spread out, may be unwelcome. On the other hand sometimes people work with sparse matrices that have large contiguous dense chunks embedded in a colossal sea of zeroes.
e.g. scipy has several completely separate representations available for sparse matrices https://docs.scipy.org/doc/scipy/reference/sparse.html
Dense chunks in an ocean of zeros would be a good case for using virtual memory for sparse arrays. I hadn’t thought of that.
FWIW I don’t mean to imply that virtual memory is necessarily a great way to implement that, just that the “entire page for one number” thing doesn’t bite you so hard when your matrices look like that.
I think you’re still likely to benefit from a representation where you write down where each sense block is + a dense array holding all the entries.
This is a fascinating topic, and my mind can’t get past the illustration you made at the top. It’s really poignant: Is userspace a complex arrangement of creditors, with the kernel sometimes so deep in debt that it has to default (aka, invoke the OOMkiller)? Is swap space the equivalent of an actual credit default swap?
Thanks for noticing the picture! :D No one has commented on it much.
Yes, I think there are so many fascinating parallels from this to finance.