Over the holidays I got a bee in my bonnet to write a memory manager for interpreters, with a garbage collector. (I’ve somehow never implemented a GC before.) Partly I was remembering a Forth-like interpreter called Tails I began a few years ago, whose memory management was kind of an awkward afterthought.
I also wanted to explore some ideas I’d been having for a while, like relative pointers. I used these in a structured binary data format called Fleece that’s used by Couchbase, but I thought they’d be useful as a complete replacement for “real” absolute pointers in a live memory heap. Besides being half the size, they also have the cool property that the heap is location-independent, so you can persist everything just by writing the whole heap to disk and reading it back. You can even page it in on demand with mmap. This is convenient for “image” based environments like Smalltalk and Lisp.
(One could also use a smol_world heap as a data format. In the one test I’ve done, it’s about 20% larger than JSON, but it’s a lot more flexible in that it allows full object graphs, you can use it without parsing, and even mutate it in place. But to make that practical, there needs to be a fast way to pre-scan the data and detect any invalid objects or pointers; I haven’t figured that out yet.)
I’ve been having a lot of fun with this little project, and I’d love to hear any feedback or ideas.
Very cool! I’d be interested in more details on how the relative pointers went. I remember the Jai language talked about that a long time ago, but not sure if they still have it.
I added this project to my wiki page, which includes the v8 compressed pointer page, the paper that describes the Gibbon compiler, etc:
I’ve also been thinking about this for C++ programs, not necessarily interpreters, which I gradually realized is a significantly different problem.
One issue is if you have to root all references to an object, or just one. Normally with Cheney you have to root all references, and I found that to be troublesome in practice.
But if all your references are 32-bit indices, then only the real address has to be updated when you GC, not the index.
That is, I think that “double indirection” could make it more natural to use in C++ programs. And I think there could be a type safe way of wrapping existing APIs, e.g. for every type Foo*, you could have a Foo32 value type of 4 bytes with the same methods, which makes the code look nicer.
I guess the way to test it is to make a programming language from your heap :)
The details of smol pointers are in Val.hh :) But the key part is, as I said, deleting the copy constructor and making sure to pass them around as references. They don’t actually get passed around much; usually they get implicitly converted to Value, which is a 64-bit equivalent that holds a real pointer.
You also have to watch out for using memcpy to copy/move any region containing Vals; it breaks them. Vals have to be moved explicitly, using operator=. This came up in the GC implementation. The worst part is the way the Cheney algorithm leaves a forwarding address behind after an object is moved to the new heap: a relative pointer doesn’t work there because the two heaps might be more than 2GB apart…
The memcpy() issue is interesting … So then regarding serialization, you can only serialize an entire Heap ?
I was thinking it would be nice to reuse the GC graph walking to just copy part of the graph, not necessarily the whole thing.
I think there is a difference between the temporary data a program builds up in its own memory, and the interface/data that it exports to another process or thread. I guess to really tell there needs to be a language built around it
The design I was experimenting with involved the global base address, and indices rather than relative pointers.
I don’t think the global is bad, because a garbage-collected heap is really a global anyway … the Oil interpreter has just a few globals: the GC heap, signal handlers, and stdout/stderr objects.
This reminds me a lot of the Microvium GC, which is great for tiny systems. The fact that it doesn’t reuse the backing allocation (when it copies, it copies to new allocations from the system allocator) has some very nice properties with CHERI temporal safety too.
I love this. I’m currently working on something similar and this came at the perfect time. Do you have any more sources on alignment? I’m not super informed but I’m curious since most other sources seem to make a big deal out of alignment (e.g. here).
Another reason to align the heap would be to support >1GB sizes with the same relative pointer, e.g. you could do 4GB with 4 byte alignment. But then I guess it wouldn’t be small anymore.
I do not have any other references on alignment; if you find any, please post them to lobste.rs!
Adding alignment does expand the reach of pointers. It would make even-smaller pointers more practical: for instance you could access 16MB instead of 4MB with 24-bit values (assuming 1 bit is still used for a tag.)
Recent versions of my qp-trie data structure use 32 bit pointers. (I’m calling them references, which is less confusing because I’m writing in C).
Instead of a contiguous arena, I allocate memory in a collection of chunks, which I somewhat arbitrarily made 12KB in size (enough for 1024 nodes of 12 bytes each). A reference is a combination of a chunk number and an offset within the chunk.
I have a simple GC which scans the whole trie and evacuates nodes from chunks that are underused, which is generally able to avoid copying everything, though the trie walk is still more expensive than I would like.
Changing from native pointers to 32 bit references made things measurably faster, presumably due to the lower cache footprint; possibly also because of better TLB locality. (When it used the native allocator, there were many small allocations of lots of different sizes, and allocators tend to scatter different sized allocations across different parts of the address space.)
The chunk-based memory layout also made it easier to support copy-on-write mutation, because I don’t need to keep track of when many little allocations are no longer in use: I just keep count of how much space is used in each chunk.
And copy-on-write with a single writer makes it possible to have many concurrent readers with negligible overhead.
Bagwell’s original paper on the HAMT (Hash-Array Mapped Trie) also goes into detail on using a custom allocator for the trie nodes. Makes sense, since these are being allocated and resized on every insertion. But IIRC he was still using regular pointers. (I think he did say he used a 32-bit CPU.)
I seem to recall there are CPUs where these unaligned memory accesses will cause a fault. But apparently this code works on x86_64 and aarch64? I would want to more information about what architectures and CPUs this is portable to before using it in a project.
My source (it’s in the README) is this blog post by Lemire wherein he benchmarked unaligned access on a lot of circa-2012 CPUs; there are some later comments too.
I personally have only run this code on an M1 MacBook Pro, so far. The last time I can recall unaligned memory access actually crashing was back on 32-bit ARM CPUs, i.e. early iOS devices.
Last time I checked, unaligned accesses on x86 that spanned a page boundary hit a very slow microcode path, but ones that were in the same page were slow only if one of the lines missed in the cache.
This is more complicated for OS or embedded code. I think Arm traps on unaligned accesses to uncashed memory. On a system without caches, an unaligned access needs to be lowered to two word-sized read-modify-writes, which is very painful, but given that you are calling 32-bit pointers ‘smol’, I’m assuming systems with a few tens of KiBs of tightly coupled SRAM are probably out of scope.
Definitely! This is aimed at 64-bit CPUs with pipelines and multilevel caches and all that.
On a 32-bit microcontroller I don’t know what the best options are for smolifying pointers. 24-bit implies unaligned access, plus awkward multiplication/division by 3 for arrays. 16-bit is verging on teensy, though if you also align blocks you can stretch to 128K or 256K, which could be good enough. And there’s always the good-old 16-bit indirect object table a la Smalltalk-80.
Microvium uses 16-bit pointers and restricts you to a 64 KiB heap per JavaScript VM. We’re using it and can fit quite a few JS VMs on a prototype with 256 KiB of SRAM. Their semi-space compacting model works well with our hardware temporal safety approach because they reallocate the backing store when they compact, so every pointer to the JS heap is either to a live (possibly unreachable, but not collected) JavaScript object, or to a deallocated allocation. We catch the second case in hardware, so any dangling C -> JavaScript pointers trap.
Over the holidays I got a bee in my bonnet to write a memory manager for interpreters, with a garbage collector. (I’ve somehow never implemented a GC before.) Partly I was remembering a Forth-like interpreter called Tails I began a few years ago, whose memory management was kind of an awkward afterthought.
I also wanted to explore some ideas I’d been having for a while, like relative pointers. I used these in a structured binary data format called Fleece that’s used by Couchbase, but I thought they’d be useful as a complete replacement for “real” absolute pointers in a live memory heap. Besides being half the size, they also have the cool property that the heap is location-independent, so you can persist everything just by writing the whole heap to disk and reading it back. You can even page it in on demand with mmap. This is convenient for “image” based environments like Smalltalk and Lisp.
(One could also use a smol_world heap as a data format. In the one test I’ve done, it’s about 20% larger than JSON, but it’s a lot more flexible in that it allows full object graphs, you can use it without parsing, and even mutate it in place. But to make that practical, there needs to be a fast way to pre-scan the data and detect any invalid objects or pointers; I haven’t figured that out yet.)
I’ve been having a lot of fun with this little project, and I’d love to hear any feedback or ideas.
Interesting! It makes me think of http://iu-parfunc.github.io/gibbon/
Very cool! I’d be interested in more details on how the relative pointers went. I remember the Jai language talked about that a long time ago, but not sure if they still have it.
I added this project to my wiki page, which includes the v8 compressed pointer page, the paper that describes the Gibbon compiler, etc:
https://github.com/oilshell/oil/wiki/Compact-AST-Representation (which is also about “compressed pointers”)
I’ve also been thinking about this for C++ programs, not necessarily interpreters, which I gradually realized is a significantly different problem.
One issue is if you have to root all references to an object, or just one. Normally with Cheney you have to root all references, and I found that to be troublesome in practice.
But if all your references are 32-bit indices, then only the real address has to be updated when you GC, not the index.
That is, I think that “double indirection” could make it more natural to use in C++ programs. And I think there could be a type safe way of wrapping existing APIs, e.g. for every type
Foo*
, you could have aFoo32
value type of 4 bytes with the same methods, which makes the code look nicer.I guess the way to test it is to make a programming language from your heap :)
The details of smol pointers are in Val.hh :) But the key part is, as I said, deleting the copy constructor and making sure to pass them around as references. They don’t actually get passed around much; usually they get implicitly converted to Value, which is a 64-bit equivalent that holds a real pointer.
You also have to watch out for using memcpy to copy/move any region containing Vals; it breaks them. Vals have to be moved explicitly, using operator=. This came up in the GC implementation. The worst part is the way the Cheney algorithm leaves a forwarding address behind after an object is moved to the new heap: a relative pointer doesn’t work there because the two heaps might be more than 2GB apart…
The memcpy() issue is interesting … So then regarding serialization, you can only serialize an entire Heap ?
I was thinking it would be nice to reuse the GC graph walking to just copy part of the graph, not necessarily the whole thing.
I think there is a difference between the temporary data a program builds up in its own memory, and the interface/data that it exports to another process or thread. I guess to really tell there needs to be a language built around it
(Emacs apparently does something similar and hacky with its allocator, to save images – https://lwn.net/Articles/707615/ )
The design I was experimenting with involved the global base address, and indices rather than relative pointers.
I don’t think the global is bad, because a garbage-collected heap is really a global anyway … the Oil interpreter has just a few globals: the GC heap, signal handlers, and stdout/stderr objects.
This reminds me a lot of the Microvium GC, which is great for tiny systems. The fact that it doesn’t reuse the backing allocation (when it copies, it copies to new allocations from the system allocator) has some very nice properties with CHERI temporal safety too.
I love this. I’m currently working on something similar and this came at the perfect time. Do you have any more sources on alignment? I’m not super informed but I’m curious since most other sources seem to make a big deal out of alignment (e.g. here).
Another reason to align the heap would be to support >1GB sizes with the same relative pointer, e.g. you could do 4GB with 4 byte alignment. But then I guess it wouldn’t be small anymore.
I do not have any other references on alignment; if you find any, please post them to lobste.rs!
Adding alignment does expand the reach of pointers. It would make even-smaller pointers more practical: for instance you could access 16MB instead of 4MB with 24-bit values (assuming 1 bit is still used for a tag.)
Recent versions of my qp-trie data structure use 32 bit pointers. (I’m calling them references, which is less confusing because I’m writing in C).
Instead of a contiguous arena, I allocate memory in a collection of chunks, which I somewhat arbitrarily made 12KB in size (enough for 1024 nodes of 12 bytes each). A reference is a combination of a chunk number and an offset within the chunk.
I have a simple GC which scans the whole trie and evacuates nodes from chunks that are underused, which is generally able to avoid copying everything, though the trie walk is still more expensive than I would like.
Changing from native pointers to 32 bit references made things measurably faster, presumably due to the lower cache footprint; possibly also because of better TLB locality. (When it used the native allocator, there were many small allocations of lots of different sizes, and allocators tend to scatter different sized allocations across different parts of the address space.)
The chunk-based memory layout also made it easier to support copy-on-write mutation, because I don’t need to keep track of when many little allocations are no longer in use: I just keep count of how much space is used in each chunk.
And copy-on-write with a single writer makes it possible to have many concurrent readers with negligible overhead.
Cool! I will check it out. I love tries.
Bagwell’s original paper on the HAMT (Hash-Array Mapped Trie) also goes into detail on using a custom allocator for the trie nodes. Makes sense, since these are being allocated and resized on every insertion. But IIRC he was still using regular pointers. (I think he did say he used a 32-bit CPU.)
Wow, I want this in Rust, for AccessKit. Having all accessibility nodes in their own heap with compressed pointers could further reduce memory usage.
I seem to recall there are CPUs where these unaligned memory accesses will cause a fault. But apparently this code works on x86_64 and aarch64? I would want to more information about what architectures and CPUs this is portable to before using it in a project.
My source (it’s in the README) is this blog post by Lemire wherein he benchmarked unaligned access on a lot of circa-2012 CPUs; there are some later comments too.
I personally have only run this code on an M1 MacBook Pro, so far. The last time I can recall unaligned memory access actually crashing was back on 32-bit ARM CPUs, i.e. early iOS devices.
Last time I checked, unaligned accesses on x86 that spanned a page boundary hit a very slow microcode path, but ones that were in the same page were slow only if one of the lines missed in the cache.
This is more complicated for OS or embedded code. I think Arm traps on unaligned accesses to uncashed memory. On a system without caches, an unaligned access needs to be lowered to two word-sized read-modify-writes, which is very painful, but given that you are calling 32-bit pointers ‘smol’, I’m assuming systems with a few tens of KiBs of tightly coupled SRAM are probably out of scope.
Definitely! This is aimed at 64-bit CPUs with pipelines and multilevel caches and all that.
On a 32-bit microcontroller I don’t know what the best options are for smolifying pointers. 24-bit implies unaligned access, plus awkward multiplication/division by 3 for arrays. 16-bit is verging on teensy, though if you also align blocks you can stretch to 128K or 256K, which could be good enough. And there’s always the good-old 16-bit indirect object table a la Smalltalk-80.
Microvium uses 16-bit pointers and restricts you to a 64 KiB heap per JavaScript VM. We’re using it and can fit quite a few JS VMs on a prototype with 256 KiB of SRAM. Their semi-space compacting model works well with our hardware temporal safety approach because they reallocate the backing store when they compact, so every pointer to the JS heap is either to a live (possibly unreachable, but not collected) JavaScript object, or to a deallocated allocation. We catch the second case in hardware, so any dangling C -> JavaScript pointers trap.