1. 44
    1. 19

      Author here! The title says 2.3x faster, but most runs showed much higher multiples (usually 10-14x faster than RC). I decided to use the least impressive result (2.3x faster than RC) until we can figure out how to benchmark more consistently.

      We found this by wondering what really makes Rust code fast. It wasn’t the borrow checker, since the borrow checker is just a restriction, not a particular code pattern. But Rust’s idioms and the borrow checker influence us into patterns that happen to make code fast, such as:

      • Using better allocation strategies (most often Vec)
      • Isolation of memory between threads (except message passing and mutexes)
      • Using generational indices instead of RC (and instead of GC, of course)

      So we made a language that does these things directly, instead of forcing the user to do them via a borrow checker. Through some leaps of madness, we found a way to integrate generational indices directly into malloc (for now, a layer above malloc), as mentioned in the article. The result was quite surprising.

      We’re pretty excited about what doors this opens. The next design (hybrid-generational-memory) adds static analysis like an “automatic borrow checker”, and uses regions to enable better allocation strategies. If all goes well, maybe it’ll be as fast as Rust itself, without the borrow checker’s complexity cost. We’ll see!

      1. 4

        Great results! Do you see any limitations with this approach?

        1. 4

          Hey soc! No user-visible limitations per se, but off the top of my head, there are two things that could cause some overhead in the final design:

          • Releasing memory back to the OS is doable with mmap, but could be tricky and have unforeseen consequences, I’m sure there’s mmap caveats that I don’t know of yet.
          • Some objects that would have been on the normal stack are now pushed onto “side stacks” (which have generation numbers) which grow and shrink like the normal stack. The minor problem here is that main stack then needs pointers to the locals in the side stack, instead of just being able to subtract a constant from %rbp. That load will probably be fast, but those pointers on the main stack use up some space and therefore some cache.

          Other than those potential concerns, the approach seems pretty solid so far!

          1. 6

            I can see lots of potential issues, though it’s difficult to say how they’d impact real applications.

            • you say “releasing memory back to the OS is doable with mmap”, but I don’t see how that could be right. The whole scheme (of checking the generation counter to ensure it’s correct, on each dereference) seems to rely on the allocation (and generation count) persisting indefinitely.
            • while it may be faster than reference counting, one obvious drawback is that it doesn’t actually persist an object to which a reference is still held. That is, reference counting keeps the object alive while it is referenced; this scheme instead allows detecting when the reference is no longer valid. They are slightly different use cases, but that is important. Real reference counting allows objects to be kept alive solely by reference-counting links, which allows for dynamically allocated circular data structures, for example.
            • the cost of checking on dereference is unlikely to always be irrelevant.
            • the inability to ever truly free an allocated object block (i.e. to re-use it an allocation of a different size) could lead to memory fragmentation, which will be problematic for some applications. Breaking allocations into fixed-size buckets will also lead to internal fragmentation, which is not without cost either. And further, how does this work for very large allocations?

            I’d be wary of writing off those concerns when comparing to reference counting.

            1. 3

              Thanks for the reply! Great questions.

              • Let’s say we’ve deallocated all the objects on 5 particular pages. We would re-map all of these pages’ virtual address space to a central page containing 0xFF which will (correctly) fail all generation checks, and we would never use that address space again (well actually, we can, with a certain other trick).
              • If one requires shared ownership (keeping the object alive) then this approach would be a drawback. However, we rarely actually require shared ownership. I believe most C++ would agree, it’s very rare that unique_ptr is not sufficient for something. In my experience using C++ in industry and Vale at home, I’ve never come across a need for shared ownership (though we can contrive a rare case where it would be convenient, if we try hard). You can read more about this phenomenon at https://vale.dev/blog/raii-next-steps#emerging-patterns if you’re curious.
              • You’re right, though it’s only a minor concern of mine because the stack is very likely to be hot in cache. I could be wrong about this impact, so I’m keeping an eye on it. EDIT: I think I actually misunderstood your remark, did I imply checking on dereference is irrelevant somewhere?
              • Mimalloc and jemalloc segregate by size class already, so we’re not too worried about this. There are also two mitigations:
                • Vale makes very good use of regions, which are wonderful for avoiding fragmentation, you can read more about this at https://vale.dev/blog/zero-cost-refs-regions.
                • We could merge two partially-filled pages, using mmap, effectively compacting memory. We’ll wait until fragmentation proves to be a problem before we implement this though.
              1. 4

                Thanks for responding.

                We could merge two partially-filled pages, using mmap, effectively compacting memory. We’ll wait until fragmentation proves to be a problem before we implement this though.

                I am dubious about this. As well as adding complexity in the allocator, you could only safely merge two pages if the generation count of all live objects in each page was larger than the generation count of the corresponding dead object in the other page. And even if that weren’t the case, finding merge-able pages would already be non-trivial to do efficiently (plus, messing with page tables has a performance impact especially on multi-threaded applications, and you can’t atomically copy-and-remap, so you’d need some sort of stop-the-world strategy or a write barrier which again is not free).

                I believe most C++ would agree, it’s very rare that unique_ptr is not sufficient for something

                I’m primarily a C++ developer and I do agree with that. But if you are comparing against reference-counting, it’s a different story, because the ability to manage cyclic data structures is one of the key benefits of RC.

                I.e. I agree that it’s an uncommon use case, but when you’re directly comparing against reference counting I think you should be clear that reference counting can do something that generational references can’t.

                Mimalloc and jemalloc segregate by size class already, so we’re not too worried about this

                Ok, but I think you’ve overlooked what was probably the major concern I raised: how do you deal with very large objects, eg. large arrays? using buckets for such allocations will cause major internal fragmentation. If I remember correctly jemalloc (at least) stops using segregated-by-size buckets beyond a certain size.

                I guess, again, it’s something of a corner case; but it’s something to consider regardless.

                But - hey, it’s a neat enough idea, and maybe it will prove to be a pretty good strategy overall. I’ll be interested to see how it pans out in the long run!

                1. 3

                  Thanks for the great comment! What do you mean by “manage cyclic data structures”? Also, if one wants to implement reference counting in Vale, we actually can; the programmer would give an object a counter and an owning reference to itself, and make an Rc-like handle. (Yes, this possibility means that cyclical references are technically possible, in Vale, but they’re basically impossible to do accidentally, which is a nice balance I think.) It’s more expensive, but I’m fine with a very niche need (IMO) taking a bit more memory.

                  Thanks for pointing out the large arrays concern, I forgot to address that.

                  For arrays that are made of non-inline objects or inline immutables (chars, ints, or inline immutable structs), they are a fat pointer which contains:

                  • A pointer into an allocation that came from regular malloc, with no generations.
                  • A pointer to an 8-byte allocation that holds only the generation. This generation is separate, but serves as the sole generation for the entire array.
                  • (of course) The length.

                  The other case, arrays that are inline mutables, could either:

                  • Look for a contiguous chunk of elements of the desired length, in the gaps of the existing allocations, similar to some malloc implementations today.
                  • Allocate a new set of pages for this array.
                    • This would seem to explode our use of virtual memory (which is a finite resource here), but we can use a trick I hinted at before (but didn’t explain, apologies). When we release a page to the OS, we can look for the “maximum generation” in all the entries for the page, and remember that on the side. Later, we can re-use that address space by going through and initializing all the generations to that previous “maximum generation” number.

                  Regarding your first point, about being able to merge the entries only all of one page’s generations are greater than the other, you’re totally right. This conundrum felt familiar, so I had to look back at my notes to figure out what mitigation I had for this, and I have written down some insane, feverish ramblings:

                  • We can designate every page as only using either “even” or “odd” or “any” generations. Pages start as even or odd, and we can only merge an even- and an odd-generationed page together, to form an “any generation” page. We would also set a “minimum” generation (which is the maximum of both pages’ generations at merge-time), and any new generations would be at least that.
                    • We could extend this logic to more than just two (even and odd) partitions. Maybe 10? Maybe with some cleverness with prime numbers, any amount?

                  Regarding finding mergeable pages, yeah, that’s pretty difficult. We were exploring this the other day in the discord, and came up with an interesting heuristic: if A is the number of allocations in the first page, and B in the second, and there are 256 slots in each, they will most likely be able to merge if A * B < 44. It seems pretty low, but maybe it’s not. In a given page, most objects are probably short-lived. But even if it doesn’t pan out, that’s okay, since merging partial pages was just a nice-to-have; most allocators don’t do that today anyway.

                  Hope that helps, and let me know if there’s any other shortcomings you can find! (and let me know if you see any improvements, too)

                  1. 1

                    You’ve obviously thought about it this very thoroughly, which is great. I’ve gotten a lot of interesting details out of reading your responses, thanks.

                    What do you mean by “manage cyclic data structures”?

                    I meant, maintain a heap-allocated cyclic data structure, a structure whose nodes (potentially) contain reference cycles, such as a graph. Each node would be part of the graph and wouldn’t necessarily have an owner outside the graph, but it’s not possible (or at least, it’s difficult) to structure it so that each node is owned by another node. Generational references alone won’t keep graph nodes alive, and you don’t have owning references for all nodes, so what do you do? (there are solutions, but I don’t think there’s any as nice as just having reference counting references, though even those come with the problem that you risk memory leaks if you do have reference cycles and don’t take steps to break them).

                    Re multithreading: IIRC, each page is managed by one thread (if an allocation travels to another thread, dropping the owning ref will send it back to its original thread for freeing)

                    My main point about multithreading was that, if you have references (of any kind) to an object from multiple threads, and at least one of those references allows mutation of the object, and you want to relocate that object to another physical page (eg as part of the page merging you mentioned), you have make sure that all threads with mutation-capable references don’t write to the object just after you copied it to the new page but just before you updated the page tables (via mmap). So you’d need a write barrier of some kind, or you need to suspend all threads (that might have a mutation-capable reference).

                    Now, if only the owning reference can do mutation, maybe that’s not such a problem. Apologies, I’m not really familiar with the Vale language, so I don’t know if this is the case.

                    And of course if multiple threads can mutate an object you may well have a mutex built into the object anyway, in which case the solution could be just to lock it before copying it to the new page and unlocking after (at the risk of allowing another thread to stall your memory management by holding the mutex).

                    Regarding your first point, about being able to merge the entries only all of one page’s generations are greater than the other, you’re totally right

                    Actually I may be missing something, but that’s not quite what I meant. As I see it, it’s only necessary that the live object in a particular “slot” in one page has a greater generation number than the dead object in the same slot in the other page (this must be true for every slot). If both slots contained dead objects you would preserve the highest generation number of the two, but I don’t see why that would need be from the same page every time.

                    Eg merging two pages like:

                    | page #1    | page #2    |
                    |------------+------------|
                    | live (3)   | dead (2)   |
                    | dead (4)   | live (7)   |
                    | dead (4)   | dead (7)   |
                    | dead (7)   | dead (4)   |
                    

                    Would produce:

                    | Merged     |
                    |------------|
                    | live (3)   |
                    | live (7)   |
                    | dead (7)   |
                    | dead (7)   |
                    

                    In each case, live references still refer to the same live object with the same generation count, and dead references now refer to a dead slot with a same-or-higher generation count or to a live slot with a higher generation count (and so they can be identified as dead references).

                    But even if it doesn’t pan out, that’s okay, since merging partial pages was just a nice-to-have;

                    Yes, fair enough. I think it would probably be do-able but at a cost to performance. I guess you could delay at least some of that cost until merge-time, but overall I suspect you’d incur a massive increase in allocator complexity for a fairly minimal benefit.

                    Anyway, thanks, this has been enlightening! You should consider writing up the whole thing in detail once the implementation is complete (I’m assuming it’s not quite there yet). A lot of the details in the discussion we’ve had are really interesting.

                2. 3

                  Seems I missed some parts of your post!

                  Re multithreading: IIRC, each page is managed by one thread (if an allocation travels to another thread, dropping the owning ref will send it back to its original thread for freeing), which lets us do a lot more operations without locks or atomics, and we can batch them to reduce write barrier waits. I haven’t done much more thinking in this area, so any input you have here is welcome!

                  Re mmap expense: I think we do the same amount of page manipulations as existing strategies, so it should be the same. Unless a remapping is a more expensive call than plain releasing to the OS, at which point we’ll have to adjust that optimization.

                  Re large objects: for example if an object is larger than the largest size class (128B) then the struct will use multiple entries in the 128B* bucket, leaving bytes 128-136, 256-264, etc untouched.

                  *We can actually put it in whichever bucket will minimize wasted space.

                  I appreciate your comments, you’re exploring this much further than anyone else has!

          2. 3

            I’ve seen this behaviour in our own workloads, at some rare point there is a very rare spike in usage (GBs) and then the work is finished and all the memory is freed. However, the process continues to own that memory since “free” does not hand it back to the OS. IMO, this is a kind of a leak.

          3. 1

            Nevermind, we just solved that second drawback =) we can group all a block’s locals into a single allocation, as if they were members of a struct. Now we only have pointers to blocks. (Also, this may accidentally enable really easy green threads.)

            1. 1

              Isn’t this how some languages implement closures and nested functions?

              1. 2

                Yes! That’s right. That’s a much better way of thinking about it.

                (Though Vale is a bit weird, itself implements closures by individually “boxing” fields, and then the closure structs will point at the boxes. This lets us compile to JS and JVM losslessly, and with some hints to the LLVM backend, it can be zero cost.)

    2. 9

      an object is freed when its owning reference goes out of scope. An object always has exactly one owning reference pointing to it.

      This is very, very restrictive compared to most GC (and ref-counting) implementations, so I’m not surprised it can be implemented so cheaply. (I’m not sure it’s even correct to call it garbage-collection, but that’s quibbling.)

      Rust already has single-ownership without any GC, because in this scheme it’s obvious when an object should be freed. The new stuff in Vale, if I’m reading correctly, is the weak references. These have to be nulled out when an object is freed. You’re taking a lazy approach, only detecting an obsolete weak ref the next time it’s derefed. Other weak-ref systems I know of do it with a secondary intermediate heap block to which all the weak refs point, which is itself ref-counted; or by chaining the weak refs together in a linked list so they can be traversed and nulled out at the same time the target object is freed. I don’t know how this new scheme compares.

      One thing that worries me about this scheme is that it can never return any memory / address space back to the OS. The process’s memory usage can only grow. This is usually seen as a problem, and I’ve got some Go programs on my dev machine I have to periodically kill because they start eating up tons of memory due to Go GC issues like this.

      1. 3

        After a few more minutes thought: chaining weak refs seems like a better approach to me, provided references are never shared between threads. That way there’s no need to store a tag/counter in the heap, and you don’t have to use a custom malloc. The weak refs do take twice as much space as a pointer, but the same is true in the generational scheme, because they have to hold the gen count.

        The downside, I suppose, is you need a write barrier that does some linked-list operations. Which might be expensive unless you use a doubly-linked list, which then makes the weak ref even bigger… Hm.

        TL;DR: Here I am claiming I can do better than you, after spending five minutes thinking and writing no code. Whereas I’m sure you’ve spent much longer thinking and coding. In other words I’m being a typical engineer. 🤓

        1. 2

          Good thoughts! And I’m glad you’re enjoying engineering weak ref designs as much as I do =) It’s a fascinating puzzle. You might also enjoy this draft about various weak reference implementations I’m working on, at https://vale.dev/blog/surprising-weak-refs.

          The approach you mention actually works pretty well, because we can make a given object only visible to one thread at any given time (like Rust, Lobster, Pony… Vale uses the region borrow checker to guarantee this with zero cost, or we can do a recursive scan on all inter-thread messages like Pony does). The reason we chose generational references over the linked list approach is actually because it requires two cache misses every time we make or destroy a weak reference, whereas this approach incurs 0-1. There’s also another approach we played with, where each object had an vector of “back pointers” which pointed at all the extant weak references to that object. However, it had the same problem, two cache misses (plus branching and expand costs).

          I share your feelings about a custom malloc, that will be a headache to implement.

      2. 3

        Single ownership is often regarded as restrictive or difficult, but (IMO) that’s just because the other two languages that use it (Rust and C++) are difficult, for unrelated reasons (borrow checking, aliasability xor mutability, UB, unsafety, etc).

        In both Vale and Rust, it’s obvious when an object should be freed; both of them have single ownership.

        I like your description, it is a “lazy” way of doing weak refs. They don’t have to be nulled out eagerly, their targets’ nonexistence is detected on dereference. It’s not unprecedented, we do it all the time with IDs into central hash maps (if you consider these IDs as “references”, which is kind of a stretch, but I like the notion).

        When a page is empty, we actually can return it to the OS, by mapping its virtual address range to a central underlying physical page which contains all 0xFF, which will (correctly) fail all generation checks. There’s even some techniques we can do with merging multiple partially filled pages, to effectively have a “compacting malloc”, which we could do if memory usage becomes problematic. There are a few more tricks up our sleeves too, but they probably won’t be necessary.

        The only environment we’re targeting that can’t use virtual memory is WASM, but they don’t have the ability to return memory to the OS anyway. I’m not entirely sure why!

        I just looked it up, and I’m shocked to find that Go doesn’t do any sort of compaction! I didn’t know there were GCs without it.

        1. 5

          I’ve used C++ for a long time, so that’s not the difficulty. And I don’t think you can really compare C++ to Rust, since C++ doesn’t require single ownership (you can just use shared_ptr or make up your own ref-counting system), nor does it enforce safety with it — if you’re the unique owner of an object, you can still hand out raw pointers to it, and simply hope that no one hangs onto such a pointer longer than its owner. Rust is pretty unique, I think, at coercing you to use single ownership.

          The problem I have is that I just don’t think in terms of single ownership. When I try to apply it in Rust, it’s kind of like when I try to use functional languages with immutable data structures — it feels constrained and artificial to me. (I’m not disparaging those styles! Just saying I’m not used to them.)

          1. 2

            I suspect what you ran into was Rust’s own unrelated problems, specifically the borrow checker’s aliasability xor mutability rule. It makes single ownership much more needlessly complex and difficult than C++’s (and C++‘s is already kind of weird, because we can’t have weak references to unique_ptr).

            Many years ago, a teammate challenged me to try and use shared_ptr less, and try to remake my particular design only using unique_ptr. The notion seemed super foreign to me, because I always reached for shared_ptr first. I was quite surprised to find that every situation where I thought shared_ptr was the best choice, there was a simple solution with only unique_ptr. In the first month, I quickly discovered the clasp, and its next step (the weak reference) but once I had those, I found that they covered all situations where I previously thought I needed shared_ptr.

            (Funny story, six months later I learned that I completely misunderstood what my teammate was asking! They’d accidentally opened my eyes to an entirely new approach!)

        2. 4

          When a page is empty, we actually can return it to the OS, by mapping its virtual address range to a central underlying physical page which contains all 0xFF, which will (correctly) fail all generation checks

          Why don’t you start on generation 1 for any valid object? That way, you can use the canonical zero page, which has a bunch of fast paths in most VM subsystems. You can just MADV_DONTNEED on Linux. This also lets you take advantage of MADV_FREE if you record the maximum generation used for a chunk: You MADV_FREE the chunk and the next time you allocate either the OS has replaced it with 0, in which case you reinitialise it with the largest generation number pointing to that page, or you the OS hasn’t reclaimed it and you just increment the non-zero value that you find. You’ll burn through generations a bit quicker if you do this but in exchange you get to be more friendly to memory-constrained systems.

          1. 2

            Great thought! I really wanted to do this, because it would save us 4kb, and it feels more elegant since OSs provide zero pages.

            Recall how a program’s normal stack will lazily acquire physical RAM when we hit the next page of virtual ram. We wanted to also do this with the “generational stacks” which are next to the regular stack; we want the OS to automatically allocate those new pages, so we wouldn’t need a signal handler to lazily initialize every contained generation to 1. Because of that, we need 0 to be a valid generation, which means 0xFF has to be the never-used generation.

            Though, it occurs to me now, we could do something clever with mapping those generational stacks’ “not yet used” virtual pages to copy-on-write pages which contain 0x0000000000000001 repeated, instead of 0x00 repeated… that would let us do what you describe! How interesting, I’ll have to think on this more!

            Remembering the maximum generation to later ressurrect already-freed pages is a great technique! davmac and I also talked about this a bit, and we can do it with either the 0 scheme or the 0xFF scheme.

            I’d love to learn more about these fast paths in VM subsystems, where might I find some information on those?

    3. 3

      programs dereference less than they alias and dealias

      is this true? this seems unintuitive to me, but maybe i’ve just been writing a lot of graph search algorithms recently

      1. 2

        This was true of our benchmark program, which we didn’t write with any particular pattern in mind, so I think this is typical of an average program. We were surprised at first when we ran the measurements, but it kind of makes sense. Imagine a HashMap implementation, it will be aliasing the its elements all over the place, but it never dereferences them (it doesn’t know how, it’s a generic container).

        Also consider loading a reference from a local as an argument to a function call. Unless that’s the last usage of the local, that would be an alias. The function itself might never dereference it, but instead hand it on.

    4. 3

      This isn’t truly comparable to reference counting. Reference counting “just works”, while here the programmer has to correctly identify where the object should be freed, and errors in doing that lead to crashes. Of course you might also leave figuring out where to place the free() to the compiler, but then you’ll have to fight the compiler for complex cases. This does have the minor benefit of not needing a garbage collector that would catch cyclic references, but it also puts the requirement of single-ownership on the language.

      1. 2

        I think it’s a common misconception that shared ownership (such as in GC and RC) is so important and helpful. As mentioned elsewhere, we rarely actually require shared ownership. I believe most C++ would agree, it’s very rare that unique_ptr is not sufficient for something. IME using C++ in industry and Vale at home, I’ve never come across a need for shared ownership (though we can contrive a rare case where it would be convenient, if we try hard). You can read more about this phenomenon at https://vale.dev/blog/raii-next-steps#emerging-patterns if you’re curious.

        I think you’re also understating the benefit of not worry about cyclic reference leaks and not have a garbage collector, but to each his own =)

        1. 4

          If we map ownership semantics back to standard fragments of logic, then single ownership corresponds to linear logic without exponentials. But our computing hardware implements Cartesian closed logic, able to copy and delete references. This is the heart of the mismatch; it has little to do with language design and quite a bit to do with hardware design.

          Incidentally, for memory, linear logic makes sense; this is specifically because we cannot duplicate physical memory banks, and so a single-ownership model does not artificially hamper the programmer.

          I think that you haven’t seen languages where cyclic references are standard and boring. For example, in E, to create a pair which references itself infinitely:

          def x := [1, x]
          

          Because cycle-handling is built into the language, we can do comparisons with cyclic objects:

          def y := [1, y]
          x == y # true!
          

          Making functions and other higher-order objects which reference themselves is trivial:

          def f() { return f }
          

          But all of this requires a GC which does not flinch when the user decides to copy an object reference.

          It is fair to have the reaction that perhaps this is all needless dross which complicates a language, but E was intended to allow for copying of object refs between mutually-untrusting machines across a network, and in that situation, cycles are not just possible but endemic. If the language’s runtime has trouble handling local cycles in a single heap where it can examine every reference, then it becomes somewhat hopeless to try to prevent cycles in networked objects.

          1. 2

            Vale doesn’t allow untrusted deserialization of objects, I believe that scenario would (correctly) be impossible in Vale.

            I’d love to learn more about these cycles. Did E consciously set out to allow these cycles? If so, what use cases did they have in mind?

            1. 2

              E doesn’t have untrusted serialization. Rather, E was designed to allow for objects to be copied by construction. But copying by construction inherently must handle cycles both when deconstructing and reconstructing objects.

              There is a section of the old E website which explains safe serialization under mutual suspicion. The main use case is distributed garbage collection between many machines on a network.

              Shadows of this appear in other languages. In OCaml and Nix, the rec keyword is needed to create cyclic objects. In Python, the venerable and insecure pickle protocol supports pickling and unpickling cyclic objects.

        2. 3

          That’s not my experience. This morning, thinking about this thread, I instrumented the C++ RefCounted base class in the project I work on, and checked so see which subclasses never get a ref-count higher than 1 when running our test suite, i.e. are good candidates for single ownership. Only about 5% of them did, and it looked like that was just due to some limitations of the testing.

          Some of that is because of sloppiness, like accessors that return strong (+1) references when they don’t need to; or because of limitations of our smart-pointer system where a ref-count gets bumped when it doesn’t optimally need to be. But for the most part the design does use multiple ownership. I could probably refactor code to reduce that, but it would be a lot of work.

          1. 2

            This is likely also due to missing std::moves; Vale (like Rust) is move by default, and avoids an extra increment here. RVO sometimes allows us to skip a std::move, but otherwise, we have to remember them. There’s also a possibility shared ownership in one area will force you to use shared ownership in another area, inertia so to speak. Vale wouldn’t encounter that particular problem.

            I’d be very curious to hear an example where it seems like you need shared ownership. It’s very rare to encounter such a case in my experience, but I could be surprised.

        3. 3

          Yes, in theory most situations encountered in practice do not require shared ownership. But it doesn’t mean shared ownership isn’t useful there to make code simpler or faster to write. Like it or not, nowadays probably one of the more important things in programming languages is programming speed. Shared ownership often lets you go with a shotgun approach - giving everything references to things they need, which is very simple to write and understand. On the other hand, it could be done with a single ownership, but then the programmer has to spend extra time figuring out what should own what, when things should be freed, etc. There definitely is a place for single ownership, but I don’t think that making it the default for everything is a correct decision.

          1. 3

            A very reasonable perspective, for those that believe single ownership is that difficult. As I mentioned to snej, single ownership is often regarded as restrictive or difficult, but (IMO) that’s just because the other two languages that use it (Rust and C++) are difficult, for unrelated reasons (borrow checking, aliasability xor mutability, UB, unsafety, etc). Vale is the first language to offer single ownership without all the unrelated complexity, and shows us how easy it can be.

            That said, even though it’s not difficult, I can’t deny that it does take a little extra time and keystrokes to designate one reference as owning. It’s true that garbage collected languages like Java and Python will be slightly easier than Vale because of that. I think the tiny amount of added thinking is well worth the (hopefully) massive speedup. Time will tell!

            Re defaults: almost all situations encountered in practice do not require shared ownership, so I’d say it’s a reasonable default, especially since it’s more efficient than shared ownership. Also, as a last resort, users can implement a shared_ptr-like class themselves, Vale doesn’t have to offer it as a first-class citizen. I recommend one exhaust all other options though, because it’s rarely helpful.

            1. 4

              I think it’s a good idea to experiment with “no shared ownership by default” and see what the benefits and costs are; it reminds me something of the restrictions of pure functional programming languages, where mutation is forbidden and usually this is just fine. You end up using certain patterns a lot instead of mutation, and it’d be interesting to see what kind of patterns you do in a language with no shared ownership.

              Semi-common situations I run into where shared ownership is helpful:

              • Sharing immutable data with no internal pointers, such as interned strings or cached images/audio data – these usually live for the length of the program, when I use them, so one owning reference and many weak references seems feasible
              • Managing one-per-thread or one-per-program resources, such as database connections, handles to hardware resources – Not sure how to do this well, especially in situations where one part of the program initializes the resource and others try to use it, weak references might work out well in practice, I would have to think about it more
              • Building immutable data structures, especially ones shared across threads in an actor-like system – This one is tricky now, since there are very explicitly no lifetime dependencies between the various threads that might use the data
              1. 2

                Great points! Your three uses are indeed accurate, and I should explain Vale’s capabilities a little more thoroughly: for the first and third cases, we actually do offer built-in shared ownership for deeply immutable objects (https://vale.dev/guide/structs#mutability), because we noticed that specifically for deeply immutable objects, shared ownership has its usual benefits without the usual downsides (eg cycles).

                The second use case is actually the same “rare case” that I was alluding to in my original reply to ignaloidas (I usually think of it in terms of file handles, but you’re correct in saying that it’s any one-per-XYZ resource). This is an interesting case which I think about a lot, and I often wonder if there’s a single-ownership pattern to be found here. But even if there isn’t, there’s always the escape hatch (letting the user implement RC themselves, using Vale’s RAII). So, your point is a good one!

            2. 2

              But it’s yet another thing for programmers to worry about, increasing the cognitive load needed to solve a problem. Ideally the cognitive load is only as large as the “business problem” you are trying to solve is. Obviously, that isn’t really possible to reach, but adding additional things doesn’t help either. The less thought you have to put in, the faster you can write the code. Adding additional things to worry about has to bring benefits that are worth it. And I doubt that the few percent difference this makes is worth it.

              1. 2

                I hear you, sometimes that few percent difference isn’t worth it the few extra keystrokes. There are situations where Python would be best (quick small scripts), Java/Kotlin/Scala/Go would be best (large programs that don’t care about speed), or Rust would be best (programs that need super low level access and unsafe code, despite the complexity). In those cases, those languages’ tradeoffs could fit better.

                However, I believe Vale will be a powerful and compelling choice for everything in-between those cases. For a minor amount of extra thinking and keystrokes, you can halve your overhead and eliminate garbage collection pauses, while remaining 100% safe. I don’t know about your use cases and experience, but every team I’ve worked on would love if that was possible in the languages they were using.

    5. 3

      How does that compare (performance-wise, ignoring determinism) to tracing garbage collectors?

      1. 2

        We’ll soon find out! Vale is (semi-intentionally) designed to be able to compile to JVM and CLR bytecode and JS, as well as our current native backend. At that point, we’ll be able to do some reliable benchmarking.

    6. 3

      Many languages are able to skip a lot of the adjustments, using static analysis. For example, Lobster can remove up to 95%. Our experiment doesn’t have those optimizations; it compares naive RC to naive generational references.

      I wonder if you’re actually getting some of these optimizations for free. Even if your frontend compiler naively inserts a generation check on every field-access, LLVM could still eliminate the redundant loads and branches: https://godbolt.org/z/c9j8xs Do you know if this is happening?

      1. 2

        I think it’s happening to both the RC and gen-ref programs compared by the benchmark, especially because when we bumped from LLVM 7 to LLVM 11 we saw a noticeable speedup in both, which hints that the optimizer may have improved.

        Thanks for this godbolt, this is a beautiful illustration of how it can be optimized!

    7. 3

      Interesting idea!

      A similar approach is quite common in some Entity Component Systems (ECS) in game engines, though the approach is not generalised in the same way. Each entity handle is a uint64 with the lower 32 bits representing an index into the component tables, and the high 32 bits representing the generation. When an entity is destroyed its generation is incremented and the ID added to a free list.

      1. 2

        Definitely! I very much wanted to use ECS as an example in the article too, but I stumbled when I realized we never actually dereference the entity, only its components. It was unfortunate, because I usually love working ECS mentions into my articles.

    8. 2

      This is interesting. Seems similar to the version-numbers you’ll find used in various game engines as part of their entity system, only using pointers instead of entity ids, and built-in to the runtime of the language rather than being a custom-built thing that’s kind of alien to the host language.

      1. 2

        Yeah, there’s an interesting article that was linked here recently, something like “handles are the better pointers”, that described a similar generational technique.

    9. 2

      In the example launchShip, the variable ship goes out of scope at the end of the function, which means armada will pretty much immediately have an empty reference. If the program were refcounted, the armada would still have a valid ship. The latter certainly seems more like expected behavior.

      1. 2

        You’re right, to a certain extent. We often have “deinitialize” functions (whether they be destructors like in C++, or regular Java methods like dispose, close, destroy, drop, etc), which perform some operation when an object reaches the end of its expected “logical” lifetime. If the object is kept alive, and accidentally dereferenced, it’s memory safe but still results in bugs. (See also side-note at https://vale.dev/blog/raii-next-steps#simplification) So it’s not quite as clear-cut as “keeping an object alive is the correct thing to do”, but sometimes it is.

        In the end, this approach is faster than ref-counting, and one should consciously decide whether they want the behavior you describe, or more speed.

    10. 2

      Not specific to memory management, but did you consider having explicit ownership/capability denotation, like in Pony? That, to me, makes it easier to reason about (because explicit is better than implicit) than say rust.

      1. 3

        Vale explicitly annotates ownership:

        • fn launch(s &&Spaceship) { ... } receives a weak reference, which the user must check before using.
        • fn launch(s &Spaceship) { ... } receives a “constraint” reference, which is automatically checked on dereference.
        • fn launch(s Spaceship) { ... } receives an owning reference.

        And we have region annotations, which can achieve something similar to their iso, I believe. We’re tossing around the idea of adding uni which would help with optimization. We’re trying to keep the language simple by default, which is why we aren’t going to the same extent as Pony here.

        1. 1

          Well that is very interesting! I guess I misunderstood your docs.

          What kind of application do you think Vale shines at now, and do you have plans to make it good in other domains than whatever you consider your current strength?

          1. 2

            Good question! I’ve had my head in the code for so long that I haven’t thought about this in a while.

            I’d say that Vale is too young to really shine yet, it’s still in the prototype stage. I was able to make the vale.dev markdown++ site generator and the BenchmarkRL program in Vale and it worked well, but there were some rough edges; it’s really only suited for experimenting right now. That said, I think it already shines at showing people that single ownership can be easy, and that’s the main reason we’re making the language.

            More to your question, I think Vale will really shine for servers. Right now, most servers pay the overhead of massive amounts of memory to enable garbage collection (5-6x what’s needed) and nondeterministic pauses and slow performance, because they need the 100% safety. These folks can’t use C++ or Rust because they are both very unsafe compared to GC’d languages. Vale has both the 100% safety of GC and the speed of Rust (hopefully), something that no other language can claim.

            I also think Vale could shine in games. AAA games will likely always prefer C++ over Rust because Rust’s borrow checker prevents some of the more creative and necessary patterns, because it can’t conclude at compile time that they’re safe. It also influences you into certain architectures which fit only certain situations, something really showcased by this adventure: https://www.reddit.com/r/roguelikedev/comments/i3xekn/ec_vs_ecs_for_roguelikes/. Vale’s “assist mode” is more powerful and finds more than asan/valgrind, and should be minimal overhead, suitable for development modes, and Vale’s “unsafe mode” has zero overhead, suitable for AAA games. The non-AAA games (which don’t need to run as fast as C++) trade some performance for safety by using Rust. Vale’s hybrid generational memory would be a better choice here, because it’s safer while being (theoretically) about as fast as Rust.

            For other domains… I’ve actually been subtly designing to keep a certain option open: cross-compilation to JVM, CLR, JS, and Swift. This is something unheard of and downright insane for a native-speed language like Vale to even attempt, but it’s possible, because of a very clever language design approach of “simple semantics with optimization hints”, such as the inl keyword (which just ignored by JVM, CLR, and JS, but heeded by LLVM/wasm). This could revolutionize app development if we pull it off. I’ve drafted some thoughts on this at https://vale.dev/blog/cross-platform-core-vision, but we’d need a lot of manpower to attempt something that ambitious.

    11. 1

      This seems to remind me more or less of what was explored for Nim recently (some more details show up in issues and forum threads IIRC, I believe the Internet search keywords are “nim newruntime”). Though I’m not super sure why, but it was eventually apparently discarded for a reference counting approach. I’m curious if you’re aware of that experiment, and whether it educated yours in some ways?

      edit: Hm, ok, I guess IIUC this approach was Not Discarded But Postponed™, and refcounting was chosen in meantime as easier to introduce without breaking changes.

      1. 1

        Theres nothing mentioning generations in any of your links, what are you referring to?

        If you’re thinking about Nim’s halting references, Gel (Dingle, Bacon) introduced those to the world in 2007, and game engines have been using them for a while (including some of mine ~2012). Vale has had those in its Assist mode as a debug tool since the beginning, almost eight years ago. But thats a different technique, based on reference counting, not generations.

        1. 1

          Thanks and sorry! I’m not well versed in memory handling schemes, so it just sounded similar to me in some aspects, esp. the “memory safety” of invalid references. But you’re right, I now seem to recall there was also some recounting involved. Awesome to hear you know your papers! I’m curious if Nim’s author would be interested in learning of your approach, or he similarly also already considered it. He seems to like exploring various memory schemes too.

          edit: BTW, in the “not to be confused with” footer you could also add a mention of GNOME’s Vala language; I only now realized they’re something different X-P

          1. 1

            No worries =) not a bad idea, maybe I’ll drop in and message him.

            I’ll add a line about Vala later tonight, good call.

    12. 1

      so, Box with weak references

      thoughts on using a PRNG instead of a counter?

      free could zero the counter, and call the allocator’s free, rather than keeping it around (with tiny chance of catastrophic false positive)

      1. 2

        Havent considered that! Thats a fascinating idea. It’s probably not as fast as a simple increment, but it allows us to free memory back to the OS more eagerly, and be better than crashing if we ever max out the generations.

        This could be an option a lot of users like. It could even use less space (32b gens instead of 48b).

        I’m going to explore this a bit more and get back to you. If it pans out, what would you like to call it and how would you like to be credited?

        (Also, Box implies heap, these neednt be only heap)

        1. 1

          You are doing an incredible job implementing, documenting and community building. I’ve read everything now and I’ve learned a lot. Looking forward to using Vale :)