1. 31
  1.  

  2. 3

    This is something that you cannot do both a) this safely and b) this efficiently in any other language

    I’m not sure I agree, while it is cool, in garbage collected languages with immutable strings it seems like its quite simple to do this.

    1. 8

      I agree. Though the Rust implementation has another benefit. In a GC language with immutable strings, when a slice stays reachable (in a GC) sense, you hold on to the complete (potentially large) string, unless you use something like ropes. In Rust, you would notice accidentally doing this, since the slice reference cannot outlive the (owned) string.

      1. 3

        I think most GC languages with immutable strings do not do this. I’ve been going back and forth about this issue in Oil a bunch. Examples:

        • Python copies the bytes of a slice.
        • Ditto for OCaml – it copies bytes. It doesn’t have any kind of slice in its GC.
        • The main Java JVM used to share the data between the string and slice, but now they copy it, because of the issue of holding onto big strings mentioned, and maybe others.
        • (reviewing my notes) Go is one of the few languages that shares. It’s also true that Go’s GC is insanely complex for multiple reasons, not just this one … (it used to be conservative too, but now it’s precise I believe).
        • I’d be interested to hear what other languages/VMs do (JavaScript, Ruby, etc.) Pretty sure Ruby copies them too.

        Right now oil-native shares the bytes, but GC is pushing it toward copying. The implementation issues I see are:

        (1) Most strings are short, which means that a slice is actually bigger than most strings

        For example, if you have a pointer to a string, a beginning index, and an end index, that’s 16 bytes on a 64-bit machine. Add in some other data about whether it’s interned (which Python has), or the string’s hash value (ditto), the you easily have 24 bytes, or 32 bytes when rounded up to an allocator size.

        You may as well just use that 24-32 bytes to store the string itself in most cases!

        (2) Sharing slices can complicate the GC implementation.

        If you don’t copy, then a slice is a form of pointer. So if you want to represent it efficiently, you naturally have to account for it in GC reachability analysis (for any form of tracing GC).

        The other option is to make a slice a first class GC object, which increases the number of objects and thus increases GC pauses.

        And you need the (pointer, begin index) implementation, or else you have interior pointers, which greatly complicates a GC implementations.


        Probably the real answer is to specialize for small strings vs. large strings, or provide a separate string and slice type, but both of those make things more complicated.

        So although I was slightly skeptical too that there was anything special about Rust here, I’d have to concede it after working on my own GC.

        I think people have said that Rust basically has a compile-time GC, and while that incurs tremendous complexity, the benefit seems to be borne out in this case, which is very common in parsing. I would say that saving the GC header is a big deal in terms of efficiency.

        Most strings are small, and the GC header really bloats every object. When every single token holds a string, this is a big deal. (AFAIK is basically the issue with Eclipse running out of memory – lots of tiny bloated GC’d objects, many of which are strings. IntelliJ has a bunch of tricks to mitigate this, which I think boil down to using Java like C – arrays rather than objects, etc.)

        1. 2

          Additionally, Go code has a capital-P Pattern to deal with the substringing problem, where you write string([]byte(s)), in order to take a copy of a slice of a large input in order to not retain the input buffer.

          1. 1

            Ah interesting, makes sense!

          2. 1

            You can trivially represent a token as a pointer to a string and a start/end index:

            type Token struct {
              Input string
              Start,End int
            }
            

            This would work fine even in languages without COW strings or sub slices of strings.

            1. 1

              Of course, but I’m saying that this is not efficient. It’s bigger than most strings, especially when you include the GC header, and especially when you’re tokenizing, which is what the original article is about.

              The “compile-time GC” of Rust is doing something interesting.

          3. 2

            Assertions about “any other language” are bound to hit some exception :) Strings are perhaps an especially easy case, because many languages make them immutable/CoW even if they don’t have any other immutability.

            For example, in Golang I fear accepting and returning slices without copying. You can’t be sure whether your caller meant to give you ownership of the slice, or will mutate “your” slice later. Similarly you need to be vigilant about append reusing slice’s capacity, which may unexpectedly mutate contents of a different slice.

            1. 1

              I think the distinction is “this efficiently”, but you’re right, GC handles this sort of thing very well.

            2. 2

              Is there any way to get closer to this safety, within C++? For example,

              • Is there a function annotation that says “the result lives only as long as the arguments”?
              • Can you give the function a templated type that somehow encodes the lifetime? (Maybe inspired by Haskell’s runST?)
              • Is there a smart pointer that documents the intended usage? (shared_ptr is not right, because it keeps the outer object alive. weak_ptr seems closer, but still means you can’t stack-allocate the outer object.)

              I’ve recently been adding comments like “caller must ensure the [argument thing] outlives the [result thing]”, which is unsatisfying.

              1. 1

                What do you know! Kakoune has something like this, in safe_ptr.hh:

                // *** SafePtr: objects that assert nobody references them when they die ***
                
                • You make your class Foo extend SafeCountable.
                • You allocate a Foo however you want.
                • Creating/destroying safe_ptr<Foo> updates a refcount, which Foo inherited from SafeCountable.
                • Destroying the Foo runs ~SafeCountable which asserts m_count == 0.

                This seems like a pretty good compromise: the type documents the intended usage, and it can catch mistakes at runtime.

                EDIT: Thinking about this some more… what happens if ~Foo runs in thread A, while thread B accesses the Foo through a safe_ptr? Thread A will crash when it hits the assert, but will it crash before thread B can invoke undefined behavior? Maybe it’s not UB because technically ~Foo never returned?