1. 22
    1. 16

      You need to be very careful with the implicit sharing. I helped debug a huge memory overhead problem in one program that was fetching large XML documents and extracting a few fields from them. Each document was tens to hundreds of KiBs. The captured strings (which were added to a dictionary) added up to a couple of hundred bytes. The program did a new fetch every few seconds and so grew its memory usage by about 10 MiB per minute. Explicitly copying the strings got this down to about 1 KiB per minute (and the old entries were aged out long before this became a problem, so it eventually hit a steady state). Implicit sharing with large objects can cause weird performance issues.

      It’s worth looking at the string representation that we used for GNUstep with the v2 ABI on 64-bit systems, which Apple extended a lot. Our strings are either stored in a pointer, or are pointers to string objects. When stored in a pointer, we store 7-bit ASCII characters (so expansion to UTF-8 is cheap, UTF-16 is only slightly more expensive). This lets us store 8 characters, a length, and a discriminator in the 8 bytes. If it’s a pointer, it refers to an object that contains an isa pointer, a pointer to the buffer, a length, and the hash. We don’t store the hash for small strings because it’s very cheap to compute. The isa pointer (vtable in C++) means that the not-inline version can be one of a variety of different implementations depending on the string data that’s being stored.

      Apple’s version extends the in-pointer version with 5- and 6-bit encodings. These are tiny character sets that are generated by profiling a load of different applications.

      Part of the reason that this works well is that Objective-C’s string interfaces don’t encourage accessing the underlying data. The fast accessor is a copy operation that extracts a range of characters. You can build fast iteration with a nested loop using this primitive. Fast enumeration in Objective-C was added later and never back-ported to strings, but a more modern design would allow the copy to alternatively provide access to the internal storage if desired (if it happened to be the same encoding that the caller wants). For strings embedded in pointers, if the caller always provides a 16-byte on-stack buffer then they can get the whole thing in the outer loop and only hit the inner one once.

      ICU’s UText abstraction is very similar to this. It provides a buffer and a pointer and a callback that fetches ranges and fills in the length that it has fetched. If you have a rope or similar, it will iterate over each contiguous block very quickly. If you have some other encoding, it will operate on that.

      C++ has absolutely terrible string designs from an API perspective, so it’s not a good source of inspiration. It leaves a load of performance on the table from high-level optimisations by doing early micro optimisation.

      Oh, and I really liked you choice of quote in the examples!

      1. 7

        You need to be very careful with the implicit sharing. I helped debug a huge memory overhead problem in one program that was fetching large XML documents and extracting a few fields from them. Each document was tens to hundreds of KiBs. The captured strings (which were added to a dictionary) added up to a couple of hundred bytes. The program did a new fetch every few seconds and so grew its memory usage by about 10 MiB per minute. Explicitly copying the strings got this down to about 1 KiB per minute (and the old entries were aged out long before this became a problem, so it eventually hit a steady state). Implicit sharing with large objects can cause weird performance issues.

        IIRC that is a major reason why it was removed from OpenJDK 1.7u6 and the JDK moved to copy instead. This also reduced the size of the structure (as it didn’t need an offset and length), and sped up most operations as it simplified data access.

        Plus substrings tend to be pretty small so if you have a GC with a good allocator allocating a small buffer is pretty cheap, and you can optimise small string copy for that situation. Substring copy is also much more compatible with escape analysis, allowing small substrings to be non-escaping for further non-gain from the sharing.

        Opt-in data sharing is nice (e.g. C++‘s string_view, Rust’s str), but unless the data structure is persistent in the traditional sense (a tree with a possibly large but limited branching factor) implicit sharing tends to be very footgunny, and a pretty large liability.

        1. 7

          This also reduced the size of the structure (as it didn’t need an offset and length)

          This is an interesting piece of language design. It only needs an offset and a length because the Java object model doesn’t permit interior pointers. Java pointers (which the language calls references, just to confuse everyone) must point to the canonical address for an object (which, in OpenJDK, I believe, is actually somewhere in the middle, rather than the start). This simplifies the GC a lot, as does single inheritance (all object pointers in fields can be stored contiguously and so you can just store the count of them in the class structure). This then means that anything wanting to be a slice in an array must carry an offset, whereas in .NET (which does permit interior pointers) it would not need to at the language level (though, I believe, at least some .NET implementations actually store interior pointers as a pair of an offset and an object pointer).

          This composes with the Java decision to make java.lang.String a final class. This precludes having specialised versions and was necessary for two reasons. First, the early implementations could statically dispatch to string methods, which was a big performance win, but which hasn’t been relevant for at least 15 years (modern JVMs don’t use final at all because they have full view of the class hierarchy and profiling information and so get far better fine-grained feedback). Second, and most importantly, the design of the SecurityManager requires strings to be immutable. If you can mutate a subtype of java.lang.String then you can bypass the security manager.

          So all of this comes from a combination of a GC design choice and a security model design choice.

          Plus substrings tend to be pretty small so if you have a GC with a good allocator allocating a small buffer is pretty cheap, and you can optimise small string copy for that situation

          I believe OpenJDK’s default GCs also has some special cases for objects like strings, arrays of primitives, and NIO buffers. They can’t contain pointers and so can be allocated from a heap region that is never scanned.

          Substring copy is also much more compatible with escape analysis, allowing small substrings to be non-escaping for further non-gain from the sharing.

          Yup, that’s also a huge win. This, in turn, leads to moving them to stack allocations if you can prove that they’re small.

          1. 2

            These are some great points. I had a similar discussion over on r/Clojure: https://www.reddit.com/r/Clojure/comments/18urw8f/janks_new_persistent_string_is_fast/

            In short, I didn’t think of this case and I love that you folks jump in to share the context and previous designs. :D However, a couple of options present themselves:

            1. When making a substring, check the size difference between the substring and the original; if it’s greater than some X, don’t share. This would mean we’re saying we’re ok with X amount of potential waste to save on a deep copy
            2. Make substring sharing and the value of X used a compile-time option; we just need to figure out sane defaults

            For copy construction, sharing is still caring, since the sizes remain the same. So we’ll still save millions of deep copies in a typical application from that alone.

            jank’s GC (Boehm) is currently checking interior pointers, since object addresses can vary, depending on base classes, and because jank/Clojure rely on a great deal of type erasure which generally offsets the address we use. The upside here is that shared large strings are still just a pointer, size, and category flags, same as an owned large string.

            1. 1

              When making a substring, check the size difference between the substring and the original; if it’s greater than some X, don’t share. This would mean we’re saying we’re ok with X amount of potential waste to save on a deep copy

              That’s a good potential approach. We didn’t find a value that worked well, but we didn’t try very hard when we explored this. Most substrings in very large things are well under 50% of the total size, so in hindsight the correct thing may be an upper limit on the size of the referenced string. That said, things like extracting path components are better to just copy, because then the start is always strongly aligned and you can use SIMD operations when processing. The heuristic approach had weird performance cliffs that confused programmers.

              Make substring sharing and the value of X used a compile-time option; we just need to figure out sane defaults

              I prefer the option of exposing this to the programmer with explicit string view operations. If the string view implements the same structural type / interface (not sure what your type system looks like) then programmers can make this choice at construction point but still use the same APIs either way. This avoid weird performance artefacts that are outside of programmer control.

              For copy construction, sharing is still caring, since the sizes remain the same. So we’ll still save millions of deep copies in a typical application from that alone.

              For what it’s worth, the profiling I did on desktop applications showed that the string-in-pointers trick saved around 10% of total allocations. Path components and dictionary keys were a huge proportion of the total number of allocations (not just strings, I hooked into the allocate-object function in ObjC and counted the number of calls and the number that would not be needed if you did the tiny string optimisation). This significantly reduced the load on the allocator.

              1. 1

                Firstly, thanks for all of the very thoughtful responses. This is a helpful discussion for me.

                Most substrings in very large things are well under 50% of the total size, so in hindsight the correct thing may be an upper limit on the size of the referenced string.

                I’ve considered this, but I think the size difference is more helpful. For example, let’s say I’m traversing the class paths. That’s a very large string, colon separated. I search for the first colon, take a substring of that path, and work on it. Maybe I even store it. If the overall difference between that single path’s size and the whole class path string size is large, I definitely want to copy out. However, after I’m done processing that path, I move to the next one. Let’s say I simplify the code by then saying class_paths = class_paths.substr(last_colon); and then I loop, checking for the next colon. In this case, I’m taking a very large substring, since it’s the whole class path string, minus the first path. I don’t want a copy there, exactly because it’s a large string. This would work well, since the difference between the overall string size and the substring will only be the size of the first path. As we work our way through the paths, we may end up copying later on, but at least we’ll save ourselves some big copies.

                That said, things like extracting path components are better to just copy, because then the start is always strongly aligned and you can use SIMD operations when processing.

                This is a strong argument. Ultimately, I’m going to need more data once jank can be used for large programs. At that point, toying more with these things will be fun.

                I prefer the option of exposing this to the programmer with explicit string view operations.

                From strictly a C++ perspective, I do too. I may ultimately extract this substring logic out into a view object, but that has big implications. I combined them here for three main reasons:

                1. The majority of jank users are just writing Clojure code. It’s completely opaque of clojure.core/subs returns a string or a string view, for example.
                2. When people are embedding jank into an existing C++ application, having the C++ string interface behave consistently with how it does from Clojure will help keep people sane.
                3. Adding a string view is more than just implementing a native view type. It would also need to be a first class jank object, which means registering it in the object model, setting up a run-time boxed object type for it, re-implementing all string-related mechanisms, handling both of them in clojure.core/string?, etc. C++ is not like Java where everything is already an Object and jank uses something akin to tagged unions for objects, rather than dynamic dispatching and dynamic casting everywhere. More details on that here: https://jank-lang.org/blog/2023-07-08-object-model/

                For what it’s worth, the profiling I did on desktop applications showed that the string-in-pointers trick saved around 10% of total allocations. Path components and dictionary keys were a huge proportion of the total number of allocations (not just strings, I hooked into the allocate-object function in ObjC and counted the number of calls and the number that would not be needed if you did the tiny string optimisation). This significantly reduced the load on the allocator.

                Yep, that makes sense. I think the vast majority of the wins here are going to come simply from jank’s string’s copy constructor sharing the data, rather than from substring optimizations, since string copies happen every time you put a key into a map, take it out, intern a keyword, var, or symbol, and so on.

      2. 1

        I don’t understand why copying large strings is so much faster in jank than with the other strings. Is it just the storing the length means we don’t have to find the null terminator first?

        1. 2

          Hey! Thanks for reading; I’m glad you asked for clarification.

          jank’s string is outperforming the other two for large string copies and substrings because it’s not deep copying. Instead, the new string just refers to the original string’s char array, at whatever offset position. For example:

          • String 1 is: “Happy new year! Everybody dance like it’s 2024!”) (47 chars, large string)
          • String 2 is constructed directly from string 1: “Happy new year! Everybody dance like it’s 2024!”) (47 chars, large string)
          • String 3 is a substring of String 2: “Everybody dance like it’s 2024!” (31 chars, large string)

          Note, none of them employ the short string optimization. Starting with String 1, it’s large, so we allocate a char array for it and mark String 1 as “large_owned”, since it owns its array. When we construct String 2, rather than allocating a brand new array and copying all 47 chars, we have String 2 point at the same data and then we update both String 1 and String 2 to make them as “large_shared”, meaning they don’t own the char array anymore. We leave cleanup of the char array to the GC. When String 3 is constructed, as a substring of String 2, we also don’t allocate a new char array. We just point at the start of the string within String 2 and mark String 3 as also being “large_shared”.

          So now all three strings are pointing at the same char array, though at varying offsets. We constructed three strings, but only needed one dynamic allocation. String copies happen a lot, in C++, and in Clojure, as we construct strings for maps, take them out to use them, etc. When I counted how many string copies (not substrings, just copies) alone happen when compiling clojure.core within jank, it was over 3k. With the two other string implementations (std::string and folly::fbstring), that would mean 3k new allocations and deep copies. With jank’s string, all of that memory is shared and we save on 3k copies just when loading clojure.core.

          Hopefully that makes more sense.