1. 14

    Pretty good article overall, but a few things to fix:

    • Please spell UTF-8 and UTF-16 consistently with a hyphen.
    • Each plane has 65536 code points, not 65535.
    • Rust strings do not support rune indexing; they only support byte indexing. However, the str.chars() method returns an iterator that yields each rune sequentially. Moreover, this iterator can be collected into a Vec<char> which is in UTF-32.
    • CJK characters are 3 bytes characters long in UTF-8, not 2 bytes as you stated. This is because the 2-byte limit is U+07FF, and all CJK characters are above U+3000.
    • You mention that UTF-8 is a variable-length encoding by design, but fail to acknowledge that UTF-16 is variable-length too.
    • Python 2 is a can of worms (which depends on compile-time flags), but Python 3 is not. In Python 3, all strings behave as if they are in UTF-32 (i.e. direct access to every code point). But internally, it can store strings as UTF-32, UTF-16, or UTF-8, depending on the actual content of each string (e.g. a pure ASCII string can be stored as UTF-8 and still allow fast O(1) indexing).
    1. 6

      Rust strings do not support rune indexing; they only support byte indexing. However, the str.chars() method returns an iterator that yields each rune sequentially. Moreover, this iterator can be collected into a Vec which is in UTF-32.

      Stringactually doesn’t support byte indexing either, it just has range indexing to get a slice of the string, but you can use the as_bytes method to get a slice of bytes instead.

      1. 3

        This is correct and this is a really important point I think for Rust string indexing, I’ll modify the post to reflect this.

      2. 3

        You mention that UTF-8 is a variable-length encoding by design, but fail to acknowledge that UTF-16 is variable-length too.

        I don’t understand this criticism. The entire article is about the special case where UTF-16 is variable length - the case where you need to use surrogate pairs to represent a single unicode code point.

        1. 2

          Thanks for the feedback! I’m really glad the article initiated some interesting comments here. There are a few points which need clarifying and correcting which I’ll do in an update.

          You’re right about the Rust string indexing requiring .chars() iterator, I think this is a very important point so I’ll edit the post to explain this.

          I tried to point out how UTF-16 isn’t fixed length but I definitely think it could be made more clear.

        1. 2

          The themes of these questions remind me of Java Puzzlers (book, video).

          1. 2

            Excellent article; it explains the underlying concepts well. Note that using a B-tree to store a long string is like using many little gap buffers, because each tree node has some slack space.

            I wrote about the same topic (a list structure with fast insertion and reading) with a different focus. My description is shorter and shallower, but code is included. https://www.nayuki.io/page/avl-tree-list

            It’s true that the modern solution is to use ropes in a balanced tree.

            1. 1

              The default implimentation of “Set” in F# is an immutable AVL Tree. Is there a requirement for uniqueness of elements for AVL trees? Also are AVL’s unordered? Or is this just a coincidence that F# decided to use AVL Trees for unique unordered collections.

              1. 1

                In an imperative programming language like Java (C++, Python, etc.), it is customary to use a self-balancing binary tree data structure to implement ordered sets and ordered maps/dictionaries. If I recall correctly, the AVL balanced tree was invented first, but the later red-black tree became the more popular choice in practice. I’m not sure why though, because the RB code is longer, there are more cases to handle, and it’s not clear to me that it’s faster than AVL (though I haven’t implemented or benchmarked it). Anyways, these mutable binary trees can be adapted to be immutable - just throw away a node and replace it with a new one if you need to insert/update/delete a tree.

                Onto your questions. Any tree (plain binary, AVL, RB, B-tree) doesn’t require unique elements or ordering of the values. But you need certain rules about where to store and retrieve an element. For ordered sets, the ordering forces the position within the tree. For the hex editor article and my “AVL tree list”, the subtree weight forces the position within the tree.

                Sorry, because I’m unfamiliar with F#, I can’t answer your specific question about why they decided to use AVL trees for unique unordered collections. It seems that “unique” and “unordered” are features that are mutually incompatible for a tree structure. I expect you’ll have better luck on a larger community forum like Stack Overflow.

              2. 1

                That’s a great article and site. I’ll probably dig into those write-ups at some point.

                1. 3

                  Be careful when XOR-swapping two integers that are addressed by pointers/references. If they both point to the same memory location, then the value gets erased to zero. In C it would be good to add the restrict keyword to the two pointers, to let the human know about this pitfall (it doesn’t stop the compiler).

                  1. 6

                    Be careful when using coding tricks in production code.

                    I doubt whether there’s still a compiler used “for real” which doesn’t optimize the normal swap algorithm, using a temporary variable, into whatever’s most efficient for the system, which usually works out to being a single opcode. This is decades-old peephole optimization stuff. The intricacies of restrict and pointer aliasing are useful to keep in mind in general, but if you’re using this trick in code where correctness matters, just use a temporary variable and have some faith in the people who wrote the optimizer.

                  1. 4

                    Rust is the first popular language to propose an alternative — automatic memory management and safety without garbage collection 

                    This statement is inaccurate. C++ has automatic memory management without garbage collection, thanks to RAII. But C++ comes with a lot of baggage from the past, unsafe default behaviors, and subtle pitfalls. Whereas Rust is a clean start that takes the good parts of C++ and none of the bad parts.

                    But overall, a well-written, information-packed article.

                    1. 1

                      This statement is inaccurate. C++ has automatic memory management without garbage collection, thanks to RAII.

                      I knew about RAII in C++, but (please correct me if I’m mistaken) it is not enforced by the compiler. By the same metric, C with Boehm’s GC has automatic memory collection. Sure, in Rust a developer can use unsafe and escape the compiler, but it is an explicit act and, when using an open-source 3rd party lib, pretty easy to grep.

                      1. 1

                        RAII in C++ is basically an add-on.

                        After fighting with random SIGSEGVs on some large C++ codebase I was asked to maintain, I disciplined myself to either use RAII and pass-by-reference, or having a mandatory public ok() function returning false if something went wrong in the constructor. Years later, I found Rust compiler enforced what to me was the best practice.

                    1. 12

                      This is a bit of a wild card, I am not experienced at all with haskell, but my impression is it may not be well suited to imperitive operations like read buffers from a stream and write buffers to a stateful database. I could be totally wrong or ignorant and would like to one day address my ignorance.

                      These are things that I use Haskell for!! As Tackling the Awkward Squad says:

                      In short, Haskell is the world’s finest imperative programming language.

                      1. 3

                        exactly the type of thing I love to read, thanks :)

                        1. 1

                          This PDF crashed my browser.

                          1. 0

                            Why was my comment about marking PDFs downvoted?

                            1. 0

                              The article is an HTML page. There is not a PDF in sight. Why do you want to mark a PDF?

                              1. 1

                                It’s quite obviously a PDF. I honestly don’t know what you’re talking about.

                            2. -2

                              Can you mark your PDF? Thanks.