Excellent article; it explains the underlying concepts well. Note that using a B-tree to store a long string is like using many little gap buffers, because each tree node has some slack space.
I wrote about the same topic (a list structure with fast insertion and reading) with a different focus. My description is shorter and shallower, but code is included. https://www.nayuki.io/page/avl-tree-list
It’s true that the modern solution is to use ropes in a balanced tree.
The default implimentation of “Set” in F# is an immutable AVL Tree. Is there a requirement for uniqueness of elements for AVL trees? Also are AVL’s unordered? Or is this just a coincidence that F# decided to use AVL Trees for unique unordered collections.
In an imperative programming language like Java (C++, Python, etc.), it is customary to use a self-balancing binary tree data structure to implement ordered sets and ordered maps/dictionaries. If I recall correctly, the AVL balanced tree was invented first, but the later red-black tree became the more popular choice in practice. I’m not sure why though, because the RB code is longer, there are more cases to handle, and it’s not clear to me that it’s faster than AVL (though I haven’t implemented or benchmarked it). Anyways, these mutable binary trees can be adapted to be immutable - just throw away a node and replace it with a new one if you need to insert/update/delete a tree.
Onto your questions. Any tree (plain binary, AVL, RB, B-tree) doesn’t require unique elements or ordering of the values. But you need certain rules about where to store and retrieve an element. For ordered sets, the ordering forces the position within the tree. For the hex editor article and my “AVL tree list”, the subtree weight forces the position within the tree.
Sorry, because I’m unfamiliar with F#, I can’t answer your specific question about why they decided to use AVL trees for unique unordered collections. It seems that “unique” and “unordered” are features that are mutually incompatible for a tree structure. I expect you’ll have better luck on a larger community forum like Stack Overflow.
Be careful when XOR-swapping two integers that are addressed by pointers/references. If they both point to the same memory location, then the value gets erased to zero. In C it would be good to add the restrict
keyword to the two pointers, to let the human know about this pitfall (it doesn’t stop the compiler).
Be careful when using coding tricks in production code.
I doubt whether there’s still a compiler used “for real” which doesn’t optimize the normal swap algorithm, using a temporary variable, into whatever’s most efficient for the system, which usually works out to being a single opcode. This is decades-old peephole optimization stuff. The intricacies of restrict
and pointer aliasing are useful to keep in mind in general, but if you’re using this trick in code where correctness matters, just use a temporary variable and have some faith in the people who wrote the optimizer.
Rust is the first popular language to propose an alternative — automatic memory management and safety without garbage collection
This statement is inaccurate. C++ has automatic memory management without garbage collection, thanks to RAII. But C++ comes with a lot of baggage from the past, unsafe default behaviors, and subtle pitfalls. Whereas Rust is a clean start that takes the good parts of C++ and none of the bad parts.
But overall, a well-written, information-packed article.
This statement is inaccurate. C++ has automatic memory management without garbage collection, thanks to RAII.
I knew about RAII in C++, but (please correct me if I’m mistaken) it is not enforced by the compiler. By the same metric, C with Boehm’s GC has automatic memory collection. Sure, in Rust a developer can use unsafe
and escape the compiler, but it is an explicit act and, when using an open-source 3rd party lib, pretty easy to grep.
RAII in C++ is basically an add-on.
After fighting with random SIGSEGVs on some large C++ codebase I was asked to maintain, I disciplined myself to either use RAII and pass-by-reference, or having a mandatory public ok() function returning false if something went wrong in the constructor. Years later, I found Rust compiler enforced what to me was the best practice.
This is a bit of a wild card, I am not experienced at all with haskell, but my impression is it may not be well suited to imperitive operations like read buffers from a stream and write buffers to a stateful database. I could be totally wrong or ignorant and would like to one day address my ignorance.
These are things that I use Haskell for!! As Tackling the Awkward Squad says:
In short, Haskell is the world’s finest imperative programming language.
Pretty good article overall, but a few things to fix:
str.chars()
method returns an iterator that yields each rune sequentially. Moreover, this iterator can be collected into aVec<char>
which is in UTF-32.String
actually doesn’t support byte indexing either, it just has range indexing to get a slice of the string, but you can use theas_bytes
method to get a slice of bytes instead.This is correct and this is a really important point I think for Rust string indexing, I’ll modify the post to reflect this.
I don’t understand this criticism. The entire article is about the special case where UTF-16 is variable length - the case where you need to use surrogate pairs to represent a single unicode code point.
Thanks for the feedback! I’m really glad the article initiated some interesting comments here. There are a few points which need clarifying and correcting which I’ll do in an update.
You’re right about the Rust string indexing requiring .chars() iterator, I think this is a very important point so I’ll edit the post to explain this.
I tried to point out how UTF-16 isn’t fixed length but I definitely think it could be made more clear.