This data structure looks like the same one used in constructing the Merkle tree used by the Dat protocol.
I hadn’t heard of Segment Trees as a general data structure before, but I’ve seen them used in CouchDB’s implementation: the b-tree backing a map/reduce view stores the intermediate reduced (aggregated) values inside its interior nodes, so the reduction of a subrange can be computed quickly. I always thought this was super clever. (I think Couchbase’s CouchStore does this too, but it was only a TBD feature in the early days of 2012 when I was still working on it.j
Wow that Dat protocol document has amazingly good diagrams. It looks like they’re also using a range aggregation tree / segment tree structure, although it doesn’t look like the document specifies if they store it with pointers or some implicit layout.
You mentioned B-trees as being more efficient but harder to implement; I wonder if you could improve efficiency by simply increasing the fanout of your tree? For example, by making the branching factor 4 instead of 2. The algorithms would stay pretty similar, you just need to consider two bits at a time instead of one.
I believe the merkle tree in-memory layout with parent hashes alternating with child hashes is called a binmap. I have seen these in a couple of places, https://scattered-thoughts.net/blog/2012/01/03/binmaps-compressed-bitmaps/ sadly link-rotted away, and https://github.com/gritzko/swift/blob/master/doc/binmaps-alenex.pdf which suggests the provenance was via TU-Delft.
Such a great article!
We also use this sort of implicit in-order forests in Mina protocol – this is a great data-structure for efficiently computing an online periodic parallel scan over an infinite stream given some associative operation https://minaprotocol.com/blog/scanning-for-scans . It looks like some of the image links and formatting has bitrot in that article across stylesheet rewrites, sorry!
Do you have a link to the code implementing this or something that describes using the in-order layout? If so I’ll add an edit note linking to it. The document you link shows using the typical breadth-first layout but I don’t see anything about the in-order layout.
Ah yes good point, we use the typical heap-like breadth-first layout (though over the forest of complete trees of increasing depth) – the current version doesn’t use the array anymore because it was a bit hard to work with and we didn’t end up needing the extra performance there (yet anyway) https://github.com/MinaProtocol/mina/blob/develop/src/lib/parallel_scan/parallel_scan.mli . Here’s a link to the older version https://github.com/MinaProtocol/mina/blob/9e1ce494a7335817d7dd4c314cd38b293b10fbaa/src/lib/parallel_scan/parallel_scan.mli
The article reminded me of this excellent paper Array Layouts for Comparison-Based Searching that goes through different array layouts as implicit data structures for binary search. They go through sorted order, heap order, Van Emde Boas order, and a blocked B-Tree order, analyzing the microarchitectural implications on performance.