Threads for carsonfarmer

    1. 1

      After some thought … this seems vulnerable to a similar problem as many hash-tables, wherein if an adversary can choose keys to add to the data structure, they can come up with keys that mess it up enough to ruin the performance characteristics.

      In this case, they can generate keys that have a low rank, or I guess, sufficient keys of the same rank. That will bias the expected distribution of ranks among the keys and make the structure top-heavy (or middle-heavy or whatever.) This would be quite bad for a zip-tree since it creates longer and longer linked lists; not as bad if the “G” data structure is more efficient, but possibly still a problem.

      Prolly trees aren’t vulnerable to this, AFAIK, because the tree structure is determined not by individual keys but by a rolling hash function over concatenated series of keys in a node.

      1. 2

        Unfortunately, prolly-trees are also vulnerable to adversarial keys, in pretty much the same way as any probabilistic tree structure is vulnerable. If you are aware of the rolling hash procedure, you can come up with keys that will trigger a boundary (or not), just the same as you can with g-trees or other structures that rely on local patterns. In here, the dolt folks explain how they try to adjust things to mitigate this (and some other properties) to a degree: https://www.dolthub.com/blog/2024-03-03-prolly-trees/, but this introduces new problems (like, your boundary selection is no longer independent, as detailed here: https://interjectedfuture.com/lab-notes/lab-note-033-cascading-boundary-changes-in-prolly-trees/). Ideally, you want to completely avoid having external information to help select your boundaries (having said that, dolt’s modification to adjust for node size is quite nice).

        In g-trees, we opt to abstract away this part with the internal set structure. So you could, for instance, parameterize the internal set structure to be another g-tree with a different pseudorandom function to determine ranks (use the leading zero count of the binary representation on the main tree, and the trailing zero count on the internal set tree for example). Note also that, while it is totally true that someone with control over the keys can come up with pretty horrific patterns, we can also fight that to a degree with a good hash function. While obviously not an actual fix, it can at least make attacks like this more annoying for our adversary.

      2. 2

        Slightly surprised there’s no mention of the seemingly-similar Prolly Tree. (At least, not in the first half of the paper…)

        1. 4

          Oops, that’s a total mistake on our part! I’ve (re)added a reference (it was lost along the way somehow) and included it back in our latter discussion. Thanks for pointing out that omission!

          1. 1

            That’s another very cute B-tree-like data structure. Thanks for the pointer!