1. 7

This is one of the more impressive things in TokuMX v2.0 (which has been shipped, this blog post is from just before that). It’s something I’ve been excited about doing for years, and I’m really happy that we finally got around to shipping it.

  1.  

  2. 1

    This is pretty sick! I’m curious how exactly it’s implemented. =)

    1. 3

      This is probably the most comprehensive talk about how fractal trees work: http://www3.cs.stonybrook.edu/~bender/talks/2013-BenderKuszmaul-xldb-tutorial.pdf

      I’m going to give a few talks in SF in November, you should come if you’re curious and in the area.

      Basically, the fractal tree implements inserts and deletes as messages that travel down the tree. To perform, say, db.coll.update({_id: 10}, {'$inc': {a: 4}}) we basically just encode {'$inc': {a: 4}} as an “update” message, send it down to the key {_id: 10}, and whenever it gets applied (due to a flush or a query), we just increment whatever value was there. You lose out on information you’d get from the query, like “how many docs were affected” or “was there a type error (e.g. incrementing a string)” because we don’t query before returning to the client, but if your app can tolerate that, it’s nice.

      1. 1

        Interesting! I’ve only seen a little bit about fractal trees before, I’ll check it out.

        And I might be SF, no definitive plans though. When / where are the talks? :D

        1. 2

          The November 12 talk is a redux of a talk I gave at Big Data DC in August, it’s more in-depth and theory oriented than the November 11 talk, which is more product-focused.

          Hope to see you there!

      2. 2

        I don’t know, however I do know that TokuMX’s op log does not require operations to be idempotent, which means you could just toss the operation in the op log and return to the user before actually applying it. If it is implemented this way it also means it might play funny in a distributed setting.

        1. 2

          Requiring either the oplog or the tree messages to be idempotent would really mess this up. This feature, fast updates, is one of the main reasons we made sure to not require the oplog to be idempotent from the beginning. This requires e.g. rollback’s algorithm to change (you can’t just blindly play forward stuff from some point before the partition, you actually have to go back and find the spot carefully), but it’s not too bad.

          1. 1

            I’m pretty sure that oplog not being idempotent makes doing proper oplog application impossible. How do you deal with applying the change but dying before updating the position in the oplog?

            1. 2

              We do so with transactions. Everything is transactional.

              1. 2

                Not sure if this is a troll response or not, but transaction logs are idempotent in every database I’m aware of (which isn’t many) so you can play transactions twice and it’s ok.

                1. 2

                  We made it safe for the oplog to not be idempotent specifically to support fast updates. This is not a troll response. The storage engine is transactional, so we can apply the effect of the update and modify the applied bit together in an atomic transaction.

                  1. 1

                    How do you guarantee the application of an operation + storing that it was applied happens atomically?

                    1. 1

                      Both are writes to the storage engine that happen in a transaction. Making transactions atomic with respect to crashes is an old and well-studied problem. The most well known paper is probably the ARIES paper, which is how our transaction log works, pretty much. Most durable, transactional storage engines (of which there are many, including BDB and InnoDB) use an ARIES-style log and provide multi-element transactions that are atomic with respect to crash.

                      1. 1

                        So in the ARIES log you materialize the actual changes from the relative ones in the oplog? Thanks for the information.