1. 10
  1.  

  2. 5

    Hehe, nice!

    Some comments:

    Read repair - Riak handles ensuring that all nodes contain the same values for a key on read.

    Roshi does this, too.

    Hinted handoff - If a node goes down, the lost data is shuffled to other nodes, ensuring the replication factor is maintained.

    That’s fair, but just some food for thought: one motivating condition of Roshi is that its operation should be as dumb (read: predictable) as possible. In that model, a dead node can definitely fire alarms about elevated error- and repair-rates. But automatically triggering data transfer might make a network problem worse. Similar considerations for Roshi’s explicit lack of elasticity.

    Roshiak only supports getting the entire dataset on read, Roshi supports more sophisticated ‘select’ calls. There are a few possible solutions for this in the Roshiak model depending on requirements.

    This is super critical, definitely a necessary feature :) If it’s useful to you, I’m happy to elucidate Roshi’s requirements for this method.

    1. 3

      Hey, author here. This was posted slightly prematurely, I haven’t gotten it to the point of just being able to install with one command.

      Re read repair: whoops, sorry, I meant to phrase that differently. I was trying to say that this is the features it has and some of them are lacking in Roshi.

      That’s fair, but just some food for thought: one motivating condition of Roshi is that its operation should be as dumb (read: predictable) as possible.

      The current Riak model of transfers is very robust and desirable and many people are running Riak clusters that do hinted handoff. Riak supports throttling the transfers as well and it is smart enough to talk directly to nodes that have the desired data, distributing the network load across the cluster. As long as the MTBF is greater than the time to transfer, you can keep on losing nodes without replacing them without losing data. Adding this ability to Roshi itself will, IMO, be a fair amount of work.

      This is super critical, definitely a necessary feature :) If it’s useful to you, I’m happy to elucidate Roshi’s requirements for this method.

      How do you use partial selects now? One issue I ran into considering options is you need access to the entire Roshi set in order to calculate what should actually be visible to a user. For example if you take bucket by time, you need to see all the values after that time to determine if anything was deleted, if I understand correctly.

      1. 2

        As long as the MTBF is greater than the time to transfer,

        Yeah—I challenge this assumption :) Failures aren’t pretty; it’s more likely several machines die concurrently than an isolated node loses a HDD, or whatever. But, it’s all fair points. Other people have other operational contexts.

        How do you use partial selects now? One issue I ran into considering options is you need access to the entire Roshi set in order to calculate what should actually be visible to a user. For example if you take bucket by time, you need to see all the values after that time to determine if anything was deleted, if I understand correctly.

        A logical set is represented by two physical sets (buckets): the add and remove sets. The write operation ensures a given key is present in only one or the other. The read operation, then, is free to read from the add set only. You don’t need to see all values.

        The important bit is we need a method of pagination. At first, that was with offset/limit parameters, but now the preferred API is with start and stop cursors. In either case, it means you need O(1) access to arbitrary points in the set, with order being defined by timestamp. So, the underlying data store needs to support something that resembles read(key string, offset timestamp, limit int), or read(key string, from timestamp, to timestamp), returning a list of values.

        Also, the overall system needs to support single-digit millisecond response times, with an order-of-1M key read requests per second :)

        Does it make sense?

        1. 2

          Yeah—I challenge this assumption :) Failures aren’t pretty; it’s more likely several machines die concurrently than an isolated node loses a HDD, or whatever. But, it’s all fair points. Other people have other operational contexts.

          In this case, I think you are making the argument for hinted hand-off. If you have many machines down at the same time, I think one wants some effort going on to ensure replication factor incase another set of machines are lost.

          I have to think about how to possibly do partial reads in Riak, but while I think few other questions:

          • How do you do read repair without reading and writing the whole set?
          • How large are the buckets? median, mean, and 99.9 percentile?

          Also, the overall system needs to support single-digit millisecond response times, with an order-of-1M key read requests per second

          IME this is pretty well within the bound of Riak. It handles parallel queries very well. It mostly comes down to pushing network around. And because the model is eventually consistent you can always have multiple clusters going that have replication going between them.

          1. 1

            How do you do read repair without reading and writing the whole set?

            The read detects inconsistencies across your replicas among the values selected and returned. Any inconsistent values are sent for read-repair. See this description.

            How large are the buckets? median, mean, and 99.9 percentile?

            Bucket (set) size is capped at write time, and our median/mean/99 is probably order-of 100/1000/10000 elements, respectively.

            If you have many machines down at the same time, I think one wants some effort going on to ensure replication factor incase another set of machines are lost.

            That’s one perspective. Another is that I don’t want my data system automatically increasing the load on my already-degraded network. That can lead to cascading failures, turning a minor operational blip into a total meltdown. (I’ve seen it happen, more times than I care to relate.)

    2. 3

      Weird Question, why not use the data types now built into riak as of 2.0?

      The only major difference I see between the roshi set and the inbuilt riak set, is that the roshi set is LWW, which means it cannot accurately track causality, whereas the riak set is Observed-Removed, so you can supply a causal token to make sure your operation happens with the right causal history (rather than an assumed causal history). While this may sound minor, it impacts other things, so LWW-sets always have tombstones that cannot be garbage collected (no matter what the roshi readme say about GC, if tombstones remain, they’ve not been GC’d), whereas our OR-sets don’t have tombstones. How do tombstones make a difference? With tombstones, storage requirements are O(number of elements inserted over lifetime of the set), without them, storage requirements are O(number of elements currently in the set), with high churn, these two numbers can be very different.

      I worked on the Riak Sets, and it’s depressing to see a functionally limited implementation based on top of riak when Russell, Sean and I worked so hard to build a complete correct implementation into riak. That’s not to say riak’s data types shouldn’t have competitors, it just seems that no one is competing on quality, they’re just using the “CRDT” buzzword because it’s hot right now.

      I guess I can at least be glad that people are embracing Eventual Consistency models, which is progress.

      1. 2

        Cheap answer: my client doesn’t support Riak 2.0 yet and this was a Sunday project. I would use Riak 2.0 CRDTs in a true production solution over my implementation.

        Roshiak was really my knee-jerk reaction that Roshi was even written in the first place, which I believe should have just been done on top of Riak.

        1. 1

          Ah, completely reasonable explanation. I guess you were just on the receiving end of me having a grump.

          I definitely suggest giving Riak 2.0 a try.