Threads for buch

  1. 2

    Hi! In what is apparently become a Notion party I work with Jake :)

    I come from the mobile side (Android) but have spent my career being sync adjacent. It’s wonderful to see someone treating the mobile <-> server relationship as Just Another Distributed System. Replicache incorporates all of the best things I’ve seen from a half dozen attempts at sync: git style branched state, pending transactions cleared as part of sync, optimistic updates. “transactions are guaranteed to be applied atomically, in the same order, across all clients” is new to me for client sync and very cool!

    I’m actually in the middle of doing the per-document coarse diffing you were talking about! Some super rough thoughts:

    • Feels like there’s a spectrum of offline with Replicache’s all-data-on-client/all-data-on-sync at one end, all-data-on-client/delta-sync in the middle, and select-data-on-client/delta-sync at the other.
    • It’s so hard to make anything invariant. And what invariants I can set up are kinda subtle. I’m currently relying pretty heavily on a globally (well per-DB-shard) ordered but non-sequential ID to do diffing.
    • I’m also leaning very heavily on the specifics of Notion’s data model, I can only imagine how tricky it is to build a general solution.
    • It’s looking like the client is doing to have to basically pester the server with sync calls until nothing is left which feels inelegant.

    Two questions:

    • If you had a magic wand is there an invariant you’d impose or a constraint you’d relax? I’m kinda thinking of Spanner/TrueTime and being able to trust time to a greater degree than usual.
    • Would you find any use to having versioned data on the server in additional to on the client?
    1. 1

      Replicache incorporates all of the best things I’ve seen from a half dozen attempts at sync

      Thanks. I was involved in or nearby many attempts at this at Google over the years, so have been thinking about it on and off for a long time.

      Feels like there’s a spectrum of offline with Replicache’s all-data-on-client/all-data-on-sync at one end, all-data-on-client/delta-sync in the middle, and select-data-on-client/delta-sync at the other.

      I don’t quite follow the spectrum you are trying to setup, but it seems interesting. Can you expand? Note that Replicache does not actually send all data in every sync to clients. Only deltas are sent to the clients. The “diff server” does fetch a full snapshot from the customer server, but (a) that is server-to-server communication, and (b) we hope to expand this in the future to allow the customer server to return coarse-grain diffs. At the limit, the server can just return diffs itself and you don’t need the diff server.

      It’s so hard to make anything invariant. And what invariants I can set up are kinda subtle. I’m currently relying pretty heavily on a globally (well per-DB-shard) ordered but non-sequential ID to do diffing.

      Here are some of the invariants that Replicache maintains, perhaps they will be useful for you:

      • After all changes are pushed from a client, the client view on the client will exactly match, the client view on the server. We verify this internally with a checksum, though if the client is correct, the checksum would never mismatch. This seems minor, but I think it is actually the most important invariant Replicache maintains – it makes life so much easier to reason about if you can know with assurance exactly what state clients have.
      • Replicache guarantees that when a transaction is unwound, no matter what it did, it is completely undone. Replicache supports client-side indexes (https://replicache-sdk-js.now.sh/classes/replicache.html#createindex), and we guarantee that those indexes are also always exactly up to date with primary data.
      • Replicache will never remove a speculative change until it is specifically acknowledged by the server in the client view response. There is no assumption that if a mutation is accepted by server that it is immediately visible in next request.
      • Replicache transactions on the client (either read or write) are completely isolated. To the transaction code it appears as if it is alone in the universe with exclusive access to the cache. This dramatically simplifies thinking about conflicts.

      I’m also leaning very heavily on the specifics of Notion’s data model, I can only imagine how tricky it is to build a general solution.

      I actually think in some ways it is easier to make the more general thing that allows transactions to be arbitrary. I’m not sure about the details of Notion, but in many attempts at this people try to restrict the data model to either things that always merge, or operations that know how to undo themselves. In both these approaches there’s an ongoing tax to working with the system – you have to learn how to use these specialized datastructures, and/or ensure that you keep the undo operation working properly.

      Replicache instead takes on a big up-front cost (a versioned transactional cache that can be cleanly unwound) in exchange for less on-going complexity: any mutation works and can be guaranteed to revert perfectly. You don’t have to think so much about merge and sync on a day-to-day basis.

      It’s looking like the client is doing to have to basically pester the server with sync calls until nothing is left which feels inelegant.

      I don’t follow this. Are you talking about in your system or in Replicache? I think in any model there is always the case that the user adds more data while sync is happening so you’ll have to go again. It sounds like you’re talking about something else though.

      If you had a magic wand is there an invariant you’d impose or a constraint you’d relax? I’m kinda thinking of Spanner/TrueTime and being able to trust time to a greater degree than usual.

      I find the dag-y / git-like model of time very easy to reason about, and the performance of massive distributed databases is not needed in this application. So no, I can’t think of any constraint that would be useful to relax, on the client side.

      Would you find any use to having versioned data on the server in additional to on the client?

      This would greatly help, but it’s more difficult than it seems. There are versioned databases already that exist. But a sync system doesn’t need databases to be versioned, it needs queries to be versioned. Clients don’t have a copy of the full db, they have a projection of it. Typically an extremely reduced projection. (This is part of what hampered Couchbase adoption).

      What we need to send to clients is a diff over that projection. To make matters worse most applications do work at the application layer “on top” of queries. The thing that needs to be versioned is not the db, or even a query result, it’s the data that gets sent to the client.

      I think the work that materialize.io is doing is very relevant here, but they are targeting huge backend datastores.

      This is why I like the diff server approach. Yeah it’s inefficient, but it’s simple and guaranteed to work. If you do it on a document-by-document level for e.g., Notion, you’re basically sending the complete content (modulo blobs) of a single document between servers on the same network everytime you sync that document, which doesn’t seem terrible to me.

      But yes, for absolutely optimal performance, you’d want to be able to subscribe to a query on the server, and get deltas directly from it, without computing a diff.