Hi! In what is apparently become a Notion party I work with Jake :)
I come from the mobile side (Android) but have spent my career being sync adjacent. It’s wonderful to see someone treating the mobile <-> server relationship as Just Another Distributed System. Replicache incorporates all of the best things I’ve seen from a half dozen attempts at sync: git style branched state, pending transactions cleared as part of sync, optimistic updates. “transactions are guaranteed to be applied atomically, in the same order, across all clients” is new to me for client sync and very cool!
I’m actually in the middle of doing the per-document coarse diffing you were talking about! Some super rough thoughts:
Feels like there’s a spectrum of offline with Replicache’s all-data-on-client/all-data-on-sync at one end, all-data-on-client/delta-sync in the middle, and select-data-on-client/delta-sync at the other.
It’s so hard to make anything invariant. And what invariants I can set up are kinda subtle. I’m currently relying pretty heavily on a globally (well per-DB-shard) ordered but non-sequential ID to do diffing.
I’m also leaning very heavily on the specifics of Notion’s data model, I can only imagine how tricky it is to build a general solution.
It’s looking like the client is doing to have to basically pester the server with sync calls until nothing is left which feels inelegant.
Two questions:
If you had a magic wand is there an invariant you’d impose or a constraint you’d relax? I’m kinda thinking of Spanner/TrueTime and being able to trust time to a greater degree than usual.
Would you find any use to having versioned data on the server in additional to on the client?
Replicache incorporates all of the best things I’ve seen from a half dozen attempts at sync
Thanks. I was involved in or nearby many attempts at this at Google over the years, so have been thinking about it on and off for a long time.
Feels like there’s a spectrum of offline with Replicache’s all-data-on-client/all-data-on-sync at one end, all-data-on-client/delta-sync in the middle, and select-data-on-client/delta-sync at the other.
I don’t quite follow the spectrum you are trying to setup, but it seems interesting. Can you expand? Note that Replicache does not actually send all data in every sync to clients. Only deltas are sent to the clients. The “diff server” does fetch a full snapshot from the customer server, but (a) that is server-to-server communication, and (b) we hope to expand this in the future to allow the customer server to return coarse-grain diffs. At the limit, the server can just return diffs itself and you don’t need the diff server.
It’s so hard to make anything invariant. And what invariants I can set up are kinda subtle. I’m currently relying pretty heavily on a globally (well per-DB-shard) ordered but non-sequential ID to do diffing.
Here are some of the invariants that Replicache maintains, perhaps they will be useful for you:
After all changes are pushed from a client, the client view on the client will exactly match, the client view on the server. We verify this internally with a checksum, though if the client is correct, the checksum would never mismatch. This seems minor, but I think it is actually the most important invariant Replicache maintains – it makes life so much easier to reason about if you can know with assurance exactly what state clients have.
Replicache guarantees that when a transaction is unwound, no matter what it did, it is completely undone. Replicache supports client-side indexes (https://replicache-sdk-js.now.sh/classes/replicache.html#createindex), and we guarantee that those indexes are also always exactly up to date with primary data.
Replicache will never remove a speculative change until it is specifically acknowledged by the server in the client view response. There is no assumption that if a mutation is accepted by server that it is immediately visible in next request.
Replicache transactions on the client (either read or write) are completely isolated. To the transaction code it appears as if it is alone in the universe with exclusive access to the cache. This dramatically simplifies thinking about conflicts.
I’m also leaning very heavily on the specifics of Notion’s data model, I can only imagine how tricky it is to build a general solution.
I actually think in some ways it is easier to make the more general thing that allows transactions to be arbitrary. I’m not sure about the details of Notion, but in many attempts at this people try to restrict the data model to either things that always merge, or operations that know how to undo themselves. In both these approaches there’s an ongoing tax to working with the system – you have to learn how to use these specialized datastructures, and/or ensure that you keep the undo operation working properly.
Replicache instead takes on a big up-front cost (a versioned transactional cache that can be cleanly unwound) in exchange for less on-going complexity: any mutation works and can be guaranteed to revert perfectly. You don’t have to think so much about merge and sync on a day-to-day basis.
It’s looking like the client is doing to have to basically pester the server with sync calls until nothing is left which feels inelegant.
I don’t follow this. Are you talking about in your system or in Replicache? I think in any model there is always the case that the user adds more data while sync is happening so you’ll have to go again. It sounds like you’re talking about something else though.
If you had a magic wand is there an invariant you’d impose or a constraint you’d relax? I’m kinda thinking of Spanner/TrueTime and being able to trust time to a greater degree than usual.
I find the dag-y / git-like model of time very easy to reason about, and the performance of massive distributed databases is not needed in this application. So no, I can’t think of any constraint that would be useful to relax, on the client side.
Would you find any use to having versioned data on the server in additional to on the client?
This would greatly help, but it’s more difficult than it seems. There are versioned databases already that exist. But a sync system doesn’t need databases to be versioned, it needs queries to be versioned. Clients don’t have a copy of the full db, they have a projection of it. Typically an extremely reduced projection. (This is part of what hampered Couchbase adoption).
What we need to send to clients is a diff over that projection. To make matters worse most applications do work at the application layer “on top” of queries. The thing that needs to be versioned is not the db, or even a query result, it’s the data that gets sent to the client.
I think the work that materialize.io is doing is very relevant here, but they are targeting huge backend datastores.
This is why I like the diff server approach. Yeah it’s inefficient, but it’s simple and guaranteed to work. If you do it on a document-by-document level for e.g., Notion, you’re basically sending the complete content (modulo blobs) of a single document between servers on the same network everytime you sync that document, which doesn’t seem terrible to me.
But yes, for absolutely optimal performance, you’d want to be able to subscribe to a query on the server, and get deltas directly from it, without computing a diff.
Hi, I work at Notion, a collaborative note-taking and project management company. We use many similar ideas.
we use “operations” for the majority of the changes in our app; same concept as “mutations”;
we use IndexedDB k/v store or SQLite store to cache records on the client device in a similar way as described.
we apply changes locally first on the cache and then store the operation queue for next time were online, but we don’t replay changes on the client when the server sends new versions yet - because the client sends the mutations, and the server pushes those updates back down again. As you might imagine, consistency leaves something to be desired.
We’re in the research phase of improving our caching system. Your approach is interesting, but we have far more data than 30MB of JSON to sync per client (easily upwards of 10x more, depending on the user data) that we want to maintain on the client; it’d prohibitively expensive to use your fetchall-then-diff for every one of our users. We may need to take the “subscribe-to-query” architecture all the way to the backend source of truth data stores.
It’s a very interesting space to look at because offline-first and collaboration are now critical table-stakes features, and many of the vendors out there right now are quite rudimentary. For example, Firebase seems like a joke due to scale limits, and Firestore doubly so because of the lack of serious tooling. It’s heartening to see something with similar overall model to our system hopefully it means we’re both on the right tract.
I spoke with Chet Corcos and a few others (maybe you? if so, apologies for forgetting) at Notion early on in the development of Replicache about the system used there. I am not sure if it has changed since then. It was encouraging then as now to find that Replicache is basically a generalization of what you are doing.
Your approach is interesting, but we have far more data than 30MB of JSON to sync per client (easily upwards of 10x more, depending on the user data) that we want to maintain on the client; it’d prohibitively expensive to use your fetchall-then-diff for every one of our users. We may need to take the “subscribe-to-query” architecture all the way to the backend source of truth data stores.
A few thoughts:
It is possible to have the client view return a “coarse diff” to the diff server, rather than a full snapshot. The application can progressively increase the granularity of diff it returns to the diff server as it wants to trade complexity for performance. In an application like Notion, a nice place to draw boundaries would be the document level: when a document is updated, return a diff that contains the entire state of that document, but no other documents.
I agree that the subscribe-to-query at the backend is the dream that really makes a design like this complete. We think of the diff server as a sort of sledgehammer that makes the overall design of Replicache possible for many customers today, without dramatic server-side rearchitecture. However, if you have a more principled way to get those diffs, that is better. I’m hopeful that databases like FaunaDB (or maybe Materialized) will enable this functionality over time, and Replicache can become purely client-side technology.
What would you price this at? It looks high for my company’s current scale [and at this point we want to own the whole stack anyways], but an earlier Notion might have found this offering attractive.
I realize you’re being derisive, but in a sense, yeah:
The fact that conflict resolution is handled by running normal, arbitrary functions serially against the database on client and server is the point. Other systems either restrict you to specialized data structures that can always merge (e.g., Realm), or force you to write out-of-band conflict resolution code that is difficult to reason about
(e.g., Couchbase). In Replicache you use a normal transactional database on the server, and a plain old key/value store on the client. You modify these stores by writing code that feels basically the same as what you’d write if you weren’t in an offline-first system. It’s a feature.
===
TL;DR: Replicache is a versioned cache you embed on the client side. Conflict resolution happens by forking the cache and replaying transactions against newer versions.
When a transaction commits on the client, Replicache adds a new entry to the history, and the entry is annotated with the name of the mutation and its arguments (as JSON).
During sync, Replicache forks the cache and sends pending requests to your server, where they are handled basically like normal REST requests by your backend. You have to defensively handle mutations server-side (but you were probably already doing that!). Replicache then fetches the latest canonical state of the data from your server, computes a delta from the fork point, applies the delta, and replays any still pending mutations atop the new canonical state. Then the fork is atomically revealed, the UI re-renders, and the no-longer needed data is collected.
It is not rocket science, but it is a very practical approach to this problem informed by years of experience.
As for the price, it’s weird. Teams that have struggled with this have basically no problem at all with the price, if anything they seem to think it’s too low.
This reads a lot like an advertisement for something you built yourself. That’s not necessarily bad, but it’s one of your only two contributions in 5 years. That also isn’t necessarily bad in and of itself, and the claim sounds interesting, but I can’t try out for myself and have to provide personal data in order to get early access. Sorry sir, I respect the work you put in, but that’s a hard pass for me.
Hi! It is something I am building myself (with a small team). It’s not complete yet - we are looking for teams who need something like this to partner with.
Perhaps I should I have commented to that effect when I posted it?
Wow! They don’t do branching. That blew my mind. But the way that they avoid having to branch is by an enormous amount of testing, about 14K tests in total. This is interesting, and I’m curious if it’d work in practice elsewhere. I wonder how they run policy on development, specifically about repositories and keeping things up-to-date. It doesn’t take a few hours to write some big features for Chromium, so what do people do when they have to actually merge their changes in?
The runtime switches / feature flags / etc. are the main reason you can get away without branches, even for large changes. Firefox development also follows a similar flow to what’s described here.
In many cases, you can land small pieces so you’re always integrating with trunk / master / the main thing. But for more complex things, you add a feature flag so your feature is disabled for the general user population, but you can still land your code, add unit / integration tests, etc. Then just keep building on that in successive patches until it’s complete enough to enable for users by flipping the flag on.
Wow, a reply from a Firefox developer! Awesome. This is all very interesting, especially from my perspective of being a university student where I’m told that people do primarily branching in industry and that alternatives won’t work.
You can read more about this approach to software development by searching for “Continuous Delivery”: https://en.wikipedia.org/wiki/Continuous_delivery. It’s quite common with web services, and less common in client-side (particularly mobile) software - mainly because it’s harder to do continuous autoupdate with client-side software.
From long experience, I’d even say that branching doesn’t work, or at least works very poorly. Case in point, at work we have dozens (maybe even 100) branches going at any one point, where each commit on each branch needs to run through many tens of thousands of tests, which therefore requires hundreds of machines to run the tests sufficiently in parallel to get a reasonable response time (and by “reasonable”, I mean 60-90 minutes). It’s all incredibly wasteful, and can take weeks for changes from one team to be absorbed by another team. I often feel that we’ve enabled this by spending lots of money on the test machines, but it’s so easy to just become adapted to “that’s just the way things are”.
I don’t do a lot of product work anymore, but I have a side project I’m hoping to get into production soon. Each time I push a commit to GitHub, Travis picks it up, runs my tests, and if they pass, deploys straight to Heroku. 100% automated.
People work like this to varying extremes. Some people only deploy once a month or once a week, but I really like to deploy all the time.
Hi! In what is apparently become a Notion party I work with Jake :)
I come from the mobile side (Android) but have spent my career being sync adjacent. It’s wonderful to see someone treating the mobile <-> server relationship as Just Another Distributed System. Replicache incorporates all of the best things I’ve seen from a half dozen attempts at sync: git style branched state, pending transactions cleared as part of sync, optimistic updates. “transactions are guaranteed to be applied atomically, in the same order, across all clients” is new to me for client sync and very cool!
I’m actually in the middle of doing the per-document coarse diffing you were talking about! Some super rough thoughts:
sync
calls until nothing is left which feels inelegant.Two questions:
Thanks. I was involved in or nearby many attempts at this at Google over the years, so have been thinking about it on and off for a long time.
I don’t quite follow the spectrum you are trying to setup, but it seems interesting. Can you expand? Note that Replicache does not actually send all data in every sync to clients. Only deltas are sent to the clients. The “diff server” does fetch a full snapshot from the customer server, but (a) that is server-to-server communication, and (b) we hope to expand this in the future to allow the customer server to return coarse-grain diffs. At the limit, the server can just return diffs itself and you don’t need the diff server.
Here are some of the invariants that Replicache maintains, perhaps they will be useful for you:
I actually think in some ways it is easier to make the more general thing that allows transactions to be arbitrary. I’m not sure about the details of Notion, but in many attempts at this people try to restrict the data model to either things that always merge, or operations that know how to undo themselves. In both these approaches there’s an ongoing tax to working with the system – you have to learn how to use these specialized datastructures, and/or ensure that you keep the undo operation working properly.
Replicache instead takes on a big up-front cost (a versioned transactional cache that can be cleanly unwound) in exchange for less on-going complexity: any mutation works and can be guaranteed to revert perfectly. You don’t have to think so much about merge and sync on a day-to-day basis.
I don’t follow this. Are you talking about in your system or in Replicache? I think in any model there is always the case that the user adds more data while sync is happening so you’ll have to go again. It sounds like you’re talking about something else though.
I find the dag-y / git-like model of time very easy to reason about, and the performance of massive distributed databases is not needed in this application. So no, I can’t think of any constraint that would be useful to relax, on the client side.
This would greatly help, but it’s more difficult than it seems. There are versioned databases already that exist. But a sync system doesn’t need databases to be versioned, it needs queries to be versioned. Clients don’t have a copy of the full db, they have a projection of it. Typically an extremely reduced projection. (This is part of what hampered Couchbase adoption).
What we need to send to clients is a diff over that projection. To make matters worse most applications do work at the application layer “on top” of queries. The thing that needs to be versioned is not the db, or even a query result, it’s the data that gets sent to the client.
I think the work that materialize.io is doing is very relevant here, but they are targeting huge backend datastores.
This is why I like the diff server approach. Yeah it’s inefficient, but it’s simple and guaranteed to work. If you do it on a document-by-document level for e.g., Notion, you’re basically sending the complete content (modulo blobs) of a single document between servers on the same network everytime you sync that document, which doesn’t seem terrible to me.
But yes, for absolutely optimal performance, you’d want to be able to subscribe to a query on the server, and get deltas directly from it, without computing a diff.
Hi, I work at Notion, a collaborative note-taking and project management company. We use many similar ideas.
It’s a very interesting space to look at because offline-first and collaboration are now critical table-stakes features, and many of the vendors out there right now are quite rudimentary. For example, Firebase seems like a joke due to scale limits, and Firestore doubly so because of the lack of serious tooling. It’s heartening to see something with similar overall model to our system hopefully it means we’re both on the right tract.
Thank you for the substantive reply Jake.
I spoke with Chet Corcos and a few others (maybe you? if so, apologies for forgetting) at Notion early on in the development of Replicache about the system used there. I am not sure if it has changed since then. It was encouraging then as now to find that Replicache is basically a generalization of what you are doing.
A few thoughts:
It is possible to have the client view return a “coarse diff” to the diff server, rather than a full snapshot. The application can progressively increase the granularity of diff it returns to the diff server as it wants to trade complexity for performance. In an application like Notion, a nice place to draw boundaries would be the document level: when a document is updated, return a diff that contains the entire state of that document, but no other documents.
I agree that the subscribe-to-query at the backend is the dream that really makes a design like this complete. We think of the diff server as a sort of sledgehammer that makes the overall design of Replicache possible for many customers today, without dramatic server-side rearchitecture. However, if you have a more principled way to get those diffs, that is better. I’m hopeful that databases like FaunaDB (or maybe Materialized) will enable this functionality over time, and Replicache can become purely client-side technology.
You’re totally right about course-grained sync; we’re discussing it as our next step.
I think Cloudflare Durable Objects are also very interesting here, although it remains to be seen what the practical limits are.
TL;DR: The conflict resolution algorithm seems to be app-specific code that runs in the database. I didn’t get more details, documentation is unclear.
Pricing is ridiculous.
What would you price this at? It looks high for my company’s current scale [and at this point we want to own the whole stack anyways], but an earlier Notion might have found this offering attractive.
I realize you’re being derisive, but in a sense, yeah:
The fact that conflict resolution is handled by running normal, arbitrary functions serially against the database on client and server is the point. Other systems either restrict you to specialized data structures that can always merge (e.g., Realm), or force you to write out-of-band conflict resolution code that is difficult to reason about (e.g., Couchbase). In Replicache you use a normal transactional database on the server, and a plain old key/value store on the client. You modify these stores by writing code that feels basically the same as what you’d write if you weren’t in an offline-first system. It’s a feature.
===
TL;DR: Replicache is a versioned cache you embed on the client side. Conflict resolution happens by forking the cache and replaying transactions against newer versions.
When a transaction commits on the client, Replicache adds a new entry to the history, and the entry is annotated with the name of the mutation and its arguments (as JSON).
During sync, Replicache forks the cache and sends pending requests to your server, where they are handled basically like normal REST requests by your backend. You have to defensively handle mutations server-side (but you were probably already doing that!). Replicache then fetches the latest canonical state of the data from your server, computes a delta from the fork point, applies the delta, and replays any still pending mutations atop the new canonical state. Then the fork is atomically revealed, the UI re-renders, and the no-longer needed data is collected.
It is not rocket science, but it is a very practical approach to this problem informed by years of experience.
As for the price, it’s weird. Teams that have struggled with this have basically no problem at all with the price, if anything they seem to think it’s too low.
Will it be open source?
I’m not sure yet. See https://twitter.com/aboodman/status/1221807959212118017?s=21.
Currently thinking of doing something like what MariaDB does where it starts out BSL then after a year or so becomes Apache.
This reads a lot like an advertisement for something you built yourself. That’s not necessarily bad, but it’s one of your only two contributions in 5 years. That also isn’t necessarily bad in and of itself, and the claim sounds interesting, but I can’t try out for myself and have to provide personal data in order to get early access. Sorry sir, I respect the work you put in, but that’s a hard pass for me.
Hi! It is something I am building myself (with a small team). It’s not complete yet - we are looking for teams who need something like this to partner with.
Perhaps I should I have commented to that effect when I posted it?
So advertising it is.
Yes! (Sorry if that isn’t allowed or if I violated some norm)
Wow! They don’t do branching. That blew my mind. But the way that they avoid having to branch is by an enormous amount of testing, about 14K tests in total. This is interesting, and I’m curious if it’d work in practice elsewhere. I wonder how they run policy on development, specifically about repositories and keeping things up-to-date. It doesn’t take a few hours to write some big features for Chromium, so what do people do when they have to actually merge their changes in?
The runtime switches / feature flags / etc. are the main reason you can get away without branches, even for large changes. Firefox development also follows a similar flow to what’s described here.
In many cases, you can land small pieces so you’re always integrating with trunk / master / the main thing. But for more complex things, you add a feature flag so your feature is disabled for the general user population, but you can still land your code, add unit / integration tests, etc. Then just keep building on that in successive patches until it’s complete enough to enable for users by flipping the flag on.
Wow, a reply from a Firefox developer! Awesome. This is all very interesting, especially from my perspective of being a university student where I’m told that people do primarily branching in industry and that alternatives won’t work.
You can read more about this approach to software development by searching for “Continuous Delivery”: https://en.wikipedia.org/wiki/Continuous_delivery. It’s quite common with web services, and less common in client-side (particularly mobile) software - mainly because it’s harder to do continuous autoupdate with client-side software.
From long experience, I’d even say that branching doesn’t work, or at least works very poorly. Case in point, at work we have dozens (maybe even 100) branches going at any one point, where each commit on each branch needs to run through many tens of thousands of tests, which therefore requires hundreds of machines to run the tests sufficiently in parallel to get a reasonable response time (and by “reasonable”, I mean 60-90 minutes). It’s all incredibly wasteful, and can take weeks for changes from one team to be absorbed by another team. I often feel that we’ve enabled this by spending lots of money on the test machines, but it’s so easy to just become adapted to “that’s just the way things are”.
I don’t do a lot of product work anymore, but I have a side project I’m hoping to get into production soon. Each time I push a commit to GitHub, Travis picks it up, runs my tests, and if they pass, deploys straight to Heroku. 100% automated.
People work like this to varying extremes. Some people only deploy once a month or once a week, but I really like to deploy all the time.