This is a good description of a necessary sequence that I learned on-the-job and have never seen written down before (maybe I just haven’t been looking in the right places?)
I actually had a blog post bookmarked, but it’s been 404’ing for a long time now, and I think the list was actually missing some details that I’ve been hitting into recently as well.
Though there are a lot of details involved with running a highly-available system (my recommendation would be to just not, if you can get away with downtime), and it can be a big pain. The things we do to match our users’ expecations
Great to see this written down somewhere. I remember having to independently figure this exact sequence out the first time I was lead developer on a system that required zero-downtime code deployment.
What the article doesn’t convey is how much of a hassle this process is. If you did a code diff between the initial and final versions of the code, they might differ by only a handful of lines, but you’ll probably end up writing dozens or hundreds of lines of code to walk through all the bidirectionally-compatible transition steps, and you’ll end up babysitting all the migrations and the sequence of production releases. (Though one nice thing is that if you have zero-downtime deployment, you can generally do it during regular business hours without disrupting users.)
The transition to the new data model almost always has the same basic shape with the exact same sequence of steps the article describes. For a while, every time I had to go through this, I tried to figure out if it was possible to abstract away or automate some or all of the transformation, but I didn’t see a way to do it that would be a net savings in time or effort. The outline is the same each time but the actual transition logic is always highly application-specific and doesn’t seem well-suited to a generic abstraction layer.
Lol, I have exactly this article in the content plan of my substack. I’m not going to read your piece to prevent contamination (but I’ve upvoted it after skimming).
I do think this could be a bit overly pedantic. I work at , and version 3: read from the new representation (while still writing to the old) is something we would typically skip - we would go from 2 to 4 using a knob/killswitch/rollout config/whatever you want to call it. They deploy (just about) instantly to all machines, so there is not really any downtime. But without the infrastructure to deploy configs that fast, I see how it could be necessary.
when would you do your data migration for older records? It does depend on the velocity of your changes (this flow is really important when working on your primary data model, for example, where you have near-constant reads and writes, and migrations take time to run), but I feel like just a knob seems a bit dangerous.Of course if you’re only looking at a lower velocity table, or stale reads aren’t a big deal, then you can just move stuff, then like… run the backfill twice. Bit data dependent.
There are some alternatives though! For example you can track the “version” of your data on a row-basis. For example your JSON structure embeds a version parameter. That way your read code can on-the-fly transform to the latest version, and backfilling switches things over on the fly as well.
I’ve found that strategy to be fiddly if you aren’t just using a loosly-validated JSON though.
Basically I’m saying version 2 would read old, write old & new, and also have another codepath (behind a killswitch to enable it) that would read new, write new. No need for an intermediate version that writes to both old and new. You would backfill the “new” data for existing records while v2 is running in production, then flip the switch as soon as you are done with the backfill. You can then reap the knob (another version, which would correspond to v4 in the article) and then drop the old column(s)
The version tag is interesting. We use something similar to gRPC internally, and we use the same approach for our databases: basically guessing the version based on which fields are present/missing, and relying on every engineer to clean up after any migration they start. The version tag might be a better alternative.