1. 6
  1.  

  2. 1

    Having a shadow prod seems like a great idea.

    1. 1

      I spent some time thinking about this as I had a database with a high rate of writes. I didn’t get to come up with a solution unfortunately. Copying the production database is certainly useful but it comes with caveats.

      For example, if the database is sharded then copying a single shard has limited utility, and copying all of them is going to get expensive quickly.

      The other issue I had with using production data for testing is that sometimes it shouldn’t be made available to all and sundry due to privacy concerns, so there either needs to be a process for scrubbing it (error-prone) or we’re back to generating synthetic data.

      Finally, in a high write rate scenario there are additional load issues for the application server(s) if you’re trying to write both to the production DB and a copy. It’s another layer of complexity. It’s possible to avoid that by restoring a backup and doing some sort of synthetic load simulation rather than reproducing actual real time writes, but then it’s not so easy to make it realistic and keep it that way over time.

      1. 2

        Author here.

        For example, if the database is sharded then copying a single shard has limited utility, and copying all of them is going to get expensive quickly.

        Our database is sharded. We have 10s of thousands of logical shards spread across a much smaller number of physical machines. Almost all the queries we care about are ran across every physical machine. For testing optimizations, copying a single physical machine is the 80/20. It allows us to tell whether the optimization does make the queries we expect to get faster do get faster, and that no other queries degrade. For this to miss regressions, there would need to be a high number of shards on other machines that behave differently under the optimization.

        The other issue I had with using production data for testing is that sometimes it shouldn’t be made available to all and sundry due to privacy concerns, so there either needs to be a process for scrubbing it (error-prone) or we’re back to generating synthetic data.

        Why would data in shadow prod “be made available to all”? We have the same restrictions on accessing the data in shadow prod as we do for production.

        Finally, in a high write rate scenario there are additional load issues for the application server(s) if you’re trying to write both to the production DB and a copy.

        For us, our writes do not need to show up instantaneously. Ideally they show up within a few seconds, but it’s okay for us if they periodically take longer. All our writes go through Kafka before they go into our database. Our service that reads from Kafka then writes to both databases. If shadow prod slows things down, all writes will be queued in Kafka. This way, we don’t drop any data. There will be latency between when data shows up in production, which while less than ideal, is acceptable. We’ve discussed having separate Kafka consumers, one that writes to production and one that writes to shadow prod. This would prevent shadow prod outages from causing latency in production. There hasn’t been need for doing so, so we haven’t implemented it yet.

        1. 1

          Thanks for the extra information, it makes sense.

          Why would data in shadow prod “be made available to all”? We have the same restrictions on accessing the data in shadow prod as we do for production.

          I think it makes sense for performance testing and general error/no error tests on migrations. In my case, we occasionally had fairly complex data transformations going on during migrations, and if they went wrong or even while writing the migration, we’d need to see what happened to the data. It would have been hugely useful to run this on prod data, but then privacy issues became a stumbling block.