1. 24

At our company, we’ve neglected our “staging environment” for quite some time, and are feeling the pain. It is hurting our iteration speed, and even though we do rigorous unit and integration testing, undefined (or badly-defined) business logic still ends up causing problems in production. It has resulted in a culture where (client-)developers and our support team create “fake” accounts in our production environment to validate customer complaints, or to test newly built customer flows.

It has to be said that our systems are pretty stable, and the amount of bugs we ship is minimal, but the real pain is in the time it takes to validate whatever you are currently working on in an end-to-end environment, outside of your unit-tested world.

We are designing a plan to make real-world testing easier for our different teams, and asked them what their biggest pain points were:

  • Missing real-world data to mimic production-usage
  • Missing (or broken) key systems in staging, causing missing functionality
  • Difficult to start testing from a particular customer state onwards
  • Sometimes you really want to test the actual payment flow of customers, without using the sandbox environment of a PSP

We’ve been thinking on possible solutions, but wanted to survey the land first, and see how other companies (of different sizes) tackle these problems.

I’m interested in your thoughts on this topic, how do you handle testing functionality outside of using unit and integration tests? Do you maintain two (or more) environments? Do you allow testing on your production environment, and if so, how do you model such a system to keep garbage data from impacting other systems?

  1. 20

    My advice, which is worth every penny you pay for it:

    Don’t maintain a test environment. Rather, write, maintain and use code to build an all-new copy of production, making specific changes. If you use puppet, chef, ansible or such to set up production, use the same, and if you have a database, restore the latest backup and perhaps delete records.

    The specific changes may include deleting 99% of the users or other records, using the smallest possible VM instances if you’re on a public cloud, and should include removing the ability to send mail, but it ought to be a faithful copy by default. Including all the data has drawbacks, including only 1% has drawbacks, I’ve suffered both, pick your poison.

    Don’t let them diverge. Recreate the copy every week, or maybe even every night, automatically.

    1. 9

      Seconding this.

      One of the nice things about having “staging” being basically a hot-standby of production is that, in a pinch, you can cut over to serve from it if you need to. Additionally, the act of getting things organized to provision that system will usually help you spot issues with your existing production deployment–and if you can’t rebuild prod from a script automatically, you have a ticking timebomb on your hands.

      As far as database stuff goes, use the database backups from prod (hopefully taken every night) and perhaps run them through an anonymizing ETL to do things like scramble sensitive customer data and names. You can’t beat the shape (and issues) of real data for testing purposes.

      1. 2

        Pardon a soapbox digression: Friendlysock is big improvement over your previous persona. Merci.

        1. 1

          It’s not a bad idea to make use of a secondary by having it be available to tests. Though I would argue instead for multiple availability zones and auto scaling groups if you want production to be high availability. Having staging as a secondary makes it difficult for certain databases like Couch base to have automatic fail over since the data is not in sync and in both cases your gonna have to spin up new servers anyways.

        2. 8

          We basically do this. our production DB (and other production datastores) are restored every hour, so when a developer/tester runs our code they can specify –db=hourly and it will talk to the hourly copy(actually we do this through ENV variables, but can override that with a cli option) . We do the same for daily. We don’t have a weekly.

          Most of our development happens in daily. Our development rarely needs to live past a day, as our changes tend to be pretty small anymore. If we have some long-lived branch that needs it’s own DB to play in(like a huge long-lasting DB change or something) we spin out a copy of daily just for that purpose, we limit it to one, and it’s called dev.

          All of our debugging and user issue fixing happens in hourly. It’s very rare that a user bug gets to us in < 1hr that can’t be reproduced easily. When that happens we usually just wait for the next hour tick to happen, to make sure it’s still not reproducible before closing.

          It makes life very nice to do this. We get to debug and troubleshoot in what is essentially a live environment, with real data, without caring if we break it badly (since it’s just an at most 1 hour old copy of production, and will automatically get rebuilt every hour of every day).

          Plus this means all of our dev and test systems have the same security and access controls as production, if we are re-building them EVERY HOUR, it needs to be identical to production.

          Also this is all automated, and is restored from our near-term backup(s). So we know our backups work every single hour of every day. This does mean keeping your near-term backups very close to production, since it’s tied so tightly to our development workflow. We do of course also do longer-term backups that are just bit-for-bit copies of the near-term stuck at a particular time(i.e. daily, weekly, monthly).

          Overall, definitely do this and make your development life lazy.

          1. 1

            I’m sorry, what is the distinction you’re making that makes this not a test environment? The syncing databases?

            1. 2

              If I understand correctly, the point is that this entire environment, infrastructure included, is effectively ephemeral. It is not a persistent set of servers with a managed set of data, instead, it’s a stand by copy of production recreated every week, or day. Thus, it’s less of a classic environment and more like a temporary copy. (That is always available.)

              1. 4

                Yes, precisely.

                OP wants the test environment to be usable for testing, etc., all of which implies that for the unknown case that comes up next week, the test and production environments should be equivalent.

                One could say “well, we could just maintain both environments, and when we change one we’ll do the same change on the other”. I say that’s rubbish, doesn’t happen, sooner or later the test environment has unrealistic data and significant but unknown divergences. The way to get equivalence is to force the two to be the same, so that

                • quick hacks done during testing get wiped and replaced by a faithful copy of production every night or sunday
                • mistakes don’t live forever and slowly increase divergence
                • data is realistic by default and every difference is a conscious decision
                • people trust that the test environment is usable for testing

                Put differently, the distinction is not the noun (“environment”) but the verb (“maintain” vs “regenerate”).

                1. 2

                  Ah, okay. That’s an interesting distinction you make – I take it for granted that the entire infrastructure is generated with automation and hence can be created / destroyed at will.

                  1. 2

                    LOLWTFsomething. Even clueful teams fail a little now and then.

                    Getting the big important database right seems particularly difficult. Nowhere I’ve worked and nowhere I’ve heard details about was really able to tear down and set up the database without significant downtime.

          2. 7

            Missing real-world data to mimic production-usage

            It’s tempting to make an ETL pipeline to copy prod data to staging/dev servers (of course redacting/replacing sensitive data along the way). My favorite strategy is to add a moving part, though: the process for copying data to the staging/dev servers should work by restoring from the prod backup. It drives you to have a quick, clean, reliable process of recovering from backup, and it ensures you’re constantly checking that prod backups are happening.

            1. 4

              We’ve been discussing potential solutions, and came up with three avenues, listing their pros and cons. We’re leaning towards a “Sandbox accounts in production” approach at the moment, but recognize that none of the solutions are free, and whatever we decide on will have a big impact going forward.

              Production-like environment

              + Clear, separate environment
              + No impact on production data / users (events, emails)
              + Not a problem if we mess up data without cleaning up afterwards
              + Runs the same code as on production
              + Requires very little changes to existing code
              - non-production code can (and will be!) deployed, making it no longer "production-like"
              - data needs to be kept in sync, to be usable by clients
              - every service needs to run both a production instance, and a production-like instance
              - all external services need to work with this production-like setup (payments, emails, etc...)
              - higher ongoing maintenance overhead
              - requires "manual agreements" on how to manage/manipulate this environment

              Individual temporary environments

              + all the pros of the "Production-like environment"
              + even more isolated, higher guarantee of your expected state of the environment
              - requires a lot of CI/operational changes
              - requires ongoing maintenance to keep working
              - requires "operational" knowledge to make new changes work with this setup
              - requires more syncing of data
              - higher costs of running

              Sandbox accounts with special “capability” flags, in production

              + All test-data is scoped to a (sandbox) user account, single "source of truth" on wether some piece of data is test-data
              + One single environment to maintain, no divergence
              + Data is always the same as production
              + Whatever code runs on production, is what you test
              + Because you want to test your changes, you automatically make sure your new code works with sandbox accounts / capability flags
              + You deploy your pre-production changes behind a capability flag, for testing
              + A preference page allows you to enable/disable certain capabilities to enable real/fake PSP environments, pre-release functionality, etc...
              - Much more complex to realise
              - Only works (well) for user-scoped test-data
              - Might result in higher learning curve to ship a feature that works with sandbox+capability flags
              - Production data can be changed accidentally, there's nothing but our own code between test and production data
              - Other systems need to be able to handle (and/or ignore) test data
              - Only suitable to the specific use-case of testing user-flows, not for testing f.e. if a library upgrade broke anything
              1. 3

                We test with canaries. They act like customers doing precisely what customers do.

                1. 2

                  This lady has some very interesting ideas about it, look her (and her company) up: https://twitter.com/mipsytipsy

                  1. 1

                    One thing you can do is use verified fakes; it doesn’t solve all your problems, but it does let you make sure your unit tests are more realistic and therefore reduce your need for production testing: https://pythonspeed.com/articles/verified-fakes/

                    1. 1

                      Dark canaries. Putting a hardware or software router in front of your endpoints lets you tee traffic from one machine (or cluster) to another machine (or cluster) while discarding the responses from the tee’d service. You must ensure these are exact copies of the production environment but with no side-effects (no database writes, email sends, etc). This gives the benefit of using real production traffic and data, triggering bugs that only occur with a production workload, but with much less impact from triggering those bugs.

                      1. 1

                        How big is your production environment? Is it realistic and affordable to run a regularly updated “exact copy”?

                        1. 1

                          We did something like this at a previous employer. Our application was a .NET desktop application, using a central SQL Server database for each project. The application was only used by internal employees, and they used it to set up and perform complex calculations.

                          When we started developing this application, we used a “Staging” environment, and would have employees test calculations in it (on top of a decent automated test suite, of course). This never caught most of the bugs, which we ended up discovering later, upon deployment to production. The problem was, nobody really wanted to spend enough time on the Staging environment to set up calculations of enough variety and complexity to thoroughly test the application. The whole Staging step seemed like more trouble than it was worth.

                          What we ended up doing that worked pretty well was to deploy releases straight to production, but only on the smallest project/database first. After a certain amount of time finding no bugs there, we would then deploy it to the other projects. This worked pretty well in practice.

                          This is similar to the process that I understand Facebook and some of the other majors use - deploy new releases immediately to prod, but keep features behind feature flags, and allow only small numbers of users to use them at first, before opening them up for global usage.

                          Exactly what’s practical depends a lot on your domain, but it might be worth trying to make something like this work.

                          1. 1

                            The code I work on is nearly impossible to test fully in anything other than production. My case is unusual in that the code I work on is part of the Monopolistic Phone Company’s network (although we are not part of the Monopolistic Phone Company) and it costs huge amounts of money to get a real phone lab going. Because of that, we do as much testing as possible using data learned from previous deployments (and issues that come up in production). We can “fake” SIP calls (even in production), but testing CDMA is … interesting (and involves real phone properly provisioned—not that easy when you are testing a new feature that needs to be provisioned).

                            In theory, we could shunt specific production traffic (to the component that handles the business logic, not actual phone network traffic) to staging, but netops was leery of poking holes in the firewall between the two. So it is what it is.