1. 26
  1.  

  2. 37

    To err is human; to propagate error automatically to all systems is devops.

    1. 13

      As someone who works on a large system that sees its fair and regular share of outages, I see about 95% of outages caused by bad deploys, the mitigation being rolling the deploy back. The remaining 5% are largely unexpected faults like network cables physically being cut, power being cut across a whole data center or other hard-to-test-for cases. Note that the bad deploy can often be at a third party we rely on, such as a payments processor or cloud provider.

      This is also the reason that at critical times, when we want to minimise downtime, we put a “code freeze” no-deployment policy in-place. Most large companies for this around e.g. the winter holidays (Christmas and New Years), when many people are away and/or large traffic spikes are expected for commerce. Same with Thanksgiving or Black Friday in the US.

      And the crazy thing is, code freezes work like a charm! Having been through multiple weeks-long code freeze, this is the time when there are virtually no outages or pages. On the flip side, I observed a lot more outages happening when the code freeze is lifted and many times more changes hitting production, than before.

      1. 9

        I used to work for a large company where we had services which we deployed many times each day. One demo season we decided not to deploy anything for a while so that the televised presentations would work without a hitch.

        Unfortunately we hadn’t noticed a memory leak in one of the services which meant that it would fall over after a few days. This is one of the few times I’ve found that not releasing a change has caused a problem.

        1. 1

          This is also the reason that at critical times, when we want to minimise downtime, we put a “code freeze” no-deployment policy in-place.

          I think I read the same idea also in the “Site Reliability Engineering” book from Google. They advertise the idea there that you define a percentage uptime you want to achieve (e.g. 99,9%) and try to push changes as long you’re above your target uptime. If you fall below you’re not allowed to push changes anymore.

        2. 9

          I feel it has to be said that Cloudflare deliberately and unnecessarily put themselves in between users and web servers, as a gigantic single point of failure. That they will mess up now and then is inevitable, and it’s a fundamental flaw in the essence of their product that every website goes down when they mess up.

          I don’t like how they’re deliberately turning the web’s robust many-to-many relationship into a fragile many-to-one relationship.

          1. 2

            As always, nobody is forcing anybody to use them.

            Each time they go down, more and more people will also realize the issue and maybe change.

          2. 4

            Reminds me of this chestnut from Jamie Zawinski:

            Some people, when confronted with a problem, think “I know, I’ll use regular expressions.” Now they have two problems.

            1. 1

              I desperately want to know what the specific regexp in question was.

              1. 2

                According to this article, the regex was

                (?:(?:\"|'|\]|\}|\\|\d|(?:nan|infinity|true|false|null|undefined|symbol|math)|\`|\-|\+)+[)]*;?((?:\s|-|~|!|{}|\|\||\+)*.*(?:.*=.*)))
                

                with the critical component (.*(?:.*=.*)) in the end.

                1. 1

                  So many problems.

            2. 5

              We were seeing an unprecedented CPU exhaustion event, which was novel for us as we had not experienced global CPU exhaustion before.

              I’ve seen job ads for places that require five years’ experience of unprecedented events before you even get to interview.

              1. 2

                “How many more unforeseen problems are we going to meet?!”

                “…all of them”