1. 29
  1.  

  2. 6

    In the initial moments of the outage there was speculation it was an attack of some type we’d never seen before.

    I am as guilty as anyone, but it’s really interesting that we create these far-fetched explanations when something goes wrong instead of assuming “the last deploy broke something,” which statistically is far more likely.

    1. 7

      Well, I don’t know. Cloudflare is seeing so many attacks, of considerable scale, everyday, that it’s probably more common for them to get monitoring going crazy because of attacks than someone pushing code.

      Although, I do agree that not considering this option right away means that they’re biased in some way towards blaming external actors.

      1. 4

        This is a great point! Their statistics for incident causes are likely not the same as most companies, but I’d still expect to see “bad deploys”/human error heavily represented.

        I guess the other important point is that they are probably used to bad deploys being caught earlier with gradual rollouts. The size of the impact for this incident is atypical for a bad deploy…

        1. 10

          I’m an SRE at Cloudflare. Most deploys to our edge are handled by us so we generally know what’s been deployed and when. As the post mentions we use the Quicksilver key-value store to replicate configs near-instantly, globally, and these config changes are either user data or changes made by us as part of a release. The WAF is unique here in that it’s effectively a code deploy performed through the key-value store, and is performed by the WAF engineers directly, via CI/CD, not us.

          So yeah, when this outage started we weren’t immediately aware that a WAF release had recently gone out, but generally we wouldn’t want to know - they do it frequently so generally it’d be noise to us if we had notifications for it. This is one of the things that lead to a few minutes’ delay in identifying the WAF as the cause of the issues, but we had excellent people on-hand who used tools like flamegraph generators to identify the WAF in only a few minutes.

          It’ll be interesting to see how we change the deployments.

    2. 4

      The author says that “In the last few years we have seen a dramatic increase in vulnerabilities in common applications.” Curious as to why, I looked it up. RCE and XSS spiked especially, with a few other types too.

      The answer I found? A change in the way they assigned CVEs; they made the process easier. Quoting https://security.stackexchange.com/questions/203578/what-is-the-reason-for-the-increase-of-cves-since-2017: “So the spike you are seeing doesn’t necessarily mean that more vulnerabilities have been discovered, but just that more researchers apply for and successfully get CVEs.”

      so yeah there wasn’t really a spike in real problems, it just represents a shift of accounting.

      1. 13

        CVEs have been gamed for some time. It helps to think of them as invoice numbers. No invoice, no payday. More companies benefit from the number going up than going down.

      2. 1

        The regexp backtracking issue sent me down memory lane.

        I remember reading about this in the Perl documentation over a decade ago.

        Back then when Google Code Search was a thing, I recall that they went with their own NFA based regex engine to avoid the complexity problems. That’s the RE2 engine that’s mentioned in this Cloudflare post as one of the engines being considered for adoption. Code search had a different source of inputs though, users were directly able to submit regular expressions so you definitely had to have an engine that didn’t have pathological corner cases.

        Since this is the first global outage that Cloudflare had for over 6 years, and not a long one at that, I can’t say I can nitpick their architectural choices. Still, running all of their services (which are growing in count and complexity) on the same edge nodes was bound to trip them up sooner or later.