This incident started at Cloudflare, probably by just one individual, when they made a one line programming error in their use of a HTML parser library.
I disagree whole heartedly. This is an organizational problem. Now, I’ll give the author the benefit of the doubt and assume he actually agrees with what I’m about to argue, based the the rest of the article, but I think that stating it like this upfront is actively harmful.
It should not have been possible for one person to get this kind of bu into production. It wasn’t one person’s decision to do do fuzz testings. It wasn’t one person’s decision to not test with known bad inputs. It wasn’t one person’s decision to allow to allow a custom parser into a production system. It wasn’t one person’s decision to allow C code in a system of this type.
And if any of those things were one person’s decision that’s an even bigger problem.
I’ve been wondering this myself. The Cloudflare incident report referred to fuzz testing they conducted after being notified of Cloudbleed, but didn’t get into details about precisely what testing had been performed prior. Have they published any details about this?
I think under a lot of ordinary circumstances it’d be unfair to demand a detailed test plan from an arbitrary piece of software. But: Cloudflare’s reverse proxy nginx extensions are some of the most sensitive C modules on the planet. In the same way that we know a good bit about how aggressively Google tests the codecs behind Youtube and the image parsers in Chromium, I feel like it would be reasonable to want to know what Cloudflare did with this code.
It’s not like nobody knew writing this kind of code in C wouldn’t be dangerous.
I agree. I think the scope of the problem is sufficient enough to demand transparency about why the problem happened, and not just how. I imagine that will be problematic, considering the legal liability issues, and is precisely the problem in letting the private sector handle this itself.
Author here: I definitely do not blame that individual, and agree wholeheartedly that it is an organisational problem.
I agree organizational priorities were main problem. I will add that wise use of techniques to reduce defects might have knocked it out. One person can do that. Happens regularly in safety-critical industries and the few in embedded that care about quality. A specific example is what I used to do where I’d autogenerate the code for parsers from a grammar then analyze and test that code. This got common with DO-178B since they had to buy static analysis & automated testings tools anyway to meet certification requirements with high confidence. The combo of generation followed by static analysis was common enough that Eschersoft made it the standard practice for their product: spec language w/ formal analysis & code generation in C/Ada; static analyzer for C to spot any errors in that (or other code).
There’s also many open-source tools that can catch errors in C with either low or no false positives. I’d have a repo running any C code through all of them as part of build/testing process. Automatically. So, I blame Cloudfare mostly but one developer could prevent this without much time. We have two memes here where a company and an individual doesn’t care enough about security/quality to use cost-effective means to achieve it. The baseline for IT.
That’s a good list. Also: service isolation and containment. No need for the email protector parser to be running in the same process as all other traffic.
Do you think it’s possible to push a lot more of this work into the OS kernel? The example I give is imagine if when I logged into gmail.com I got my own process, my own file system, my own slice of the database, and even my user credentials were a standard OS credentials. The user credentials in particular seems like it would be nice if that could be pushed up into the OS. So many people are just ignorant of how to store user credentials properly but I don’t even have to think about it when I add users to machines I operate.
I think this is a direction like mirage is going. you could start a server for every session.
Yeah, for sure. Personally, I am not so sold on the unikernel idea in practice. So integrating those sort of things into the OS is what I’d like to see. In practice, as well, I’m guessing the virtual machine overhead is just too much to handle 1 billion users on such setup. In microservice land we all seem a bit too free with making requests to other systems, though, so maybe that would move us back towards monoliths. Who knows.
I can’t imagine any scenario in which a regulatory response by NIST or any other government agency would’ve made this incident being handled better.
From my experience when gov agencies are involved in such incidents there’s much more secrecy and less transparency than how google and cloudflare handled this.
Don’t get me wrong: I think there are areas in Infosec where government response would be welcomed by me (think: IoT). I don’t think this is one of them.
Their recommendations would’ve prevented it. Prior regulations and recommendations included using tools that generate code [hopefully] free of security vulnerabilities and memory-safe languages. The code submitted for review under DO-178C certification typically has to pass static analysis + testing of every path. A whole ecosystem of tools and safe libraries emerged to support such things with idea that re-certification after screw-ups would be more costly than investing in solid development. Same thing happened under TCSEC, original certification for security, where high-assurance systems had to have all states precisely modeled with evidence none broke a precise, security policy. They also usually used things like Pascal or Ada for better safety.
Long story short: existing and prior regulations pushing the minimal, best practices for secure coding could’ve prevented it if it was just a pointer or parsing error. The safety regulations continue to do their job in aerospace. The security regulations were canned…. despite effectiveness during pentests by NSA… due to bad management at NSA (MISSI initiative) and bribery of politicians by software companies (esp IBM and Microsoft) wanting DOD to use their stuff instead of secure stuff. Policy changes killed assurance requirements & demand of them. A bullshit standard called Common Criteria followed with many insecure systems certified and adopted per DOD’s COTS policy. A subset of it (EAL6/7) worked but it wasn’t required & evaluators allowed more hand-waiving. The common mantra that we don’t know how to prevent many security problems with regulation comes from people who have never studied regulation of safety/security in IT to begin with & probably couldn’t name a single, evaluated system under TCSEC unless they read one of my comments naming them.
Gov regs that require disclosure I’m ok with. Every company wants to downplay breaches. Big hoary notices that must be snail mailed to every single person who’s data may have been compromised? That’s a fitting punishment. And yes, that punishes CF’s customers somewhat unjustly, but that makes all of them very angry at CF. So maybe CF will try a little harder not to screw up again.
I can certainly understand your viewpoint, and it is most likely true given the current state of the world.
But what we have here is Google, a large private company, essentially performing some of the investigative and reporting role that an incident such as this requires, but not being able to do much in the way of regulation or punishment. It certainly has no power to do root cause analysis at Cloudflare, change any laws or hand out fines.
My hope is certainly that the industry can sort itself out, but if it doesn’t, and the scope of these leaks gets worse, then what is the answer?