If it were only sysadmins who paid the price, I’d be a lot more sympathetic to this argument.
We can and should learn from decades of FAA practices and recognize that all sysadmins are human, all will make a mistake somewhere, and the correct response to that is to alter our tools and procedures to mitigate those mistakes, because we have not as yet succeeded in breeding the infallible human.
At first I was rolling my eyes but then realized I was being an elitist. Infrastructure must be sympathetic to its users. Part of that IS forward-compatibility, but security issues like this trump those concerns IMO.
Maybe I was doing it wrong, but I spent a considerable amount of time searching for documentation, so it wasn’t for lack of trying.
They made breaking API changes between v2 -> v3, and tons of Google hits for etcd gives you wrong information.
The bootstrap process just sucks. SRV records didn’t propagate fast enough, so the cluster bootstrap failed with DNS discovery. Connections to the discovery service frequently timeout, so that caused constant bootstrap failures. Doing it by hand was the only thing that worked and I had everything else perfectly scripted with Terraform so that felt icky.
Managing cluster membership also seems to be entirely manual. I really wish more of this thing could be automated.
Perhaps if I were willing to spend another couple months on it I could figure out how to bootstrap a cluster with no manual input, but it wasn’t worth the time.
But now that the thing is actually online and working it’s pretty nice.
What the article doesn’t spell out is that at least on etcd 2.x, activating user/password authentication is a true performance killer. So if you want ACLs a bit more complex than “client has a valid TLS certificate to connect with”, you had to either take the huge performance hit, or get creative and proxy etcd via nginx to support a better auth model[1].
Even then, etcd has never been designed as something that could be exposed directly to the internet, I wouldn’t blame the developers for not foreseeing people ignoring the most basic security principles (like: don’t have any of your datastores directly exposed to the internet); but I would blame them for attaching an auth model post-facto to etcd, and for the fact that, along all the 2.x series, auth was basically unusable (see e.g. https://github.com/coreos/etcd/issues/5840 or https://github.com/coreos/etcd/issues/3223) and has been “fixed” in 3.1 and following releases by allowing to do RBAC based on the common name of a client TLS certificate…
Unfortunately, I think it’s more common that this happens accidentally rather than deliberately, e.g. someone spinning up AWS servers and installing etcd in it’s default configuration on them without really thinking about the fact that they’re exposed to the internet.
Why? Admittedly they should check and change the default config. But having no auth by default is a terrible design decision just to keep backward compatibility.
Why? Because using etcd in the first place is probably a dumb decision, and if you (for some bizarre reason) actually need it, then anyone who calls themselves a “sys admin” should be prudent enough to not install software that exposes all of your most private and important credentials to the Internet. And if you do install such software, you should most definitely be aware of it and configure it properly — before it takes down everything.
If you’re not competent enough to do that, you have no business being a system administrator, you are a security threat to anyone who hires you.
Lmao bad defaults should not be kept. No defaults are frequently just another form of bad defaults.
[Comment removed by author]
If it were only sysadmins who paid the price, I’d be a lot more sympathetic to this argument.
We can and should learn from decades of FAA practices and recognize that all sysadmins are human, all will make a mistake somewhere, and the correct response to that is to alter our tools and procedures to mitigate those mistakes, because we have not as yet succeeded in breeding the infallible human.
Is there a pithy phrase that encapsulates the whole, “People make mistakes, proper tools and processes prevent them.”?
That’d be great if it were actually the sysadmins being punished, not the users.
If I’m remembered for nothing else, could it be this quote?
Springs to mind.
At first I was rolling my eyes but then realized I was being an elitist. Infrastructure must be sympathetic to its users. Part of that IS forward-compatibility, but security issues like this trump those concerns IMO.
I just recently set up an etcd cluster at work. Literally everything about etcd is a footgun, so this comes as no surprise.
It’s a nice DB once it’s actually up and working though.
How do you mean? I’ve very limited experience, but it beat the shit out of Zookeeper last I used it on almost every metric.
Maybe I was doing it wrong, but I spent a considerable amount of time searching for documentation, so it wasn’t for lack of trying.
They made breaking API changes between v2 -> v3, and tons of Google hits for etcd gives you wrong information.
The bootstrap process just sucks. SRV records didn’t propagate fast enough, so the cluster bootstrap failed with DNS discovery. Connections to the discovery service frequently timeout, so that caused constant bootstrap failures. Doing it by hand was the only thing that worked and I had everything else perfectly scripted with Terraform so that felt icky.
Managing cluster membership also seems to be entirely manual. I really wish more of this thing could be automated.
Perhaps if I were willing to spend another couple months on it I could figure out how to bootstrap a cluster with no manual input, but it wasn’t worth the time.
But now that the thing is actually online and working it’s pretty nice.
What the article doesn’t spell out is that at least on etcd 2.x, activating user/password authentication is a true performance killer. So if you want ACLs a bit more complex than “client has a valid TLS certificate to connect with”, you had to either take the huge performance hit, or get creative and proxy etcd via nginx to support a better auth model[1].
Even then, etcd has never been designed as something that could be exposed directly to the internet, I wouldn’t blame the developers for not foreseeing people ignoring the most basic security principles (like: don’t have any of your datastores directly exposed to the internet); but I would blame them for attaching an auth model post-facto to etcd, and for the fact that, along all the 2.x series, auth was basically unusable (see e.g. https://github.com/coreos/etcd/issues/5840 or https://github.com/coreos/etcd/issues/3223) and has been “fixed” in 3.1 and following releases by allowing to do RBAC based on the common name of a client TLS certificate…
FYI: Your [1] is missing also
Then it shouldn’t by default bind to 0.0.0.0
What reasons do owners have to keep their etcd servers open to the internet? Is this mainly used for cross-datacentre communication?
Unfortunately, I think it’s more common that this happens accidentally rather than deliberately, e.g. someone spinning up AWS servers and installing etcd in it’s default configuration on them without really thinking about the fact that they’re exposed to the internet.
Zero sympathy for those pwned.
Why? Admittedly they should check and change the default config. But having no auth by default is a terrible design decision just to keep backward compatibility.
Why? Because using etcd in the first place is probably a dumb decision, and if you (for some bizarre reason) actually need it, then anyone who calls themselves a “sys admin” should be prudent enough to not install software that exposes all of your most private and important credentials to the Internet. And if you do install such software, you should most definitely be aware of it and configure it properly — before it takes down everything.
If you’re not competent enough to do that, you have no business being a system administrator, you are a security threat to anyone who hires you.