Hey folks,
Our host prgmr had a networking outage starting around 6:45 Central time (GMT-5) this morning. A switch rebooted and, after 30m, entered a reboot loop, taking their hosting entirely offline. They dispatched an employee to the data center to configure and install a backup switch.
When that was completed we came back online with everyone else. Unfortunately, our Let’s Encrypt cert was due to renew during the networking outage. So that didn’t happen and we came back up with an expired security certificate. @alynpost reissued the cert and we bounced nginx to get it working.
Between networking and the certificate were offline for about three hours. No data was lost, our vps did not go down. I don’t believe any part of this outage was malicious or has any security concerns.
Our hosting is donated by @alynpost, owner of prgmr and longtime lobster. During the networking outage I asked not to receive any updates unless all their customers were back up and something extra was wrong with Lobsters. I didn’t want to distract from their paying customers on what must be a very busy morning. We have no plans to change anything about our hosting.
And as long as I’m writing an announce
post many people will see, yesterday I fixed the bug that broke replies via mailing list mode.
Welcome back!
[Comment from banned user removed]
First NixOS on prgmr, then Lobste.rs on NixOS! :)
I haven’t received any reports of users running NixOS, but typically folks would only reach out to me i they were having a problem. You can certainly boot up a live rescue and run an install over the serial console. Depending on the distribution this either ‘just works’ or requires it be told the console is on the serial port.
NixOS ISOs from their website do not enable the serial console by default, but building a custom ISO which does is easy enough. I did so a few days ago on Debian using
nix
to create an NixOS installer for my APU2:I actually tried a few months ago, but gave up because I had thought I figured out it was impossible. Although, seeing the link that @alynpost just posted, I might give it another go when I have some free time.
Some years ago, a friend taught me a simple trick which I used twice to install OpenBSD at providers where neither OpenBSD nor custom ISOs were directly supported: We would build or download a statically linked build of Qemu, boot the VPS into its rescue image and start Qemu with the actual hard disk of the VPS as disk and an ISO to boot from. Thats not too hard and works for pretty much everything where you got a rescue system with internet access. I guess it should work for NixOS too and maybe nix could even be used for the qemu build ;)
If you want to give it a go and get stuck write support@prgmr.com and we’ll help you debug.
Thanks! I really should have been less lazy and just asked for help last time.
My pager went off this morning telling me it had one, then twelve, then forty alerts. That many problems all at once is a sure sign of either network, power, or monitoring system failure. On investigating we discovered one of our switches had spontaneously reboot. By the time we’d determined we had a switch reboot it was back up: we decided we’d schedule a maintenance window for the weekend and replace it Saturday.
Alas, the switch didn’t wait. It reboot again later in the morning and then kept doing it. We keep spare equipment at the data center and brought a new switch online.
For network failures like this we have a so-called out-of-band console. We’re able to log in to prgmr.com and debug network failures remotely. From this out-of-band connection I was able to look at the logs on our serial console and determine which equipment had reboot. Further I was able to see all of our equipment was still powered on–meaning switch failure rather than power failure.
As @pushcx said, welcome back!
I don’t want to kick a man while he’s down, but should we assume this means your network isn’t redundant then?
We do have some single points of failure in our network. We’ve been eliminating them as time permits but it is a work in progress.
We peer with HE and use NTT as a backup. If the event of an outage with HE we can get to the Internet with NTT instead. This switch that failed also had a ‘backup’ but not quite as warm as our peering connection. As a result of this failure we’ll be experimenting with Multi-Chasis Link Aggregation (MLAG).
Do you have any experience with MLAG or other switch-level redundancy?
Hey,
Nice to hear a host (owner) admit where they have areas for improvement!
I’m afraid it’s a long time since I had my hands on an IOS console, and I took a different direction after graduating, so I’m definitely not the guy to help you here!
I have a bit of experience with HPE’s IRF (on Comware), but only in a lab environment. As far as I can tell once you set it up (not terribly difficult) it works as advertised. The downstream devices obviously should be connected to at least two physical switches in an aggregated switch domain, and preferably use some link aggregation, like LACP or various Linux link bonding techniques.
Succinct write-up! Cheers :)
I’m curious why the LE certs are only set to renew within 3 hours of them expiring?
Edit: more specifically, is it a deliberate choice (and if so, please explain why?) or is it the ‘default’ for some esoteric ACME certificate renewal client that nobody else has ever used in production (and thus this weird choice)?
Looked into this. LE was not, in fact, set to renew from any cron job or other means. @alynpost did it manually in January and the expiration happened to fall during the outage. I’ve added it to the open LE issue.
I see.. Presumably not installed via a package (e.g. deb, rpm, etc) then?
It was installed by the package in Ubuntu 16.04 LTS, but by hand rather than by ansible.
That’s quite weird that it didn’t install a cron/systemd timer then…
Oh. Ubuntu strikes again. There isn’t a cron/timer file in the package because it’s just v0.4 in Xenial.
It sounds to me, that certbot was trying to renew during the network outage, and upon failing the ACME challenge got into a stuck state/gave up renewing - that being until NGINX was kicked and a new cert was manually issued.
By default certbot will try to renew certificates that expire < 30 days from “now”.
But the other reply solves the mystery (sort of): it was never called to renew.
Thanks for the write-up and thanks to @alynpost for graciously hosting us.
You’re welcome. I’m glad to help.
I have Let’s Encrypt set up to renew two weeks prior to it expiring just in case of outages like this. This gives me two weeks to resolve issues.
[Comment from banned user removed]
What toolchain do you use? We had a wrinkle along the way that certbot was reporting the cert was not yet up for renewal. @nanny guessed that nginx needed to be restarted and I did that at the same time @alynpost force-reissued the cert and one or both of those fixed the issue for us.
We’ve been wanting to move to acme-client for a while.
I use acme-client since certbot causes Python to create memory mappings that are both writable and executable, which is a huge no-no in my book.
If you have the spare time, I’d love a PR to our ansible repo to move to acme-client. We haven’t had the expertise to change over with confidence, and it sounds like you’d avoid at least one failure mode we hadn’t thought of. :)
I wish I had the time, but unfortunately I don’t.
No worries, I’ll get to it eventually. (Or maybe some other lobster reading this will volunteer.)
I didn’t bother with timing for mine, I just set up a cron job to call renew once a week, since the client doesn’t do anything if it isn’t available to be renewed yet.
Great transparency. Thanks to both of you for your hard work. Thanks to alynpost again for donating hosting to the site.
You’re welcome. It’s my pleasure.
Thanks for the update. #HugOps to everyone involved <3
Network (switch) failures are particularly frustrating to deal with because we all work over IRC and self-host both a server and our bouncer. Since we can’t log in it’s difficult to chat with each other: I was able to connect to my IRC client over out out-of-band network but the connection between it and our bouncer was down.
Everyone scrambles for a bit to get on Freenode without using the bouncer–once that’s done and we’ve figured out what everyone’s temporary nick is and set about restoring service.
Oof. That sounds awful. Glad you got it sorted.
On the bright side, this post caused me to take a look at prgmr and I’m going to try it out this weekend! Looks dope and being able to run NixOS on it is super :)
Hmmm… “Hostname is incorrectly formatted. It must be between 2 and 32 characters. It must begin with a letter and contain only lowercase letters, numbers, and dashes.”
That’s not how hostnames work. :(
Internally we refer to that as a “label.” You’re correct, it doesn’t match the limitations in DNS. That label is used to not only set a hostname in .xen.prgmr.com, but also shows up in other systems. Most of our customers ignore this label in favor of setting up their own domain. In DNS you can set any valid A record for the IPv4 associated with your VPS. If you set an rDNS record for us you’ve got identical flexibility.
All that said, that limitation may also exist for legacy reasons that are no longer valid. Did you try to enter a hostname it would not accept or is it that you noticed it was more restrictive than it might need to be? I’m happy to look at changing the constraints here.
Thanks for the explanation! I tried entering “myserver.home.monokro.me” but it seems to only allow short names instead of full hostnames?
You’re correct. The “hostname” is expected to be the leftmost subdivision of a DNS name. You’d add myserver.home.monokro.me to your own DNS server.
Ah. Yeah. That’s a derailment from what Linux itself expects, afaik - that’s why there’s a
hostname -s
to get the first segment of the hostname.Not sure if that’s helpful or not, but figured I’d say it. Thanks for your responses :)
Thanks for taking care of lobste.rs!
Respect for the quick write-up. Also, seeing the loyalty to prgmr is cool too!
Thanks for working hard.
(note- I have nothing to do with prgmr)
Hadn’t heard of prgmr before, thanks for hosting!
Great and simple postmortem. Didn’t blame the problem on some ‘root cause’.
When people tell you there is a root cause of system failure, fire them immediately. This situation is a perfect and simple example where a cascade of failures led to an outage. Root causes are for ants.
There’s some more notes in the comments here - that statement you quote was inaccurate, we’d have been going down anyways for the cert expiring. Just bad luck on timing that it happened to be the same time.
I found the comment about doing the renew manually. I think if anything, this makes your example even stronger. Meaning there is no single failure here, and it took a combination of all of them to create the long downtime you saw. In other words, there is no root cause.
I think you guys did a great job and I even learned about a new (to me) old service, prgmr!
FWIW, for Lets Encrypt I use acme.sh in a daily cronjob that renews the cert every 30 days (i.e. if it’s within 60 days of expiration). Then I have alerts that fire if the time runs down. It’s kind of aggressive but it protects against non-outage risks like subtle API changes or bugs.
Does your
acme.sh
automatically restart your webserver processes? If so, does it run as root or do you use a sudo or something else to restart them?In my case it feeds the new cert into the configuration management tool and it then pushes out the new config like normal. acme.sh itself is decoupled from the web server.
Another LE thing: did you register without an email address? I don’t have a working cron job on some servers (lol), but LE emails me several times before the cert expires!