Hey folks,
We got a note from our friends at prgmr (who generously donate our hosting - thanks, @alynpost) that we have some upcoming maintenance:
As you may be aware, the mitigations we deployed earlier this year for Meltdown and Spectre were the best available at the time, but were incomplete. In addition to other changes, we would like to deploy more complete updates on the host server for [lobste.rs] covering Spectre v2/SP2 now that all the dependencies are available. We are still not aware of any update for Spectre v1.
These updates requires system reboots. We have allocated up to two hours after 2018-05-05 02:00:00 UTC for this maintenance window.
That’s 9 PM US Central time on Friday.
Conveniently, @jstoja opened a PR to start Unicorn when the server comes up. I plan on reviewing and merging that Wednesday morning (my standard time for Lobsters code).
However: this will be the first restart of this new Lobsters server. There’s a real chance that some fiddly little thing we haven’t thought of will break, so our downtime might run longer. I’ll be online to troubleshoot as the server comes back up, and if you want updates it’s best to hang out in the chat room.
Catch you on the flip side.
9PM Friday night for a freely available community service. If you don’t hear it enough from us, your humble users, we once again thank you for maintaining Lobsters!
I appreciate your kind words. We do as much of our maintenance windows in the evenings and through the weekend as we can. Compared to the weekdays and daylight hours, more folk have signed off or are otherwise away from their keyboards. It lowers the impact for most of our users.
We do have customers from all over the world–in addition to some keeping odd hours for reasons other than being in a different timezone–but working late or on the weekend when we need to bring machines down is the best option for the largest number of our users.
I’ll go ahead and talk a little bit about what’s been going on on the operations side at prgmr.com for the past several months. Nothing here directly impacted lobste.rs: this was all happening on physical hosts running other VPSes.
In October 2017 we had a network related kernel panic that crashed one of our dom0s. The backtrace didn’t leave us many clues, and that was the first of what turned out to be 5 crashes on three separate hosts over the next four months.
Oddly enough, one of the crucial clues debugging those crashes came during a separate operational incident. We host gpodder.net on a dedicated RAID. We were moving that RAID array to another physical host with an existing RAID array. While inserting the drives for the gpodder array, we had 3 drives in the existing array simultaneously kicked. We typically run RAID 10, and the machine stayed up after losing three drives, but we immediately scheduled downtime to move those customers to a spare. While doing so, we had one of our 5 network-related kernel crashes. The kernel panic this time was distinct enough that it was one of a few crucial clues to discovering a use after free bug in the driver of one of our 10gbps network cards. The use after free bug, while not a bug in Xen, will under normal circumstances only show up under Xen, due to the way Xen uses the software input-output translation lookaside buffer (SWIOTLB).
As to the drive kicks, we believe that was caused by the 1st and 2nd revisions of the 2.5” -> 3.5” drive adapters we use. These earlier revisions of that part lacked a locking clip on one side. This left too much play in the adapter, and over time caused deformation of the backplane leading to loss of physical contact between the backplane and the drives one one side of the chassis.
More than any other consideration, our users have emphatically stated that uptime is their #1 priority. While maintenance windows are necessary to keep our network running, we do everything we can to minimize the number and duration of those windows. The crashes I’m describing here are not maintenance. Maintenance is one of the things that helps prevent them. When we have crashes or other unexpected downtime that gets our attention and we get it fixed. We’ve been working with upstream to patch the use after-free we discovered and have applied a mitigation on our own systems. We’ve removed all but the latest revision of the MCP-220-00043-0N drive adapter from our inventory, and will replace those in production as we’re able.
While I’m the one here talking about it, the work I’ve described here was done in large part by srn and cmb. I’d like to thank them both for helping keep prgmr.com running. Thank you all for reading lobste.rs and hanging in there while we reboot things at the end of the week here.
And we’re back up. A recent tweak to the mariadb config kept that from coming up automatically, so unicorn tried but failed to come up automatically (production Rails exits if it can’t connect to the db). I’ll sort out the config in lobsters-ansible probably next week (just removing my.cnf that has a bunch of settings that don’t match prod, I think it’s some repo default).
Site’s pretty slow, but that’s probably disk and database caches warming up.