1. 76

We have 3 VPSs at DigitalOcean: web01, db01, and mockturtle. I attempted to update to Ubuntu LTS 24.04 about 10:00 and it’s currently 15:38. The upgrade went fine but we’ve been down or in read-only mode since because of a DigitalOcean misconfiguration.

I took an offsite database backup before starting the upgrade and this issue seems unlikely to affect storage, so I’m not worried about data loss.

When the hypervisor boots our host it should be providing a virtual device or api that cloud-init needs to create an approprate netplan. Basically: what’s the networking for this VPS? IP address, routes, DNS, etc.

DigitalOcean changed from an API to ConfigDrive sometime since our VPSs were created. web01 apparently migrated OK. db01 and mockturtle (chatbot) didn’t. They lack the virtual device /dev/vdb that should have the network info, so on every boot cloud-init writes an empty netplan and they come up without networking.

I’ve manually recreated the netplan (and saved a copy should cloud-init overwrite it again) so we’re back online for now, but if the VPS restarts it will lose networking until I can manually intervene again.

We can’t fix DigitalOcean’s hypervisor config, so I’ve filed a support ticket. Support says it will take 24h to acknowledge the ticket. I can’t predict whether Tier 1 support will understand the problem or how long it will take to escalate. I’ll update this post when I have more info. I’m sorry for this outage and the current unclear status.

    1. 37

      Also, big thanks to @355e3b for helping. I had figured out part of what was wrong and got us back to read-only mode, but he zeroed in on the underlying cause and recognized that it was safe to operate in this slightly degraded state rather than us needing to take a longer downtime to immediately rebuild db01. His analysis and judgment got us back up significantly quicker.

      1. 33

        Thanks for all your hard work keeping this site going!

        1. 9

          100%, @pushcx appreciate you working on a holiday to bring this back up!

          1. 11

            I’m not sure which holiday you mean, it didn’t really interfere with me going barefoot.

        2. 6

          Update 2024-09-23 08:53: I got a response from DO support overnight. They mostly understood the issue but think it’s a system-level misconfig rather than their hypervisor. I sent some more info based on the debugging that followed the initial ticket.

          1. 4

            I’m very late updating this thread, but: Digital Ocean support investigated, diagnosed, and solved this issue. It wasn’t a hypervisor misconfig, it was this bug in cloud-init. DO suggested a patch (basically same as the PR fixing the bug). That got cloud-init working correctly so we’re out of the degraded state where a reboot would’ve caused the box to lose networking. Over in another thread I discussed potential longer-term strategy and got some very helpful experience reports from @strugee, @sunng, and @Exagone313. My current plan is to move the servers to Ubuntu Interim sometime in the next few weeks, when I can schedule a full day in case there are issues. I’ll spend at least a couple minutes going over the bug in the regularly scheduled office hours streams in a few hours.

          2. 4

            Yup, right out of the DO playbook and I assure you support will get back to you when it’s morning in Pakistan and not a minute before.

            Great job with the recovery process so far and thank you for all the work you do.

            1. 2

              Oof! Been there, done that. I had a FreeBSD VM fail after an upgrade for a similar reason in DO. It’s not something I’d wish on anyone.

              1. 2

                Could an ‘immutable upgrade’ ie rolling upgrade strategy have prevented this? Ie spin up new upgraded instances while keeping the old ones serving live traffic, verify the new instances, then switch traffic over to them after verifying they worked?

                1. 1

                  I assume the SQL cluster doesn’t have active-standby replication and swap set up.

                  1. 1

                    Possibly, but it would be a significant amount of added complexity for a situation that only happens every 2 years with new Ubuntu LTS releases. (The 24.04 upgrade was delayed.) My trust in ansible is also pretty low for the amount of work required to enable this.

                    1. 3

                      What about rset (on lobsters)? I’ve been using it for simple stuff and wow it’s really useful to go from an installed system to a working system.

                      1. 1

                        I hadn’t heard of it. It looks like it knows exactly what it’s going for and provides a pleasantly minimal layer of abstraction over those features.

                    2. 1

                      Another easier approach would be to snapshot the VM before doing large upgrades. You’ll get very easy rollback.

                      The database must be read only for the entire upgrade to avoid rolling back newer production data.