1. 12

  2. 5

    This is such a clear writeup, I would benefit from more like this.

    Ceph tried to parse an IPv6 address into a struct sockaddr, which only has space for an IPv4 address!

    I’m excited about Rust replacing C. Though anything with a better typesystem would be fine by me.

    Related, I hope IPv6 gains traction soon and shakes out all these sorts of bugs.

    1. 3

      Excellent writeup. Ceph itself aside, I see a few operational practices that could have prevented/reduced this, so I come away with very different lessons learned:

      • This fuels my personal bias against compiling software from source for infrastructure stuff
      • Reboot (or at least restart services as thoroughly as possible) immediately after software upgrades so everything is always in a known state
      • Upgrading caused problems, which caused a further upgrade, which caused more different problems, etc. Seems to me like the Easy Path would have been to downgrade to the last known-good version (though there was none, due to #2), which should have gotten the system operational again more quickly.
      1. 2

        That is a pretty good summary!

        Would like to add that compiling from sources can be a good alternative only iff it has been done way ahead of time, and there is an independent process compiling all the subsequent releases. That’s how/why we use Nix, niv, and niv-updater-action.

      2. 2

        Really great writeup. It aligns with the stuff I also saw - it’s hard to run Ceph, as there are too many failure modes, and staggered upgrades are life saviors.

          1. 2

            We came to the conclusion that we needed to rebuild world to get into a consistent state.

            I’m curious as to why one would consider doing this on a production cluster. Surely, relying on pre built binaries might be a tad slower (since they are, as the wisdom goes, not built with your CPU’s exact compile flags), but you would likely make up for this marginal gain in stability (again, as the wisdom goes). Or at least a certain level of confidence that you’re running known-good binaries.

            What drove the decision here?

            I know the topic is very troll-bate like, I apologise. But I genuinely am curious, not passing judgement one way or another.