1. 4
    1. 3

      If the data structure is append-only, why is step 4 necessary?

      1. 2

        Good question. Here is a sequence of events that makes step 4 necessary. Let’s say there are two servers that want to coordinate and share their IP addresses. The first server comes online and unconditionally initializes/resets the data structure and adds it’s own IP address. Some time later the second server comes online and does the same thing. This means the second server wipes whatever the first server did and the first server needs to re-add its IP address. Without the loop we lose information.

        The issue is that there is no nice way to initialize the data structure without jumping through some extra hoops. Avoiding the loop requires additional checks for making initialization idempotent. It’s doable but slightly more complicated than just looping.

        1. 1

          What prevents the data structure from being reset by another server during step 5?

          1. 1

            The assumption is that each registrant knows there will be exactly N entries and this information is communicated out of band or ahead of time by whatever process is kickstarting the startup/bootstrapping process. So as soon as there are N entries we can be assured no resets will happen.

            If this is not the case then I don’t know how to guarantee progress because you either need timeouts or some other mechanism to proceed.

    2. 2

      Is it assuming registrants will not fail?

      Otherwise the hardcoded number of registrants could be a problem – say one of them fails and never starts, then the algorithm will not finish (this could be a feature rather than an issue of course, depending on use).

      Also, if registrants can fail, then for example imagine one registrant has just re-added its id as the final one then crashes before it continues, half the other registrants see the full size and continue doing rest of work, while the one registrant restarts and resets the data structure and the other half then loop on a half-full data structure.

      1. 2

        I use this in cloud-init scripts so restarts aren’t an issue because cloud-init scripts aren’t re-executed on restart so what happens in practice is that cluster formation fails.