1. 9
  1. 4

    In new OTP releases there is handle_continue/2 for that and you do not need to utilise such dirty hacks.

    1. 2

      Thank-you, @hauleth. I added a note about this on the post.

      1. 1

        Dependent of course on what I am doing, I tend to just have my handle_continue pass directly to handle_cast which lines things up for me to use handle_cast(hup, State) as a generic point to handle reloads throughout the lifetime of my processes.

        Before continue was an option, I just used to just call gen_server:cast(?MODULE, hup) before my init function returned.

        1. 1

          The problem with message to self is that it causes race condition. If anything send message after spawn but before such message then server can receive messages while in wrong state. handle_continue avoids that as it is ran just after the message is processed but before next loop call.

          1. 1

            Why would you want init to return immediately but then not actually be ready to service requests and care about this race condition?

            1. 1

              Because long running init will stop launching other processes in the supervision tree. So sometimes you want to continue launching while still prepping the process and queue all messages that will arrive in the meantime. This is where handle_continue becomes handy.

              1. 1

                I think you need to go back and read my comments and bear in mind that the comments are about what the article is setting out to solve and not telling you what you should be doing.

                The whole point of supervision is that it expects after your init returns your process is in a stable state. If you are using tricks (continue, gen_server:cast() from init, …) to speed up your startup time you have officially declared “I do not care about supervison” as your now out of bound long init task could still lead to a crash of that process.

                Your point of a race condition is correct but is replacing it with an unstable supervision tree just to get a faster startup times something that is correct for everyone?

                Either you think everyone should do this, or (more likely) you have not read or taken onboard:

                • before continue was an option…”
                • “Dependent of course on what I am doing…”
                • continue is a recent OTP feature (stated by yourself)
                • continue is not available in all generic servers (stated by the author)

                So beating up on me to use continue when it may not be an option is not really fair, right?

                Throwing out a stable supervision tree may be okay and manageable in your environment. The race condition you describe is correct but has zero impact on my life as either I arrange it so dropping those messages is okay (two lines in a gen_server) or alternatively expose ‘readiness’ to a load balancer (where startup times are often irrelevant) so making the race a non-issue.

                I suspect others may have their own thoughts here, and I am interested in hearing from them. What I do not appreciate is being told to swap a race condition for a unstable supervision tree whilst being told to suck eggs.

                1. 1

                  The obvious case, to me, is starting up a process/service that is inherently flaky, e.g. a database connection. The “stable state” of the process is disconnected, and whenever it is disconnected it is attempting to establish a connection/reconnect. While in that disconnected/trying to connect state, it can still service requests from clients, by simply returning an error. The rest of the system can proceed with startup, even if the database service was temporarily unavailable, or otherwise flaky during boot.

                  This is especially important for resilience in the supervision tree. Using the example from above, if your process tries to establish a connection during init, and it fails, it will likely fail again immediately, probably triggering the max restarts of the supervisor and bringing down large swaths of the application (or in the worst case, the entire node). The same applies later on in the nodes life, should the database connection fail, and trigger a restart, if the connection cannot be immediately re-established, then it is very likely that the process will restart in a loop, trigger the max restarts, and take down a bunch of stuff it shouldn’t have impacted.

                  The idea of doing that fragile work post-init is not to boot faster, it is to boot stable, and to ensure that restarts during the application lifecycle aren’t prone to the crash loop problem I described above.

                  1. 1

                    As database disconnects are ‘normal’ and expected (not exceptional) though this sounds like the stable process here describes a behaviour similar to a simple_one_for_one supervisor (where you do not care but if you did you would instead use one_for_one)?

                    I still need to fix my Erlang as I am still ‘broken’ in the manner that I prefer a SIGHUP-via-cast approach over process restarting as I am yet to see a simple way of handling multiple reload requests without a global lock; though I guess you could drain your message queue with a pattern for only reload requests and debounce the whole system that way before exiting?

                    1. 1

                      As database disconnects are ‘normal’ and expected (not exceptional) though this sounds like the stable process here describes a behaviour similar to a simple_one_for_one supervisor

                      The example of a database connection was illustrative, not meant as the best example, as typically they are created on demand and so would not be restarted, or would run under a simple_one_for_one supervisor. However the point it was meant to demonstrate is that some processes/services that have external dependencies are by definition unreliable; as you point out, this is part of their “normal” behavioral profile, so it is critical that the process handles the flakiness gracefully. An alternative example might be an S3/RabbitMQ/etc. consumer/producer - the same rules apply, if you require init to block on successfully connecting to the service, it will bite you eventually, and adds fragility where it isn’t necessary.

                      I still need to fix my Erlang as I am still ‘broken’ in the manner that I prefer a SIGHUP-via-cast approach over process restarting as I am yet to see a simple way of handling multiple reload requests without a global lock

                      I’m not sure I follow, they are essentially orthagonal concerns right? You can have supervision/process restarts and provide the ability to do some kind of “soft restart” via a SIGHUP-like message sent to the process. The part I think I’m missing here is what you mean by handling multiple reload requests - for a single process the requests are all serialized anyway, so if they trigger an actual process restart, then only the first message matters. If you are doing some kind of soft restart, then you could “debounce” them by dropping duplicate restart requests that come within some window of time (no need to drain the mailbox with a selective receive). In general though, I would assume the SIGHUP-like signal you are sending is something triggered by an operator, not by the system, in which case handling that edge case seems like worrying about something that is unlikely to ever happen in practice. Maybe an example of where you are using this pattern would help illustrate the problem it solves though.