Supervision problem?

josevalim · August 27, 2023, 4:39pm

This is more of an issue of using Process.exit(pid, :kill) on a supervisor. :kill is a last resort message, which causes the supervisor to immediately terminate, without notifying its children to terminate properly. So the supervisor exits, the children now have to notice their parent is dead, clean up, and terminate.

Meanwhile, the application is restarting part of the tree at the same time, which ends up conflicting with the old one still shutting down, causing another failure. This eventually triggers max_restarts and the application shuts down.

Overall:

Just use :kill as a last resort, when the process did not respond to any other exit signal. This especially applies to supervisors, as their only job is to ensure processes start and terminate accordingly, and sending a :kill voids that
If you want to simulate stopping a supervisor, Supervisor.stop will at least go through the usual flow
Run iex --logger-sasl-reports true -S mix phx.server to get precise logging from supervisors

One question that may arise from this is: could my supervisor fail in a way that triggers the same behaviour as the :kill signal? Supervisors trap exits, which means no other process (linked or not) could cause them to crash. Therefore, this can only happen if there is a bug in the supervisor. And if there is a bug in the supervisor, then indeed they can’t guarantee their fault tolerant properties anyway. That’s why supervisors rarely change, they have been strongly tested for decades and are an essential piece of your application.