Publicly summarizing: The existing 'reconfigure' mechanism is rather fragile: (a) it cannot make changes to which plugins are loaded (b) there are a large number of locations in which derived configuration data is cached, and that do not check for a changed last_update timestamp to catch a configuration change that could invalidate those options (c) any number of fatal() warnings during plugin reconfigure calls (or configuration parsing) will lead to the slurmctld and/or slurmd processes dying Architecturally, my plan to address these is to move away from an in-memory reconfiguration, and instead have the reconfigure mechanism start new slurmctld/slurmd processes and hand off control to them. This fixes (a) by allowing the new processes to load different plugins and (b) by making it so that these cached values are no longer relevant. (Existing locations checking and invalidating configuration based on last_update could be removed as well.) For (c), the hand-off mechanism between the old and new process would need to wait until some amount of initial bootstrapping had occurred, and only once an "okay" signal (likely send over a pipe) has been received from the new process it would finish shutting down. If the child dies, the old process should resume execution - with the existing configuration - and would be able respond to the reconfigure RPC with some indication there is a problem with the change and it will not be applied.
*** Ticket 11602 has been marked as a duplicate of this ticket. ***
*** Ticket 10321 has been marked as a duplicate of this ticket. ***
*** Ticket 13054 has been marked as a duplicate of this ticket. ***