Ticket 10597 - reconfigure mechanism overhaul
Summary: reconfigure mechanism overhaul
Status: OPEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: slurmd (show other tickets)
Version: 21.08.x
Hardware: Linux Linux
: 5 - Enhancement
Assignee: Tim Wickberg
QA Contact:
URL:
: 10321 11274 11602 13054 (view as ticket list)
Depends on:
Blocks: 9313
  Show dependency treegraph
 
Reported: 2021-01-08 16:50 MST by Brian Christiansen
Modified: 2024-10-23 23:17 MDT (History)
10 users (show)

See Also:
Site: SchedMD
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Comment 6 Tim Wickberg 2021-06-17 14:16:55 MDT
Publicly summarizing:

The existing 'reconfigure' mechanism is rather fragile:

(a) it cannot make changes to which plugins are loaded
(b) there are a large number of locations in which derived configuration data is cached, and that do not check for a changed last_update timestamp to catch a configuration change that could invalidate those options
(c) any number of fatal() warnings during plugin reconfigure calls (or configuration parsing) will lead to the slurmctld and/or slurmd processes dying

Architecturally, my plan to address these is to move away from an in-memory reconfiguration, and instead have the reconfigure mechanism start new slurmctld/slurmd processes and hand off control to them.

This fixes (a) by allowing the new processes to load different plugins and (b) by making it so that these cached values are no longer relevant. (Existing locations checking and invalidating configuration based on last_update could be removed as well.)

For (c), the hand-off mechanism between the old and new process would need to wait until some amount of initial bootstrapping had occurred, and only once an "okay" signal (likely send over a pipe) has been received from the new process it would finish shutting down. If the child dies, the old process should resume execution - with the existing configuration - and would be able respond to the reconfigure RPC with some indication there is a problem with the change and it will not be applied.
Comment 7 Marcin Stolarek 2021-06-22 02:53:01 MDT
*** Ticket 11602 has been marked as a duplicate of this ticket. ***
Comment 8 Albert Gil 2021-08-27 09:32:32 MDT
*** Ticket 10321 has been marked as a duplicate of this ticket. ***
Comment 10 Jason Booth 2021-12-16 13:03:12 MST
*** Ticket 13054 has been marked as a duplicate of this ticket. ***