Ticket 5644

Summary: pending reboot state doesn't survive across a slurmctld restart
Product: Slurm Reporter: Phil Schwan <phils>
Component: slurmctldAssignee: Marshall Garey <marshall>
Status: RESOLVED TIMEDOUT QA Contact:
Severity: 4 - Minor Issue    
Priority: ---    
Version: 17.11.7   
Hardware: Linux   
OS: Linux   
Site: DownUnder GeoSolutions Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---

Description Phil Schwan 2018-08-29 23:06:54 MDT
It appears that if you:

- "scontrol reboot" a bunch of nodes, one or more of which can't reboot immediately (because they're allocated)

- Restart slurmctld (most often because logrotate runs, but could also be a slurmctld crash, or changing a parameter that requires a restart, or whatever)

...then it loses the state of which nodes were pending reboot.

This may not sound all that important, but it's a big deal when you're trying to do a controlled reboot of 2,000 nodes, any number of which might be in the middle of tasks that will run for hours or even days.

Cheers,

-Phil
Comment 1 Marshall Garey 2018-08-31 10:57:08 MDT
Hi Phil,

I can't reproduce it, following what you've described.

- Does this happen every time, even if you just scontrol reboot a one or a few nodes?
- Do you have a test system you can easily reproduce this on? If so, can you detail exact steps?
- Can you upload the slurmctld log file from a day where this issue happened? Can you also upload a slurmd log file from one of the nodes that was allocated that the slurmctld forgot was pending reboot?

Thanks.

- Marshall
Comment 2 Marshall Garey 2018-09-06 16:59:51 MDT
- In addition to requests from comment 1, can you also upload your current slurm.conf file? That would be helpful for me in trying to reproduce this.
Comment 3 Marshall Garey 2018-09-18 10:16:46 MDT
Closing as resolved/timedout. Feel free to reopen this whenever you have time to get the requested materials.