It appears that if you: - "scontrol reboot" a bunch of nodes, one or more of which can't reboot immediately (because they're allocated) - Restart slurmctld (most often because logrotate runs, but could also be a slurmctld crash, or changing a parameter that requires a restart, or whatever) ...then it loses the state of which nodes were pending reboot. This may not sound all that important, but it's a big deal when you're trying to do a controlled reboot of 2,000 nodes, any number of which might be in the middle of tasks that will run for hours or even days. Cheers, -Phil
Hi Phil, I can't reproduce it, following what you've described. - Does this happen every time, even if you just scontrol reboot a one or a few nodes? - Do you have a test system you can easily reproduce this on? If so, can you detail exact steps? - Can you upload the slurmctld log file from a day where this issue happened? Can you also upload a slurmd log file from one of the nodes that was allocated that the slurmctld forgot was pending reboot? Thanks. - Marshall
- In addition to requests from comment 1, can you also upload your current slurm.conf file? That would be helpful for me in trying to reproduce this.
Closing as resolved/timedout. Feel free to reopen this whenever you have time to get the requested materials.