| Summary: | pending reboot state doesn't survive across a slurmctld restart | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Phil Schwan <phils> |
| Component: | slurmctld | Assignee: | Marshall Garey <marshall> |
| Status: | RESOLVED TIMEDOUT | QA Contact: | |
| Severity: | 4 - Minor Issue | ||
| Priority: | --- | ||
| Version: | 17.11.7 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | DownUnder GeoSolutions | Slinky Site: | --- |
| Alineos Sites: | --- | Atos/Eviden Sites: | --- |
| Confidential Site: | --- | Coreweave sites: | --- |
| Cray Sites: | --- | DS9 clusters: | --- |
| Google sites: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | NoveTech Sites: | --- |
| Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Tzag Elita Sites: | --- |
| Linux Distro: | --- | Machine Name: | |
| CLE Version: | Version Fixed: | ||
| Target Release: | --- | DevPrio: | --- |
| Emory-Cloud Sites: | --- | ||
|
Description
Phil Schwan
2018-08-29 23:06:54 MDT
Hi Phil, I can't reproduce it, following what you've described. - Does this happen every time, even if you just scontrol reboot a one or a few nodes? - Do you have a test system you can easily reproduce this on? If so, can you detail exact steps? - Can you upload the slurmctld log file from a day where this issue happened? Can you also upload a slurmd log file from one of the nodes that was allocated that the slurmctld forgot was pending reboot? Thanks. - Marshall - In addition to requests from comment 1, can you also upload your current slurm.conf file? That would be helpful for me in trying to reproduce this. Closing as resolved/timedout. Feel free to reopen this whenever you have time to get the requested materials. |