5644 – pending reboot state doesn't survive across a slurmctld restart

Ticket 5644 - pending reboot state doesn't survive across a slurmctld restart

Summary: pending reboot state doesn't survive across a slurmctld restart

Status:	RESOLVED TIMEDOUT

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	slurmctld (show other tickets)
Version:	17.11.7
Hardware:	Linux Linux

Severity:	4 - Minor Issue
Assignee:	Marshall Garey
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2018-08-29 23:06 MDT by Phil Schwan
Modified:	2018-09-18 10:16 MDT (History)
CC List:	0 users

See Also:
Site:	DownUnder GeoSolutions
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Phil Schwan 2018-08-29 23:06:54 MDT

It appears that if you:

- "scontrol reboot" a bunch of nodes, one or more of which can't reboot immediately (because they're allocated)

- Restart slurmctld (most often because logrotate runs, but could also be a slurmctld crash, or changing a parameter that requires a restart, or whatever)

...then it loses the state of which nodes were pending reboot.

This may not sound all that important, but it's a big deal when you're trying to do a controlled reboot of 2,000 nodes, any number of which might be in the middle of tasks that will run for hours or even days.

Cheers,

-Phil

Comment 1 Marshall Garey 2018-08-31 10:57:08 MDT

Hi Phil,

I can't reproduce it, following what you've described.

- Does this happen every time, even if you just scontrol reboot a one or a few nodes?
- Do you have a test system you can easily reproduce this on? If so, can you detail exact steps?
- Can you upload the slurmctld log file from a day where this issue happened? Can you also upload a slurmd log file from one of the nodes that was allocated that the slurmctld forgot was pending reboot?

Thanks.

- Marshall

Comment 2 Marshall Garey 2018-09-06 16:59:51 MDT

- In addition to requests from comment 1, can you also upload your current slurm.conf file? That would be helpful for me in trying to reproduce this.

Comment 3 Marshall Garey 2018-09-18 10:16:46 MDT

Closing as resolved/timedout. Feel free to reopen this whenever you have time to get the requested materials.