Ticket 9667

Summary:	scontrol reboot_nodes nodelist repeatedly reboots nodes
Product:	Slurm	Reporter:	Amit Kumar <ahkumar>
Component:	Configuration	Assignee:	Marcin Stolarek <cinek>
Status:	RESOLVED TIMEDOUT	QA Contact:
Severity:	3 - Medium Impact
Priority:	---
Version:	20.02.4
Hardware:	Linux
OS:	Linux
See Also:	https://bugs.schedmd.com/show_bug.cgi?id=9468
Site:	SMU	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---

Description Amit Kumar 2020-08-25 14:09:53 MDT

Dear SchedMD,

For some reason slurmctld does not want to give on rebooting the nodes over and over again for a set of nodes that I kicked of using the command 

scontrol reboot_nodes nodelist; 

I took the node out of queue yet I notice that the moment the nodes comes up and brings up slurmd it just kills the node. 

My ResumeTimeout           = 60 sec; no matter what I change this value to has any affect to it. I literally had to disable the systemctl service to get the node to stay up, although out of the queue/use. 

Any help in resolving this for us will be a great help. 

Only change on our end is, two weeks back we upgraded slurm from 19.05.x to 20.02.4; I am sure I have used scontrol reboot before but never ran into this issue. 

Please advise. 

Thank you,
Amit

Comment 2 Jason Booth 2020-08-25 15:18:18 MDT

Amit this looks like it may be a duplicate of bug #9513 which Marcin is handling. He will have a look and let you know.

Comment 3 Amit Kumar 2020-08-25 15:34:27 MDT

(In reply to Jason Booth from comment #2)
> Amit this looks like it may be a duplicate of bug #9513 which Marcin is
> handling. He will have a look and let you know.

Hi Jason, 

Thank you for the reply. Although reading through that other bug it seems the node does not return to service and stays in drain state was their main issue. But for me the issue is as soon as the slurmd starts on the compute node slurmctld issues a shutdown command:

In slurmctld I am constantly seeing these lines for every node that was rebooted :
debug2: _queue_reboot_msg: Still waiting for boot of node v001

The above has repeated after over 10 reboots this is not normal. If you give me a way to force slurmctld not to initiate further reboots I will be happy. I have tried to issue cancel_reboot but it does not accept it and says it has already begun rebooting. It has be 4 hours and reboots continue. 

There should be a way to force slurmctld stop doing this, it is kind of annoying and I am not able to bring the nodes back into service. 

Any help here is appreciated. 


Thank you,
Amit

Comment 4 Jason Booth 2020-08-25 21:26:56 MDT

Can you confirm if the node really booted?

You could try adding the -b to see if this stops the reboot.

-b
Report node rebooted when daemon restarted. Used for testing purposes.

Comment 5 Jason Booth 2020-08-25 21:59:57 MDT

You could also try:

Increasing the ResumeTimeout to 600 seconds and using scontrol reboot ASAP nextstate=resume <nodelist>

Comment 6 Amit Kumar 2020-08-26 13:06:21 MDT

(In reply to Jason Booth from comment #5)
> You could also try:
> 
> Increasing the ResumeTimeout to 600 seconds and using scontrol reboot ASAP
> nextstate=resume <nodelist>

Hi Jason,

Thank you Not sure what time late yesteday nodes were set to down state by the controller. I started slurmd on these already up nodes and they have stayed up and returned to service. I am happy to provide additional logs if it helps you with debugging the related bug. 

Thank you,
Amit

Comment 7 Marcin Stolarek 2020-08-28 00:00:09 MDT

Amit,

Please share your current configuration and slurmctd/slurmd logs from the time when you noticed the issue.

cheers,
Marcin

Comment 8 Marcin Stolarek 2020-09-03 22:51:30 MDT

Amit,

Could you please reply to comment 7. I'm not able to reproduce the issue, but logs form the time when it happened and your configuration may be helpful.

cheers,
Marcin

Comment 9 Marcin Stolarek 2020-09-10 08:24:07 MDT

Amit,

Did you have a chance to gather logs requested in comment 7? In case of no reply, I'll close the case as timed out.

cheers,
Marcin

Comment 10 Marcin Stolarek 2020-09-17 09:04:01 MDT

Amit,

As mentioned last week. I cannot reproduce the case in the lab, it may be that there are certain condition I didn't find causing reported buggy behavior that I'm not able to reproduce without the logs from your side.

cheers,
Marcin