Dear SchedMD, For some reason slurmctld does not want to give on rebooting the nodes over and over again for a set of nodes that I kicked of using the command scontrol reboot_nodes nodelist; I took the node out of queue yet I notice that the moment the nodes comes up and brings up slurmd it just kills the node. My ResumeTimeout = 60 sec; no matter what I change this value to has any affect to it. I literally had to disable the systemctl service to get the node to stay up, although out of the queue/use. Any help in resolving this for us will be a great help. Only change on our end is, two weeks back we upgraded slurm from 19.05.x to 20.02.4; I am sure I have used scontrol reboot before but never ran into this issue. Please advise. Thank you, Amit
Amit this looks like it may be a duplicate of bug #9513 which Marcin is handling. He will have a look and let you know.
(In reply to Jason Booth from comment #2) > Amit this looks like it may be a duplicate of bug #9513 which Marcin is > handling. He will have a look and let you know. Hi Jason, Thank you for the reply. Although reading through that other bug it seems the node does not return to service and stays in drain state was their main issue. But for me the issue is as soon as the slurmd starts on the compute node slurmctld issues a shutdown command: In slurmctld I am constantly seeing these lines for every node that was rebooted : debug2: _queue_reboot_msg: Still waiting for boot of node v001 The above has repeated after over 10 reboots this is not normal. If you give me a way to force slurmctld not to initiate further reboots I will be happy. I have tried to issue cancel_reboot but it does not accept it and says it has already begun rebooting. It has be 4 hours and reboots continue. There should be a way to force slurmctld stop doing this, it is kind of annoying and I am not able to bring the nodes back into service. Any help here is appreciated. Thank you, Amit
Can you confirm if the node really booted? You could try adding the -b to see if this stops the reboot. -b Report node rebooted when daemon restarted. Used for testing purposes.
You could also try: Increasing the ResumeTimeout to 600 seconds and using scontrol reboot ASAP nextstate=resume <nodelist>
(In reply to Jason Booth from comment #5) > You could also try: > > Increasing the ResumeTimeout to 600 seconds and using scontrol reboot ASAP > nextstate=resume <nodelist> Hi Jason, Thank you Not sure what time late yesteday nodes were set to down state by the controller. I started slurmd on these already up nodes and they have stayed up and returned to service. I am happy to provide additional logs if it helps you with debugging the related bug. Thank you, Amit
Amit, Please share your current configuration and slurmctd/slurmd logs from the time when you noticed the issue. cheers, Marcin
Amit, Could you please reply to comment 7. I'm not able to reproduce the issue, but logs form the time when it happened and your configuration may be helpful. cheers, Marcin
Amit, Did you have a chance to gather logs requested in comment 7? In case of no reply, I'll close the case as timed out. cheers, Marcin
Amit, As mentioned last week. I cannot reproduce the case in the lab, it may be that there are certain condition I didn't find causing reported buggy behavior that I'm not able to reproduce without the logs from your side. cheers, Marcin