| Summary: | scontrol reboot_nodes nodelist repeatedly reboots nodes | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Amit Kumar <ahkumar> |
| Component: | Configuration | Assignee: | Marcin Stolarek <cinek> |
| Status: | RESOLVED TIMEDOUT | QA Contact: | |
| Severity: | 3 - Medium Impact | ||
| Priority: | --- | ||
| Version: | 20.02.4 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| See Also: | https://bugs.schedmd.com/show_bug.cgi?id=9468 | ||
| Site: | SMU | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
|
Description
Amit Kumar
2020-08-25 14:09:53 MDT
Amit this looks like it may be a duplicate of bug #9513 which Marcin is handling. He will have a look and let you know. (In reply to Jason Booth from comment #2) > Amit this looks like it may be a duplicate of bug #9513 which Marcin is > handling. He will have a look and let you know. Hi Jason, Thank you for the reply. Although reading through that other bug it seems the node does not return to service and stays in drain state was their main issue. But for me the issue is as soon as the slurmd starts on the compute node slurmctld issues a shutdown command: In slurmctld I am constantly seeing these lines for every node that was rebooted : debug2: _queue_reboot_msg: Still waiting for boot of node v001 The above has repeated after over 10 reboots this is not normal. If you give me a way to force slurmctld not to initiate further reboots I will be happy. I have tried to issue cancel_reboot but it does not accept it and says it has already begun rebooting. It has be 4 hours and reboots continue. There should be a way to force slurmctld stop doing this, it is kind of annoying and I am not able to bring the nodes back into service. Any help here is appreciated. Thank you, Amit Can you confirm if the node really booted? You could try adding the -b to see if this stops the reboot. -b Report node rebooted when daemon restarted. Used for testing purposes. You could also try: Increasing the ResumeTimeout to 600 seconds and using scontrol reboot ASAP nextstate=resume <nodelist> (In reply to Jason Booth from comment #5) > You could also try: > > Increasing the ResumeTimeout to 600 seconds and using scontrol reboot ASAP > nextstate=resume <nodelist> Hi Jason, Thank you Not sure what time late yesteday nodes were set to down state by the controller. I started slurmd on these already up nodes and they have stayed up and returned to service. I am happy to provide additional logs if it helps you with debugging the related bug. Thank you, Amit Amit, Please share your current configuration and slurmctd/slurmd logs from the time when you noticed the issue. cheers, Marcin Amit, Could you please reply to comment 7. I'm not able to reproduce the issue, but logs form the time when it happened and your configuration may be helpful. cheers, Marcin Amit, Did you have a chance to gather logs requested in comment 7? In case of no reply, I'll close the case as timed out. cheers, Marcin Amit, As mentioned last week. I cannot reproduce the case in the lab, it may be that there are certain condition I didn't find causing reported buggy behavior that I'm not able to reproduce without the logs from your side. cheers, Marcin |