I am seeing a few nodes where I get a status of ALLOCATED+CLOUD+NOT_RESPONDING+POWERING_UP but slurmd is running fine on the node. From the node, I can even run sinfo, squeue, etc successfully. If I restart slurmd, nothing changes and I see: Last slurmctld msg time = NONE Eventually the nodes time out and get the resumefailprogram run, which will delete them so they will be recreated next time. There are no firewalls or other blockers between the nodes and the head node. Is there a way to force a ping from slurmctld to the node?
Check the uptime of the not responding compute node. slurmd reports this in the log at debug3, and I think that comes from the same place as the program `uptime`. I've seen a similar problem before on Azure where an apparently freshly booted instance has a reported uptime of over 30 minutes. Due to some race conditions around rebooting nodes, slurmctld will ignore a node registration if the reported uptime puts the boot before ResumeProgram was called. A way around this, though it would be good to find out why this uptime happens, is to pass `-b` to the slurmd daemon. This forces it to ignore the reported uptime, instead reporting the slurmd start time as the boot time. If the reported uptime/boot time is not before ResumeProgram was called, we'll have to look elsewhere. I'll need to see the slurmctld.log and slurmd.log (from the not responding node). Thanks
Hmm. Not sure how to compare the two. I can look at the boot time/uptime on the node with 'uptime' and I can see what it is reporting when I do 'scontrol show slurmd' Is that what you mean? I did try the following: echo 'SLURMD_OPTIONS="-b"' > /etc/default/slurmd systemctl restart slurmd Which does seem to be working on the two nodes that were in that state and I was able to run that on. I'll continue to use that method for the time, if it is the workaround/fix. [https://www.lamresearch.com/wp-content/uploads/2018/05/lam_research_logo_corporate.jpg]Brian Andrus - HPC Systems brian.andrus@lamresearch.com<mailto:brian.andrus@lamresearch.com> From: bugs@schedmd.com <bugs@schedmd.com> Sent: Friday, August 5, 2022 10:35 AM To: Andrus, Brian <brian.andrus@lamresearch.com> Subject: [Bug 14690] nodes show NOT_RESPONDING when slurmd is running External Email: Do NOT reply, click on links, or open attachments unless you recognize the sender and know the content is safe. If you believe this email may be unsafe, please click on the "Report Phishing" button on the top right of Outlook. Comment # 2<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D14690%23c2&data=05%7C01%7Cbrian.andrus%40lamresearch.com%7Cb0c07a370e2b46c6458708da7708dc71%7C918079dbc9024e29b22c9764410d0375%7C0%7C0%7C637953177219235251%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=lyBXoe0OeVY9hoDVcGMUkCEazlXhoYXjz9Ln3RTvFLg%3D&reserved=0> on bug 14690<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D14690&data=05%7C01%7Cbrian.andrus%40lamresearch.com%7Cb0c07a370e2b46c6458708da7708dc71%7C918079dbc9024e29b22c9764410d0375%7C0%7C0%7C637953177219235251%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=uTImRMTdspaGXcUzk7nxeqogeS1Os0fdCrgSVpmteyY%3D&reserved=0> from Broderick Gardner<mailto:broderick@schedmd.com> Check the uptime of the not responding compute node. slurmd reports this in the log at debug3, and I think that comes from the same place as the program `uptime`. I've seen a similar problem before on Azure where an apparently freshly booted instance has a reported uptime of over 30 minutes. Due to some race conditions around rebooting nodes, slurmctld will ignore a node registration if the reported uptime puts the boot before ResumeProgram was called. A way around this, though it would be good to find out why this uptime happens, is to pass `-b` to the slurmd daemon. This forces it to ignore the reported uptime, instead reporting the slurmd start time as the boot time. If the reported uptime/boot time is not before ResumeProgram was called, we'll have to look elsewhere. I'll need to see the slurmctld.log and slurmd.log (from the not responding node). Thanks ________________________________ You are receiving this mail because: * You reported the bug. LAM RESEARCH CONFIDENTIALITY NOTICE: This e-mail transmission, and any documents, files, or previous e-mail messages attached to it, (collectively, "E-mail Transmission") may be subject to one or more of the following based on the associated sensitivity level: E-mail Transmission (i) contains confidential information, (ii) is prohibited from distribution outside of Lam, and/or (iii) is intended solely for and restricted to the specified recipient(s). If you are not the intended recipient, or a person responsible for delivering it to the intended recipient, you are hereby notified that any disclosure, copying, distribution or use of any of the information contained in or attached to this message is STRICTLY PROHIBITED. If you have received this transmission in error, please immediately notify the sender and destroy the original transmission and its attachments without reading them or saving them to disk. Thank you. Confidential - Limited Access and Use
The times to compare are when the ResumeProgram was run and the boot time. When the ResumeProgram is when a job needing the node was scheduled. You could find that in the slurmctld.log or just scontrol show job.
Would you like to pursue finding the underlying reason for the incorrect uptime? If so, then we will need some more information to help find the root cause. Otherwise I can close the ticket. Another avenue to try is to open a ticket with Microsoft Azure support. Without knowing more details from your environment (e.g. slurmctld.log, slurmd.log) it is difficult to determine if there is a bug in Slurm or one with Azure. Please let me know what you would like to do. Thanks!
We can close this, I guess. I haven't seen it since. My best guess is that there were ntp issues or something. Of course MS doesn't say anything about anything, so I am likely stuck with no new info. Close it as a 'cold case'. [https://www.lamresearch.com/wp-content/uploads/2018/05/lam_research_logo_corporate.jpg]Brian Andrus - HPC Systems brian.andrus@lamresearch.com<mailto:brian.andrus@lamresearch.com> From: bugs@schedmd.com <bugs@schedmd.com> Sent: Thursday, October 13, 2022 2:14 PM To: Andrus, Brian <brian.andrus@lamresearch.com> Subject: [Bug 14690] nodes show NOT_RESPONDING when slurmd is running External Email: Do NOT reply, click on links, or open attachments unless you recognize the sender and know the content is safe. If you believe this email may be unsafe, please click on the "Report Phishing" button on the top right of Outlook. Comment # 6<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D14690%23c6&data=05%7C01%7Cbrian.andrus%40lamresearch.com%7C85bc711b655b43d9a1db08daad5fe582%7C918079dbc9024e29b22c9764410d0375%7C0%7C0%7C638012924663271671%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=XHNeJVBVBMwNOQ7IbeDPiflNF8v4gAv4SaF3ooUTzH4%3D&reserved=0> on bug 14690<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D14690&data=05%7C01%7Cbrian.andrus%40lamresearch.com%7C85bc711b655b43d9a1db08daad5fe582%7C918079dbc9024e29b22c9764410d0375%7C0%7C0%7C638012924663271671%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=85UMQmyXhp5aw6Kz2Avc0CuwGMNXvSlsLvUPPX7tamw%3D&reserved=0> from Skyler Malinowski<mailto:malinowski@schedmd.com> Would you like to pursue finding the underlying reason for the incorrect uptime? If so, then we will need some more information to help find the root cause. Otherwise I can close the ticket. Another avenue to try is to open a ticket with Microsoft Azure support. Without knowing more details from your environment (e.g. slurmctld.log, slurmd.log) it is difficult to determine if there is a bug in Slurm or one with Azure. Please let me know what you would like to do. Thanks! ________________________________ You are receiving this mail because: * You reported the bug. LAM RESEARCH CONFIDENTIALITY NOTICE: This e-mail transmission, and any documents, files, or previous e-mail messages attached to it, (collectively, "E-mail Transmission") may be subject to one or more of the following based on the associated sensitivity level: E-mail Transmission (i) contains confidential information, (ii) is prohibited from distribution outside of Lam, and/or (iii) is intended solely for and restricted to the specified recipient(s). If you are not the intended recipient, or a person responsible for delivering it to the intended recipient, you are hereby notified that any disclosure, copying, distribution or use of any of the information contained in or attached to this message is STRICTLY PROHIBITED. If you have received this transmission in error, please immediately notify the sender and destroy the original transmission and its attachments without reading them or saving them to disk. Thank you. Confidential - Limited Access and Use
Thank you for the update. Closing ticket. If you find out anything, please comment and re-open this ticket; we can continue where we left off. We are interested in what is causing this too! All the best, Skyler