14690 – nodes show NOT_RESPONDING when slurmd is running

Ticket 14690 - nodes show NOT_RESPONDING when slurmd is running

Summary: nodes show NOT_RESPONDING when slurmd is running

Status:	RESOLVED INFOGIVEN

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	slurmd (show other tickets)
Version:	22.05.2
Hardware:	Linux Linux

Severity:	3 - Medium Impact
Assignee:	Skyler Malinowski
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2022-08-05 11:09 MDT by Brian Andrus
Modified:	2022-10-17 07:48 MDT (History)
CC List:	0 users

See Also:
Site:	Lam
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Brian Andrus 2022-08-05 11:09:17 MDT

I am seeing a few nodes where I get a status of ALLOCATED+CLOUD+NOT_RESPONDING+POWERING_UP but slurmd is running fine on the node.
From the node, I can even run sinfo, squeue, etc successfully.
If I restart slurmd, nothing changes and I see:
Last slurmctld msg time  = NONE

Eventually the nodes time out and get the resumefailprogram run, which will delete them so they will be recreated next time. 

There are no firewalls or other blockers between the nodes and the head node.
Is there a way to force a ping from slurmctld to the node?

Comment 2 Broderick Gardner 2022-08-05 11:35:15 MDT

Check the uptime of the not responding compute node. slurmd reports this in the log at debug3, and I think that comes from the same place as the program `uptime`.

I've seen a similar problem before on Azure where an apparently freshly booted instance has a reported uptime of over 30 minutes. Due to some race conditions around rebooting nodes, slurmctld will ignore a node registration if the reported uptime puts the boot before ResumeProgram was called. A way around this, though it would be good to find out why this uptime happens, is to pass `-b` to the slurmd daemon. This forces it to ignore the reported uptime, instead reporting the slurmd start time as the boot time. 

If the reported uptime/boot time is not before ResumeProgram was called, we'll have to look elsewhere. I'll need to see the slurmctld.log and slurmd.log (from the not responding node).

Thanks

Comment 3 Brian Andrus 2022-08-05 11:47:13 MDT

Hmm. Not sure how to compare the two. I can look at the boot time/uptime on the node with 'uptime' and I can see what it is reporting when I do 'scontrol show slurmd'
Is that what you mean?

I did try the following:
echo 'SLURMD_OPTIONS="-b"' > /etc/default/slurmd
systemctl restart slurmd

Which does seem to be working on the two nodes that were in that state and I was able to run that on.
I'll continue to use that method for the time, if it is the workaround/fix.


[https://www.lamresearch.com/wp-content/uploads/2018/05/lam_research_logo_corporate.jpg]Brian Andrus - HPC Systems
brian.andrus@lamresearch.com<mailto:brian.andrus@lamresearch.com>

From: bugs@schedmd.com <bugs@schedmd.com>
Sent: Friday, August 5, 2022 10:35 AM
To: Andrus, Brian <brian.andrus@lamresearch.com>
Subject: [Bug 14690] nodes show NOT_RESPONDING when slurmd is running



External Email: Do NOT reply, click on links, or open attachments unless you recognize the sender and know the content is safe. If you believe this email may be unsafe, please click on the "Report Phishing" button on the top right of Outlook.


Comment # 2<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D14690%23c2&data=05%7C01%7Cbrian.andrus%40lamresearch.com%7Cb0c07a370e2b46c6458708da7708dc71%7C918079dbc9024e29b22c9764410d0375%7C0%7C0%7C637953177219235251%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=lyBXoe0OeVY9hoDVcGMUkCEazlXhoYXjz9Ln3RTvFLg%3D&reserved=0> on bug 14690<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D14690&data=05%7C01%7Cbrian.andrus%40lamresearch.com%7Cb0c07a370e2b46c6458708da7708dc71%7C918079dbc9024e29b22c9764410d0375%7C0%7C0%7C637953177219235251%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=uTImRMTdspaGXcUzk7nxeqogeS1Os0fdCrgSVpmteyY%3D&reserved=0> from Broderick Gardner<mailto:broderick@schedmd.com>

Check the uptime of the not responding compute node. slurmd reports this in the

log at debug3, and I think that comes from the same place as the program

`uptime`.



I've seen a similar problem before on Azure where an apparently freshly booted

instance has a reported uptime of over 30 minutes. Due to some race conditions

around rebooting nodes, slurmctld will ignore a node registration if the

reported uptime puts the boot before ResumeProgram was called. A way around

this, though it would be good to find out why this uptime happens, is to pass

`-b` to the slurmd daemon. This forces it to ignore the reported uptime,

instead reporting the slurmd start time as the boot time.



If the reported uptime/boot time is not before ResumeProgram was called, we'll

have to look elsewhere. I'll need to see the slurmctld.log and slurmd.log (from

the not responding node).



Thanks

________________________________
You are receiving this mail because:

  *   You reported the bug.

LAM RESEARCH CONFIDENTIALITY NOTICE: This e-mail transmission, and any documents, files, or previous e-mail messages attached to it, (collectively, "E-mail Transmission") may be subject to one or more of the following based on the associated sensitivity level: E-mail Transmission (i) contains confidential information, (ii) is prohibited from distribution outside of Lam, and/or (iii) is intended solely for and restricted to the specified recipient(s). If you are not the intended recipient, or a person responsible for delivering it to the intended recipient, you are hereby notified that any disclosure, copying, distribution or use of any of the information contained in or attached to this message is STRICTLY PROHIBITED. If you have received this transmission in error, please immediately notify the sender and destroy the original transmission and its attachments without reading them or saving them to disk. Thank you.


Confidential - Limited Access and Use

Comment 4 Broderick Gardner 2022-08-05 14:24:23 MDT

The times to compare are when the ResumeProgram was run and the boot time. When the ResumeProgram is when a job needing the node was scheduled. You could find that in the slurmctld.log or just scontrol show job.

Comment 6 Skyler Malinowski 2022-10-13 15:14:20 MDT

Would you like to pursue finding the underlying reason for the incorrect uptime? If so, then we will need some more information to help find the root cause. Otherwise I can close the ticket.

Another avenue to try is to open a ticket with Microsoft Azure support. Without knowing more details from your environment (e.g. slurmctld.log, slurmd.log) it is difficult to determine if there is a bug in Slurm or one with Azure.

Please let me know what you would like to do. Thanks!

Comment 7 Brian Andrus 2022-10-13 15:29:58 MDT

We can close this, I guess. I haven't seen it since. My best guess is that there were ntp issues or something. Of course MS doesn't say anything about anything, so I am likely stuck with no new info.
Close it as a 'cold case'.

[https://www.lamresearch.com/wp-content/uploads/2018/05/lam_research_logo_corporate.jpg]Brian Andrus - HPC Systems
brian.andrus@lamresearch.com<mailto:brian.andrus@lamresearch.com>

From: bugs@schedmd.com <bugs@schedmd.com>
Sent: Thursday, October 13, 2022 2:14 PM
To: Andrus, Brian <brian.andrus@lamresearch.com>
Subject: [Bug 14690] nodes show NOT_RESPONDING when slurmd is running

External Email: Do NOT reply, click on links, or open attachments unless you recognize the sender and know the content is safe. If you believe this email may be unsafe, please click on the "Report Phishing" button on the top right of Outlook.

Comment # 6<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D14690%23c6&data=05%7C01%7Cbrian.andrus%40lamresearch.com%7C85bc711b655b43d9a1db08daad5fe582%7C918079dbc9024e29b22c9764410d0375%7C0%7C0%7C638012924663271671%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=XHNeJVBVBMwNOQ7IbeDPiflNF8v4gAv4SaF3ooUTzH4%3D&reserved=0> on bug 14690<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D14690&data=05%7C01%7Cbrian.andrus%40lamresearch.com%7C85bc711b655b43d9a1db08daad5fe582%7C918079dbc9024e29b22c9764410d0375%7C0%7C0%7C638012924663271671%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=85UMQmyXhp5aw6Kz2Avc0CuwGMNXvSlsLvUPPX7tamw%3D&reserved=0> from Skyler Malinowski<mailto:malinowski@schedmd.com>

Would you like to pursue finding the underlying reason for the incorrect

uptime? If so, then we will need some more information to help find the root

cause. Otherwise I can close the ticket.

Another avenue to try is to open a ticket with Microsoft Azure support. Without

knowing more details from your environment (e.g. slurmctld.log, slurmd.log) it

is difficult to determine if there is a bug in Slurm or one with Azure.

Please let me know what you would like to do. Thanks!

________________________________
You are receiving this mail because:

  *   You reported the bug.

LAM RESEARCH CONFIDENTIALITY NOTICE: This e-mail transmission, and any documents, files, or previous e-mail messages attached to it, (collectively, "E-mail Transmission") may be subject to one or more of the following based on the associated sensitivity level: E-mail Transmission (i) contains confidential information, (ii) is prohibited from distribution outside of Lam, and/or (iii) is intended solely for and restricted to the specified recipient(s). If you are not the intended recipient, or a person responsible for delivering it to the intended recipient, you are hereby notified that any disclosure, copying, distribution or use of any of the information contained in or attached to this message is STRICTLY PROHIBITED. If you have received this transmission in error, please immediately notify the sender and destroy the original transmission and its attachments without reading them or saving them to disk. Thank you.

Confidential - Limited Access and Use

Comment 8 Skyler Malinowski 2022-10-17 07:48:11 MDT

Thank you for the update.

Closing ticket. If you find out anything, please comment and re-open this ticket; we can continue where we left off. We are interested in what is causing this too!

All the best,
Skyler