15400 – Many nodes marked State=IDLE+CLOUD+POWERED_DOWN+NOT_RESPONDING

Ticket 15400 - Many nodes marked State=IDLE+CLOUD+POWERED_DOWN+NOT_RESPONDING

Summary: Many nodes marked State=IDLE+CLOUD+POWERED_DOWN+NOT_RESPONDING

Status:	RESOLVED INFOGIVEN

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	slurmctld (show other tickets)
Version:	22.05.5
Hardware:	Linux Linux

Severity:	3 - Medium Impact
Assignee:	Skyler Malinowski
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2022-11-10 18:09 MST by Brian Andrus
Modified:	2022-12-14 09:24 MST (History)
CC List:	1 user (show)

See Also:
Site:	Lam
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Brian Andrus 2022-11-10 18:09:15 MST

I am experiencing something similar to #14690 but the nodes are powered down so slurm will never try to power them up.

I did recently add the "-b" to slurmd on the nodes, which may be a factor.

Generally, many (almost 75%) of our nodes stay in idle~ state and slurmctld does not try to power them up. 
I look and they have:
  State=IDLE+CLOUD+POWERED_DOWN+NOT_RESPONDING 
I am guessing it is the "NOT_RESPONDING" that is the issue.
If I manually start them, the state gets reset and things are ok. 

Is there a way to update the state to remove NOT_RESPONDING without booting a node so slurm will just try to resume them when needed?

Comment 1 Brian Andrus 2022-11-10 18:12:44 MST

I also just noticed that a whole set of nodes that we have not even turned up yet are in a 'NOT_RESPONDING' state, so they will never be able to be used until they are turned on at least once.

Comment 2 Skyler Malinowski 2022-11-11 13:08:49 MST

(In reply to Brian Andrus from comment #0)
> Is there a way to update the state to remove NOT_RESPONDING without booting
> a node so slurm will just try to resume them when needed?

Yes, NOT_RESPONDING is preventing the node from being scheduled to.

The good new is that you can do `scontrol update nodename=$HOSTLIST state=power_down_force` to re-suspend the node and NOT_RESPONDING will be cleared. You will have to wait for SuspendTimeout seconds before the node can be allocated work.


The bad news is that I need more information to trace the root cause of this NOT_RESPONDING being present. Can you provide a slurmctld.log capturing the events?

Are you using Azure Cycle Cloud?

Comment 3 Brian Andrus 2022-11-11 14:04:37 MST

Sweet! Looks like that did the trick. Thanks!

[https://www.lamresearch.com/wp-content/uploads/2018/05/lam_research_logo_corporate.jpg]Brian Andrus - HPC Systems
brian.andrus@lamresearch.com<mailto:brian.andrus@lamresearch.com>

From: bugs@schedmd.com <bugs@schedmd.com>
Sent: Friday, November 11, 2022 12:09 PM
To: Andrus, Brian <brian.andrus@lamresearch.com>
Subject: [Bug 15400] Many nodes marked State=IDLE+CLOUD+POWERED_DOWN+NOT_RESPONDING

External Email: Do NOT reply, click on links, or open attachments unless you recognize the sender and know the content is safe. If you believe this email may be unsafe, please click on the "Report Phishing" button on the top right of Outlook.

Skyler Malinowski<mailto:skyler@schedmd.com> changed bug 15400<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D15400&data=05%7C01%7Cbrian.andrus%40lamresearch.com%7C7175ba2372a845cf215e08dac4208d0c%7C918079dbc9024e29b22c9764410d0375%7C0%7C0%7C638037941417549955%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=l3kPkUzPh0v7ymzWORhB%2FhiHzNn66cJL4XhVskXPx8E%3D&reserved=0>
What
Removed
Added
CC

skyler@schedmd.com<mailto:skyler@schedmd.com>
Comment # 2<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D15400%23c2&data=05%7C01%7Cbrian.andrus%40lamresearch.com%7C7175ba2372a845cf215e08dac4208d0c%7C918079dbc9024e29b22c9764410d0375%7C0%7C0%7C638037941417549955%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=68k7nVP5FH6ZozREH5Wk1R3HtOfHoq07WtuwhaZAky4%3D&reserved=0> on bug 15400<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D15400&data=05%7C01%7Cbrian.andrus%40lamresearch.com%7C7175ba2372a845cf215e08dac4208d0c%7C918079dbc9024e29b22c9764410d0375%7C0%7C0%7C638037941417549955%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=l3kPkUzPh0v7ymzWORhB%2FhiHzNn66cJL4XhVskXPx8E%3D&reserved=0> from Skyler Malinowski<mailto:skyler@schedmd.com>

(In reply to Brian Andrus from comment #0<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D15400%23c0&data=05%7C01%7Cbrian.andrus%40lamresearch.com%7C7175ba2372a845cf215e08dac4208d0c%7C918079dbc9024e29b22c9764410d0375%7C0%7C0%7C638037941417549955%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=r5xJFZtwTuRWFeUB%2Ftj9Pgbz7bwI%2FZBLXaFYJ6BLLdk%3D&reserved=0>)

> Is there a way to update the state to remove NOT_RESPONDING without booting

> a node so slurm will just try to resume them when needed?

Yes, NOT_RESPONDING is preventing the node from being scheduled to.

The good new is that you can do `scontrol update nodename=$HOSTLIST

state=power_down_force` to re-suspend the node and NOT_RESPONDING will be

cleared. You will have to wait for SuspendTimeout seconds before the node can

be allocated work.

The bad news is that I need more information to trace the root cause of this

NOT_RESPONDING being present. Can you provide a slurmctld.log capturing the

events?

Are you using Azure Cycle Cloud?

________________________________
You are receiving this mail because:

  *   You reported the bug.

LAM RESEARCH CONFIDENTIALITY NOTICE: This e-mail transmission, and any documents, files, or previous e-mail messages attached to it, (collectively, "E-mail Transmission") may be subject to one or more of the following based on the associated sensitivity level: E-mail Transmission (i) contains confidential information, (ii) is prohibited from distribution outside of Lam, and/or (iii) is intended solely for and restricted to the specified recipient(s). If you are not the intended recipient, or a person responsible for delivering it to the intended recipient, you are hereby notified that any disclosure, copying, distribution or use of any of the information contained in or attached to this message is STRICTLY PROHIBITED. If you have received this transmission in error, please immediately notify the sender and destroy the original transmission and its attachments without reading them or saving them to disk. Thank you.

Confidential - Limited Access and Use

Comment 4 Jason Booth 2022-11-11 14:49:21 MST

I am lowering the severity given that a workaround is present. We will look into how nodes are getting into POWERED_DOWN+NOT_RESPONDING.

Comment 5 Skyler Malinowski 2022-12-14 08:41:24 MST

Echoing comment #2
> The bad news is that I need more information to trace the root cause of this
> NOT_RESPONDING being present. Can you provide a slurmctld.log capturing the
> events?

Moreover, it is unclear if your ResumeProgram, SuspendProgram, or another script/actor is modifying node state into this situation (e.g. scontrol update nodename=$NODENAME state=noresp) or what series of events is causing this. Potentially Azure's boot-time bug (see bug #14690) could potentially play a role in this but I would expect 'slurmd -b' to alleviate this.

Does your cluster still experience the issue described in comment #0? Or was it a one-time issue?

Comment 6 Brian Andrus 2022-12-14 08:59:55 MST

It seems to have been a one-time issue and has not returned.
We do not set the state to 'noresp' in any scripts.
I do set it to down for nodes that were unable to join the domain, then it is cleaned (all resources deleted) so it will be recreated next time it is assigned. That has had no issue for about 2 years.

[https://www.lamresearch.com/wp-content/uploads/2018/05/lam_research_logo_corporate.jpg]Brian Andrus - HPC Systems
brian.andrus@lamresearch.com<mailto:brian.andrus@lamresearch.com>

From: bugs@schedmd.com <bugs@schedmd.com>
Sent: Wednesday, December 14, 2022 7:41 AM
To: Andrus, Brian <brian.andrus@lamresearch.com>
Subject: [Bug 15400] Many nodes marked State=IDLE+CLOUD+POWERED_DOWN+NOT_RESPONDING

External Email: Do NOT reply, click on links, or open attachments unless you recognize the sender and know the content is safe. If you believe this email may be unsafe, please click on the "Report Phishing" button on the top right of Outlook.

Comment # 5<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D15400%23c5&data=05%7C01%7Cbrian.andrus%40lamresearch.com%7Cc43ce2c21aa54777802708dadde9a901%7C918079dbc9024e29b22c9764410d0375%7C0%7C0%7C638066292956154079%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=Yj2gq%2Bp6br3CobV3wCQsb7r2T%2FE461FQwQf%2FPfNN64I%3D&reserved=0> on bug 15400<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D15400&data=05%7C01%7Cbrian.andrus%40lamresearch.com%7Cc43ce2c21aa54777802708dadde9a901%7C918079dbc9024e29b22c9764410d0375%7C0%7C0%7C638066292956154079%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=EGhn0azTOIvrXmhSAEUOSxEhJnqneR682hFeRbrcXCI%3D&reserved=0> from Skyler Malinowski<mailto:skyler@schedmd.com>

Echoing comment #2<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D15400%23c2&data=05%7C01%7Cbrian.andrus%40lamresearch.com%7Cc43ce2c21aa54777802708dadde9a901%7C918079dbc9024e29b22c9764410d0375%7C0%7C0%7C638066292956154079%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=CicoSY%2Ba1s9Fd7OSh31SpzbE%2B%2BW8MFjSACzUlIxKRz8%3D&reserved=0>

> The bad news is that I need more information to trace the root cause of this

> NOT_RESPONDING being present. Can you provide a slurmctld.log capturing the

> events?

Moreover, it is unclear if your ResumeProgram, SuspendProgram, or another

script/actor is modifying node state into this situation (e.g. scontrol update

nodename=$NODENAME state=noresp) or what series of events is causing this.

Potentially Azure's boot-time bug (see bug #14690<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D14690&data=05%7C01%7Cbrian.andrus%40lamresearch.com%7Cc43ce2c21aa54777802708dadde9a901%7C918079dbc9024e29b22c9764410d0375%7C0%7C0%7C638066292956310281%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=Kkg0MtAnloCPbwGF4xjZiTSPn3nCB6hvQ4NlTWYIwBs%3D&reserved=0>) could potentially play a

role in this but I would expect 'slurmd -b' to alleviate this.

Does your cluster still experience the issue described in comment #0<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D15400%23c0&data=05%7C01%7Cbrian.andrus%40lamresearch.com%7Cc43ce2c21aa54777802708dadde9a901%7C918079dbc9024e29b22c9764410d0375%7C0%7C0%7C638066292956310281%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=AHkHC8dqc%2B4fsMACuKZMbtN34xSch87rZlyogrMRqTI%3D&reserved=0>? Or was it

a one-time issue?

________________________________
You are receiving this mail because:

  *   You reported the bug.

LAM RESEARCH CONFIDENTIALITY NOTICE: This e-mail transmission, and any documents, files, or previous e-mail messages attached to it, (collectively, "E-mail Transmission") may be subject to one or more of the following based on the associated sensitivity level: E-mail Transmission (i) contains confidential information, (ii) is prohibited from distribution outside of Lam, and/or (iii) is intended solely for and restricted to the specified recipient(s). If you are not the intended recipient, or a person responsible for delivering it to the intended recipient, you are hereby notified that any disclosure, copying, distribution or use of any of the information contained in or attached to this message is STRICTLY PROHIBITED. If you have received this transmission in error, please immediately notify the sender and destroy the original transmission and its attachments without reading them or saving them to disk. Thank you.

Confidential - Limited Access and Use

Comment 7 Skyler Malinowski 2022-12-14 09:24:39 MST

I'm glad it has not returned, although I am still confused how your cluster got into that state initially. Regardless, this looks to be resolved. Closing ticket.

Best,
Skyler