| Summary: | Many nodes marked State=IDLE+CLOUD+POWERED_DOWN+NOT_RESPONDING | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Brian Andrus <brian.andrus> |
| Component: | slurmctld | Assignee: | Skyler Malinowski <skyler> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 3 - Medium Impact | ||
| Priority: | --- | CC: | skyler |
| Version: | 22.05.5 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | Lam | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
|
Description
Brian Andrus
2022-11-10 18:09:15 MST
I also just noticed that a whole set of nodes that we have not even turned up yet are in a 'NOT_RESPONDING' state, so they will never be able to be used until they are turned on at least once. (In reply to Brian Andrus from comment #0) > Is there a way to update the state to remove NOT_RESPONDING without booting > a node so slurm will just try to resume them when needed? Yes, NOT_RESPONDING is preventing the node from being scheduled to. The good new is that you can do `scontrol update nodename=$HOSTLIST state=power_down_force` to re-suspend the node and NOT_RESPONDING will be cleared. You will have to wait for SuspendTimeout seconds before the node can be allocated work. The bad news is that I need more information to trace the root cause of this NOT_RESPONDING being present. Can you provide a slurmctld.log capturing the events? Are you using Azure Cycle Cloud? Sweet! Looks like that did the trick. Thanks! [https://www.lamresearch.com/wp-content/uploads/2018/05/lam_research_logo_corporate.jpg]Brian Andrus - HPC Systems brian.andrus@lamresearch.com<mailto:brian.andrus@lamresearch.com> From: bugs@schedmd.com <bugs@schedmd.com> Sent: Friday, November 11, 2022 12:09 PM To: Andrus, Brian <brian.andrus@lamresearch.com> Subject: [Bug 15400] Many nodes marked State=IDLE+CLOUD+POWERED_DOWN+NOT_RESPONDING External Email: Do NOT reply, click on links, or open attachments unless you recognize the sender and know the content is safe. If you believe this email may be unsafe, please click on the "Report Phishing" button on the top right of Outlook. Skyler Malinowski<mailto:skyler@schedmd.com> changed bug 15400<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D15400&data=05%7C01%7Cbrian.andrus%40lamresearch.com%7C7175ba2372a845cf215e08dac4208d0c%7C918079dbc9024e29b22c9764410d0375%7C0%7C0%7C638037941417549955%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=l3kPkUzPh0v7ymzWORhB%2FhiHzNn66cJL4XhVskXPx8E%3D&reserved=0> What Removed Added CC skyler@schedmd.com<mailto:skyler@schedmd.com> Comment # 2<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D15400%23c2&data=05%7C01%7Cbrian.andrus%40lamresearch.com%7C7175ba2372a845cf215e08dac4208d0c%7C918079dbc9024e29b22c9764410d0375%7C0%7C0%7C638037941417549955%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=68k7nVP5FH6ZozREH5Wk1R3HtOfHoq07WtuwhaZAky4%3D&reserved=0> on bug 15400<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D15400&data=05%7C01%7Cbrian.andrus%40lamresearch.com%7C7175ba2372a845cf215e08dac4208d0c%7C918079dbc9024e29b22c9764410d0375%7C0%7C0%7C638037941417549955%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=l3kPkUzPh0v7ymzWORhB%2FhiHzNn66cJL4XhVskXPx8E%3D&reserved=0> from Skyler Malinowski<mailto:skyler@schedmd.com> (In reply to Brian Andrus from comment #0<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D15400%23c0&data=05%7C01%7Cbrian.andrus%40lamresearch.com%7C7175ba2372a845cf215e08dac4208d0c%7C918079dbc9024e29b22c9764410d0375%7C0%7C0%7C638037941417549955%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=r5xJFZtwTuRWFeUB%2Ftj9Pgbz7bwI%2FZBLXaFYJ6BLLdk%3D&reserved=0>) > Is there a way to update the state to remove NOT_RESPONDING without booting > a node so slurm will just try to resume them when needed? Yes, NOT_RESPONDING is preventing the node from being scheduled to. The good new is that you can do `scontrol update nodename=$HOSTLIST state=power_down_force` to re-suspend the node and NOT_RESPONDING will be cleared. You will have to wait for SuspendTimeout seconds before the node can be allocated work. The bad news is that I need more information to trace the root cause of this NOT_RESPONDING being present. Can you provide a slurmctld.log capturing the events? Are you using Azure Cycle Cloud? ________________________________ You are receiving this mail because: * You reported the bug. LAM RESEARCH CONFIDENTIALITY NOTICE: This e-mail transmission, and any documents, files, or previous e-mail messages attached to it, (collectively, "E-mail Transmission") may be subject to one or more of the following based on the associated sensitivity level: E-mail Transmission (i) contains confidential information, (ii) is prohibited from distribution outside of Lam, and/or (iii) is intended solely for and restricted to the specified recipient(s). If you are not the intended recipient, or a person responsible for delivering it to the intended recipient, you are hereby notified that any disclosure, copying, distribution or use of any of the information contained in or attached to this message is STRICTLY PROHIBITED. If you have received this transmission in error, please immediately notify the sender and destroy the original transmission and its attachments without reading them or saving them to disk. Thank you. Confidential - Limited Access and Use I am lowering the severity given that a workaround is present. We will look into how nodes are getting into POWERED_DOWN+NOT_RESPONDING. Echoing comment #2 > The bad news is that I need more information to trace the root cause of this > NOT_RESPONDING being present. Can you provide a slurmctld.log capturing the > events? Moreover, it is unclear if your ResumeProgram, SuspendProgram, or another script/actor is modifying node state into this situation (e.g. scontrol update nodename=$NODENAME state=noresp) or what series of events is causing this. Potentially Azure's boot-time bug (see bug #14690) could potentially play a role in this but I would expect 'slurmd -b' to alleviate this. Does your cluster still experience the issue described in comment #0? Or was it a one-time issue? It seems to have been a one-time issue and has not returned. We do not set the state to 'noresp' in any scripts. I do set it to down for nodes that were unable to join the domain, then it is cleaned (all resources deleted) so it will be recreated next time it is assigned. That has had no issue for about 2 years. [https://www.lamresearch.com/wp-content/uploads/2018/05/lam_research_logo_corporate.jpg]Brian Andrus - HPC Systems brian.andrus@lamresearch.com<mailto:brian.andrus@lamresearch.com> From: bugs@schedmd.com <bugs@schedmd.com> Sent: Wednesday, December 14, 2022 7:41 AM To: Andrus, Brian <brian.andrus@lamresearch.com> Subject: [Bug 15400] Many nodes marked State=IDLE+CLOUD+POWERED_DOWN+NOT_RESPONDING External Email: Do NOT reply, click on links, or open attachments unless you recognize the sender and know the content is safe. If you believe this email may be unsafe, please click on the "Report Phishing" button on the top right of Outlook. Comment # 5<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D15400%23c5&data=05%7C01%7Cbrian.andrus%40lamresearch.com%7Cc43ce2c21aa54777802708dadde9a901%7C918079dbc9024e29b22c9764410d0375%7C0%7C0%7C638066292956154079%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=Yj2gq%2Bp6br3CobV3wCQsb7r2T%2FE461FQwQf%2FPfNN64I%3D&reserved=0> on bug 15400<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D15400&data=05%7C01%7Cbrian.andrus%40lamresearch.com%7Cc43ce2c21aa54777802708dadde9a901%7C918079dbc9024e29b22c9764410d0375%7C0%7C0%7C638066292956154079%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=EGhn0azTOIvrXmhSAEUOSxEhJnqneR682hFeRbrcXCI%3D&reserved=0> from Skyler Malinowski<mailto:skyler@schedmd.com> Echoing comment #2<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D15400%23c2&data=05%7C01%7Cbrian.andrus%40lamresearch.com%7Cc43ce2c21aa54777802708dadde9a901%7C918079dbc9024e29b22c9764410d0375%7C0%7C0%7C638066292956154079%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=CicoSY%2Ba1s9Fd7OSh31SpzbE%2B%2BW8MFjSACzUlIxKRz8%3D&reserved=0> > The bad news is that I need more information to trace the root cause of this > NOT_RESPONDING being present. Can you provide a slurmctld.log capturing the > events? Moreover, it is unclear if your ResumeProgram, SuspendProgram, or another script/actor is modifying node state into this situation (e.g. scontrol update nodename=$NODENAME state=noresp) or what series of events is causing this. Potentially Azure's boot-time bug (see bug #14690<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D14690&data=05%7C01%7Cbrian.andrus%40lamresearch.com%7Cc43ce2c21aa54777802708dadde9a901%7C918079dbc9024e29b22c9764410d0375%7C0%7C0%7C638066292956310281%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=Kkg0MtAnloCPbwGF4xjZiTSPn3nCB6hvQ4NlTWYIwBs%3D&reserved=0>) could potentially play a role in this but I would expect 'slurmd -b' to alleviate this. Does your cluster still experience the issue described in comment #0<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D15400%23c0&data=05%7C01%7Cbrian.andrus%40lamresearch.com%7Cc43ce2c21aa54777802708dadde9a901%7C918079dbc9024e29b22c9764410d0375%7C0%7C0%7C638066292956310281%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=AHkHC8dqc%2B4fsMACuKZMbtN34xSch87rZlyogrMRqTI%3D&reserved=0>? Or was it a one-time issue? ________________________________ You are receiving this mail because: * You reported the bug. LAM RESEARCH CONFIDENTIALITY NOTICE: This e-mail transmission, and any documents, files, or previous e-mail messages attached to it, (collectively, "E-mail Transmission") may be subject to one or more of the following based on the associated sensitivity level: E-mail Transmission (i) contains confidential information, (ii) is prohibited from distribution outside of Lam, and/or (iii) is intended solely for and restricted to the specified recipient(s). If you are not the intended recipient, or a person responsible for delivering it to the intended recipient, you are hereby notified that any disclosure, copying, distribution or use of any of the information contained in or attached to this message is STRICTLY PROHIBITED. If you have received this transmission in error, please immediately notify the sender and destroy the original transmission and its attachments without reading them or saving them to disk. Thank you. Confidential - Limited Access and Use I'm glad it has not returned, although I am still confused how your cluster got into that state initially. Regardless, this looks to be resolved. Closing ticket. Best, Skyler |