| Summary: | We see sometimes slurmstepd hanging and keeping the job in CG indefinitely | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | devops <richard> |
| Component: | slurmd | Assignee: | Nate Rini <nate> |
| Status: | RESOLVED TIMEDOUT | QA Contact: | |
| Severity: | 4 - Minor Issue | ||
| Priority: | --- | CC: | nate |
| Version: | 23.02.2 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | Stability AI | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
|
Description
devops@stability.ai
2023-08-18 02:56:40 MDT
we are using a spank plugin (Nvidia pyxis) (In reply to devops@stability.ai from comment #0) > #2 _read_slurmd_conf_lite (fd=0) at slurmstepd.c:315 > #3 0x00005636ff935e4b in _handle_spank_mode (argv=<optimized out>, argc=3) > at slurmstepd.c:445 > #4 _process_cmdline (argv=<optimized out>, argc=<optimized out>) at > slurmstepd.c:485 > #5 main (argc=<optimized out>, argv=<optimized out>) at slurmstepd.c:128 This backtrace suggests that slurmd is not sending the (cached) slurm.conf, which should not be delayed by a epilog or a prolog. How often does this happen? Please provide slurmd logs atleast debug2 from when this happens. Please ask if instructions are required for setting up or getting the logs. it is not often. but it blocks the job in CR and I need to manually ssh into the nodes to reboot them I saw this happening for one node out of 8, 7 nodes are closing fine but the 8th will hang please let me know how to generate debug2 on the workers I will probably need to enable cluster-wide and hope it will happen soon (In reply to devops@stability.ai from comment #3) > it is not often. but it blocks the job in CR and I need to manually ssh into > the nodes to reboot them > > I saw this happening for one node out of 8, 7 nodes are closing fine but the > 8th will hang > > please let me know how to generate debug2 on the workers > > I will probably need to enable cluster-wide and hope it will happen soon Set the following in slurm.conf on the node: > SlurmdDebug=debug2 > SlurmdLogFile=/path/to/place/logs restart slurmd and start a test job. If slurm.conf is on a shared filesystem, then make a new slurm.conf locally: > include /path/to/slurm.conf > SlurmdDebug=debug2 > SlurmdLogFile=/path/to/place/logs and then start slurmd manually using "-f". Here is an example: > slurmd -D -f /path/to/local/slurm.conf Once testing is complete, send SIGINT (ctrl-C in terminal) and start slurmd normally in systemd. hm, the problem is I cannot reproduce the issue on demand. I need a way to catch it among all nodes of the cluster whenever it will happen next time could take days, weeks (In reply to devops@stability.ai from comment #5) > hm, the problem is I cannot reproduce the issue on demand. I need a way to > catch it among all nodes of the cluster whenever it will happen next time In comment#0, there is a backtrace. Was a core taken and do you still have it? > could take days, weeks Reducing ticket severity due to the rarity of this issue. We can always increase it later if it starts happening more. I do not have it anymore but when it happens again I will try to move the node to FAIL so we can inspect it properly (In reply to devops@stability.ai from comment #7) > I do not have it anymore but when it happens again I will try to move the > node to FAIL so we can inspect it properly I'm going to mark this ticket as timed out while we wait for the issue to happen again. Once it does, please just reply to the ticket and it will re-open automatically. |