| Summary: | Job stuck in pending, reason resources, but resources are available. | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Jeff Haferman <jlhaferm> |
| Component: | Scheduling | Assignee: | Chad Vizino <chad> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 4 - Minor Issue | ||
| Priority: | --- | ||
| Version: | 20.02.5 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | NPS HPC | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
| Attachments: |
slurm.conf
nodes.conf cgroup.conf sdiag output squeue output slurmctld log |
||
|
Description
Jeff Haferman
2021-04-09 12:35:10 MDT
Would you please attach your slurm.conf, cgroup.conf and your slurmctld.log to the bug? Please also run sdiag and squeue and attach that output as well. Created attachment 18882 [details]
slurm.conf
Created attachment 18883 [details]
nodes.conf
Created attachment 18884 [details]
cgroup.conf
Created attachment 18885 [details]
sdiag output
Created attachment 18886 [details]
squeue output
Created attachment 18887 [details]
slurmctld log
It appears that the job actually went into a "R"unning state (see the squeue output I attached). So, there is no longer an issue with his job running, I'm just wondering why it was "stuck" for a few hours? Hi Jeff, I'll look more in depth over the info you attached. For now we noticed: >[2021-04-08T14:00:36.776] _slurm_rpc_submit_batch_job: JobId=41034874 InitPrio=1 usec=670 > ... >[2021-04-08T14:18:02.649] _slurm_rpc_update_job: complete JobId=41034874 uid=50846 usec=444 >[2021-04-08T14:18:10.799] _slurm_rpc_update_job: complete JobId=41034874 uid=50846 usec=287 > ... >[2021-04-09T10:43:01.528] Terminate signal (SIGINT or SIGTERM) received >[2021-04-09T10:43:01.554] Saving all slurm state >[2021-04-09T10:43:02.165] slurmctld version 20.02.5 started on cluster hamming > ... >[2021-04-09T10:43:08.877] Recovered JobId=41034874 Assoc=967 >[2021-04-09T10:47:09.413] _slurm_rpc_update_job: complete JobId=41034874 uid=0 usec=603 >[2021-04-09T11:14:10.519] _slurm_rpc_requeue: 41034874: Job is pending execution So it looks like the job completed but then was requeued. Do you know why? There was also a slurmctld restart in there also--do you know why that was performed? Looks like from the squeue: > JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) >41034874 primary codeRate jacob.ga R 1:02 1 compute-3-17 Could you supply the slurmd log from compute-3-17 covering this whole period? -Chad Hi Chad - The restart of slurmctld was due to me updating a configuration on a node that had been out of the mix for a while. I started slurmd on that node, and then restarted slurmctld. Is that the correct way to add a new node? (it is called compute-2-39, so not related to the node this user ended up on). Regarding the "requeue", I do see in the "scontrol show job" that I posted that it says "Requeue=1". Is this what you are referring to? I have no idea why it would have failed and then been requeued. Finally, it looks like our node configuration needs to be cleaned up (I inherited this system). I see from "slurm.conf" that we have SlurmdLogFile=/var/log/slurm/slurm.log BUT, on our nodes it looks like there are old logfiles at /var/log/slurm (i.e. /var/log/slurm is a file, and not a directory, so I'm guessing when slurmd starts, it doesn't have a logfile to write to, so the slurmd logs are not being created). I would say at this point, if you have an idea, great, if not, no worries since the job does seem to be running fine now. AND, I have some cleaning up to do. (In reply to Jeff Haferman from comment #11) > The restart of slurmctld was due to me updating a configuration on a node > that had been out of the mix for a while. I started slurmd on that node, and > then restarted slurmctld. Is that the correct way to add a new node? (it is > called compute-2-39, so not related to the node this user ended up on). That should be ok. > Regarding the "requeue", I do see in the "scontrol show job" that I posted > that it says "Requeue=1". Is this what you are referring to? I have no idea > why it would have failed and then been requeued. Yes, the Requeue=1 is one clue but also in the slurmctld.log (slurmctld processed an RPC to requeue the job): >[2021-04-09T11:14:10.519] _slurm_rpc_requeue: 41034874: Job is pending execution About the logging: > Finally, it looks like our node configuration needs to be cleaned up (I > inherited this system). I see from "slurm.conf" that we have > SlurmdLogFile=/var/log/slurm/slurm.log > > BUT, on our nodes it looks like there are old logfiles at > /var/log/slurm (i.e. /var/log/slurm is a file, and not a directory, so I'm > guessing when slurmd starts, it doesn't have a logfile to write to, so the > slurmd logs are not being created). SlurmdLogFile is the name of the log file that will be used so would expect that /var/log/slurm on each node would list the slurmd log info. You mentioned that the files are old on the nodes so there may be something else preventing log lines from being written. I do see a failure in your slurmctld log file that refers to slurmctld at startup but maybe the nodes' slurmd procs also have a permission problem writing to /var/log? >[2021-04-09T10:43:02.165] error: chdir(/var/log): Permission denied >[2021-04-09T10:43:02.165] slurmctld version 20.02.5 started on cluster hamming Finally: > I would say at this point, if you have an idea, great, if not, no worries > since the job does seem to be running fine now. AND, I have some cleaning up > to do. Without the slurmd log that the job had been running on, there's not a lot to go on here and am not sure what else I can do to help. Right, we're ok to close. But I want to clarify what I'm saying about the logfile. slurm.conf has: SlurmdLogFile=/var/log/slurm/slurm.log So, that IMPLIES that /var/log/slurm is an existing DIRECTORY. But on our compute nodes /var/log/slurm is a FILE (it looks like it is actually a slumrd logfile from about a year ago). So, when slumrd starts, it wants to open up a FILE called /var/log/slurm/slurm.log but it cannot, because there is no DIRECTORY /var/log/slurm. Like I said, I inherited a bit of a mess here that I need to cleanup. (In reply to Jeff Haferman from comment #13) > Right, we're ok to close. Ok. I'll do that. > But I want to clarify what I'm saying about the logfile. slurm.conf has: > SlurmdLogFile=/var/log/slurm/slurm.log > > So, that IMPLIES that /var/log/slurm is an existing DIRECTORY. But on our > compute nodes /var/log/slurm is a FILE (it looks like it is actually a > slumrd logfile from about a year ago). > > So, when slumrd starts, it wants to open up a FILE called > /var/log/slurm/slurm.log > > but it cannot, because there is no DIRECTORY /var/log/slurm. Ah ok. That makes sense--I saw "SlurmdLogFile=/var/log/slurm" in your supplied slurm.conf and was confused so your clarification helped. > Like I said, I inherited a bit of a mess here that I need to cleanup. No problem--sorry about that. If there's anything else we can help you with, let us know. I'll go ahead and close this ticket. Closing. |