Ticket 11338

Summary: Job stuck in pending, reason resources, but resources are available.
Product: Slurm Reporter: Jeff Haferman <jlhaferm>
Component: SchedulingAssignee: Chad Vizino <chad>
Status: RESOLVED INFOGIVEN QA Contact:
Severity: 4 - Minor Issue    
Priority: ---    
Version: 20.02.5   
Hardware: Linux   
OS: Linux   
Site: NPS HPC Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---
Attachments: slurm.conf
nodes.conf
cgroup.conf
sdiag output
squeue output
slurmctld log

Description Jeff Haferman 2021-04-09 12:35:10 MDT
One job is stuck in the queue with state PD, reason "resources". It is looking for just 1 core on 1 node, and those resources are available. It really looks like a "vanilla" request, nothing special. 

sinfo for it:
JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON) 
41034874   primary codeRate jacob.ga PD       0:00      1 (Resources) 

scontrol show job 41034874:
JobId=41034874 JobName=codeRate.sbatch
   UserId=jacob.garrett(50846) GroupId=hamming(20) MCS_label=N/A
   Priority=103 Nice=-100 Account=hamusers QOS=normal
   JobState=PENDING Reason=Resources Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=04:00:00 TimeMin=N/A
   SubmitTime=2021-04-08T14:00:36 EligibleTime=2021-04-08T14:00:36
   AccrueTime=2021-04-08T14:00:36
   StartTime=2021-04-10T03:00:01 EndTime=2021-04-10T07:00:01 Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2021-04-09T11:31:11
   Partition=primary AllocNode:Sid=localhost:8803
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=(null) SchedNodeList=compute-3-27
   NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=1,node=1,billing=1
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/home/jacob.garrett/codeRate.sbatch
   WorkDir=/home/jacob.garrett
   StdErr=/home/jacob.garrett/slurm-41034874.out
   StdIn=/dev/null
   StdOut=/home/jacob.garrett/slurm-41034874.out
   Power=
   MailUser=(null) MailType=NONE

sudo cat /home/jacob.garrett/codeRate.sbatch:
#!/bin/bash
#SBATCH --mem=0
#SBATCH  --time=4:00:00
source /etc/profile
module load app/matlab/R2019b
matlab -nodisplay < CodeRateVfalsePositives.m


Not much to it. Any ideas on what might be going on? The scheduled node is busy, but there are 4 other idle nodes available.
Comment 1 Jason Booth 2021-04-09 13:12:51 MDT
Would you please attach your slurm.conf, cgroup.conf and your  slurmctld.log to the bug?

Please also run sdiag and squeue and attach that output as well.
Comment 2 Jeff Haferman 2021-04-09 13:55:33 MDT
Created attachment 18882 [details]
slurm.conf
Comment 3 Jeff Haferman 2021-04-09 13:55:53 MDT
Created attachment 18883 [details]
nodes.conf
Comment 4 Jeff Haferman 2021-04-09 13:56:08 MDT
Created attachment 18884 [details]
cgroup.conf
Comment 5 Jeff Haferman 2021-04-09 13:56:29 MDT
Created attachment 18885 [details]
sdiag output
Comment 6 Jeff Haferman 2021-04-09 13:56:51 MDT
Created attachment 18886 [details]
squeue output
Comment 7 Jeff Haferman 2021-04-09 13:57:17 MDT
Created attachment 18887 [details]
slurmctld log
Comment 8 Jeff Haferman 2021-04-09 14:04:23 MDT
It appears that the job actually went into a "R"unning state (see the squeue output I attached).

So, there is no longer an issue with his job running, I'm just wondering why it was "stuck" for a few hours?
Comment 10 Chad Vizino 2021-04-09 16:59:15 MDT
Hi Jeff,

I'll look more in depth over the info you attached. For now we noticed:

>[2021-04-08T14:00:36.776] _slurm_rpc_submit_batch_job: JobId=41034874 InitPrio=1 usec=670
> ...
>[2021-04-08T14:18:02.649] _slurm_rpc_update_job: complete JobId=41034874 uid=50846 usec=444
>[2021-04-08T14:18:10.799] _slurm_rpc_update_job: complete JobId=41034874 uid=50846 usec=287
> ...
>[2021-04-09T10:43:01.528] Terminate signal (SIGINT or SIGTERM) received
>[2021-04-09T10:43:01.554] Saving all slurm state
>[2021-04-09T10:43:02.165] slurmctld version 20.02.5 started on cluster hamming
> ...
>[2021-04-09T10:43:08.877] Recovered JobId=41034874 Assoc=967
>[2021-04-09T10:47:09.413] _slurm_rpc_update_job: complete JobId=41034874 uid=0 usec=603
>[2021-04-09T11:14:10.519] _slurm_rpc_requeue: 41034874: Job is pending execution
So it looks like the job completed but then was requeued. Do you know why? There was also a slurmctld restart in there also--do you know why that was performed?

Looks like from the squeue:

>   JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
>41034874   primary codeRate jacob.ga  R       1:02      1 compute-3-17
Could you supply the slurmd log from compute-3-17 covering this whole period?

 -Chad
Comment 11 Jeff Haferman 2021-04-09 17:13:55 MDT
Hi Chad -
The restart of slurmctld was due to me updating a configuration on a node that had been out of the mix for a while. I started slurmd on that node, and then restarted slurmctld. Is that the correct way to add a new node? (it is called compute-2-39, so not related to the node this user ended up on).

Regarding the "requeue", I do see in the "scontrol show job" that I posted that it says "Requeue=1". Is this what you are referring to? I have no idea why it would have failed and then been requeued.

Finally, it looks like our node configuration needs to be cleaned up (I inherited this system). I see from "slurm.conf" that we have 
SlurmdLogFile=/var/log/slurm/slurm.log

BUT, on our nodes it looks like there are old logfiles at
/var/log/slurm (i.e. /var/log/slurm is a file, and not a directory, so I'm guessing when slurmd starts, it doesn't have a logfile to write to, so the slurmd logs are not being created).

I would say at this point, if you have an idea, great, if not, no worries since the job does seem to be running fine now. AND, I have some cleaning up to do.
Comment 12 Chad Vizino 2021-04-09 17:50:35 MDT
(In reply to Jeff Haferman from comment #11)
> The restart of slurmctld was due to me updating a configuration on a node
> that had been out of the mix for a while. I started slurmd on that node, and
> then restarted slurmctld. Is that the correct way to add a new node? (it is
> called compute-2-39, so not related to the node this user ended up on).
That should be ok.

> Regarding the "requeue", I do see in the "scontrol show job" that I posted
> that it says "Requeue=1". Is this what you are referring to? I have no idea
> why it would have failed and then been requeued.
Yes, the Requeue=1 is one clue but also in the slurmctld.log (slurmctld processed an RPC to requeue the job):

>[2021-04-09T11:14:10.519] _slurm_rpc_requeue: 41034874: Job is pending execution
About the logging:

> Finally, it looks like our node configuration needs to be cleaned up (I
> inherited this system). I see from "slurm.conf" that we have 
> SlurmdLogFile=/var/log/slurm/slurm.log
> 
> BUT, on our nodes it looks like there are old logfiles at
> /var/log/slurm (i.e. /var/log/slurm is a file, and not a directory, so I'm
> guessing when slurmd starts, it doesn't have a logfile to write to, so the
> slurmd logs are not being created).
SlurmdLogFile is the name of the log file that will be used so would expect that /var/log/slurm on each node would list the slurmd log info. You mentioned that the files are old on the nodes so there may be something else preventing log lines from being written. I do see a failure in your slurmctld log file that refers to slurmctld at startup but maybe the nodes' slurmd procs also have a permission problem writing to /var/log?

>[2021-04-09T10:43:02.165] error: chdir(/var/log): Permission denied
>[2021-04-09T10:43:02.165] slurmctld version 20.02.5 started on cluster hamming
Finally:

> I would say at this point, if you have an idea, great, if not, no worries
> since the job does seem to be running fine now. AND, I have some cleaning up
> to do.
Without the slurmd log that the job had been running on, there's not a lot to go on here and am not sure what else I can do to help.
Comment 13 Jeff Haferman 2021-04-09 18:16:43 MDT
Right, we're ok to close.

But I want to clarify what I'm saying about the logfile. slurm.conf has:
SlurmdLogFile=/var/log/slurm/slurm.log

So, that IMPLIES that /var/log/slurm is an existing DIRECTORY. But on our compute nodes /var/log/slurm is a FILE (it looks like it is actually a slumrd logfile from about a year ago). 

So, when slumrd starts, it wants to open up a FILE called
/var/log/slurm/slurm.log

but it cannot, because there is no DIRECTORY /var/log/slurm.

Like I said, I inherited a bit of a mess here that I need to cleanup.
Comment 14 Chad Vizino 2021-04-12 08:59:04 MDT
(In reply to Jeff Haferman from comment #13)
> Right, we're ok to close.
Ok. I'll do that.

> But I want to clarify what I'm saying about the logfile. slurm.conf has:
> SlurmdLogFile=/var/log/slurm/slurm.log
> 
> So, that IMPLIES that /var/log/slurm is an existing DIRECTORY. But on our
> compute nodes /var/log/slurm is a FILE (it looks like it is actually a
> slumrd logfile from about a year ago). 
> 
> So, when slumrd starts, it wants to open up a FILE called
> /var/log/slurm/slurm.log
> 
> but it cannot, because there is no DIRECTORY /var/log/slurm.
Ah ok. That makes sense--I saw "SlurmdLogFile=/var/log/slurm" in your supplied slurm.conf and was confused so your clarification helped.

> Like I said, I inherited a bit of a mess here that I need to cleanup.
No problem--sorry about that. If there's anything else we can help you with, let us know. I'll go ahead and close this ticket.
Comment 15 Chad Vizino 2021-04-12 08:59:20 MDT
Closing.