Ticket 3368

Summary: Jobs staying in PD with reason BeginTime
Product: Slurm Reporter: Susan Chacko <susanc>
Component: SchedulingAssignee: Alejandro Sanchez <alex>
Status: RESOLVED FIXED QA Contact:
Severity: 3 - Medium Impact    
Priority: ---    
Version: 16.05.6   
Hardware: Linux   
OS: Linux   
Site: NIH Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: 16.05.9 Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---
Attachments: ctld.log
slurmd.log.cn2040
slurm.conf
job script for 30122355_173

Description Susan Chacko 2016-12-28 07:36:46 MST
Hi,


We have a bunch of jobs in PD state with reason: BeginTime. An example is 
% scontrol show job 30122355_173
JobId=30157573 ArrayJobId=30122355 ArrayTaskId=173 JobName=swarm
   UserId=sampsonjn(33882) GroupId=sampsonjn(33882) MCS_label=N/A
   Priority=24943 Nice=0 Account=sampsonjn QOS=global
   JobState=PENDING Reason=BeginTime Dependency=(null)
   Requeue=1 Restarts=1 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=16:00:00 TimeMin=N/A
   SubmitTime=2016-12-26T20:24:47 EligibleTime=2016-12-26T20:26:48
   StartTime=Unknown EndTime=Unknown Deadline=N/A
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   Partition=norm AllocNode:Sid=biowulf:1524
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=(null)
   BatchHost=cn2040
   NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=2,mem=1.50G,node=1
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryNode=1.50G MinTmpDiskNode=0
   Features=(null) Gres=(null) Reservation=(null)
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/spin1/swarm/sampsonjn/tmpv7dsB1Qn/swarm.batch
   WorkDir=/home/sampsonjn
   StdErr=/home/sampsonjn/swarm_30122355_173.e
   StdIn=/dev/null
   StdOut=/home/sampsonjn/swarm_30122355_173.o
   Power=

The jobs went into this state apparently due to a filesystem problem on the allocated node. From ctld.log -> 

[2016-12-26T15:07:55.703] sched: Allocate JobID=30122355_173(30157573) NodeList=cn2040 #CPUs=2 Partition=norm
[2016-12-26T20:24:47.205] requeue job 30157573 due to failure of node cn2040
[2016-12-26T20:24:47.207] Requeuing JobID=30122355_173(30157573) State=0x0 NodeCnt=0

We've taken the node offline. The question is, how do we get these jobs out of this pending state? 'scontrol release jobid' does not have any effect. There are plenty of resources available on the cluster right now. 

All suggestions appreciated, thanks,
Susan.
Comment 1 Tim Wickberg 2016-12-28 07:48:11 MST
They should be eligible to restart shortly after requeue - the StartTime of the job is intentionally moved forward roughly a minute or so to make sure all parts of the previous job have cleaned up, and to make sure the different attempts to run it can be distinguished in the accounting records.

The 'BeginTime' is meant to note that the StartTime prevents the job from launching, but that doesn't seem to be the case here. I'm going to see what I can do to reproduce this here.

In the meantime, can you try manually setting the StartTime to a minute or so in the future and see if it launches then? If it doesn't can you attach the slurmctld logs from that point on through the next few minutes.

- Tim
Comment 2 Susan Chacko 2016-12-28 08:17:48 MST
Hi Tim,

I set the starttime to 2 mins in the future: 

% scontrol update jobid=30122355_173 starttime=2016-12-28T10:00:00

and verified....

ctld.log for this job -> 
[2016-12-28T09:59:03.778] _slurm_rpc_update_job complete JobId=30122355_173 uid=0 usec=25010

% scontrol show job 30122355_173
JobId=30157573 ArrayJobId=30122355 ArrayTaskId=173 JobName=swarm
   UserId=sampsonjn(33882) GroupId=sampsonjn(33882) MCS_label=N/A
   Priority=22687 Nice=0 Account=sampsonjn QOS=global
   JobState=PENDING Reason=BeginTime Dependency=(null)
   Requeue=1 Restarts=1 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=16:00:00 TimeMin=N/A
   SubmitTime=2016-12-26T20:24:47 EligibleTime=2016-12-28T10:00:00
   StartTime=2016-12-28T10:00:00 EndTime=2016-12-29T02:00:00 Deadline=N/A
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   Partition=norm AllocNode:Sid=biowulf:1524
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=(null)
   BatchHost=cn2040
   NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=2,mem=1.50G,node=1
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryNode=1.50G MinTmpDiskNode=0
   Features=(null) Gres=(null) Reservation=(null)
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/spin1/swarm/sampsonjn/tmpv7dsB1Qn/swarm.batch
   WorkDir=/home/sampsonjn
   Comment=***job_modified***
   StdErr=/home/sampsonjn/swarm_30122355_173.e
   StdIn=/dev/null
   StdOut=/home/sampsonjn/swarm_30122355_173.o
   Power=

10 minutes later, the start time was reset to 'Unknown', but the reason is still BeginTime: 

% scontrol show job 30122355_173
JobId=30157573 ArrayJobId=30122355 ArrayTaskId=173 JobName=swarm
   UserId=sampsonjn(33882) GroupId=sampsonjn(33882) MCS_label=N/A
   Priority=22678 Nice=0 Account=sampsonjn QOS=global
   JobState=PENDING Reason=BeginTime Dependency=(null)
   Requeue=1 Restarts=1 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=16:00:00 TimeMin=N/A
   SubmitTime=2016-12-26T20:24:47 EligibleTime=2016-12-28T10:00:00
   StartTime=Unknown EndTime=Unknown Deadline=N/A
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   Partition=norm AllocNode:Sid=biowulf:1524
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=(null)
   BatchHost=cn2040
   NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=2,mem=1.50G,node=1
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryNode=1.50G MinTmpDiskNode=0
   Features=(null) Gres=(null) Reservation=(null)
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/spin1/swarm/sampsonjn/tmpv7dsB1Qn/swarm.batch
   WorkDir=/home/sampsonjn
   Comment=***job_modified***
   StdErr=/home/sampsonjn/swarm_30122355_173.e
   StdIn=/dev/null
   StdOut=/home/sampsonjn/swarm_30122355_173.o
   Power=

Nothing new for this job in ctld.log after the starttime update.
Comment 3 Susan Chacko 2016-12-28 08:37:42 MST
Putting the job into Hold and then Releasing it caused it to start up right away. 
We've done this for all the jobs stuck in BeginTime.
Comment 4 Tim Wickberg 2016-12-28 08:45:22 MST
Glad you found that fix. I'm going to ask Alex if he can reproduce this and hopefully find a solution to the underlying issue; you shouldn't need to hold/release to get these to schedule again.

- Tim
Comment 5 Susan Chacko 2016-12-28 08:55:55 MST
Thanks, Tim, that would be very helpful.
Comment 6 Alejandro Sanchez 2016-12-29 03:45:57 MST
Hi Susan - could you please upload slurmctld.log messages since 2016-12-26, cn2040 node slurmd.log and slurm.conf as well if it changed since Oct 10th, which is the date of the last copy we have here? Thanks.
Comment 7 Alejandro Sanchez 2016-12-29 04:28:05 MST
I've done this experiment:

1. 2 idle nodes compute[1-2] and no jobs in the queue
2. submit a simple batch job 'sbatch --wrap "sleep 99999"'
3. job starts running on node 'compute1'
4. update compute1 state to fail
5. update compute1 state to down
6. Immediately after node state to down job is requeued due to failure on compute1
slurmctld: requeue job 13 due to failure of node compute1
7. Job 13 could start in node compute2 but it remains PD with reason BeginTime
8. Eventually (after 1m41s), job starts R on node compute2

But they don't get stuck in PD (BeginTime) forever.
Comment 8 Alejandro Sanchez 2016-12-29 06:10:38 MST
Susan, the job script with the SBATCH options used might be useful too along with the info requested two comments above. Tried with a job array instead and got the same result, tasks eventually start after the requeue.
Comment 9 Susan Chacko 2016-12-29 07:55:09 MST
Created attachment 3886 [details]
ctld.log

slurmctld.log for 2016-12-26 to 2016-12-27 inclusive
Comment 10 Susan Chacko 2016-12-29 07:56:29 MST
Created attachment 3887 [details]
slurmd.log.cn2040

slurmd.log from node cn2040. This node had a hardware problem and was taken offline. The slurmd.log ends before this job was started.
Comment 11 Susan Chacko 2016-12-29 07:56:50 MST
Created attachment 3888 [details]
slurm.conf

slurm.conf
Comment 12 Susan Chacko 2016-12-29 08:01:33 MST
Created attachment 3889 [details]
job script for 30122355_173
Comment 13 Susan Chacko 2016-12-29 08:08:56 MST

I've uploaded
- slurmctld.log from Dec 26 and 27
- slurmd.log for cn2040
- slurm.conf
- the batch script and submission command.

The job described above is one of a 999-element job array. We had 4 nodes that randomly rebooted sometime before these jobs ran, and thenceforth rejected jobs. 
15 of the subjobs from this array went into PD-BeginTime after attempting to start up on one of these 4 nodes. (Those nodes are now offline and we don't have any jobs in PD-BeginTime state now). 
cn2040 rebooted at Dec 26 19:09:36 after which all jobs were rejected.
Comment 14 Alejandro Sanchez 2017-01-05 11:06:49 MST
Thanks Susan. I've been trying to reproduce by lowering bf_max_job_array_resv and
bf_max_job_user, submitting 1000 tasks arrays from different users and making setting the nodes to fail/down through scontrol, which makes the jobs running jobs on that node to be requeued. I thought lowering the value of these parameters would make your issue reproducible, and it is but just for a while I see the StartTime=Unknown and Reason=BeginTime, eventually Reason becomes Priority and it doesn't get stuck on BeginTime. StartTime remains Unknown for some tasks, though. We'll continue playing with this and come back to you. Not sure how to reproduce a filesystem failure on the compute node other than marking the node to fail/down.
Comment 15 Susan Chacko 2017-01-05 12:03:37 MST
Thanks, Alejandro. We haven't seen any jobs in that state since I reported this problem, so unfortunately I can't provide any more information either. Appreciate your looking into it. If you aren't able to reproduce the problem, we can close this and I'll reopen if we see it happening again.
Comment 19 Moe Jette 2017-01-30 10:38:42 MST
This would be difficult to reproduce without a fairly heavily loaded system, but I was able to identify and fix the underlying problem. The patch identified below has the fix:

https://github.com/SchedMD/slurm/commit/0abbf72732a94d9bceb11cd4435ba440e58c91a1.patch

We will probably release version 16.08.9 this week including this patch if you prefer to wait for that.

Holding and releasing the job will probably be your best option to clear this problem until the patch is applied or new version installed.
Comment 21 Susan Chacko 2017-01-30 14:45:43 MST
Thanks very much, Moe. 
As you suggest, we'll probably wait until 16.08.9 comes out including this patch.