Ticket 3368

Summary:	Jobs staying in PD with reason BeginTime
Product:	Slurm	Reporter:	Susan Chacko <susanc>
Component:	Scheduling	Assignee:	Alejandro Sanchez <alex>
Status:	RESOLVED FIXED	QA Contact:
Severity:	3 - Medium Impact
Priority:	---
Version:	16.05.6
Hardware:	Linux
OS:	Linux
Site:	NIH	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:	16.05.9
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	ctld.log slurmd.log.cn2040 slurm.conf job script for 30122355_173

Description Susan Chacko 2016-12-28 07:36:46 MST

Hi,


We have a bunch of jobs in PD state with reason: BeginTime. An example is 
% scontrol show job 30122355_173
JobId=30157573 ArrayJobId=30122355 ArrayTaskId=173 JobName=swarm
   UserId=sampsonjn(33882) GroupId=sampsonjn(33882) MCS_label=N/A
   Priority=24943 Nice=0 Account=sampsonjn QOS=global
   JobState=PENDING Reason=BeginTime Dependency=(null)
   Requeue=1 Restarts=1 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=16:00:00 TimeMin=N/A
   SubmitTime=2016-12-26T20:24:47 EligibleTime=2016-12-26T20:26:48
   StartTime=Unknown EndTime=Unknown Deadline=N/A
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   Partition=norm AllocNode:Sid=biowulf:1524
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=(null)
   BatchHost=cn2040
   NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=2,mem=1.50G,node=1
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryNode=1.50G MinTmpDiskNode=0
   Features=(null) Gres=(null) Reservation=(null)
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/spin1/swarm/sampsonjn/tmpv7dsB1Qn/swarm.batch
   WorkDir=/home/sampsonjn
   StdErr=/home/sampsonjn/swarm_30122355_173.e
   StdIn=/dev/null
   StdOut=/home/sampsonjn/swarm_30122355_173.o
   Power=

The jobs went into this state apparently due to a filesystem problem on the allocated node. From ctld.log -> 

[2016-12-26T15:07:55.703] sched: Allocate JobID=30122355_173(30157573) NodeList=cn2040 #CPUs=2 Partition=norm
[2016-12-26T20:24:47.205] requeue job 30157573 due to failure of node cn2040
[2016-12-26T20:24:47.207] Requeuing JobID=30122355_173(30157573) State=0x0 NodeCnt=0

We've taken the node offline. The question is, how do we get these jobs out of this pending state? 'scontrol release jobid' does not have any effect. There are plenty of resources available on the cluster right now. 

All suggestions appreciated, thanks,
Susan.

Comment 1 Tim Wickberg 2016-12-28 07:48:11 MST

They should be eligible to restart shortly after requeue - the StartTime of the job is intentionally moved forward roughly a minute or so to make sure all parts of the previous job have cleaned up, and to make sure the different attempts to run it can be distinguished in the accounting records.

The 'BeginTime' is meant to note that the StartTime prevents the job from launching, but that doesn't seem to be the case here. I'm going to see what I can do to reproduce this here.

In the meantime, can you try manually setting the StartTime to a minute or so in the future and see if it launches then? If it doesn't can you attach the slurmctld logs from that point on through the next few minutes.

- Tim

Comment 2 Susan Chacko 2016-12-28 08:17:48 MST

Hi Tim,

I set the starttime to 2 mins in the future: 

% scontrol update jobid=30122355_173 starttime=2016-12-28T10:00:00

and verified....

ctld.log for this job -> 
[2016-12-28T09:59:03.778] _slurm_rpc_update_job complete JobId=30122355_173 uid=0 usec=25010

% scontrol show job 30122355_173
JobId=30157573 ArrayJobId=30122355 ArrayTaskId=173 JobName=swarm
   UserId=sampsonjn(33882) GroupId=sampsonjn(33882) MCS_label=N/A
   Priority=22687 Nice=0 Account=sampsonjn QOS=global
   JobState=PENDING Reason=BeginTime Dependency=(null)
   Requeue=1 Restarts=1 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=16:00:00 TimeMin=N/A
   SubmitTime=2016-12-26T20:24:47 EligibleTime=2016-12-28T10:00:00
   StartTime=2016-12-28T10:00:00 EndTime=2016-12-29T02:00:00 Deadline=N/A
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   Partition=norm AllocNode:Sid=biowulf:1524
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=(null)
   BatchHost=cn2040
   NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=2,mem=1.50G,node=1
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryNode=1.50G MinTmpDiskNode=0
   Features=(null) Gres=(null) Reservation=(null)
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/spin1/swarm/sampsonjn/tmpv7dsB1Qn/swarm.batch
   WorkDir=/home/sampsonjn
   Comment=***job_modified***
   StdErr=/home/sampsonjn/swarm_30122355_173.e
   StdIn=/dev/null
   StdOut=/home/sampsonjn/swarm_30122355_173.o
   Power=

10 minutes later, the start time was reset to 'Unknown', but the reason is still BeginTime: 

% scontrol show job 30122355_173
JobId=30157573 ArrayJobId=30122355 ArrayTaskId=173 JobName=swarm
   UserId=sampsonjn(33882) GroupId=sampsonjn(33882) MCS_label=N/A
   Priority=22678 Nice=0 Account=sampsonjn QOS=global
   JobState=PENDING Reason=BeginTime Dependency=(null)
   Requeue=1 Restarts=1 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=16:00:00 TimeMin=N/A
   SubmitTime=2016-12-26T20:24:47 EligibleTime=2016-12-28T10:00:00
   StartTime=Unknown EndTime=Unknown Deadline=N/A
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   Partition=norm AllocNode:Sid=biowulf:1524
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=(null)
   BatchHost=cn2040
   NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=2,mem=1.50G,node=1
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryNode=1.50G MinTmpDiskNode=0
   Features=(null) Gres=(null) Reservation=(null)
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/spin1/swarm/sampsonjn/tmpv7dsB1Qn/swarm.batch
   WorkDir=/home/sampsonjn
   Comment=***job_modified***
   StdErr=/home/sampsonjn/swarm_30122355_173.e
   StdIn=/dev/null
   StdOut=/home/sampsonjn/swarm_30122355_173.o
   Power=

Nothing new for this job in ctld.log after the starttime update.

Comment 3 Susan Chacko 2016-12-28 08:37:42 MST

Putting the job into Hold and then Releasing it caused it to start up right away. 
We've done this for all the jobs stuck in BeginTime.

Comment 4 Tim Wickberg 2016-12-28 08:45:22 MST

Glad you found that fix. I'm going to ask Alex if he can reproduce this and hopefully find a solution to the underlying issue; you shouldn't need to hold/release to get these to schedule again.

- Tim

Comment 5 Susan Chacko 2016-12-28 08:55:55 MST

Thanks, Tim, that would be very helpful.

Comment 6 Alejandro Sanchez 2016-12-29 03:45:57 MST

Hi Susan - could you please upload slurmctld.log messages since 2016-12-26, cn2040 node slurmd.log and slurm.conf as well if it changed since Oct 10th, which is the date of the last copy we have here? Thanks.

Comment 7 Alejandro Sanchez 2016-12-29 04:28:05 MST

I've done this experiment:

1. 2 idle nodes compute[1-2] and no jobs in the queue
2. submit a simple batch job 'sbatch --wrap "sleep 99999"'
3. job starts running on node 'compute1'
4. update compute1 state to fail
5. update compute1 state to down
6. Immediately after node state to down job is requeued due to failure on compute1
slurmctld: requeue job 13 due to failure of node compute1
7. Job 13 could start in node compute2 but it remains PD with reason BeginTime
8. Eventually (after 1m41s), job starts R on node compute2

But they don't get stuck in PD (BeginTime) forever.

Comment 8 Alejandro Sanchez 2016-12-29 06:10:38 MST

Susan, the job script with the SBATCH options used might be useful too along with the info requested two comments above. Tried with a job array instead and got the same result, tasks eventually start after the requeue.

Comment 9 Susan Chacko 2016-12-29 07:55:09 MST

Created attachment 3886 [details]
ctld.log

slurmctld.log for 2016-12-26 to 2016-12-27 inclusive

Comment 10 Susan Chacko 2016-12-29 07:56:29 MST

Created attachment 3887 [details]
slurmd.log.cn2040

slurmd.log from node cn2040. This node had a hardware problem and was taken offline. The slurmd.log ends before this job was started.

Comment 11 Susan Chacko 2016-12-29 07:56:50 MST

Created attachment 3888 [details]
slurm.conf

slurm.conf

Comment 12 Susan Chacko 2016-12-29 08:01:33 MST

Created attachment 3889 [details]
job script for 30122355_173

Comment 13 Susan Chacko 2016-12-29 08:08:56 MST


I've uploaded
- slurmctld.log from Dec 26 and 27
- slurmd.log for cn2040
- slurm.conf
- the batch script and submission command.

The job described above is one of a 999-element job array. We had 4 nodes that randomly rebooted sometime before these jobs ran, and thenceforth rejected jobs. 
15 of the subjobs from this array went into PD-BeginTime after attempting to start up on one of these 4 nodes. (Those nodes are now offline and we don't have any jobs in PD-BeginTime state now). 
cn2040 rebooted at Dec 26 19:09:36 after which all jobs were rejected.

Comment 14 Alejandro Sanchez 2017-01-05 11:06:49 MST

Thanks Susan. I've been trying to reproduce by lowering bf_max_job_array_resv and
bf_max_job_user, submitting 1000 tasks arrays from different users and making setting the nodes to fail/down through scontrol, which makes the jobs running jobs on that node to be requeued. I thought lowering the value of these parameters would make your issue reproducible, and it is but just for a while I see the StartTime=Unknown and Reason=BeginTime, eventually Reason becomes Priority and it doesn't get stuck on BeginTime. StartTime remains Unknown for some tasks, though. We'll continue playing with this and come back to you. Not sure how to reproduce a filesystem failure on the compute node other than marking the node to fail/down.

Comment 15 Susan Chacko 2017-01-05 12:03:37 MST

Thanks, Alejandro. We haven't seen any jobs in that state since I reported this problem, so unfortunately I can't provide any more information either. Appreciate your looking into it. If you aren't able to reproduce the problem, we can close this and I'll reopen if we see it happening again.

Comment 19 Moe Jette 2017-01-30 10:38:42 MST

This would be difficult to reproduce without a fairly heavily loaded system, but I was able to identify and fix the underlying problem. The patch identified below has the fix:

https://github.com/SchedMD/slurm/commit/0abbf72732a94d9bceb11cd4435ba440e58c91a1.patch

We will probably release version 16.08.9 this week including this patch if you prefer to wait for that.

Holding and releasing the job will probably be your best option to clear this problem until the patch is applied or new version installed.

Comment 21 Susan Chacko 2017-01-30 14:45:43 MST

Thanks very much, Moe. 
As you suggest, we'll probably wait until 16.08.9 comes out including this patch.