2965 – Jobs being delayed once allocated to a node

Ticket 2965 - Jobs being delayed once allocated to a node

Summary: Jobs being delayed once allocated to a node

Status:	RESOLVED TIMEDOUT

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	Scheduling (show other tickets)
Version:	14.11.10
Hardware:	Linux Linux

Severity:	3 - Medium Impact
Assignee:	Tim Wickberg
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2016-08-03 08:29 MDT by paull
Modified:	2016-12-12 12:44 MST (History)
CC List:	0 users

See Also:
Site:	DownUnder GeoSolutions
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
File showing job and server log. (6.64 KB, text/x-log) 2016-08-03 08:29 MDT, paull	Details
slurmctld.log as requested. (17.44 MB, text/x-log) 2016-08-03 12:19 MDT, paull	Details
slurm.conf as requested. (17.93 KB, text/plain) 2016-08-03 12:25 MDT, paull	Details
Additional logs from slurmctld.log (17.44 MB, text/x-log) 2016-08-03 12:26 MDT, paull	Details
Log from slurmd of node in reference to job delay (40.21 KB, text/plain) 2016-08-15 15:35 MDT, paull	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description paull 2016-08-03 08:29:06 MDT

Created attachment 3377 [details]
File showing job and server log.

We have been alerted by multiple users that jobs were taking a long time during submission. I have attached a log file patched together with information from slurmctld.log and the log file of a single job that was reported.

What happened:

Job was allocated to a node and the queue showed it as Running. Then, 20 min later it was reported that the job was delayed 1200 sec but, at which point, ran as usual. During the time prior to the delay report, nothing was happening job wise. Not sure what the node was doing either as at first I believed the node to be waking from powersave but uptime proved this theory to be incorrect. 

Why would slurm delay the job 1200 sec? What could be the cause of some jobs on various nodes taking a while to begin running once allocated to a node and showing running in the queue?

Comment 1 Tim Wickberg 2016-08-03 11:32:42 MDT

Can you provide additional logs from slurmctld? I'm curious what was happening there during that delay, especially if there are any other events related to node1.

Is there anything in a prolog script that could account for the delay? Or could there have been a network or fileserver issue during that period? And is this an ongoing issue, or did that happen once and not recur since?

And can you attach the slurm.conf from this system?

Comment 2 paull 2016-08-03 12:19:16 MDT

Created attachment 3378 [details]
slurmctld.log as requested.

Comment 3 paull 2016-08-03 12:25:06 MDT

Created attachment 3379 [details]
slurm.conf as requested.

Comment 4 paull 2016-08-03 12:26:00 MDT

Created attachment 3380 [details]
Additional logs from slurmctld.log

Comment 5 paull 2016-08-03 12:27:36 MDT

Is there anything in a prolog script that could account for the delay? 

---Yes but its the same in all offices and we are not experiencing issues elsewhere.

Or could there have been a network or fileserver issue during that period? And is this an ongoing issue, or did that happen once and not recur since? 

--This could be the case, but I want to rule out slurm while checking others. This is ongoing, and intermittent. 

Also, apologize for attaching slurmctldsection.log twice.

Comment 6 Tim Wickberg 2016-08-03 12:35:51 MDT

Right around the time the communication problems start there was some hiccup:

[2016-08-02T10:26:10.307] error: Node node0024 appears to have a different slurm.conf than the slurmctld. This could cause issues with communication and functionality. Please review both files and make sure they are the same. If this is expected ignore, and set DebugFlags=NO_CONF_HASH in your slurm.conf.
[2016-08-02T10:26:10.307] debug: Node node8 has low real_memory size (64467 < 64483)
[2016-08-02T10:26:10.307] error: Node node0047 appears to have a different slurm.conf than the slurmctld. This could cause issues with communication and functionality. Please review both files and make sure they are the same. If this is expected ignore, and set DebugFlags=NO_CONF_HASH in your slurm.conf.
[2016-08-02T10:26:10.307] debug: Node node10 has low real_memory size (64482 < 64483)
[2016-08-02T10:26:10.310] error: Node node0043 appears to have a different slurm.conf than the slurmctld. This could cause issues with communication and functionality. Please review both files and make sure they are the same. If this is expected ignore, and set DebugFlags=NO_CONF_HASH in your slurm.conf.
[2016-08-02T10:26:10.319] error: Node node18 appears to have a different slurm.conf than the slurmctld. This could cause issues with communication and functionality. Please review both files and make sure they are the same. If this is expected ignore, and set DebugFlags=NO_CONF_HASH in your slurm.conf.
[2016-08-02T10:26:10.319] debug: Node node2 has low real_memory size (257702 < 258323)
[2016-08-02T10:26:10.330] error: Node node15 appears to have a different slurm.conf than the slurmctld. This could cause issues with communication and functionality. Please review both files and make sure they are the same. If this is expected ignore, and set DebugFlags=NO_CONF_HASH in your slurm.conf.
[2016-08-02T10:26:10.526] error: Node node13 appears to have a different slurm.conf than the slurmctld. This could cause issues with communication and functionality. Please review both files and make sure they are the same. If this is expected ignore, and set DebugFlags=NO_CONF_HASH in your slurm.conf.

Do you know what caused that? My best current guess as to what happened is that a handful of nodes went offline and came back with an inconsistent config file. That config file can affect how some of the internal message routing happens, and may explain some of the behavior.

Can you verify if the config files on those nodes are in sync with the rest of the cluster, and that slurmd has been restarted since those were updated?

Comment 7 paull 2016-08-09 08:43:57 MDT

Tim,

If I have it correctly, the best way to get a nodes config file up to date is by restarting slurmd on the node, this correct?

Thanks,

Paul

Comment 8 paull 2016-08-09 09:15:12 MDT

Tim,

This is still happening even after restarting slurmd and slurmctld on nodes and controller. What stipulations is Slurm setup for that would tell it delay launch by 1200+ secs? I notice jobs being delayed 1-5sec which is fine, but then there are the 1200+ sec delays that are causing the issue. If I understand further this information, it may help me pinpoint whats exactly going on.

Thanks,

Paul

Comment 9 Tim Wickberg 2016-08-12 11:03:45 MDT

Is there something in /d/sw/slurm/etc/slurm_task_prolog.sh that could account for the delay?

The prolog scripts are the only thing I can think of that would introduce such unpredictability in job launch. The logs you've given look okay, and aren't pointing to an obvious communication issue.


(In reply to paull from comment #8)
> Tim,
> 
> This is still happening even after restarting slurmd and slurmctld on nodes
> and controller. What stipulations is Slurm setup for that would tell it
> delay launch by 1200+ secs? I notice jobs being delayed 1-5sec which is
> fine, but then there are the 1200+ sec delays that are causing the issue. If
> I understand further this information, it may help me pinpoint whats exactly
> going on.

Comment 10 paull 2016-08-12 14:07:20 MDT

All our slurm_task_prolog.sh script does is setup the job environment. I will do some more digging. I see that slurm_prolog.sh runs when a job is allocated to it. 

Do any exit codes 101-110 cause slurm to hold off starting a job for 20min (1200sec)?

I noticed this happened right as the epilog completed and as the node was allocated the job that was pushed off 1200 sec:

2016-08-02T14:55:33+01:00 node1 slurmd[3922]: debug:  [job 5744648] attempting to run epilog [/d/sw/slurm/etc/slurm_epilog.sh]
2016-08-02T14:55:33+01:00 node1 slurmd[3922]: debug:  completed epilog for jobid 5744648
2016-08-02T14:55:33+01:00 node1 slurmd[3922]: debug:  Job 5744648: sent epilog complete msg: rc = 0
2016-08-02T14:55:34+01:00 node1 slurmd[3922]: debug:  _slurm_recv_timeout at 0 of 4, recv zero bytes
2016-08-02T14:55:34+01:00 node1 slurmd[3922]: error: slurm_receive_msg_and_forward: Zero Bytes were transmitted or received
2016-08-02T14:55:34+01:00 node1 slurmd[3922]: error: service_connection: slurm_receive_msg: Zero Bytes were transmitted or received

Just to clarify, the node has lost communication with the controller right?

Comment 11 Tim Wickberg 2016-08-12 15:37:50 MDT

(In reply to paull from comment #10)
> All our slurm_task_prolog.sh script does is setup the job environment. I
> will do some more digging. I see that slurm_prolog.sh runs when a job is
> allocated to it. 

> Do any exit codes 101-110 cause slurm to hold off starting a job for 20min
> (1200sec)?

Nope. There's nothing that should delay jobs like this. 20 minutes is an odd value, but could potentially be explained by the node registration interval re-acquiring a network connection to the node.

Is the slurm_common.conf file for this cluster the same as the kl cluster? I just noticed the ResumeTimeout is set to 20 minutes; I'm wondering if that is involved somehow.

And are you really running with AuthType=auth/none?

> I noticed this happened right as the epilog completed and as the node was
> allocated the job that was pushed off 1200 sec:
> 
> 2016-08-02T14:55:33+01:00 node1 slurmd[3922]: debug:  [job 5744648]
> attempting to run epilog [/d/sw/slurm/etc/slurm_epilog.sh]
> 2016-08-02T14:55:33+01:00 node1 slurmd[3922]: debug:  completed epilog for
> jobid 5744648
> 2016-08-02T14:55:33+01:00 node1 slurmd[3922]: debug:  Job 5744648: sent
> epilog complete msg: rc = 0
> 2016-08-02T14:55:34+01:00 node1 slurmd[3922]: debug:  _slurm_recv_timeout at
> 0 of 4, recv zero bytes
> 2016-08-02T14:55:34+01:00 node1 slurmd[3922]: error:
> slurm_receive_msg_and_forward: Zero Bytes were transmitted or received
> 2016-08-02T14:55:34+01:00 node1 slurmd[3922]: error: service_connection:
> slurm_receive_msg: Zero Bytes were transmitted or received
> 
> Just to clarify, the node has lost communication with the controller right?

Yes. That's unusual, and I'm trying to piece together if that could stall communication for other jobs.

Is there any firewall or other network complication between the compute nodes and the master that could be blocking traffic?

Comment 12 Tim Wickberg 2016-08-12 16:05:57 MDT

If you could run the slurmd nodes with SlurmdDebug=debug3 for a while I'd love to see more details on the communication interactions. The current debug level just isn't giving me much to work off unfortunately.

Comment 13 paull 2016-08-15 10:02:45 MDT

Hi Tim,

I have changed the ResumeTimeout from 1200 to 900 to see how this affects the jobs. After some time if nothing, I will change the debug to debug3 as requested. 

Thanks,

Paul

Comment 14 paull 2016-08-15 10:38:27 MDT

Tim,

After changing the ResumeTimeout to 900 the launch delay decreased to 900 secs. Also, for the node whose job was delayed, it was in powersave...but wasn't. It was marked by slurm as powersave, but the node itself never went offline. As a matter of face, once the node finished its previous job it experienced a communication error. 

Would you still like me to change the debug level? This is done in the slurm.conf file?

[2016-08-15T17:15:06.238] Job 5775850 launch delayed by 901 secs, updating end_time

...

[root@server ~]# scontrol show job 5775850
JobId=5775850 ArrayJobId=5774749 ArrayTaskId=2862 JobName=961_PRE-MIG_CMPS_2015-ONLY
   UserId=user(1234) GroupId=teambond(2110)
   Priority=100 Nice=0 Account=user QOS=normal
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:16:06 TimeLimit=00:06:00 TimeMin=N/A
   SubmitTime=2016-08-15T10:52:38 EligibleTime=2016-08-15T10:52:39
   StartTime=2016-08-15T17:00:05 EndTime=2016-08-15T17:21:06
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   Partition=team AllocNode:Sid=lud43:22030
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=node44
   BatchHost=node44
   NumNodes=1 NumCPUs=40 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   Socks/Node=* NtasksPerN:B:S:C=1:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryNode=110G MinTmpDiskNode=0
   Features=(null) Gres=(null) Reservation=(null)
   Shared=0 Contiguous=0 Licenses=(null) Network=(null)
   (...Removed...)

[root@server ~]# scontrol show node=node44
NodeName=node44 CoresPerSocket=10
   CPUAlloc=40 CPUErr=0 CPUTot=40 CPULoad=N/A Features=intel,localdisk,nogpu
   Gres=(null)
   NodeAddr=node44 NodeHostName=node44 Version=(null)
   RealMemory=129163 AllocMem=112640 Sockets=2 Boards=1
   MemSpecLimit=123017
   State=ALLOCATED+POWER ThreadsPerCore=2 TmpDisk=730899 Weight=1
   BootTime=None SlurmdStartTime=None
   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

[root@server ~]# ssh node44 uptime
 17:17:54 up 21 days,  1:09,  0 users,  load average: 0.42, 0.33, 0.16

...

NODE44 LOGS:

2016-08-15T17:00:04+01:00 node44 slurmstepd[8615]: [job 5775782] attempting to run slurm task_epilog [/d/sw/slurm/etc/slurm_task_epilog.sh]
2016-08-15T17:00:04+01:00 node44 slurmstepd[8615]: task_p_post_term: 5775782.4294967294, task 0
2016-08-15T17:00:04+01:00 node44 slurmstepd[8615]: cpu_freq_reset: #cpus reset = 0
2016-08-15T17:00:04+01:00 node44 slurmstepd[8615]: job 5775782 completed with slurm_rc = 0, job_rc = 0
2016-08-15T17:00:04+01:00 node44 slurmstepd[8615]: sending REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status 0
2016-08-15T17:00:04+01:00 node44 slurmstepd[8615]: Message thread exited
2016-08-15T17:00:04+01:00 node44 slurmstepd[8615]: done with job
2016-08-15T17:00:04+01:00 node44 slurmd[35970]: debug:  _rpc_terminate_job, uid = 601
2016-08-15T17:00:04+01:00 node44 slurmd[35970]: debug:  task_p_slurmd_release_resources: 5775782
2016-08-15T17:00:04+01:00 node44 slurmd[35970]: debug:  credential for job 5775782 revoked
2016-08-15T17:00:04+01:00 node44 slurmd[35970]: debug:  Waiting for job 5775782's prolog to complete
2016-08-15T17:00:04+01:00 node44 slurmd[35970]: debug:  Finished wait for job 5775782's prolog to complete
2016-08-15T17:00:04+01:00 node44 slurmd[35970]: debug:  Calling /d/sw/slurm/20160620_1411/sbin/slurmstepd spank epilog
2016-08-15T17:00:04+01:00 node44 spank-epilog[10371]: Reading slurm.conf file: /d/sw/slurm/20160620_1411/etc/slurm.conf
2016-08-15T17:00:04+01:00 node44 spank-epilog[10371]: Running spank/epilog for jobid [5775782] uid [1487]
2016-08-15T17:00:04+01:00 node44 spank-epilog[10371]: spank: opening plugin stack /d/sw/slurm/20160620_1411/etc/plugstack.conf
2016-08-15T17:00:04+01:00 node44 slurmd[35970]: debug:  [job 5775782] attempting to run epilog [/d/sw/slurm/etc/slurm_epilog.sh]
2016-08-15T17:00:04+01:00 node44 slurmd[35970]: debug:  completed epilog for jobid 5775782
2016-08-15T17:00:04+01:00 node44 slurmd[35970]: debug:  Job 5775782: sent epilog complete msg: rc = 0
2016-08-15T17:00:05+01:00 node44 slurmd[35970]: debug:  _slurm_recv_timeout at 0 of 4, recv zero bytes
2016-08-15T17:00:05+01:00 node44 slurmd[35970]: error: slurm_receive_msg_and_forward: Zero Bytes were transmitted or received
2016-08-15T17:00:05+01:00 node44 slurmd[35970]: error: service_connection: slurm_receive_msg: Zero Bytes were transmitted or received


Thanks,

Paul

Comment 15 Tim Wickberg 2016-08-15 10:42:08 MDT

(In reply to paull from comment #14)
> Tim,
> 
> After changing the ResumeTimeout to 900 the launch delay decreased to 900
> secs. Also, for the node whose job was delayed, it was in powersave...but
> wasn't. It was marked by slurm as powersave, but the node itself never went
> offline. As a matter of face, once the node finished its previous job it
> experienced a communication error. 

I will look into what could trigger that case; it sounds like the communication issue somehow correlates, although I haven't seen an issue like this before.

> Would you still like me to change the debug level? This is done in the
> slurm.conf file?

Yes, please. If you can capture logs with SlurmdDebug=debug3 while this happens and attach them that should help isolate the issue.

Comment 16 paull 2016-08-15 15:35:20 MDT

Created attachment 3409 [details]
Log from slurmd of node in reference to job delay

Comment 17 paull 2016-08-15 15:37:14 MDT

Hi Tim,

I have uploaded a log file (Log from slurmd of node in reference to job delay) with an excerpt from the nodes log of when the job below was delayed. Below is the job information and the cutout of the delay from slurmctld.log.

[2016-08-15T22:11:49.447] Job 5776562 launch delayed by 120 secs, updating end_time

...

[root@server ~]# scontrol show job 5776562
JobId=5776562 ArrayJobId=5774707 ArrayTaskId=14 JobName=pt2_3DPreSTM_4_vels
   UserId=user(1234) GroupId=team(2113)
   Priority=150 Nice=0 Account=user QOS=normal
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:09:26 TimeLimit=01:48:00 TimeMin=N/A
   SubmitTime=2016-08-15T10:24:55 EligibleTime=2016-08-15T10:29:28
   StartTime=2016-08-15T22:09:49 EndTime=2016-08-15T23:59:49
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   Partition=idle AllocNode:Sid=lud55:27870
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=node19
   BatchHost=node19
   NumNodes=1 NumCPUs=24 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   Socks/Node=* NtasksPerN:B:S:C=1:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryNode=110G MinTmpDiskNode=0
   Features=gpu Gres=(null) Reservation=(null)
   Shared=0 Contiguous=0 Licenses=(null) Network=(null)
   (...REMOVED...)

[root@lugy ~]# scontrol show node=node19
NodeName=node19 CoresPerSocket=6
   CPUAlloc=24 CPUErr=0 CPUTot=24 CPULoad=N/A Features=intel,nogpu,gpu,8gpu,tengig
   Gres=(null)
   NodeAddr=node19 NodeHostName=node19 Version=(null)
   RealMemory=129083 AllocMem=112640 Sockets=2 Boards=1
   MemSpecLimit=122939
   State=ALLOCATED+POWER ThreadsPerCore=2 TmpDisk=500 Weight=1
   BootTime=None SlurmdStartTime=None
   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

Comment 18 Tim Wickberg 2016-08-16 15:59:39 MDT

The logs helps, although I haven't yet been able to sort out the exact cause. There's a communication failure right as the job should be scheduled, and then slurmctld is waiting until the ResumeTimeout expires to reconnect which is unusual.

One thing I'm noticing in the logs are some errors related to the SPANK plugin you're using. Would you be able to disable the plugin on a few nodes and see if that has any impact?

Comment 19 paull 2016-08-24 13:11:38 MDT

Tim,

Quick update, I have disabled the spank plugin and am waiting on some logs. Will update soon.

Thanks,

Paul

Comment 20 Tim Wickberg 2016-12-12 12:44:20 MST

Closing this as I'd expected a response back on the last message if there were further issues. Please reopen if there's still an active problem here.

- Tim