Created attachment 3377 [details] File showing job and server log. We have been alerted by multiple users that jobs were taking a long time during submission. I have attached a log file patched together with information from slurmctld.log and the log file of a single job that was reported. What happened: Job was allocated to a node and the queue showed it as Running. Then, 20 min later it was reported that the job was delayed 1200 sec but, at which point, ran as usual. During the time prior to the delay report, nothing was happening job wise. Not sure what the node was doing either as at first I believed the node to be waking from powersave but uptime proved this theory to be incorrect. Why would slurm delay the job 1200 sec? What could be the cause of some jobs on various nodes taking a while to begin running once allocated to a node and showing running in the queue?
Can you provide additional logs from slurmctld? I'm curious what was happening there during that delay, especially if there are any other events related to node1. Is there anything in a prolog script that could account for the delay? Or could there have been a network or fileserver issue during that period? And is this an ongoing issue, or did that happen once and not recur since? And can you attach the slurm.conf from this system?
Created attachment 3378 [details] slurmctld.log as requested.
Created attachment 3379 [details] slurm.conf as requested.
Created attachment 3380 [details] Additional logs from slurmctld.log
Is there anything in a prolog script that could account for the delay? ---Yes but its the same in all offices and we are not experiencing issues elsewhere. Or could there have been a network or fileserver issue during that period? And is this an ongoing issue, or did that happen once and not recur since? --This could be the case, but I want to rule out slurm while checking others. This is ongoing, and intermittent. Also, apologize for attaching slurmctldsection.log twice.
Right around the time the communication problems start there was some hiccup: [2016-08-02T10:26:10.307] error: Node node0024 appears to have a different slurm.conf than the slurmctld. This could cause issues with communication and functionality. Please review both files and make sure they are the same. If this is expected ignore, and set DebugFlags=NO_CONF_HASH in your slurm.conf. [2016-08-02T10:26:10.307] debug: Node node8 has low real_memory size (64467 < 64483) [2016-08-02T10:26:10.307] error: Node node0047 appears to have a different slurm.conf than the slurmctld. This could cause issues with communication and functionality. Please review both files and make sure they are the same. If this is expected ignore, and set DebugFlags=NO_CONF_HASH in your slurm.conf. [2016-08-02T10:26:10.307] debug: Node node10 has low real_memory size (64482 < 64483) [2016-08-02T10:26:10.310] error: Node node0043 appears to have a different slurm.conf than the slurmctld. This could cause issues with communication and functionality. Please review both files and make sure they are the same. If this is expected ignore, and set DebugFlags=NO_CONF_HASH in your slurm.conf. [2016-08-02T10:26:10.319] error: Node node18 appears to have a different slurm.conf than the slurmctld. This could cause issues with communication and functionality. Please review both files and make sure they are the same. If this is expected ignore, and set DebugFlags=NO_CONF_HASH in your slurm.conf. [2016-08-02T10:26:10.319] debug: Node node2 has low real_memory size (257702 < 258323) [2016-08-02T10:26:10.330] error: Node node15 appears to have a different slurm.conf than the slurmctld. This could cause issues with communication and functionality. Please review both files and make sure they are the same. If this is expected ignore, and set DebugFlags=NO_CONF_HASH in your slurm.conf. [2016-08-02T10:26:10.526] error: Node node13 appears to have a different slurm.conf than the slurmctld. This could cause issues with communication and functionality. Please review both files and make sure they are the same. If this is expected ignore, and set DebugFlags=NO_CONF_HASH in your slurm.conf. Do you know what caused that? My best current guess as to what happened is that a handful of nodes went offline and came back with an inconsistent config file. That config file can affect how some of the internal message routing happens, and may explain some of the behavior. Can you verify if the config files on those nodes are in sync with the rest of the cluster, and that slurmd has been restarted since those were updated?
Tim, If I have it correctly, the best way to get a nodes config file up to date is by restarting slurmd on the node, this correct? Thanks, Paul
Tim, This is still happening even after restarting slurmd and slurmctld on nodes and controller. What stipulations is Slurm setup for that would tell it delay launch by 1200+ secs? I notice jobs being delayed 1-5sec which is fine, but then there are the 1200+ sec delays that are causing the issue. If I understand further this information, it may help me pinpoint whats exactly going on. Thanks, Paul
Is there something in /d/sw/slurm/etc/slurm_task_prolog.sh that could account for the delay? The prolog scripts are the only thing I can think of that would introduce such unpredictability in job launch. The logs you've given look okay, and aren't pointing to an obvious communication issue. (In reply to paull from comment #8) > Tim, > > This is still happening even after restarting slurmd and slurmctld on nodes > and controller. What stipulations is Slurm setup for that would tell it > delay launch by 1200+ secs? I notice jobs being delayed 1-5sec which is > fine, but then there are the 1200+ sec delays that are causing the issue. If > I understand further this information, it may help me pinpoint whats exactly > going on.
All our slurm_task_prolog.sh script does is setup the job environment. I will do some more digging. I see that slurm_prolog.sh runs when a job is allocated to it. Do any exit codes 101-110 cause slurm to hold off starting a job for 20min (1200sec)? I noticed this happened right as the epilog completed and as the node was allocated the job that was pushed off 1200 sec: 2016-08-02T14:55:33+01:00 node1 slurmd[3922]: debug: [job 5744648] attempting to run epilog [/d/sw/slurm/etc/slurm_epilog.sh] 2016-08-02T14:55:33+01:00 node1 slurmd[3922]: debug: completed epilog for jobid 5744648 2016-08-02T14:55:33+01:00 node1 slurmd[3922]: debug: Job 5744648: sent epilog complete msg: rc = 0 2016-08-02T14:55:34+01:00 node1 slurmd[3922]: debug: _slurm_recv_timeout at 0 of 4, recv zero bytes 2016-08-02T14:55:34+01:00 node1 slurmd[3922]: error: slurm_receive_msg_and_forward: Zero Bytes were transmitted or received 2016-08-02T14:55:34+01:00 node1 slurmd[3922]: error: service_connection: slurm_receive_msg: Zero Bytes were transmitted or received Just to clarify, the node has lost communication with the controller right?
(In reply to paull from comment #10) > All our slurm_task_prolog.sh script does is setup the job environment. I > will do some more digging. I see that slurm_prolog.sh runs when a job is > allocated to it. > Do any exit codes 101-110 cause slurm to hold off starting a job for 20min > (1200sec)? Nope. There's nothing that should delay jobs like this. 20 minutes is an odd value, but could potentially be explained by the node registration interval re-acquiring a network connection to the node. Is the slurm_common.conf file for this cluster the same as the kl cluster? I just noticed the ResumeTimeout is set to 20 minutes; I'm wondering if that is involved somehow. And are you really running with AuthType=auth/none? > I noticed this happened right as the epilog completed and as the node was > allocated the job that was pushed off 1200 sec: > > 2016-08-02T14:55:33+01:00 node1 slurmd[3922]: debug: [job 5744648] > attempting to run epilog [/d/sw/slurm/etc/slurm_epilog.sh] > 2016-08-02T14:55:33+01:00 node1 slurmd[3922]: debug: completed epilog for > jobid 5744648 > 2016-08-02T14:55:33+01:00 node1 slurmd[3922]: debug: Job 5744648: sent > epilog complete msg: rc = 0 > 2016-08-02T14:55:34+01:00 node1 slurmd[3922]: debug: _slurm_recv_timeout at > 0 of 4, recv zero bytes > 2016-08-02T14:55:34+01:00 node1 slurmd[3922]: error: > slurm_receive_msg_and_forward: Zero Bytes were transmitted or received > 2016-08-02T14:55:34+01:00 node1 slurmd[3922]: error: service_connection: > slurm_receive_msg: Zero Bytes were transmitted or received > > Just to clarify, the node has lost communication with the controller right? Yes. That's unusual, and I'm trying to piece together if that could stall communication for other jobs. Is there any firewall or other network complication between the compute nodes and the master that could be blocking traffic?
If you could run the slurmd nodes with SlurmdDebug=debug3 for a while I'd love to see more details on the communication interactions. The current debug level just isn't giving me much to work off unfortunately.
Hi Tim, I have changed the ResumeTimeout from 1200 to 900 to see how this affects the jobs. After some time if nothing, I will change the debug to debug3 as requested. Thanks, Paul
Tim, After changing the ResumeTimeout to 900 the launch delay decreased to 900 secs. Also, for the node whose job was delayed, it was in powersave...but wasn't. It was marked by slurm as powersave, but the node itself never went offline. As a matter of face, once the node finished its previous job it experienced a communication error. Would you still like me to change the debug level? This is done in the slurm.conf file? [2016-08-15T17:15:06.238] Job 5775850 launch delayed by 901 secs, updating end_time ... [root@server ~]# scontrol show job 5775850 JobId=5775850 ArrayJobId=5774749 ArrayTaskId=2862 JobName=961_PRE-MIG_CMPS_2015-ONLY UserId=user(1234) GroupId=teambond(2110) Priority=100 Nice=0 Account=user QOS=normal JobState=RUNNING Reason=None Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 RunTime=00:16:06 TimeLimit=00:06:00 TimeMin=N/A SubmitTime=2016-08-15T10:52:38 EligibleTime=2016-08-15T10:52:39 StartTime=2016-08-15T17:00:05 EndTime=2016-08-15T17:21:06 PreemptTime=None SuspendTime=None SecsPreSuspend=0 Partition=team AllocNode:Sid=lud43:22030 ReqNodeList=(null) ExcNodeList=(null) NodeList=node44 BatchHost=node44 NumNodes=1 NumCPUs=40 CPUs/Task=1 ReqB:S:C:T=0:0:*:* Socks/Node=* NtasksPerN:B:S:C=1:0:*:* CoreSpec=* MinCPUsNode=1 MinMemoryNode=110G MinTmpDiskNode=0 Features=(null) Gres=(null) Reservation=(null) Shared=0 Contiguous=0 Licenses=(null) Network=(null) (...Removed...) [root@server ~]# scontrol show node=node44 NodeName=node44 CoresPerSocket=10 CPUAlloc=40 CPUErr=0 CPUTot=40 CPULoad=N/A Features=intel,localdisk,nogpu Gres=(null) NodeAddr=node44 NodeHostName=node44 Version=(null) RealMemory=129163 AllocMem=112640 Sockets=2 Boards=1 MemSpecLimit=123017 State=ALLOCATED+POWER ThreadsPerCore=2 TmpDisk=730899 Weight=1 BootTime=None SlurmdStartTime=None CurrentWatts=0 LowestJoules=0 ConsumedJoules=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s [root@server ~]# ssh node44 uptime 17:17:54 up 21 days, 1:09, 0 users, load average: 0.42, 0.33, 0.16 ... NODE44 LOGS: 2016-08-15T17:00:04+01:00 node44 slurmstepd[8615]: [job 5775782] attempting to run slurm task_epilog [/d/sw/slurm/etc/slurm_task_epilog.sh] 2016-08-15T17:00:04+01:00 node44 slurmstepd[8615]: task_p_post_term: 5775782.4294967294, task 0 2016-08-15T17:00:04+01:00 node44 slurmstepd[8615]: cpu_freq_reset: #cpus reset = 0 2016-08-15T17:00:04+01:00 node44 slurmstepd[8615]: job 5775782 completed with slurm_rc = 0, job_rc = 0 2016-08-15T17:00:04+01:00 node44 slurmstepd[8615]: sending REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status 0 2016-08-15T17:00:04+01:00 node44 slurmstepd[8615]: Message thread exited 2016-08-15T17:00:04+01:00 node44 slurmstepd[8615]: done with job 2016-08-15T17:00:04+01:00 node44 slurmd[35970]: debug: _rpc_terminate_job, uid = 601 2016-08-15T17:00:04+01:00 node44 slurmd[35970]: debug: task_p_slurmd_release_resources: 5775782 2016-08-15T17:00:04+01:00 node44 slurmd[35970]: debug: credential for job 5775782 revoked 2016-08-15T17:00:04+01:00 node44 slurmd[35970]: debug: Waiting for job 5775782's prolog to complete 2016-08-15T17:00:04+01:00 node44 slurmd[35970]: debug: Finished wait for job 5775782's prolog to complete 2016-08-15T17:00:04+01:00 node44 slurmd[35970]: debug: Calling /d/sw/slurm/20160620_1411/sbin/slurmstepd spank epilog 2016-08-15T17:00:04+01:00 node44 spank-epilog[10371]: Reading slurm.conf file: /d/sw/slurm/20160620_1411/etc/slurm.conf 2016-08-15T17:00:04+01:00 node44 spank-epilog[10371]: Running spank/epilog for jobid [5775782] uid [1487] 2016-08-15T17:00:04+01:00 node44 spank-epilog[10371]: spank: opening plugin stack /d/sw/slurm/20160620_1411/etc/plugstack.conf 2016-08-15T17:00:04+01:00 node44 slurmd[35970]: debug: [job 5775782] attempting to run epilog [/d/sw/slurm/etc/slurm_epilog.sh] 2016-08-15T17:00:04+01:00 node44 slurmd[35970]: debug: completed epilog for jobid 5775782 2016-08-15T17:00:04+01:00 node44 slurmd[35970]: debug: Job 5775782: sent epilog complete msg: rc = 0 2016-08-15T17:00:05+01:00 node44 slurmd[35970]: debug: _slurm_recv_timeout at 0 of 4, recv zero bytes 2016-08-15T17:00:05+01:00 node44 slurmd[35970]: error: slurm_receive_msg_and_forward: Zero Bytes were transmitted or received 2016-08-15T17:00:05+01:00 node44 slurmd[35970]: error: service_connection: slurm_receive_msg: Zero Bytes were transmitted or received Thanks, Paul
(In reply to paull from comment #14) > Tim, > > After changing the ResumeTimeout to 900 the launch delay decreased to 900 > secs. Also, for the node whose job was delayed, it was in powersave...but > wasn't. It was marked by slurm as powersave, but the node itself never went > offline. As a matter of face, once the node finished its previous job it > experienced a communication error. I will look into what could trigger that case; it sounds like the communication issue somehow correlates, although I haven't seen an issue like this before. > Would you still like me to change the debug level? This is done in the > slurm.conf file? Yes, please. If you can capture logs with SlurmdDebug=debug3 while this happens and attach them that should help isolate the issue.
Created attachment 3409 [details] Log from slurmd of node in reference to job delay
Hi Tim, I have uploaded a log file (Log from slurmd of node in reference to job delay) with an excerpt from the nodes log of when the job below was delayed. Below is the job information and the cutout of the delay from slurmctld.log. [2016-08-15T22:11:49.447] Job 5776562 launch delayed by 120 secs, updating end_time ... [root@server ~]# scontrol show job 5776562 JobId=5776562 ArrayJobId=5774707 ArrayTaskId=14 JobName=pt2_3DPreSTM_4_vels UserId=user(1234) GroupId=team(2113) Priority=150 Nice=0 Account=user QOS=normal JobState=RUNNING Reason=None Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 RunTime=00:09:26 TimeLimit=01:48:00 TimeMin=N/A SubmitTime=2016-08-15T10:24:55 EligibleTime=2016-08-15T10:29:28 StartTime=2016-08-15T22:09:49 EndTime=2016-08-15T23:59:49 PreemptTime=None SuspendTime=None SecsPreSuspend=0 Partition=idle AllocNode:Sid=lud55:27870 ReqNodeList=(null) ExcNodeList=(null) NodeList=node19 BatchHost=node19 NumNodes=1 NumCPUs=24 CPUs/Task=1 ReqB:S:C:T=0:0:*:* Socks/Node=* NtasksPerN:B:S:C=1:0:*:* CoreSpec=* MinCPUsNode=1 MinMemoryNode=110G MinTmpDiskNode=0 Features=gpu Gres=(null) Reservation=(null) Shared=0 Contiguous=0 Licenses=(null) Network=(null) (...REMOVED...) [root@lugy ~]# scontrol show node=node19 NodeName=node19 CoresPerSocket=6 CPUAlloc=24 CPUErr=0 CPUTot=24 CPULoad=N/A Features=intel,nogpu,gpu,8gpu,tengig Gres=(null) NodeAddr=node19 NodeHostName=node19 Version=(null) RealMemory=129083 AllocMem=112640 Sockets=2 Boards=1 MemSpecLimit=122939 State=ALLOCATED+POWER ThreadsPerCore=2 TmpDisk=500 Weight=1 BootTime=None SlurmdStartTime=None CurrentWatts=0 LowestJoules=0 ConsumedJoules=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
The logs helps, although I haven't yet been able to sort out the exact cause. There's a communication failure right as the job should be scheduled, and then slurmctld is waiting until the ResumeTimeout expires to reconnect which is unusual. One thing I'm noticing in the logs are some errors related to the SPANK plugin you're using. Would you be able to disable the plugin on a few nodes and see if that has any impact?
Tim, Quick update, I have disabled the spank plugin and am waiting on some logs. Will update soon. Thanks, Paul
Closing this as I'd expected a response back on the last message if there were further issues. Please reopen if there's still an active problem here. - Tim