Hello, This issue is somewhere between slurm and Cray ugni networking. However, we did not see if before upgrading to slurm 15.08.8. The issue is that a few times per minute all traffic on the ipogif0 interface (the one that talks to all the slurmds) will pause for about 7-14 seconds. The LNet kernel module is reporting on this, though I don't think is to blame. I moved slurmctld from one network node (ctl1) to a new one (ctl2) on edison. The issue was not observed on ctl2 until I moved the slurmctld service. It is now happening. I also have Cray case 132438 on this issue. I'm asking Cray for any information on tuning the ipogif0 interface to decrease these network pauses. Interactively this shows up as a hang on ssh traffic (cant. type. anything.). From the job perspective this might mean a job needs to resend a message if it times out trying to talk to the slurmctld. scontrol/sinfo/sbatch/salloc commands etc hang waiting to hear back from slurmctld. Is there anything that changed in 15.08.8 that might be affecting this? slurmctld logs have these kinds of messages during these times: [2016-03-04T03:17:10.450] error: slurm_receive_msg: Socket timed out on send/recv operation [2016-03-04T03:18:47.582] error: slurm_receive_msg: Zero Bytes were transmitted or received [2016-03-04T03:18:47.592] error: slurm_receive_msg: Zero Bytes were transmitted or received [2016-03-04T03:19:23.676] error: slurm_receive_msg: Socket timed out on send/recv operation [2016-03-04T03:19:23.686] error: slurm_receive_msg: Socket timed out on send/recv operation [2016-03-04T03:28:47.680] error: slurm_receive_msg: Zero Bytes were transmitted or received [2016-03-04T03:28:47.691] error: slurm_receive_msg: Zero Bytes were transmitted or received [2016-03-04T03:38:47.559] error: slurm_receive_msg: Zero Bytes were transmitted or received [2016-03-04T03:38:47.569] error: slurm_receive_msg: Zero Bytes were transmitted or received [2016-03-04T03:48:47.512] error: slurm_receive_msg: Zero Bytes were transmitted or received Console logs have: 2016-03-04T03:18:14.965022-08:00 c0-1c1s1n1 LNet: hardware quiesce flag: Waiting for 7 threads to pause 2016-03-04T03:18:16.994319-08:00 c0-1c1s1n1 LNet: Waiting for hardware quiesce flag to clear 2016-03-04T03:18:18.997424-08:00 c0-1c1s1n1 LNet: Waiting for hardware quiesce flag to clear 2016-03-04T03:18:21.982473-08:00 c0-1c1s1n1 LNet: Waiting for hardware quiesce flag to clear 2016-03-04T03:18:23.683994-08:00 c0-1c1s1n1 HWERR[16373]:0x4c22:SSID LREQ A_STATUS_BTE_SEQUENCE_TERMINATE Error:Info1=0xd5c00060c062:Info2=0x0:Info3=0x4bb 2016-03-04T03:18:23.709347-08:00 c0-1c1s1n1 HWERR[16374]:0x4e24:SSID RSP A_STATUS_MSG_PROTOCOL_ERR Error:Info1=0xd5cc00054064:Info2=0x0:Info3=0xd60a 2016-03-04T03:18:25.987181-08:00 c0-1c1s1n1 LNet: hardware quiesce: Waiting for 7 threads to wake up 2016-03-04T03:19:26.964111-08:00 c0-1c1s1n1 LNet: hardware quiesce flag: Waiting for 7 threads to pause 2016-03-04T03:19:28.994435-08:00 c0-1c1s1n1 LNet: Waiting for hardware quiesce flag to clear 2016-03-04T03:19:30.998838-08:00 c0-1c1s1n1 LNet: Waiting for hardware quiesce flag to clear 2016-03-04T03:19:33.989763-08:00 c0-1c1s1n1 LNet: Waiting for hardware quiesce flag to clear 2016-03-04T03:19:34.812063-08:00 c0-1c1s1n1 HWERR[16377]:0x4c22:SSID LREQ A_STATUS_BTE_SEQUENCE_TERMINATE Error:Info1=0x32f800060c062:Info2=0x0:Info3=0x4bc 2016-03-04T03:19:34.837498-08:00 c0-1c1s1n1 HWERR[16378]:0x4e24:SSID RSP A_STATUS_MSG_PROTOCOL_ERR Error:Info1=0x32f8c00054064:Info2=0x0:Info3=0xd614 the LNet errors (and HWERRS) started on ctl2 (c4-1c0s0n1) after I moved slurmctld there though there were not observed before. Thus I suspect that we are somehow overflowing the capability of the network to deal with these connections. I have tuned the sysctls on these nodes to have these properties: # dmj, tune slurmctld network parameters net.core.somaxconn = 8192 net.ipv4.tcp_max_syn_backlog = 8192 Is there any further tuning I could do to lighten the communication burden on this node? or other diagnostics that would be helpful to you? This is on the edison system with 5586 compute nodes. Thanks, Doug Current configuration: nid02305:~ # scontrol show config Configuration data as of 2016-03-04T13:08:16 AccountingStorageBackupHost = edique02 AccountingStorageEnforce = associations,limits,qos,safe AccountingStorageHost = edique01 AccountingStorageLoc = N/A AccountingStoragePort = 6819 AccountingStorageTRES = node,cpu,mem,energy,node AccountingStorageType = accounting_storage/slurmdbd AccountingStorageUser = N/A AccountingStoreJobComment = Yes AcctGatherEnergyType = acct_gather_energy/cray AcctGatherFilesystemType = acct_gather_filesystem/none AcctGatherInfinibandType = acct_gather_infiniband/none AcctGatherNodeFreq = 0 sec AcctGatherProfileType = acct_gather_profile/none AllowSpecResourcesUsage = 1 AuthInfo = (null) AuthType = auth/munge BackupAddr = (null) BackupController = (null) BatchStartTimeout = 60 sec BOOT_TIME = 2016-03-04T10:39:23 BurstBufferType = (null) CacheGroups = 0 CheckpointType = checkpoint/none ChosLoc = (null) ClusterName = edison CompleteWait = 0 sec ControlAddr = nid02305 ControlMachine = nid02305 CoreSpecPlugin = core_spec/cray CpuFreqDef = OnDemand CpuFreqGovernors = OnDemand CryptoType = crypto/munge DebugFlags = Backfill DefMemPerNode = UNLIMITED DisableRootJobs = Yes EioTimeout = 60 EnforcePartLimits = Yes Epilog = (null) EpilogMsgTime = 2000 usec EpilogSlurmctld = (null) ExtSensorsType = ext_sensors/none ExtSensorsFreq = 0 sec FairShareDampeningFactor = 1 FastSchedule = 1 FirstJobId = 1 GetEnvTimeout = 2 sec GresTypes = craynetwork GroupUpdateForce = 0 GroupUpdateTime = 600 sec HASH_VAL = Match HealthCheckInterval = 0 sec HealthCheckNodeState = ANY HealthCheckProgram = (null) InactiveLimit = 600 sec JobAcctGatherFrequency = 0 JobAcctGatherType = jobacct_gather/cgroup JobAcctGatherParams = (null) JobCheckpointDir = /var/slurm/checkpoint JobCompHost = localhost JobCompLoc = /var/log/slurm_jobcomp.log JobCompPort = 0 JobCompType = jobcomp/none JobCompUser = root JobContainerType = job_container/cncu JobCredentialPrivateKey = (null) JobCredentialPublicCertificate = (null) JobFileAppend = 0 JobRequeue = 0 JobSubmitPlugins = cray,lua KeepAliveTime = SYSTEM_DEFAULT KillOnBadExit = 1 KillWait = 30 sec LaunchParameters = (null) LaunchType = launch/slurm Layouts = Licenses = scratch1:1000000,scratch2:1000000,scratch3:1000000,dna:1000000,project:1000000,projecta:1000000,projectb:1000000 LicensesUsed = projectb:0/1000000,projecta:0/1000000,project:0/1000000,dna:0/1000000,scratch3:0/1000000,scratch2:0/1000000,scratch1:0/1000000 MailProg = /bin/mail MaxArraySize = 65000 MaxJobCount = 200000 MaxJobId = 2147418112 MaxMemPerNode = UNLIMITED MaxStepCount = 40000 MaxTasksPerNode = 128 MemLimitEnforce = Yes MessageTimeout = 60 sec MinJobAge = 300 sec MpiDefault = openmpi MpiParams = ports=63001-64000 MsgAggregationParams = (null) NEXT_JOB_ID = 214527 OverTimeLimit = 0 min PluginDir = /opt/slurm/default/lib/slurm PlugStackConfig = /opt/slurm/etc/plugstack.conf PowerParameters = (null) PowerPlugin = PreemptMode = REQUEUE PreemptType = preempt/qos PriorityParameters = (null) PriorityDecayHalfLife = 8-00:00:00 PriorityCalcPeriod = 00:05:00 PriorityFavorSmall = No PriorityFlags = PriorityMaxAge = 21-00:00:00 PriorityUsageResetPeriod = NONE PriorityType = priority/multifactor PriorityWeightAge = 30240 PriorityWeightFairShare = 1440 PriorityWeightJobSize = 0 PriorityWeightPartition = 0 PriorityWeightQOS = 38880 PriorityWeightTRES = (null) PrivateData = none ProctrackType = proctrack/cray Prolog = /opt/slurm/etc/node_prolog.sh PrologEpilogTimeout = 65534 PrologSlurmctld = (null) PrologFlags = Alloc PropagatePrioProcess = 0 PropagateResourceLimits = ALL PropagateResourceLimitsExcept = (null) RebootProgram = (null) ReconfigFlags = (null) RequeueExit = (null) RequeueExitHold = (null) ResumeProgram = (null) ResumeRate = 300 nodes/min ResumeTimeout = 60 sec ResvEpilog = (null) ResvOverRun = 0 min ResvProlog = (null) ReturnToService = 1 RoutePlugin = (null) SallocDefaultCommand = srun -n1 -N1 --mem-per-cpu=0 --pty --preserve-env --gres=craynetwork:0 --cpu_bind=none --mpi=none $SHELL SchedulerParameters = bf_window=2880,bf_resolution=120,bf_max_job_array_resv=20,default_queue_depth=400,bf_max_job_test=6000,bf_continue,bf_interval=150,bf_max_job_part=1000,bf_max_job_user=12,nohold_on_prolog_fail,kill_invalid_depend SchedulerPort = 7321 SchedulerRootFilter = 1 SchedulerTimeSlice = 30 sec SchedulerType = sched/backfill SelectType = select/cray SelectTypeParameters = CR_SOCKET_MEMORY,OTHER_CONS_RES SlurmUser = root(0) SlurmctldDebug = debug SlurmctldLogFile = /var/tmp/slurm/slurmctld.log SlurmctldPort = 6817 SlurmctldTimeout = 120 sec SlurmdDebug = info SlurmdLogFile = /var/spool/slurmd/%h.log SlurmdPidFile = /var/run/slurmd.pid SlurmdPlugstack = (null) SlurmdPort = 6818 SlurmdSpoolDir = /var/spool/slurmd SlurmdTimeout = 300 sec SlurmdUser = root(0) SlurmSchedLogFile = (null) SlurmSchedLogLevel = 0 SlurmctldPidFile = /var/run/slurmctld.pid SlurmctldPlugstack = (null) SLURM_CONF = /opt/slurm/etc/slurm.conf SLURM_VERSION = 15.08.8 SrunEpilog = (null) SrunPortRange = 60001-63000 SrunProlog = (null) StateSaveLocation = /global/syscom/sc/nsg/var/edison-slurm-state SuspendExcNodes = (null) SuspendExcParts = (null) SuspendProgram = (null) SuspendRate = 60 nodes/min SuspendTime = NONE SuspendTimeout = 30 sec SwitchType = switch/cray TaskEpilog = (null) TaskPlugin = task/cgroup,task/cray TaskPluginParam = (null type) TaskProlog = (null) TmpFS = /tmp TopologyParam = (null) TopologyPlugin = topology/none TrackWCKey = No TreeWidth = 18 UsePam = 0 UnkillableStepProgram = (null) UnkillableStepTimeout = 60 sec VSizeFactor = 0 percent WaitTime = 0 sec Slurmctld(primary/backup) at nid02305/(NULL) are UP/DOWN nid02305:~ #
(In reply to Doug Jacobsen from comment #0) > Hello, > > This issue is somewhere between slurm and Cray ugni networking. However, we > did not see if before upgrading to slurm 15.08.8. > > The issue is that a few times per minute all traffic on the ipogif0 > interface (the one that talks to all the slurmds) will pause for about 7-14 > seconds. The LNet kernel module is reporting on this, though I don't think > is to blame. > > I moved slurmctld from one network node (ctl1) to a new one (ctl2) on > edison. The issue was not observed on ctl2 until I moved the slurmctld > service. It is now happening. > > I also have Cray case 132438 on this issue. I'm asking Cray for any > information on tuning the ipogif0 interface to decrease these network pauses. > > Interactively this shows up as a hang on ssh traffic (cant. type. > anything.). From the job perspective this might mean a job needs to resend > a message if it times out trying to talk to the slurmctld. > scontrol/sinfo/sbatch/salloc commands etc hang waiting to hear back from > slurmctld. > > Is there anything that changed in 15.08.8 that might be affecting this? Can you elaborate on the timing on the slurm upgrade vs. the switch from ctl1 to ctl2? Was 15.08.8 working okay on ctl1, and now that it's on ctl2 things are acting up? Or was the upgrade to 15.08.8 done on ctl2 after the migration from ctl1? I'm not seeing any changes that should have substantially changed the network traffic pattern for 15.08.8. Did any of this correlate with a change to the Cray environment? > I have tuned the sysctls on these nodes to have these properties: > > # dmj, tune slurmctld network parameters > net.core.somaxconn = 8192 > net.ipv4.tcp_max_syn_backlog = 8192 > > > > Is there any further tuning I could do to lighten the communication burden > on this node? or other diagnostics that would be helpful to you? > > This is on the edison system with 5586 compute nodes. A larger chunk of the log from around there would be nice, the connection timeouts aren't that useful aside from confirming that the network stopped working for us. I'd like to get a feel for what slurmctld may have been doing right before then and if that could have caused any issues.
Hello, We upgraded to 15.08.8 last Saturday night following a power outage (planned). I don't recall seeing this issue prior to 15.08.8. We ran for several days on ctl1 using 15.08.8 in this context and I saw the quiesces through that entire time. I suspected a hardware error on clt1 and so today reconfigured slurm to use ctl2 case the primary and after moving the slurmctld service started seeing the errors on the new server. Checking the Cray logs for the wakeup message I do see that there seems to be a strong correlation that this is happening about every 30s: 2016-03-04T10:06:41.197125-08:00 c0-1c1s1n1 LNet: hardware quiesce: Waiting for 7 threads to wake up 2016-03-04T10:07:38.163990-08:00 c0-1c1s1n1 LNet: hardware quiesce: Waiting for 7 threads to wake up 2016-03-04T10:08:09.143595-08:00 c0-1c1s1n1 LNet: hardware quiesce: Waiting for 7 threads to wake up 2016-03-04T10:09:21.155770-08:00 c0-1c1s1n1 LNet: hardware quiesce: Waiting for 7 threads to wake up 2016-03-04T10:09:51.139347-08:00 c0-1c1s1n1 LNet: hardware quiesce: Waiting for 7 threads to wake up 2016-03-04T10:11:03.186205-08:00 c0-1c1s1n1 LNet: hardware quiesce: Waiting for 7 threads to wake up 2016-03-04T10:11:33.186136-08:00 c0-1c1s1n1 LNet: hardware quiesce: Waiting for 7 threads to wake up 2016-03-04T10:12:45.165819-08:00 c0-1c1s1n1 LNet: hardware quiesce: Waiting for 7 threads to wake up 2016-03-04T10:13:15.168049-08:00 c0-1c1s1n1 LNet: hardware quiesce: Waiting for 7 threads to wake up 2016-03-04T10:14:27.159889-08:00 c0-1c1s1n1 LNet: hardware quiesce: Waiting for 7 threads to wake up 2016-03-04T10:14:57.164990-08:00 c0-1c1s1n1 LNet: hardware quiesce: Waiting for 7 threads to wake up 2016-03-04T10:16:09.169230-08:00 c0-1c1s1n1 LNet: hardware quiesce: Waiting for 7 threads to wake up 2016-03-04T10:16:39.177083-08:00 c0-1c1s1n1 LNet: hardware quiesce: Waiting for 7 threads to wake up 2016-03-04T10:17:51.164952-08:00 c0-1c1s1n1 LNet: hardware quiesce: Waiting for 7 threads to wake up 2016-03-04T10:18:21.159589-08:00 c0-1c1s1n1 LNet: hardware quiesce: Waiting for 7 threads to wake up 2016-03-04T10:18:51.160109-08:00 c0-1c1s1n1 LNet: hardware quiesce: Waiting for 7 threads to wake up 2016-03-04T10:19:32.168868-08:00 c0-1c1s1n1 LNet: hardware quiesce: Waiting for 7 threads to wake up 2016-03-04T10:20:02.164627-08:00 c0-1c1s1n1 LNet: hardware quiesce: Waiting for 7 threads to wake up 2016-03-04T10:21:14.199460-08:00 c0-1c1s1n1 LNet: hardware quiesce: Waiting for 7 threads to wake up 2016-03-04T10:21:44.163792-08:00 c0-1c1s1n1 LNet: hardware quiesce: Waiting for 7 threads to wake up 2016-03-04T10:22:56.175142-08:00 c0-1c1s1n1 LNet: hardware quiesce: Waiting for 7 threads to wake up 2016-03-04T10:23:26.195071-08:00 c0-1c1s1n1 LNet: hardware quiesce: Waiting for 7 threads to wake up 2016-03-04T10:24:38.193241-08:00 c0-1c1s1n1 LNet: hardware quiesce: Waiting for 7 threads to wake up 2016-03-04T10:25:08.172062-08:00 c0-1c1s1n1 LNet: hardware quiesce: Waiting for 7 threads to wake up 2016-03-04T10:25:38.172016-08:00 c0-1c1s1n1 LNet: hardware quiesce: Waiting for 7 threads to wake up 2016-03-04T10:26:12.177397-08:00 c0-1c1s1n1 LNet: hardware quiesce: Waiting for 7 threads to wake up 2016-03-04T10:30:00.175082-08:00 c0-1c1s1n1 LNet: hardware quiesce: Waiting for 7 threads to wake up 2016-03-04T10:30:35.193673-08:00 c0-1c1s1n1 LNet: hardware quiesce: Waiting for 7 threads to wake up 2016-03-04T10:31:32.217260-08:00 c0-1c1s1n1 LNet: hardware quiesce: Waiting for 7 threads to wake up 2016-03-04T10:32:02.205825-08:00 c0-1c1s1n1 LNet: hardware quiesce: Waiting for 7 threads to wake up 2016-03-04T10:34:41.227884-08:00 c0-1c1s1n1 LNet: hardware quiesce: Waiting for 7 threads to wake up During the last boot (prior to this was using 15.08.7 and possibly earlier), these error messages never occured: crayadm@edismw:/var/opt/cray/log/p0-20160203t195000> for i in $(ls console-20160*); do echo $i; grep c0-1c1s1n1 $i | grep "threads to wake up" | wc -l; done #" console-20160203 0 console-20160204 0 console-20160205 0 console-20160206 0 console-20160207 0 console-20160208 0 console-20160209 0 console-20160210 0 console-20160211 0 console-20160212 0 console-20160213 0 console-20160214 0 console-20160215 0 console-20160216 0 console-20160217 0 console-20160218 0 console-20160219 0 console-20160220 0 console-20160221 0 console-20160222 0 console-20160223 0 console-20160224 0 console-20160225 0 console-20160226 0 console-20160227 0 In the most recent boot (which came up with 15.08.8), it occurs at a pretty constant rate: crayadm@edismw:/var/opt/cray/log/p0-20160227t222722> for i in $(ls console-20160*); do echo $i; grep c0-1c1s1n1 $i | grep "threads to wake up" | wc -l; done #" console-20160227 49 console-20160228 1740 console-20160229 1734 console-20160301 1730 console-20160302 1733 console-20160303 1734 console-20160304 758 I'll attached the slurmctld logs from ctl1 (before the move). You'll notice at the end some difficulties failing over from ctl1 to ctl2 (after a reconfigure). I'm concerned that there may be some problems with 15.08.8 failing over with the recent changes to the shared state dir protections, but I haven't investigated that fully yet, and it isn't my primary goal with this ticket =) -Doug On Fri, Mar 4, 2016 at 1:29 PM <bugs@schedmd.com> wrote: > *Comment # 1 <https://bugs.schedmd.com/show_bug.cgi?id=2516#c1> on bug > 2516 <https://bugs.schedmd.com/show_bug.cgi?id=2516> from Tim Wickberg > <tim@schedmd.com> * > > (In reply to Doug Jacobsen from comment #0 <https://bugs.schedmd.com/show_bug.cgi?id=2516#c0>)> Hello, > > > > This issue is somewhere between slurm and Cray ugni networking. However, we > > did not see if before upgrading to slurm 15.08.8. > > > > The issue is that a few times per minute all traffic on the ipogif0 > > interface (the one that talks to all the slurmds) will pause for about 7-14 > > seconds. The LNet kernel module is reporting on this, though I don't think > > is to blame. > > > > I moved slurmctld from one network node (ctl1) to a new one (ctl2) on > > edison. The issue was not observed on ctl2 until I moved the slurmctld > > service. It is now happening. > > > > I also have Cray case 132438 on this issue. I'm asking Cray for any > > information on tuning the ipogif0 interface to decrease these network pauses. > > > > Interactively this shows up as a hang on ssh traffic (cant. type. > > anything.). From the job perspective this might mean a job needs to resend > > a message if it times out trying to talk to the slurmctld. > > scontrol/sinfo/sbatch/salloc commands etc hang waiting to hear back from > > slurmctld. > > > > Is there anything that changed in 15.08.8 that might be affecting this? > > Can you elaborate on the timing on the slurm upgrade vs. the switch from ctl1 > to ctl2? > > Was 15.08.8 working okay on ctl1, and now that it's on ctl2 things are acting > up? > > Or was the upgrade to 15.08.8 done on ctl2 after the migration from ctl1? > > I'm not seeing any changes that should have substantially changed the network > traffic pattern for 15.08.8. > > Did any of this correlate with a change to the Cray environment? > > I have tuned the sysctls on these nodes to have these properties: > > > > # dmj, tune slurmctld network parameters > > net.core.somaxconn = 8192 > > net.ipv4.tcp_max_syn_backlog = 8192 > > > > > > > > Is there any further tuning I could do to lighten the communication burden > > on this node? or other diagnostics that would be helpful to you? > > > > This is on the edison system with 5586 compute nodes. > > > A larger chunk of the log from around there would be nice, the connection > timeouts aren't that useful aside from confirming that the network stopped > working for us. I'd like to get a feel for what slurmctld may have been doing > right before then and if that could have caused any issues. > > ------------------------------ > You are receiving this mail because: > > - You reported the bug. > >
Created attachment 2827 [details] slurmctld logs from ctl1
(In reply to Doug Jacobsen from comment #2) > Hello, > > We upgraded to 15.08.8 last Saturday night following a power outage > (planned). I don't recall seeing this issue prior to 15.08.8. We ran for > several days on ctl1 using 15.08.8 in this context and I saw the quiesces > through that entire time. I suspected a hardware error on clt1 and so > today reconfigured slurm to use ctl2 case the primary and after moving the > slurmctld service started seeing the errors on the new server. Can you attach the currently slurm.conf for edison? I only have cori on hand at the moment. > Checking the Cray logs for the wakeup message I do see that there seems to > be a strong correlation that this is happening about every 30s: > 2016-03-04T10:34:41.227884-08:00 c0-1c1s1n1 LNet: hardware quiesce: Waiting > for 7 threads to wake up. You don't happen to have a line like: JobAcctGatherFrequency= or have it commented out entirely? That's a 30-second default timer that I could see generating a large amount of traffic. I think messages for that would be debug2 or lower as well, which may explain the lack of evidence in the log. > I'll attached the slurmctld logs from ctl1 (before the move). You'll > notice at the end some difficulties failing over from ctl1 to ctl2 (after a > reconfigure). I'm concerned that there may be some problems with 15.08.8 > failing over with the recent changes to the shared state dir protections, > but I haven't investigated that fully yet, and it isn't my primary goal > with this ticket =) The safety check is a single file in the state directory with the ClusterName in it (in a bout of creativity, I set the file name as 'clustername'), failing over isn't impacted at all. If you're concerned about this it's always safe to remove that file - if it doesn't exist at slurmctld start time it'll just recreate it to protect the next launch. And older slurm versions will ignore it. There aren't any messages related to the check in the log. This is suspicious, I believe the other slurmctld should drop the connection when you fail over: [2016-03-04T10:35:58.012] error: _start_session: aeld connection restart exceed threshold, find and remove other program using the aeld socket, likely another slurmctld instance
dmj@edison06:~> cat /opt/slurm/etc/slurm.conf | grep JobAcc JobAcctGatherFrequency=0 JobAcctGatherType=jobacct_gather/cgroup dmj@edison06:~> I'll attach the full slurm.conf On Fri, Mar 4, 2016 at 2:32 PM <bugs@schedmd.com> wrote: > *Comment # 4 <https://bugs.schedmd.com/show_bug.cgi?id=2516#c4> on bug > 2516 <https://bugs.schedmd.com/show_bug.cgi?id=2516> from Tim Wickberg > <tim@schedmd.com> * > > (In reply to Doug Jacobsen from comment #2 <https://bugs.schedmd.com/show_bug.cgi?id=2516#c2>) > > > Hello, > > > > We upgraded to 15.08.8 last Saturday night following a power outage > > (planned). I don't recall seeing this issue prior to 15.08.8. We ran for > > several days on ctl1 using 15.08.8 in this context and I saw the quiesces > > through that entire time. I suspected a hardware error on clt1 and so > > today reconfigured slurm to use ctl2 case the primary and after moving the > > slurmctld service started seeing the errors on the new server. > > > > Can you attach the currently slurm.conf for edison? I only have cori on hand at > the moment. > > Checking the Cray logs for the wakeup message I do see that there seems to > > be a strong correlation that this is happening about every 30s: > > 2016-03-04T10:34:41.227884-08:00 c0-1c1s1n1 LNet: hardware quiesce: Waiting > > for 7 threads to wake up. > > You don't happen to have a line like: > > JobAcctGatherFrequency= > > or have it commented out entirely? That's a 30-second default timer that I > could see generating a large amount of traffic. I think messages for that would > be debug2 or lower as well, which may explain the lack of evidence in the log. > > > > I'll attached the slurmctld logs from ctl1 (before the move). You'll > > notice at the end some difficulties failing over from ctl1 to ctl2 (after a > > reconfigure). I'm concerned that there may be some problems with 15.08.8 > > failing over with the recent changes to the shared state dir protections, > > but I haven't investigated that fully yet, and it isn't my primary goal > > with this ticket =) > > > > The safety check is a single file in the state directory with the ClusterName > in it (in a bout of creativity, I set the file name as 'clustername'), failing > over isn't impacted at all. If you're concerned about this it's always safe to > remove that file - if it doesn't exist at slurmctld start time it'll just > recreate it to protect the next launch. And older slurm versions will ignore > it. There aren't any messages related to the check in the log. > > This is suspicious, I believe the other slurmctld should drop the connection > when you fail over: > > [2016-03-04T10:35:58.012] error: _start_session: aeld connection restart exceed > threshold, find and remove other program using the aeld socket, likely another > slurmctld instance > > ------------------------------ > You are receiving this mail because: > > - You reported the bug. > >
I did check to validate -- this is only happening with any frequency on the nodes I'm running slurmctld; the other nodes have a basal rate of this matching (more or less) the count of global network quiesces we've had since boot ca. ~10, as opposed to the 10,000 instances on the ctl nodes.
Created attachment 2828 [details] edison slurm.conf
Diff'ing them has been interesting, I'm still looking for other potential issues but wanted to write down a few notes as I go through: I don't think these are directly related to the current problem, but s there a reason that edison has the topology plugin disabled, while cori has tree? We'd expect dragonfly on a cray, but admittedly you're the one with the machine. ;) Cori doesn't have --cpu_bind=none in SallocDefaultCommand, not sure if you wanted those to stay in sync. And the day difference in PriorityDecayHalfLife seems like it may be unintentional. Also unrelated - we heavily recommend against using the jobacct_gather/cgroup. jobacct_gather/linux is much faster and gets almost identical data, the cgroup interface here is slow and doesn't add anything. If Cray's set that by default please let me know and I'll ask them to change it in the future. Can you confirm that there aren't any rouge slurmctld's in the system somehow? There is a 30-second retry counter in one specific spot, although I doubt that's the issue here. Although the large number of "error: _start_session: aeld connection restart exceed threshold, find and remove other program using the aeld socket, likely another slurmctld instance" messages at the end of the log has me concerned. If those have cleared up, can you attach a newer chunk of the log file?
Hi Tim, There is a major functionality difference between jobacct_gather/cgroup and jobacct_gather/linux; which is that with cgroup resident memory is enforced, not the summation of the amount of memory recorded in /proc/<pid>/status or whatever (which is what jobacct_gather/linux does). We've had issues with some of workload getting killed off because of improper summations performed during (presumably) fork/execs of large processes. The --cpu_bind=none is on both cori and edison now, you'll just need an updated file =) Those issues at the end of the slurmctld were where I was trying to failover. at the time nid02030 or whatever was the backup (and was configured so) but neither were happy. -Doug On Fri, Mar 4, 2016 at 3:07 PM <bugs@schedmd.com> wrote: > *Comment # 8 <https://bugs.schedmd.com/show_bug.cgi?id=2516#c8> on bug > 2516 <https://bugs.schedmd.com/show_bug.cgi?id=2516> from Tim Wickberg > <tim@schedmd.com> * > > Diff'ing them has been interesting, I'm still looking for other potential > issues but wanted to write down a few notes as I go through: > > I don't think these are directly related to the current problem, but s there a > reason that edison has the topology plugin disabled, while cori has tree? We'd > expect dragonfly on a cray, but admittedly you're the one with the machine. ;) > > Cori doesn't have --cpu_bind=none in SallocDefaultCommand, not sure if you > wanted those to stay in sync. And the day difference in PriorityDecayHalfLife > seems like it may be unintentional. > > Also unrelated - we heavily recommend against using the jobacct_gather/cgroup. > jobacct_gather/linux is much faster and gets almost identical data, the cgroup > interface here is slow and doesn't add anything. If Cray's set that by default > please let me know and I'll ask them to change it in the future. > > Can you confirm that there aren't any rouge slurmctld's in the system somehow? > There is a 30-second retry counter in one specific spot, although I doubt > that's the issue here. Although the large number of "error: _start_session: > aeld connection restart exceed threshold, find and remove other program using > the aeld socket, likely another slurmctld instance" messages at the end of the > log has me concerned. > > If those have cleared up, can you attach a newer chunk of the log file? > > ------------------------------ > You are receiving this mail because: > > - You reported the bug. > >
If a message is received from an unknown IP address then the network interface shuts down for a period of time. We ran into this behavior on the first XC30 running Native-Slurm. The description looks similar. I will add more information when I find the bug.
Here is the description of the behavior. "If TCP or UDP attempts to send a packet to an IP address that is within an HSN subnet, but for which there is no ARP entry, no error is generated immediately. Instead, the destination layer 2 address is set to the address of the originating node (see the handling of IFF_NOARP in arp_constructor()) and the packet is transmitted. I don't believe that Aries will actually loop back such a packet, but even if it did, the packet would be discarded at the IP layer because its destination IP address doesn't belong to the node. The net effect is that packets sent to addresses for which there is no ARP entry go into a black hole. In particular, a TCP connection attempt to such an address will not fail immediately, but only after many retransmissions of the SYN taking approximately 3 minutes. CLE has behaved this way since Gemini. What makes it more relevant now is that Slurm attempts to open TCP connections (to send ping RPCs) to nodes that are configured but may be down." This is not a clear explanation since ipogni does not use ARP correctly. The ARP tables are updated out of band. When a node is added or removed from the HSN the network is stopped and a new ARP table is distributed. What happened was that as the system was brought up the ARP table was constantly being updated. The software was updated but this behavior was not fixed.
We are seeing ORB Timeouts every 30 seconds. That means that an IP message is going to a down node. Is there anything that polls at that interval?
Hello, I have an update here. I removed the slurm.conf references to nid04317 (c6-2c1s7n1) and nid03544 (c2-2c1s6n0). Since I did this slurm stopped trying to communicate with those nodes and the issue went away. Typically a node being down hasn't been an issue for us. I'm guessing that slurm is just pinging these nodes periodically to see if they are available again. Once I removed the references to the compute node the problem cleared up. I'm guessing there is something very bad going on with the network interface on one of them. I did not try to narrow it down using slurm. I'll continue following this up in my Cray case. Thanks for your help on this, Doug
Doug, That is known behavior it has been around since Gemini. The Aries is a point to point network. The IPoGNI has the assumption that ALPS would contain communication to only nodes that were available. So there was no expectation of any application trying to communicate to a node that was not available. The ORB Timeout initiates a scrub of the ARP tables (not what you think) which requires the network to stop while this is happening. The assumption is that would happen rarely only when a node went off-line. Since Slurm assumes a real TCP/IP network it happily sends messages to a node that is configured in slurm.conf. The other confounding bit is that since Aries is point to point there are no routers so the sender never get network error messages like "network cannot be reached". That leave the application to rely on timeouts. Brian
Here's what Slurm does today: if (slurmctld_conf.slurmd_timeout) { /* We ping nodes that haven't responded in SlurmdTimeout/3, * but need to do the test at a higher frequency or we might * DOWN nodes with times that fall in the gap. */ ping_interval = slurmctld_conf.slurmd_timeout / 3; } else { /* This will just ping non-responding nodes * and restore them to service */ ping_interval = 100; /* 100 seconds */ } ... if (((difftime(now, last_ping_node_time) >= ping_interval) || ping_nodes_now) && is_ping_done()) { now = time(NULL); last_ping_node_time = now; ping_nodes_now = false; lock_slurmctld(node_write_lock); ping_nodes(); unlock_slurmctld(node_write_lock); } Every 100 seconds Slurm is attempting to reconnect to any missing nodes to restore them to service. (The 'scontrol ping' command can also force it to start a ping cycle.) It sounds like the node ping function is counter-productive on native cray, since the missing nodes will cause IPoGNI to flush out tables. I believe we could disable node_ping() without any significant side effects - nodes returning to service should register themselves with slurmctld when slurmd starts on the node - although there may be some other subsystems that assume that communicating with down nodes won't stall the network that may also need to be addressed. I'll discuss this further internally.
Doug - Is this still an active concern? Our internal discussion never really progressed anywhere; we don't really want to disable the node registration by default but could still explore hiding the behavior behind an #ifndef HAVE_CRAY if necessary.
Well, actually there is something we should probably do. I found out more about what the issue is, apparently when a node goes down in a cray system, a signal goes out to all the nodes to remove that node (or at least invalidate that node) in the arp table of every single node on the system. This is done because if a node attempts to communicate with a down/failed aries interface, it gets put in the penalty box for that infraction in an attempt to keep the HSN stable. There are apparently rare cases, like the one documented here, where that arp update didn't happen. Apparently, the solution is to make some librca call to get the list of currently up nodes prior to performing any wide scale ping. I'll try to track down my source again to find out what that call is. But, I think, based on this, that the correct course of action for slurm would be to periodically check to make sure the rca or hss system is still happy with the node prior to the ping. -Doug ---- Doug Jacobsen, Ph.D. NERSC Computer Systems Engineer National Energy Research Scientific Computing Center <http://www.nersc.gov> dmjacobsen@lbl.gov ------------- __o ---------- _ '\<,_ ----------(_)/ (_)__________________________ On Fri, Jul 8, 2016 at 11:40 AM, <bugs@schedmd.com> wrote: > *Comment # 16 <https://bugs.schedmd.com/show_bug.cgi?id=2516#c16> on bug > 2516 <https://bugs.schedmd.com/show_bug.cgi?id=2516> from Tim Wickberg > <tim@schedmd.com> * > > Doug - > > Is this still an active concern? Our internal discussion never really > progressed anywhere; we don't really want to disable the node registration by > default but could still explore hiding the behavior behind an #ifndef HAVE_CRAY > if necessary. > > ------------------------------ > You are receiving this mail because: > > - You reported the bug. > >
Doug - I'm marking this closed. If you'd like, I can break out a separate enhancement for ways to avoid communication issues within the Cray environment by using librca or something similar.