Ticket 2516

Summary: slurmctld master network quiescing a few times per minute owing to slurm traffic
Product: Slurm Reporter: Doug Jacobsen <dmjacobsen>
Component: slurmctldAssignee: Tim Wickberg <tim>
Status: RESOLVED INFOGIVEN QA Contact:
Severity: 3 - Medium Impact    
Priority: --- CC: brian.gilmer, jsrinivasan, tlbutler
Version: 15.08.8   
Hardware: Cray XC   
OS: Linux   
Site: NERSC Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: edison CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---
Attachments: slurmctld logs from ctl1
edison slurm.conf

Description Doug Jacobsen 2016-03-04 06:09:51 MST
Hello,

This issue is somewhere between slurm and Cray ugni networking.  However, we did not see if before upgrading to slurm 15.08.8.

The issue is that a few times per minute all traffic on the ipogif0 interface (the one that talks to all the slurmds) will pause for about 7-14 seconds.  The LNet kernel module is reporting on this, though I don't think is to blame.

I moved slurmctld from one network node (ctl1) to a new one (ctl2) on edison.  The issue was not observed on ctl2 until I moved the slurmctld service.  It is now happening.

I also have Cray case 132438 on this issue. I'm asking Cray for any information on tuning the ipogif0 interface to decrease these network pauses.

Interactively this shows up as a hang on ssh traffic (cant. type. anything.).  From the job perspective this might mean a job needs to resend a message if it times out trying to talk to the slurmctld.  scontrol/sinfo/sbatch/salloc commands etc hang waiting to hear back from slurmctld.

Is there anything that changed in 15.08.8 that might be affecting this?



slurmctld logs have these kinds of messages during these times:
[2016-03-04T03:17:10.450] error: slurm_receive_msg: Socket timed out on send/recv operation
[2016-03-04T03:18:47.582] error: slurm_receive_msg: Zero Bytes were transmitted or received
[2016-03-04T03:18:47.592] error: slurm_receive_msg: Zero Bytes were transmitted or received
[2016-03-04T03:19:23.676] error: slurm_receive_msg: Socket timed out on send/recv operation
[2016-03-04T03:19:23.686] error: slurm_receive_msg: Socket timed out on send/recv operation
[2016-03-04T03:28:47.680] error: slurm_receive_msg: Zero Bytes were transmitted or received
[2016-03-04T03:28:47.691] error: slurm_receive_msg: Zero Bytes were transmitted or received
[2016-03-04T03:38:47.559] error: slurm_receive_msg: Zero Bytes were transmitted or received
[2016-03-04T03:38:47.569] error: slurm_receive_msg: Zero Bytes were transmitted or received
[2016-03-04T03:48:47.512] error: slurm_receive_msg: Zero Bytes were transmitted or received


Console logs have:
2016-03-04T03:18:14.965022-08:00 c0-1c1s1n1 LNet: hardware quiesce flag: Waiting for 7 threads to pause
2016-03-04T03:18:16.994319-08:00 c0-1c1s1n1 LNet: Waiting for hardware quiesce flag to clear
2016-03-04T03:18:18.997424-08:00 c0-1c1s1n1 LNet: Waiting for hardware quiesce flag to clear
2016-03-04T03:18:21.982473-08:00 c0-1c1s1n1 LNet: Waiting for hardware quiesce flag to clear
2016-03-04T03:18:23.683994-08:00 c0-1c1s1n1 HWERR[16373]:0x4c22:SSID LREQ A_STATUS_BTE_SEQUENCE_TERMINATE Error:Info1=0xd5c00060c062:Info2=0x0:Info3=0x4bb
2016-03-04T03:18:23.709347-08:00 c0-1c1s1n1 HWERR[16374]:0x4e24:SSID RSP A_STATUS_MSG_PROTOCOL_ERR Error:Info1=0xd5cc00054064:Info2=0x0:Info3=0xd60a
2016-03-04T03:18:25.987181-08:00 c0-1c1s1n1 LNet: hardware quiesce: Waiting for 7 threads to wake up
2016-03-04T03:19:26.964111-08:00 c0-1c1s1n1 LNet: hardware quiesce flag: Waiting for 7 threads to pause
2016-03-04T03:19:28.994435-08:00 c0-1c1s1n1 LNet: Waiting for hardware quiesce flag to clear
2016-03-04T03:19:30.998838-08:00 c0-1c1s1n1 LNet: Waiting for hardware quiesce flag to clear
2016-03-04T03:19:33.989763-08:00 c0-1c1s1n1 LNet: Waiting for hardware quiesce flag to clear
2016-03-04T03:19:34.812063-08:00 c0-1c1s1n1 HWERR[16377]:0x4c22:SSID LREQ A_STATUS_BTE_SEQUENCE_TERMINATE Error:Info1=0x32f800060c062:Info2=0x0:Info3=0x4bc
2016-03-04T03:19:34.837498-08:00 c0-1c1s1n1 HWERR[16378]:0x4e24:SSID RSP A_STATUS_MSG_PROTOCOL_ERR Error:Info1=0x32f8c00054064:Info2=0x0:Info3=0xd614


the LNet errors (and HWERRS) started on ctl2 (c4-1c0s0n1) after I moved slurmctld there though there were not observed before.  Thus I suspect that we are somehow overflowing the capability of the network to deal with these connections.

I have tuned the sysctls on these nodes to have these properties:

# dmj, tune slurmctld network parameters
net.core.somaxconn = 8192
net.ipv4.tcp_max_syn_backlog = 8192



Is there any further tuning I could do to lighten the communication burden on this node?  or other diagnostics that would be helpful to you?

This is on the edison system with 5586 compute nodes.

Thanks,
Doug


Current configuration:
nid02305:~ # scontrol show config
Configuration data as of 2016-03-04T13:08:16
AccountingStorageBackupHost = edique02
AccountingStorageEnforce = associations,limits,qos,safe
AccountingStorageHost   = edique01
AccountingStorageLoc    = N/A
AccountingStoragePort   = 6819
AccountingStorageTRES   = node,cpu,mem,energy,node
AccountingStorageType   = accounting_storage/slurmdbd
AccountingStorageUser   = N/A
AccountingStoreJobComment = Yes
AcctGatherEnergyType    = acct_gather_energy/cray
AcctGatherFilesystemType = acct_gather_filesystem/none
AcctGatherInfinibandType = acct_gather_infiniband/none
AcctGatherNodeFreq      = 0 sec
AcctGatherProfileType   = acct_gather_profile/none
AllowSpecResourcesUsage = 1
AuthInfo                = (null)
AuthType                = auth/munge
BackupAddr              = (null)
BackupController        = (null)
BatchStartTimeout       = 60 sec
BOOT_TIME               = 2016-03-04T10:39:23
BurstBufferType         = (null)
CacheGroups             = 0
CheckpointType          = checkpoint/none
ChosLoc                 = (null)
ClusterName             = edison
CompleteWait            = 0 sec
ControlAddr             = nid02305
ControlMachine          = nid02305
CoreSpecPlugin          = core_spec/cray
CpuFreqDef              = OnDemand
CpuFreqGovernors        = OnDemand
CryptoType              = crypto/munge
DebugFlags              = Backfill
DefMemPerNode           = UNLIMITED
DisableRootJobs         = Yes
EioTimeout              = 60
EnforcePartLimits       = Yes
Epilog                  = (null)
EpilogMsgTime           = 2000 usec
EpilogSlurmctld         = (null)
ExtSensorsType          = ext_sensors/none
ExtSensorsFreq          = 0 sec
FairShareDampeningFactor = 1
FastSchedule            = 1
FirstJobId              = 1
GetEnvTimeout           = 2 sec
GresTypes               = craynetwork
GroupUpdateForce        = 0
GroupUpdateTime         = 600 sec
HASH_VAL                = Match
HealthCheckInterval     = 0 sec
HealthCheckNodeState    = ANY
HealthCheckProgram      = (null)
InactiveLimit           = 600 sec
JobAcctGatherFrequency  = 0
JobAcctGatherType       = jobacct_gather/cgroup
JobAcctGatherParams     = (null)
JobCheckpointDir        = /var/slurm/checkpoint
JobCompHost             = localhost
JobCompLoc              = /var/log/slurm_jobcomp.log
JobCompPort             = 0
JobCompType             = jobcomp/none
JobCompUser             = root
JobContainerType        = job_container/cncu
JobCredentialPrivateKey = (null)
JobCredentialPublicCertificate = (null)
JobFileAppend           = 0
JobRequeue              = 0
JobSubmitPlugins        = cray,lua
KeepAliveTime           = SYSTEM_DEFAULT
KillOnBadExit           = 1
KillWait                = 30 sec
LaunchParameters        = (null)
LaunchType              = launch/slurm
Layouts                 =
Licenses                = scratch1:1000000,scratch2:1000000,scratch3:1000000,dna:1000000,project:1000000,projecta:1000000,projectb:1000000
LicensesUsed            = projectb:0/1000000,projecta:0/1000000,project:0/1000000,dna:0/1000000,scratch3:0/1000000,scratch2:0/1000000,scratch1:0/1000000
MailProg                = /bin/mail
MaxArraySize            = 65000
MaxJobCount             = 200000
MaxJobId                = 2147418112
MaxMemPerNode           = UNLIMITED
MaxStepCount            = 40000
MaxTasksPerNode         = 128
MemLimitEnforce         = Yes
MessageTimeout          = 60 sec
MinJobAge               = 300 sec
MpiDefault              = openmpi
MpiParams               = ports=63001-64000
MsgAggregationParams    = (null)
NEXT_JOB_ID             = 214527
OverTimeLimit           = 0 min
PluginDir               = /opt/slurm/default/lib/slurm
PlugStackConfig         = /opt/slurm/etc/plugstack.conf
PowerParameters         = (null)
PowerPlugin             =
PreemptMode             = REQUEUE
PreemptType             = preempt/qos
PriorityParameters      = (null)
PriorityDecayHalfLife   = 8-00:00:00
PriorityCalcPeriod      = 00:05:00
PriorityFavorSmall      = No
PriorityFlags           =
PriorityMaxAge          = 21-00:00:00
PriorityUsageResetPeriod = NONE
PriorityType            = priority/multifactor
PriorityWeightAge       = 30240
PriorityWeightFairShare = 1440
PriorityWeightJobSize   = 0
PriorityWeightPartition = 0
PriorityWeightQOS       = 38880
PriorityWeightTRES      = (null)
PrivateData             = none
ProctrackType           = proctrack/cray
Prolog                  = /opt/slurm/etc/node_prolog.sh
PrologEpilogTimeout     = 65534
PrologSlurmctld         = (null)
PrologFlags             = Alloc
PropagatePrioProcess    = 0
PropagateResourceLimits = ALL
PropagateResourceLimitsExcept = (null)
RebootProgram           = (null)
ReconfigFlags           = (null)
RequeueExit             = (null)
RequeueExitHold         = (null)
ResumeProgram           = (null)
ResumeRate              = 300 nodes/min
ResumeTimeout           = 60 sec
ResvEpilog              = (null)
ResvOverRun             = 0 min
ResvProlog              = (null)
ReturnToService         = 1
RoutePlugin             = (null)
SallocDefaultCommand    = srun -n1 -N1 --mem-per-cpu=0 --pty --preserve-env --gres=craynetwork:0 --cpu_bind=none --mpi=none $SHELL
SchedulerParameters     = bf_window=2880,bf_resolution=120,bf_max_job_array_resv=20,default_queue_depth=400,bf_max_job_test=6000,bf_continue,bf_interval=150,bf_max_job_part=1000,bf_max_job_user=12,nohold_on_prolog_fail,kill_invalid_depend
SchedulerPort           = 7321
SchedulerRootFilter     = 1
SchedulerTimeSlice      = 30 sec
SchedulerType           = sched/backfill
SelectType              = select/cray
SelectTypeParameters    = CR_SOCKET_MEMORY,OTHER_CONS_RES
SlurmUser               = root(0)
SlurmctldDebug          = debug
SlurmctldLogFile        = /var/tmp/slurm/slurmctld.log
SlurmctldPort           = 6817
SlurmctldTimeout        = 120 sec
SlurmdDebug             = info
SlurmdLogFile           = /var/spool/slurmd/%h.log
SlurmdPidFile           = /var/run/slurmd.pid
SlurmdPlugstack         = (null)
SlurmdPort              = 6818
SlurmdSpoolDir          = /var/spool/slurmd
SlurmdTimeout           = 300 sec
SlurmdUser              = root(0)
SlurmSchedLogFile       = (null)
SlurmSchedLogLevel      = 0
SlurmctldPidFile        = /var/run/slurmctld.pid
SlurmctldPlugstack      = (null)
SLURM_CONF              = /opt/slurm/etc/slurm.conf
SLURM_VERSION           = 15.08.8
SrunEpilog              = (null)
SrunPortRange           = 60001-63000
SrunProlog              = (null)
StateSaveLocation       = /global/syscom/sc/nsg/var/edison-slurm-state
SuspendExcNodes         = (null)
SuspendExcParts         = (null)
SuspendProgram          = (null)
SuspendRate             = 60 nodes/min
SuspendTime             = NONE
SuspendTimeout          = 30 sec
SwitchType              = switch/cray
TaskEpilog              = (null)
TaskPlugin              = task/cgroup,task/cray
TaskPluginParam         = (null type)
TaskProlog              = (null)
TmpFS                   = /tmp
TopologyParam           = (null)
TopologyPlugin          = topology/none
TrackWCKey              = No
TreeWidth               = 18
UsePam                  = 0
UnkillableStepProgram   = (null)
UnkillableStepTimeout   = 60 sec
VSizeFactor             = 0 percent
WaitTime                = 0 sec

Slurmctld(primary/backup) at nid02305/(NULL) are UP/DOWN
nid02305:~ #
Comment 1 Tim Wickberg 2016-03-04 06:28:49 MST
(In reply to Doug Jacobsen from comment #0)
> Hello,
> 
> This issue is somewhere between slurm and Cray ugni networking.  However, we
> did not see if before upgrading to slurm 15.08.8.
> 
> The issue is that a few times per minute all traffic on the ipogif0
> interface (the one that talks to all the slurmds) will pause for about 7-14
> seconds.  The LNet kernel module is reporting on this, though I don't think
> is to blame.
> 
> I moved slurmctld from one network node (ctl1) to a new one (ctl2) on
> edison.  The issue was not observed on ctl2 until I moved the slurmctld
> service.  It is now happening.
> 
> I also have Cray case 132438 on this issue. I'm asking Cray for any
> information on tuning the ipogif0 interface to decrease these network pauses.
> 
> Interactively this shows up as a hang on ssh traffic (cant. type.
> anything.).  From the job perspective this might mean a job needs to resend
> a message if it times out trying to talk to the slurmctld. 
> scontrol/sinfo/sbatch/salloc commands etc hang waiting to hear back from
> slurmctld.
> 
> Is there anything that changed in 15.08.8 that might be affecting this?

Can you elaborate on the timing on the slurm upgrade vs. the switch from ctl1 to ctl2?

Was 15.08.8 working okay on ctl1, and now that it's on ctl2 things are acting up?

Or was the upgrade to 15.08.8 done on ctl2 after the migration from ctl1?

I'm not seeing any changes that should have substantially changed the network traffic pattern for 15.08.8.

Did any of this correlate with a change to the Cray environment?

> I have tuned the sysctls on these nodes to have these properties:
> 
> # dmj, tune slurmctld network parameters
> net.core.somaxconn = 8192
> net.ipv4.tcp_max_syn_backlog = 8192
> 
> 
> 
> Is there any further tuning I could do to lighten the communication burden
> on this node?  or other diagnostics that would be helpful to you?
> 
> This is on the edison system with 5586 compute nodes.


A larger chunk of the log from around there would be nice, the connection timeouts aren't that useful aside from confirming that the network stopped working for us. I'd like to get a feel for what slurmctld may have been doing right before then and if that could have caused any issues.
Comment 2 Doug Jacobsen 2016-03-04 06:48:56 MST
Hello,

We upgraded to 15.08.8 last Saturday night following a power outage
(planned).  I don't recall seeing this issue prior to 15.08.8.  We ran for
several days on ctl1 using 15.08.8 in this context and I saw the quiesces
through that entire time.  I suspected a hardware error on clt1 and so
today reconfigured slurm to use ctl2 case the primary and after moving the
slurmctld service started seeing the errors on the new server.

Checking the Cray logs for the wakeup message I do see that there seems to
be a strong correlation that this is happening about every 30s:
2016-03-04T10:06:41.197125-08:00 c0-1c1s1n1 LNet: hardware quiesce: Waiting
for 7 threads to wake up
2016-03-04T10:07:38.163990-08:00 c0-1c1s1n1 LNet: hardware quiesce: Waiting
for 7 threads to wake up
2016-03-04T10:08:09.143595-08:00 c0-1c1s1n1 LNet: hardware quiesce: Waiting
for 7 threads to wake up
2016-03-04T10:09:21.155770-08:00 c0-1c1s1n1 LNet: hardware quiesce: Waiting
for 7 threads to wake up
2016-03-04T10:09:51.139347-08:00 c0-1c1s1n1 LNet: hardware quiesce: Waiting
for 7 threads to wake up
2016-03-04T10:11:03.186205-08:00 c0-1c1s1n1 LNet: hardware quiesce: Waiting
for 7 threads to wake up
2016-03-04T10:11:33.186136-08:00 c0-1c1s1n1 LNet: hardware quiesce: Waiting
for 7 threads to wake up
2016-03-04T10:12:45.165819-08:00 c0-1c1s1n1 LNet: hardware quiesce: Waiting
for 7 threads to wake up
2016-03-04T10:13:15.168049-08:00 c0-1c1s1n1 LNet: hardware quiesce: Waiting
for 7 threads to wake up
2016-03-04T10:14:27.159889-08:00 c0-1c1s1n1 LNet: hardware quiesce: Waiting
for 7 threads to wake up
2016-03-04T10:14:57.164990-08:00 c0-1c1s1n1 LNet: hardware quiesce: Waiting
for 7 threads to wake up
2016-03-04T10:16:09.169230-08:00 c0-1c1s1n1 LNet: hardware quiesce: Waiting
for 7 threads to wake up
2016-03-04T10:16:39.177083-08:00 c0-1c1s1n1 LNet: hardware quiesce: Waiting
for 7 threads to wake up
2016-03-04T10:17:51.164952-08:00 c0-1c1s1n1 LNet: hardware quiesce: Waiting
for 7 threads to wake up
2016-03-04T10:18:21.159589-08:00 c0-1c1s1n1 LNet: hardware quiesce: Waiting
for 7 threads to wake up
2016-03-04T10:18:51.160109-08:00 c0-1c1s1n1 LNet: hardware quiesce: Waiting
for 7 threads to wake up
2016-03-04T10:19:32.168868-08:00 c0-1c1s1n1 LNet: hardware quiesce: Waiting
for 7 threads to wake up
2016-03-04T10:20:02.164627-08:00 c0-1c1s1n1 LNet: hardware quiesce: Waiting
for 7 threads to wake up
2016-03-04T10:21:14.199460-08:00 c0-1c1s1n1 LNet: hardware quiesce: Waiting
for 7 threads to wake up
2016-03-04T10:21:44.163792-08:00 c0-1c1s1n1 LNet: hardware quiesce: Waiting
for 7 threads to wake up
2016-03-04T10:22:56.175142-08:00 c0-1c1s1n1 LNet: hardware quiesce: Waiting
for 7 threads to wake up
2016-03-04T10:23:26.195071-08:00 c0-1c1s1n1 LNet: hardware quiesce: Waiting
for 7 threads to wake up
2016-03-04T10:24:38.193241-08:00 c0-1c1s1n1 LNet: hardware quiesce: Waiting
for 7 threads to wake up
2016-03-04T10:25:08.172062-08:00 c0-1c1s1n1 LNet: hardware quiesce: Waiting
for 7 threads to wake up
2016-03-04T10:25:38.172016-08:00 c0-1c1s1n1 LNet: hardware quiesce: Waiting
for 7 threads to wake up
2016-03-04T10:26:12.177397-08:00 c0-1c1s1n1 LNet: hardware quiesce: Waiting
for 7 threads to wake up
2016-03-04T10:30:00.175082-08:00 c0-1c1s1n1 LNet: hardware quiesce: Waiting
for 7 threads to wake up
2016-03-04T10:30:35.193673-08:00 c0-1c1s1n1 LNet: hardware quiesce: Waiting
for 7 threads to wake up
2016-03-04T10:31:32.217260-08:00 c0-1c1s1n1 LNet: hardware quiesce: Waiting
for 7 threads to wake up
2016-03-04T10:32:02.205825-08:00 c0-1c1s1n1 LNet: hardware quiesce: Waiting
for 7 threads to wake up
2016-03-04T10:34:41.227884-08:00 c0-1c1s1n1 LNet: hardware quiesce: Waiting
for 7 threads to wake up




During the last boot (prior to this was using 15.08.7 and possibly
earlier), these error messages never occured:
crayadm@edismw:/var/opt/cray/log/p0-20160203t195000> for i in $(ls
console-20160*); do echo $i; grep c0-1c1s1n1 $i | grep "threads to wake up"
| wc -l; done #"
console-20160203
0
console-20160204
0
console-20160205
0
console-20160206
0
console-20160207
0
console-20160208
0
console-20160209
0
console-20160210
0
console-20160211
0
console-20160212
0
console-20160213
0
console-20160214
0
console-20160215
0
console-20160216
0
console-20160217
0
console-20160218
0
console-20160219
0
console-20160220
0
console-20160221
0
console-20160222
0
console-20160223
0
console-20160224
0
console-20160225
0
console-20160226
0
console-20160227
0




In the most recent boot (which came up with 15.08.8), it occurs at a pretty
constant rate:
crayadm@edismw:/var/opt/cray/log/p0-20160227t222722> for i in $(ls
console-20160*); do echo $i; grep c0-1c1s1n1 $i | grep "threads to wake up"
| wc -l; done #"
console-20160227
49
console-20160228
1740
console-20160229
1734
console-20160301
1730
console-20160302
1733
console-20160303
1734
console-20160304
758

I'll attached the slurmctld logs from ctl1 (before the move).  You'll
notice at the end some difficulties failing over from ctl1 to ctl2 (after a
reconfigure).  I'm concerned that there may be some problems with 15.08.8
failing over with the recent changes to the shared state dir protections,
but I haven't investigated that fully yet, and it isn't my primary goal
with this ticket =)

-Doug







On Fri, Mar 4, 2016 at 1:29 PM <bugs@schedmd.com> wrote:

> *Comment # 1 <https://bugs.schedmd.com/show_bug.cgi?id=2516#c1> on bug
> 2516 <https://bugs.schedmd.com/show_bug.cgi?id=2516> from Tim Wickberg
> <tim@schedmd.com> *
>
> (In reply to Doug Jacobsen from comment #0 <https://bugs.schedmd.com/show_bug.cgi?id=2516#c0>)> Hello,
> >
> > This issue is somewhere between slurm and Cray ugni networking.  However, we
> > did not see if before upgrading to slurm 15.08.8.
> >
> > The issue is that a few times per minute all traffic on the ipogif0
> > interface (the one that talks to all the slurmds) will pause for about 7-14
> > seconds.  The LNet kernel module is reporting on this, though I don't think
> > is to blame.
> >
> > I moved slurmctld from one network node (ctl1) to a new one (ctl2) on
> > edison.  The issue was not observed on ctl2 until I moved the slurmctld
> > service.  It is now happening.
> >
> > I also have Cray case 132438 on this issue. I'm asking Cray for any
> > information on tuning the ipogif0 interface to decrease these network pauses.
> >
> > Interactively this shows up as a hang on ssh traffic (cant. type.
> > anything.).  From the job perspective this might mean a job needs to resend
> > a message if it times out trying to talk to the slurmctld.
> > scontrol/sinfo/sbatch/salloc commands etc hang waiting to hear back from
> > slurmctld.
> >
> > Is there anything that changed in 15.08.8 that might be affecting this?
>
> Can you elaborate on the timing on the slurm upgrade vs. the switch from ctl1
> to ctl2?
>
> Was 15.08.8 working okay on ctl1, and now that it's on ctl2 things are acting
> up?
>
> Or was the upgrade to 15.08.8 done on ctl2 after the migration from ctl1?
>
> I'm not seeing any changes that should have substantially changed the network
> traffic pattern for 15.08.8.
>
> Did any of this correlate with a change to the Cray environment?
> > I have tuned the sysctls on these nodes to have these properties:
> >
> > # dmj, tune slurmctld network parameters
> > net.core.somaxconn = 8192
> > net.ipv4.tcp_max_syn_backlog = 8192
> >
> >
> >
> > Is there any further tuning I could do to lighten the communication burden
> > on this node?  or other diagnostics that would be helpful to you?
> >
> > This is on the edison system with 5586 compute nodes.
>
>
> A larger chunk of the log from around there would be nice, the connection
> timeouts aren't that useful aside from confirming that the network stopped
> working for us. I'd like to get a feel for what slurmctld may have been doing
> right before then and if that could have caused any issues.
>
> ------------------------------
> You are receiving this mail because:
>
>    - You reported the bug.
>
>
Comment 3 Doug Jacobsen 2016-03-04 06:49:28 MST
Created attachment 2827 [details]
slurmctld logs from ctl1
Comment 4 Tim Wickberg 2016-03-04 07:32:29 MST
(In reply to Doug Jacobsen from comment #2)
> Hello,
> 
> We upgraded to 15.08.8 last Saturday night following a power outage
> (planned).  I don't recall seeing this issue prior to 15.08.8.  We ran for
> several days on ctl1 using 15.08.8 in this context and I saw the quiesces
> through that entire time.  I suspected a hardware error on clt1 and so
> today reconfigured slurm to use ctl2 case the primary and after moving the
> slurmctld service started seeing the errors on the new server.

Can you attach the currently slurm.conf for edison? I only have cori on hand at the moment.

> Checking the Cray logs for the wakeup message I do see that there seems to
> be a strong correlation that this is happening about every 30s:
> 2016-03-04T10:34:41.227884-08:00 c0-1c1s1n1 LNet: hardware quiesce: Waiting
> for 7 threads to wake up.

You don't happen to have a line like:

JobAcctGatherFrequency=

or have it commented out entirely? That's a 30-second default timer that I could see generating a large amount of traffic. I think messages for that would be debug2 or lower as well, which may explain the lack of evidence in the log.

> I'll attached the slurmctld logs from ctl1 (before the move).  You'll
> notice at the end some difficulties failing over from ctl1 to ctl2 (after a
> reconfigure).  I'm concerned that there may be some problems with 15.08.8
> failing over with the recent changes to the shared state dir protections,
> but I haven't investigated that fully yet, and it isn't my primary goal
> with this ticket =)

The safety check is a single file in the state directory with the ClusterName in it (in a bout of creativity, I set the file name as 'clustername'), failing over isn't impacted at all. If you're concerned about this it's always safe to remove that file - if it doesn't exist at slurmctld start time it'll just recreate it to protect the next launch. And older slurm versions will ignore it. There aren't any messages related to the check in the log.

This is suspicious, I believe the other slurmctld should drop the connection when you fail over:

[2016-03-04T10:35:58.012] error: _start_session: aeld connection restart exceed threshold, find and remove other program using the aeld socket, likely another slurmctld instance
Comment 5 Doug Jacobsen 2016-03-04 07:35:32 MST
dmj@edison06:~> cat /opt/slurm/etc/slurm.conf | grep JobAcc
JobAcctGatherFrequency=0
JobAcctGatherType=jobacct_gather/cgroup
dmj@edison06:~>


I'll attach the full slurm.conf

On Fri, Mar 4, 2016 at 2:32 PM <bugs@schedmd.com> wrote:

> *Comment # 4 <https://bugs.schedmd.com/show_bug.cgi?id=2516#c4> on bug
> 2516 <https://bugs.schedmd.com/show_bug.cgi?id=2516> from Tim Wickberg
> <tim@schedmd.com> *
>
> (In reply to Doug Jacobsen from comment #2 <https://bugs.schedmd.com/show_bug.cgi?id=2516#c2>)
>
> > Hello,
> >
> > We upgraded to 15.08.8 last Saturday night following a power outage
> > (planned).  I don't recall seeing this issue prior to 15.08.8.  We ran for
> > several days on ctl1 using 15.08.8 in this context and I saw the quiesces
> > through that entire time.  I suspected a hardware error on clt1 and so
> > today reconfigured slurm to use ctl2 case the primary and after moving the
> > slurmctld service started seeing the errors on the new server.
>
>
>
> Can you attach the currently slurm.conf for edison? I only have cori on hand at
> the moment.
> > Checking the Cray logs for the wakeup message I do see that there seems to
> > be a strong correlation that this is happening about every 30s:
> > 2016-03-04T10:34:41.227884-08:00 c0-1c1s1n1 LNet: hardware quiesce: Waiting
> > for 7 threads to wake up.
>
> You don't happen to have a line like:
>
> JobAcctGatherFrequency=
>
> or have it commented out entirely? That's a 30-second default timer that I
> could see generating a large amount of traffic. I think messages for that would
> be debug2 or lower as well, which may explain the lack of evidence in the log.
>
>
> > I'll attached the slurmctld logs from ctl1 (before the move).  You'll
> > notice at the end some difficulties failing over from ctl1 to ctl2 (after a
> > reconfigure).  I'm concerned that there may be some problems with 15.08.8
> > failing over with the recent changes to the shared state dir protections,
> > but I haven't investigated that fully yet, and it isn't my primary goal
> > with this ticket =)
>
>
>
> The safety check is a single file in the state directory with the ClusterName
> in it (in a bout of creativity, I set the file name as 'clustername'), failing
> over isn't impacted at all. If you're concerned about this it's always safe to
> remove that file - if it doesn't exist at slurmctld start time it'll just
> recreate it to protect the next launch. And older slurm versions will ignore
> it. There aren't any messages related to the check in the log.
>
> This is suspicious, I believe the other slurmctld should drop the connection
> when you fail over:
>
> [2016-03-04T10:35:58.012] error: _start_session: aeld connection restart exceed
> threshold, find and remove other program using the aeld socket, likely another
> slurmctld instance
>
> ------------------------------
> You are receiving this mail because:
>
>    - You reported the bug.
>
>
Comment 6 Doug Jacobsen 2016-03-04 07:41:41 MST
I did check to validate -- this is only happening with any frequency on the nodes I'm running slurmctld; the other nodes have a basal rate of this matching (more or less) the count of global network quiesces we've had since boot ca. ~10, as opposed to the 10,000 instances on the ctl nodes.
Comment 7 Doug Jacobsen 2016-03-04 07:42:34 MST
Created attachment 2828 [details]
edison slurm.conf
Comment 8 Tim Wickberg 2016-03-04 08:07:16 MST
Diff'ing them has been interesting, I'm still looking for other potential issues but wanted to write down a few notes as I go through:

I don't think these are directly related to the current problem, but s there a reason that edison has the topology plugin disabled, while cori has tree? We'd expect dragonfly on a cray, but admittedly you're the one with the machine. ;)

Cori doesn't have --cpu_bind=none in SallocDefaultCommand, not sure if you wanted those to stay in sync. And the day difference in PriorityDecayHalfLife seems like it may be unintentional.

Also unrelated - we heavily recommend against using the jobacct_gather/cgroup. jobacct_gather/linux is much faster and gets almost identical data, the cgroup interface here is slow and doesn't add anything. If Cray's set that by default please let me know and I'll ask them to change it in the future.

Can you confirm that there aren't any rouge slurmctld's in the system somehow? There is a 30-second retry counter in one specific spot, although I doubt that's the issue here. Although the large number of "error: _start_session: aeld connection restart exceed threshold, find and remove other program using the aeld socket, likely another slurmctld instance" messages at the end of the log has me concerned.

If those have cleared up, can you attach a newer chunk of the log file?
Comment 9 Doug Jacobsen 2016-03-04 08:40:35 MST
Hi Tim,

There is a major functionality difference between
jobacct_gather/cgroup and jobacct_gather/linux;
which is that with cgroup resident memory is enforced, not the summation of
the amount of memory recorded in /proc/<pid>/status or whatever (which is
what jobacct_gather/linux does).  We've had issues with some of workload
getting killed off because of improper summations performed during
(presumably) fork/execs of large processes.

The --cpu_bind=none is on both cori and edison now, you'll just need an
updated file =)

Those issues at the end of the slurmctld were where I was trying to
failover.  at the time nid02030 or whatever was the backup (and was
configured so) but neither were happy.

-Doug

On Fri, Mar 4, 2016 at 3:07 PM <bugs@schedmd.com> wrote:

> *Comment # 8 <https://bugs.schedmd.com/show_bug.cgi?id=2516#c8> on bug
> 2516 <https://bugs.schedmd.com/show_bug.cgi?id=2516> from Tim Wickberg
> <tim@schedmd.com> *
>
> Diff'ing them has been interesting, I'm still looking for other potential
> issues but wanted to write down a few notes as I go through:
>
> I don't think these are directly related to the current problem, but s there a
> reason that edison has the topology plugin disabled, while cori has tree? We'd
> expect dragonfly on a cray, but admittedly you're the one with the machine. ;)
>
> Cori doesn't have --cpu_bind=none in SallocDefaultCommand, not sure if you
> wanted those to stay in sync. And the day difference in PriorityDecayHalfLife
> seems like it may be unintentional.
>
> Also unrelated - we heavily recommend against using the jobacct_gather/cgroup.
> jobacct_gather/linux is much faster and gets almost identical data, the cgroup
> interface here is slow and doesn't add anything. If Cray's set that by default
> please let me know and I'll ask them to change it in the future.
>
> Can you confirm that there aren't any rouge slurmctld's in the system somehow?
> There is a 30-second retry counter in one specific spot, although I doubt
> that's the issue here. Although the large number of "error: _start_session:
> aeld connection restart exceed threshold, find and remove other program using
> the aeld socket, likely another slurmctld instance" messages at the end of the
> log has me concerned.
>
> If those have cleared up, can you attach a newer chunk of the log file?
>
> ------------------------------
> You are receiving this mail because:
>
>    - You reported the bug.
>
>
Comment 10 Brian F Gilmer 2016-03-07 00:12:19 MST
If a message is received from an unknown IP address then the network interface shuts down for a period of time.  We ran into this behavior on the first XC30 running Native-Slurm.  The description looks similar.  I will add more information when I find the bug.
Comment 11 Brian F Gilmer 2016-03-07 04:03:12 MST
Here is the description of the behavior.

"If TCP or UDP attempts to send a packet to an IP address that is within an HSN subnet, but for which there is no ARP entry, no error is generated immediately.  Instead, the destination layer 2 address is set to the address of the originating node (see the handling of IFF_NOARP in arp_constructor()) and the packet is transmitted.

I don't believe that Aries will actually loop back such a packet, but even if it did, the packet would be discarded at the IP layer because its destination IP address doesn't belong to the node.

The net effect is that packets sent to addresses for which there is no ARP entry go into a black hole.  In particular, a TCP connection attempt to such an address will not fail immediately, but only after many retransmissions of the SYN taking approximately 3 minutes.

CLE has behaved this way since Gemini.  What makes it more relevant now is that Slurm attempts to open TCP connections (to send ping RPCs) to nodes that are configured but may be down."

This is not a clear explanation since ipogni does not use ARP correctly.  The ARP tables are updated out of band.  When a node is added or removed from the HSN the network is stopped and a new ARP table is distributed.  What happened was that as the system was brought up the ARP table was constantly being updated.

The software was updated but this behavior was not fixed.
Comment 12 Brian F Gilmer 2016-03-07 07:52:32 MST
We are seeing ORB Timeouts every 30 seconds.  That means that an IP message is going to a down node.  Is there anything that polls at that interval?
Comment 13 Doug Jacobsen 2016-03-07 10:37:38 MST
Hello,

I have an update here.  I removed the slurm.conf references to nid04317 (c6-2c1s7n1) and nid03544 (c2-2c1s6n0).  Since I did this slurm stopped trying to communicate with those nodes and the issue went away.

Typically a node being down hasn't been an issue for us.  I'm guessing that slurm is just pinging these nodes periodically to see if they are available again.

Once I removed the references to the compute node the problem cleared up.  I'm guessing there is something very bad going on with the network interface on one of them.  I did not try to narrow it down using slurm.

I'll continue following this up in my Cray case.

Thanks for your help on this,
Doug
Comment 14 Brian F Gilmer 2016-03-07 22:49:54 MST
Doug,

That is known behavior it has been around since Gemini.  The Aries is a point to point network.  The IPoGNI has the assumption that ALPS would contain communication to only nodes that were available.  So there was no expectation of any application trying to communicate to a node that was not available.  The ORB Timeout initiates a scrub of the ARP tables (not what you think) which requires the network to stop while this is happening.  The assumption is that would happen rarely only when a node went off-line.  Since Slurm assumes a real TCP/IP network it happily sends messages to a node that is configured in slurm.conf.  The other confounding bit is that since Aries is point to point there are no routers so the sender never get network error messages like "network cannot be reached".  That leave the application to rely on timeouts.

Brian
Comment 15 Tim Wickberg 2016-03-08 00:22:26 MST
Here's what Slurm does today:

        if (slurmctld_conf.slurmd_timeout) {
                /* We ping nodes that haven't responded in SlurmdTimeout/3,
                 * but need to do the test at a higher frequency or we might
                 * DOWN nodes with times that fall in the gap. */
                ping_interval = slurmctld_conf.slurmd_timeout / 3;
        } else {
                /* This will just ping non-responding nodes
                 * and restore them to service */
                ping_interval = 100;    /* 100 seconds */
        }


...

                if (((difftime(now, last_ping_node_time) >= ping_interval) ||
                     ping_nodes_now) && is_ping_done()) {
                        now = time(NULL);
                        last_ping_node_time = now;
                        ping_nodes_now = false;
                        lock_slurmctld(node_write_lock);
                        ping_nodes();
                        unlock_slurmctld(node_write_lock);
                }


Every 100 seconds Slurm is attempting to reconnect to any missing nodes to restore them to service. (The 'scontrol ping' command can also force it to start a ping cycle.)

It sounds like the node ping function is counter-productive on native cray, since the missing nodes will cause IPoGNI to flush out tables.

I believe we could disable node_ping() without any significant side effects - nodes returning to service should register themselves with slurmctld when slurmd starts on the node - although there may be some other subsystems that assume that communicating with down nodes won't stall the network that may also need to be addressed. I'll discuss this further internally.
Comment 16 Tim Wickberg 2016-07-08 12:40:56 MDT
Doug -

Is this still an active concern? Our internal discussion never really progressed anywhere; we don't really want to disable the node registration by default but could still explore hiding the behavior behind an #ifndef HAVE_CRAY if necessary.
Comment 17 Doug Jacobsen 2016-07-08 13:03:47 MDT
Well, actually there is something we should probably do.  I found out more
about what the issue is,

apparently when a node goes down in a cray system, a signal goes out to all
the nodes to remove that node (or at least invalidate that node) in the arp
table of every single node on the system.  This is done because if a node
attempts to communicate with a down/failed aries interface, it gets put in
the penalty box for that infraction in an attempt to keep the HSN stable.

There are apparently rare cases, like the one documented here, where that
arp update didn't happen.  Apparently, the solution is to make some librca
call to get the list of currently up nodes prior to performing any wide
scale ping.

I'll try to track down my source again to find out what that call is.  But,
I think, based on this, that the correct course of action for slurm would
be to periodically check to make sure the rca or hss system is still happy
with the node prior to the ping.

-Doug

----
Doug Jacobsen, Ph.D.
NERSC Computer Systems Engineer
National Energy Research Scientific Computing Center <http://www.nersc.gov>
dmjacobsen@lbl.gov

------------- __o
---------- _ '\<,_
----------(_)/  (_)__________________________


On Fri, Jul 8, 2016 at 11:40 AM, <bugs@schedmd.com> wrote:

> *Comment # 16 <https://bugs.schedmd.com/show_bug.cgi?id=2516#c16> on bug
> 2516 <https://bugs.schedmd.com/show_bug.cgi?id=2516> from Tim Wickberg
> <tim@schedmd.com> *
>
> Doug -
>
> Is this still an active concern? Our internal discussion never really
> progressed anywhere; we don't really want to disable the node registration by
> default but could still explore hiding the behavior behind an #ifndef HAVE_CRAY
> if necessary.
>
> ------------------------------
> You are receiving this mail because:
>
>    - You reported the bug.
>
>
Comment 18 Tim Wickberg 2016-11-09 14:49:27 MST
Doug - 

I'm marking this closed. If you'd like, I can break out a separate enhancement for ways to avoid communication issues within the Cray environment by using librca or something similar.