| Summary: | slurmctld unresponsive, sched thread(s) stuck | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Andrew Elwell <Andrew.Elwell> |
| Component: | slurmctld | Assignee: | Marshall Garey <marshall> |
| Status: | RESOLVED DUPLICATE | QA Contact: | |
| Severity: | 2 - High Impact | ||
| Priority: | --- | CC: | cinek, marshall, wdavey |
| Version: | 20.02.3 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| See Also: | https://bugs.schedmd.com/show_bug.cgi?id=9248 | ||
| Site: | Pawsey | Slinky Site: | --- |
| Alineos Sites: | --- | Atos/Eviden Sites: | --- |
| Confidential Site: | --- | Coreweave sites: | --- |
| Cray Sites: | --- | DS9 clusters: | --- |
| Google sites: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | NoveTech Sites: | --- |
| Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Tzag Elita Sites: | --- |
| Linux Distro: | CentOS | Machine Name: | |
| CLE Version: | Version Fixed: | ||
| Target Release: | --- | DevPrio: | --- |
| Emory-Cloud Sites: | --- | ||
| Attachments: |
yesterday's logfile
todays logfile (many restarts) |
||
Created attachment 15162 [details]
todays logfile (many restarts)
Looking at logs, there is a number of messages like:
>[2020-07-23T23:42:59.854] error: slurm_unpack_received_msg: Message receive failure
>[2020-07-23T23:42:59.856] error: unpack_header: protocol_version 7680 not supported
>[2020-07-23T23:42:59.856] error: unpacking header
protocol version 7680 corresponds with Slurm 16.05. Based on logs the old version may be running on:
146.118.80.201
146.118.80.202
146.118.80.203
146.118.80.204
146.118.80.205
146.118.80.206
146.118.80.207
146.118.80.208
146.118.80.209
146.118.80.210
146.118.80.211
146.118.80.212
146.118.80.213
146.118.80.214
146.118.80.215
Could you please check that?
cheers,
Marcin
Yes, they were old slurmds that were assumed to be turned off. Did the issue disappear when you turned those off? cheers, Marcin No. They were stopped (and the corresponding node and partition info from config) without any improvement. The only way we could start slurmctld and keep it responsive was by moving the state files out of the way. (We still have these if they are any help) If it will happen again please take a core dump of running slurmctld using commands like: #gcore $(pidof slurmctld) load its result to gdb and share a backtrace with us: #gdb /path/to/slurmctld /path/to/core.XXX (gdb)t a a bt f I'm looking into the logs and code to find a potential root cause. cheers, Marcin Andrew, I'm taking over for Marcin since it looks like something I've worked on. Just to be clear, is the cluster running smoothly again? From your comments it sounds like you cleared the state files (and sadly lost the job queue) to get the cluster going. Is that correct? I strongly suspect that this is a duplicate of bug 9248. There was a regression that caused an infinite loop when someone requests the --distribution=plane, but someone can accidentally do that with syntax like this: -m=8gb So someone trying to request memory might use this bad syntax and actually request --distribution=plane and cause an infinite loop in slurmctld. This infinite loop is in the sched task, which also makes me think that's what happened. Commit eb79918f is the fix for the regression causing the infinite loop. Commit 6beb50bc fixes the issue with parsing --distribution so this bad syntax is rejected instead of accepted. You should be able to view the job scripts in the state save location to see if anyone used the -m option in their batch job script. If they used it on the command line and not in their batch job script you sadly won't be able to see if they used it or not. The job scripts are located in: /path/to/state/save/location/hash.<0-9>/script Can you do a search for -m or --distribution in those job scripts? If it does happen again, like Marcin said in comment 6 we'll need you to use gcore and get a backtrace. no sign of any '#SBATCH -m' anywhere, but this looks like a contender for the distribution bug: [root@citra state_screwed_20200724]# cat hash.2/job.4750462/script #!/bin/bash #SBATCH --partition=debugq #SBATCH --mem=100GB #SBATCH --nodes=1 #SBATCH --ntasks=1 #SBATCH --time=00:00:10 #SBATCH --job-name=Stokes_cube #SBATCH --distribution=plane=1 #SBATCH --output=logs/%x-%j.out #SBATCH --error=logs/%x-%j.err module load astropy srun -n 1 /home/jcollier/Scripts/Stokes_cube.py [root@citra state_screwed_20200724]# and to confirm, yes the cluster is running again (hence why I didn't log it as a sev1) That confirms it. I'm going to mark this as a duplicate of bug 9248. Until 20.02.4 is released, you'll need to apply commit eb79918f810 to prevent this infinite loop from happening again. You can find the patch file here: https://github.com/SchedMD/slurm/commit/eb79918f810.patch If this problem does happen again (slurmctld has one core running at 100% and stuck in "Sched"), please re-open this bug and upload a backtrace as Marcin described in comment 6. If other problems happen but not that one, let's keep those in separate tickets. Also make sure that slurmctld and all slurmd's get restarted after you've removed the old 16.05 nodes from slurm.conf to clear up communication issues. Thanks for the bug report and sorry about that regression. *** This ticket has been marked as a duplicate of ticket 9248 *** |
Created attachment 15161 [details] yesterday's logfile Hi, Overnight we had an issue with slurmctld apparently locking up on one of our clusters (zeus). Symptoms were 'scontrol ping' showing no response, the slurmctld process itself using one core 100% and stuck in the 'sched' task. Increasing core count (to 4 from 2) on the VM it was running on made no difference, as did changing any systemd limits incase it was being stalled that way. Starting foreground (-Dvvvvv) made no difference. Eventually we had to stop it, move the StateSaveLocation dir out of the way and restart with empty losing running and queued jobs (in order to restore service). We have straces from attaching to the running slurmctld (not much obvious to us mostly restart_syscall(<... resuming interrupted futex ...>) = -1 ETIMEDOUT (Connection timed out) futex(0x6f31e0, FUTEX_WAKE_PRIVATE, 1) = 0 futex(0x6f31a4, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 3861, {1595555567, 0}, ffffffff) = -1 ETIMEDOUT (Connection timed out) futex(0x6f31e0, FUTEX_WAKE_PRIVATE, 1) = 0 futex(0x6f31a4, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 3863, {1595555569, 0}, ffffffff) = -1 ETIMEDOUT (Connection timed out) futex(0x6f31e0, FUTEX_WAKE_PRIVATE, 1) = 0 futex(0x6f31a4, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 3865, {1595555571, 0}, ffffffff) = -1 ETIMEDOUT (Connection timed out) futex(0x6f31e0, FUTEX_WAKE_PRIVATE, 1) = 0 ... or rt_sigprocmask(SIG_BLOCK, [CHLD], [HUP INT QUIT ABRT USR2 PIPE ALRM TERM CHLD TSTP XCPU], 8) = 0 nanosleep({2, 0}, 0x2ac45a583ce0) = 0 rt_sigprocmask(SIG_BLOCK, [CHLD], [HUP INT QUIT ABRT USR2 PIPE ALRM TERM CHLD TSTP XCPU], 8) = 0 nanosleep({2, 0}, 0x2ac45a583ce0) = 0 rt_sigprocmask(SIG_BLOCK, [CHLD], [HUP INT QUIT ABRT USR2 PIPE ALRM TERM CHLD TSTP XCPU], 8) = 0 nanosleep({2, 0}, 0x2ac45a583ce0) = 0 ... (which we can tar and send if relevant) After each restart (we did quite a bit of this), we'd see sdiag look OK initially [root@citra ~]# sdiag ******************************************************* sdiag output at Fri Jul 24 08:05:16 2020 (1595549116) Data since Fri Jul 24 08:04:52 2020 (1595549092) ******************************************************* Server thread count: 11 Agent queue size: 0 Agent count: 1 Agent thread count: 12 DBD Agent queue size: 0 Jobs submitted: 0 Jobs started: 0 Jobs completed: 0 Jobs canceled: 0 Jobs failed: 0 Job states ts: Fri Jul 24 08:04:52 2020 (1595549092) Jobs pending: 329 Jobs running: 91 Main schedule statistics (microseconds): Last cycle: 0 Max cycle: 0 Total cycles: 0 Last queue length: 264 Backfilling stats Total backfilled jobs (since last slurm start): 0 Total backfilled jobs (since last stats cycle start): 0 Total backfilled heterogeneous job components: 0 Total cycles: 0 Last cycle when: N/A Last cycle: 0 Max cycle: 0 Last depth cycle: 0 Last depth cycle (try sched): 0 Last queue length: 0 Last table size: 0 Latency for 1000 calls to gettimeofday(): 41 microseconds Remote Procedure Call statistics by message type MESSAGE_NODE_REGISTRATION_STATUS ( 1002) count:182 ave_time:3012 total_time:548336 REQUEST_PARTITION_INFO ( 2009) count:2 ave_time:231 total_time:463 REQUEST_NODE_INFO ( 2007) count:1 ave_time:761 total_time:761 REQUEST_JOB_INFO ( 2003) count:1 ave_time:3828 total_time:3828 REQUEST_FED_INFO ( 2049) count:1 ave_time:127 total_time:127 REQUEST_JOB_INFO_SINGLE ( 2021) count:0 ave_time:0 total_time:0 REQUEST_STATS_INFO ( 2035) count:0 ave_time:0 total_time:0 Remote Procedure Call statistics by user root ( 0) count:182 ave_time:3012 total_time:548336 487 ( 487) count:4 ave_time:1263 total_time:5052 asevernellis ( 23212) count:1 ave_time:127 total_time:127 Pending RPC statistics No pending RPCs [root@citra ~]# but then timeout in the same way that sinfo was. I did notice the following thread count warning slurmctld: debug2: Processing RPC: MESSAGE_NODE_REGISTRATION_STATUS from uid=0 [2020-07-24T07:42:41.320] debug2: Processing RPC: MESSAGE_NODE_REGISTRATION_STATUS from uid=0 slurmctld: server_thread_count over limit (256), waiting [2020-07-24T07:42:41.324] server_thread_count over limit (256), waiting slurmctld: debug2: Processing RPC: MESSAGE_NODE_REGISTRATION_STATUS from uid=0 [2020-07-24T07:42:41.324] debug2: Processing RPC: MESSAGE_NODE_REGISTRATION_STATUS from uid=0 slurmctld: debug2: Processing RPC: MESSAGE_NODE_REGISTRATION_STATUS from uid=0 [2020-07-24T07:42:41.324] debug2: Processing RPC: MESSAGE_NODE_REGISTRATION_STATUS from uid=0 but I see from the source that increasing this isn't recommended. I attach the logs from yesterday (a cron job started throwing errrors around 23:35 and since restart. Yes, I know we had some old (very old) slurmd's trying to connect. They've since been killed off. current config and sdiag [root@citra ~]# scontrol show config Configuration data as of 2020-07-24T12:38:20 AccountingStorageBackupHost = (null) AccountingStorageEnforce = associations,limits,qos AccountingStorageHost = comet.pawsey.org.au AccountingStorageLoc = N/A AccountingStoragePort = 6819 AccountingStorageTRES = cpu,mem,energy,node,billing,fs/disk,vmem,pages,gres/gpu AccountingStorageType = accounting_storage/slurmdbd AccountingStorageUser = N/A AccountingStoreJobComment = Yes AcctGatherEnergyType = acct_gather_energy/none AcctGatherFilesystemType = acct_gather_filesystem/none AcctGatherInterconnectType = acct_gather_interconnect/none AcctGatherNodeFreq = 0 sec AcctGatherProfileType = acct_gather_profile/none AllowSpecResourcesUsage = No AuthAltTypes = (null) AuthInfo = (null) AuthType = auth/munge BatchStartTimeout = 10 sec BOOT_TIME = 2020-07-24T10:22:36 BurstBufferType = (null) CliFilterPlugins = (null) ClusterName = zeus CommunicationParameters = (null) CompleteWait = 0 sec CoreSpecPlugin = core_spec/none CpuFreqDef = Unknown CpuFreqGovernors = Performance,OnDemand CredType = cred/munge DebugFlags = NO_CONF_HASH DefMemPerNode = UNLIMITED DependencyParameters = kill_invalid_depend=1 DisableRootJobs = No EioTimeout = 60 EnforcePartLimits = ANY Epilog = (null) EpilogMsgTime = 2000 usec EpilogSlurmctld = (null) ExtSensorsType = ext_sensors/none ExtSensorsFreq = 0 sec FairShareDampeningFactor = 8 FederationParameters = (null) FirstJobId = 4750001 GetEnvTimeout = 2 sec GresTypes = gpu GpuFreqDef = high,memory=high GroupUpdateForce = 1 GroupUpdateTime = 600 sec HASH_VAL = Match HealthCheckInterval = 0 sec HealthCheckNodeState = ANY HealthCheckProgram = (null) InactiveLimit = 0 sec JobAcctGatherFrequency = 30 JobAcctGatherType = jobacct_gather/linux JobAcctGatherParams = (null) JobCompHost = localhost JobCompLoc = /var/log/slurm_jobcomp.log JobCompPort = 0 JobCompType = jobcomp/none JobCompUser = root JobContainerType = job_container/none JobCredentialPrivateKey = (null) JobCredentialPublicCertificate = (null) JobDefaults = (null) JobFileAppend = 0 JobRequeue = 1 JobSubmitPlugins = (null) KeepAliveTime = SYSTEM_DEFAULT KillOnBadExit = 0 KillWait = 30 sec LaunchParameters = send_gids LaunchType = launch/slurm Layouts = Licenses = (null) LogTimeFormat = iso8601_ms MailDomain = (null) MailProg = /bin/mail MaxArraySize = 1000 MaxDBDMsgs = 20796 MaxJobCount = 10000 MaxJobId = 67108863 MaxMemPerCPU = 0 MaxStepCount = 40000 MaxTasksPerNode = 51200 MCSPlugin = mcs/none MCSParameters = (null) MessageTimeout = 10 sec MinJobAge = 300 sec MpiDefault = pmi2 MpiParams = ports=12000-12999 MsgAggregationParams = (null) NEXT_JOB_ID = 4750738 NodeFeaturesPlugins = (null) OverTimeLimit = 0 min PluginDir = /usr/lib64/slurm PlugStackConfig = plugstack.conf PowerParameters = (null) PowerPlugin = PreemptMode = OFF PreemptType = preempt/none PreemptExemptTime = 00:00:00 PrEpParameters = (null) PrEpPlugins = prep/script PriorityParameters = (null) PrioritySiteFactorParameters = (null) PrioritySiteFactorPlugin = (null) PriorityDecayHalfLife = 00:00:00 PriorityCalcPeriod = 00:05:00 PriorityFavorSmall = No PriorityFlags = MAX_TRES PriorityMaxAge = 100-00:00:00 PriorityUsageResetPeriod = QUARTERLY PriorityType = priority/multifactor PriorityWeightAge = 9600 PriorityWeightAssoc = 0 PriorityWeightFairShare = 600 PriorityWeightJobSize = 0 PriorityWeightPartition = 30000 PriorityWeightQOS = 150000 PriorityWeightTRES = (null) PrivateData = none ProctrackType = proctrack/cgroup Prolog = (null) PrologEpilogTimeout = 65534 PrologSlurmctld = (null) PrologFlags = Alloc,Contain,X11 PropagatePrioProcess = 0 PropagateResourceLimits = NONE PropagateResourceLimitsExcept = (null) RebootProgram = (null) ReconfigFlags = (null) RequeueExit = (null) RequeueExitHold = (null) ResumeFailProgram = (null) ResumeProgram = (null) ResumeRate = 300 nodes/min ResumeTimeout = 60 sec ResvEpilog = (null) ResvOverRun = 0 min ResvProlog = (null) ReturnToService = 1 RoutePlugin = route/default SallocDefaultCommand = srun --mem-per-cpu=0 --gres=gpu:0 --pty --preserve-env --mpi=none $SHELL SbcastParameters = (null) SchedulerParameters = bf_continue,bf_max_job_user=10 SchedulerTimeSlice = 30 sec SchedulerType = sched/backfill SelectType = select/cons_res SelectTypeParameters = CR_CORE_MEMORY,OTHER_CONS_RES SlurmUser = slurm(20381) SlurmctldAddr = (null) SlurmctldDebug = debug SlurmctldHost[0] = citra SlurmctldLogFile = /var/log/slurmctld.log SlurmctldPort = 6817 SlurmctldSyslogDebug = quiet SlurmctldPrimaryOffProg = (null) SlurmctldPrimaryOnProg = (null) SlurmctldTimeout = 1600 sec SlurmctldParameters = (null) SlurmdDebug = info SlurmdLogFile = /var/spool/slurmd/%h.log SlurmdParameters = (null) SlurmdPidFile = /var/run/slurmd.pid SlurmdPort = 6818 SlurmdSpoolDir = /var/spool/slurm SlurmdSyslogDebug = quiet SlurmdTimeout = 1600 sec SlurmdUser = root(0) SlurmSchedLogFile = (null) SlurmSchedLogLevel = 0 SlurmctldPidFile = /var/run/slurmctld.pid SlurmctldPlugstack = (null) SLURM_CONF = /etc/slurm/slurm.conf SLURM_VERSION = 20.02.3 SrunEpilog = (null) SrunPortRange = 0-0 SrunProlog = (null) StateSaveLocation = /var/spool/slurm/state SuspendExcNodes = (null) SuspendExcParts = (null) SuspendProgram = (null) SuspendRate = 60 nodes/min SuspendTime = NONE SuspendTimeout = 30 sec SwitchType = switch/none TaskEpilog = (null) TaskPlugin = task/affinity,task/cgroup TaskPluginParam = (null type) TaskProlog = (null) TCPTimeout = 2 sec TmpFS = /tmp TopologyParam = (null) TopologyPlugin = topology/none TrackWCKey = No TreeWidth = 50 UsePam = No UnkillableStepProgram = /pawsey/pawsey/admin/getProcList.sh UnkillableStepTimeout = 60 sec VSizeFactor = 0 percent WaitTime = 0 sec X11Parameters = (null) Cgroup Support Configuration: AllowedDevicesFile = /etc/slurm/cgroup_allowed_devices_file.conf AllowedKmemSpace = (null) AllowedRAMSpace = 100.0% AllowedSwapSpace = 0.0% CgroupAutomount = yes CgroupMountpoint = /sys/fs/cgroup ConstrainCores = yes ConstrainDevices = yes ConstrainKmemSpace = no ConstrainRAMSpace = yes ConstrainSwapSpace = no MaxKmemPercent = 100.0% MaxRAMPercent = 100.0% MaxSwapPercent = 100.0% MemorySwappiness = (null) MinKmemSpace = 30 MB MinRAMSpace = 30 MB TaskAffinity = no Slurmctld(primary) at citra is UP [root@citra ~]# sdiag ******************************************************* sdiag output at Fri Jul 24 12:38:25 2020 (1595565505) Data since Fri Jul 24 10:22:43 2020 (1595557363) ******************************************************* Server thread count: 3 Agent queue size: 0 Agent count: 0 Agent thread count: 0 DBD Agent queue size: 0 Jobs submitted: 749 Jobs started: 222 Jobs completed: 118 Jobs canceled: 7 Jobs failed: 0 Job states ts: Fri Jul 24 12:38:15 2020 (1595565495) Jobs pending: 526 Jobs running: 98 Main schedule statistics (microseconds): Last cycle: 1173 Max cycle: 479505 Total cycles: 855 Mean cycle: 4069 Mean depth cycle: 144 Cycles per minute: 6 Last queue length: 505 Backfilling stats Total backfilled jobs (since last slurm start): 57 Total backfilled jobs (since last stats cycle start): 57 Total backfilled heterogeneous job components: 0 Total cycles: 259 Last cycle when: Fri Jul 24 12:38:14 2020 (1595565494) Last cycle: 1161 Max cycle: 2491 Mean cycle: 975 Last depth cycle: 1 Last depth cycle (try sched): 1 Depth Mean: 0 Depth Mean (try depth): 0 Last queue length: 505 Queue length mean: 422 Last table size: 1 Mean table size: 1 Latency for 1000 calls to gettimeofday(): 41 microseconds Remote Procedure Call statistics by message type REQUEST_JOB_STEP_CREATE ( 5001) count:9681 ave_time:61401 total_time:594426792 REQUEST_PARTITION_INFO ( 2009) count:7783 ave_time:175 total_time:1362565 REQUEST_FED_INFO ( 2049) count:7048 ave_time:276 total_time:1952174 REQUEST_JOB_USER_INFO ( 2039) count:6586 ave_time:16130 total_time:106238553 MESSAGE_NODE_REGISTRATION_STATUS ( 1002) count:5801 ave_time:916 total_time:5319203 REQUEST_HET_JOB_ALLOC_INFO ( 4027) count:953 ave_time:307 total_time:292624 REQUEST_STEP_COMPLETE ( 5016) count:934 ave_time:5050 total_time:4717473 ACCOUNTING_UPDATE_MSG (10001) count:915 ave_time:419855 total_time:384167815 REQUEST_JOB_INFO_SINGLE ( 2021) count:751 ave_time:63281 total_time:47524460 REQUEST_SUBMIT_BATCH_JOB ( 4003) count:722 ave_time:2966 total_time:2142001 REQUEST_COMPLETE_PROLOG ( 6018) count:260 ave_time:1870 total_time:486293 REQUEST_NODE_INFO ( 2007) count:236 ave_time:756 total_time:178514 REQUEST_JOB_INFO ( 2003) count:198 ave_time:19729 total_time:3906416 MESSAGE_EPILOG_COMPLETE ( 6012) count:140 ave_time:16311 total_time:2283580 REQUEST_COMPLETE_BATCH_SCRIPT ( 5018) count:119 ave_time:24387 total_time:2902053 REQUEST_PING ( 1008) count:14 ave_time:164 total_time:2309 REQUEST_BUILD_INFO ( 2001) count:13 ave_time:289 total_time:3769 REQUEST_JOB_READY ( 4019) count:12 ave_time:20446 total_time:245354 REQUEST_KILL_JOB ( 5032) count:9 ave_time:456 total_time:4105 REQUEST_RESOURCE_ALLOCATION ( 4001) count:8 ave_time:21999 total_time:175994 REQUEST_COMPLETE_JOB_ALLOCATION ( 5017) count:6 ave_time:67719 total_time:406317 REQUEST_CREATE_RESERVATION ( 3006) count:5 ave_time:474 total_time:2371 REQUEST_CANCEL_JOB_STEP ( 5005) count:4 ave_time:29627 total_time:118508 REQUEST_RESERVATION_INFO ( 2024) count:3 ave_time:213 total_time:640 REQUEST_SHARE_INFO ( 2022) count:2 ave_time:185362 total_time:370725 REQUEST_STATS_INFO ( 2035) count:0 ave_time:0 total_time:0 Remote Procedure Call statistics by user awilliamson1 ( 20863) count:12117 ave_time:4936 total_time:59814493 sbeecroft ( 24501) count:11329 ave_time:52462 total_time:594344628 root ( 0) count:7308 ave_time:2151 total_time:15724863 mbravo ( 21105) count:4812 ave_time:9539 total_time:45902204 hhu ( 23879) count:2315 ave_time:20670 total_time:47852935 slurm ( 20381) count:915 ave_time:419855 total_time:384167815 mdelapierre ( 22669) count:658 ave_time:1292 total_time:850176 nswainston ( 23667) count:536 ave_time:256 total_time:137501 487 ( 487) count:512 ave_time:7773 total_time:3979936 mtaylor ( 22120) count:459 ave_time:1343 total_time:616824 aelwell ( 20701) count:241 ave_time:795 total_time:191629 amanuwal ( 21510) count:234 ave_time:4944 total_time:1156909 cfernandez ( 20780) count:191 ave_time:256 total_time:49075 apomeroy ( 20377) count:111 ave_time:905 total_time:100509 isolde ( 21486) count:96 ave_time:951 total_time:91360 cplowman ( 25049) count:36 ave_time:230 total_time:8314 shaodianme ( 24341) count:34 ave_time:568 total_time:19327 sbellstedt ( 21116) count:29 ave_time:3331 total_time:96623 ilkhom ( 20484) count:27 ave_time:80728 total_time:2179662 cdipietrantonio ( 22158) count:23 ave_time:2191 total_time:50393 awestin ( 20420) count:22 ave_time:948 total_time:20870 sumithae ( 21430) count:21 ave_time:205 total_time:4318 abalaji ( 21158) count:21 ave_time:16674 total_time:350169 kavianipour ( 21045) count:20 ave_time:13841 total_time:276827 amb021 ( 22803) count:18 ave_time:1009 total_time:18172 kwamahiu ( 22512) count:16 ave_time:33777 total_time:540436 cormac ( 20635) count:15 ave_time:24414 total_time:366212 randerson ( 20831) count:15 ave_time:12502 total_time:187544 moshea ( 20002) count:12 ave_time:3215 total_time:38588 alagain ( 21292) count:11 ave_time:6267 total_time:68938 aschuitemaker ( 23738) count:9 ave_time:381 total_time:3429 wdavey ( 23811) count:9 ave_time:468 total_time:4213 adusting ( 22218) count:6 ave_time:447 total_time:2684 mmariano ( 22025) count:5 ave_time:764 total_time:3824 apfaff ( 21334) count:5 ave_time:565 total_time:2827 alechislop ( 25096) count:4 ave_time:1034 total_time:4138 bcocking ( 22156) count:3 ave_time:250 total_time:752 ale ( 24336) count:3 ave_time:186 total_time:558 mgill ( 22184) count:3 ave_time:171 total_time:515 jpetereit ( 22149) count:2 ave_time:209 total_time:418 Pending RPC statistics No pending RPCs Any idea what else we can look for to prevent this re-appearing? Logged at sev2 as it actaually caused a cluster outage, but we're back up despite losing the queue. Many thanks Andrew