Ticket 9464

Summary:	slurmctld unresponsive, sched thread(s) stuck
Product:	Slurm	Reporter:	Andrew Elwell <Andrew.Elwell>
Component:	slurmctld	Assignee:	Marshall Garey <marshall>
Status:	RESOLVED DUPLICATE	QA Contact:
Severity:	2 - High Impact
Priority:	---	CC:	cinek, marshall, wdavey
Version:	20.02.3
Hardware:	Linux
OS:	Linux
See Also:	https://bugs.schedmd.com/show_bug.cgi?id=9248
Site:	Pawsey	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	CentOS	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	yesterday's logfile todays logfile (many restarts)

Description Andrew Elwell 2020-07-23 22:44:07 MDT

Created attachment 15161 [details]
yesterday's logfile

Hi, Overnight we had an issue with slurmctld apparently locking up on one of our clusters (zeus). Symptoms were 'scontrol ping' showing no response, the slurmctld process itself using one core 100% and stuck in the 'sched' task.

Increasing core count (to 4 from 2) on the VM it was running on made no difference, as did changing any systemd limits incase it was being stalled that way. Starting foreground (-Dvvvvv) made no difference.

Eventually we had to stop it, move the StateSaveLocation dir out of the way and restart with empty losing running and queued jobs (in order to restore service).

We have straces from attaching to the running slurmctld (not much obvious to us mostly 
restart_syscall(<... resuming interrupted futex ...>) = -1 ETIMEDOUT (Connection timed out)
futex(0x6f31e0, FUTEX_WAKE_PRIVATE, 1)  = 0
futex(0x6f31a4, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 3861, {1595555567, 0}, ffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0x6f31e0, FUTEX_WAKE_PRIVATE, 1)  = 0
futex(0x6f31a4, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 3863, {1595555569, 0}, ffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0x6f31e0, FUTEX_WAKE_PRIVATE, 1)  = 0
futex(0x6f31a4, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 3865, {1595555571, 0}, ffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0x6f31e0, FUTEX_WAKE_PRIVATE, 1)  = 0
...

or 

rt_sigprocmask(SIG_BLOCK, [CHLD], [HUP INT QUIT ABRT USR2 PIPE ALRM TERM CHLD TSTP XCPU], 8) = 0
nanosleep({2, 0}, 0x2ac45a583ce0)       = 0
rt_sigprocmask(SIG_BLOCK, [CHLD], [HUP INT QUIT ABRT USR2 PIPE ALRM TERM CHLD TSTP XCPU], 8) = 0
nanosleep({2, 0}, 0x2ac45a583ce0)       = 0
rt_sigprocmask(SIG_BLOCK, [CHLD], [HUP INT QUIT ABRT USR2 PIPE ALRM TERM CHLD TSTP XCPU], 8) = 0
nanosleep({2, 0}, 0x2ac45a583ce0)       = 0
...

(which we can tar and send if relevant)


After each restart (we did quite a bit of this), we'd see sdiag look OK initially
[root@citra ~]# sdiag
*******************************************************
sdiag output at Fri Jul 24 08:05:16 2020 (1595549116)
Data since      Fri Jul 24 08:04:52 2020 (1595549092)
*******************************************************
Server thread count:  11
Agent queue size:     0
Agent count:          1
Agent thread count:   12
DBD Agent queue size: 0
Jobs submitted: 0
Jobs started:   0
Jobs completed: 0
Jobs canceled:  0
Jobs failed:    0
Job states ts:  Fri Jul 24 08:04:52 2020 (1595549092)
Jobs pending:   329
Jobs running:   91
Main schedule statistics (microseconds):
	Last cycle:   0
	Max cycle:    0
	Total cycles: 0
	Last queue length: 264
Backfilling stats
	Total backfilled jobs (since last slurm start): 0
	Total backfilled jobs (since last stats cycle start): 0
	Total backfilled heterogeneous job components: 0
	Total cycles: 0
	Last cycle when: N/A
	Last cycle: 0
	Max cycle:  0
	Last depth cycle: 0
	Last depth cycle (try sched): 0
	Last queue length: 0
	Last table size: 0
Latency for 1000 calls to gettimeofday(): 41 microseconds
Remote Procedure Call statistics by message type
	MESSAGE_NODE_REGISTRATION_STATUS        ( 1002) count:182    ave_time:3012   total_time:548336
	REQUEST_PARTITION_INFO                  ( 2009) count:2      ave_time:231    total_time:463
	REQUEST_NODE_INFO                       ( 2007) count:1      ave_time:761    total_time:761
	REQUEST_JOB_INFO                        ( 2003) count:1      ave_time:3828   total_time:3828
	REQUEST_FED_INFO                        ( 2049) count:1      ave_time:127    total_time:127
	REQUEST_JOB_INFO_SINGLE                 ( 2021) count:0      ave_time:0      total_time:0
	REQUEST_STATS_INFO                      ( 2035) count:0      ave_time:0      total_time:0
Remote Procedure Call statistics by user
	root            (       0) count:182    ave_time:3012   total_time:548336
	487             (     487) count:4      ave_time:1263   total_time:5052
	asevernellis    (   23212) count:1      ave_time:127    total_time:127
Pending RPC statistics
	No pending RPCs
[root@citra ~]#

but then timeout in the same way that sinfo was.


I did notice the following thread count warning 

slurmctld: debug2: Processing RPC: MESSAGE_NODE_REGISTRATION_STATUS from uid=0
[2020-07-24T07:42:41.320] debug2: Processing RPC: MESSAGE_NODE_REGISTRATION_STATUS from uid=0
slurmctld: server_thread_count over limit (256), waiting
[2020-07-24T07:42:41.324] server_thread_count over limit (256), waiting
slurmctld: debug2: Processing RPC: MESSAGE_NODE_REGISTRATION_STATUS from uid=0
[2020-07-24T07:42:41.324] debug2: Processing RPC: MESSAGE_NODE_REGISTRATION_STATUS from uid=0
slurmctld: debug2: Processing RPC: MESSAGE_NODE_REGISTRATION_STATUS from uid=0
[2020-07-24T07:42:41.324] debug2: Processing RPC: MESSAGE_NODE_REGISTRATION_STATUS from uid=0

but I see from the source that increasing this isn't recommended.



I attach the logs from yesterday (a cron job started throwing errrors around 23:35 and since restart. Yes, I know we had some old (very old) slurmd's trying to connect. They've since been killed off.

current config and sdiag
[root@citra ~]# scontrol show config
Configuration data as of 2020-07-24T12:38:20
AccountingStorageBackupHost = (null)
AccountingStorageEnforce = associations,limits,qos
AccountingStorageHost   = comet.pawsey.org.au
AccountingStorageLoc    = N/A
AccountingStoragePort   = 6819
AccountingStorageTRES   = cpu,mem,energy,node,billing,fs/disk,vmem,pages,gres/gpu
AccountingStorageType   = accounting_storage/slurmdbd
AccountingStorageUser   = N/A
AccountingStoreJobComment = Yes
AcctGatherEnergyType    = acct_gather_energy/none
AcctGatherFilesystemType = acct_gather_filesystem/none
AcctGatherInterconnectType = acct_gather_interconnect/none
AcctGatherNodeFreq      = 0 sec
AcctGatherProfileType   = acct_gather_profile/none
AllowSpecResourcesUsage = No
AuthAltTypes            = (null)
AuthInfo                = (null)
AuthType                = auth/munge
BatchStartTimeout       = 10 sec
BOOT_TIME               = 2020-07-24T10:22:36
BurstBufferType         = (null)
CliFilterPlugins        = (null)
ClusterName             = zeus
CommunicationParameters = (null)
CompleteWait            = 0 sec
CoreSpecPlugin          = core_spec/none
CpuFreqDef              = Unknown
CpuFreqGovernors        = Performance,OnDemand
CredType                = cred/munge
DebugFlags              = NO_CONF_HASH
DefMemPerNode           = UNLIMITED
DependencyParameters    = kill_invalid_depend=1
DisableRootJobs         = No
EioTimeout              = 60
EnforcePartLimits       = ANY
Epilog                  = (null)
EpilogMsgTime           = 2000 usec
EpilogSlurmctld         = (null)
ExtSensorsType          = ext_sensors/none
ExtSensorsFreq          = 0 sec
FairShareDampeningFactor = 8
FederationParameters    = (null)
FirstJobId              = 4750001
GetEnvTimeout           = 2 sec
GresTypes               = gpu
GpuFreqDef              = high,memory=high
GroupUpdateForce        = 1
GroupUpdateTime         = 600 sec
HASH_VAL                = Match
HealthCheckInterval     = 0 sec
HealthCheckNodeState    = ANY
HealthCheckProgram      = (null)
InactiveLimit           = 0 sec
JobAcctGatherFrequency  = 30
JobAcctGatherType       = jobacct_gather/linux
JobAcctGatherParams     = (null)
JobCompHost             = localhost
JobCompLoc              = /var/log/slurm_jobcomp.log
JobCompPort             = 0
JobCompType             = jobcomp/none
JobCompUser             = root
JobContainerType        = job_container/none
JobCredentialPrivateKey = (null)
JobCredentialPublicCertificate = (null)
JobDefaults             = (null)
JobFileAppend           = 0
JobRequeue              = 1
JobSubmitPlugins        = (null)
KeepAliveTime           = SYSTEM_DEFAULT
KillOnBadExit           = 0
KillWait                = 30 sec
LaunchParameters        = send_gids
LaunchType              = launch/slurm
Layouts                 =
Licenses                = (null)
LogTimeFormat           = iso8601_ms
MailDomain              = (null)
MailProg                = /bin/mail
MaxArraySize            = 1000
MaxDBDMsgs              = 20796
MaxJobCount             = 10000
MaxJobId                = 67108863
MaxMemPerCPU            = 0
MaxStepCount            = 40000
MaxTasksPerNode         = 51200
MCSPlugin               = mcs/none
MCSParameters           = (null)
MessageTimeout          = 10 sec
MinJobAge               = 300 sec
MpiDefault              = pmi2
MpiParams               = ports=12000-12999
MsgAggregationParams    = (null)
NEXT_JOB_ID             = 4750738
NodeFeaturesPlugins     = (null)
OverTimeLimit           = 0 min
PluginDir               = /usr/lib64/slurm
PlugStackConfig         = plugstack.conf
PowerParameters         = (null)
PowerPlugin             =
PreemptMode             = OFF
PreemptType             = preempt/none
PreemptExemptTime       = 00:00:00
PrEpParameters          = (null)
PrEpPlugins             = prep/script
PriorityParameters      = (null)
PrioritySiteFactorParameters = (null)
PrioritySiteFactorPlugin = (null)
PriorityDecayHalfLife   = 00:00:00
PriorityCalcPeriod      = 00:05:00
PriorityFavorSmall      = No
PriorityFlags           = MAX_TRES
PriorityMaxAge          = 100-00:00:00
PriorityUsageResetPeriod = QUARTERLY
PriorityType            = priority/multifactor
PriorityWeightAge       = 9600
PriorityWeightAssoc     = 0
PriorityWeightFairShare = 600
PriorityWeightJobSize   = 0
PriorityWeightPartition = 30000
PriorityWeightQOS       = 150000
PriorityWeightTRES      = (null)
PrivateData             = none
ProctrackType           = proctrack/cgroup
Prolog                  = (null)
PrologEpilogTimeout     = 65534
PrologSlurmctld         = (null)
PrologFlags             = Alloc,Contain,X11
PropagatePrioProcess    = 0
PropagateResourceLimits = NONE
PropagateResourceLimitsExcept = (null)
RebootProgram           = (null)
ReconfigFlags           = (null)
RequeueExit             = (null)
RequeueExitHold         = (null)
ResumeFailProgram       = (null)
ResumeProgram           = (null)
ResumeRate              = 300 nodes/min
ResumeTimeout           = 60 sec
ResvEpilog              = (null)
ResvOverRun             = 0 min
ResvProlog              = (null)
ReturnToService         = 1
RoutePlugin             = route/default
SallocDefaultCommand    = srun --mem-per-cpu=0 --gres=gpu:0 --pty --preserve-env --mpi=none $SHELL
SbcastParameters        = (null)
SchedulerParameters     = bf_continue,bf_max_job_user=10
SchedulerTimeSlice      = 30 sec
SchedulerType           = sched/backfill
SelectType              = select/cons_res
SelectTypeParameters    = CR_CORE_MEMORY,OTHER_CONS_RES
SlurmUser               = slurm(20381)
SlurmctldAddr           = (null)
SlurmctldDebug          = debug
SlurmctldHost[0]        = citra
SlurmctldLogFile        = /var/log/slurmctld.log
SlurmctldPort           = 6817
SlurmctldSyslogDebug    = quiet
SlurmctldPrimaryOffProg = (null)
SlurmctldPrimaryOnProg  = (null)
SlurmctldTimeout        = 1600 sec
SlurmctldParameters     = (null)
SlurmdDebug             = info
SlurmdLogFile           = /var/spool/slurmd/%h.log
SlurmdParameters        = (null)
SlurmdPidFile           = /var/run/slurmd.pid
SlurmdPort              = 6818
SlurmdSpoolDir          = /var/spool/slurm
SlurmdSyslogDebug       = quiet
SlurmdTimeout           = 1600 sec
SlurmdUser              = root(0)
SlurmSchedLogFile       = (null)
SlurmSchedLogLevel      = 0
SlurmctldPidFile        = /var/run/slurmctld.pid
SlurmctldPlugstack      = (null)
SLURM_CONF              = /etc/slurm/slurm.conf
SLURM_VERSION           = 20.02.3
SrunEpilog              = (null)
SrunPortRange           = 0-0
SrunProlog              = (null)
StateSaveLocation       = /var/spool/slurm/state
SuspendExcNodes         = (null)
SuspendExcParts         = (null)
SuspendProgram          = (null)
SuspendRate             = 60 nodes/min
SuspendTime             = NONE
SuspendTimeout          = 30 sec
SwitchType              = switch/none
TaskEpilog              = (null)
TaskPlugin              = task/affinity,task/cgroup
TaskPluginParam         = (null type)
TaskProlog              = (null)
TCPTimeout              = 2 sec
TmpFS                   = /tmp
TopologyParam           = (null)
TopologyPlugin          = topology/none
TrackWCKey              = No
TreeWidth               = 50
UsePam                  = No
UnkillableStepProgram   = /pawsey/pawsey/admin/getProcList.sh
UnkillableStepTimeout   = 60 sec
VSizeFactor             = 0 percent
WaitTime                = 0 sec
X11Parameters           = (null)

Cgroup Support Configuration:
AllowedDevicesFile      = /etc/slurm/cgroup_allowed_devices_file.conf
AllowedKmemSpace        = (null)
AllowedRAMSpace         = 100.0%
AllowedSwapSpace        = 0.0%
CgroupAutomount         = yes
CgroupMountpoint        = /sys/fs/cgroup
ConstrainCores          = yes
ConstrainDevices        = yes
ConstrainKmemSpace      = no
ConstrainRAMSpace       = yes
ConstrainSwapSpace      = no
MaxKmemPercent          = 100.0%
MaxRAMPercent           = 100.0%
MaxSwapPercent          = 100.0%
MemorySwappiness        = (null)
MinKmemSpace            = 30 MB
MinRAMSpace             = 30 MB
TaskAffinity            = no

Slurmctld(primary) at citra is UP
[root@citra ~]# sdiag
*******************************************************
sdiag output at Fri Jul 24 12:38:25 2020 (1595565505)
Data since      Fri Jul 24 10:22:43 2020 (1595557363)
*******************************************************
Server thread count:  3
Agent queue size:     0
Agent count:          0
Agent thread count:   0
DBD Agent queue size: 0

Jobs submitted: 749
Jobs started:   222
Jobs completed: 118
Jobs canceled:  7
Jobs failed:    0

Job states ts:  Fri Jul 24 12:38:15 2020 (1595565495)
Jobs pending:   526
Jobs running:   98

Main schedule statistics (microseconds):
	Last cycle:   1173
	Max cycle:    479505
	Total cycles: 855
	Mean cycle:   4069
	Mean depth cycle:  144
	Cycles per minute: 6
	Last queue length: 505

Backfilling stats
	Total backfilled jobs (since last slurm start): 57
	Total backfilled jobs (since last stats cycle start): 57
	Total backfilled heterogeneous job components: 0
	Total cycles: 259
	Last cycle when: Fri Jul 24 12:38:14 2020 (1595565494)
	Last cycle: 1161
	Max cycle:  2491
	Mean cycle: 975
	Last depth cycle: 1
	Last depth cycle (try sched): 1
	Depth Mean: 0
	Depth Mean (try depth): 0
	Last queue length: 505
	Queue length mean: 422
	Last table size: 1
	Mean table size: 1

Latency for 1000 calls to gettimeofday(): 41 microseconds

Remote Procedure Call statistics by message type
	REQUEST_JOB_STEP_CREATE                 ( 5001) count:9681   ave_time:61401  total_time:594426792
	REQUEST_PARTITION_INFO                  ( 2009) count:7783   ave_time:175    total_time:1362565
	REQUEST_FED_INFO                        ( 2049) count:7048   ave_time:276    total_time:1952174
	REQUEST_JOB_USER_INFO                   ( 2039) count:6586   ave_time:16130  total_time:106238553
	MESSAGE_NODE_REGISTRATION_STATUS        ( 1002) count:5801   ave_time:916    total_time:5319203
	REQUEST_HET_JOB_ALLOC_INFO              ( 4027) count:953    ave_time:307    total_time:292624
	REQUEST_STEP_COMPLETE                   ( 5016) count:934    ave_time:5050   total_time:4717473
	ACCOUNTING_UPDATE_MSG                   (10001) count:915    ave_time:419855 total_time:384167815
	REQUEST_JOB_INFO_SINGLE                 ( 2021) count:751    ave_time:63281  total_time:47524460
	REQUEST_SUBMIT_BATCH_JOB                ( 4003) count:722    ave_time:2966   total_time:2142001
	REQUEST_COMPLETE_PROLOG                 ( 6018) count:260    ave_time:1870   total_time:486293
	REQUEST_NODE_INFO                       ( 2007) count:236    ave_time:756    total_time:178514
	REQUEST_JOB_INFO                        ( 2003) count:198    ave_time:19729  total_time:3906416
	MESSAGE_EPILOG_COMPLETE                 ( 6012) count:140    ave_time:16311  total_time:2283580
	REQUEST_COMPLETE_BATCH_SCRIPT           ( 5018) count:119    ave_time:24387  total_time:2902053
	REQUEST_PING                            ( 1008) count:14     ave_time:164    total_time:2309
	REQUEST_BUILD_INFO                      ( 2001) count:13     ave_time:289    total_time:3769
	REQUEST_JOB_READY                       ( 4019) count:12     ave_time:20446  total_time:245354
	REQUEST_KILL_JOB                        ( 5032) count:9      ave_time:456    total_time:4105
	REQUEST_RESOURCE_ALLOCATION             ( 4001) count:8      ave_time:21999  total_time:175994
	REQUEST_COMPLETE_JOB_ALLOCATION         ( 5017) count:6      ave_time:67719  total_time:406317
	REQUEST_CREATE_RESERVATION              ( 3006) count:5      ave_time:474    total_time:2371
	REQUEST_CANCEL_JOB_STEP                 ( 5005) count:4      ave_time:29627  total_time:118508
	REQUEST_RESERVATION_INFO                ( 2024) count:3      ave_time:213    total_time:640
	REQUEST_SHARE_INFO                      ( 2022) count:2      ave_time:185362 total_time:370725
	REQUEST_STATS_INFO                      ( 2035) count:0      ave_time:0      total_time:0

Remote Procedure Call statistics by user
	awilliamson1    (   20863) count:12117  ave_time:4936   total_time:59814493
	sbeecroft       (   24501) count:11329  ave_time:52462  total_time:594344628
	root            (       0) count:7308   ave_time:2151   total_time:15724863
	mbravo          (   21105) count:4812   ave_time:9539   total_time:45902204
	hhu             (   23879) count:2315   ave_time:20670  total_time:47852935
	slurm           (   20381) count:915    ave_time:419855 total_time:384167815
	mdelapierre     (   22669) count:658    ave_time:1292   total_time:850176
	nswainston      (   23667) count:536    ave_time:256    total_time:137501
	487             (     487) count:512    ave_time:7773   total_time:3979936
	mtaylor         (   22120) count:459    ave_time:1343   total_time:616824
	aelwell         (   20701) count:241    ave_time:795    total_time:191629
	amanuwal        (   21510) count:234    ave_time:4944   total_time:1156909
	cfernandez      (   20780) count:191    ave_time:256    total_time:49075
	apomeroy        (   20377) count:111    ave_time:905    total_time:100509
	isolde          (   21486) count:96     ave_time:951    total_time:91360
	cplowman        (   25049) count:36     ave_time:230    total_time:8314
	shaodianme      (   24341) count:34     ave_time:568    total_time:19327
	sbellstedt      (   21116) count:29     ave_time:3331   total_time:96623
	ilkhom          (   20484) count:27     ave_time:80728  total_time:2179662
	cdipietrantonio (   22158) count:23     ave_time:2191   total_time:50393
	awestin         (   20420) count:22     ave_time:948    total_time:20870
	sumithae        (   21430) count:21     ave_time:205    total_time:4318
	abalaji         (   21158) count:21     ave_time:16674  total_time:350169
	kavianipour     (   21045) count:20     ave_time:13841  total_time:276827
	amb021          (   22803) count:18     ave_time:1009   total_time:18172
	kwamahiu        (   22512) count:16     ave_time:33777  total_time:540436
	cormac          (   20635) count:15     ave_time:24414  total_time:366212
	randerson       (   20831) count:15     ave_time:12502  total_time:187544
	moshea          (   20002) count:12     ave_time:3215   total_time:38588
	alagain         (   21292) count:11     ave_time:6267   total_time:68938
	aschuitemaker   (   23738) count:9      ave_time:381    total_time:3429
	wdavey          (   23811) count:9      ave_time:468    total_time:4213
	adusting        (   22218) count:6      ave_time:447    total_time:2684
	mmariano        (   22025) count:5      ave_time:764    total_time:3824
	apfaff          (   21334) count:5      ave_time:565    total_time:2827
	alechislop      (   25096) count:4      ave_time:1034   total_time:4138
	bcocking        (   22156) count:3      ave_time:250    total_time:752
	ale             (   24336) count:3      ave_time:186    total_time:558
	mgill           (   22184) count:3      ave_time:171    total_time:515
	jpetereit       (   22149) count:2      ave_time:209    total_time:418

Pending RPC statistics
	No pending RPCs



Any idea what else we can look for to prevent this re-appearing? Logged at sev2 as it actaually caused a cluster outage, but we're back up despite losing the queue.

Many thanks

Andrew

Comment 1 Andrew Elwell 2020-07-23 22:44:52 MDT

Created attachment 15162 [details]
todays logfile (many restarts)

Comment 2 Marcin Stolarek 2020-07-24 04:00:31 MDT

Looking at logs, there is a number of messages like:

>[2020-07-23T23:42:59.854] error: slurm_unpack_received_msg: Message receive failure
>[2020-07-23T23:42:59.856] error: unpack_header: protocol_version 7680 not supported
>[2020-07-23T23:42:59.856] error: unpacking header

protocol version 7680 corresponds with Slurm 16.05. Based on logs the old version may be running on:
  146.118.80.201
  146.118.80.202
  146.118.80.203
  146.118.80.204
  146.118.80.205
  146.118.80.206
  146.118.80.207
  146.118.80.208
  146.118.80.209
  146.118.80.210
  146.118.80.211
  146.118.80.212
  146.118.80.213
  146.118.80.214
  146.118.80.215


Could you please check that?

cheers,
Marcin

Comment 3 Andrew Elwell 2020-07-24 04:13:23 MDT

Yes, they were old slurmds that were assumed to be turned off.

Comment 4 Marcin Stolarek 2020-07-24 04:19:15 MDT

Did the issue disappear when you turned those off?

cheers,
Marcin

Comment 5 Andrew Elwell 2020-07-24 05:29:29 MDT

No. They were stopped (and the corresponding node and partition info from config) without any improvement. 
The only way we could start slurmctld and keep it responsive was by moving the state files out of the way. 
(We still have these if they are any help)

Comment 6 Marcin Stolarek 2020-07-24 05:56:24 MDT

If it will happen again please take a core dump of running slurmctld using commands like:
#gcore $(pidof slurmctld)
load its result to gdb and share a backtrace with us:
#gdb /path/to/slurmctld /path/to/core.XXX
(gdb)t a a bt f

I'm looking into the logs and code to find a potential root cause.

cheers,
Marcin

Comment 8 Marshall Garey 2020-07-24 10:45:31 MDT

Andrew,

I'm taking over for Marcin since it looks like something I've worked on.

Just to be clear, is the cluster running smoothly again? From your comments it sounds like you cleared the state files (and sadly lost the job queue) to get the cluster going. Is that correct?

I strongly suspect that this is a duplicate of bug 9248. There was a regression that caused an infinite loop when someone requests the --distribution=plane, but someone can accidentally do that with syntax like this:

-m=8gb

So someone trying to request memory might use this bad syntax and actually request --distribution=plane and cause an infinite loop in slurmctld. This infinite loop is in the sched task, which also makes me think that's what happened.

Commit eb79918f is the fix for the regression causing the infinite loop. Commit 6beb50bc fixes the issue with parsing --distribution so this bad syntax is rejected instead of accepted.

You should be able to view the job scripts in the state save location to see if anyone used the -m option in their batch job script. If they used it on the command line and not in their batch job script you sadly won't be able to see if they used it or not. The job scripts are located in:

/path/to/state/save/location/hash.<0-9>/script

Can you do a search for -m or --distribution in those job scripts?

If it does happen again, like Marcin said in comment 6 we'll need you to use gcore and get a backtrace.

Comment 10 Andrew Elwell 2020-07-24 19:36:10 MDT

no sign of any '#SBATCH -m' anywhere, but this looks like a contender for the distribution bug:

[root@citra state_screwed_20200724]# cat hash.2/job.4750462/script
#!/bin/bash
#SBATCH --partition=debugq
#SBATCH --mem=100GB
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --time=00:00:10
#SBATCH --job-name=Stokes_cube
#SBATCH --distribution=plane=1
#SBATCH --output=logs/%x-%j.out
#SBATCH --error=logs/%x-%j.err

module load astropy

srun -n 1 /home/jcollier/Scripts/Stokes_cube.py
[root@citra state_screwed_20200724]#

and to confirm, yes the cluster is running again (hence why I didn't log it as a sev1)

Comment 12 Marshall Garey 2020-07-27 11:07:59 MDT

That confirms it. I'm going to mark this as a duplicate of bug 9248. Until 20.02.4 is released, you'll need to apply commit eb79918f810 to prevent this infinite loop from happening again. You can find the patch file here:

https://github.com/SchedMD/slurm/commit/eb79918f810.patch

If this problem does happen again (slurmctld has one core running at 100% and stuck in "Sched"), please re-open this bug and upload a backtrace as Marcin described in comment 6. If other problems happen but not that one, let's keep those in separate tickets.

Also make sure that slurmctld and all slurmd's get restarted after you've removed the old 16.05 nodes from slurm.conf to clear up communication issues.

Thanks for the bug report and sorry about that regression.

*** This ticket has been marked as a duplicate of ticket 9248 ***