Ticket 6462

Summary:	slurmctld gets stuck and doesn't respond anymore
Product:	Slurm	Reporter:	Cineca HPC Systems <hpc-sysmgt-info>
Component:	slurmctld	Assignee:	Alejandro Sanchez <alex>
Status:	RESOLVED DUPLICATE	QA Contact:
Severity:	3 - Medium Impact
Priority:	---	CC:	alex, nate
Version:	18.08.3
Hardware:	Linux
OS:	Linux
Site:	Cineca	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	gdb trace netstat during hang lsof during hang strace during hang slurmctld log slurm conf file sdiag of the controller when not responding gdb trace for all threads logs at 20190319 galileo sdiag output marconi sdiag output sdiag output during hang gdb trace for all threads during hang

Description Cineca HPC Systems 2019-02-07 09:35:09 MST

Created attachment 9101 [details]
gdb trace

Hello,

since the upgrade to 18.08.3 we are observing a strange behaviour of the slurmctld daemon. It happens infact that the slurmctl gets frozen and not only it doesn't schedule/dispatch any jobs, but also it doesn't accept any command resulting in all query command (squeue,sinfo etc) hanging.
In the slurmctld logs we can see:

[2019-02-07T16:47:21.897] _slurmctld_rpc_mgr: accept() connection from 10.23.16.26:45076
[2019-02-07T16:47:21.898] slurmctld_req: received opcode REQUEST_COMPLETE_BATCH_SCRIPT from 10.23.16.26:45076
[2019-02-07T16:47:22.447] _slurmctld_rpc_mgr: accept() connection from 10.23.16.32:51172
[2019-02-07T16:47:22.447] slurmctld_req: received opcode REQUEST_COMPLETE_BATCH_SCRIPT from 10.23.16.32:51172
[2019-02-07T16:47:36.579] _slurmctld_rpc_mgr: accept() connection from 10.23.16.5:44908
[2019-02-07T16:47:36.580] slurmctld_req: received opcode REQUEST_COMPLETE_BATCH_SCRIPT from 10.23.16.5:44908
[2019-02-07T16:47:39.027] _slurmctld_rpc_mgr: accept() connection from 10.23.16.21:39334
[2019-02-07T16:47:39.027] server_thread_count over limit (256), waiting
[2019-02-07T16:47:39.027] slurmctld_req: received opcode REQUEST_STEP_COMPLETE from 10.23.16.21:39334

as the last messages logged.
Strangely this is the only cluster (we have 3) where this happens.

Best Regards,
Marcello

Comment 1 Cineca HPC Systems 2019-02-07 09:35:43 MST

Created attachment 9102 [details]
netstat during hang

Comment 2 Cineca HPC Systems 2019-02-07 09:36:10 MST

Created attachment 9103 [details]
lsof during hang

Comment 3 Cineca HPC Systems 2019-02-07 09:36:38 MST

Created attachment 9104 [details]
strace during hang

Comment 4 Cineca HPC Systems 2019-02-07 09:37:14 MST

Created attachment 9105 [details]
slurmctld log

Comment 5 Cineca HPC Systems 2019-02-07 09:55:50 MST

Additionally, when the slurmctld is restarted from this situation to restore functionality it happens that some partitions get the wrong configuration:

PartitionName=gll_usr_gpuprod
   AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
   AllocNodes=LINCO_INVENEW,LINCO_tph1,LINCO_galinew,fer_prod1,eli_prod1 Default=NO QoS=gll_usr_gpuprod
   DefaultTime=00:30:00 DisableRootJobs=YES ExclusiveUser=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=1-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED
   Nodes=node[153-156,499-507]
   PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
   OverTimeLimit=NONE PreemptMode=OFF
   State=UP TotalCPUs=208 TotalNodes=13 SelectTypeParameters=NONE
   JobDefaults=(null)
   DefMemPerCPU=7000 MaxMemPerNode=UNLIMITED
   TRESBillingWeights=CPU=0.25,Mem=0.035G,GRES/gpu=1

PartitionName=gll_usr_prod
   AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
   AllocNodes=LINCO_galinew,fer_prod1,eli_prod1 Default=NO QoS=gll_usr_prod
   DefaultTime=00:30:00 DisableRootJobs=YES ExclusiveUser=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=1-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED
   Nodes=node[125-134,173-280,389-483]
   PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
   OverTimeLimit=NONE PreemptMode=OFF
   State=UP TotalCPUs=7668 TotalNodes=213 SelectTypeParameters=NONE
   JobDefaults=(null)
   DefMemPerCPU=3000 MaxMemPerNode=UNLIMITED
   TRESBillingWeights=CPU=0.25,Mem=0.072G


The AllocNodes list is not made of nodes, but accounts...

Regards,
Marcello

Comment 6 Jason Booth 2019-02-07 10:22:45 MST

Hi Marcello,

 When did you finish the upgrade to 18.08.3 and does the problem only occur some times or after a duration of time e.g. fifteen minuets?

Comment 9 Cineca HPC Systems 2019-02-07 11:03:09 MST

(In reply to Jason Booth from comment #6)
> Hi Marcello,
> 
>  When did you finish the upgrade to 18.08.3 and does the problem only occur
> some times or after a duration of time e.g. fifteen minuets?

The upgrade was made in november 2018, after that the problem appeared more and more frequently. There doesn't seem to be a pattern however, this is when it happened:

/var/log/messages-20181125.gz:Nov 22 17:23:26 io07 slurmctld[29712]: server_thread_count over limit (256), waiting
/var/log/messages-20181125.gz:Nov 22 17:23:31 io07 slurmctld[29712]: server_thread_count over limit (256), waiting
/var/log/messages-20181231.gz:Dec 26 11:41:47 io07 slurmctld[9354]: server_thread_count over limit (256), waiting
/var/log/messages-20181231.gz:Dec 27 10:49:28 io07 slurmctld[16862]: server_thread_count over limit (256), waiting
/var/log/messages-20181231.gz:Dec 29 15:36:18 io07 slurmctld[17366]: server_thread_count over limit (256), waiting
/var/log/messages-20190113.gz:Jan  8 15:25:08 io07 slurmctld[19868]: server_thread_count over limit (256), waiting
/var/log/messages-20190113.gz:Jan  9 10:51:59 io07 slurmctld[26980]: server_thread_count over limit (256), waiting
/var/log/messages-20190203.gz:Jan 29 12:02:00 io07 slurmctld[2484]: server_thread_count over limit (256), waiting
/var/log/messages-20190203.gz:Jan 29 17:29:27 io07 slurmctld[26128]: server_thread_count over limit (256), waiting
/var/log/messages:Feb  6 15:08:29 io08 slurmctld[18699]: server_thread_count over limit (256), waiting
/var/log/messages:Feb  7 16:47:39 io08 slurmctld[4123]: server_thread_count over limit (256), waiting


Regards,
Marcello

Comment 12 Jason Booth 2019-02-07 11:15:46 MST

Marcello, 

 The service may be hitting a limit. Can you send the limit from the slurmctld.


 cat /proc/`pgrep slurmctld`/limits

Please also send me the the output of the service file if you have one in use.
e.g
/usr/lib/systemd/system/slurmctld.service 

Also, please attach your slurm.conf.


Can you also let us know if you have any other processes running on a consistency basis such as sacctmgr running updates often?

Comment 14 Alejandro Sanchez 2019-02-08 02:41:15 MST

Hi Marcello,

besides Jason questions/requests, can you try temporarily disabling Priority DebugFlags? I'm wondering if the amount of logs enabled by this flag is affecting ctld performance.

Comment 15 Cineca HPC Systems 2019-02-08 04:42:10 MST

(In reply to Jason Booth from comment #12)
> Marcello, 
> 
>  The service may be hitting a limit. Can you send the limit from the
> slurmctld.
> 
> 
>  cat /proc/`pgrep slurmctld`/limits
> 
> Please also send me the the output of the service file if you have one in
> use.
> e.g
> /usr/lib/systemd/system/slurmctld.service 
> 
> Also, please attach your slurm.conf.
> 
> 
> Can you also let us know if you have any other processes running on a
> consistency basis such as sacctmgr running updates often?

Hi Jason,

[root@io08 ~]#  cat /proc/`pgrep slurmctld`/limits
Limit                     Soft Limit           Hard Limit           Units     
Max cpu time              unlimited            unlimited            seconds   
Max file size             unlimited            unlimited            bytes                                                                                                                                                                   
Max data size             unlimited            unlimited            bytes                                                                                                                                                                   
Max stack size            unlimited            unlimited            bytes                                                                                                                                                                   
Max core file size        unlimited            unlimited            bytes                                                                                                                                                                   
Max resident set          unlimited            unlimited            bytes                                                                                                                                                                   
Max processes             514023               514023               processes                                                                                                                                                               
Max open files            65536                65536                files                                                                                                                                                                   
Max locked memory         65536                65536                bytes                                                                                                                                                                   
Max address space         unlimited            unlimited            bytes                                                                                                                                                                   
Max file locks            unlimited            unlimited            locks                                                                                                                                                                   
Max pending signals       514023               514023               signals                                                                                                                                                                 
Max msgqueue size         819200               819200               bytes                                                                                                                                                                   
Max nice priority         0                    0                                                                                                                                                                                            
Max realtime priority     0                    0                                                                                                                                                                                            
Max realtime timeout      unlimited            unlimited            us    

[root@io08 ~]# cat /usr/lib/systemd/system/slurmctld.service
[Unit]
Description=Slurm controller daemon
After=network.target munge.service
ConditionPathExists=/etc/slurm/slurm.conf

[Service]
Type=forking
EnvironmentFile=-/etc/sysconfig/slurmctld
ExecStart=/usr/sbin/slurmctld $SLURMCTLD_OPTIONS
ExecReload=/bin/kill -HUP $MAINPID
PIDFile=/var/run/slurmctld.pid


[Install]
WantedBy=multi-user.target


We have 2 scripts running on a daily basis:

- one script updating twice a day ( 8:30a.m. and 2:30p.m.) accounts' budget 
- one script updating twice a day ( 9:00 a.m. and 3:00 p.m.) accounts' fairshare.

These same scripts runs also on the other 2 HPC clusters running slurm we have.    

Best Regards,
Marcello

Comment 16 Cineca HPC Systems 2019-02-08 04:44:34 MST

Created attachment 9112 [details]
slurm conf file

Comment 17 Cineca HPC Systems 2019-02-08 05:14:11 MST

(In reply to Alejandro Sanchez from comment #14)
> Hi Marcello,
> 
> besides Jason questions/requests, can you try temporarily disabling Priority
> DebugFlags? I'm wondering if the amount of logs enabled by this flag is
> affecting ctld performance.


Priority DebugFlags removed.

Regards,
Marcello

Comment 18 Alejandro Sanchez 2019-02-11 02:13:52 MST

Hi Marcello. Did the change have any positive effect? If not, will you show the output of 'sdiag'? thanks.

Comment 19 Cineca HPC Systems 2019-02-12 06:23:46 MST

(In reply to Alejandro Sanchez from comment #18)
> Hi Marcello. Did the change have any positive effect? If not, will you show
> the output of 'sdiag'? thanks.

There were no more disruption so far, but I'm not fully sure this was the problem since this is a relatively small cluster (~400 nodes) and we have the same debug flags active on a ~7000 node cluster with no issues.
I guess we'll just have to wait and see.

Regards,
Marcello

Comment 20 Cineca HPC Systems 2019-02-13 09:14:53 MST

(In reply to Cineca HPC Systems from comment #19)
> (In reply to Alejandro Sanchez from comment #18)
> > Hi Marcello. Did the change have any positive effect? If not, will you show
> > the output of 'sdiag'? thanks.
> 
> There were no more disruption so far, but I'm not fully sure this was the
> problem since this is a relatively small cluster (~400 nodes) and we have
> the same debug flags active on a ~7000 node cluster with no issues.
> I guess we'll just have to wait and see.
> 
> Regards,
> Marcello


As expected the problem appeared again, this time we had a little time to examine it.
At first slurmctld become unresponsive from other host's client (s* commands start to hang), BUT on the slurmctld the commands still responds.
On the slurmd side some nodes log this:

[2019-02-13T16:44:42.215] [780733.batch] debug:  step_terminate_monitor_stop signaling condition
[2019-02-13T16:44:42.216] [780733.batch] job 780733 completed with slurm_rc = 0, job_rc = 0
[2019-02-13T16:44:42.216] [780733.batch] sending REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status 0
[2019-02-13T16:45:18.956] [780733.extern] debug:  jag_common_poll_data: Task 0 pid 33260 ave_freq = 1200042 mem size/max 16384/16384 vmem size/max 110481408/110481408, disk read size/max (2012/2012), disk write size/max (0/0), time 0.00
0000(0+0)
[2019-02-13T16:46:18.956] [780733.extern] debug:  jag_common_poll_data: Task 0 pid 33260 ave_freq = 1199953 mem size/max 16384/16384 vmem size/max 110481408/110481408, disk read size/max (2012/2012), disk write size/max (0/0), time 0.00
0000(0+0)
[2019-02-13T16:46:22.167] [780733.0] debug:  slurm_recv_timeout at 0 of 4, timeout
[2019-02-13T16:46:22.167] [780733.0] error: Rank 0 failed sending step completion message directly to slurmctld (0.0.0.0:0), retrying
[2019-02-13T16:46:22.320] [780733.batch] debug:  slurm_recv_timeout at 0 of 4, timeout
[2019-02-13T16:46:22.320] [780733.batch] Retrying job complete RPC for 780733.4294967294
[2019-02-13T16:47:18.956] [780733.extern] debug:  jag_common_poll_data: Task 0 pid 33260 ave_freq = 1199953 mem size/max 16384/16384 vmem size/max 110481408/110481408, disk read size/max (2012/2012), disk write size/max (0/0), time 0.00
0000(0+0)
[2019-02-13T16:48:17.424] [780733.batch] debug:  slurm_recv_timeout at 0 of 4, timeout


being 16:45 the time when the problem started.
As soon as the controller is restarted the log of the above mentioned slurmd node reports:

2019-02-13T17:07:25.715] [780733.batch] debug:  Failed to contact primary controller: Connection refused
[2019-02-13T17:07:25.716] [780733.batch] debug:  Contacted backup controller 0
[2019-02-13T17:07:25.718] [780733.batch] debug:  Primary not responding, backup not in control. sleep and retry
[2019-02-13T17:08:08.396] [780733.0] debug:  Handling REQUEST_STATE
[2019-02-13T17:08:08.396] debug:  _fill_registration_msg: found apparently running step 780733.0
[2019-02-13T17:08:08.397] [780733.extern] debug:  Handling REQUEST_STATE
[2019-02-13T17:08:08.397] debug:  _fill_registration_msg: found apparently running step 780733.4294967295
[2019-02-13T17:08:08.398] [780733.batch] debug:  Handling REQUEST_STATE
[2019-02-13T17:08:08.398] debug:  _fill_registration_msg: found apparently running job 780733
[2019-02-13T17:08:08.445] debug:  _handle_node_reg_resp: slurmctld sent back 9 TRES.
[2019-02-13T17:08:10.717] [780733.0] Rank 0 sent step completion message directly to slurmctld (0.0.0.0:0)
[2019-02-13T17:08:10.717] [780733.0] debug:  Message thread exited
[2019-02-13T17:08:10.717] [780733.0] debug:  mpi/pmi2: agent thread exit
[2019-02-13T17:08:10.718] [780733.0] done with job
[2019-02-13T17:08:18.956] [780733.extern] debug:  jag_common_poll_data: Task 0 pid 33260 ave_freq = 1231578 mem size/max 16384/16384 vmem size/max 110481408/110481408, disk read size/max (2012/2012), disk write size/max (0/0), time 0.000000(0+0)
[2019-02-13T17:08:25.734] [780733.batch] debug:  Message thread exited
[2019-02-13T17:08:25.734] debug:  _rpc_terminate_job, uid = 6003
[2019-02-13T17:08:25.734] debug:  task_p_slurmd_release_resources: affinity jobid 780733
[2019-02-13T17:08:25.734] [780733.batch] done with job
[2019-02-13T17:08:25.734] debug:  credential for job 780733 revoked

Any clue?
Controller nodes are reachable when the problem occur and we can successfully telnet to the 6817 port of the controller's host from any node, so it doesn't seem to be a network problem.

Regards,
Marcello

Comment 21 Cineca HPC Systems 2019-02-13 10:56:06 MST

Created attachment 9163 [details]
sdiag of the controller when not responding

Comment 22 Alejandro Sanchez 2019-02-14 04:43:20 MST

The netstat attachment shows 255 tcp ctld connections in CLOSE_WAIT state and 1 in ESTABLISHED. The lsof attachment shows the CLOSE_WAIT states mostly happen in tcp connections with compute nodes. There are definitions of this state in either the rfc793 specification[1] and the Linux implementation[2]:

Slurm defines MAX_SERVER_THREADS as 256.

The ctld logs indicate this is galileo. Last cineca slurm.conf copies we have for either marconi, power8 and galileo all have a TCPTimeout of 60. I'm curious as to what's the differentiating factor so that galileo with ~400 nodes being hit by this problem while marconi with ~7000 isn't.

I guess the timing config values haven't changed and all clusters continue with TCPTimeout=60 and MessageTimeout=100?

I'd also be interested if you have differences in the kernel net tcp params like /proc/sys/net/ipv4/tcp_fin_timeout between clusters.

Which Slurm version is marconi running?

The gdb backtrace doesn't show information about all threads. I'd also be curious to see 'thread apply all bt full' output to discard any potential lock problems.

An i/o fast path to StateSaveLocation is also critical for ctld performance. Are there differences in this aspect between marconi and galileo?

sdiag shows a not negligible amount of RPC's by users root, edanovar, slurm000 and a big amount of REQUEST_PARTITION_INFO and REQUEST_NODE_INFO_SINGKLE. Is any user/cron/script executing sinfo's, scontrol show's in a repetitive/heavy fashion?

For whatever reason these tcp connections aren't being closed and ctld doesn't accept more after hitting the limit. Then the compute nodes try to send more msgs like step completion and similar but fail to do so.

In bug 6189 NERSC is experiencing something similar in 18.08.3. I attached a patch to fix lockings around server_thread_count but they didn't provide any feedback yet. We can consider applying it to galileo as well.

[1] https://tools.ietf.org/html/rfc793#page-21
[2] https://github.com/torvalds/linux/blob/master/net/ipv4/tcp.c#L237

Comment 23 Cineca HPC Systems 2019-02-14 07:09:48 MST

(In reply to Alejandro Sanchez from comment #22)
> The netstat attachment shows 255 tcp ctld connections in CLOSE_WAIT state
> and 1 in ESTABLISHED. The lsof attachment shows the CLOSE_WAIT states mostly
> happen in tcp connections with compute nodes. There are definitions of this
> state in either the rfc793 specification[1] and the Linux implementation[2]:
> 
> Slurm defines MAX_SERVER_THREADS as 256.
> 
> The ctld logs indicate this is galileo. Last cineca slurm.conf copies we
> have for either marconi, power8 and galileo all have a TCPTimeout of 60. I'm
> curious as to what's the differentiating factor so that galileo with ~400
> nodes being hit by this problem while marconi with ~7000 isn't.
> 
> I guess the timing config values haven't changed and all clusters continue
> with TCPTimeout=60 and MessageTimeout=100?
> 
> I'd also be interested if you have differences in the kernel net tcp params
> like /proc/sys/net/ipv4/tcp_fin_timeout between clusters. 
> 
> Which Slurm version is marconi running?
> 
> The gdb backtrace doesn't show information about all threads. I'd also be
> curious to see 'thread apply all bt full' output to discard any potential
> lock problems.
> 
> An i/o fast path to StateSaveLocation is also critical for ctld performance.
> Are there differences in this aspect between marconi and galileo?
> 
> sdiag shows a not negligible amount of RPC's by users root, edanovar,
> slurm000 and a big amount of REQUEST_PARTITION_INFO and
> REQUEST_NODE_INFO_SINGKLE. Is any user/cron/script executing sinfo's,
> scontrol show's in a repetitive/heavy fashion?
> 
> For whatever reason these tcp connections aren't being closed and ctld
> doesn't accept more after hitting the limit. Then the compute nodes try to
> send more msgs like step completion and similar but fail to do so.
> 
> In bug 6189 NERSC is experiencing something similar in 18.08.3. I attached a
> patch to fix lockings around server_thread_count but they didn't provide any
> feedback yet. We can consider applying it to galileo as well.
> 
> [1] https://tools.ietf.org/html/rfc793#page-21
> [2] https://github.com/torvalds/linux/blob/master/net/ipv4/tcp.c#L237

Alejandro,

our slurm version are as follow:

MARCONI: 18.08.4, 
GALILEO: 18.08.3
DAVIDE (Power8): 18.08.3

TCPtimeout and MessageTimeout are 60 and 100 on every cluster.
/proc/sys/net/ipv4/tcp_fin_timeout is 60s on every cluster. 
OSes for the slurmctld hosts are:

MARCONI and GALILEO: CentOS 7.2
DAVIDE: CentOS 7.5

I'll try the gdb of all threads next time the problem happens and have a look on the high number of client requests and let you know whether I find something or not. 

Regards,
Marcello

Comment 24 Alejandro Sanchez 2019-02-14 08:09:12 MST

With regards to comment 5, that's a separate issue. Will you please file a new sev-4 bug copying the problem there and assign it to me?

It looks like AllocNodes=LINCO_INVENEW,LINCO_tph1,LINCO_galinew,fer_prod1,eli_prod1 contain the values of the configured DenyAccounts=LINCO_INVENEW,LINCO_tph1,LINCO_galinew,fer_prod1,eli_prod1 and DenyAccounts isn't shown in the output of scontrol show part.

Casually the two mentioned partitions are the only ones with DenyAccounts configured (at least from the last slurm.conf copy we have). 

Do you remember doing any 'scontrol update partition' to change fields dynamically?

I'm looking into the logic in here to see how the output is built:

https://github.com/SchedMD/slurm/blob/slurm-18-08-3-1/src/api/partition_info.c#L169

Comment 25 Cineca HPC Systems 2019-03-14 11:14:52 MDT

Created attachment 9573 [details]
gdb trace for all threads

Comment 26 Cineca HPC Systems 2019-03-14 11:16:38 MDT

We finally have been able to get a gdb backtrace of all thread when the problem happened.

Best Regards,
Marcello

Comment 27 Alejandro Sanchez 2019-03-18 04:31:09 MDT

Hi Marcello,

Can you attach new output of sdiag and slurmctld.log?

1. Is the StateSaveLocation=/slurmstate/spool backed up by a fast I/O filesystem? Overall ctld responsiveness under heavy load might be governed by latency reading/writing from StateSaveLocation as a side note.

2. Did you identify anything related to this question?

sdiag shows a not negligible amount of RPC's by users root, edanovar, slurm000 and a big amount of REQUEST_PARTITION_INFO and REQUEST_NODE_INFO_SINGLE. Is any user/cron/script executing sinfo's, scontrol show's in a repetitive/heavy fashion?

3. Can you try adding 'defer' to SchedulerParameters and see if that improves responsiveness?

Comment 28 Cineca HPC Systems 2019-03-18 06:36:09 MDT

Created attachment 9593 [details]
logs at 20190319

Comment 29 Cineca HPC Systems 2019-03-18 08:17:10 MDT

(In reply to Alejandro Sanchez from comment #27)
> Hi Marcello,
> 
> Can you attach new output of sdiag and slurmctld.log?

I've attached logs and netstat output, unfortunately sdiag is not reporting anything:

[root@master ~]# sdiag 
slurm_get_statistics: Socket timed out on send/recv operation


> 
> 1. Is the StateSaveLocation=/slurmstate/spool backed up by a fast I/O
> filesystem? Overall ctld responsiveness under heavy load might be governed
> by latency reading/writing from StateSaveLocation as a side note.

The filesystem is a replicated glusterfs which is showing no problems on MARCONI which has a much higher throughput of job and therefore load.

> 
> 2. Did you identify anything related to this question?

We gave a thought about it, the only difference between MARCONI and GALILEO apart from the slurm version (18.08.4 on MARCONI, 18.08.3 on GALILEO) and HW (GALILEO nodes are slighly older, but still powerful enough) is that slurmctld hosts on GALILEO are behind a NAT (with appropriate firewall rules) and not directly connected to the slurmdbd. We also have noticed that sometimes GALILEO closes the connection to the slurmdbd and is not able to recover it although the slurmctld keeps working:


Mar 17 19:02:51 r000u17l01 slurmdbd[53521]: debug:  cluster galileo has disconnected

Moreover if you take a look at the last netstat output you can see that there are many packets waiting in the RECV-Q but this can be due to the fact that the slurmctld is hanging and so are its connections...


> 
> sdiag shows a not negligible amount of RPC's by users root, edanovar,
> slurm000 and a big amount of REQUEST_PARTITION_INFO and
> REQUEST_NODE_INFO_SINGLE. Is any user/cron/script executing sinfo's,
> scontrol show's in a repetitive/heavy fashion?

We haven't found anything unusual, MARCONI is managing hundreds of time more requests to say the least, with no apparent problem.


> 
> 3. Can you try adding 'defer' to SchedulerParameters and see if that
> improves responsiveness?

Added to the configuration.

Regards,

Comment 30 Alejandro Sanchez 2019-03-18 08:49:16 MDT

These two comments:

(In reply to Cineca HPC Systems from comment #20)
> As expected the problem appeared again, this time we had a little time to
> examine it.
> At first slurmctld become unresponsive from other host's client (s* commands
> start to hang), BUT on the slurmctld the commands still responds.

(In reply to Cineca HPC Systems from comment #29)
> We gave a thought about it, the only difference between MARCONI and GALILEO
> apart from the slurm version (18.08.4 on MARCONI, 18.08.3 on GALILEO) and HW
> (GALILEO nodes are slighly older, but still powerful enough) is that
> slurmctld hosts on GALILEO are behind a NAT (with appropriate firewall
> rules) and not directly connected to the slurmdbd.

make me think if that is making the difference and impacting performance of ctld. If you run 'sdiag' on the host where ctld runs do you get the output?

Comment 31 Alejandro Sanchez 2019-03-19 03:13:57 MDT

Good morning Marcello,

did you have the chance to run s* client commands from SlurmctlHost?

Comment 32 Cineca HPC Systems 2019-03-19 07:41:15 MDT

Created attachment 9615 [details]
galileo sdiag output

Comment 33 Cineca HPC Systems 2019-03-19 07:41:41 MDT

Created attachment 9616 [details]
marconi sdiag output

Comment 34 Cineca HPC Systems 2019-03-19 07:43:04 MDT

(In reply to Alejandro Sanchez from comment #30)
> These two comments:
> 
> (In reply to Cineca HPC Systems from comment #20)
> > As expected the problem appeared again, this time we had a little time to
> > examine it.
> > At first slurmctld become unresponsive from other host's client (s* commands
> > start to hang), BUT on the slurmctld the commands still responds.
> 
> (In reply to Cineca HPC Systems from comment #29)
> > We gave a thought about it, the only difference between MARCONI and GALILEO
> > apart from the slurm version (18.08.4 on MARCONI, 18.08.3 on GALILEO) and HW
> > (GALILEO nodes are slighly older, but still powerful enough) is that
> > slurmctld hosts on GALILEO are behind a NAT (with appropriate firewall
> > rules) and not directly connected to the slurmdbd.
> 
> make me think if that is making the difference and impacting performance of
> ctld. If you run 'sdiag' on the host where ctld runs do you get the output?

When it's stuck all the s* commands timeout, no matter the host.
Otherwise it's working fine. Attached you can find the sdiag output.
For comparison I've also attached MARCONI sdiag output which shows that activity is quite normal on GALILEO.

Regards,
Marcello

Comment 35 Alejandro Sanchez 2019-03-19 10:26:42 MDT

I have a couple more questions:

1. Do any of your Prolog, PrologSlurmctld, TaskProlog execute scontrol, sinfo or other generating RPC's?

2. Are any of your users submitting jobs in a tight loop instead of using job arrays?

Comment 36 Cineca HPC Systems 2019-03-21 11:03:42 MDT

(In reply to Alejandro Sanchez from comment #35)
> I have a couple more questions:
> 
> 1. Do any of your Prolog, PrologSlurmctld, TaskProlog execute scontrol,
> sinfo or other generating RPC's?

We have just one PrologSlurmctld script running at submit time which is performing an "squeue" command in order to obtain the partition on which the job was submitted and also 2 "scontrol" commands to hold the job if the budget is expired. We don't know if this is the right way to do it and if there is a more efficient way, but on MARCONI it worked just fine until now.

> 
> 2. Are any of your users submitting jobs in a tight loop instead of using
> job arrays?

No, this doesn't seem to be the case


However today the problem happened 3 times and luckily we were able to observe it while it was going on. What happens is that at first only some commands (squeue and sinfo for example) seem to stop working, while sdiag keep working. This gave us the opportunity to dump sdiag informations as the thread count was growing (while no pending RPCs were present!!!) and also trace all processes before hitting the 256 thread limit. 
What we've seen is that during this time the slurmctld wasn't scheduling any job and kept accepting all requests coming from the clients. 
The problem seems to originate from the inability of the slurmctld threads to close the connections which remains in a CLOSE_WAIT state undefinitely therefore preventing the threads to close and increasing the thread count until everything gets stuck.
Attached you will find all the informations we've gathered.

Regards,
Marcello

Comment 37 Cineca HPC Systems 2019-03-21 11:04:13 MDT

Created attachment 9655 [details]
sdiag output during hang

Comment 38 Cineca HPC Systems 2019-03-21 11:09:43 MDT

Created attachment 9656 [details]
gdb trace for all threads during hang

Comment 39 Alejandro Sanchez 2019-03-21 11:19:37 MDT

Ah, after inspecting your new backtraces I think I know what's going on.

Thread 256 (Thread 0x7f9e566cf700 (LWP 16165)):
#0  0x00007f9e9e5f1a39 in __strchr_sse42 () from /lib64/libc.so.6
No symbol table info available.
#1  0x00007f9e9ed1b60e in _get_next_job_gres (in_val=0x0, cnt=cnt@entry=0x7f9e566ce498, gres_list=0x161d450, save_ptr=save_ptr@entry=0x7f9e566ce4a0, rc=rc@entry=0x7f9e566ce494) at gres.c:3226
        prev_save_ptr = 0x7f9d540022f0 "gpu:kepler:N"

I think a malformed --gres request is triggering the ctld hang... which casually was fixed in guess Slurm version... 18.08.4 (and thus Marconi isn't affected because it runs with the fix):

https://github.com/SchedMD/slurm/commit/80e2cc41f86fc.patch

Will you try upgrading Galileo to the latest 18.08 release Slurm 18.08.6-2 or at least apply this patch as standalone and see if that fixes the issues?

https://github.com/SchedMD/slurm/commit/80e2cc41f86fc.patch

Can you also ask user sbonechi if he is requesting --gres=gpu:kepler:N ?

Thanks.

Comment 40 Alejandro Sanchez 2019-03-22 09:25:56 MDT

Hi. Any feedback from this? could you apply the patch or upgrade? thanks.

Comment 41 Cineca HPC Systems 2019-03-22 10:01:11 MDT

(In reply to Alejandro Sanchez from comment #40)
> Hi. Any feedback from this? could you apply the patch or upgrade? thanks.

We're discussing the opportunity to perform a rolling upgrade without interrupting the production.
We'll let you know,

Best Regards,
Marcello

Comment 42 Alejandro Sanchez 2019-03-27 03:50:33 MDT

(In reply to Cineca HPC Systems from comment #41)
> (In reply to Alejandro Sanchez from comment #40)
> > Hi. Any feedback from this? could you apply the patch or upgrade? thanks.
> 
> We're discussing the opportunity to perform a rolling upgrade without
> interrupting the production.
> We'll let you know,
> 
> Best Regards,
> Marcello

All right. I'm lowering the severity until you can confirm this is no longer an issue once you run Slurm with that patch in place.

Comment 43 Cineca HPC Systems 2019-04-04 05:36:02 MDT

(In reply to Alejandro Sanchez from comment #42)
> (In reply to Cineca HPC Systems from comment #41)
> > (In reply to Alejandro Sanchez from comment #40)
> > > Hi. Any feedback from this? could you apply the patch or upgrade? thanks.
> > 
> > We're discussing the opportunity to perform a rolling upgrade without
> > interrupting the production.
> > We'll let you know,
> > 
> > Best Regards,
> > Marcello
> 
> All right. I'm lowering the severity until you can confirm this is no longer
> an issue once you run Slurm with that patch in place.

We have upgraded to 18.08.6-2, now when performing the job submission with the wrong --gres specification the controller just rejects job and the thread number doesn't increase:

[mmorgott@node165 ~]$ srun --nodes=1 --ntasks-per-node=4 --account=cin_staff --partition=gll_usr_gpuprod --gres=gpu:kepler:N --pty /bin/bash
srun: error: Unable to allocate resources: Invalid generic resource (gres) specification


we'll keep monitoring, but I think we can consider this problem solved.

Regards,
Marcello

Comment 44 Alejandro Sanchez 2019-04-10 08:23:16 MDT

Hi Marcello. Can we close this? Thanks.

Comment 45 Alejandro Sanchez 2019-05-01 05:48:28 MDT

I'm gonna close this as a duplicate of the referenced fix in bug 6077.

Please, reopen if you need anything else from here.

Thanks.

*** This ticket has been marked as a duplicate of ticket 6077 ***