Ticket 11487

Summary:	slurmctlds stuck after restart
Product:	Slurm	Reporter:	Cineca HPC Systems <hpc-sysmgt-info>
Component:	slurmctld	Assignee:	Nate Rini <nate>
Status:	RESOLVED INFOGIVEN	QA Contact:
Severity:	1 - System not usable
Priority:	---	CC:	nate
Version:	20.11.4
Hardware:	Linux
OS:	Linux
Site:	Cineca	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	slurmctld logs for primary controller slurmctld logs for secondary controller slurm configuration files

Description Cineca HPC Systems 2021-04-29 08:29:59 MDT

Dear Support,

we are experiencing a major problem after an ordinary configuration change in order to disable the slurmctld Prolog/Epilog and the consequent slurmctld restart.
What happens is the following:

- we start one of the slurmctld service 
- the slurmctld starts up apparently correctly 
- some of the s* commands respond properly:
sdiag
scontrol ping
scontrol show conf
-most of the s* commands hang and contribute to reach the maximum thread number:
squeue
sinfo (which however seems to work in the very early stages... let's say the very first 30 seconds)

After a while thread counts reach 256 and the slurmctld cannot be queried anymore.
We see some strange messages such as:

slurmctld[43346]: error: High latency for 1000 calls to gettimeofday(): 737 microseconds

And also sdiag output seems to suggest there are no more pending jobs even though there are supposed to be hundreds if not thousands and that data are from epoch:

[root@r000u17l01 ~]# sdiag 
*******************************************************
sdiag output at Thu Apr 29 16:26:26 2021 (1619706386)
Data since      Thu Jan 01 01:00:00 1970 (0)
*******************************************************
Server thread count:  50
Agent queue size:     0
Agent count:          0
Agent thread count:   0
DBD Agent queue size: 0

Jobs submitted: 0
Jobs started:   0
Jobs completed: 0
Jobs canceled:  0
Jobs failed:    0

Job states ts:  Thu Jan 01 01:00:00 1970 (0)
Jobs pending:   0
Jobs running:   0

Main schedule statistics (microseconds):
        Last cycle:   0
        Max cycle:    0
        Total cycles: 0
        Cycles per minute: 0
        Last queue length: 0

Backfilling stats
        Total backfilled jobs (since last slurm start): 0
        Total backfilled jobs (since last stats cycle start): 0
        Total backfilled heterogeneous job components: 0
        Total cycles: 0
        Last cycle when: N/A
        Last cycle: 0
        Max cycle:  0
        Last depth cycle: 0
        Last depth cycle (try sched): 0
        Last queue length: 0
        Last table size: 0

Latency for 1000 calls to gettimeofday(): 737 microseconds

Remote Procedure Call statistics by message type
        REQUEST_FED_INFO                        ( 2049) count:11     ave_time:352    total_time:3877

Remote Procedure Call statistics by user
        a07cmc00        (   12522) count:7      ave_time:345    total_time:2419
        jalguaci        (   29828) count:2      ave_time:355    total_time:711
        akolsek0        (   29428) count:1      ave_time:362    total_time:362
        dbaratel        (   31670) count:1      ave_time:385    total_time:385
        root            (       0) count:0      ave_time:0      total_time:0

Pending RPC statistics
        No pending RPCs

Can you help us debug and solve the issue?

Thansk in advance,

Marcello

Comment 1 Nate Rini 2021-04-29 08:31:51 MDT

Please attach the slurmctld log and slurm.conf.

Please call gcore against slurmctld and provide a backtrace:
> pgrep slurmctld | xargs gcore
> gdb -ex 't a a bt full' -c $PATH_TO_CORE $(which slurmctld)

Comment 2 Cineca HPC Systems 2021-04-29 09:21:59 MDT

Created attachment 19195 [details]
slurmctld logs for primary controller

Comment 3 Cineca HPC Systems 2021-04-29 09:22:36 MDT

Created attachment 19196 [details]
slurmctld logs for secondary controller

Comment 4 Nate Rini 2021-04-29 09:37:38 MDT

Any luck on the back trace too?

Comment 5 Nate Rini 2021-04-29 09:40:30 MDT

Please also attach slurm.conf (& friends).

Comment 6 Nate Rini 2021-04-29 09:42:12 MDT

Please open a new bug for this:
>  slurmctld[46191]: error: High latency for 1000 calls to gettimeofday(): 707 microseconds

Comment 7 Nate Rini 2021-04-29 09:45:09 MDT

Please open a new bug for this too:
> slurmctld[47725]: error: Can't find parent id 21773 for assoc 21774, this should never happen.

Comment 8 Cineca HPC Systems 2021-04-29 10:13:26 MDT

(In reply to Nate Rini from comment #1)
> Please attach the slurmctld log and slurm.conf.
> 
> Please call gcore against slurmctld and provide a backtrace:
> > pgrep slurmctld | xargs gcore
> > gdb -ex 't a a bt full' -c $PATH_TO_CORE $(which slurmctld)

Nate,

thanks a lot for the quick response.
Luckily we seem to have solved the problem on our own.
Unfortunately we were not able to get the necessary debugging info with gcore because after yet another restart the slurmctld was able to complete startup.

Our best guess is that something wrong was going on on the shared filesystem (gluterfs) which hosts the /slurmstate directory. Unfortunately also in this case logfiles are not helpful as no problem is reported (or was reported in the last months).

About the extremely long latency for gettimeofday() calls it was caused by the clocksource set to "hpet". Setting it to "tsc" reduced the latency by a factor of 10 and also seems, for some reason or pure coincidence, to have 'unlocked' the filesystem problem.

I have uploaded the logs for analysis, for sure you can lower the priority of the issue or even close it since the problem for us is solved.
We are availble if you want to perform further analysis.

Best Regards,
Marcello

Comment 9 Nate Rini 2021-04-29 10:20:23 MDT

(In reply to Cineca HPC Systems from comment #8)
> (In reply to Nate Rini from comment #1)
> thanks a lot for the quick response.
> Luckily we seem to have solved the problem on our own.
> Unfortunately we were not able to get the necessary debugging info with
> gcore because after yet another restart the slurmctld was able to complete
> startup.
That is unfortunate.
 
> Our best guess is that something wrong was going on on the shared filesystem
> (gluterfs) which hosts the /slurmstate directory. Unfortunately also in this
> case logfiles are not helpful as no problem is reported (or was reported in
> the last months).
In events like this, having slurmctld report to syslog may be beneficial instead of directly to file.

> About the extremely long latency for gettimeofday() calls it was caused by
> the clocksource set to "hpet". Setting it to "tsc" reduced the latency by a
> factor of 10 and also seems, for some reason or pure coincidence, to have
> 'unlocked' the filesystem problem.
That sounds like a fun bug for the gluster developers.

> I have uploaded the logs for analysis, for sure you can lower the priority
> of the issue or even close it since the problem for us is solved.
> We are availble if you want to perform further analysis.
slurmctld requires the filesystem hosting statesavelocation to be healthy and will generally fatal out if it is not. If this happens again, please update the ticket and we can continue debugging but the logs were essentially clean except those other issues I pointed out but they are unlikely to cause a crash.

I did notice the logs were at a high debug level and had multiple debugflags active. I would suggest returning to verbose and turning off those debugflags during normal production. High level of logging will needlessly slow down Slurm.

Comment 10 Cineca HPC Systems 2021-04-29 10:25:00 MDT

Created attachment 19206 [details]
slurm configuration files

These are the slurm configuration files.
We are running a configless setup