9217 – Slurmctld currently hung

Ticket 9217 - Slurmctld currently hung

Summary: Slurmctld currently hung

Status:	RESOLVED DUPLICATE of ticket 8978

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	slurmctld (show other tickets)
Version:	19.05.7
Hardware:	Linux Linux

Severity:	3 - Medium Impact
Assignee:	Marshall Garey
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2020-06-10 14:42 MDT by Steve Ford
Modified:	2020-06-11 11:11 MDT (History)
CC List:	1 user (show)

See Also:
Site:	MSU
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
Slurmctld log (24.67 MB, application/x-gzip) 2020-06-10 14:48 MDT, Steve Ford	Details
gdb (807.44 KB, text/plain) 2020-06-10 14:59 MDT, Steve Ford	Details
slurm.conf (42.59 KB, text/plain) 2020-06-10 15:08 MDT, Steve Ford	Details
gdb (1.59 MB, text/plain) 2020-06-10 16:31 MDT, Steve Ford	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Steve Ford 2020-06-10 14:42:58 MDT

Our SLURM controller is currently hung and unresponsive. Killing and restarting the daemon yields the same state. The log has several messages like the following:

[2020-06-10T16:26:59.950] error: _slurm_rpc_complete_batch_script: Could not find batch step for JobId=63512707_27(63515356), this should never happen
[2020-06-10T16:26:59.951] error: _slurm_rpc_complete_batch_script: Could not find batch step for JobId=63515913, this should never happen
[2020-06-10T16:26:59.951] error: _slurm_rpc_complete_batch_script: Could not find batch step for JobId=63512859, this should never happen

I'll attach the full slurmctld log shortly.

Comment 1 Steve Ford 2020-06-10 14:48:38 MDT

Created attachment 14625 [details]
Slurmctld log

Comment 2 Marshall Garey 2020-06-10 14:49:36 MDT

That message is a known bug in 19.05, but (to my knowledge) it's harmless unless you shrink a batch job. See bug 7499 for more details about that. Anyway I'll study the slurmctld log. Can you also run gcore on the running slurmctld to get the coredump? Then can you run these gdb commands?

thread apply all backtrace

thread apply all backtrace full

Comment 3 Marshall Garey 2020-06-10 14:51:02 MDT

Also, I would encourage you to upgrade to the latest version of 19.05.1. We always encourage upgrading to the micro latest version since it has the most bug fixes. But before doing that I'll analyze the logs and the backtrace when you get it.

Comment 4 Steve Ford 2020-06-10 14:51:59 MDT

I'm tailing the logs and it seems like its not completely hung, but doing a little bit of work and then hanging in a cycle. It's not currently responsive to user commands.

Comment 5 Steve Ford 2020-06-10 14:59:57 MDT

Created attachment 14626 [details]
gdb

Comment 6 Marshall Garey 2020-06-10 15:04:22 MDT

Thanks, I'll take a look at the gdb output.

Can you also upload your current slurm.conf?

From my first brief look at the logs, there are tons of these two types of messages:

[2020-06-10T16:40:30.717] cleanup_completing: JobId=63512030 completion process took 1922 seconds
[2020-06-10T16:42:59.513] Resending TERMINATE_JOB request JobId=63178784 Nodelist=css-126


My first idea is to add SchedulerParameters=defer to slurm.conf, and also to set the state set of all partitions to DRAIN. defer will force slurmctld to not try to schedule jobs on job completion or submission, and setting partitions to drain won't cancel any jobs currently running but won't allow slurmctld to schedule any new jobs. Of course this isn't desirable long term but might let slurmctld have some time to catch up with the traffic.

If slurmctld ever does respond to client commands, can you run sdiag and upload the output? When you do don't run client commands in a loop without any sleeps in between so as not to effectively DDOS the slurmctld.

Comment 7 Steve Ford 2020-06-10 15:08:44 MDT

Created attachment 14627 [details]
slurm.conf

Comment 8 Steve Ford 2020-06-10 15:19:17 MDT

Here's the output of sdiag:

*******************************************************
sdiag output at Wed Jun 10 17:16:54 2020 (1591823814)
Data since      Wed Dec 31 19:00:00 1969 (0)
*******************************************************
Server thread count:  123
Agent queue size:     0
Agent count:          0
DBD Agent queue size: 0

Jobs submitted: 0
Jobs started:   0
Jobs completed: 0
Jobs canceled:  0
Jobs failed:    0

Job states ts:  Wed Dec 31 19:00:00 1969 (0)
Jobs pending:   0
Jobs running:   0

Main schedule statistics (microseconds):
	Last cycle:   0
	Max cycle:    0
	Total cycles: 0
	Cycles per minute: 0
	Last queue length: 0

Backfilling stats
	Total backfilled jobs (since last slurm start): 0
	Total backfilled jobs (since last stats cycle start): 0
	Total backfilled heterogeneous job components: 0
	Total cycles: 0
	Last cycle when: Wed Dec 31 19:00:00 1969 (0)
	Last cycle: 0
	Max cycle:  0
	Last depth cycle: 0
	Last depth cycle (try sched): 0
	Last queue length: 0

Latency for 1000 calls to gettimeofday(): 24 microseconds

Remote Procedure Call statistics by message type
	REQUEST_PARTITION_INFO                  ( 2009) count:115    ave_time:6550   total_time:753316
	REQUEST_NODE_INFO_SINGLE                ( 2040) count:86     ave_time:991    total_time:85281
	REQUEST_FED_INFO                        ( 2049) count:23     ave_time:2667   total_time:61343
	REQUEST_JOB_USER_INFO                   ( 2039) count:19     ave_time:8075   total_time:153428
	REQUEST_NODE_INFO                       ( 2007) count:7      ave_time:1145801 total_time:8020611
	REQUEST_STATS_INFO                      ( 2035) count:3      ave_time:857    total_time:2573
	REQUEST_BUILD_INFO                      ( 2001) count:3      ave_time:5127   total_time:15382
	REQUEST_PING                            ( 1008) count:3      ave_time:15771  total_time:47313
	REQUEST_JOB_INFO_SINGLE                 ( 2021) count:3      ave_time:184    total_time:554
	REQUEST_SHARE_INFO                      ( 2022) count:2      ave_time:520024 total_time:1040048
	MESSAGE_EPILOG_COMPLETE                 ( 6012) count:2      ave_time:3410   total_time:6820
	REQUEST_PRIORITY_FACTORS                ( 2026) count:1      ave_time:5896735 total_time:5896735
	REQUEST_JOB_INFO                        ( 2003) count:1      ave_time:3954408 total_time:3954408
	REQUEST_ASSOC_MGR_INFO                  ( 2043) count:1      ave_time:6153969 total_time:6153969
	REQUEST_SUBMIT_BATCH_JOB                ( 4003) count:0      ave_time:0      total_time:0
	REQUEST_COMPLETE_BATCH_SCRIPT           ( 5018) count:0      ave_time:0      total_time:0

Remote Procedure Call statistics by user
	nobody          (     997) count:172    ave_time:1351   total_time:232435
	rhtoczyd        ( 1035122) count:30     ave_time:17306  total_time:519191
	root            (       0) count:21     ave_time:854235 total_time:17938944
	lotzsamu        (  758827) count:16     ave_time:14536  total_time:232586
	keenandr        (  256005) count:8      ave_time:495750 total_time:3966002
	jal             (  175025) count:6      ave_time:580    total_time:3483
	vanant25        (  991427) count:3      ave_time:1082378 total_time:3247135
	benucci         (  833945) count:3      ave_time:2210   total_time:6632
	mamooz          (  961043) count:3      ave_time:3546   total_time:10640
	wuyunqiq        (  918863) count:3      ave_time:1966   total_time:5900
	jansenn1        ( 1027710) count:3      ave_time:6926   total_time:20778
	bazavov         (  874719) count:1      ave_time:8055   total_time:8055
	sanch373        ( 1002301) count:0      ave_time:0      total_time:0
	rysewykd        (  830611) count:0      ave_time:0      total_time:0
	nisamehr        ( 1025924) count:0      ave_time:0      total_time:0
	pelicion        ( 1028178) count:0      ave_time:0      total_time:0
	rahnamou        ( 1036377) count:0      ave_time:0      total_time:0
	koppercl        (  988288) count:0      ave_time:0      total_time:0
	sinzpaul        (  921266) count:0      ave_time:0      total_time:0
	tdeyoung        (  799192) count:0      ave_time:0      total_time:0
	baclark         ( 1035984) count:0      ave_time:0      total_time:0
	greyjosh        ( 1037610) count:0      ave_time:0      total_time:0
	peiskera        (  876366) count:0      ave_time:0      total_time:0
	nowick70        (  996635) count:0      ave_time:0      total_time:0
	twagiray        (  961133) count:0      ave_time:0      total_time:0
	nguye639        (  949996) count:0      ave_time:0      total_time:0
	kowalc40        (  936676) count:0      ave_time:0      total_time:0
	priesbr1        (  929898) count:0      ave_time:0      total_time:0
	yarranto        (  767367) count:0      ave_time:0      total_time:0
	fordste5        (  885046) count:0      ave_time:0      total_time:0

Pending RPC statistics
	No pending RPCs

Comment 9 Marshall Garey 2020-06-10 15:20:43 MDT

From the gdb output:

268 threads right now.
231 _slurm_rpc_epilog_complete
15 _slurm_rpc_complete_batch_script

All the epilog complete RPC’s are waiting in _throttle_start() on this line:

        slurm_cond_wait(&throttle_cond, &throttle_mutex);


It looks like there's a storm of job completions that is slowing down slurmctld right now. So I definitely would like to try setting defer and also setting all partitions to DRAIN in slurm.conf. Restart slurmctld to make the changes happen since scontrol reconfigure probably.

Comment 10 Marshall Garey 2020-06-10 15:30:59 MDT

Looking at sdiag:

*******************************************************
sdiag output at Wed Jun 10 17:16:54 2020 (1591823814)
Data since      Wed Dec 31 19:00:00 1969 (0)
*******************************************************

Data since the Unix epoch? I'm guessing you got the sdiag output before some value in slurmctld was updated, so it probably means not much time has passed since a restart.

With this is mind, the data is just what happened since the last restart, which was presumably not long ago. I'm concerned about the top RPC's and top user:


Remote Procedure Call statistics by message type
	REQUEST_PARTITION_INFO                  ( 2009) count:115    ave_time:6550   total_time:753316
	REQUEST_NODE_INFO_SINGLE                ( 2040) count:86     ave_time:991    total_time:85281


Remote Procedure Call statistics by user
	nobody          (     997) count:172    ave_time:1351   total_time:232435
	rhtoczyd        ( 1035122) count:30     ave_time:17306  total_time:519191
	root            (       0) count:21     ave_time:854235 total_time:17938944


Since you probably don't have a user named "nobody" this means that getpwuid_r on that user returned an error. I'm also concerned that the top RPC's are requests for partition info and node info. Is somebody running a script that is calling squeue or scontrol in a loop?

It's also interesting that not lots of new MESSAGE_EPILOG_COMPLETE messages have come in since the last restart, even though that was by far the highest RPC going on in the gdb backtrace. Obviously some threads could have completed and different RPC's happened in the meantime. Are you able to get a new gcore and run those gdb commands again, maybe waiting a few minutes after restarting slurmctld with those changes from my last comment?


Let me know if slurmctld starts responding a little better or worse.

Comment 11 Marshall Garey 2020-06-10 15:58:28 MDT

Do the job prolog(s)/epilog(s) in /etc/slurm/prolog.d/* and /etc/slurm/epilog.d/* call squeue/sinfo/scontrol at all? That could aggravate the problem when too many jobs complete at once. Same with lots of those calls in the job prolog when too many jobs start at once. We've seen other sites have this issue.

Comment 12 Steve Ford 2020-06-10 16:31:25 MDT

Created attachment 14628 [details]
gdb

Comment 14 Nate Rini 2020-06-10 16:34:21 MDT

Steve,

Is it possible to apply this patch and restart slurmctld?

https://github.com/SchedMD/slurm/commit/f17cf91ccc56ccf87

Comment 16 Steve Ford 2020-06-10 17:08:12 MDT

I think the xstrfmtcatat function was added in a newer version of SLURM:

/root/rpmbuild/BUILD/slurm-19.05.1-4/src/slurmctld/job_mgr.c:10612: undefined reference to `xstrfmtcatat'

Comment 17 Steve Ford 2020-06-10 17:29:15 MDT

We've had all partitions set to drain and defer set in the SchedulerParams for some time now. I'm not sure that many jobs completing is what's causing this. Judging by "_job_complete" entries in the logs, job completion has been lower today than spikes we've had in the past without issue.

Our prolog scripts do call slurm commands, but if the prologs were the issue, I would expect the system to recover when drained. Epilog doesn't call any slurm commands.

Comment 19 Nate Rini 2020-06-10 17:42:31 MDT

(In reply to Steve Ford from comment #16)
> I think the xstrfmtcatat function was added in a newer version of SLURM:
> 
> /root/rpmbuild/BUILD/slurm-19.05.1-4/src/slurmctld/job_mgr.c:10612:
> undefined reference to `xstrfmtcatat'

This looks like a great reason to upgrade to slurm-19-05-7-1 as xstrfmtcatat() was added in slurm-19-05-2-1. You should be able to safely upgrade slurmdbd and slurmctld for now and then push out the upgrade to the nodes eventually.

Looks like issue may be caused by user kowal165 (or who ever owns "/mnt/ufs18/home-049/kowal165/test_sb_files/test1000.sb"). From the core in comment #12, can you please provide this gdb output:
> set pagination off
> set print pretty on
> t 263
> f 2
> p *job_ptr
> p *job_ptr->details

Comment 20 Steve Ford 2020-06-10 17:53:11 MDT

(gdb) t 263
[Switching to thread 263 (Thread 0x7f662a09d700 (LWP 53162))]
#0  0x00007f6637750995 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
(gdb) f 2
#2  0x000000000048badf in _slurm_rpc_step_complete (msg=0x7f662a09ce50, running_composite=<optimized out>) at proc_req.c:3768
3768	proc_req.c: No such file or directory.
(gdb) p *job_ptr
No symbol "job_ptr" in current context.
(gdb) p *job_ptr->details
No symbol "job_ptr" in current context.

I will begin upgrading to 19.05.7 with your suggested patch.

Comment 21 Marshall Garey 2020-06-10 18:02:02 MDT

Hey Steve, can you save a copy of the old slurmctld binary before upgrading so we can ask for more things with the core?

Comment 22 Marshall Garey 2020-06-10 18:06:38 MDT

Also did you run those gdb commands on the second core? If so that explains why the output was nonsensical. Can you type thread 264 instead? That's the thread we're looking for in the second core.


Thread 264 (Thread 0x7f65aaa9c700 (LWP 53163)):
#0  0x00007f6637ce2b8c in makespace (str=str@entry=0x7f65aaa9b8c0, needed=11) at xstring.c:106
#1  0x00007f6637ce2d75 in _xstrcat (str1=str1@entry=0x7f65aaa9b8c0, str2=<optimized out>) at xstring.c:134
#2  0x00000000004401fd in _pack_default_job_details (job_ptr=job_ptr@entry=0x26bb5f60, buffer=buffer@entry=0x7f65e0011e30, protocol_version=protocol_version@entry=8704) at job_mgr.c:10614

Comment 23 Steve Ford 2020-06-10 18:44:12 MDT

We're up with that patch on 19.05.7.

Here is the gdb info for thread 264:

(gdb) t 264
[Switching to thread 264 (Thread 0x7f65aaa9c700 (LWP 53163))]
#0  xstrcasecmp (s1=0x7f65e0007ff9 "b test1745.sb test1746.sb test1747.sb test1748.sb test1749.sb test174.sb test1750.sb test1751.sb test1752.sb test1753.sb test1754.sb test1755.sb test1756.sb test1757.sb test1758.sb test1759.sb test175"..., s2=0xb <Address 0xb out of bounds>) at xstring.c:605
605	xstring.c: No such file or directory.
(gdb) f 2
#2  0x0000000026bcd790 in ?? ()
(gdb) p *job_ptr
No symbol "job_ptr" in current context.
(gdb) p *job_ptr->details
No symbol "job_ptr" in current context.

I did point it at the saved copy of the old slurmctld.

Comment 24 Marshall Garey 2020-06-10 18:52:02 MDT

Oh well, too bad about the weird backtrace/core file. It does look like that job has a lot of arguments; the args in the gdb output that we have got cut off, and that thread is in the problematic part of code.

By "up" with the patch do you mean slurmctld is running and responsive, and partitions are up and scheduling again?

Comment 25 Steve Ford 2020-06-10 18:59:29 MDT

Yes. Slurmctld is running and responsive and all partitions are up and jobs are scheduling.

Comment 26 Jason Booth 2020-06-10 19:15:56 MDT

Bumping the priority down based on comment #25

Comment 27 Marshall Garey 2020-06-11 08:59:22 MDT

Good morning Steve,

Is slurmctld still up and running and smooth today? If so then I'll mark this as a duplicate of bug 8978. Much thanks to Nate for recognizing this as a duplicate.

Comment 28 Steve Ford 2020-06-11 11:04:28 MDT

Yes, slurmctld is still up and running smoothly.

Comment 29 Marshall Garey 2020-06-11 11:11:58 MDT

Great to hear. I'll close this as a dup of bug 8978.

*** This ticket has been marked as a duplicate of ticket 8978 ***