9255 – slurmctld stuck at 100% CPU usage

Ticket 9255 - slurmctld stuck at 100% CPU usage

Summary: slurmctld stuck at 100% CPU usage

Status:	RESOLVED DUPLICATE of ticket 9248

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	slurmctld (show other tickets)
Version:	20.02.3
Hardware:	Linux Linux

Severity:	1 - System not usable
Assignee:	Marshall Garey
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2020-06-18 12:47 MDT by Simon Raffeiner
Modified:	2020-06-19 08:24 MDT (History)
CC List:	2 users (show)

See Also:	9248
Site:	KIT
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:	20.02.3
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
slurm.conf (282.69 KB, text/plain) 2020-06-18 12:47 MDT, Simon Raffeiner	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Simon Raffeiner 2020-06-18 12:47:27 MDT

Created attachment 14727 [details]
slurm.conf

We upgraded from Slurm 19.05.5 to 20.02.3 during a recent downtime. About 24 hours after the system was mad available to the users again, slurmctld got stuck at 100% CPU load and all other threads stopped working. slurmctld did no longer respond to any queries and didn't schedule any new jobs. This issue persisted across restarts, with slurmctld becoming unresponsive and getting stuck at 100% CPU load after seconds to minutes.

Since the thread consuming 100% CPU was called "bckfl", we briefly tried switching to SchedulerType=sched/builtin. This did not solve the problem, but the name of the thread stuck at 100% CPU changed to "sched". So the root cause seems to be in some common functionality and not necessarily in sched/backfill.


We used DebugFlags=Backfill,SchedType to further debug the issue. The last lines in the log file before slurcmtld got stuck were:


[2020-06-18T15:55:34.683] backfill: beginning
[2020-06-18T15:55:34.685] backfill test for JobId=18314566 Prio=3848 Partition=single


Cancelling job 18314566 actually fixed the issue (for the moment). This is the output of "scontrol show job 18314566" (with some sensitive information redacted):


JobId=18314566 JobName=XXXXXXXXXXX
   UserId=XXXXXX(XXXX) GroupId=XXXXXXX(XXXXXXX) MCS_label=N/A
   Priority=3854 Nice=0 Account=kit QOS=normal
   JobState=PENDING Reason=Resources Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=00:10:00 TimeMin=N/A
   SubmitTime=2020-06-18T15:54:46 EligibleTime=2020-06-18T15:54:46
   AccrueTime=2020-06-18T15:54:46
   StartTime=Unknown EndTime=Unknown Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2020-06-18T15:54:50
   Partition=single AllocNode:Sid=uc2n997:34521
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=(null)
   NumNodes=1-1 NumCPUs=40 NumTasks=40 CPUs/Task=N/A ReqB:S:C:T=0:0:*:*
   TRES=cpu=40,mem=45000M,node=1,billing=40
   Socks/Node=* NtasksPerN:B:S:C=40:0:*:1 CoreSpec=*
   MinCPUsNode=40 MinMemoryCPU=1125M MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=YES Contiguous=0 Licenses=(null) Network=(null)
   Power=
   MailUser=(null) MailType=NONE


I also gathered a gdb backtrace of the thread at fault while it was stuck (all other threads were hanging in pthread_cond_timedwait or pthread_rwlock_wrlock):

Thread 58 (Thread 0x7f27b29dd700 (LWP 54041)):
#0  0x00007f27b4dd81e6 in _compute_plane_dist (gres_task_limit=0x0, job_ptr=0x23cd090)
    at dist_tasks.c:391
#1  dist_tasks (job_ptr=0x23cd090, cr_type=<optimized out>, preempt_mode=<optimized out>, 
    core_array=<optimized out>, gres_task_limit=0x0) at dist_tasks.c:1209
#2  0x00007f27b4ddd552 in _job_test (job_ptr=job_ptr@entry=0x23cd090, 
    node_bitmap=node_bitmap@entry=0x7f27a401efc0, min_nodes=min_nodes@entry=1, 
    max_nodes=max_nodes@entry=1, req_nodes=req_nodes@entry=1, mode=mode@entry=2, 
    cr_type=cr_type@entry=276, job_node_req=job_node_req@entry=NODE_CR_AVAILABLE, 
    cr_part_ptr=cr_part_ptr@entry=0x7f27a4000a00, node_usage=node_usage@entry=0x7f27a4035e00, 
    exc_cores=<optimized out>, exc_cores@entry=0x0, prefer_alloc_nodes=false, 
    qos_preemptor=qos_preemptor@entry=false, preempt_mode=preempt_mode@entry=true) at job_test.c:1569
#3  0x00007f27b4dde548 in _will_run_test (exc_core_bitmap=0x0, preemptee_job_list=0x0, 
    preemptee_candidates=0x0, job_node_req=0, req_nodes=1, max_nodes=1, min_nodes=1, 
    node_bitmap=0x7f27a401efc0, job_ptr=0x23cd090) at job_test.c:1988
#4  common_job_test (job_ptr=0x23cd090, node_bitmap=0x7f27a401efc0, min_nodes=1, max_nodes=1, 
    req_nodes=1, mode=<optimized out>, preemptee_candidates=0x0, preemptee_job_list=0x0, 
    exc_cores=0x0) at job_test.c:2316
#5  0x00007f27b4dd182d in select_p_job_test (job_ptr=0x23cd090, node_bitmap=0x7f27a401efc0, 
    min_nodes=1, max_nodes=1, req_nodes=1, mode=<optimized out>, preemptee_candidates=0x0, 
    preemptee_job_list=0x0, exc_core_bitmap=0x0) at select_cons_tres.c:508
#6  0x00007f27b6bf9cfb in select_g_job_test (job_ptr=job_ptr@entry=0x23cd090, bitmap=0x7f27a401efc0, 
    min_nodes=min_nodes@entry=1, max_nodes=max_nodes@entry=1, req_nodes=req_nodes@entry=1, 
    mode=mode@entry=2, preemptee_candidates=preemptee_candidates@entry=0x0, 
    preemptee_job_list=preemptee_job_list@entry=0x0, exc_core_bitmap=exc_core_bitmap@entry=0x0)
    at node_select.c:517
#7  0x00007f27b29e4097 in _try_sched (job_ptr=0x23cd090, 
    avail_bitmap=avail_bitmap@entry=0x7f27b29dcc60, min_nodes=1, max_nodes=1, req_nodes=1, 
    exc_core_bitmap=0x0) at backfill.c:613
#8  0x00007f27b29e77b9 in _attempt_backfill () at backfill.c:2348
#9  0x00007f27b29e9658 in backfill_agent (args=<optimized out>) at backfill.c:1062
#10 0x00007f27b6504ea5 in start_thread () from /lib64/libpthread.so.0
---Type <return> to continue, or q <return> to quit---
#11 0x00007f27b622d8cd in clone () from /lib64/libc.so.6


I didn't think of dereferencing job_ptr while gdb was running, so it is unclear if _compute_plane_dist was actually stuck on job 18314566. slurmctld has been running uninterrupted since the job was cancelled, but if it locks up again we can provide more gdb output.

slurm.conf is attached.

Comment 1 Nate Rini 2020-06-18 12:49:33 MDT

This looks like a repeat of the issue in bug#9248.

Please apply the patch in bug#9248 comment#22.

Comment 2 Simon Raffeiner 2020-06-18 13:04:40 MDT

It very much looks like a duplicate of bug#9248, yes. Patch will be tested tomorrow.

Comment 3 Jason Booth 2020-06-18 13:47:15 MDT

Simon we are just following up on this. Were you able to apply the patch and is slurmctld behaving like normal again?

Comment 5 Marshall Garey 2020-06-18 14:37:31 MDT

Simon, I think we should be able to close this as a duplicate of bug 9248. Before we do that:

- Can you confirm if the slurmctld is not still stuck (I think you said it isn't stuck right now)?
- Do you have the script and options used for job 18314566? The issue is with requesting --distribution=plane. However, there are bugs in the parsing code so there are ways to accidentally request that - for example, -m=8gb was used in bug 9248. It would be helpful to get more data points on potentially bad syntax if that caused the problem.

Comment 6 Brigitte May 2020-06-19 08:09:36 MDT

We installed the necessary code and I can confirm that the slurmctld isn't stuck right now.
We close this as a duplicate of bug 9248.

Comment 7 Marshall Garey 2020-06-19 08:24:46 MDT

Thanks for confirming. Marking as a duplicate of 9248

*** This ticket has been marked as a duplicate of ticket 9248 ***