Created attachment 14727 [details] slurm.conf We upgraded from Slurm 19.05.5 to 20.02.3 during a recent downtime. About 24 hours after the system was mad available to the users again, slurmctld got stuck at 100% CPU load and all other threads stopped working. slurmctld did no longer respond to any queries and didn't schedule any new jobs. This issue persisted across restarts, with slurmctld becoming unresponsive and getting stuck at 100% CPU load after seconds to minutes. Since the thread consuming 100% CPU was called "bckfl", we briefly tried switching to SchedulerType=sched/builtin. This did not solve the problem, but the name of the thread stuck at 100% CPU changed to "sched". So the root cause seems to be in some common functionality and not necessarily in sched/backfill. We used DebugFlags=Backfill,SchedType to further debug the issue. The last lines in the log file before slurcmtld got stuck were: [2020-06-18T15:55:34.683] backfill: beginning [2020-06-18T15:55:34.685] backfill test for JobId=18314566 Prio=3848 Partition=single Cancelling job 18314566 actually fixed the issue (for the moment). This is the output of "scontrol show job 18314566" (with some sensitive information redacted): JobId=18314566 JobName=XXXXXXXXXXX UserId=XXXXXX(XXXX) GroupId=XXXXXXX(XXXXXXX) MCS_label=N/A Priority=3854 Nice=0 Account=kit QOS=normal JobState=PENDING Reason=Resources Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 RunTime=00:00:00 TimeLimit=00:10:00 TimeMin=N/A SubmitTime=2020-06-18T15:54:46 EligibleTime=2020-06-18T15:54:46 AccrueTime=2020-06-18T15:54:46 StartTime=Unknown EndTime=Unknown Deadline=N/A SuspendTime=None SecsPreSuspend=0 LastSchedEval=2020-06-18T15:54:50 Partition=single AllocNode:Sid=uc2n997:34521 ReqNodeList=(null) ExcNodeList=(null) NodeList=(null) NumNodes=1-1 NumCPUs=40 NumTasks=40 CPUs/Task=N/A ReqB:S:C:T=0:0:*:* TRES=cpu=40,mem=45000M,node=1,billing=40 Socks/Node=* NtasksPerN:B:S:C=40:0:*:1 CoreSpec=* MinCPUsNode=40 MinMemoryCPU=1125M MinTmpDiskNode=0 Features=(null) DelayBoot=00:00:00 OverSubscribe=YES Contiguous=0 Licenses=(null) Network=(null) Power= MailUser=(null) MailType=NONE I also gathered a gdb backtrace of the thread at fault while it was stuck (all other threads were hanging in pthread_cond_timedwait or pthread_rwlock_wrlock): Thread 58 (Thread 0x7f27b29dd700 (LWP 54041)): #0 0x00007f27b4dd81e6 in _compute_plane_dist (gres_task_limit=0x0, job_ptr=0x23cd090) at dist_tasks.c:391 #1 dist_tasks (job_ptr=0x23cd090, cr_type=<optimized out>, preempt_mode=<optimized out>, core_array=<optimized out>, gres_task_limit=0x0) at dist_tasks.c:1209 #2 0x00007f27b4ddd552 in _job_test (job_ptr=job_ptr@entry=0x23cd090, node_bitmap=node_bitmap@entry=0x7f27a401efc0, min_nodes=min_nodes@entry=1, max_nodes=max_nodes@entry=1, req_nodes=req_nodes@entry=1, mode=mode@entry=2, cr_type=cr_type@entry=276, job_node_req=job_node_req@entry=NODE_CR_AVAILABLE, cr_part_ptr=cr_part_ptr@entry=0x7f27a4000a00, node_usage=node_usage@entry=0x7f27a4035e00, exc_cores=<optimized out>, exc_cores@entry=0x0, prefer_alloc_nodes=false, qos_preemptor=qos_preemptor@entry=false, preempt_mode=preempt_mode@entry=true) at job_test.c:1569 #3 0x00007f27b4dde548 in _will_run_test (exc_core_bitmap=0x0, preemptee_job_list=0x0, preemptee_candidates=0x0, job_node_req=0, req_nodes=1, max_nodes=1, min_nodes=1, node_bitmap=0x7f27a401efc0, job_ptr=0x23cd090) at job_test.c:1988 #4 common_job_test (job_ptr=0x23cd090, node_bitmap=0x7f27a401efc0, min_nodes=1, max_nodes=1, req_nodes=1, mode=<optimized out>, preemptee_candidates=0x0, preemptee_job_list=0x0, exc_cores=0x0) at job_test.c:2316 #5 0x00007f27b4dd182d in select_p_job_test (job_ptr=0x23cd090, node_bitmap=0x7f27a401efc0, min_nodes=1, max_nodes=1, req_nodes=1, mode=<optimized out>, preemptee_candidates=0x0, preemptee_job_list=0x0, exc_core_bitmap=0x0) at select_cons_tres.c:508 #6 0x00007f27b6bf9cfb in select_g_job_test (job_ptr=job_ptr@entry=0x23cd090, bitmap=0x7f27a401efc0, min_nodes=min_nodes@entry=1, max_nodes=max_nodes@entry=1, req_nodes=req_nodes@entry=1, mode=mode@entry=2, preemptee_candidates=preemptee_candidates@entry=0x0, preemptee_job_list=preemptee_job_list@entry=0x0, exc_core_bitmap=exc_core_bitmap@entry=0x0) at node_select.c:517 #7 0x00007f27b29e4097 in _try_sched (job_ptr=0x23cd090, avail_bitmap=avail_bitmap@entry=0x7f27b29dcc60, min_nodes=1, max_nodes=1, req_nodes=1, exc_core_bitmap=0x0) at backfill.c:613 #8 0x00007f27b29e77b9 in _attempt_backfill () at backfill.c:2348 #9 0x00007f27b29e9658 in backfill_agent (args=<optimized out>) at backfill.c:1062 #10 0x00007f27b6504ea5 in start_thread () from /lib64/libpthread.so.0 ---Type <return> to continue, or q <return> to quit--- #11 0x00007f27b622d8cd in clone () from /lib64/libc.so.6 I didn't think of dereferencing job_ptr while gdb was running, so it is unclear if _compute_plane_dist was actually stuck on job 18314566. slurmctld has been running uninterrupted since the job was cancelled, but if it locks up again we can provide more gdb output. slurm.conf is attached.
This looks like a repeat of the issue in bug#9248. Please apply the patch in bug#9248 comment#22.
It very much looks like a duplicate of bug#9248, yes. Patch will be tested tomorrow.
Simon we are just following up on this. Were you able to apply the patch and is slurmctld behaving like normal again?
Simon, I think we should be able to close this as a duplicate of bug 9248. Before we do that: - Can you confirm if the slurmctld is not still stuck (I think you said it isn't stuck right now)? - Do you have the script and options used for job 18314566? The issue is with requesting --distribution=plane. However, there are bugs in the parsing code so there are ways to accidentally request that - for example, -m=8gb was used in bug 9248. It would be helpful to get more data points on potentially bad syntax if that caused the problem.
We installed the necessary code and I can confirm that the slurmctld isn't stuck right now. We close this as a duplicate of bug 9248.
Thanks for confirming. Marking as a duplicate of 9248 *** This ticket has been marked as a duplicate of ticket 9248 ***