Ticket 8818

Summary:	slurmctld stuck with 100% CPU usage (error: select/cons_tres: _compute_c_b_task_dist: oversubscribe)
Product:	Slurm	Reporter:	Kilian Cavalotti <kilian>
Component:	slurmctld	Assignee:	Marcin Stolarek <cinek>
Status:	RESOLVED FIXED	QA Contact:
Severity:	3 - Medium Impact
Priority:	---	CC:	arc-slurm-admins, bart, cinek
Version:	19.05.6
Hardware:	Linux
OS:	Linux
See Also:	https://bugs.schedmd.com/show_bug.cgi?id=8265
Site:	Stanford	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:	Sherlock
CLE Version:		Version Fixed:	19.05.7 20.02.2 20.11.0pre1
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	output of "thread apply all bt full" continuation of 6dcb598c6b9(v1)

Description Kilian Cavalotti 2020-04-08 16:47:45 MDT

Created attachment 13679 [details]
output of "thread apply all bt full"

Hi SchedMD, 

We've just experienced the exact same symptoms described in 8265:

"""
a completely unresponsive controller. The slurmctld process doesn't crash nor segfault, but it doesn't seem to be processing any RPC, as commands like sdiag or even "scontrol ping" time out and result in "Socket timed out on send/recv operation" messages. Looking at the threads, bckfl seem to be taking 100% CPU and not progressing at all. Nothing is recorded in the logs anymore.
"""

Like in 8265, the last slurmctld message before the hang was:

Apr 08 15:14:06 sh02-sl01.int slurmctld[83899]: _slurm_rpc_kill_job: REQUEST_KILL_JOB JobId=65088793 uid 254732
Apr 08 15:14:12 sh02-sl01.int slurmctld[83899]: _slurm_rpc_submit_batch_job: JobId=65107317 InitPrio=55433 usec=13098
Apr 08 15:14:13 sh02-sl01.int slurmctld[83899]: error: select/cons_tres: _compute_c_b_task_dist: oversubscribe for JobId=65101461

The job in question is here:
JobId=65101461 JobName=angela_varselRF
   UserId=[...](337380) GroupId=[...](30289) MCS_label=N/A
   Priority=16306 Nice=0 Account=[...] QOS=normal
   JobState=PENDING Reason=Resources Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=03:00:00 TimeMin=N/A
   SubmitTime=2020-04-08T13:54:01 EligibleTime=2020-04-08T13:54:01
   AccrueTime=2020-04-08T13:54:29
   StartTime=Unknown EndTime=Unknown Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2020-04-08T15:40:06
   Partition=owners,quake,normal AllocNode:Sid=sh02-ln02:92085
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=(null)
   NumNodes=1 NumCPUs=5 NumTasks=5 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=5,mem=124000M,node=1,billing=5
   Socks/Node=* NtasksPerN:B:S:C=5:0:*:* CoreSpec=*
   MinCPUsNode=10 MinMemoryNode=124000M MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=[...]forest/makeSmallForest.sh
   WorkDir=[...]forest
   StdErr=[...]forest/log/rf.err
   StdIn=/dev/null
   StdOut=[...]forest/log/rf.out
   Power=


After killing and restarting slurmctld, everything looks fine, I can query that job and the LastSchedEval timestamp is being updated, so it doesn't look like this particular job is the culprit, maybe just a coincidence.

Anyway, I've saved a core dump of the blocked process. Here's the backtrace:

(gdb) bt
#0  0x00007fa6149812ce in pthread_rwlock_wrlock () from /lib64/libpthread.so.0
#1  0x00000000004701e5 in lock_slurmctld (lock_levels=...) at locks.c:125
#2  0x0000000000421542 in _agent_defer () at agent.c:1543
#3  _agent_init (arg=<optimized out>) at agent.c:1409
#4  0x00007fa61497ddd5 in start_thread () from /lib64/libpthread.so.0
#5  0x00007fa6146a702d in clone () from /lib64/libc.so.6

I'm attaching the output of "thread apply all bt full".

Thanks!
-- 
Kilian

Comment 1 Kilian Cavalotti 2020-04-08 18:23:30 MDT

Woops, spoke too soon, it did again with the same:
error: select/cons_tres: _compute_c_b_task_dist: oversubscribe for JobId=65101461

I'm gonna cancel that job, but that makes it really look like 8265 then. And we're running 19.05.6, which I believe include 8265's fix.

Thanks!
-- 
Kilian

Comment 2 Marcin Stolarek 2020-04-09 02:12:22 MDT

Created attachment 13681 [details]
continuation of 6dcb598c6b9(v1)

Kilian,

It looks like in Bug 8265 we've lost part of the real fix in patch rewrites.
The attached patch should fix the infinite loop you're hitting now, could you please verify?

cheers,
Marcin

Comment 3 Kilian Cavalotti 2020-04-09 13:17:50 MDT

Hi Marcin, 

(In reply to Marcin Stolarek from comment #2)
> It looks like in Bug 8265 we've lost part of the real fix in patch rewrites.
> The attached patch should fix the infinite loop you're hitting now, could
> you please verify?

Thanks! Patch is applied and deployed.

Is there a way we can test it and make sure the problem is fixed? 

Cheers,
-- 
Kilian

Comment 4 Marcin Stolarek 2020-04-09 14:17:36 MDT

Kilian,

I may be exactly the job you found in Bug 8265. Unfortunately, we failed to reproduce it in the test environment(I think it has something to do with MaxMemPerCPU resulting in increased number of CPUs per task + some specific node allocation prior to its resources selection) and most important...
I did a mistake rewriting the patch I shared with you in bug 6265 during the review process. Finally, a security fuse was added in a different loop than the one that was root cause of the bug 8265, so it really didn't get fixed in 19.05.6.

cheers,
Marcin

Comment 5 Kilian Cavalotti 2020-04-09 15:16:33 MDT

(In reply to Marcin Stolarek from comment #4)
> Kilian,
> 
> I may be exactly the job you found in Bug 8265. Unfortunately, we failed to
> reproduce it in the test environment(I think it has something to do with
> MaxMemPerCPU resulting in increased number of CPUs per task + some specific
> node allocation prior to its resources selection) and most important...
> I did a mistake rewriting the patch I shared with you in bug 6265 during the
> review process. Finally, a security fuse was added in a different loop than
> the one that was root cause of the bug 8265, so it really didn't get fixed
> in 19.05.6.

Got it, thanks Marcin!

Cheers,
-- 
Kilian

Comment 21 Marcin Stolarek 2020-04-20 01:54:35 MDT

Kilian,

The corrected fix for Bug 8265 is now merged and will be released in Slurm 19.05.7[1].

Sorry for the previous mistake.

cheers,
Marcin 

[1]https://github.com/SchedMD/slurm/commit/2db5ded74734ab304dd9b6dfec36a96babd99f6d

Comment 22 Kilian Cavalotti 2020-04-20 11:20:46 MDT

(In reply to Marcin Stolarek from comment #21)
> Kilian,
> 
> The corrected fix for Bug 8265 is now merged and will be released in Slurm
> 19.05.7[1].

Great to hear, thanks you Marcin!

Cheers,
-- 
Kilian

Comment 23 Marshall Garey 2020-05-14 10:39:44 MDT

*** Ticket 8959 has been marked as a duplicate of this ticket. ***