| Summary: | slurmctld stuck with 100% CPU usage (error: select/cons_tres: _compute_c_b_task_dist: oversubscribe) | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Kilian Cavalotti <kilian> |
| Component: | slurmctld | Assignee: | Marcin Stolarek <cinek> |
| Status: | RESOLVED FIXED | QA Contact: | |
| Severity: | 3 - Medium Impact | ||
| Priority: | --- | CC: | arc-slurm-admins, bart, cinek |
| Version: | 19.05.6 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| See Also: | https://bugs.schedmd.com/show_bug.cgi?id=8265 | ||
| Site: | Stanford | Slinky Site: | --- |
| Alineos Sites: | --- | Atos/Eviden Sites: | --- |
| Confidential Site: | --- | Coreweave sites: | --- |
| Cray Sites: | --- | DS9 clusters: | --- |
| Google sites: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | NoveTech Sites: | --- |
| Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Tzag Elita Sites: | --- |
| Linux Distro: | --- | Machine Name: | Sherlock |
| CLE Version: | Version Fixed: | 19.05.7 20.02.2 20.11.0pre1 | |
| Target Release: | --- | DevPrio: | --- |
| Emory-Cloud Sites: | --- | ||
| Attachments: |
output of "thread apply all bt full"
continuation of 6dcb598c6b9(v1) |
||
Woops, spoke too soon, it did again with the same: error: select/cons_tres: _compute_c_b_task_dist: oversubscribe for JobId=65101461 I'm gonna cancel that job, but that makes it really look like 8265 then. And we're running 19.05.6, which I believe include 8265's fix. Thanks! -- Kilian Created attachment 13681 [details] continuation of 6dcb598c6b9(v1) Kilian, It looks like in Bug 8265 we've lost part of the real fix in patch rewrites. The attached patch should fix the infinite loop you're hitting now, could you please verify? cheers, Marcin Hi Marcin, (In reply to Marcin Stolarek from comment #2) > It looks like in Bug 8265 we've lost part of the real fix in patch rewrites. > The attached patch should fix the infinite loop you're hitting now, could > you please verify? Thanks! Patch is applied and deployed. Is there a way we can test it and make sure the problem is fixed? Cheers, -- Kilian Kilian, I may be exactly the job you found in Bug 8265. Unfortunately, we failed to reproduce it in the test environment(I think it has something to do with MaxMemPerCPU resulting in increased number of CPUs per task + some specific node allocation prior to its resources selection) and most important... I did a mistake rewriting the patch I shared with you in bug 6265 during the review process. Finally, a security fuse was added in a different loop than the one that was root cause of the bug 8265, so it really didn't get fixed in 19.05.6. cheers, Marcin (In reply to Marcin Stolarek from comment #4) > Kilian, > > I may be exactly the job you found in Bug 8265. Unfortunately, we failed to > reproduce it in the test environment(I think it has something to do with > MaxMemPerCPU resulting in increased number of CPUs per task + some specific > node allocation prior to its resources selection) and most important... > I did a mistake rewriting the patch I shared with you in bug 6265 during the > review process. Finally, a security fuse was added in a different loop than > the one that was root cause of the bug 8265, so it really didn't get fixed > in 19.05.6. Got it, thanks Marcin! Cheers, -- Kilian Kilian, The corrected fix for Bug 8265 is now merged and will be released in Slurm 19.05.7[1]. Sorry for the previous mistake. cheers, Marcin [1]https://github.com/SchedMD/slurm/commit/2db5ded74734ab304dd9b6dfec36a96babd99f6d (In reply to Marcin Stolarek from comment #21) > Kilian, > > The corrected fix for Bug 8265 is now merged and will be released in Slurm > 19.05.7[1]. Great to hear, thanks you Marcin! Cheers, -- Kilian *** Ticket 8959 has been marked as a duplicate of this ticket. *** |
Created attachment 13679 [details] output of "thread apply all bt full" Hi SchedMD, We've just experienced the exact same symptoms described in 8265: """ a completely unresponsive controller. The slurmctld process doesn't crash nor segfault, but it doesn't seem to be processing any RPC, as commands like sdiag or even "scontrol ping" time out and result in "Socket timed out on send/recv operation" messages. Looking at the threads, bckfl seem to be taking 100% CPU and not progressing at all. Nothing is recorded in the logs anymore. """ Like in 8265, the last slurmctld message before the hang was: Apr 08 15:14:06 sh02-sl01.int slurmctld[83899]: _slurm_rpc_kill_job: REQUEST_KILL_JOB JobId=65088793 uid 254732 Apr 08 15:14:12 sh02-sl01.int slurmctld[83899]: _slurm_rpc_submit_batch_job: JobId=65107317 InitPrio=55433 usec=13098 Apr 08 15:14:13 sh02-sl01.int slurmctld[83899]: error: select/cons_tres: _compute_c_b_task_dist: oversubscribe for JobId=65101461 The job in question is here: JobId=65101461 JobName=angela_varselRF UserId=[...](337380) GroupId=[...](30289) MCS_label=N/A Priority=16306 Nice=0 Account=[...] QOS=normal JobState=PENDING Reason=Resources Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 RunTime=00:00:00 TimeLimit=03:00:00 TimeMin=N/A SubmitTime=2020-04-08T13:54:01 EligibleTime=2020-04-08T13:54:01 AccrueTime=2020-04-08T13:54:29 StartTime=Unknown EndTime=Unknown Deadline=N/A SuspendTime=None SecsPreSuspend=0 LastSchedEval=2020-04-08T15:40:06 Partition=owners,quake,normal AllocNode:Sid=sh02-ln02:92085 ReqNodeList=(null) ExcNodeList=(null) NodeList=(null) NumNodes=1 NumCPUs=5 NumTasks=5 CPUs/Task=1 ReqB:S:C:T=0:0:*:* TRES=cpu=5,mem=124000M,node=1,billing=5 Socks/Node=* NtasksPerN:B:S:C=5:0:*:* CoreSpec=* MinCPUsNode=10 MinMemoryNode=124000M MinTmpDiskNode=0 Features=(null) DelayBoot=00:00:00 OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) Command=[...]forest/makeSmallForest.sh WorkDir=[...]forest StdErr=[...]forest/log/rf.err StdIn=/dev/null StdOut=[...]forest/log/rf.out Power= After killing and restarting slurmctld, everything looks fine, I can query that job and the LastSchedEval timestamp is being updated, so it doesn't look like this particular job is the culprit, maybe just a coincidence. Anyway, I've saved a core dump of the blocked process. Here's the backtrace: (gdb) bt #0 0x00007fa6149812ce in pthread_rwlock_wrlock () from /lib64/libpthread.so.0 #1 0x00000000004701e5 in lock_slurmctld (lock_levels=...) at locks.c:125 #2 0x0000000000421542 in _agent_defer () at agent.c:1543 #3 _agent_init (arg=<optimized out>) at agent.c:1409 #4 0x00007fa61497ddd5 in start_thread () from /lib64/libpthread.so.0 #5 0x00007fa6146a702d in clone () from /lib64/libc.so.6 I'm attaching the output of "thread apply all bt full". Thanks! -- Kilian