Ticket 15280

Summary:	slurmctld hangup
Product:	Slurm	Reporter:	Matt Ezell <ezellma>
Component:	slurmctld	Assignee:	Marcin Stolarek <cinek>
Status:	RESOLVED FIXED	QA Contact:
Severity:	3 - Medium Impact
Priority:	---	CC:	brian.gilmer, csamuel, davismj, peltzpl, vergaravg
Version:	22.05.2
Hardware:	Linux
OS:	Linux
Site:	ORNL-OLCF	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:	22.05.6 23.02.0pre1
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	Backtrace while hung Reduce the number of time s we call bit_set_count. v2 patch that correct bad logic in step_mgr

Description Matt Ezell 2022-10-25 12:27:58 MDT

Since yesterday, we've had our slurmctld become unresponsive to client commands. Most of the threads are waiting to acquire locks.  It looks like it was using 100% of a core. I generated 3 core files over a couple minutes (not automated, so not back-to-back) and found that the same LWP seemed to be VERY slowly progressing with:

Thread 21 (Thread 0x7faa7ddb4700 (LWP 623)):
#0  0x00007faa995db912 in bit_set_count (b=b@entry=0x7fa8f8081290) at bitstring.c:850
#1  0x00000000004ca5e8 in _handle_core_select (step_ptr=step_ptr@entry=0x7fa8f8010b80, job_resrcs_ptr=job_resrcs_ptr@entry=0x7faa880b2770, avail_core_bitmap=0x7fa8f8081290, job_node_inx=job_node_inx@entry=48, sockets=8, cores=8, use_all_cores=true, oversubscribing_cpus=false, cpu_cnt=0x7faa7ddb38e4, cpus_per_task=8) at step_mgr.c:2017
#2  0x00000000004d0c48 in _pick_step_cores (node_inx=48, cpus_per_core=<optimized out>, task_cnt=<optimized out>, job_node_inx=48, job_resrcs_ptr=0x7faa880b2770, step_ptr=0x7fa8f8010b80) at step_mgr.c:2123
#3  _step_alloc_lps (step_ptr=0x7fa8f8010b80) at step_mgr.c:2471
#4  step_create (step_specs=step_specs@entry=0x7fa8f800ff10, new_step_record=new_step_record@entry=0x7faa7ddb39a0, protocol_version=<optimized out>, err_msg=err_msg@entry=0x7faa7ddb3990) at step_mgr.c:3509
#5  0x00000000004a0a7c in _slurm_rpc_job_step_create (msg=0x7fa8f8007660) at proc_req.c:2453
#6  0x00000000004ad4ea in slurmctld_req (msg=msg@entry=0x7fa8f8007660) at proc_req.c:6660
#7  0x00000000004303f4 in _service_connection (arg=<optimized out>) at controller.c:1380
#8  0x00007faa991426ea in start_thread () from /lib64/libpthread.so.0
#9  0x00007faa988f7a8f in clone () from /lib64/libc.so.6

Thread 21 (Thread 0x7faa7ddb4700 (LWP 623)):
#0  __popcountdi2 (x=578721382704613384) at ../../../libgcc/libgcc2.c:860
#1  0x00007faa995db917 in bit_set_count (b=b@entry=0x7fa8f8081290) at bitstring.c:850
#2  0x00000000004ca5e8 in _handle_core_select (step_ptr=step_ptr@entry=0x7fa8f8010b80, job_resrcs_ptr=job_resrcs_ptr@entry=0x7faa880b2770, avail_core_bitmap=0x7fa8f8081290, job_node_inx=job_node_inx@entry=76, sockets=8, cores=8, use_all_cores=true, oversubscribing_cpus=false, cpu_cnt=0x7faa7ddb38e4, cpus_per_task=8) at step_mgr.c:2017
#3  0x00000000004d0c48 in _pick_step_cores (node_inx=148, cpus_per_core=<optimized out>, task_cnt=<optimized out>, job_node_inx=76, job_resrcs_ptr=0x7faa880b2770, step_ptr=0x7fa8f8010b80) at step_mgr.c:2123
#4  _step_alloc_lps (step_ptr=0x7fa8f8010b80) at step_mgr.c:2471
#5  step_create (step_specs=step_specs@entry=0x7fa8f800ff10, new_step_record=new_step_record@entry=0x7faa7ddb39a0, protocol_version=<optimized out>, err_msg=err_msg@entry=0x7faa7ddb3990) at step_mgr.c:3509
#6  0x00000000004a0a7c in _slurm_rpc_job_step_create (msg=0x7fa8f8007660) at proc_req.c:2453
#7  0x00000000004ad4ea in slurmctld_req (msg=msg@entry=0x7fa8f8007660) at proc_req.c:6660
#8  0x00000000004303f4 in _service_connection (arg=<optimized out>) at controller.c:1380
#9  0x00007faa991426ea in start_thread () from /lib64/libpthread.so.0
#10 0x00007faa988f7a8f in clone () from /lib64/libc.so.6

Thread 14 (Thread 0x7faa7ddb4700 (LWP 623)):
#0  __popcountdi2 (x=578721382704613384) at ../../../libgcc/libgcc2.c:860
#1  0x00007faa995db917 in bit_set_count (b=b@entry=0x7fa8f8081290) at bitstring.c:850
#2  0x00000000004ca5e8 in _handle_core_select (step_ptr=step_ptr@entry=0x7fa8f8010b80, job_resrcs_ptr=job_resrcs_ptr@entry=0x7faa880b2770, avail_core_bitmap=0x7fa8f8081290, job_node_inx=job_node_inx@entry=101, sockets=8, cores=8, use_all_cores=true, oversubscribing_cpus=false, cpu_cnt=0x7faa7ddb38e4, cpus_per_task=8) at step_mgr.c:2017
#3  0x00000000004d0c48 in _pick_step_cores (node_inx=173, cpus_per_core=<optimized out>, task_cnt=<optimized out>, job_node_inx=101, job_resrcs_ptr=0x7faa880b2770, step_ptr=0x7fa8f8010b80) at step_mgr.c:2123
#4  _step_alloc_lps (step_ptr=0x7fa8f8010b80) at step_mgr.c:2471
#5  step_create (step_specs=step_specs@entry=0x7fa8f800ff10, new_step_record=new_step_record@entry=0x7faa7ddb39a0, protocol_version=<optimized out>, err_msg=err_msg@entry=0x7faa7ddb3990) at step_mgr.c:3509
#6  0x00000000004a0a7c in _slurm_rpc_job_step_create (msg=0x7fa8f8007660) at proc_req.c:2453
#7  0x00000000004ad4ea in slurmctld_req (msg=msg@entry=0x7fa8f8007660) at proc_req.c:6660
#8  0x00000000004303f4 in _service_connection (arg=<optimized out>) at controller.c:1380
#9  0x00007faa991426ea in start_thread () from /lib64/libpthread.so.0
#10 0x00007faa988f7a8f in clone () from /lib64/libc.so.6

Note the job_node_inx is increasing between samples

Comment 1 Matt Ezell 2022-10-25 12:32:52 MDT

Created attachment 27445 [details]
Backtrace while hung

Comment 2 Paul Peltz 2022-10-25 12:40:22 MDT

After it becomes unresponsive, we see these agent messages as well from the controller:

[root@slurm2.frontier ~]# grep _update_wdog /var/log/slurmctld.log
[2022-10-25T14:33:17.475] AGENT: _update_wdog_state: agent thread 139911638144768 timed out
[2022-10-25T14:33:52.478] AGENT: _update_wdog_state: agent thread 139911638144768 timed out
[2022-10-25T14:34:27.481] AGENT: _update_wdog_state: agent thread 139911638144768 timed out
[2022-10-25T14:35:02.484] AGENT: _update_wdog_state: agent thread 139911638144768 timed out
[2022-10-25T14:35:37.488] AGENT: _update_wdog_state: agent thread 139911638144768 timed out

server thread count piles up until even sdiag is unresponsive when it reaches the max server thread count.

[root@slurm2.frontier ~]# sdiag | head
*******************************************************
sdiag output at Tue Oct 25 14:37:05 2022 (1666723025)
Data since      Tue Oct 25 14:17:18 2022 (1666721838)
*******************************************************
Server thread count:  84

We are also seeing stats showing defunct that are owned by slurmctld.

[root@slurm2.frontier ~]# ps -eaf | grep slurm
slurm     1250 18173 28 14:15 pts/0    00:06:12 /usr/sbin/slurmctld -D -vv
slurm     1251  1250  0 14:15 pts/0    00:00:00 slurmctld: slurmscriptd
slurm     1254  1250  0 14:15 pts/0    00:00:00 [stat] <defunct>
slurm     5193  1250  0 14:19 pts/0    00:00:00 [stat] <defunct>
slurm     5194  1250  0 14:19 pts/0    00:00:00 [stat] <defunct>
slurm    11794  1250  0 14:26 pts/0    00:00:00 [stat] <defunct>
slurm    11795  1250  0 14:26 pts/0    00:00:00 [stat] <defunct>
slurm    22087  1250  0 14:28 pts/0    00:00:00 [stat] <defunct>
slurm    22088  1250  0 14:28 pts/0    00:00:00 [stat] <defunct>
slurm    23380     1  0 13:46 ?        00:00:14 /usr/sbin/slurmdbd -D -s
slurm    27801  1250  0 14:18 pts/0    00:00:00 [stat] <defunct>
slurm    27802  1250  0 14:18 pts/0    00:00:00 [stat] <defunct>

We originally thought this might be a communication issue between the primary controller on slurm1 and the dbd on slurm2, but since last night we've been running the controller and slurmdbd/mariadb on the same node. After about 6 hours we are back into the same state again where we restart slurmctld and client commands quickly becomes unresponsive once jobs begin to run.

Comment 3 Jason Booth 2022-10-25 13:01:13 MDT

I think you may be running into an issue fixed in the following commit that was also released as part of 22.05.5.

https://github.com/SchedMD/slurm/commit/ca510f2984

We are still discussing this and will let you know more shortly.

Comment 4 Matt Ezell 2022-10-25 13:06:28 MDT

(In reply to Jason Booth from comment #3)
> I think you may be running into an issue fixed in the following commit that
> was also released as part of 22.05.5.
> 
> https://github.com/SchedMD/slurm/commit/ca510f2984
> 
> We are still discussing this and will let you know more shortly.

Thanks for the quick reply. We are running with bug14229_2205_v3.patch. Looking at the commits from 14229 to see if we need to pull in others (or just move to 22.05.5)

Comment 5 Jason Booth 2022-10-25 13:14:01 MDT

I meant to also post the following commit. This commit was a follow-up from bug#15149.


https://github.com/SchedMD/slurm/commit/49826b926488273e760dcb2fcae80d6df77cab30

Comment 6 Jason Booth 2022-10-25 14:29:53 MDT

Created attachment 27449 [details]
Reduce the number of time s we call bit_set_count.

Attached is a patch that reduced the number of times we call bit_set_count. We suspect this is causing part, if not the majority of the slowdown you are seeing.  Please apply this and let us know if things improve.

Comment 7 Jason Booth 2022-10-25 15:06:40 MDT

Created attachment 27450 [details]
v2 patch that correct bad logic in step_mgr

Please apply and test this over the previous patch.

Comment 8 Matt Ezell 2022-10-25 15:28:08 MDT

We are running with the v2 patch. I'll downgrade to sev-2 while we wait to see if it was effective. Thanks!!!

Comment 9 Jason Booth 2022-10-25 15:37:14 MDT

Great! Sounds positive so far. We will wait to hear back from you. Marcin has been working on that patch. He will handle this bug going forward and any feedback you have for us regarding it. Please let us know if any odd behavior occurs which job allocations.

Comment 10 Marcin Stolarek 2022-10-27 04:40:31 MDT

Matt,

I guess that no news is a good news. We decreased the ticket severity to 3, could you please confirm that things look good while running with the patch applied?

cheers,
Marcin

Comment 11 Matt Ezell 2022-10-27 06:54:53 MDT

(In reply to Marcin Stolarek from comment #10)
> Matt,
> 
> I guess that no news is a good news. We decreased the ticket severity to 3,
> could you please confirm that things look good while running with the patch
> applied?
> 
> cheers,
> Marcin

Yes, no news is good news. No issues since applying the v2 patch. Thanks!

Comment 12 Matt Ezell 2022-11-01 20:35:06 MDT

Looks like this has hit the 22.05 and master branches. I think we can close this out, thanks!