Since yesterday, we've had our slurmctld become unresponsive to client commands. Most of the threads are waiting to acquire locks. It looks like it was using 100% of a core. I generated 3 core files over a couple minutes (not automated, so not back-to-back) and found that the same LWP seemed to be VERY slowly progressing with: Thread 21 (Thread 0x7faa7ddb4700 (LWP 623)): #0 0x00007faa995db912 in bit_set_count (b=b@entry=0x7fa8f8081290) at bitstring.c:850 #1 0x00000000004ca5e8 in _handle_core_select (step_ptr=step_ptr@entry=0x7fa8f8010b80, job_resrcs_ptr=job_resrcs_ptr@entry=0x7faa880b2770, avail_core_bitmap=0x7fa8f8081290, job_node_inx=job_node_inx@entry=48, sockets=8, cores=8, use_all_cores=true, oversubscribing_cpus=false, cpu_cnt=0x7faa7ddb38e4, cpus_per_task=8) at step_mgr.c:2017 #2 0x00000000004d0c48 in _pick_step_cores (node_inx=48, cpus_per_core=<optimized out>, task_cnt=<optimized out>, job_node_inx=48, job_resrcs_ptr=0x7faa880b2770, step_ptr=0x7fa8f8010b80) at step_mgr.c:2123 #3 _step_alloc_lps (step_ptr=0x7fa8f8010b80) at step_mgr.c:2471 #4 step_create (step_specs=step_specs@entry=0x7fa8f800ff10, new_step_record=new_step_record@entry=0x7faa7ddb39a0, protocol_version=<optimized out>, err_msg=err_msg@entry=0x7faa7ddb3990) at step_mgr.c:3509 #5 0x00000000004a0a7c in _slurm_rpc_job_step_create (msg=0x7fa8f8007660) at proc_req.c:2453 #6 0x00000000004ad4ea in slurmctld_req (msg=msg@entry=0x7fa8f8007660) at proc_req.c:6660 #7 0x00000000004303f4 in _service_connection (arg=<optimized out>) at controller.c:1380 #8 0x00007faa991426ea in start_thread () from /lib64/libpthread.so.0 #9 0x00007faa988f7a8f in clone () from /lib64/libc.so.6 Thread 21 (Thread 0x7faa7ddb4700 (LWP 623)): #0 __popcountdi2 (x=578721382704613384) at ../../../libgcc/libgcc2.c:860 #1 0x00007faa995db917 in bit_set_count (b=b@entry=0x7fa8f8081290) at bitstring.c:850 #2 0x00000000004ca5e8 in _handle_core_select (step_ptr=step_ptr@entry=0x7fa8f8010b80, job_resrcs_ptr=job_resrcs_ptr@entry=0x7faa880b2770, avail_core_bitmap=0x7fa8f8081290, job_node_inx=job_node_inx@entry=76, sockets=8, cores=8, use_all_cores=true, oversubscribing_cpus=false, cpu_cnt=0x7faa7ddb38e4, cpus_per_task=8) at step_mgr.c:2017 #3 0x00000000004d0c48 in _pick_step_cores (node_inx=148, cpus_per_core=<optimized out>, task_cnt=<optimized out>, job_node_inx=76, job_resrcs_ptr=0x7faa880b2770, step_ptr=0x7fa8f8010b80) at step_mgr.c:2123 #4 _step_alloc_lps (step_ptr=0x7fa8f8010b80) at step_mgr.c:2471 #5 step_create (step_specs=step_specs@entry=0x7fa8f800ff10, new_step_record=new_step_record@entry=0x7faa7ddb39a0, protocol_version=<optimized out>, err_msg=err_msg@entry=0x7faa7ddb3990) at step_mgr.c:3509 #6 0x00000000004a0a7c in _slurm_rpc_job_step_create (msg=0x7fa8f8007660) at proc_req.c:2453 #7 0x00000000004ad4ea in slurmctld_req (msg=msg@entry=0x7fa8f8007660) at proc_req.c:6660 #8 0x00000000004303f4 in _service_connection (arg=<optimized out>) at controller.c:1380 #9 0x00007faa991426ea in start_thread () from /lib64/libpthread.so.0 #10 0x00007faa988f7a8f in clone () from /lib64/libc.so.6 Thread 14 (Thread 0x7faa7ddb4700 (LWP 623)): #0 __popcountdi2 (x=578721382704613384) at ../../../libgcc/libgcc2.c:860 #1 0x00007faa995db917 in bit_set_count (b=b@entry=0x7fa8f8081290) at bitstring.c:850 #2 0x00000000004ca5e8 in _handle_core_select (step_ptr=step_ptr@entry=0x7fa8f8010b80, job_resrcs_ptr=job_resrcs_ptr@entry=0x7faa880b2770, avail_core_bitmap=0x7fa8f8081290, job_node_inx=job_node_inx@entry=101, sockets=8, cores=8, use_all_cores=true, oversubscribing_cpus=false, cpu_cnt=0x7faa7ddb38e4, cpus_per_task=8) at step_mgr.c:2017 #3 0x00000000004d0c48 in _pick_step_cores (node_inx=173, cpus_per_core=<optimized out>, task_cnt=<optimized out>, job_node_inx=101, job_resrcs_ptr=0x7faa880b2770, step_ptr=0x7fa8f8010b80) at step_mgr.c:2123 #4 _step_alloc_lps (step_ptr=0x7fa8f8010b80) at step_mgr.c:2471 #5 step_create (step_specs=step_specs@entry=0x7fa8f800ff10, new_step_record=new_step_record@entry=0x7faa7ddb39a0, protocol_version=<optimized out>, err_msg=err_msg@entry=0x7faa7ddb3990) at step_mgr.c:3509 #6 0x00000000004a0a7c in _slurm_rpc_job_step_create (msg=0x7fa8f8007660) at proc_req.c:2453 #7 0x00000000004ad4ea in slurmctld_req (msg=msg@entry=0x7fa8f8007660) at proc_req.c:6660 #8 0x00000000004303f4 in _service_connection (arg=<optimized out>) at controller.c:1380 #9 0x00007faa991426ea in start_thread () from /lib64/libpthread.so.0 #10 0x00007faa988f7a8f in clone () from /lib64/libc.so.6 Note the job_node_inx is increasing between samples
Created attachment 27445 [details] Backtrace while hung
After it becomes unresponsive, we see these agent messages as well from the controller: [root@slurm2.frontier ~]# grep _update_wdog /var/log/slurmctld.log [2022-10-25T14:33:17.475] AGENT: _update_wdog_state: agent thread 139911638144768 timed out [2022-10-25T14:33:52.478] AGENT: _update_wdog_state: agent thread 139911638144768 timed out [2022-10-25T14:34:27.481] AGENT: _update_wdog_state: agent thread 139911638144768 timed out [2022-10-25T14:35:02.484] AGENT: _update_wdog_state: agent thread 139911638144768 timed out [2022-10-25T14:35:37.488] AGENT: _update_wdog_state: agent thread 139911638144768 timed out server thread count piles up until even sdiag is unresponsive when it reaches the max server thread count. [root@slurm2.frontier ~]# sdiag | head ******************************************************* sdiag output at Tue Oct 25 14:37:05 2022 (1666723025) Data since Tue Oct 25 14:17:18 2022 (1666721838) ******************************************************* Server thread count: 84 We are also seeing stats showing defunct that are owned by slurmctld. [root@slurm2.frontier ~]# ps -eaf | grep slurm slurm 1250 18173 28 14:15 pts/0 00:06:12 /usr/sbin/slurmctld -D -vv slurm 1251 1250 0 14:15 pts/0 00:00:00 slurmctld: slurmscriptd slurm 1254 1250 0 14:15 pts/0 00:00:00 [stat] <defunct> slurm 5193 1250 0 14:19 pts/0 00:00:00 [stat] <defunct> slurm 5194 1250 0 14:19 pts/0 00:00:00 [stat] <defunct> slurm 11794 1250 0 14:26 pts/0 00:00:00 [stat] <defunct> slurm 11795 1250 0 14:26 pts/0 00:00:00 [stat] <defunct> slurm 22087 1250 0 14:28 pts/0 00:00:00 [stat] <defunct> slurm 22088 1250 0 14:28 pts/0 00:00:00 [stat] <defunct> slurm 23380 1 0 13:46 ? 00:00:14 /usr/sbin/slurmdbd -D -s slurm 27801 1250 0 14:18 pts/0 00:00:00 [stat] <defunct> slurm 27802 1250 0 14:18 pts/0 00:00:00 [stat] <defunct> We originally thought this might be a communication issue between the primary controller on slurm1 and the dbd on slurm2, but since last night we've been running the controller and slurmdbd/mariadb on the same node. After about 6 hours we are back into the same state again where we restart slurmctld and client commands quickly becomes unresponsive once jobs begin to run.
I think you may be running into an issue fixed in the following commit that was also released as part of 22.05.5. https://github.com/SchedMD/slurm/commit/ca510f2984 We are still discussing this and will let you know more shortly.
(In reply to Jason Booth from comment #3) > I think you may be running into an issue fixed in the following commit that > was also released as part of 22.05.5. > > https://github.com/SchedMD/slurm/commit/ca510f2984 > > We are still discussing this and will let you know more shortly. Thanks for the quick reply. We are running with bug14229_2205_v3.patch. Looking at the commits from 14229 to see if we need to pull in others (or just move to 22.05.5)
I meant to also post the following commit. This commit was a follow-up from bug#15149. https://github.com/SchedMD/slurm/commit/49826b926488273e760dcb2fcae80d6df77cab30
Created attachment 27449 [details] Reduce the number of time s we call bit_set_count. Attached is a patch that reduced the number of times we call bit_set_count. We suspect this is causing part, if not the majority of the slowdown you are seeing. Please apply this and let us know if things improve.
Created attachment 27450 [details] v2 patch that correct bad logic in step_mgr Please apply and test this over the previous patch.
We are running with the v2 patch. I'll downgrade to sev-2 while we wait to see if it was effective. Thanks!!!
Great! Sounds positive so far. We will wait to hear back from you. Marcin has been working on that patch. He will handle this bug going forward and any feedback you have for us regarding it. Please let us know if any odd behavior occurs which job allocations.
Matt, I guess that no news is a good news. We decreased the ticket severity to 3, could you please confirm that things look good while running with the patch applied? cheers, Marcin
(In reply to Marcin Stolarek from comment #10) > Matt, > > I guess that no news is a good news. We decreased the ticket severity to 3, > could you please confirm that things look good while running with the patch > applied? > > cheers, > Marcin Yes, no news is good news. No issues since applying the v2 patch. Thanks!
Looks like this has hit the 22.05 and master branches. I think we can close this out, thanks!