| Summary: | slurmctld hangup | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Matt Ezell <ezellma> |
| Component: | slurmctld | Assignee: | Marcin Stolarek <cinek> |
| Status: | RESOLVED FIXED | QA Contact: | |
| Severity: | 3 - Medium Impact | ||
| Priority: | --- | CC: | brian.gilmer, csamuel, davismj, peltzpl, vergaravg |
| Version: | 22.05.2 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | ORNL-OLCF | Slinky Site: | --- |
| Alineos Sites: | --- | Atos/Eviden Sites: | --- |
| Confidential Site: | --- | Coreweave sites: | --- |
| Cray Sites: | --- | DS9 clusters: | --- |
| Google sites: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | NoveTech Sites: | --- |
| Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Tzag Elita Sites: | --- |
| Linux Distro: | --- | Machine Name: | |
| CLE Version: | Version Fixed: | 22.05.6 23.02.0pre1 | |
| Target Release: | --- | DevPrio: | --- |
| Emory-Cloud Sites: | --- | ||
| Attachments: |
Backtrace while hung
Reduce the number of time s we call bit_set_count. v2 patch that correct bad logic in step_mgr |
||
|
Description
Matt Ezell
2022-10-25 12:27:58 MDT
Created attachment 27445 [details]
Backtrace while hung
After it becomes unresponsive, we see these agent messages as well from the controller: [root@slurm2.frontier ~]# grep _update_wdog /var/log/slurmctld.log [2022-10-25T14:33:17.475] AGENT: _update_wdog_state: agent thread 139911638144768 timed out [2022-10-25T14:33:52.478] AGENT: _update_wdog_state: agent thread 139911638144768 timed out [2022-10-25T14:34:27.481] AGENT: _update_wdog_state: agent thread 139911638144768 timed out [2022-10-25T14:35:02.484] AGENT: _update_wdog_state: agent thread 139911638144768 timed out [2022-10-25T14:35:37.488] AGENT: _update_wdog_state: agent thread 139911638144768 timed out server thread count piles up until even sdiag is unresponsive when it reaches the max server thread count. [root@slurm2.frontier ~]# sdiag | head ******************************************************* sdiag output at Tue Oct 25 14:37:05 2022 (1666723025) Data since Tue Oct 25 14:17:18 2022 (1666721838) ******************************************************* Server thread count: 84 We are also seeing stats showing defunct that are owned by slurmctld. [root@slurm2.frontier ~]# ps -eaf | grep slurm slurm 1250 18173 28 14:15 pts/0 00:06:12 /usr/sbin/slurmctld -D -vv slurm 1251 1250 0 14:15 pts/0 00:00:00 slurmctld: slurmscriptd slurm 1254 1250 0 14:15 pts/0 00:00:00 [stat] <defunct> slurm 5193 1250 0 14:19 pts/0 00:00:00 [stat] <defunct> slurm 5194 1250 0 14:19 pts/0 00:00:00 [stat] <defunct> slurm 11794 1250 0 14:26 pts/0 00:00:00 [stat] <defunct> slurm 11795 1250 0 14:26 pts/0 00:00:00 [stat] <defunct> slurm 22087 1250 0 14:28 pts/0 00:00:00 [stat] <defunct> slurm 22088 1250 0 14:28 pts/0 00:00:00 [stat] <defunct> slurm 23380 1 0 13:46 ? 00:00:14 /usr/sbin/slurmdbd -D -s slurm 27801 1250 0 14:18 pts/0 00:00:00 [stat] <defunct> slurm 27802 1250 0 14:18 pts/0 00:00:00 [stat] <defunct> We originally thought this might be a communication issue between the primary controller on slurm1 and the dbd on slurm2, but since last night we've been running the controller and slurmdbd/mariadb on the same node. After about 6 hours we are back into the same state again where we restart slurmctld and client commands quickly becomes unresponsive once jobs begin to run. I think you may be running into an issue fixed in the following commit that was also released as part of 22.05.5. https://github.com/SchedMD/slurm/commit/ca510f2984 We are still discussing this and will let you know more shortly. (In reply to Jason Booth from comment #3) > I think you may be running into an issue fixed in the following commit that > was also released as part of 22.05.5. > > https://github.com/SchedMD/slurm/commit/ca510f2984 > > We are still discussing this and will let you know more shortly. Thanks for the quick reply. We are running with bug14229_2205_v3.patch. Looking at the commits from 14229 to see if we need to pull in others (or just move to 22.05.5) I meant to also post the following commit. This commit was a follow-up from bug#15149. https://github.com/SchedMD/slurm/commit/49826b926488273e760dcb2fcae80d6df77cab30 Created attachment 27449 [details]
Reduce the number of time s we call bit_set_count.
Attached is a patch that reduced the number of times we call bit_set_count. We suspect this is causing part, if not the majority of the slowdown you are seeing. Please apply this and let us know if things improve.
Created attachment 27450 [details]
v2 patch that correct bad logic in step_mgr
Please apply and test this over the previous patch.
We are running with the v2 patch. I'll downgrade to sev-2 while we wait to see if it was effective. Thanks!!! Great! Sounds positive so far. We will wait to hear back from you. Marcin has been working on that patch. He will handle this bug going forward and any feedback you have for us regarding it. Please let us know if any odd behavior occurs which job allocations. Matt, I guess that no news is a good news. We decreased the ticket severity to 3, could you please confirm that things look good while running with the patch applied? cheers, Marcin (In reply to Marcin Stolarek from comment #10) > Matt, > > I guess that no news is a good news. We decreased the ticket severity to 3, > could you please confirm that things look good while running with the patch > applied? > > cheers, > Marcin Yes, no news is good news. No issues since applying the v2 patch. Thanks! Looks like this has hit the 22.05 and master branches. I think we can close this out, thanks! |