16911 – Worker nodes drained

Ticket 16911 - Worker nodes drained

Summary: Worker nodes drained

Status:	RESOLVED INFOGIVEN

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	slurmctld (show other tickets)
Version:	- Unsupported Older Versions
Hardware:	Linux Linux

Severity:	3 - Medium Impact
Assignee:	Oscar Hernández
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2023-06-07 08:18 MDT by J.P. Waller
Modified:	2023-06-21 07:58 MDT (History)
CC List:	2 users (show)

See Also:
Site:	Acadian Asset
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
slurmd.log.06.06 (30.64 MB, application/octet-stream) 2023-06-07 13:52 MDT, J.P. Waller	Details
Unkillable script example (922 bytes, application/x-shellscript) 2023-06-13 08:22 MDT, Oscar Hernández	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description J.P. Waller 2023-06-07 08:18:30 MDT

For some reason 11 of our 12 worker nodes drained last night.  I can't figure out from the logs why they drained.

We see messages like this in the controller logs.

[2023-06-06T22:11:45.176] sched: Allocate JobId=32806682_17025(32860749) NodeList=bos-rndclus03 #CPUs=1 Partition=batch                                                                                                                                     [2023-06-06T22:12:21.996] cleanup_completing: JobId=32844068_432(32854439) completion process took 74 seconds                                                                                                                                               [2023-06-06T22:12:22.226] cleanup_completing: JobId=32844068_420(32854297) completion process took 75 seconds                                                                                                                                               [2023-06-06T22:12:24.561] cleanup_completing: JobId=32844068_353(32853287) completion process took 77 seconds                                                                                                                                               [2023-06-06T22:12:25.094] cleanup_completing: JobId=32844068_347(32853009) completion process took 78 seconds                                                                                                                                               [2023-06-06T22:12:33.868] _slurm_rpc_submit_batch_job: JobId=32860750 InitPrio=52611 usec=10416                                                                                                                                                             [2023-06-06T22:12:36.686] _job_complete: JobId=32860130_7(32860463) WEXITSTATUS 0                                                                                                                                                                           [2023-06-06T22:12:36.686] _job_complete: JobId=32860130_7(32860463) done                                                                                                                                                                                    [2023-06-06T22:12:38.242] update_node: node bos-rndclus12 reason set to: Kill task failed                                                                                                                                                                   [2023-06-06T22:12:38.242] update_node: node bos-rndclus12 state set to DRAINING                                                                                                                                                                             [2023-06-06T22:12:38.242] error: slurmd error running JobId=32856164 on node(s)=bos-rndclus12: Kill task failed                                                                                                                                             [2023-06-06T22:12:38.678] cleanup_completing: JobId=32844068_561(32856164) completion process took 91 seconds                                                                                                                                               [2023-06-06T22:12:39.341] error: slurmd error running JobId=32856108 on node(s)=bos-rndclus12: Kill task failed                                                                                                                                             [2023-06-06T22:12:39.344] error: slurmd error running JobId=32856139 on node(s)=bos-rndclus12: Kill task failed                                                                                                                                             [2023-06-06T22:12:39.346] error: slurmd error running JobId=32856138 on node(s)=bos-rndclus12: Kill task failed                                                                                                                                             [2023-06-06T22:12:39.395] error: slurmd error running JobId=32856161 on node(s)=bos-rndclus12: Kill task failed                                                                                                                                             [2023-06-06T22:12:39.695] cleanup_completing: JobId=32844068_544(32856108) completion process took 92 seconds                                                                                                                                               [2023-06-06T22:12:40.142] error: slurmd error running JobId=32856067 on node(s)=bos-rndclus12: Kill task failed                                                                                                                                             [2023-06-06T22:12:40.345] cleanup_completing: JobId=32844068_503(32856067) completion process took 93 seconds                                                                                                                                               [2023-06-06T22:12:41.068] update_node: node bos-rndclus09 reason set to: Kill task failed                                                                                                                                                                   [2023-06-06T22:12:41.068] update_node: node bos-rndclus09 state set to DRAINING                                                                                                                                                                             [2023-06-06T22:12:41.069] error: slurmd error running JobId=32856022 on node(s)=bos-rndclus09: Kill task failed                                                                                                                                             [2023-06-06T22:12:41.107] update_node: node bos-rndclus11 reason set to: Kill task failed                                                                                                                                                                   [2023-06-06T22:12:41.107] update_node: node bos-rndclus11 state set to DRAINING                                                                                                                                                                             [2023-06-06T22:12:41.109] error: slurmd error running JobId=32856007 on node(s)=bos-rndclus11: Kill task failed                                                                                                                                             [2023-06-06T22:12:41.109] error: slurmd error running JobId=32856012 on node(s)=bos-rndclus11: Kill task failed                                                                                                                                             [2023-06-06T22:12:41.211] error: slurmd error running JobId=32856053 on node(s)=bos-rndclus12: Kill task failed                                                                                                                                             [2023-06-06T22:12:41.212] error: slurmd error running JobId=32856051 on node(s)=bos-rndclus12: Kill task failed                                                                                                                                             [2023-06-06T22:12:41.215] error: slurmd error running JobId=32856050 on node(s)=bos-rndclus12: Kill task failed                                                                                                                                             [2023-06-06T22:12:41.266] cleanup_completing: JobId=32844068_466(32856012) completion process took 94 seconds                                                                                                                                               [2023-06-06T22:12:41.316] cleanup_completing: JobId=32844068_461(32856007) completion process took 94 seconds                                                                                                                                               [2023-06-06T22:12:41.323] error: slurmd error running JobId=32856059 on node(s)=bos-rndclus12: Kill task failed                                                                                                                                             [2023-06-06T22:12:41.324] error: slurmd error running JobId=32856055 on node(s)=bos-rndclus12: Kill task failed                                                                                                                                             [2023-06-06T22:12:42.098] update_node: node bos-rndclus02 reason set to: Kill task failed                                                                                                                                                                   [2023-06-06T22:12:42.098] update_node: node bos-rndclus02 state set to DRAINING                                                                                                                                                                             [2023-06-06T22:12:42.099] error: slurmd error running JobId=32854286 on node(s)=bos-rndclus02: Kill task failed                                                                                                                                             [2023-06-06T22:12:42.100] error: slurmd error running JobId=32854035 on node(s)=bos-rndclus02: Kill task failed                                                                                                                                             [2023-06-06T22:12:42.103] error: slurmd error running JobId=32854420 on node(s)=bos-rndclus02: Kill task failed                                                                                                                                             [2023-06-06T22:12:42.108] error: slurmd error running JobId=32854591 on node(s)=bos-rndclus09: Kill task failed                                                                                                                                             [2023-06-06T22:12:42.109] error: slurmd error running JobId=32854248 on node(s)=bos-rndclus02: Kill task failed                                                                                                                                             [2023-06-06T22:12:42.109] error: slurmd error running JobId=32854151 on node(s)=bos-rndclus09: Kill task failed                                                                                                                                             [2023-06-06T22:12:42.112] error: slurmd error running JobId=32854620 on node(s)=bos-rndclus09: Kill task failed                                                                                                                                             [2023-06-06T22:12:42.118] update_node: node bos-rndclus08 reason set to: Kill task failed                                                                                                                                                                   [2023-06-06T22:12:42.118] update_node: node bos-rndclus08 state set to DRAINING                                                                                                                                                                             [2023-06-06T22:12:42.120] error: slurmd error running JobId=32854006 on node(s)=bos-rndclus08: Kill task failed                                                                                                                                             [2023-06-06T22:12:42.121] cleanup_completing: JobId=32844068_419(32854286) completion process took 95 seconds                                                                                                                                               [2023-06-06T22:12:42.132] error: slurmd error running JobId=32854253 on node(s)=bos-rndclus08: Kill task failed                                                                                                                                             [2023-06-06T22:12:42.132] error: slurmd error running JobId=32854174 on node(s)=bos-rndclus08: Kill task failed                                                                                                                                             [2023-06-06T22:12:42.132] update_node: node bos-rndclus06 reason set to: Kill task failed                                                                                                                                                                   [2023-06-06T22:12:42.132] update_node: node bos-rndclus06 state set to DRAINING                                                                                                                                                                             [2023-06-06T22:12:42.133] update_node: node bos-rndclus05 reason set to: Kill task failed                                                                                                                                                                   [2023-06-06T22:12:42.133] update_node: node bos-rndclus05 state set to DRAINING

Comment 1 Oscar Hernández 2023-06-07 08:49:20 MDT

Hi,

>error: slurmd error running JobId=32856108 on node(s)=bos-rndclus12: Kill task failed
Looks like some nodes were not able to successfully kill running tasks. Since the controller detected them as problematic nodes, it drained them to avoid allocating new jobs there.

It is difficult to guess the reason without more details. Could you share the slurmd.log of node bos-rndclus12 since yesterday?

Are nodes still drained?

Kind regards,
Oscar

Comment 2 J.P. Waller 2023-06-07 13:52:08 MDT

Created attachment 30656 [details]
slurmd.log.06.06

Attached.

From: bugs@schedmd.com <bugs@schedmd.com>
Sent: Wednesday, June 7, 2023 10:49 AM
To: J.P. Waller <jwaller@Acadian-Asset.com>
Subject: [Bug 16911] Worker nodes drained

Oscar Hernández<mailto:oscar.hernandez@schedmd.com> changed bug 16911<https://bugs.schedmd.com/show_bug.cgi?id=16911>
What
Removed
Added
CC

oscar.hernandez@schedmd.com<mailto:oscar.hernandez@schedmd.com>
Comment # 1<https://bugs.schedmd.com/show_bug.cgi?id=16911#c1> on bug 16911<https://bugs.schedmd.com/show_bug.cgi?id=16911> from Oscar Hernández<mailto:oscar.hernandez@schedmd.com>

Hi,

>error: slurmd error running JobId=32856108 on node(s)=bos-rndclus12: Kill task failed

Looks like some nodes were not able to successfully kill running tasks. Since

the controller detected them as problematic nodes, it drained them to avoid

allocating new jobs there.

It is difficult to guess the reason without more details. Could you share the

slurmd.log of node bos-rndclus12 since yesterday?

Are nodes still drained?

Kind regards,

Oscar

________________________________
You are receiving this mail because:

  *   You reported the bug.

________________________________

This e-Mail and any attachments may contain privileged and confidential information. If you are not the intended recipient or have received this e-mail in error, please notify the sender immediately and destroy/delete this e-mail. You are hereby notified that any unauthorized copying, disclosure or distribution of the material in this e-mail is strictly prohibited. Confidentiality and privilege are not lost by this transmission having been sent or passed on to you in error. Acadian is not liable for any damage that may be caused by viruses or transmission errors.

This communication is for informational purposes only. It is not intended as an offer or solicitation for the purchase or sale of any security or an offer to establish an account. All information contained in this communication is not warranted as to completeness or accuracy and is subject to change without notice.

Acadian Asset Management LLC is registered as an investment adviser with the U.S. Securities and Exchange Commission. Registered Office: 260 Franklin Street, Boston, Massachusetts 02110. Our Data Protection Notice can be found at www.acadian-asset.com. Should you no longer wish to receive marketing-related materials from Acadian please email OptOut@acadian-asset.com.

________________________________

Comment 3 Oscar Hernández 2023-06-08 03:09:23 MDT

Thanks for the extra log.

This log shows that the DRAINED nodes had some jobs cancelled just before being set to DRAIN:

>[2023-06-06T22:11:08.229] [32856164.batch] error: *** JOB 32856164 ON bos-rndclus12 CANCELLED AT 2023-06-06T22:11:07 ***
>[2023-06-06T22:11:08.472] [32856139.batch] error: *** JOB 32856139 ON bos-rndclus12 CANCELLED AT 2023-06-06T22:11:08 ***
>[2023-06-06T22:11:08.486] [32856161.batch] error: *** JOB 32856161 ON bos-rndclus12 CANCELLED AT 2023-06-06T22:11:08 ***
>[2023-06-06T22:11:08.498] [32856138.batch] error: *** JOB 32856138 ON bos-rndclus12 CANCELLED AT 2023-06-06T22:11:08 ***
>[2023-06-06T22:11:09.254] [32856108.batch] error: *** JOB 32856108 ON bos-rndclus12 CANCELLED AT 2023-06-06T22:11:08 ***
>[2023-06-06T22:11:09.903] [32856067.batch] error: *** JOB 32856067 ON bos-rndclus12 CANCELLED AT 2023-06-06T22:11:09 ***
>[2023-06-06T22:11:10.311] [32856050.batch] error: *** JOB 32856050 ON bos-rndclus12 CANCELLED AT 2023-06-06T22:11:09 ***
>[2023-06-06T22:11:10.323] [32856051.batch] error: *** JOB 32856051 ON bos-rndclus12 CANCELLED AT 2023-06-06T22:11:09 ***
>[2023-06-06T22:11:10.333] [32856053.batch] error: *** JOB 32856053 ON bos-rndclus12 CANCELLED AT 2023-06-06T22:11:09 ***
>[2023-06-06T22:11:10.380] [32856055.batch] error: *** JOB 32856055 ON bos-rndclus12 CANCELLED AT 2023-06-06T22:11:09 ***
>[2023-06-06T22:11:10.380] [32856059.batch] error: *** JOB 32856059 ON bos-rndclus12 CANCELLED AT 2023-06-06T22:11:09 ***

A cluster user might have cancelled a bunch of jobs for some reason. Maybe because they were not performing as expected, or they were just stuck.

Afterwards, after aprox 90s (which I supose is your configured UnkillableStepTimeout), we can see in the log how the cancelled jobs could not be successfully killed:

>[2023-06-06T22:12:38.163] [32856164.batch] error: *** JOB 32856164 STEPD TERMINATED ON bos-rndclus12 AT 2023-06-06T22:12:37 DUE TO JOB NOT ENDING WITH SIGNALS ***

As stated in the documentation. If the process running in the jobs cannot be successfully terminated within UnkillableStepTimeout[1], the node is set to DRAIN.

So I am thinking on 2 possible reasons for that DRAINS.

1. There was some general event/problem on Tuesday, maybe affecting the shared FS, that caused running jobs in the nodes to get stuck. Making it impossible for slurmd to kill the processes.

2. There is a particular workflow from some user that gets processes stuck in some way. 

Are nodes still drained? Have they been rebooted?

If nodes have not been rebooted yet, you could try to ssh into one of the affected compute nodes, and search for active user processes. Since the process could not be killed by slurmd, it should still be running/stuck in some of the nodes, and could help us understand reason for failure.

Another idea would be to check what were the cancelled jobs actually doing. Jobs in this particular node are the ones shown in the error above. If all jobs are from the same user and executed the same program/pipeline, it could be worth investigating a bit on it.

In any case, while we investigate the possible causes, in order to get the nodes back to production. I would suggest to reboot (check for running processes first as mentioned before), and if there is no faulty hardware detected, then resume nodes so they can be allocated again. I would recommend to do a first test with just one node, and if it runs a job properly, do the rest in a bunch.

Kind regards,
Oscar

[1]https://slurm.schedmd.com/slurm.conf.html#OPT_UnkillableStepTimeout

Comment 4 J.P. Waller 2023-06-08 05:42:47 MDT

Hi 2 more drained last night, 03 and 10.  This happened at 3AM so very unlikely it is a user cancelling jobs.  I've brought all back online.  Is there any way we can get more details on what exactly slurm doesn't like trying to kill the jobs?

From: bugs@schedmd.com <bugs@schedmd.com>
Sent: Thursday, June 8, 2023 5:09 AM
To: J.P. Waller <jwaller@Acadian-Asset.com>
Subject: [Bug 16911] Worker nodes drained

Comment # 3<https://bugs.schedmd.com/show_bug.cgi?id=16911#c3> on bug 16911<https://bugs.schedmd.com/show_bug.cgi?id=16911> from Oscar Hernández<mailto:oscar.hernandez@schedmd.com>

Thanks for the extra log.

This log shows that the DRAINED nodes had some jobs cancelled just before being

set to DRAIN:

>[2023-06-06T22:11:08.229] [32856164.batch] error: *** JOB 32856164 ON bos-rndclus12 CANCELLED AT 2023-06-06T22:11:07 ***

>[2023-06-06T22:11:08.472] [32856139.batch] error: *** JOB 32856139 ON bos-rndclus12 CANCELLED AT 2023-06-06T22:11:08 ***

>[2023-06-06T22:11:08.486] [32856161.batch] error: *** JOB 32856161 ON bos-rndclus12 CANCELLED AT 2023-06-06T22:11:08 ***

>[2023-06-06T22:11:08.498] [32856138.batch] error: *** JOB 32856138 ON bos-rndclus12 CANCELLED AT 2023-06-06T22:11:08 ***

>[2023-06-06T22:11:09.254] [32856108.batch] error: *** JOB 32856108 ON bos-rndclus12 CANCELLED AT 2023-06-06T22:11:08 ***

>[2023-06-06T22:11:09.903] [32856067.batch] error: *** JOB 32856067 ON bos-rndclus12 CANCELLED AT 2023-06-06T22:11:09 ***

>[2023-06-06T22:11:10.311] [32856050.batch] error: *** JOB 32856050 ON bos-rndclus12 CANCELLED AT 2023-06-06T22:11:09 ***

>[2023-06-06T22:11:10.323] [32856051.batch] error: *** JOB 32856051 ON bos-rndclus12 CANCELLED AT 2023-06-06T22:11:09 ***

>[2023-06-06T22:11:10.333] [32856053.batch] error: *** JOB 32856053 ON bos-rndclus12 CANCELLED AT 2023-06-06T22:11:09 ***

>[2023-06-06T22:11:10.380] [32856055.batch] error: *** JOB 32856055 ON bos-rndclus12 CANCELLED AT 2023-06-06T22:11:09 ***

>[2023-06-06T22:11:10.380] [32856059.batch] error: *** JOB 32856059 ON bos-rndclus12 CANCELLED AT 2023-06-06T22:11:09 ***

A cluster user might have cancelled a bunch of jobs for some reason. Maybe

because they were not performing as expected, or they were just stuck.

Afterwards, after aprox 90s (which I supose is your configured

UnkillableStepTimeout), we can see in the log how the cancelled jobs could not

be successfully killed:

>[2023-06-06T22:12:38.163] [32856164.batch] error: *** JOB 32856164 STEPD TERMINATED ON bos-rndclus12 AT 2023-06-06T22:12:37 DUE TO JOB NOT ENDING WITH SIGNALS ***

As stated in the documentation. If the process running in the jobs cannot be

successfully terminated within UnkillableStepTimeout[1], the node is set to

DRAIN.

So I am thinking on 2 possible reasons for that DRAINS.

1. There was some general event/problem on Tuesday, maybe affecting the shared

FS, that caused running jobs in the nodes to get stuck. Making it impossible

for slurmd to kill the processes.

2. There is a particular workflow from some user that gets processes stuck in

some way.

Are nodes still drained? Have they been rebooted?

If nodes have not been rebooted yet, you could try to ssh into one of the

affected compute nodes, and search for active user processes. Since the process

could not be killed by slurmd, it should still be running/stuck in some of the

nodes, and could help us understand reason for failure.

Another idea would be to check what were the cancelled jobs actually doing.

Jobs in this particular node are the ones shown in the error above. If all jobs

are from the same user and executed the same program/pipeline, it could be

worth investigating a bit on it.

In any case, while we investigate the possible causes, in order to get the

nodes back to production. I would suggest to reboot (check for running

processes first as mentioned before), and if there is no faulty hardware

detected, then resume nodes so they can be allocated again. I would recommend

to do a first test with just one node, and if it runs a job properly, do the

rest in a bunch.

Kind regards,

Oscar

[1]https://slurm.schedmd.com/slurm.conf.html#OPT_UnkillableStepTimeout

________________________________
You are receiving this mail because:

  *   You reported the bug.

________________________________

This e-Mail and any attachments may contain privileged and confidential information. If you are not the intended recipient or have received this e-mail in error, please notify the sender immediately and destroy/delete this e-mail. You are hereby notified that any unauthorized copying, disclosure or distribution of the material in this e-mail is strictly prohibited. Confidentiality and privilege are not lost by this transmission having been sent or passed on to you in error. Acadian is not liable for any damage that may be caused by viruses or transmission errors.

This communication is for informational purposes only. It is not intended as an offer or solicitation for the purchase or sale of any security or an offer to establish an account. All information contained in this communication is not warranted as to completeness or accuracy and is subject to change without notice.

Acadian Asset Management LLC is registered as an investment adviser with the U.S. Securities and Exchange Commission. Registered Office: 260 Franklin Street, Boston, Massachusetts 02110. Our Data Protection Notice can be found at www.acadian-asset.com. Should you no longer wish to receive marketing-related materials from Acadian please email OptOut@acadian-asset.com.

________________________________

Comment 5 Oscar Hernández 2023-06-08 06:17:11 MDT

I am sorry to hear that.

Next time this happens, could you check if there is any user process still running on the DRAINED nodes, before putting them back online?

In order to check what is triggering the job cancellation, could you share the slurmctld.log also from the 6th of June? It would be great if you clould also share slurm.conf.

If I cannot see anything, I might ask you to raise debug flags for a while, until this happens again. So that we can get a bit more debugging info. But for the moment with that will be enough.

Were you able to check if jobs: 32856164, 32856139, 32856161, 32856138, 32856108.. had anything in common? In the slurmd logs I can appreciate how this ones fail, but there are others which are completed without issues. Could you share the output of:

sacct -j  32856164,32856139,32856161,32856138,32856108 -o job,user,start,node,alloccpus

and then the ouput of:

sacct -j  32860531,32860666 -o job,user,start,node,alloccpus

The first ones are the jobs that could not get killed, and the second ones are jobs that finished successfully in the same node and at the same time.

Kind regards,
Oscar

Comment 6 Oscar Hernández 2023-06-09 06:16:14 MDT

Hi!

Did you notice any other problem overnight? Were you able to get any of the data requested in the previous comment?

Kind regards,
Oscar

Comment 7 J.P. Waller 2023-06-09 08:40:32 MDT

I see a lot of processes that look like this:  

Could this be related?


root     1093125       2  0 10:29 ?        00:00:00 [kworker/53:1-cgroup_pidlist_destroy]                                                                                                    root     1093126       2  0 10:29 ?        00:00:00 [kworker/5:1-cgroup_pidlist_destroy]                                                                                                     root     1093128       2  0 10:29 ?        00:00:00 [kworker/21:2-cgroup_pidlist_destroy]                                                                                                    root     1093243       2  0 10:29 ?        00:00:00 [kworker/17:2-cgroup_pidlist_destroy]                                                                                                    root     1093245       2  0 10:29 ?        00:00:00 [kworker/51:2-cgroup_pidlist_destroy]                                                                                                    root     1093481       2  0 10:30 ?        00:00:00 [kworker/58:1-cgroup_pidlist_destroy]                                                                                                    root     1093602       2  0 10:30 ?        00:00:00 [kworker/24:2-cgroup_pidlist_destroy]                                                                                                    root     1093709       2  0 10:30 ?        00:00:00 [kworker/4:2-cgroup_pidlist_destroy]                                                                                                     root     1093912       2  0 10:30 ?        00:00:00 [kworker/16:1-cgroup_pidlist_destroy]                                                                                                    root     1093994       2  0 10:30 ?        00:00:00 [kworker/1:2-cgroup_pidlist_destroy]                                                                                                     root     1094172       2  0 10:30 ?        00:00:00 [kworker/72:1-cgroup_pidlist_destroy]                                                                                                    root     1094362       2  0 10:30 ?        00:00:00 [kworker/48:2-cgroup_pidlist_destroy]                                                                                                    root     1094402       2  0 10:30 ?        00:00:00 [kworker/9:2-cgroup_pidlist_destroy]                                                                                                     root     1094417       2  0 10:30 ?        00:00:00 [kworker/10:1-events]                                                                                                                    root     1094419       2  0 10:30 ?        00:00:00 [kworker/64:0-cgroup_pidlist_destroy]                                                                                                    root     1094468       2  0 10:30 ?        00:00:00 [kworker/7:0-cgroup_pidlist_destroy]                                                                                                     root     1094471       2  0 10:30 ?        00:00:00 [kworker/49:2-cgroup_pidlist_destroy]                                                                                                    root     1094472       2  0 10:30 ?        00:00:00 [kworker/42:2-cgroup_pidlist_destroy]                                                                                                    root     1094605       2  0 10:30 ?        00:00:00 [kworker/45:2-cgroup_pidlist_destroy]                                                                                                    root     1094686       2  0 10:30 ?        00:00:00 [kworker/88:2-cgroup_pidlist_destroy]                                                                                                    root     1094891       2  0 10:30 ?        00:00:00 [kworker/12:2-cgroup_pidlist_destroy]

Comment 8 Oscar Hernández 2023-06-09 11:09:12 MDT

Is this from a node currently running jobs, or from a drained node?
I think this could be expected in a busy node (running multiple jobs at the same time), since there could be many cgroups being destroyed. But will investigate a bit around that. 

Did you get that info at 10:30? That would mean these processes do not run for much time.

Any chance you could also share the slurmctld.log from 6/6/2023? Would like to have a reference for the slurmd I already have.

Kind regards,
Oscar

Comment 9 Oscar Hernández 2023-06-13 08:15:07 MDT

Hi,

I have been thinking about that. There is actually a way to know why nodes were set to drain. There is a slurm.conf configuration parameter UnkillableStepProgram[1]. This allows to configure a script that will be executed in a node where slurm is not able to successfully terminate some processes.

I am attaching a valid example for a script to put there (but feel free to put any commands you consider interesting there). The idea with it is to detect the job processes currently running into the node. And execute some forensics commands like: ps, lsof, strace... this information should help us figure out what were the actual processes that caused trouble.

So, in order to configure it, you should set the following in slurm.conf:

UnkillableStepProgram=/path/to/script

Note that this script must executable, and present or accessible in this path by all compute nodes in the cluster. After setting it in all nodes slurm.conf, it is enough to run: "scontrol reconfigure".

If you configure it, some file should be generated in the nodes /tmp directory when a node is automatically drained. So, next time you see a node set to drain, please, share this generated log, so we can investigate further on that.

[1]https://slurm.schedmd.com/slurm.conf.html#OPT_UnkillableStepProgram

Comment 10 Oscar Hernández 2023-06-13 08:22:13 MDT

Created attachment 30746 [details]
Unkillable script example

Attaching here the unkillable script example. Feel free to use or tune it.

Also, could you share your slurm.conf. I would like to check some other parameters.

Cheers,
Oscar

Comment 11 Oscar Hernández 2023-06-20 07:37:25 MDT

Hi,

Just checking the status of this issue. Has the situation been repeated recently?

Since things seem to be under control, I am dropping severity to 3. Feel free to increase it if necessary.

Cheers,
Oscar

Comment 12 J.P. Waller 2023-06-21 06:54:47 MDT

Hi Oscar, the issue seems to have gone away so feel free to close out this ticket.

Comment 13 Oscar Hernández 2023-06-21 07:58:01 MDT

Good!

Closing the issue then. 

Nevertheless, I would suggest to implement an UnkillabaleStepProgram as mentioned in comment 9. It will be pretty useful if the situation of unexpectedly drained nodes appears in the future.

Kind regards,
Oscar