Ticket 14265

Summary: Job not ending with signals
Product: Slurm Reporter: rvasquez2
Component: slurmdAssignee: Albert Gil <albert.gil>
Status: RESOLVED INFOGIVEN QA Contact:
Severity: 3 - Medium Impact    
Priority: ---    
Version: 20.11.7   
Hardware: Linux   
OS: Linux   
Site: Dow Chemical Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---
Attachments: slurmd.log from the machine

Description rvasquez2 2022-06-07 11:48:30 MDT
Created attachment 25399 [details]
slurmd.log from the machine

Looks like we had jobs hit walltime and slurm tried to cancel but wasn't able. Any insight?

[2022-05-31T15:35:33.003] [155896.batch] error: *** JOB 155896 STEPD TERMINATED ON gpu7 AT 2022-05-31T15:35:33 DUE TO JOB NOT ENDING WITH SIGNALS ***
Comment 1 Albert Gil 2022-06-08 04:50:53 MDT
Hi,

> Looks like we had jobs hit walltime and slurm tried to cancel but wasn't
> able. Any insight?
> 
> [2022-05-31T15:35:33.003] [155896.batch] error: *** JOB 155896 STEPD
> TERMINATED ON gpu7 AT 2022-05-31T15:35:33 DUE TO JOB NOT ENDING WITH SIGNALS
> ***

Yes, your description is correct.
When Slurm cancels job it sends Linux signals to all the Linux process in all the steps. It initially sends SIGTERM:

[2022-05-31T15:34:03.256] [155896.batch] Sent signal 15 to StepId=155896.batch

And after some time (KillWait), if they have not ended, it sends a SIGKILL:

[2022-05-31T15:34:32.968] [155896.batch] Sent SIGKILL signal to StepId=155896.batch

With SIGKILL the Linux process should finish, but if the node is facing some issues, they may still run. Most tipically because they are in an "uninterruptible sleep state" (D), and that usually happens due problems in some network filesystem.

Slurm keeps trying to kill those process for some time (UnkillableStepTimeout), and it finally decides to terminate the step and drain the node to avoid further problems in future jobs:

[2022-05-31T15:35:33.003] [155896.batch] error: *** JOB 155896 STEPD TERMINATED ON gpu7 AT 2022-05-31T15:35:33 DUE TO JOB NOT ENDING WITH SIGNALS ***

I assume that this gpu7 node was drained after this happened, right?

See KillWait option to control the time between SIGTERM and SIGKILL:
- https://slurm.schedmd.com/slurm.conf.html#OPT_KillWait

And UnkillableStepTimeout to control the time between first SIGKILL and decide that is "unkillable":
- https://slurm.schedmd.com/slurm.conf.html#OPT_UnkillableStepTimeout

For further troubleshooting you can create and set an UnkillableStepProgram, that will be run on the node when that happens, so you can gather informations about possible problems on the node:
- https://slurm.schedmd.com/slurm.conf.html#OPT_UnkillableStepProgram


Hope this helps,
Albert
Comment 2 Albert Gil 2022-06-17 10:09:01 MDT
Hi,

If this is ok for you I'm closing this ticket as infogiven, but please don't hesitate to reopen it if you need further support about it.

Regards,
Albert
Comment 3 rvasquez2 2022-06-17 11:25:41 MDT
Sounds good, thank you



General Business
From: bugs@schedmd.com <bugs@schedmd.com>
Sent: Friday, June 17, 2022 12:09 PM
To: Vasquez, Robert (R) <rvasquez2@dow.com>
Subject: [Bug 14265] Job not ending with signals

 CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you recognize the sender and know the content is safe.

Albert Gil<mailto:albert.gil@schedmd.com> changed bug 14265<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D14265&data=05%7C01%7Crvasquez2%40dow.com%7C517fb208dfd84443524608da507bb208%7Cc3e32f53cb7f4809968d1cc4ccc785fe%7C0%7C0%7C637910789448833368%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=COjln4O8nIvAWAx1q3I0UZstAe0kEUITFFPKMe7%2F5E0%3D&reserved=0>
What
Removed
Added
Status
OPEN
RESOLVED
Resolution
---
INFOGIVEN
Comment # 2<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D14265%23c2&data=05%7C01%7Crvasquez2%40dow.com%7C517fb208dfd84443524608da507bb208%7Cc3e32f53cb7f4809968d1cc4ccc785fe%7C0%7C0%7C637910789448833368%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=TE6J3vAWlGxBo9oax%2Fm4kOzHZF2QehC2pwpfu0lj5Bk%3D&reserved=0> on bug 14265<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D14265&data=05%7C01%7Crvasquez2%40dow.com%7C517fb208dfd84443524608da507bb208%7Cc3e32f53cb7f4809968d1cc4ccc785fe%7C0%7C0%7C637910789448833368%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=COjln4O8nIvAWAx1q3I0UZstAe0kEUITFFPKMe7%2F5E0%3D&reserved=0> from Albert Gil<mailto:albert.gil@schedmd.com>

Hi,



If this is ok for you I'm closing this ticket as infogiven, but please don't

hesitate to reopen it if you need further support about it.



Regards,

Albert

________________________________
You are receiving this mail because:

  *   You reported the bug.