11189 – scancel is not working for one of the user jobs

Ticket 11189 - scancel is not working for one of the user jobs

Summary: scancel is not working for one of the user jobs

Status:	RESOLVED INFOGIVEN

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	slurmd (show other tickets)
Version:	20.11.4
Hardware:	Linux Linux

Severity:	4 - Minor Issue
Assignee:	Jason Booth
QA Contact:

URL:

Duplicates (1):	10840 (view as ticket list)
Depends on:
Blocks:

Reported:	2021-03-23 15:42 MDT by Nandini
Modified:	2021-04-02 13:16 MDT (History)
CC List:	2 users (show)

See Also:
Site:	TTU
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
output.txt (556.23 KB, text/plain) 2021-03-25 09:57 MDT, Nandini	Details
slurmctl.log.gz (2.69 MB, application/x-gzip) 2021-03-25 09:57 MDT, Nandini	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Nandini 2021-03-23 15:42:34 MDT

squeue -u lusilves|more
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
         648423_30    nocona      V1D lusilves CG 1-00:09:19      1 cpu-26-21

scancel --signal=TERM --state=CG 648423_30
scancel: error: Kill job error on job id 648423_30: Invalid job id specified
[root@cpu-23-11 ~]# scancel --signal=TERM --state=CG 648423
scancel: error: Kill job error on job id 648423: Invalid job id specified
[root@cpu-23-11 ~]#

I couldnt delete this user job. Let us know how we can delete them.

Thanks
Nandini
Texas Tech University

Comment 1 Jason Booth 2021-03-23 16:05:52 MDT

Can you run 'scontrol show job <job_id>' and send us back the output?


Have your tried canceling it by the "ArrayJobId"?

e.g.

$ scontrol show job  5088_2
JobId=5090 ArrayJobId=5088 ArrayTaskId=2 JobName=wrap



scancel 5088

Comment 2 Nandini 2021-03-23 16:12:01 MDT

I see the job was running on the node cpu-26-21 and it has lustre issue and it is down. But still I am unable to delete this job using scancel 648423_30

scontrol show job 648423_30
JobId=648453 ArrayJobId=648423 ArrayTaskId=30 JobName=V1D
   UserId=lusilves(100119) GroupId=EE(240) MCS_label=N/A
   Priority=1996 Nice=0 Account=default QOS=normal
   JobState=COMPLETING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=1-00:09:19 TimeLimit=1-12:00:00 TimeMin=N/A
   SubmitTime=2021-03-20T18:05:17 EligibleTime=2021-03-20T18:05:17
   AccrueTime=Unknown
   StartTime=2021-03-20T18:05:17 EndTime=2021-03-21T18:14:36 Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2021-03-20T18:05:17
   Partition=nocona AllocNode:Sid=login-20-25.localdomain:53219
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=cpu-26-21
   BatchHost=cpu-26-21
   NumNodes=1 NumCPUs=6 NumTasks=6 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=6,mem=18G,node=1,billing=21
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryCPU=3G MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/home/lusilves/V1D[03-17-2021]/EXE.sh
   WorkDir=/home/lusilves/V1D[03-17-2021]
   StdErr=/home/lusilves/V1D[03-17-2021]/array-648423_30.err
   StdIn=/dev/null
   StdOut=/home/lusilves/V1D[03-17-2021]/array-648423_30.out
   Power=
   MailUser=luke.silvestre@ttu.edu MailType=END
   NtasksPerTRES:0

From: bugs@schedmd.com <bugs@schedmd.com>
Sent: Tuesday, March 23, 2021 5:06 PM
To: Ramanathan, Nandini <Nandini.Ramanathan@ttu.edu>
Subject: [Bug 11189] scancel is not working for one of the user jobs

Comment # 1<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D11189%23c1&data=04%7C01%7Cnandini.ramanathan%40ttu.edu%7Ce4f13a3ec8d44a02578508d8ee47d3c7%7C178a51bf8b2049ffb65556245d5c173c%7C0%7C0%7C637521339548749634%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=bghYIVuuuqIkutmUYgel8%2F1WReIagcQGaJuwSJ6vxAg%3D&reserved=0> on bug 11189<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D11189&data=04%7C01%7Cnandini.ramanathan%40ttu.edu%7Ce4f13a3ec8d44a02578508d8ee47d3c7%7C178a51bf8b2049ffb65556245d5c173c%7C0%7C0%7C637521339548759631%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=23ZgZ7wjKOWkSflZiY%2FGYPZejFnTbwGBZCEwhCq61ks%3D&reserved=0> from Jason Booth<mailto:jbooth@schedmd.com>

Can you run 'scontrol show job <job_id>' and send us back the output?





Have your tried canceling it by the "ArrayJobId"?



e.g.



$ scontrol show job  5088_2

JobId=5090 ArrayJobId=5088 ArrayTaskId=2 JobName=wrap







scancel 5088

________________________________
You are receiving this mail because:

  *   You reported the bug.

Comment 3 Jason Booth 2021-03-23 16:19:57 MDT

If the node in question is experiencing luster issues then I would suggest marking it down, which will cancel that job and any others on that node.

scontrol update nodename=cpu-26-2 state=down reason=luster


When you have fixed that node you can move it back into an idle state with:

scontrol update nodename=cpu-26-2 state=idle

Comment 4 Nandini 2021-03-24 13:54:30 MDT

Even after draining the node I see the job in CG state and again tried scancel and it didn't work.

Nandini

From: bugs@schedmd.com <bugs@schedmd.com>
Sent: Tuesday, March 23, 2021 5:20 PM
To: Ramanathan, Nandini <Nandini.Ramanathan@ttu.edu>
Subject: [Bug 11189] scancel is not working for one of the user jobs

Comment # 3<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D11189%23c3&data=04%7C01%7Cnandini.ramanathan%40ttu.edu%7C3e5ba75477c043e1bdd408d8ee49cb5b%7C178a51bf8b2049ffb65556245d5c173c%7C0%7C0%7C637521347999051984%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=pn6hMnm4zrNkNQQOjjjxs2G%2FIVlJll3TMhtfw8BdTIs%3D&reserved=0> on bug 11189<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D11189&data=04%7C01%7Cnandini.ramanathan%40ttu.edu%7C3e5ba75477c043e1bdd408d8ee49cb5b%7C178a51bf8b2049ffb65556245d5c173c%7C0%7C0%7C637521347999061982%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=Ng5jjTc54KA8xnddNItW%2F9520tny%2BaPaXk7onjac5KI%3D&reserved=0> from Jason Booth<mailto:jbooth@schedmd.com>

If the node in question is experiencing luster issues then I would suggest

marking it down, which will cancel that job and any others on that node.



scontrol update nodename=cpu-26-2 state=down reason=luster





When you have fixed that node you can move it back into an idle state with:



scontrol update nodename=cpu-26-2 state=idle

________________________________
You are receiving this mail because:

  *   You reported the bug.

Comment 5 Nandini 2021-03-25 07:41:11 MDT

JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
         648423_30    nocona      V1D lusilves CG 1-00:09:19      1 cpu-26-21

I still couldn't delete the user job. Let me know please?

Thanks
Nandini

From: Ramanathan, Nandini
Sent: Wednesday, March 24, 2021 2:54 PM
To: bugs@schedmd.com
Subject: RE: [Bug 11189] scancel is not working for one of the user jobs

Even after draining the node I see the job in CG state and again tried scancel and it didn't work.

Nandini

From: bugs@schedmd.com<mailto:bugs@schedmd.com> <bugs@schedmd.com<mailto:bugs@schedmd.com>>
Sent: Tuesday, March 23, 2021 5:20 PM
To: Ramanathan, Nandini <Nandini.Ramanathan@ttu.edu<mailto:Nandini.Ramanathan@ttu.edu>>
Subject: [Bug 11189] scancel is not working for one of the user jobs

Comment # 3<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D11189%23c3&data=04%7C01%7Cnandini.ramanathan%40ttu.edu%7C3e5ba75477c043e1bdd408d8ee49cb5b%7C178a51bf8b2049ffb65556245d5c173c%7C0%7C0%7C637521347999051984%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=pn6hMnm4zrNkNQQOjjjxs2G%2FIVlJll3TMhtfw8BdTIs%3D&reserved=0> on bug 11189<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D11189&data=04%7C01%7Cnandini.ramanathan%40ttu.edu%7C3e5ba75477c043e1bdd408d8ee49cb5b%7C178a51bf8b2049ffb65556245d5c173c%7C0%7C0%7C637521347999061982%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=Ng5jjTc54KA8xnddNItW%2F9520tny%2BaPaXk7onjac5KI%3D&reserved=0> from Jason Booth<mailto:jbooth@schedmd.com>

If the node in question is experiencing luster issues then I would suggest

marking it down, which will cancel that job and any others on that node.



scontrol update nodename=cpu-26-2 state=down reason=luster





When you have fixed that node you can move it back into an idle state with:



scontrol update nodename=cpu-26-2 state=idle

________________________________
You are receiving this mail because:

  *   You reported the bug.

Comment 6 Jason Booth 2021-03-25 09:48:24 MDT

If Slurm is not able to cancel the job then we need to investigate why that is the case. Please gather the slurmctld.log and the slurmd.log for the job in question, compress it and attach them to this bug.

Please also gather the output of dmesg from the node in question, "cpu-26-21", and the reboot that node. If the job does not clear out after a reboot then we can look at another option to remove it.

Comment 7 Nandini 2021-03-25 09:57:56 MDT

Created attachment 18652 [details]
output.txt

I have attached the log file from dmesg of the node cpu-26-21 and the slurmctld.log.

Thanks
Nandini

From: bugs@schedmd.com <bugs@schedmd.com>
Sent: Thursday, March 25, 2021 10:48 AM
To: Ramanathan, Nandini <Nandini.Ramanathan@ttu.edu>
Subject: [Bug 11189] scancel is not working for one of the user jobs

Comment # 6<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D11189%23c6&data=04%7C01%7Cnandini.ramanathan%40ttu.edu%7C7e428a3999254b03255308d8efa56cd0%7C178a51bf8b2049ffb65556245d5c173c%7C0%7C0%7C637522841072734212%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=%2BdDWDpWvab9NsP1KwQEF0h5vkqZ9nMSQ3rZekqasBwE%3D&reserved=0> on bug 11189<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D11189&data=04%7C01%7Cnandini.ramanathan%40ttu.edu%7C7e428a3999254b03255308d8efa56cd0%7C178a51bf8b2049ffb65556245d5c173c%7C0%7C0%7C637522841072744205%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=HUgf52oRsDzvfGSBzOokuR3mU9pwLJvO7KK1BLu8NY8%3D&reserved=0> from Jason Booth<mailto:jbooth@schedmd.com>

If Slurm is not able to cancel the job then we need to investigate why that is

the case. Please gather the slurmctld.log and the slurmd.log for the job in

question, compress it and attach them to this bug.



Please also gather the output of dmesg from the node in question, "cpu-26-21",

and the reboot that node. If the job does not clear out after a reboot then we

can look at another option to remove it.

________________________________
You are receiving this mail because:

  *   You reported the bug.

Comment 8 Nandini 2021-03-25 09:57:57 MDT

Created attachment 18653 [details]
slurmctl.log.gz

Comment 9 Nate Rini 2021-03-25 09:58:42 MDT

*** Ticket 10840 has been marked as a duplicate of this ticket. ***

Comment 10 Jason Booth 2021-03-25 10:39:04 MDT

Do you use a prolog and epilog on these nodes? If so, what do these do and how long do they run for?

Comment 11 Nandini 2021-03-25 11:00:41 MDT

We have these script in place. Usually when we cancel the job epilog takes care of cleaning up the tasks on the node. And I see it works perfectly fine . I am not sure why for this case it failed. But on the node I have killed all the running processes belonged to this job . So epilog may not know due to root intervention. Every node has these scripts in place.

slurm.epilog   slurm.prolog  slurm.task_epilog  slurm.task_prolog

From: bugs@schedmd.com <bugs@schedmd.com>
Sent: Thursday, March 25, 2021 11:39 AM
To: Ramanathan, Nandini <Nandini.Ramanathan@ttu.edu>
Subject: [Bug 11189] scancel is not working for one of the user jobs

Comment # 10<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D11189%23c10&data=04%7C01%7Cnandini.ramanathan%40ttu.edu%7C8239ad649bd8440c083708d8efac8109%7C178a51bf8b2049ffb65556245d5c173c%7C0%7C0%7C637522871473813339%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=%2FCPHDX4bQ7Pz6cOYUVMNl%2F9yQNm%2FgN4xI63hnE4trx4%3D&reserved=0> on bug 11189<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D11189&data=04%7C01%7Cnandini.ramanathan%40ttu.edu%7C8239ad649bd8440c083708d8efac8109%7C178a51bf8b2049ffb65556245d5c173c%7C0%7C0%7C637522871473823340%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=8YTJKA5WtSNTPqK6YLyxan7vHG60I8RyOPyAjED2%2BLk%3D&reserved=0> from Jason Booth<mailto:jbooth@schedmd.com>

Do you use a prolog and epilog on these nodes? If so, what do these do and how

long do they run for?

________________________________
You are receiving this mail because:

  *   You reported the bug.

Comment 12 Jason Booth 2021-03-25 11:31:31 MDT

It sounds like there are still some artifacts, "in-active stepd sockets", left over from that job. A reboot of that node should clear these out.

Comment 13 Nandini 2021-03-30 08:36:44 MDT

I rebooted the node and that didn't help. I still see the job in CG state.

Nandini

From: bugs@schedmd.com <bugs@schedmd.com>
Sent: Thursday, March 25, 2021 12:32 PM
To: Ramanathan, Nandini <Nandini.Ramanathan@ttu.edu>
Subject: [Bug 11189] scancel is not working for one of the user jobs

Comment # 12<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D11189%23c12&data=04%7C01%7Cnandini.ramanathan%40ttu.edu%7C3ec61ecefe9c478f402108d8efb3d51a%7C178a51bf8b2049ffb65556245d5c173c%7C0%7C0%7C637522902943544090%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=EXWLpLN03NpzKzYYh5wkdmA7GVqcHyrTvW5gPAMXRN8%3D&reserved=0> on bug 11189<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D11189&data=04%7C01%7Cnandini.ramanathan%40ttu.edu%7C3ec61ecefe9c478f402108d8efb3d51a%7C178a51bf8b2049ffb65556245d5c173c%7C0%7C0%7C637522902943544090%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=tV5zmvRPTvtfo9uHDjYH9rDyNUMWHlY%2BjAJEzOMXM2g%3D&reserved=0> from Jason Booth<mailto:jbooth@schedmd.com>

It sounds like there are still some artifacts, "in-active stepd sockets", left

over from that job. A reboot of that node should clear these out.

________________________________
You are receiving this mail because:

  *   You reported the bug.

Comment 14 Jason Booth 2021-03-30 09:57:59 MDT

So it sounds like we many need to manually remove the job in this case.

Jobs are stored in the "StateSaveLocation=" under a hash.* directory.

For example.
hash.2$ ls
job.5092  job.5102

It is possible to manually remove that job, but you will want to first, shutdown slurmctld, backup the StateSaveLocation directory and then remove the job's directory under its hash folder.


Take care to only remove the job's directory and nothing else. You will have to search for that specific job under the hash directory.

Comment 15 Jason Booth 2021-04-02 13:16:52 MDT

I am going to close this issue out. If the process in comment #14 does not resolve this then please re-open this issue.