squeue -u lusilves|more JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 648423_30 nocona V1D lusilves CG 1-00:09:19 1 cpu-26-21 scancel --signal=TERM --state=CG 648423_30 scancel: error: Kill job error on job id 648423_30: Invalid job id specified [root@cpu-23-11 ~]# scancel --signal=TERM --state=CG 648423 scancel: error: Kill job error on job id 648423: Invalid job id specified [root@cpu-23-11 ~]# I couldnt delete this user job. Let us know how we can delete them. Thanks Nandini Texas Tech University
Can you run 'scontrol show job <job_id>' and send us back the output? Have your tried canceling it by the "ArrayJobId"? e.g. $ scontrol show job 5088_2 JobId=5090 ArrayJobId=5088 ArrayTaskId=2 JobName=wrap scancel 5088
I see the job was running on the node cpu-26-21 and it has lustre issue and it is down. But still I am unable to delete this job using scancel 648423_30 scontrol show job 648423_30 JobId=648453 ArrayJobId=648423 ArrayTaskId=30 JobName=V1D UserId=lusilves(100119) GroupId=EE(240) MCS_label=N/A Priority=1996 Nice=0 Account=default QOS=normal JobState=COMPLETING Reason=None Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 RunTime=1-00:09:19 TimeLimit=1-12:00:00 TimeMin=N/A SubmitTime=2021-03-20T18:05:17 EligibleTime=2021-03-20T18:05:17 AccrueTime=Unknown StartTime=2021-03-20T18:05:17 EndTime=2021-03-21T18:14:36 Deadline=N/A SuspendTime=None SecsPreSuspend=0 LastSchedEval=2021-03-20T18:05:17 Partition=nocona AllocNode:Sid=login-20-25.localdomain:53219 ReqNodeList=(null) ExcNodeList=(null) NodeList=cpu-26-21 BatchHost=cpu-26-21 NumNodes=1 NumCPUs=6 NumTasks=6 CPUs/Task=1 ReqB:S:C:T=0:0:*:* TRES=cpu=6,mem=18G,node=1,billing=21 Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* MinCPUsNode=1 MinMemoryCPU=3G MinTmpDiskNode=0 Features=(null) DelayBoot=00:00:00 OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) Command=/home/lusilves/V1D[03-17-2021]/EXE.sh WorkDir=/home/lusilves/V1D[03-17-2021] StdErr=/home/lusilves/V1D[03-17-2021]/array-648423_30.err StdIn=/dev/null StdOut=/home/lusilves/V1D[03-17-2021]/array-648423_30.out Power= MailUser=luke.silvestre@ttu.edu MailType=END NtasksPerTRES:0 From: bugs@schedmd.com <bugs@schedmd.com> Sent: Tuesday, March 23, 2021 5:06 PM To: Ramanathan, Nandini <Nandini.Ramanathan@ttu.edu> Subject: [Bug 11189] scancel is not working for one of the user jobs Comment # 1<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D11189%23c1&data=04%7C01%7Cnandini.ramanathan%40ttu.edu%7Ce4f13a3ec8d44a02578508d8ee47d3c7%7C178a51bf8b2049ffb65556245d5c173c%7C0%7C0%7C637521339548749634%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=bghYIVuuuqIkutmUYgel8%2F1WReIagcQGaJuwSJ6vxAg%3D&reserved=0> on bug 11189<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D11189&data=04%7C01%7Cnandini.ramanathan%40ttu.edu%7Ce4f13a3ec8d44a02578508d8ee47d3c7%7C178a51bf8b2049ffb65556245d5c173c%7C0%7C0%7C637521339548759631%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=23ZgZ7wjKOWkSflZiY%2FGYPZejFnTbwGBZCEwhCq61ks%3D&reserved=0> from Jason Booth<mailto:jbooth@schedmd.com> Can you run 'scontrol show job <job_id>' and send us back the output? Have your tried canceling it by the "ArrayJobId"? e.g. $ scontrol show job 5088_2 JobId=5090 ArrayJobId=5088 ArrayTaskId=2 JobName=wrap scancel 5088 ________________________________ You are receiving this mail because: * You reported the bug.
If the node in question is experiencing luster issues then I would suggest marking it down, which will cancel that job and any others on that node. scontrol update nodename=cpu-26-2 state=down reason=luster When you have fixed that node you can move it back into an idle state with: scontrol update nodename=cpu-26-2 state=idle
Even after draining the node I see the job in CG state and again tried scancel and it didn't work. Nandini From: bugs@schedmd.com <bugs@schedmd.com> Sent: Tuesday, March 23, 2021 5:20 PM To: Ramanathan, Nandini <Nandini.Ramanathan@ttu.edu> Subject: [Bug 11189] scancel is not working for one of the user jobs Comment # 3<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D11189%23c3&data=04%7C01%7Cnandini.ramanathan%40ttu.edu%7C3e5ba75477c043e1bdd408d8ee49cb5b%7C178a51bf8b2049ffb65556245d5c173c%7C0%7C0%7C637521347999051984%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=pn6hMnm4zrNkNQQOjjjxs2G%2FIVlJll3TMhtfw8BdTIs%3D&reserved=0> on bug 11189<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D11189&data=04%7C01%7Cnandini.ramanathan%40ttu.edu%7C3e5ba75477c043e1bdd408d8ee49cb5b%7C178a51bf8b2049ffb65556245d5c173c%7C0%7C0%7C637521347999061982%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=Ng5jjTc54KA8xnddNItW%2F9520tny%2BaPaXk7onjac5KI%3D&reserved=0> from Jason Booth<mailto:jbooth@schedmd.com> If the node in question is experiencing luster issues then I would suggest marking it down, which will cancel that job and any others on that node. scontrol update nodename=cpu-26-2 state=down reason=luster When you have fixed that node you can move it back into an idle state with: scontrol update nodename=cpu-26-2 state=idle ________________________________ You are receiving this mail because: * You reported the bug.
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 648423_30 nocona V1D lusilves CG 1-00:09:19 1 cpu-26-21 I still couldn't delete the user job. Let me know please? Thanks Nandini From: Ramanathan, Nandini Sent: Wednesday, March 24, 2021 2:54 PM To: bugs@schedmd.com Subject: RE: [Bug 11189] scancel is not working for one of the user jobs Even after draining the node I see the job in CG state and again tried scancel and it didn't work. Nandini From: bugs@schedmd.com<mailto:bugs@schedmd.com> <bugs@schedmd.com<mailto:bugs@schedmd.com>> Sent: Tuesday, March 23, 2021 5:20 PM To: Ramanathan, Nandini <Nandini.Ramanathan@ttu.edu<mailto:Nandini.Ramanathan@ttu.edu>> Subject: [Bug 11189] scancel is not working for one of the user jobs Comment # 3<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D11189%23c3&data=04%7C01%7Cnandini.ramanathan%40ttu.edu%7C3e5ba75477c043e1bdd408d8ee49cb5b%7C178a51bf8b2049ffb65556245d5c173c%7C0%7C0%7C637521347999051984%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=pn6hMnm4zrNkNQQOjjjxs2G%2FIVlJll3TMhtfw8BdTIs%3D&reserved=0> on bug 11189<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D11189&data=04%7C01%7Cnandini.ramanathan%40ttu.edu%7C3e5ba75477c043e1bdd408d8ee49cb5b%7C178a51bf8b2049ffb65556245d5c173c%7C0%7C0%7C637521347999061982%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=Ng5jjTc54KA8xnddNItW%2F9520tny%2BaPaXk7onjac5KI%3D&reserved=0> from Jason Booth<mailto:jbooth@schedmd.com> If the node in question is experiencing luster issues then I would suggest marking it down, which will cancel that job and any others on that node. scontrol update nodename=cpu-26-2 state=down reason=luster When you have fixed that node you can move it back into an idle state with: scontrol update nodename=cpu-26-2 state=idle ________________________________ You are receiving this mail because: * You reported the bug.
If Slurm is not able to cancel the job then we need to investigate why that is the case. Please gather the slurmctld.log and the slurmd.log for the job in question, compress it and attach them to this bug. Please also gather the output of dmesg from the node in question, "cpu-26-21", and the reboot that node. If the job does not clear out after a reboot then we can look at another option to remove it.
Created attachment 18652 [details] output.txt I have attached the log file from dmesg of the node cpu-26-21 and the slurmctld.log. Thanks Nandini From: bugs@schedmd.com <bugs@schedmd.com> Sent: Thursday, March 25, 2021 10:48 AM To: Ramanathan, Nandini <Nandini.Ramanathan@ttu.edu> Subject: [Bug 11189] scancel is not working for one of the user jobs Comment # 6<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D11189%23c6&data=04%7C01%7Cnandini.ramanathan%40ttu.edu%7C7e428a3999254b03255308d8efa56cd0%7C178a51bf8b2049ffb65556245d5c173c%7C0%7C0%7C637522841072734212%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=%2BdDWDpWvab9NsP1KwQEF0h5vkqZ9nMSQ3rZekqasBwE%3D&reserved=0> on bug 11189<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D11189&data=04%7C01%7Cnandini.ramanathan%40ttu.edu%7C7e428a3999254b03255308d8efa56cd0%7C178a51bf8b2049ffb65556245d5c173c%7C0%7C0%7C637522841072744205%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=HUgf52oRsDzvfGSBzOokuR3mU9pwLJvO7KK1BLu8NY8%3D&reserved=0> from Jason Booth<mailto:jbooth@schedmd.com> If Slurm is not able to cancel the job then we need to investigate why that is the case. Please gather the slurmctld.log and the slurmd.log for the job in question, compress it and attach them to this bug. Please also gather the output of dmesg from the node in question, "cpu-26-21", and the reboot that node. If the job does not clear out after a reboot then we can look at another option to remove it. ________________________________ You are receiving this mail because: * You reported the bug.
Created attachment 18653 [details] slurmctl.log.gz
*** Ticket 10840 has been marked as a duplicate of this ticket. ***
Do you use a prolog and epilog on these nodes? If so, what do these do and how long do they run for?
We have these script in place. Usually when we cancel the job epilog takes care of cleaning up the tasks on the node. And I see it works perfectly fine . I am not sure why for this case it failed. But on the node I have killed all the running processes belonged to this job . So epilog may not know due to root intervention. Every node has these scripts in place. slurm.epilog slurm.prolog slurm.task_epilog slurm.task_prolog From: bugs@schedmd.com <bugs@schedmd.com> Sent: Thursday, March 25, 2021 11:39 AM To: Ramanathan, Nandini <Nandini.Ramanathan@ttu.edu> Subject: [Bug 11189] scancel is not working for one of the user jobs Comment # 10<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D11189%23c10&data=04%7C01%7Cnandini.ramanathan%40ttu.edu%7C8239ad649bd8440c083708d8efac8109%7C178a51bf8b2049ffb65556245d5c173c%7C0%7C0%7C637522871473813339%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=%2FCPHDX4bQ7Pz6cOYUVMNl%2F9yQNm%2FgN4xI63hnE4trx4%3D&reserved=0> on bug 11189<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D11189&data=04%7C01%7Cnandini.ramanathan%40ttu.edu%7C8239ad649bd8440c083708d8efac8109%7C178a51bf8b2049ffb65556245d5c173c%7C0%7C0%7C637522871473823340%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=8YTJKA5WtSNTPqK6YLyxan7vHG60I8RyOPyAjED2%2BLk%3D&reserved=0> from Jason Booth<mailto:jbooth@schedmd.com> Do you use a prolog and epilog on these nodes? If so, what do these do and how long do they run for? ________________________________ You are receiving this mail because: * You reported the bug.
It sounds like there are still some artifacts, "in-active stepd sockets", left over from that job. A reboot of that node should clear these out.
I rebooted the node and that didn't help. I still see the job in CG state. Nandini From: bugs@schedmd.com <bugs@schedmd.com> Sent: Thursday, March 25, 2021 12:32 PM To: Ramanathan, Nandini <Nandini.Ramanathan@ttu.edu> Subject: [Bug 11189] scancel is not working for one of the user jobs Comment # 12<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D11189%23c12&data=04%7C01%7Cnandini.ramanathan%40ttu.edu%7C3ec61ecefe9c478f402108d8efb3d51a%7C178a51bf8b2049ffb65556245d5c173c%7C0%7C0%7C637522902943544090%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=EXWLpLN03NpzKzYYh5wkdmA7GVqcHyrTvW5gPAMXRN8%3D&reserved=0> on bug 11189<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D11189&data=04%7C01%7Cnandini.ramanathan%40ttu.edu%7C3ec61ecefe9c478f402108d8efb3d51a%7C178a51bf8b2049ffb65556245d5c173c%7C0%7C0%7C637522902943544090%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=tV5zmvRPTvtfo9uHDjYH9rDyNUMWHlY%2BjAJEzOMXM2g%3D&reserved=0> from Jason Booth<mailto:jbooth@schedmd.com> It sounds like there are still some artifacts, "in-active stepd sockets", left over from that job. A reboot of that node should clear these out. ________________________________ You are receiving this mail because: * You reported the bug.
So it sounds like we many need to manually remove the job in this case. Jobs are stored in the "StateSaveLocation=" under a hash.* directory. For example. hash.2$ ls job.5092 job.5102 It is possible to manually remove that job, but you will want to first, shutdown slurmctld, backup the StateSaveLocation directory and then remove the job's directory under its hash folder. Take care to only remove the job's directory and nothing else. You will have to search for that specific job under the hash directory.
I am going to close this issue out. If the process in comment #14 does not resolve this then please re-open this issue.