Ticket 7546

Summary:	Jobs failing with NODE_FAIL, slurmd.log reports "already killed, do not launch batch job"
Product:	Slurm	Reporter:	HMS Research Computing <rc>
Component:	slurmd	Assignee:	Jason Booth <jbooth>
Status:	RESOLVED CANNOTREPRODUCE	QA Contact:
Severity:	4 - Minor Issue
Priority:	---
Version:	18.08.4
Hardware:	Linux
OS:	Linux
Site:	Harvard Medical School	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---

Description HMS Research Computing 2019-08-09 09:52:02 MDT

Hello, 

We've had a report from one of our cluster users that his jobs were not executing as expected. For these jobs, sacct reports NODE_FAIL status. 

This is what sacct shows for some example job IDs:

[kmk34@login05 ~ 2019-08-09 11:25:30]$ sacct --format=jobid,user,state%20,CPUTime,TotalCPU,ExitCode,MaxRSS,AveRSS,NodeList%30,Partition%15,ReqTRES%30,Start%20,End%20,Elapsed,Timelimit --units=G -j 47770832,47770834,47770829
       JobID      User                State    CPUTime   TotalCPU ExitCode     MaxRSS     AveRSS                       NodeList       Partition                        ReqTRES                Start                  End    Elapsed  Timelimit
------------ --------- -------------------- ---------- ---------- -------- ---------- ---------- ------------------------------ --------------- ------------------------------ -------------------- -------------------- ---------- ----------
47770829         ao147            NODE_FAIL   00:00:37   00:00:00      1:0                                     compute-e-16-239           short  billing=1,cpu=1,mem=4G,node=1  2019-08-06T12:07:35  2019-08-06T12:08:12   00:00:37   01:30:00
47770829.ba+                         FAILED   00:00:38   00:00:00      0:0                                     compute-e-16-239                                                 2019-08-06T12:07:35  2019-08-06T12:08:13   00:00:38
47770832         ao147            NODE_FAIL   00:00:37   00:00:00      1:0                                     compute-e-16-239           short  billing=1,cpu=1,mem=4G,node=1  2019-08-06T12:07:35  2019-08-06T12:08:12   00:00:37   01:30:00
47770832.ba+                         FAILED   00:00:38   00:00:00      0:0                                     compute-e-16-239                                                 2019-08-06T12:07:35  2019-08-06T12:08:13   00:00:38
47770834         ao147            NODE_FAIL   00:00:37   00:00:00      1:0                                     compute-e-16-239           short  billing=1,cpu=1,mem=4G,node=1  2019-08-06T12:07:35  2019-08-06T12:08:12   00:00:37   01:30:00
47770834.ba+                         FAILED   00:00:38   00:00:00      0:0                                     compute-e-16-239                                                 2019-08-06T12:07:35  2019-08-06T12:08:13   00:00:38


In slurmd.log on the compute node, it reports:
[kmk34@compute-e-16-239 ~ 2019-08-06 16:41:58]$ sudo cat /var/log/slurm/slurmd.log | grep 47770829
[sudo] password for kmk34:
[2019-08-06T12:07:51.368] task_p_slurmd_batch_request: 47770829
[2019-08-06T12:07:51.368] task/affinity: job 47770829 CPU input mask for node: 0x0001000
[2019-08-06T12:07:51.368] task/affinity: job 47770829 CPU final HW mask for node: 0x0001000
[2019-08-06T12:08:13.802] _run_prolog: prolog with lock for job 47770829 ran for 22 seconds
[2019-08-06T12:08:13.802] Job 47770829 already killed, do not launch batch job


The "already killed, do not launch batch job" message appears for all the three example job IDs in slurmd.log. The cluster user didn't kill these jobs. Do you have an idea of what could be causing this problem?


Additionally, we looked at other "NODE_FAIL" jobs from this same user from the same time period to see the extent of the issue:
[kmk34@login05 ~ 2019-08-09 11:25:41]$ sacct --format=jobid,user,state%20,CPUTime,TotalCPU,ExitCode,MaxRSS,AveRSS,NodeList%30,Partition%15,ReqTRES%30,Start%20,End%20,Elapsed,Timelimit,Submit -j 47768523,47768524,47768525,47768526,47768527,47768528,47768529,47768530,47768531,47768532,47770827,47770828,47770829,47770830,47770831,47770832,47770833,47770834,47770835,47770836
       JobID      User                State    CPUTime   TotalCPU ExitCode     MaxRSS     AveRSS                       NodeList       Partition                        ReqTRES                Start                  End    Elapsed  Timelimit              Submit
------------ --------- -------------------- ---------- ---------- -------- ---------- ---------- ------------------------------ --------------- ------------------------------ -------------------- -------------------- ---------- ---------- -------------------
47768523         ao147            NODE_FAIL   00:32:41   00:00:00      1:0                                      compute-a-16-37           short  billing=1,cpu=1,mem=4G,node=1  2019-08-06T03:37:49  2019-08-06T04:10:30   00:32:41   01:30:00 2019-08-05T17:59:01
47768524         ao147            NODE_FAIL   00:34:21   00:00:00      1:0                                      compute-a-16-78           short  billing=1,cpu=1,mem=4G,node=1  2019-08-06T03:37:49  2019-08-06T04:12:10   00:34:21   01:30:00 2019-08-05T17:59:01
47768525         ao147            NODE_FAIL   00:34:21   00:00:00      1:0                                      compute-a-16-96           short  billing=1,cpu=1,mem=4G,node=1  2019-08-06T03:37:49  2019-08-06T04:12:10   00:34:21   01:30:00 2019-08-05T17:59:01
47768526         ao147            NODE_FAIL   00:34:21   00:00:00      1:0                                      compute-a-16-98           short  billing=1,cpu=1,mem=4G,node=1  2019-08-06T03:37:49  2019-08-06T04:12:10   00:34:21   01:30:00 2019-08-05T17:59:01
47768527         ao147            NODE_FAIL   00:34:21   00:00:00      1:0                                     compute-a-16-101           short  billing=1,cpu=1,mem=4G,node=1  2019-08-06T03:37:49  2019-08-06T04:12:10   00:34:21   01:30:00 2019-08-05T17:59:01
47768528         ao147            NODE_FAIL   00:34:21   00:00:00      1:0                                     compute-a-16-123           short  billing=1,cpu=1,mem=4G,node=1  2019-08-06T03:37:49  2019-08-06T04:12:10   00:34:21   01:30:00 2019-08-05T17:59:01
47768529         ao147            NODE_FAIL   00:02:26   00:00:00      1:0                                     compute-a-16-129           short  billing=1,cpu=1,mem=4G,node=1  2019-08-06T03:37:49  2019-08-06T03:40:15   00:02:26   01:30:00 2019-08-05T17:59:01
47768530         ao147            NODE_FAIL   00:02:26   00:00:00      1:0                                     compute-a-16-137           short  billing=1,cpu=1,mem=4G,node=1  2019-08-06T03:37:49  2019-08-06T03:40:15   00:02:26   01:30:00 2019-08-05T17:59:01
47768531         ao147            NODE_FAIL   00:02:26   00:00:00      1:0                                     compute-a-16-148           short  billing=1,cpu=1,mem=4G,node=1  2019-08-06T03:37:49  2019-08-06T03:40:15   00:02:26   01:30:00 2019-08-05T17:59:01
47768532         ao147            NODE_FAIL   00:02:26   00:00:00      1:0                                      compute-a-17-59           short  billing=1,cpu=1,mem=4G,node=1  2019-08-06T03:37:49  2019-08-06T03:40:15   00:02:26   01:30:00 2019-08-05T17:59:01
47770827         ao147            NODE_FAIL   00:00:37   00:00:00      1:0                                     compute-e-16-239           short  billing=1,cpu=1,mem=4G,node=1  2019-08-06T12:07:35  2019-08-06T12:08:12   00:00:37   01:30:00 2019-08-05T18:03:10
47770827.ba+                         FAILED   00:00:38   00:00:00      0:0                                     compute-e-16-239                                                 2019-08-06T12:07:35  2019-08-06T12:08:13   00:00:38            2019-08-06T12:07:35
47770828         ao147            NODE_FAIL   00:00:37   00:00:00      1:0                                     compute-e-16-239           short  billing=1,cpu=1,mem=4G,node=1  2019-08-06T12:07:35  2019-08-06T12:08:12   00:00:37   01:30:00 2019-08-05T18:03:10
47770828.ba+                         FAILED   00:00:38   00:00:00      0:0                                     compute-e-16-239                                                 2019-08-06T12:07:35  2019-08-06T12:08:13   00:00:38            2019-08-06T12:07:35
47770829         ao147            NODE_FAIL   00:00:37   00:00:00      1:0                                     compute-e-16-239           short  billing=1,cpu=1,mem=4G,node=1  2019-08-06T12:07:35  2019-08-06T12:08:12   00:00:37   01:30:00 2019-08-05T18:03:10
47770829.ba+                         FAILED   00:00:38   00:00:00      0:0                                     compute-e-16-239                                                 2019-08-06T12:07:35  2019-08-06T12:08:13   00:00:38            2019-08-06T12:07:35
47770830         ao147            NODE_FAIL   00:00:37   00:00:00      1:0                                     compute-e-16-239           short  billing=1,cpu=1,mem=4G,node=1  2019-08-06T12:07:35  2019-08-06T12:08:12   00:00:37   01:30:00 2019-08-05T18:03:10
47770830.ba+                         FAILED   00:00:38   00:00:00      0:0                                     compute-e-16-239                                                 2019-08-06T12:07:35  2019-08-06T12:08:13   00:00:38            2019-08-06T12:07:35
47770831         ao147            NODE_FAIL   00:00:37   00:00:00      1:0                                     compute-e-16-239           short  billing=1,cpu=1,mem=4G,node=1  2019-08-06T12:07:35  2019-08-06T12:08:12   00:00:37   01:30:00 2019-08-05T18:03:10
47770831.ba+                         FAILED   00:00:38   00:00:00      0:0                                     compute-e-16-239                                                 2019-08-06T12:07:35  2019-08-06T12:08:13   00:00:38            2019-08-06T12:07:35
47770832         ao147            NODE_FAIL   00:00:37   00:00:00      1:0                                     compute-e-16-239           short  billing=1,cpu=1,mem=4G,node=1  2019-08-06T12:07:35  2019-08-06T12:08:12   00:00:37   01:30:00 2019-08-05T18:03:10
47770832.ba+                         FAILED   00:00:38   00:00:00      0:0                                     compute-e-16-239                                                 2019-08-06T12:07:35  2019-08-06T12:08:13   00:00:38            2019-08-06T12:07:35
47770833         ao147            NODE_FAIL   00:00:37   00:00:00      1:0                                     compute-e-16-239           short  billing=1,cpu=1,mem=4G,node=1  2019-08-06T12:07:35  2019-08-06T12:08:12   00:00:37   01:30:00 2019-08-05T18:03:10
47770833.ba+                         FAILED   00:00:38   00:00:00      0:0                                     compute-e-16-239                                                 2019-08-06T12:07:35  2019-08-06T12:08:13   00:00:38            2019-08-06T12:07:35
47770834         ao147            NODE_FAIL   00:00:37   00:00:00      1:0                                     compute-e-16-239           short  billing=1,cpu=1,mem=4G,node=1  2019-08-06T12:07:35  2019-08-06T12:08:12   00:00:37   01:30:00 2019-08-05T18:03:11
47770834.ba+                         FAILED   00:00:38   00:00:00      0:0                                     compute-e-16-239                                                 2019-08-06T12:07:35  2019-08-06T12:08:13   00:00:38            2019-08-06T12:07:35
47770835         ao147            NODE_FAIL   00:00:37   00:00:00      1:0                                     compute-e-16-239           short  billing=1,cpu=1,mem=4G,node=1  2019-08-06T12:07:35  2019-08-06T12:08:12   00:00:37   01:30:00 2019-08-05T18:03:11
47770835.ba+                         FAILED   00:00:38   00:00:00      0:0                                     compute-e-16-239                                                 2019-08-06T12:07:35  2019-08-06T12:08:13   00:00:38            2019-08-06T12:07:35
47770836         ao147            NODE_FAIL   00:00:37   00:00:00      1:0                                     compute-e-16-239           short  billing=1,cpu=1,mem=4G,node=1  2019-08-06T12:07:35  2019-08-06T12:08:12   00:00:37   01:30:00 2019-08-05T18:03:11
47770836.ba+                         FAILED   00:00:38   00:00:00      0:0                                     compute-e-16-239                                                 2019-08-06T12:07:35  2019-08-06T12:08:13   00:00:38            2019-08-06T12:07:35


Some of the batch jobs didn't have a step listed, which was really curious. These jobs didn't appear in the slurmd.log files on the allocated compute nodes. We did also confirm that there were jobs from other users during this timeframe with the same phenotype (NODE_FAIL and no step created).

Could you provide any more information to determine what the root cause of these issues was? It doesn't look like it is continuing to affect the cluster, but it would be helpful to know.


Please let me know if there is any other information that I could provide.


Thanks, 
Kathleen

Comment 1 Jason Booth 2019-08-09 13:18:27 MDT

Hi Kathleen

 The info message you see:
> [2019-08-06T12:08:13.802] Job 47770829 already killed, do not launch batch job

is triggered when the jobs credential has been revoked by a job canceled or prolog failure.


I would be curious to know if you experienced some type of network outage or disk problems at that time. You may want to run dmesg on the node to see if there are any error listed there.


It would also be good to know if slurmd is stored on local disk or if these are diskless nodes which mount out a software image with slurmd on it.

Comment 2 HMS Research Computing 2019-08-20 12:29:57 MDT

Hi Jason,

we didn't have any significant (recognizable) network outage during that time, but looking closely at our network graphs suggest that there's a hiccup in network around that time.

we store all Slurmd state files on local disk and Slurmctld state files on an nfs export.

Thanks,
Bhanu

Comment 3 Jason Booth 2019-08-20 16:44:48 MDT

Hi Kathleen

> we didn't have any significant (recognizable) network outage during that time, but looking closely at our network graphs suggest that there's a hiccup in the network around that time.

Without more data, it is hard to say for sure that the network was to blame. As mentioned in comment #1, the most common cause is job canceled or prolog failure however, several minutes of network outage could also cause something similar. 

If this happens often then please gather the output of dmesg from a few nodes as well as the syslog for us to review.

Comment 4 Jason Booth 2019-08-29 14:30:45 MDT

I am closing this out. If you have more information to attach or a way to reproduce then please feel free to re-open this issue.