| Summary: | Jobs failing with NODE_FAIL, slurmd.log reports "already killed, do not launch batch job" | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | HMS Research Computing <rc> |
| Component: | slurmd | Assignee: | Jason Booth <jbooth> |
| Status: | RESOLVED CANNOTREPRODUCE | QA Contact: | |
| Severity: | 4 - Minor Issue | ||
| Priority: | --- | ||
| Version: | 18.08.4 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | Harvard Medical School | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
Hi Kathleen
The info message you see:
> [2019-08-06T12:08:13.802] Job 47770829 already killed, do not launch batch job
is triggered when the jobs credential has been revoked by a job canceled or prolog failure.
I would be curious to know if you experienced some type of network outage or disk problems at that time. You may want to run dmesg on the node to see if there are any error listed there.
It would also be good to know if slurmd is stored on local disk or if these are diskless nodes which mount out a software image with slurmd on it.
Hi Jason, we didn't have any significant (recognizable) network outage during that time, but looking closely at our network graphs suggest that there's a hiccup in network around that time. we store all Slurmd state files on local disk and Slurmctld state files on an nfs export. Thanks, Bhanu Hi Kathleen > we didn't have any significant (recognizable) network outage during that time, but looking closely at our network graphs suggest that there's a hiccup in the network around that time. Without more data, it is hard to say for sure that the network was to blame. As mentioned in comment #1, the most common cause is job canceled or prolog failure however, several minutes of network outage could also cause something similar. If this happens often then please gather the output of dmesg from a few nodes as well as the syslog for us to review. I am closing this out. If you have more information to attach or a way to reproduce then please feel free to re-open this issue. |
Hello, We've had a report from one of our cluster users that his jobs were not executing as expected. For these jobs, sacct reports NODE_FAIL status. This is what sacct shows for some example job IDs: [kmk34@login05 ~ 2019-08-09 11:25:30]$ sacct --format=jobid,user,state%20,CPUTime,TotalCPU,ExitCode,MaxRSS,AveRSS,NodeList%30,Partition%15,ReqTRES%30,Start%20,End%20,Elapsed,Timelimit --units=G -j 47770832,47770834,47770829 JobID User State CPUTime TotalCPU ExitCode MaxRSS AveRSS NodeList Partition ReqTRES Start End Elapsed Timelimit ------------ --------- -------------------- ---------- ---------- -------- ---------- ---------- ------------------------------ --------------- ------------------------------ -------------------- -------------------- ---------- ---------- 47770829 ao147 NODE_FAIL 00:00:37 00:00:00 1:0 compute-e-16-239 short billing=1,cpu=1,mem=4G,node=1 2019-08-06T12:07:35 2019-08-06T12:08:12 00:00:37 01:30:00 47770829.ba+ FAILED 00:00:38 00:00:00 0:0 compute-e-16-239 2019-08-06T12:07:35 2019-08-06T12:08:13 00:00:38 47770832 ao147 NODE_FAIL 00:00:37 00:00:00 1:0 compute-e-16-239 short billing=1,cpu=1,mem=4G,node=1 2019-08-06T12:07:35 2019-08-06T12:08:12 00:00:37 01:30:00 47770832.ba+ FAILED 00:00:38 00:00:00 0:0 compute-e-16-239 2019-08-06T12:07:35 2019-08-06T12:08:13 00:00:38 47770834 ao147 NODE_FAIL 00:00:37 00:00:00 1:0 compute-e-16-239 short billing=1,cpu=1,mem=4G,node=1 2019-08-06T12:07:35 2019-08-06T12:08:12 00:00:37 01:30:00 47770834.ba+ FAILED 00:00:38 00:00:00 0:0 compute-e-16-239 2019-08-06T12:07:35 2019-08-06T12:08:13 00:00:38 In slurmd.log on the compute node, it reports: [kmk34@compute-e-16-239 ~ 2019-08-06 16:41:58]$ sudo cat /var/log/slurm/slurmd.log | grep 47770829 [sudo] password for kmk34: [2019-08-06T12:07:51.368] task_p_slurmd_batch_request: 47770829 [2019-08-06T12:07:51.368] task/affinity: job 47770829 CPU input mask for node: 0x0001000 [2019-08-06T12:07:51.368] task/affinity: job 47770829 CPU final HW mask for node: 0x0001000 [2019-08-06T12:08:13.802] _run_prolog: prolog with lock for job 47770829 ran for 22 seconds [2019-08-06T12:08:13.802] Job 47770829 already killed, do not launch batch job The "already killed, do not launch batch job" message appears for all the three example job IDs in slurmd.log. The cluster user didn't kill these jobs. Do you have an idea of what could be causing this problem? Additionally, we looked at other "NODE_FAIL" jobs from this same user from the same time period to see the extent of the issue: [kmk34@login05 ~ 2019-08-09 11:25:41]$ sacct --format=jobid,user,state%20,CPUTime,TotalCPU,ExitCode,MaxRSS,AveRSS,NodeList%30,Partition%15,ReqTRES%30,Start%20,End%20,Elapsed,Timelimit,Submit -j 47768523,47768524,47768525,47768526,47768527,47768528,47768529,47768530,47768531,47768532,47770827,47770828,47770829,47770830,47770831,47770832,47770833,47770834,47770835,47770836 JobID User State CPUTime TotalCPU ExitCode MaxRSS AveRSS NodeList Partition ReqTRES Start End Elapsed Timelimit Submit ------------ --------- -------------------- ---------- ---------- -------- ---------- ---------- ------------------------------ --------------- ------------------------------ -------------------- -------------------- ---------- ---------- ------------------- 47768523 ao147 NODE_FAIL 00:32:41 00:00:00 1:0 compute-a-16-37 short billing=1,cpu=1,mem=4G,node=1 2019-08-06T03:37:49 2019-08-06T04:10:30 00:32:41 01:30:00 2019-08-05T17:59:01 47768524 ao147 NODE_FAIL 00:34:21 00:00:00 1:0 compute-a-16-78 short billing=1,cpu=1,mem=4G,node=1 2019-08-06T03:37:49 2019-08-06T04:12:10 00:34:21 01:30:00 2019-08-05T17:59:01 47768525 ao147 NODE_FAIL 00:34:21 00:00:00 1:0 compute-a-16-96 short billing=1,cpu=1,mem=4G,node=1 2019-08-06T03:37:49 2019-08-06T04:12:10 00:34:21 01:30:00 2019-08-05T17:59:01 47768526 ao147 NODE_FAIL 00:34:21 00:00:00 1:0 compute-a-16-98 short billing=1,cpu=1,mem=4G,node=1 2019-08-06T03:37:49 2019-08-06T04:12:10 00:34:21 01:30:00 2019-08-05T17:59:01 47768527 ao147 NODE_FAIL 00:34:21 00:00:00 1:0 compute-a-16-101 short billing=1,cpu=1,mem=4G,node=1 2019-08-06T03:37:49 2019-08-06T04:12:10 00:34:21 01:30:00 2019-08-05T17:59:01 47768528 ao147 NODE_FAIL 00:34:21 00:00:00 1:0 compute-a-16-123 short billing=1,cpu=1,mem=4G,node=1 2019-08-06T03:37:49 2019-08-06T04:12:10 00:34:21 01:30:00 2019-08-05T17:59:01 47768529 ao147 NODE_FAIL 00:02:26 00:00:00 1:0 compute-a-16-129 short billing=1,cpu=1,mem=4G,node=1 2019-08-06T03:37:49 2019-08-06T03:40:15 00:02:26 01:30:00 2019-08-05T17:59:01 47768530 ao147 NODE_FAIL 00:02:26 00:00:00 1:0 compute-a-16-137 short billing=1,cpu=1,mem=4G,node=1 2019-08-06T03:37:49 2019-08-06T03:40:15 00:02:26 01:30:00 2019-08-05T17:59:01 47768531 ao147 NODE_FAIL 00:02:26 00:00:00 1:0 compute-a-16-148 short billing=1,cpu=1,mem=4G,node=1 2019-08-06T03:37:49 2019-08-06T03:40:15 00:02:26 01:30:00 2019-08-05T17:59:01 47768532 ao147 NODE_FAIL 00:02:26 00:00:00 1:0 compute-a-17-59 short billing=1,cpu=1,mem=4G,node=1 2019-08-06T03:37:49 2019-08-06T03:40:15 00:02:26 01:30:00 2019-08-05T17:59:01 47770827 ao147 NODE_FAIL 00:00:37 00:00:00 1:0 compute-e-16-239 short billing=1,cpu=1,mem=4G,node=1 2019-08-06T12:07:35 2019-08-06T12:08:12 00:00:37 01:30:00 2019-08-05T18:03:10 47770827.ba+ FAILED 00:00:38 00:00:00 0:0 compute-e-16-239 2019-08-06T12:07:35 2019-08-06T12:08:13 00:00:38 2019-08-06T12:07:35 47770828 ao147 NODE_FAIL 00:00:37 00:00:00 1:0 compute-e-16-239 short billing=1,cpu=1,mem=4G,node=1 2019-08-06T12:07:35 2019-08-06T12:08:12 00:00:37 01:30:00 2019-08-05T18:03:10 47770828.ba+ FAILED 00:00:38 00:00:00 0:0 compute-e-16-239 2019-08-06T12:07:35 2019-08-06T12:08:13 00:00:38 2019-08-06T12:07:35 47770829 ao147 NODE_FAIL 00:00:37 00:00:00 1:0 compute-e-16-239 short billing=1,cpu=1,mem=4G,node=1 2019-08-06T12:07:35 2019-08-06T12:08:12 00:00:37 01:30:00 2019-08-05T18:03:10 47770829.ba+ FAILED 00:00:38 00:00:00 0:0 compute-e-16-239 2019-08-06T12:07:35 2019-08-06T12:08:13 00:00:38 2019-08-06T12:07:35 47770830 ao147 NODE_FAIL 00:00:37 00:00:00 1:0 compute-e-16-239 short billing=1,cpu=1,mem=4G,node=1 2019-08-06T12:07:35 2019-08-06T12:08:12 00:00:37 01:30:00 2019-08-05T18:03:10 47770830.ba+ FAILED 00:00:38 00:00:00 0:0 compute-e-16-239 2019-08-06T12:07:35 2019-08-06T12:08:13 00:00:38 2019-08-06T12:07:35 47770831 ao147 NODE_FAIL 00:00:37 00:00:00 1:0 compute-e-16-239 short billing=1,cpu=1,mem=4G,node=1 2019-08-06T12:07:35 2019-08-06T12:08:12 00:00:37 01:30:00 2019-08-05T18:03:10 47770831.ba+ FAILED 00:00:38 00:00:00 0:0 compute-e-16-239 2019-08-06T12:07:35 2019-08-06T12:08:13 00:00:38 2019-08-06T12:07:35 47770832 ao147 NODE_FAIL 00:00:37 00:00:00 1:0 compute-e-16-239 short billing=1,cpu=1,mem=4G,node=1 2019-08-06T12:07:35 2019-08-06T12:08:12 00:00:37 01:30:00 2019-08-05T18:03:10 47770832.ba+ FAILED 00:00:38 00:00:00 0:0 compute-e-16-239 2019-08-06T12:07:35 2019-08-06T12:08:13 00:00:38 2019-08-06T12:07:35 47770833 ao147 NODE_FAIL 00:00:37 00:00:00 1:0 compute-e-16-239 short billing=1,cpu=1,mem=4G,node=1 2019-08-06T12:07:35 2019-08-06T12:08:12 00:00:37 01:30:00 2019-08-05T18:03:10 47770833.ba+ FAILED 00:00:38 00:00:00 0:0 compute-e-16-239 2019-08-06T12:07:35 2019-08-06T12:08:13 00:00:38 2019-08-06T12:07:35 47770834 ao147 NODE_FAIL 00:00:37 00:00:00 1:0 compute-e-16-239 short billing=1,cpu=1,mem=4G,node=1 2019-08-06T12:07:35 2019-08-06T12:08:12 00:00:37 01:30:00 2019-08-05T18:03:11 47770834.ba+ FAILED 00:00:38 00:00:00 0:0 compute-e-16-239 2019-08-06T12:07:35 2019-08-06T12:08:13 00:00:38 2019-08-06T12:07:35 47770835 ao147 NODE_FAIL 00:00:37 00:00:00 1:0 compute-e-16-239 short billing=1,cpu=1,mem=4G,node=1 2019-08-06T12:07:35 2019-08-06T12:08:12 00:00:37 01:30:00 2019-08-05T18:03:11 47770835.ba+ FAILED 00:00:38 00:00:00 0:0 compute-e-16-239 2019-08-06T12:07:35 2019-08-06T12:08:13 00:00:38 2019-08-06T12:07:35 47770836 ao147 NODE_FAIL 00:00:37 00:00:00 1:0 compute-e-16-239 short billing=1,cpu=1,mem=4G,node=1 2019-08-06T12:07:35 2019-08-06T12:08:12 00:00:37 01:30:00 2019-08-05T18:03:11 47770836.ba+ FAILED 00:00:38 00:00:00 0:0 compute-e-16-239 2019-08-06T12:07:35 2019-08-06T12:08:13 00:00:38 2019-08-06T12:07:35 Some of the batch jobs didn't have a step listed, which was really curious. These jobs didn't appear in the slurmd.log files on the allocated compute nodes. We did also confirm that there were jobs from other users during this timeframe with the same phenotype (NODE_FAIL and no step created). Could you provide any more information to determine what the root cause of these issues was? It doesn't look like it is continuing to affect the cluster, but it would be helpful to know. Please let me know if there is any other information that I could provide. Thanks, Kathleen