Ticket 7890

Summary: Job exit with NODE_FAIL and resumed on another node with same ID
Product: Slurm Reporter: ARC Admins <arc-slurm-admins>
Component: AccountingAssignee: Director of Support <support>
Status: RESOLVED INFOGIVEN QA Contact:
Severity: 4 - Minor Issue    
Priority: ---    
Version: 18.08.7   
Hardware: Linux   
OS: Linux   
Site: University of Michigan Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---

Description ARC Admins 2019-10-08 07:17:43 MDT
Hello,

We have a user that submitted a job and got a node assigned. The job ran for about 6 days and then exited with NODE_FAIL. The same job ID then resumed on another node and continues running. I see no presence of the job on the original node (I looked for the users UID and the job ID in the slurmd.log). The only traces I found of the job ID "existing" twice are in the DB:

```
MariaDB [slurm_acct_db]> select derived_ec,time_start,time_end,state,nodelist,nodes_alloc,exit_code from greatlakes_job_table where id_job='675242';
+------------+------------+------------+-------+----------+-------------+-----------+
| derived_ec | time_start | time_end   | state | nodelist | nodes_alloc | exit_code |
+------------+------------+------------+-------+----------+-------------+-----------+
|          0 | 1569282469 | 1569732851 |     7 | gl3111   |           1 |         0 |
|          0 | 1569732972 |          0 |     1 | gl3028   |           1 |         0 |
+------------+------------+------------+-------+----------+-------------+-----------+
```

When I do an scontrol show job on the job itself, I do see:

```
Requeue=1 Restarts=1 
```

Is this unusual? Is there a way to programmatically tell jobs to restart or requeue if they land on a bum node?

Thanks!

David
Comment 1 ARC Admins 2019-10-08 07:29:59 MDT
It appears we DO have JobRequeue set to 1 according to scontrol show config. So, I think that accounts for it. But I'll leave this open to be sure.

David
Comment 2 ARC Admins 2019-10-08 07:57:52 MDT
After reading a bit, I see how a user can request this option on their own as well. So, I'm going to close this. Thanks for listening, though :)

David