Ticket 7890 - Job exit with NODE_FAIL and resumed on another node with same ID
Summary: Job exit with NODE_FAIL and resumed on another node with same ID
Status: RESOLVED INFOGIVEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: Accounting (show other tickets)
Version: 18.08.7
Hardware: Linux Linux
: 4 - Minor Issue
Assignee: Director of Support
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2019-10-08 07:17 MDT by ARC Admins
Modified: 2019-10-08 07:57 MDT (History)
0 users

See Also:
Site: University of Michigan
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description ARC Admins 2019-10-08 07:17:43 MDT
Hello,

We have a user that submitted a job and got a node assigned. The job ran for about 6 days and then exited with NODE_FAIL. The same job ID then resumed on another node and continues running. I see no presence of the job on the original node (I looked for the users UID and the job ID in the slurmd.log). The only traces I found of the job ID "existing" twice are in the DB:

```
MariaDB [slurm_acct_db]> select derived_ec,time_start,time_end,state,nodelist,nodes_alloc,exit_code from greatlakes_job_table where id_job='675242';
+------------+------------+------------+-------+----------+-------------+-----------+
| derived_ec | time_start | time_end   | state | nodelist | nodes_alloc | exit_code |
+------------+------------+------------+-------+----------+-------------+-----------+
|          0 | 1569282469 | 1569732851 |     7 | gl3111   |           1 |         0 |
|          0 | 1569732972 |          0 |     1 | gl3028   |           1 |         0 |
+------------+------------+------------+-------+----------+-------------+-----------+
```

When I do an scontrol show job on the job itself, I do see:

```
Requeue=1 Restarts=1 
```

Is this unusual? Is there a way to programmatically tell jobs to restart or requeue if they land on a bum node?

Thanks!

David
Comment 1 ARC Admins 2019-10-08 07:29:59 MDT
It appears we DO have JobRequeue set to 1 according to scontrol show config. So, I think that accounts for it. But I'll leave this open to be sure.

David
Comment 2 ARC Admins 2019-10-08 07:57:52 MDT
After reading a bit, I see how a user can request this option on their own as well. So, I'm going to close this. Thanks for listening, though :)

David