7890 – Job exit with NODE_FAIL and resumed on another node with same ID

Ticket 7890 - Job exit with NODE_FAIL and resumed on another node with same ID

Summary: Job exit with NODE_FAIL and resumed on another node with same ID

Status:	RESOLVED INFOGIVEN

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	Accounting (show other tickets)
Version:	18.08.7
Hardware:	Linux Linux

Severity:	4 - Minor Issue
Assignee:	Director of Support
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2019-10-08 07:17 MDT by ARC Admins
Modified:	2019-10-08 07:57 MDT (History)
CC List:	0 users

See Also:
Site:	University of Michigan
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description ARC Admins 2019-10-08 07:17:43 MDT

Hello,

We have a user that submitted a job and got a node assigned. The job ran for about 6 days and then exited with NODE_FAIL. The same job ID then resumed on another node and continues running. I see no presence of the job on the original node (I looked for the users UID and the job ID in the slurmd.log). The only traces I found of the job ID "existing" twice are in the DB:

```
MariaDB [slurm_acct_db]> select derived_ec,time_start,time_end,state,nodelist,nodes_alloc,exit_code from greatlakes_job_table where id_job='675242';
+------------+------------+------------+-------+----------+-------------+-----------+
| derived_ec | time_start | time_end   | state | nodelist | nodes_alloc | exit_code |
+------------+------------+------------+-------+----------+-------------+-----------+
|          0 | 1569282469 | 1569732851 |     7 | gl3111   |           1 |         0 |
|          0 | 1569732972 |          0 |     1 | gl3028   |           1 |         0 |
+------------+------------+------------+-------+----------+-------------+-----------+
```

When I do an scontrol show job on the job itself, I do see:

```
Requeue=1 Restarts=1 
```

Is this unusual? Is there a way to programmatically tell jobs to restart or requeue if they land on a bum node?

Thanks!

David

Comment 1 ARC Admins 2019-10-08 07:29:59 MDT

It appears we DO have JobRequeue set to 1 according to scontrol show config. So, I think that accounts for it. But I'll leave this open to be sure.

David

Comment 2 ARC Admins 2019-10-08 07:57:52 MDT

After reading a bit, I see how a user can request this option on their own as well. So, I'm going to close this. Thanks for listening, though :)

David