3958 – batch job complete failure

Ticket 3958 - batch job complete failure

Summary: batch job complete failure

Status:	RESOLVED DUPLICATE of ticket 3941

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	slurmstepd (show other tickets)
Version:	17.02.5
Hardware:	Linux Linux

Severity:	4 - Minor Issue
Assignee:	Director of Support
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2017-07-05 12:31 MDT by Kilian Cavalotti
Modified:	2017-07-05 15:20 MDT (History)
CC List:	1 user (show)

See Also:
Site:	Stanford
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Kilian Cavalotti 2017-07-05 12:31:53 MDT

Hi SchedMD,

Since moving to Slurm 17.02 and RHEL7, we're seeing quite regular occurrences of nodes being drained with "batch job complete failure", with no obvious sign of problem on the nodes.

This seem to mainly happen on job termination due to memory or time limit. For instance:

Jul 2 13:05:28 sh-112-01 slurmstepd[170482]: error: *** JOB 410277 ON sh-112-01 CANCELLED AT 2017-07-02T13:05:28 DUE TO TIME LIMIT ***
Jul 2 13:05:30 sh-112-01 slurmstepd[170476]: done with job
Jul 2 13:06:33 sh-112-01 slurmstepd[170482]: error: *** JOB 410277 STEPD TERMINATED ON sh-112-01 AT 2017-07-02T13:06:32 DUE TO JOB NOT ENDING WITH SIGNALS ***
Jul 2 13:06:33 sh-112-01 slurmstepd[170482]: sending REQUEST_COMPLETE_BATCH_SCRIPT, error:4001 status 15
Jul 2 13:06:33 sh-112-01 slurmstepd[170482]: done with job

The logs on that node don't list any OOM killer occurrence or anything like problems mentioned in #1357 or #3791. There is no D nor Z process hanging around on the node, and as far as I can tell, all the processes from that job have exited eventually.

So my questions:

1. is there any way to debug the "job not ending with signals" process a little bit more, for instance getting the PIDs of the processes not terminating?

2. what does "error:4001 status 15" mean?

3. is there a way to maybe increase the delay before slurmstepd considers that the job processes won't quit?

I should also add that we haven't seen a single instance of that happening on our RHEL6/Slurm 16.05 cluster in 3+ years of production.

Thanks!
--
Kilian

Comment 1 Tim Shaw 2017-07-05 14:00:23 MDT

Hello Kilian,

This issue has already been reported in bug 3941 and work is progressing there.  I'm closing this bug as a duplicate.

Regards.

Tim

*** This ticket has been marked as a duplicate of ticket 3941 ***

Comment 2 Kilian Cavalotti 2017-07-05 15:20:59 MDT

Ah great, thanks!