3446 – Jobs go to RUNNING state before node is booted and ready when PrologSlurmctld is set

Ticket 3446 - Jobs go to RUNNING state before node is booted and ready when PrologSlurmctld is set

Summary: Jobs go to RUNNING state before node is booted and ready when PrologSlurmctld...

Status:	RESOLVED FIXED

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	Scheduling (show other tickets)
Version:	15.08.8
Hardware:	Linux Linux

Severity:	4 - Minor Issue
Assignee:	Moe Jette
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2017-02-02 01:30 MST by Rémi Palancher
Modified:	2017-02-15 00:47 MST (History)
CC List:	3 users (show)

See Also:
Site:	EDF - Electricite de France
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:	16.05.10
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
Proposed fix (4.71 KB, patch) 2017-02-10 18:11 MST, Moe Jette	Details \| Diff
Patch for version 15.08.13 (1.57 KB, patch) 2017-02-14 11:18 MST, Moe Jette	Details \| Diff
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Rémi Palancher 2017-02-02 01:30:18 MST

Dear Slurm developers,

Unfortunately, we are still facing the same behaviour as reported in #3399 with
jobs going to the RUNNING state (and not waiting in CONFIGURING state) before
the nodes are booted and ready to run the jobs. The patch you gently provided
worked nicely in my testing environment but it doesn't bring the expected result
on the production cluster...

Actually, with a quick overview of the source code, I figured out that the
difference between the 2 environments comes from the PrologSlurmctld parameter.
As soon as a PrologSlurmctld is set in Slurm configuration, the bug comes back.
I am able to reproduce it reliably by setting PrologSlurmctld=/bin/true in my
testing environment. As sson as it is set, the jobs not longer wait in
CONFIGURING state.

I guess the bug is somewhere around prolog_running_decr() which removes the
JOB_CONFIGURING bit too early w/o running the same logic as job_config_fini()
to extend the end_time.

Can you please check what is going on here?

Thank you in advance,
Rémi

Comment 1 Alejandro Sanchez 2017-02-07 07:56:55 MST

Rémi - sorry for the late response. I can reproduce this by setting PrologSlurmctld, thanks for the suggestion. We're gonna come take a look at this and come back to you.

Comment 12 Moe Jette 2017-02-10 18:11:50 MST

Created attachment 4041 [details]
Proposed fix

Alex,
If you could give do some testing of this patch for me, that would be much appreciated.

I have tried this using quite a few different configurations:
Suspending and resuming nodes
Jobs with --reboot option
With and without PrologSlurmctld
BlueGene/Q (emulated)
Various timings

I have not yet run on a KNL (rebooting the node into various NUMA and MCDRAM modes) but hope to do that next week.

Comment 13 Alejandro Sanchez 2017-02-13 05:20:01 MST

(In reply to Moe Jette from comment #12)
> Created attachment 4041 [details]
> Proposed fix
> 
> Alex,
> If you could give do some testing of this patch for me, that would be much
> appreciated.
> 
> I have tried this using quite a few different configurations:
> Suspending and resuming nodes
> Jobs with --reboot option
> With and without PrologSlurmctld
> BlueGene/Q (emulated)
> Various timings
> 
> I have not yet run on a KNL (rebooting the node into various NUMA and MCDRAM
> modes) but hope to do that next week.

Moe, I've been testing different configurations too:

- Suspend/Resume with/without PrologSlurmctld
- Suspend/Resume with/without --reboot
- Combinations of the two above
- Different timings
- Changing the timings while powering up and 'scontrol reconfigure', including a resumeprogram with a sleep 30 + slurmd start, and initially ResumeTimeout < 30, then readjusting it to > 30 and 'scontrol reconfigure', then job ends up transitioning from CF to R properly and node state changes correctly too.

They all work as expected for me, your patch looks good so far.

I've not tested:

- BlueGene/Q (emulated)
- KNL

Comment 14 Moe Jette 2017-02-13 11:31:56 MST

Three of us have tested this patch on quite a few different systems with various configurations and found no problems. 

https://github.com/SchedMD/slurm/commit/f6d42fdbb293ca89da609779db8d8c04a86a8d13.patch

This change will be in version 16.05.10 when released (no date set).

Comment 15 Rémi Palancher 2017-02-14 01:04:30 MST

Hi Moe,

Thank you for this patch! Is there any chance you backport it to slurm 15.08?

Best,
Rémi

Comment 16 Moe Jette 2017-02-14 11:18:21 MST

Created attachment 4049 [details]
Patch for version 15.08.13

I do not expect that we will have any more releases of Slurm version 15.08. Also note that the reboot logic in version 16.05 is very different from 15.08 (mostly due to changes required for support of Intel KNL and rebooting to change NUMA or MCDRAM configuration). The attached patch has been tested with version 15.08.13.

Comment 17 Moe Jette 2017-02-14 11:29:49 MST

Also note that support for Slurm version 15.08 will end in May 2017, so upgrading soon is strongly recommended.

Comment 18 Rémi Palancher 2017-02-15 00:47:33 MST

Hi Moe,

Thank you for this new version of the patch, I'm going to give it a try soon. Indeed, we will upgrade this cluster next summer, thank you for the reminder.

Best,
Rémi