12445 – overlapping of jobs in prologue/epilogue leads to issues

Ticket 12445 - overlapping of jobs in prologue/epilogue leads to issues

Summary: overlapping of jobs in prologue/epilogue leads to issues

Status:	RESOLVED INFOGIVEN

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	Scheduling (show other tickets)
Version:	20.11.7
Hardware:	Linux Linux

Severity:	3 - Medium Impact
Assignee:	Marcin Stolarek
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2021-09-08 09:22 MDT by Michael Hebenstreit
Modified:	2021-11-03 14:41 MDT (History)
CC List:	2 users (show)

See Also:	6769 12102 12801
Site:	Intel CRT
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Michael Hebenstreit 2021-09-08 09:22:18 MDT

transferred from https://bugs.schedmd.com/show_bug.cgi?id=12102

from SLURM 19.05.3 the release notes say:

“Nodes in COMPLETING state treated as being currently available for job will-run test.”

(reference https://github.com/SchedMD/slurm/blob/master/NEWS). As jobs still performing epilog are COMPLETING, this makes it possible for the scheduler to schedule on nodes still processing through epilog. In my case, as the nodes reset the ATS cards during epilog, I ran into cases where I got a node with no GPU until a few seconds after the allocation start. For scripting this is very bad as the run starts quick enough to not find any GPUs and fails. "

Can you please confirm my understanding of prologue/epilogue order and the the impact of setting "CompleteWait":

slurmd starts prologue and epilogue on each node.
If prologue succeeds on EVERY node, the user code will start; if prologue fails on a single node, execution of user code will be skipped. Then epilogue will be executed on each node and status will go to "COMPLETING" aka "CG".
"CompleteWait" seconds after epilogue begins the node is considered free for next job, even if epilogue could still be running on some or even all nodes.

So by setting "CompleteWait" to a (ridiculous) value like 1000 I can prevent this behaviour?

Comment 4 Marcin Stolarek 2021-09-10 07:31:29 MDT

Michael,

I'm not sure I fully understand the issue. I'll try to answer/comment 0 to give you a better understanding and we can further discuss if you need.

The NEWS line:
>“Nodes in COMPLETING state treated as being currently available for job will-run test.”
you're referring to got introduced in 0666db61ca5[1]. It's changing the way SELECT_MODE_WILL_RUN works, which is used to determine when a pending job can start. I don't expect that this had an impact on jobs being actually started in different circumstances (cluster state). (Do you have any indication/reproducer that this commit introduced such a change in behavior?) 

>If prologue succeeds on EVERY node, the user code will start; if prologue fails on a single node, execution of user code will be skipped. Then epilogue will be executed on each node and status will go to "COMPLETING" aka "CG".
Correct (since you have PrologFlags=alloc).

Since the prolog start the job is in R/RUNNING state, but the user code won't be spawned until all prologs are completed.

>"CompleteWait" seconds after epilogue begins the node is considered free for next job, even if epilogue could still be running on some or even all nodes.
During the time when Epilog is running the job is in CG state and nodes get COMPLETING state too. The node COMPLETING state is reset individually for every node, so if the epilog completes on node A it's considered for job scheduling immediately unless CompleteWait is set. If CompleteWait is set no job is scheduled if there is a CG job for time shorter than configured CompleteWait time.

However, until PrologFlag=serial[2] is set there is no guarantee that you won't get a prolog and epilog running at the same time (or even multiple prologs and epilogs) on the same host. Think about the job getting allocated resources just before another job completes, or multiple jobs starting/ending at the same time.

In Slurm 21.08 we've added a node_featureas/helpers plugin, maybe it's something you should consider? You can find more info in Bug 9567.

I hope that helps, please let me know your thougths.

cheers,
Marcin

[1]https://github.com/SchedMD/slurm/commit/0666db61ca5ca45214ab2ad659b2854761b1fd7b
[2]https://slurm.schedmd.com/slurm.conf.html#OPT_PrologFlags

Comment 5 Michael Hebenstreit 2021-09-10 07:46:41 MDT

our jobs always are exclusive, so there should not be an overlap. 

Can I set: PrologFlag=alloc,serial
?

Comment 6 Marcin Stolarek 2021-09-10 11:52:45 MDT

>our jobs always are exclusive, so there should not be an overlap. 
If that's the goal why don't you configure partitions with OverSubscribe=EXCLUSIVE[1]?
I'd recommend that since with the config you shared it really depends on the job specification.

>PrologFlag=alloc,serial
Yes, you can, but with jobs being allocated exclusively it doesn't add much - it simply puts the code running epilog and prolog under a slurmd global mutex[2,3]. At the same time in your config, the performance penalty shouldn't be noticeable.

cheers,
Marcin
[1]https://slurm.schedmd.com/slurm.conf.html#OPT_OverSubscribe
[2]https://github.com/SchedMD/slurm/blob/560a2eb2524f8d511425c135e8b83f3032707ff6/src/slurmd/slurmd/req.c#L5601-L5603
[3]https://github.com/SchedMD/slurm/blob/560a2eb2524f8d511425c135e8b83f3032707ff6/src/slurmd/slurmd/req.c#L5646-L5648

Comment 7 Michael Hebenstreit 2021-09-11 08:21:34 MDT

let me correct my statement. 99.9% of our jobs are exclusive. There are some small number of special jobs. We solved that be ensuring --exclusive is part of every job submission instead of configuring it on partition level.

Comment 8 Marcin Stolarek 2021-09-14 01:40:40 MDT

Michael,

Did I answer your initial questions?

>let me correct my statement. 99.9% of our jobs are exclusive. There are some small number of special jobs. We solved that be ensuring --exclusive is part of every job submission instead of configuring it on partition level.
I can't comment on that much, with "serial" in PrologFlags prolog and epilog scripts won't run in parallel, but obviously, it's possible that the sequence won't be that after every prolog there is an epilog call. Just pointing that out, I don't know if that may be of any issue for you.

cheers,
Marcin

Comment 9 Michael Hebenstreit 2021-09-14 16:44:58 MDT

yes, you can close that ticket

I would recommend though that you put in an option to slurm.conf that enforces a strict serial behaviour - aka as long as any epilogue is running, NONE of the nodes in a job should be available for a new job.

Comment 10 Marcin Stolarek 2021-09-15 02:06:18 MDT

>I would recommend though that you put in an option to slurm.conf that enforces a strict serial behaviour - aka as long as any epilogue is running, NONE of the nodes in a job should be available for a new job.

I can talk about that with our senior developers, but.. why would you like such a behavior? I mean why resources on a node should be blocked if there is no job/postjob activity there?

cheers,
Marcin

Comment 11 Michael Hebenstreit 2021-09-15 07:00:06 MDT

your idea of prologue/epilogue going on only for a few seconds is not applicable to all clusters. This is often related to testing and reconfiguring nodes at the end of a job. Multi node tests are often executed from a single node in the job, and even if the epilogue on some nodes might be complete, from a cluster administration point of view the node still counts as busy.

Within Intel we have at least 2 other clusters besides Endeavour that suffer from the current behaviour and we are forced to implement workarounds.

Comment 12 Marcin Stolarek 2021-09-16 02:50:29 MDT

>Multi node tests are often executed from a single node in the job,
As I understand this happens when the job is spawned by a mechanism other than srun? I'd suggest checking if you can't change this[1] - for instnace number of MPI implementations offer a variety of ways to spawn tasks[1].

> and even if the epilogue on some nodes might be complete, from a cluster administration point of view the node still counts as busy.
Could you please elaborate a little bit on that? I'm trying to gather as much as I can so we can discuss this internally(SchedMD).

cheers,
Marcin

[1]https://slurm.schedmd.com/mpi_guide.html
[2]https://slurm.schedmd.com/pam_slurm_adopt.html

Comment 13 Michael Hebenstreit 2021-09-16 07:27:58 MDT

I'm talking specifically about prologue/epilogue here. On Endeavour prologue/epilogue is executed and controlled from a single node - the prologue/epilogue scripts running on all but the headnode of a job are dummies. This is done for 2 reasons. 
a) need to collect all information into a single result file
b) under certain conditions we need to run a final MPI benchmark validating the nodes are working correctly together before releasing them back into the cluster for the next job. That benchmark can only be run after the single node tests have completed.

Comment 14 Marcin Stolarek 2021-09-17 03:53:39 MDT

>a) need to collect all information into a single result file 
This sounds like something that should really be part of the job, or maybe SrunEpilog? Epilog runs as a user executing Slurmd and in every regular HPC environment should not work on job results, since it's a natural security risk - the output may potentially inject malicious code to a script executed as root.

>b) under certain conditions we need to run a final MPI benchmark validating the nodes are working correctly together before releasing them back into the cluster for the next job. That benchmark can only be run after the single node tests have completed.
That's an interesting point. I'd say that in today's Slurm infrastructure you should drain the nodes in epilog in such circumstances and then as an admin take some node recovery/verification steps.

Could you please further explain when that happens? I'd like to fully understand the process to check if it may be of interest for other sites and if we can achieve that with the current code state.

cheers,
Marcin

Comment 15 Michael Hebenstreit 2021-09-17 09:32:37 MDT

slurmd runs as root, so we are using slurmd_prologue and slurmd_epilogue. Actually the same scripts are used for prologue/epilogue because most of the work is identical

slurmd executes a shell wrapper script wrapper.XXXlog.sh
  wrapper.XXXlog.sh takes $SLURM_NODELIST and sorts it to define a headnode
  for epilogue on all nodes except the headnode wrapper.XXXlog.sh goes into a sleep
  for prologue on all nodes but the headnode wrapper.XXXlog.sh exits
  the headnode executes the real prologue/epilogue script XXXlogue.slurm.py
  for epilogue - once XXXlogue.slurm.py is complete wrapper.XXXlog.sh kills the sleep on all nodes
  for prologue - output from XXXlogue.slurm.py is copied to current directory and ownership set o job owner
  another copy of the output is kept on central NFS file share for debugging 
  wrapper.XXXlog.sh exits on all nodes

XXXlogue.slurm.py takes a minimum of 30s to run. It does about 30 different tests on the nodes. These tests run every time as our hardware is unstable and we allow users some configuration options that could impact performance for the next job. One of those tests is a short mpi based test to ensure functionality of the complete stack over several nodes. It's also critical that we have the output of those tests for later debugging. having it run from a single node makes all of that far easier.

Comment 18 Marcin Stolarek 2021-09-27 07:09:56 MDT

Michael,

Referring to point b) from my comment 14:

We had an internal discussion about the idea of allowing all nodes to be in CG if there was a job there that it's still CG. Our conclusion was that we're not seeing this as a feature that may be used by other sites.

If you want to keep all the nodes CG until the batch host is CG you'd need to prevent competition of epilogues on those nodes until the "main" epilog completes.

From Slurm design perspective prolog/epilogue scripts activity should only affect the nodes where the script runs.

Do you have any remaining questions regarding the case?

cheers,
Marcin

Comment 19 Michael Hebenstreit 2021-09-27 13:02:07 MDT

no, you can close that