12588 – does Slurm completely forego scheduling while jobs are in state CG?

Ticket 12588 - does Slurm completely forego scheduling while jobs are in state CG?

Summary: does Slurm completely forego scheduling while jobs are in state CG?

Status:	RESOLVED WONTFIX

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	Scheduling (show other tickets)
Version:	20.11.7
Hardware:	Linux Linux

Severity:	3 - Medium Impact
Assignee:	Scott Hilton
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2021-10-01 10:01 MDT by Michael Hebenstreit
Modified:	2021-10-14 16:26 MDT (History)
CC List:	2 users (show)

See Also:	12286
Site:	Intel CRT
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
slurm.conf (40.50 KB, text/plain) 2021-10-01 10:01 MDT, Michael Hebenstreit	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Michael Hebenstreit 2021-10-01 10:01:39 MDT

Created attachment 21571 [details]
slurm.conf

Yesterday I discovered that SLURM does not process new jobs for scheduling while other jobs are in CG state as epilogue on them was still running. 

sched.log shows entries like:

sched: [2021-09-30T16:50:31.364] schedule() returning, some job is still completing

is this on purpose, a bug or a configuration issue?

Comment 1 Scott Hilton 2021-10-01 11:11:33 MDT

Michael,

This is related to CompleteWait which you have set to 1200.

https://slurm.schedmd.com/slurm.conf.html#OPT_CompleteWait

-Scott

Comment 2 Michael Hebenstreit 2021-10-01 11:14:42 MDT

but why is slurm waiting on nodes in CG state? why does it not simply ignore those nodes  and consider them busy until they are really free?

Comment 3 Scott Hilton 2021-10-01 14:24:59 MDT

Michael,

CG stands for COMPLETING.

>why does it not simply ignore those nodes  and consider them busy until they are really free?
That is what it does by default. When in COMPLETING state slurm cannot allocate those resources to other jobs. It must be fully completed.

CompleteWait holds all scheduling so that new jobs can have a blank slate to allocate resources from. (i.e. if a job is completing and the epilog finishes on node 3 and 7 first the next job will grab them instead of node 0 and 1 which are closer together.) https://slurm.schedmd.com/slurm.conf.html#OPT_CompleteWait

If you want your jobs to start as fast as possible I recommend turning CompleteWait off as is default.

Let me know if you have questions about this.

-Scott

Comment 4 Michael Hebenstreit 2021-10-01 15:18:01 MDT

Thanks for pointing that out - but that behavior is not really logical. If I have 100 nodes, 20 busy with jobs, 15 busy with  a job in completing state - why does scheduling not do anything with the remaining 55? Is there no other way to prevent jobs from starting on a node in CG state but prevent scheduling on ALL nodes?

Comment 5 Scott Hilton 2021-10-04 09:51:14 MDT

Michael,

This flag will make this wait for nodes in the same partitions. (Or overlapping partitons)
https://slurm.schedmd.com/slurm.conf.html#OPT_reduce_completing_frag

What is your use case? Why do you need CompleteWait?

-Scott

Comment 6 Michael Hebenstreit 2021-10-04 10:17:04 MDT

our epilogue takes up to 15min and must not overlap with a prologue from a new job start.

Comment 7 Scott Hilton 2021-10-04 13:24:42 MDT

Michael,

As I said before, jobs will not be allocated to a node with an epilog running. This is default behavior, not associated with CompleteWait.

Is there a reason you also need the CompleteWait feature as I described or as is specified in the documentation? Unless there is I would suggest turning the feature off or setting it to 0 (default).

-Scott

Comment 8 Michael Hebenstreit 2021-10-04 14:00:59 MDT

and I said before I had instances of prologues starting while epilogue is still running. I need to prevent that - unfortunately SLURM design philosophies are at odds with a cluster that is built on unreliable hardware, needs 15min prologue/epilogues and has all nodes in a job waiting until the prologue/epilogue from the head node is complete.
CompleteWait seemed to be an option, but unfortunately is completely at odds with throughput. Originally I understood it would only affect scheduling to nodes in a single completing job. That being not the case I will need to find an alternative solution - any ideas?

Comment 13 Scott Hilton 2021-10-05 13:33:34 MDT

Michael,

Yes I see we recommended CompleteWait in bug 12445. Sorry for the misunderstanding. CompleteWait was designed for a different use case and has significant drawbacks in high throughput environments especially with the high time frame you need.

I don't currently have a good solution for your use case.

However, if you keep CompleteWait for now I believe the "reduce_completing_frag" flag will reduce the affect to just the partition with the completing job.
https://slurm.schedmd.com/slurm.conf.html#OPT_reduce_completing_frag

-Scott

Comment 16 Michael Hebenstreit 2021-10-06 07:28:06 MDT

That does not help. Most of the nodes are in multiple partitions with different priority as well as convenience (users should find all nodes they can access in one queue - some queues have access to additional nodes, but a lot of nodes are shared between queues).

What I need is a way to prevent nodes from being scheduled into a new job while epilogue is still running. Should I take the nodes offline at the start of epilogue?

Comment 19 Scott Hilton 2021-10-07 10:37:47 MDT

Michael,

What are you doing in your prolog and epilog? Are you messing with the node configuration or power state? May I see the scripts?

-Scott

Comment 20 Scott Hilton 2021-10-07 11:36:14 MDT

Michael,

It seem you have quite a complicated situation. Could you give us a list of requirements for your situation regarding job setup and teardown? Perhaps with that we can think about which features would work best for you.

-Scott

Comment 21 Michael Hebenstreit 2021-10-07 11:39:51 MDT

Modules are both for checking and configuring items on the OS not available without root permissions. All in all about 3k lines of python and shell code. I'm willing to share under NDA but not willing to put it on this open bug report. 

check_downclock
nologin
clean_turbostat
set_all_cpus
userid
watchdog
cpufreq
nm_and_chassis
fix_permissions
memkind
clean_dev_shm
getmeminfo
logger
set_numa_balancing
getcfsinfo
msr_safe
sharpd
rsyslog
test_swap
clean_sep_files
tmp_free_size
rogue_test
hugepages
ntpd
fuse
intel_gpu
nvideacard
thp
cpuonoff
fix_truescale
dmidecode
read_mce
config_ofed
dev_shm_huge
check_mounts
config_lustre

Comment 22 Michael Hebenstreit 2021-10-07 11:50:53 MDT

Trying to put everything together:

long prologue/epilogue runtimes
nodes can be reconfigured and rebooted during prologue/epilogue (changing BIOS options, boot to different kernel or different kernel options...)
prologue/epilogue run from head node only - currently epilogues on nodes are simple "sleep" commands to mimic behavior SLURM expects
only a single instance of our special prologue/epilogue script is allowed to run strictly serial prologue/job/epilogue behaviour
nodes must not be reused in new job before previous job including epilogue is completely done
SLUM must ignore changes in #cpus/cores during run (due to Intel Speed Select Technology)


low scheduling rate:
  15k jobs per month
  job size 1-350 nodes
  minimum job runtime 10 min
  exclusive nodes only
  scheduling only to constraints
  no scheduling for GPU
  no scheduling to specific cores/threads
  actually - almost all slurm scheduling options are not used

access regulated via partitions - some nodes are only in specific queues, many nodes are in at least 3 partitions

Comment 23 Jason Booth 2021-10-08 13:35:37 MDT

Michael, I am resolving this issue out as invalid, and I will explain why below. I also want to point out that any future bug you open trying to have us support your rebooting solution will be closed out as invalid.

 The method that you are using to reboot nodes is not supportable due to the many issues we have previously mentioned, and your use of CompleteWait with this configuration is not something that this was designed for.

As mentioned to you previously, you need to migrate to using the node features helper. 

This plugin is designed to have nodes reboot where prolog and epilog is not.

We have an outstanding document node_features plugin ticket, however we do support this today, and you should be using this instead of your current solution.
https://bugs.schedmd.com/show_bug.cgi?id=12331

You can find an example of how to configure this until we document this on the website.
https://bugs.schedmd.com/attachment.cgi?id=20965&action=diff

Comment 24 Michael Hebenstreit 2021-10-08 13:48:52 MDT

a) "You are not authorized to access bug #12331"
b) the second link does not explain how users are supposed to request a feature
c) even without reboots prologs/epilogues take 30+ seconds and must be run exclusively. That has nothing to do with reboots and you consider that usage invalid as well?

d) from the description: NOTE that any feature under the control of a Helper cannot use the more complex specification language. If any of those more complex specifications are specified by a job using constraints with "[]()!*" then the job will be rejected.

that makes the nodefeature plugin already barely usable for my cluster. Is there any plan to change that in the future?

Comment 25 Jason Booth 2021-10-12 11:34:32 MDT

Michael, I plan to send you an update later today. I am catching up from the long holiday weekend. I wanted to at least let you know that I did see this update and intend to reply.

Comment 26 Jason Booth 2021-10-12 22:55:54 MDT

> a) "You are not authorized to access bug #12331"
This is an internal bug/task to fully document this and is not meant to be viewed generally.

b) the second link does not explain how users are supposed to request a feature
I apologize I should have pointed to the parent bug tracking this. 
https://bugs.schedmd.com/show_bug.cgi?id=9567

This builds off KNL where users would use the -C or --constraints option of Slurm's job submission commands: salloc, sbatch, and srun.

> And that's all, a key-value format handled by custom external binaries is the gist of this proposal. 
> To match the KNL plugin there are also configuration options to tweak boot time and node reboot weight,
> but it's straightforward.
> The user-facing API is no different than using static constraints declared at the node level:
> $ srun --constraint nps=1 ...

c) even without reboots prologs/epilogues take 30+ seconds and must be run exclusively. That has nothing to do with reboots and you consider that usage invalid as well?

This has been covered already by Marcin and Scott.
https://bugs.schedmd.com/show_bug.cgi?id=12102
https://bugs.schedmd.com/show_bug.cgi?id=12588

We do not have any further interest in devoting time into the subject. The way in which you are using these features is just something Slurm was not designed to do.
I would suggest looking into other ways to obtain exclusivity on that node in a way that Slurm would expect. This could be moving tasks to a dependent job that runs directly after the primary job, putting a reservation on nodes that last the duration of that job+epilog. As mentioned by Scott in 12588 CompleteWait 'holds all scheduling so that new jobs can have a blank slate to allocate resources from.' this would not scale well in a HTC environment. 

> d) from the description: NOTE that any feature under the control of a Helper cannot use the more complex specification language. 
> If any of those more complex specifications are specified by a job using constraints with "[]()!*" then the job will be rejected.
> that makes the nodefeature plugin already barely usable for my cluster. Is there any plan to change that in the future?

We currently do not have plans to expand this.

Comment 27 Michael Hebenstreit 2021-10-13 07:33:00 MDT

So your answer essentially is: SLURM cannot do what you need it to do. We also have no interest to develop the changes you'd need. Please leave us alone and go back to LSF.

I won't bother you with these problems anymore, but you can't expect me to be happy about it.

Comment 28 Jason Booth 2021-10-14 15:37:05 MDT

Hi Michael

> So your answer essentially is: SLURM cannot do what you need it to do. We also have no interest to develop the changes you'd need. Please leave us alone and go back to LSF.
> I won't bother you with these problems anymore, but you can't expect me to be happy about it.

I can understand the frustration here. Our goal here is to have you move onto a platform that we can support, such as with node features. 

Locking in a node to block jobs with epilogue presents several issues for Slurm, (which we have discussed), and I just do not see a solution with epilogue that would be supportable.

I am aware of some other mechanisms that might do this, however, it too takes advantage of flipping nodes states. I am not keen on proposing it since it introduces other issues.

--Previous comment--
> d) from the description: NOTE that any feature under the control of a Helper cannot use the more complex specification language. 
> If any of those more complex specifications are specified by a job using constraints with "[]()!*" then the job will be rejected.
> that makes the nodefeature plugin already barely usable for my cluster. Is there any plan to change that in the future?

 Even though there are no plans to expand this now, I would still suggest proposing use cases that would make this more viable for you.

Looking into the future, there are tasks underway that may make things easier for both us and you. Although I do not have the specifics to share right now, we do mention this in the Roadmap (Page 37, Truly Dynamic Nodes).

https://slurm.schedmd.com/SLUG21/Roadmap.pdf

Comment 29 Michael Hebenstreit 2021-10-14 15:49:05 MDT

ok, please keep me informed once situation changes.