Ticket 6810

Summary:	Steer job to partition based on size
Product:	Slurm	Reporter:	Matt Ezell <ezellma>
Component:	Other	Assignee:	Unassigned Developer <dev-unassigned>
Status:	OPEN ---	QA Contact:
Severity:	5 - Enhancement
Priority:	---
Version:	18.08.6
Hardware:	Linux
OS:	Linux
Site:	NOAA	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	ORNL	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---

Description Matt Ezell 2019-04-08 06:39:57 MDT

I am trying to determine the best way to steer jobs to a partition based on their size.  The desire is to have jobs that use over 20% of the machine go to a 'novel' partition that is only (manually) activated at certain times.  I set MaxNodes on the 'batch' partition (which also has Default=true) and MinNodes on the 'novel' partition.  If I request a job with more than 20% of the machine but do not specify a partition, it goes to the default (batch) but is blocked.

I started writing some logic in a Lua job_submit script, but it's not obvious to me how to calculate how many nodes a job might use.  If the user specified a node count, obviously that can be used as-is.  But if the user specifies some combination of ntasks, cpus-per-task, sockets-per-node, cores-per-socket, threads-per-core I would have to calculate how I think Slurm might lay out the job.  Luckily our cluster is homogeneous and we allocate nodes exclusively, otherwise I could see this getting impossibly complicated.

Is there a more straightforward way of doing this?

Comment 3 Chad Vizino 2019-04-10 11:00:58 MDT

Unless it's explicitly provided, it's not possible to determine (from Slurm) a job's node count at submission time since it will be calculated later. You might be able to write something that would move the job to a different partition later after the node calculation has been made, though.

Comment 4 Chad Vizino 2019-04-12 13:34:05 MDT

Take a look at the partition spec "Alternate" in slurm.conf:

https://slurm.schedmd.com/slurm.conf.html

You may be able to use it with your novel partition when using a job filter. You will still need to do your own size calculation to figure out if you have a large job but this spec may help with using a different partition.

Comment 5 Matt Ezell 2019-04-17 19:08:23 MDT

Thanks Chad.  I feared that would be the answer, but it makes sense.  I'm going to close this.

Comment 6 Matt Ezell 2019-07-22 12:55:31 MDT

I would like to reopen this, possibly as an RFE, to understand if it is feasible or not.

On heterogeneous clusters, the number of nodes for a job could be dependent on which nodes it gets scheduled to.  Would it make sense to calculate the minimum and maximum node count based on the resources available in the cluster?

Comment 7 Jason Booth 2019-09-03 15:34:49 MDT

Matt - I do not think we would want to do this at job submission because this would introduce a "smaller scheduler" to determine this, however, we will look into this some and let you know.