5656 – Heterogeneous job components reserving individually more than available resources

Ticket 5656 - Heterogeneous job components reserving individually more than available resources

Summary: Heterogeneous job components reserving individually more than available resou...

Status:	OPEN

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	Scheduling (show other tickets)
Version:	18.08.0
Hardware:	Linux Linux

Severity:	5 - Enhancement
Assignee:	Unassigned Developer
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2018-08-31 09:36 MDT by Alejandro Sanchez
Modified:	2019-08-09 07:01 MDT (History)
CC List:	0 users

See Also:	5579
Site:	SchedMD
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Alejandro Sanchez 2018-08-31 09:36:16 MDT

This was detected in bug 5579 as a related but separate issue. Core of the problem is heterogeneous job components belonging to a single hetjob can each of them request up to the total amount of available resources. For example, if we have a 10 node cluster, and we request a hetjob like -N10 : -N1, first component will reserve all the available nodes but will never start because second component will not be able to reserve resources and the whole hetjob cannot start until all the components have a start time and are all runnable. This leaves the queue in a blocked state and currently needs manual intervention:

alex@ibiza:~/t$ sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
p1*          up   infinite     10   idle compute[1-10]
alex@ibiza:~/t$ sbatch -N10 : -N1 --wrap "sleep 99999"
Submitted batch job 20028
alex@ibiza:~/t$ squeue --start
             JOBID PARTITION     NAME     USER ST          START_TIME  NODES SCHEDNODES           NODELIST(REASON)
           20028+0        p1     wrap     alex PD 2018-08-31T17:30:08     10 compute[1-10]        (None)
           20028+1        p1     wrap     alex PD 2018-09-01T17:30:00      1 compute1             (None)
alex@ibiza:~/t$ sbatch --wrap "sleep 9999"
Submitted batch job 20030
alex@ibiza:~/t$ squeue 
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
           20028+1        p1     wrap     alex PD       0:00      1 (None)
             20030        p1     wrap     alex PD       0:00      1 (Priority)
           20028+0        p1     wrap     alex PD       0:00     10 (None)
alex@ibiza:~/t$

Comment 1 Moe Jette 2018-08-31 09:47:42 MDT

(In reply to Alejandro Sanchez from comment #0)
> This was detected in bug 5579 as a related but separate issue. Core of the
> problem is heterogeneous job components belonging to a single hetjob can
> each of them request up to the total amount of available resources.

Note that "available resources" here can include not just nodes on the cluster, but
1. Resources in a specific partition (when multiple hetjob components are submitted to the same partition)
2. GRES when specified as a total count for the job (e.g. two hetjob components specify "--gpus=16" which exceeds the total available on the partition)
3. Licenses
4. Burst buffer space
5. Global limits (total node count for a user, QOS, etc.)

This will be difficult to detect, at least for most of these cases. Ideally this can detected at submit time, then we could reject the job. Otherwise holding the job is probably the best option.