9788 – Partition QOS with DenyOnLimit blocking multiple partition jobs when one partition is valid

Ticket 9788 - Partition QOS with DenyOnLimit blocking multiple partition jobs when one partition is valid

Summary: Partition QOS with DenyOnLimit blocking multiple partition jobs when one part...

Status:	RESOLVED DUPLICATE of ticket 7375

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	Scheduling (show other tickets)
Version:	20.02.4
Hardware:	Linux Linux

Severity:	3 - Medium Impact
Assignee:	Marshall Garey
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2020-09-09 13:29 MDT by Trey Dockendorf
Modified:	2020-10-06 17:06 MDT (History)
CC List:	1 user (show)

See Also:
Site:	Ohio State OSC
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
slurm.conf (181.26 KB, text/plain) 2020-09-09 13:29 MDT, Trey Dockendorf	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Trey Dockendorf 2020-09-09 13:29:44 MDT

Created attachment 15819 [details]
slurm.conf

We submit GPU jobs to multiple partitions via job submit filter so users don't have to know which GPU partition to choose.  We have a partition QOS on each GPU partition to try and avoid people getting jobs that would not work for a given node. Right now node types are dual-gpu with 48 core and quad-gpu with 48 core.  I have this QOS:

If I remove DenyOnLimit the job works even in case that fails below.  The job is correctly started on only partition that can satisfy request, gpuserial-quad.

I would expect EnforcePartLimits=ANY to mean that if DenyOnLimit blocked on partition, the other valid partition would be used.


# sacctmgr show qos format=Name,Flags,MaxTRESPerJob,MaxTRESPerNode,MinTresPerJob --parsable
Name|Flags|MaxTRES|MaxTRESPerNode|MinTRES|
pitzer-gpuserial-partition|DenyOnLimit|gres/gpu=2||gres/gpu=1|
pitzer-gpu-quad-partition|DenyOnLimit||gres/gpu=4|gres/gpu=3|

$ sbatch --gpus-per-node=4 -p gpuserial-48core,gpuserial-quad --wrap 'scontrol show job=$SLURM_JOB_ID'
sbatch: error: QOSMaxGRESPerJob
sbatch: error: Batch job submission failed: Job violates accounting/QOS policy (job submit limit, user's size and/or time limits)

$ sbatch --gpus-per-node=4 -p gpuserial-quad --wrap 'scontrol show job=$SLURM_JOB_ID'
Submitted batch job 27781

Comment 1 Trey Dockendorf 2020-09-09 13:32:20 MDT

Appears the order matters, if I put gpuserial-quad first in list, the job is accepted:

$ sbatch --gpus-per-node=4 -p gpuserial-quad,gpuserial-48core --wrap 'scontrol show job=$SLURM_JOB_ID'
Submitted batch job 27819

The same issue occurs if I reduce --gpus-per-node=2 and try and submit to gpuserial-48core:

$ sbatch --gpus-per-node=2 -p gpuserial-quad,gpuserial-48core --wrap 'scontrol show job=$SLURM_JOB_ID'
sbatch: error: QOSMinGRES
sbatch: error: Batch job submission failed: Job violates accounting/QOS policy (job submit limit, user's size and/or time limits)

$ sbatch --gpus-per-node=2 -p gpuserial-48core,gpuserial-quad --wrap 'scontrol show job=$SLURM_JOB_ID'
Submitted batch job 27818

Comment 3 Marshall Garey 2020-09-09 16:57:38 MDT

Hi Trey, I'm looking into this. I think it might be a duplicate of another bug. I'll check on that and get back to you.

Comment 4 Marshall Garey 2020-10-06 17:06:31 MDT

Trey,

I've confirmed that this is indeed a duplicate of bug 7375. I see you've already commented on that bug, so you're already aware of it. I hope to get that bug fixed by the release of 20.11, and though I can't guarantee it will go into 20.02 I could give you a patch to test when it's ready.

Let me know if you have any more questions. For now I'm closing this as a duplicate of bug 7375.

*** This ticket has been marked as a duplicate of ticket 7375 ***