Summary: | Partition QOS with DenyOnLimit blocking multiple partition jobs when one partition is valid | ||
---|---|---|---|
Product: | Slurm | Reporter: | Trey Dockendorf <tdockendorf> |
Component: | Scheduling | Assignee: | Marshall Garey <marshall> |
Status: | RESOLVED DUPLICATE | QA Contact: | |
Severity: | 3 - Medium Impact | ||
Priority: | --- | CC: | troy |
Version: | 20.02.4 | ||
Hardware: | Linux | ||
OS: | Linux | ||
Site: | Ohio State OSC | Alineos Sites: | --- |
Atos/Eviden Sites: | --- | Confidential Site: | --- |
Coreweave sites: | --- | Cray Sites: | --- |
DS9 clusters: | --- | HPCnow Sites: | --- |
HPE Sites: | --- | IBM Sites: | --- |
NOAA SIte: | --- | NoveTech Sites: | --- |
Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
Recursion Pharma Sites: | --- | SFW Sites: | --- |
SNIC sites: | --- | Tzag Elita Sites: | --- |
Linux Distro: | --- | Machine Name: | |
CLE Version: | Version Fixed: | ||
Target Release: | --- | DevPrio: | --- |
Emory-Cloud Sites: | --- | ||
Attachments: | slurm.conf |
Appears the order matters, if I put gpuserial-quad first in list, the job is accepted: $ sbatch --gpus-per-node=4 -p gpuserial-quad,gpuserial-48core --wrap 'scontrol show job=$SLURM_JOB_ID' Submitted batch job 27819 The same issue occurs if I reduce --gpus-per-node=2 and try and submit to gpuserial-48core: $ sbatch --gpus-per-node=2 -p gpuserial-quad,gpuserial-48core --wrap 'scontrol show job=$SLURM_JOB_ID' sbatch: error: QOSMinGRES sbatch: error: Batch job submission failed: Job violates accounting/QOS policy (job submit limit, user's size and/or time limits) $ sbatch --gpus-per-node=2 -p gpuserial-48core,gpuserial-quad --wrap 'scontrol show job=$SLURM_JOB_ID' Submitted batch job 27818 Hi Trey, I'm looking into this. I think it might be a duplicate of another bug. I'll check on that and get back to you. Trey, I've confirmed that this is indeed a duplicate of bug 7375. I see you've already commented on that bug, so you're already aware of it. I hope to get that bug fixed by the release of 20.11, and though I can't guarantee it will go into 20.02 I could give you a patch to test when it's ready. Let me know if you have any more questions. For now I'm closing this as a duplicate of bug 7375. *** This ticket has been marked as a duplicate of ticket 7375 *** |
Created attachment 15819 [details] slurm.conf We submit GPU jobs to multiple partitions via job submit filter so users don't have to know which GPU partition to choose. We have a partition QOS on each GPU partition to try and avoid people getting jobs that would not work for a given node. Right now node types are dual-gpu with 48 core and quad-gpu with 48 core. I have this QOS: If I remove DenyOnLimit the job works even in case that fails below. The job is correctly started on only partition that can satisfy request, gpuserial-quad. I would expect EnforcePartLimits=ANY to mean that if DenyOnLimit blocked on partition, the other valid partition would be used. # sacctmgr show qos format=Name,Flags,MaxTRESPerJob,MaxTRESPerNode,MinTresPerJob --parsable Name|Flags|MaxTRES|MaxTRESPerNode|MinTRES| pitzer-gpuserial-partition|DenyOnLimit|gres/gpu=2||gres/gpu=1| pitzer-gpu-quad-partition|DenyOnLimit||gres/gpu=4|gres/gpu=3| $ sbatch --gpus-per-node=4 -p gpuserial-48core,gpuserial-quad --wrap 'scontrol show job=$SLURM_JOB_ID' sbatch: error: QOSMaxGRESPerJob sbatch: error: Batch job submission failed: Job violates accounting/QOS policy (job submit limit, user's size and/or time limits) $ sbatch --gpus-per-node=4 -p gpuserial-quad --wrap 'scontrol show job=$SLURM_JOB_ID' Submitted batch job 27781