| Summary: | A Large number of jobs in QOSminGRES state were blocking the backfill scheduler from looking at the rest of a partition. | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Geoff <geoffrey.ransom> |
| Component: | Scheduling | Assignee: | Ben Roberts <ben> |
| Status: | RESOLVED FIXED | QA Contact: | |
| Severity: | 4 - Minor Issue | ||
| Priority: | --- | ||
| Version: | 21.08.3 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | Johns Hopkins University Applied Physics Laboratory | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | 21.08.5, 22.05.pre1 | Target Release: | --- |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
|
Description
Geoff
2021-11-30 12:19:01 MST
Hi Geoff,
It sounds like the backfill scheduler may not have been able to get to the other queued jobs just due to the number of other jobs in the way and the fact that you have a limit on the number of jobs that can be tested per-partition and the total number that can be evaluated. One thing that would help in a scenario like you describe would be to place an additional limit on the number of jobs per user the backfill scheduler can evaluate with the bf_max_job_user SchedulerParameter [1]. You should set it to a value lower than the bf_max_job_part limit so that the jobs from a single user can't fill all the jobs that can be evaluated for the partition.
You can also have jobs like this rejected at submit time by using a flag on the QOS called DenyOnLimit. The behavior you describe makes it sound like this flag isn't set. Here's an example where I configured a MinTRES limit of 1 GPU and submitted a job that requested the QOS and it was submitted successfully, but can't run because of the limit.
$ sacctmgr show qos gpu-test format=name,mintres,flags
Name MinTRES Flags
---------- ------------- --------------------
gpu-test gres/gpu=1
$ sbatch -n1 -qgpu-test --wrap='srun sleep 5'
Submitted batch job 2263
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
2263 debug wrap ben PD 0:00 1 (QOSMinGRES)
If I add that flag to the QOS and try again you can see that the job is rejected at submit time.
$ sacctmgr show qos gpu-test format=name,mintres,flags
Name MinTRES Flags
---------- ------------- --------------------
gpu-test gres/gpu=1 DenyOnLimit
$ sbatch -n1 -qgpu-test --wrap='srun sleep 5'
sbatch: error: QOSMinGRES
sbatch: error: Batch job submission failed: Job violates accounting/QOS policy (job submit limit, user's size and/or time limits)
Let me know if you already have this flag set and jobs aren't being rejected or if you have any other questions.
Thanks,
Ben
[1] https://slurm.schedmd.com/slurm.conf.html#OPT_bf_max_job_user=#
We set the per partition limit because our CPU partition has been getting a queue depth of 2-5 million jobs and the backfill scheduler was spending all its time in this partition and not looking at the other partitions with the default values.
We decided not to add the user limit because there was a worry that users with lower priority jobs may get bumped ahead of users with higher priority jobs that could have been backfilled and some people are a bit overly concerned about that. In this case, we may just have to live with this possible failure case if we continue to want to avoud a lower user limit than partition limit.
> Let me know if you already have this flag set
No, we do not have this flag set at this time. I will read up on the flags again and see about adding this flag. Thanks.
That sounds good. Let me know if any questions come up about the DenyOnLimit flag. You've probably already found it, but just in case, here is the documentation for the flag: https://slurm.schedmd.com/sacctmgr.html#OPT_DenyOnLimit Thanks, Ben I was reading up on this and noticed that the DenyOnLimit flag talks about Max Limits and GrpTRES limits. The limit we care about is a MinTRES limit of gres/gpu=1. (A user asked for a GPU when submitting to the GPU partition) Does this flag work for MinTRES as well? If not, is there a different flag for MinTRES? Hi Geoff, The limit does apply to the MinTRES value as well as any defined maximums. You're right that the documentation could be clearer about this. I've put together a patch to make this clearer and I'll have a member of our team review it to be included in the docs. Thanks for pointing this out. Ben Hi Geoff, We've checked in an update to the documentation, clarifying that DenyOnLimits applies to minimum type limts as well. You can see the commit here: https://github.com/SchedMD/slurm/commit/b48961c1e5d1d44d86673211ca2486a827322b3f This change will be visible in the online documentation with the 21.08.5 release of Slurm. Thanks, Ben |