Ticket 12933

Summary: A Large number of jobs in QOSminGRES state were blocking the backfill scheduler from looking at the rest of a partition.
Product: Slurm Reporter: Geoff <geoffrey.ransom>
Component: SchedulingAssignee: Ben Roberts <ben>
Status: RESOLVED FIXED QA Contact:
Severity: 4 - Minor Issue    
Priority: ---    
Version: 21.08.3   
Hardware: Linux   
OS: Linux   
Site: Johns Hopkins University Applied Physics Laboratory Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: 21.08.5, 22.05.pre1 Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---

Description Geoff 2021-11-30 12:19:01 MST
Hello
   We had a user post 180,000 individual CPU jobs to our GPU queue without asking for a GPU. Our paritition QOS setting blocks jobs on the GPU queue not asking for a GPU so the jobs sat in the queue with the Reason QOSminGRES.

Our SchedulerParameters is set to...
sched_min_interval=1000000,max_rpc_cnt=128,bf_continue,bf_resolution=600,bf_window=10080,defer,max_sched_time=4,bf_max_job_part=10000,bf_max_job_test=30000

There were about 25k unrelated jobs with lower priority that could have run but did not and the GPU machines sat idle all weekend.

We deleted the 180,000 jobs and when it got down below 15k unrunnable jobs left at the front of the GPU partition the BF scheduler started picking up the runnable jobs and putting them on systems. I am assuming the bf scheduler was not able to look deep enough in the partition to find the runnable jobs until enough jobs were deleted.

Removing the jobs solved the problem and we have lectured the user about sending CPU jobs to the GPU partition and using arrays if possible, but I was wondering...

Is the BF scheduler able to avoid QOSminGRES jobs since they can't be run or is it stuck checking them each time so they will block jobs deeper in the queue than our bf_max_job* settings? (Is this a bug or an expected behavior)

Is there a way to kill these jobs automatically in slurm, akin to the kill_invalid_depend for jobs with invalid dependencies or will we have to add something to out cli_filter to prevent this situation?
Comment 1 Ben Roberts 2021-11-30 13:39:18 MST
Hi Geoff,

It sounds like the backfill scheduler may not have been able to get to the other queued jobs just due to the number of other jobs in the way and the fact that you have a limit on the number of jobs that can be tested per-partition and the total number that can be evaluated.  One thing that would help in a scenario like you describe would be to place an additional limit on the number of jobs per user the backfill scheduler can evaluate with the bf_max_job_user SchedulerParameter [1].  You should set it to a value lower than the bf_max_job_part limit so that the jobs from a single user can't fill all the jobs that can be evaluated for the partition.

You can also have jobs like this rejected at submit time by using a flag on the QOS called DenyOnLimit.  The behavior you describe makes it sound like this flag isn't set.  Here's an example where I configured a MinTRES limit of 1 GPU and submitted a job that requested the QOS and it was submitted successfully, but can't run because of the limit.

$ sacctmgr show qos gpu-test format=name,mintres,flags
      Name       MinTRES                Flags 
---------- ------------- -------------------- 
  gpu-test    gres/gpu=1                      

$ sbatch -n1 -qgpu-test --wrap='srun sleep 5'
Submitted batch job 2263

$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
              2263     debug     wrap      ben PD       0:00      1 (QOSMinGRES)




If I add that flag to the QOS and try again you can see that the job is rejected at submit time.

$ sacctmgr show qos gpu-test format=name,mintres,flags
      Name       MinTRES                Flags 
---------- ------------- -------------------- 
  gpu-test    gres/gpu=1          DenyOnLimit 

$ sbatch -n1 -qgpu-test --wrap='srun sleep 5'
sbatch: error: QOSMinGRES
sbatch: error: Batch job submission failed: Job violates accounting/QOS policy (job submit limit, user's size and/or time limits)




Let me know if you already have this flag set and jobs aren't being rejected or if you have any other questions.

Thanks,
Ben


[1]  https://slurm.schedmd.com/slurm.conf.html#OPT_bf_max_job_user=#
Comment 2 Geoff 2021-11-30 14:06:09 MST
We set the per partition limit because our CPU partition has been getting a queue depth of 2-5 million jobs and the backfill scheduler was spending all its time in this partition and not looking at the other partitions with the default values.

We decided not to add the user limit because there was a worry that users with lower priority jobs may get bumped ahead of users with higher priority jobs that could have been backfilled and some people are a bit overly concerned about that. In this case, we may just have to live with this possible failure case if we continue to want to avoud a lower user limit than partition limit.

> Let me know if you already have this flag set

No, we do not have this flag set at this time. I will read up on the flags again and see about adding this flag. Thanks.
Comment 3 Ben Roberts 2021-11-30 14:34:45 MST
That sounds good.  Let me know if any questions come up about the DenyOnLimit flag.  You've probably already found it, but just in case, here is the documentation for the flag:
https://slurm.schedmd.com/sacctmgr.html#OPT_DenyOnLimit

Thanks,
Ben
Comment 4 Geoff 2021-12-02 11:23:24 MST
I was reading up on this and noticed that the DenyOnLimit flag talks about Max Limits and GrpTRES limits. The limit we care about is a MinTRES limit of gres/gpu=1. (A user asked for a GPU when submitting to the GPU partition)

Does this flag work for MinTRES as well? If not, is there a different flag for MinTRES?
Comment 6 Ben Roberts 2021-12-02 13:10:02 MST
Hi Geoff,

The limit does apply to the MinTRES value as well as any defined maximums.  You're right that the documentation could be clearer about this.  I've put together a patch to make this clearer and I'll have a member of our team review it to be included in the docs.  Thanks for pointing this out.

Ben
Comment 9 Ben Roberts 2021-12-06 15:34:51 MST
Hi Geoff,

We've checked in an update to the documentation, clarifying that DenyOnLimits applies to minimum type limts as well.  You can see the commit here:

https://github.com/SchedMD/slurm/commit/b48961c1e5d1d44d86673211ca2486a827322b3f

This change will be visible in the online documentation with the 21.08.5 release of Slurm.

Thanks,
Ben