Ticket 9419

Summary: Jobs pending with QOSGrpJobsLimit while cluster nodes still have available resources
Product: Slurm Reporter: hui.qiu
Component: AccountingAssignee: Director of Support <support>
Status: RESOLVED INFOGIVEN QA Contact:
Severity: 4 - Minor Issue    
Priority: ---    
Version: - Unsupported Older Versions   
Hardware: Linux   
OS: Linux   
Site: BNP Paribas Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---

Description hui.qiu 2020-07-20 03:10:15 MDT
Hi Support, 

We are using Slurm 17.02.5. I understand it is out of support and we are planing to do the update. 

I just have a general question to understand what the Pending code 'QOSGrpJobsLimit' actualy means. 

While there were still a lot resources available the cluster, I noticed a few hundreds jobs pending with reason 'QOSGrpJobsLimit'. 

What slurm checks and put a job pending with reason code 'QOSGrpJobsLimit'? 

Thanks,
Hui
Comment 1 hui.qiu 2020-07-20 03:17:23 MDT
In addition, 

[gadmin@hkgslaqsdev110 17:13]$ squeue |grep PD|grep QOSGrpJobsLimit
          52420639 emergency probejob     root PD       0:00      1 (QOSGrpJobsLimit)
          52420642     mosek probejob     root PD       0:00      1 (QOSGrpJobsLimit)
          52420644    medium probejob     root PD       0:00      1 (QOSGrpJobsLimit)
          52407304    medium lynx_agg c_guilba PD       0:00      1 (QOSGrpJobsLimit)


Now most jobs have turned to R status. I see a few probejobs by root for different partitions are pending with QOSGrpJobsLimit. 


Anything special for those root probejobs, do I need to do anything to clear the status?
Comment 2 Jeff DeGraw 2020-07-20 10:57:26 MDT
Hui,

Thanks for reaching out to us. I would be happy to clarify this for you. Can you give me the output if you run this command:
> sacctmgr list qos format=name,GrpJobs

Thanks,
- Jeff
Comment 3 hui.qiu 2020-07-20 20:37:31 MDT
Hi Jeff,

Here is the command output: 

[root@hkgslaqsdev110 10:35]$ sacctmgr list qos format=name,GrpJobs
      Name GrpJobs 
---------- ------- 
    normal    1000 
   longjob    1000 
weekendjob     500 
    lowjob     500 
pretestjob     500 
   hugejob     500 
  localjob     500 
    gpujob     200 
      team   10000 

Thanks,
Hui
Comment 4 Jeff DeGraw 2020-07-21 08:59:04 MDT
Hui,

Thanks for providing that information. If I understand correctly, you're wanting to know why jobs are pending with the reason QOSGrpJobsLimit. Every user is associated with a QOS, and, as you provided, each QOS has a max running jobs limit (GrpJobs). If that limit is reached, the jobs will be pending until a running job finishes. From the sacctmgr man page:
> NOTE:  The  group  limits  (GrpJobs, GrpTRES, etc.) are tested when a job is
> being considered for being allocated resources.  If starting a job would
> cause any of its group limit to be exceeded, that job will not be considered
> for scheduling even if that job might preempt other jobs which would release
> sufficient group resources for the pending job to be initiated.

You can increase the GrpJobs value for a QOS with this command:
> sacctmgr modify qos where name=<name> set GrpJobs=<#>

Does that answer your question?

- Jeff
Comment 5 hui.qiu 2020-07-21 18:04:30 MDT
Hi Jeff, 

Understood. Last question regarding this topic:  e.g., a user sends 200 jobs in a batch to a partition where normal QOS is set with 1000 GrpJobs. There are already 900 jobs running in the partition. In this case, will 100 jobs out of 200 be scheduled first or will the entire 200 jobs be put in pending status? 


Thanks,
Hui
Comment 6 Jeff DeGraw 2020-07-22 09:58:37 MDT
(In reply to hui.qiu from comment #5)
> Last question regarding this topic:  e.g., a user sends 200 jobs
> in a batch to a partition where normal QOS is set with 1000 GrpJobs. There
> are already 900 jobs running in the partition. In this case, will 100 jobs
> out of 200 be scheduled first or will the entire 200 jobs be put in pending
> status? 

That's a great question. From my own testing:
$ sacctmgr list qos format=name,GrpJobs
      Name GrpJobs 
---------- ------- 
    normal         
      gold       5
$ sbatch --array=0-9 -q gold --wrap="sleep 60"
Submitted batch job 296
$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON) 
         296_[5-9]     debug     wrap     jeff PD       0:00      1 (QOSGrpJobsLimit) 
             296_0     debug     wrap     jeff  R       0:03      1 linux1 
             296_1     debug     wrap     jeff  R       0:03      1 linux1 
             296_2     debug     wrap     jeff  R       0:03      1 linux1 
             296_3     debug     wrap     jeff  R       0:03      1 linux1 
             296_4     debug     wrap     jeff  R       0:03      1 linux2

So, in your question, 100 of those jobs would run and 100 would be put in a pending state initially.

- Jeff
Comment 7 Jeff DeGraw 2020-07-23 09:33:11 MDT
Hui,

I'm going to go ahead and close out this ticket now, but feel free to open it back up if you have further questions.

Thanks,
- Jeff