Ticket 11456

Summary:	Users jobs can't be scheduled in a particular partition although there is still resources for other users to run jobs in the same partition
Product:	Slurm	Reporter:	hui.qiu
Component:	Scheduling	Assignee:	Ben Roberts <ben>
Status:	RESOLVED INFOGIVEN	QA Contact:
Severity:	3 - Medium Impact
Priority:	---
Version:	- Unsupported Older Versions
Hardware:	Linux
OS:	Linux
Site:	BNP Paribas	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	slurm.conf sshare -a output sacctmgr show qos squeue output sprio output sinfo output sdiag output scontrol show nodes output slurmctld.log

Description hui.qiu 2021-04-27 03:02:27 MDT

Hi Support, 

We use slurm version 17.02.5 in RHEL6.10 and found an issue of scheduling. 

You may find the attachments of our slurm.conf. 

The issue is for the heaviest user v_bazine who runs jobs mostly in mmf partition. 

When there are still plenty of resources(CPU, RAM), many user v_bazine's jobs in mmf partition are in PD status with reason 'Priority'. If user v_bazine sends the jobs to other partitions, the jobs can run. If other users send jobs to mmf partition,  the jobs can also run. 

Seems there is sort of a cap for the user in the same partition but I couldn't find or explain why. 

The user v_bazine runs jobs more often and hence has lower fairshare value than other users. But if no other users send jobs, his jobs still get stuck in PD status in the mmf partition.  This is a kind of waste of resources. 

Can you please check and advise how to avoid this? 

BTW, we have plan to upgrade slurm to latest version together with the OS RHEL8 migration later this year. I would like to see if any way to solve this by some settings in the current version. 

Thanks,
Hui

Comment 1 hui.qiu 2021-04-27 03:32:43 MDT

Created attachment 19130 [details]
slurm.conf

Comment 2 hui.qiu 2021-04-27 03:34:21 MDT

Created attachment 19131 [details]
sshare -a output

Comment 3 Ben Roberts 2021-04-27 11:36:21 MDT

Hi Hui,

It's possible that there's a limit defined in the QOS associated with the partition that's preventing this user's jobs from starting.  Can I have you send the output of 'sacctmgr show qos'?  

If you have an example job on the system right now I would also like to see the output of a few commands to see what is happening with that job:
squeue
sprio
sinfo
scontrol show job <job id of problem job>
sdiag

I would also like to see the slurmctld.log file that covers the time from when the job in question was submitted to the present.

Thanks,
Ben

Comment 4 hui.qiu 2021-04-28 18:20:40 MDT

Hi Ben, 

Yes, we have qos set for each partition. Please find the attached output.

Our default partition has qos of 2000 GrpJobs. Most others have lower GrpJobs set. 

The strange thing is that mmf partition has QOS=team set which has 10000 GrpJobs however it seems we never go beyond 2000 even if there is resources in the cluster. 

I don't have an example at the moment. I will check with the heaviest user to reproduce and send you the commands outputs as well as the slurmctld.log in the next couple of days.

Regards,
Hui

Comment 5 hui.qiu 2021-04-28 18:22:40 MDT

Created attachment 19177 [details]
sacctmgr show qos

Comment 6 Ben Roberts 2021-04-29 09:24:27 MDT

Hi Hui,

Thank you for sending the sacctmgr output.  It does look like there is a GrpJobs limit defined for the 'team' QOS, but with a value of 10,000 you're probably not hitting that limit.  Looking at your config again I do see that there is another possibility of what might be happening.  If you have a large number of jobs in the queue and this user's jobs are relatively low in priority then it would probably fall to the backfill scheduler to start the jobs.  If the backfill scheduler isn't able to get to these jobs in a single scheduling cycle then the jobs will just sit in the queue until there are fewer jobs waiting.

There is a parameter you can set in your slurm.conf file that will tell the backfill scheduler to keep track of how far it made it in the queue when it stops so that it can pick up from where it left off the next time, rather than the default behavior of starting from the beginning each time.  This parameter is called bf_continue and is set as one of the SchedulerParameters.  

Can you confirm that there were a large number of jobs in the queue when you saw this behavior?  If this behavior does seem to be related to the number of queued jobs then I do think this is likely the cause for the behavior you're seeing.  You may also want to consider increasing bf_max_job_test from the default value of 100.  Let me know if you have questions about this.

Thanks,
Ben

Comment 7 hui.qiu 2021-05-03 22:59:40 MDT

Hi Ben,

Yes, I did see a big number of Pending jobs were in the queue by sprio command at one time but the running jobs were completing fast this time and the PD jobs got executed before I finished collecting the logs/command outputs. 

I may need to wait until the heaviest user return from holidays to run similar heavy batch of jobs again to gather more evidences. 

As you commented, I also feel like the backfill scheduler behavior is probably the cause of our issue. 

Perhaps, I can start to add bf_continue and test with the user again. 

Our current setting of SchedulerParameters is: 

SchedulerParameters=sched_min_interval=200000

How do I append another param bf_continue ? 

And after the change in slurm.conf, is 'scontrol reconfig' enough to make it effective? 

Thanks,
Hui

Comment 8 Ben Roberts 2021-05-04 08:42:02 MDT

Thanks for confirming that there were a lot of jobs at the time.  It does seem likely that the bf_continue parameter will help in this case.  You can add it to the SchedulerParameters by adding it to the end with a comma separating the parameters, like this:
SchedulerParameters=sched_min_interval=200000,bf_continue

You can just do an 'scontrol reconfigure' for it to take effect.  You can confirm that it was picked up correctly by running 'scontrol show config | grep SchedulerParam' to show what the scheduler recognizes as the current parameters.

Thanks,
Ben

Comment 9 Ben Roberts 2021-05-13 15:09:49 MDT

Hi Hui,

I wanted to follow up and see if you were able to add the bf_continue flag and whether it made a difference in the behavior you were seeing.  Let me know if you still need help with this ticket.

Thanks,
Ben

Comment 10 hui.qiu 2021-05-13 20:31:46 MDT

Hi Ben, 

Yes, we have implemented the bf_continue flag and been watching for a few days. So far looks better and no users complaints. I think you can close this ticket now.

thanks,
Hui

Comment 11 Ben Roberts 2021-05-14 08:05:09 MDT

I'm glad to hear that helped.  Let us know if anything else comes up.

Thanks,
Ben

Comment 12 hui.qiu 2021-05-27 03:38:48 MDT

Hi Ben, 

Today, the same user sent batch of big jobs and small jobs. The big jobs were running and many small jobs were pending for priority although there were still resources for small jobs

e.g. below job requires only 1 CPU core and 2GB MEM. From the cluster work nodes resources report, there are still room for the job but it was PD for Priority. 

[gadmin@hkgslaqsdev110 17:34]$ cat job-391537.txt
JobId=391537 JobName=tmpbeve1l3p
   UserId=v_bazine(2021) GroupId=users(100) MCS_label=N/A
   Priority=13747 Nice=0 Account=research QOS=normal
   JobState=PENDING Reason=Priority Dependency=(null)
   Requeue=0 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=1-00:00:00 TimeMin=N/A
   SubmitTime=2021-05-27T12:31:03 EligibleTime=2021-05-27T12:31:03
   StartTime=Unknown EndTime=Unknown Deadline=N/A
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   Partition=mmf AllocNode:Sid=hkgslaqsdev100:41794
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=(null)
   NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=1,mem=2048,node=1
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryNode=2G MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   Gres=(null) Reservation=(null)
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/tmp/tmpt_ke7fqu
   WorkDir=/beegfs_exch1/scratch4/v_bazine/ML/MAOBT
   StdErr=/home/v_bazine/trash/tmpbeve1l3p/104.stdout
   StdIn=/dev/null
   StdOut=/home/v_bazine/trash/tmpbeve1l3p/104.stdout
   Power=

Please check the attached command outputs and slurmctld.log.

Thanks,
Hui

Comment 13 hui.qiu 2021-05-27 03:43:27 MDT

Created attachment 19690 [details]
squeue output

Comment 14 hui.qiu 2021-05-27 03:45:39 MDT

Created attachment 19691 [details]
sprio output

Comment 15 hui.qiu 2021-05-27 03:47:35 MDT

Created attachment 19692 [details]
sinfo output

Comment 16 hui.qiu 2021-05-27 03:49:50 MDT

Created attachment 19693 [details]
sdiag output

Comment 17 hui.qiu 2021-05-27 03:51:27 MDT

Created attachment 19694 [details]
scontrol show nodes output

Comment 18 hui.qiu 2021-05-27 03:54:56 MDT

Created attachment 19695 [details]
slurmctld.log

Comment 19 Ben Roberts 2021-05-27 09:46:25 MDT

Hi Hui,

I do see what you're pointing out, that it looks like these small jobs should be able to start.  Unfortunately I can't see what exactly is preventing them from being backfilled right away.  The sdiag output you sent makes it look like all the jobs are being evaluated:

        Last depth cycle: 1229
        Last depth cycle (try sched): 46
        ...
        Last queue length: 1229


There must be something that is keeping the jobs from being able to start.  However, with the current log level I wasn't able to see why these jobs weren't starting.  Do you happen to still have jobs queued like this?  If your cluster is still in this state, or if you can make it get in this state again, I would like to have you enable some additional logging by running these commands:
scontrol setdebug debug2
scontrol setdebugflags +backfill


Let the scheduler run for a few minutes with the additional logging enabled.  You can revert to your previous logging settings by running these:
scontrol setdebugflags -backfill
scontrol setdebug info


While you have the debug flags enabled I would like to have you collect the output of several commands too:
sdiag
sinfo
scontrol show nodes (full output)
scontrol show jobs  (full output)


With that information I should be able to get a better idea of what is happening.

Thanks,
Ben

Comment 20 Ben Roberts 2021-06-04 11:07:55 MDT

Hi Hui,

I wanted to follow up and see if this has come up again.  If you are able to collect some debug logging with the 'backfill' flag while this is happening I will be happy to look into what is going on.

Thanks,
Ben

Comment 21 hui.qiu 2021-06-07 18:35:35 MDT

Hi Ben, 

Thanks for the information of getting more debug logs. 
The issue doesn't happen too often. I will follow up the steps to gather the debug information when similar issue happens again. 

You may close the case first if you don't hear from me in next a few days as the further steps have been provided. I'd open the case again whenever I could gather the required information. 

Regards,
Hui

Comment 22 Ben Roberts 2021-06-15 15:55:48 MDT

Hi Hui,

I haven't seen an update to this ticket for a week.  I'll go ahead and close it, but if it does come up again and you are able to collect the information we discussed, feel free to update the ticket and I'll review the information.

Thanks,
Ben