Hi Support, We use slurm version 17.02.5 in RHEL6.10 and found an issue of scheduling. You may find the attachments of our slurm.conf. The issue is for the heaviest user v_bazine who runs jobs mostly in mmf partition. When there are still plenty of resources(CPU, RAM), many user v_bazine's jobs in mmf partition are in PD status with reason 'Priority'. If user v_bazine sends the jobs to other partitions, the jobs can run. If other users send jobs to mmf partition, the jobs can also run. Seems there is sort of a cap for the user in the same partition but I couldn't find or explain why. The user v_bazine runs jobs more often and hence has lower fairshare value than other users. But if no other users send jobs, his jobs still get stuck in PD status in the mmf partition. This is a kind of waste of resources. Can you please check and advise how to avoid this? BTW, we have plan to upgrade slurm to latest version together with the OS RHEL8 migration later this year. I would like to see if any way to solve this by some settings in the current version. Thanks, Hui
Created attachment 19130 [details] slurm.conf
Created attachment 19131 [details] sshare -a output
Hi Hui, It's possible that there's a limit defined in the QOS associated with the partition that's preventing this user's jobs from starting. Can I have you send the output of 'sacctmgr show qos'? If you have an example job on the system right now I would also like to see the output of a few commands to see what is happening with that job: squeue sprio sinfo scontrol show job <job id of problem job> sdiag I would also like to see the slurmctld.log file that covers the time from when the job in question was submitted to the present. Thanks, Ben
Hi Ben, Yes, we have qos set for each partition. Please find the attached output. Our default partition has qos of 2000 GrpJobs. Most others have lower GrpJobs set. The strange thing is that mmf partition has QOS=team set which has 10000 GrpJobs however it seems we never go beyond 2000 even if there is resources in the cluster. I don't have an example at the moment. I will check with the heaviest user to reproduce and send you the commands outputs as well as the slurmctld.log in the next couple of days. Regards, Hui
Created attachment 19177 [details] sacctmgr show qos
Hi Hui, Thank you for sending the sacctmgr output. It does look like there is a GrpJobs limit defined for the 'team' QOS, but with a value of 10,000 you're probably not hitting that limit. Looking at your config again I do see that there is another possibility of what might be happening. If you have a large number of jobs in the queue and this user's jobs are relatively low in priority then it would probably fall to the backfill scheduler to start the jobs. If the backfill scheduler isn't able to get to these jobs in a single scheduling cycle then the jobs will just sit in the queue until there are fewer jobs waiting. There is a parameter you can set in your slurm.conf file that will tell the backfill scheduler to keep track of how far it made it in the queue when it stops so that it can pick up from where it left off the next time, rather than the default behavior of starting from the beginning each time. This parameter is called bf_continue and is set as one of the SchedulerParameters. Can you confirm that there were a large number of jobs in the queue when you saw this behavior? If this behavior does seem to be related to the number of queued jobs then I do think this is likely the cause for the behavior you're seeing. You may also want to consider increasing bf_max_job_test from the default value of 100. Let me know if you have questions about this. Thanks, Ben
Hi Ben, Yes, I did see a big number of Pending jobs were in the queue by sprio command at one time but the running jobs were completing fast this time and the PD jobs got executed before I finished collecting the logs/command outputs. I may need to wait until the heaviest user return from holidays to run similar heavy batch of jobs again to gather more evidences. As you commented, I also feel like the backfill scheduler behavior is probably the cause of our issue. Perhaps, I can start to add bf_continue and test with the user again. Our current setting of SchedulerParameters is: SchedulerParameters=sched_min_interval=200000 How do I append another param bf_continue ? And after the change in slurm.conf, is 'scontrol reconfig' enough to make it effective? Thanks, Hui
Thanks for confirming that there were a lot of jobs at the time. It does seem likely that the bf_continue parameter will help in this case. You can add it to the SchedulerParameters by adding it to the end with a comma separating the parameters, like this: SchedulerParameters=sched_min_interval=200000,bf_continue You can just do an 'scontrol reconfigure' for it to take effect. You can confirm that it was picked up correctly by running 'scontrol show config | grep SchedulerParam' to show what the scheduler recognizes as the current parameters. Thanks, Ben
Hi Hui, I wanted to follow up and see if you were able to add the bf_continue flag and whether it made a difference in the behavior you were seeing. Let me know if you still need help with this ticket. Thanks, Ben
Hi Ben, Yes, we have implemented the bf_continue flag and been watching for a few days. So far looks better and no users complaints. I think you can close this ticket now. thanks, Hui
I'm glad to hear that helped. Let us know if anything else comes up. Thanks, Ben
Hi Ben, Today, the same user sent batch of big jobs and small jobs. The big jobs were running and many small jobs were pending for priority although there were still resources for small jobs e.g. below job requires only 1 CPU core and 2GB MEM. From the cluster work nodes resources report, there are still room for the job but it was PD for Priority. [gadmin@hkgslaqsdev110 17:34]$ cat job-391537.txt JobId=391537 JobName=tmpbeve1l3p UserId=v_bazine(2021) GroupId=users(100) MCS_label=N/A Priority=13747 Nice=0 Account=research QOS=normal JobState=PENDING Reason=Priority Dependency=(null) Requeue=0 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 RunTime=00:00:00 TimeLimit=1-00:00:00 TimeMin=N/A SubmitTime=2021-05-27T12:31:03 EligibleTime=2021-05-27T12:31:03 StartTime=Unknown EndTime=Unknown Deadline=N/A PreemptTime=None SuspendTime=None SecsPreSuspend=0 Partition=mmf AllocNode:Sid=hkgslaqsdev100:41794 ReqNodeList=(null) ExcNodeList=(null) NodeList=(null) NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:* TRES=cpu=1,mem=2048,node=1 Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* MinCPUsNode=1 MinMemoryNode=2G MinTmpDiskNode=0 Features=(null) DelayBoot=00:00:00 Gres=(null) Reservation=(null) OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) Command=/tmp/tmpt_ke7fqu WorkDir=/beegfs_exch1/scratch4/v_bazine/ML/MAOBT StdErr=/home/v_bazine/trash/tmpbeve1l3p/104.stdout StdIn=/dev/null StdOut=/home/v_bazine/trash/tmpbeve1l3p/104.stdout Power= Please check the attached command outputs and slurmctld.log. Thanks, Hui
Created attachment 19690 [details] squeue output
Created attachment 19691 [details] sprio output
Created attachment 19692 [details] sinfo output
Created attachment 19693 [details] sdiag output
Created attachment 19694 [details] scontrol show nodes output
Created attachment 19695 [details] slurmctld.log
Hi Hui, I do see what you're pointing out, that it looks like these small jobs should be able to start. Unfortunately I can't see what exactly is preventing them from being backfilled right away. The sdiag output you sent makes it look like all the jobs are being evaluated: Last depth cycle: 1229 Last depth cycle (try sched): 46 ... Last queue length: 1229 There must be something that is keeping the jobs from being able to start. However, with the current log level I wasn't able to see why these jobs weren't starting. Do you happen to still have jobs queued like this? If your cluster is still in this state, or if you can make it get in this state again, I would like to have you enable some additional logging by running these commands: scontrol setdebug debug2 scontrol setdebugflags +backfill Let the scheduler run for a few minutes with the additional logging enabled. You can revert to your previous logging settings by running these: scontrol setdebugflags -backfill scontrol setdebug info While you have the debug flags enabled I would like to have you collect the output of several commands too: sdiag sinfo scontrol show nodes (full output) scontrol show jobs (full output) With that information I should be able to get a better idea of what is happening. Thanks, Ben
Hi Hui, I wanted to follow up and see if this has come up again. If you are able to collect some debug logging with the 'backfill' flag while this is happening I will be happy to look into what is going on. Thanks, Ben
Hi Ben, Thanks for the information of getting more debug logs. The issue doesn't happen too often. I will follow up the steps to gather the debug information when similar issue happens again. You may close the case first if you don't hear from me in next a few days as the further steps have been provided. I'd open the case again whenever I could gather the required information. Regards, Hui
Hi Hui, I haven't seen an update to this ticket for a week. I'll go ahead and close it, but if it does come up again and you are able to collect the information we discussed, feel free to update the ticket and I'll review the information. Thanks, Ben