| Summary: | Users jobs can't be scheduled in a particular partition although there is still resources for other users to run jobs in the same partition | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | hui.qiu |
| Component: | Scheduling | Assignee: | Ben Roberts <ben> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 3 - Medium Impact | ||
| Priority: | --- | ||
| Version: | - Unsupported Older Versions | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | BNP Paribas | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
| Attachments: |
slurm.conf
sshare -a output sacctmgr show qos squeue output sprio output sinfo output sdiag output scontrol show nodes output slurmctld.log |
||
|
Description
hui.qiu
2021-04-27 03:02:27 MDT
Created attachment 19130 [details]
slurm.conf
Created attachment 19131 [details]
sshare -a output
Hi Hui, It's possible that there's a limit defined in the QOS associated with the partition that's preventing this user's jobs from starting. Can I have you send the output of 'sacctmgr show qos'? If you have an example job on the system right now I would also like to see the output of a few commands to see what is happening with that job: squeue sprio sinfo scontrol show job <job id of problem job> sdiag I would also like to see the slurmctld.log file that covers the time from when the job in question was submitted to the present. Thanks, Ben Hi Ben, Yes, we have qos set for each partition. Please find the attached output. Our default partition has qos of 2000 GrpJobs. Most others have lower GrpJobs set. The strange thing is that mmf partition has QOS=team set which has 10000 GrpJobs however it seems we never go beyond 2000 even if there is resources in the cluster. I don't have an example at the moment. I will check with the heaviest user to reproduce and send you the commands outputs as well as the slurmctld.log in the next couple of days. Regards, Hui Created attachment 19177 [details]
sacctmgr show qos
Hi Hui, Thank you for sending the sacctmgr output. It does look like there is a GrpJobs limit defined for the 'team' QOS, but with a value of 10,000 you're probably not hitting that limit. Looking at your config again I do see that there is another possibility of what might be happening. If you have a large number of jobs in the queue and this user's jobs are relatively low in priority then it would probably fall to the backfill scheduler to start the jobs. If the backfill scheduler isn't able to get to these jobs in a single scheduling cycle then the jobs will just sit in the queue until there are fewer jobs waiting. There is a parameter you can set in your slurm.conf file that will tell the backfill scheduler to keep track of how far it made it in the queue when it stops so that it can pick up from where it left off the next time, rather than the default behavior of starting from the beginning each time. This parameter is called bf_continue and is set as one of the SchedulerParameters. Can you confirm that there were a large number of jobs in the queue when you saw this behavior? If this behavior does seem to be related to the number of queued jobs then I do think this is likely the cause for the behavior you're seeing. You may also want to consider increasing bf_max_job_test from the default value of 100. Let me know if you have questions about this. Thanks, Ben Hi Ben, Yes, I did see a big number of Pending jobs were in the queue by sprio command at one time but the running jobs were completing fast this time and the PD jobs got executed before I finished collecting the logs/command outputs. I may need to wait until the heaviest user return from holidays to run similar heavy batch of jobs again to gather more evidences. As you commented, I also feel like the backfill scheduler behavior is probably the cause of our issue. Perhaps, I can start to add bf_continue and test with the user again. Our current setting of SchedulerParameters is: SchedulerParameters=sched_min_interval=200000 How do I append another param bf_continue ? And after the change in slurm.conf, is 'scontrol reconfig' enough to make it effective? Thanks, Hui Thanks for confirming that there were a lot of jobs at the time. It does seem likely that the bf_continue parameter will help in this case. You can add it to the SchedulerParameters by adding it to the end with a comma separating the parameters, like this: SchedulerParameters=sched_min_interval=200000,bf_continue You can just do an 'scontrol reconfigure' for it to take effect. You can confirm that it was picked up correctly by running 'scontrol show config | grep SchedulerParam' to show what the scheduler recognizes as the current parameters. Thanks, Ben Hi Hui, I wanted to follow up and see if you were able to add the bf_continue flag and whether it made a difference in the behavior you were seeing. Let me know if you still need help with this ticket. Thanks, Ben Hi Ben, Yes, we have implemented the bf_continue flag and been watching for a few days. So far looks better and no users complaints. I think you can close this ticket now. thanks, Hui I'm glad to hear that helped. Let us know if anything else comes up. Thanks, Ben Hi Ben, Today, the same user sent batch of big jobs and small jobs. The big jobs were running and many small jobs were pending for priority although there were still resources for small jobs e.g. below job requires only 1 CPU core and 2GB MEM. From the cluster work nodes resources report, there are still room for the job but it was PD for Priority. [gadmin@hkgslaqsdev110 17:34]$ cat job-391537.txt JobId=391537 JobName=tmpbeve1l3p UserId=v_bazine(2021) GroupId=users(100) MCS_label=N/A Priority=13747 Nice=0 Account=research QOS=normal JobState=PENDING Reason=Priority Dependency=(null) Requeue=0 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 RunTime=00:00:00 TimeLimit=1-00:00:00 TimeMin=N/A SubmitTime=2021-05-27T12:31:03 EligibleTime=2021-05-27T12:31:03 StartTime=Unknown EndTime=Unknown Deadline=N/A PreemptTime=None SuspendTime=None SecsPreSuspend=0 Partition=mmf AllocNode:Sid=hkgslaqsdev100:41794 ReqNodeList=(null) ExcNodeList=(null) NodeList=(null) NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:* TRES=cpu=1,mem=2048,node=1 Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* MinCPUsNode=1 MinMemoryNode=2G MinTmpDiskNode=0 Features=(null) DelayBoot=00:00:00 Gres=(null) Reservation=(null) OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) Command=/tmp/tmpt_ke7fqu WorkDir=/beegfs_exch1/scratch4/v_bazine/ML/MAOBT StdErr=/home/v_bazine/trash/tmpbeve1l3p/104.stdout StdIn=/dev/null StdOut=/home/v_bazine/trash/tmpbeve1l3p/104.stdout Power= Please check the attached command outputs and slurmctld.log. Thanks, Hui Created attachment 19690 [details]
squeue output
Created attachment 19691 [details]
sprio output
Created attachment 19692 [details]
sinfo output
Created attachment 19693 [details]
sdiag output
Created attachment 19694 [details]
scontrol show nodes output
Created attachment 19695 [details]
slurmctld.log
Hi Hui,
I do see what you're pointing out, that it looks like these small jobs should be able to start. Unfortunately I can't see what exactly is preventing them from being backfilled right away. The sdiag output you sent makes it look like all the jobs are being evaluated:
Last depth cycle: 1229
Last depth cycle (try sched): 46
...
Last queue length: 1229
There must be something that is keeping the jobs from being able to start. However, with the current log level I wasn't able to see why these jobs weren't starting. Do you happen to still have jobs queued like this? If your cluster is still in this state, or if you can make it get in this state again, I would like to have you enable some additional logging by running these commands:
scontrol setdebug debug2
scontrol setdebugflags +backfill
Let the scheduler run for a few minutes with the additional logging enabled. You can revert to your previous logging settings by running these:
scontrol setdebugflags -backfill
scontrol setdebug info
While you have the debug flags enabled I would like to have you collect the output of several commands too:
sdiag
sinfo
scontrol show nodes (full output)
scontrol show jobs (full output)
With that information I should be able to get a better idea of what is happening.
Thanks,
Ben
Hi Hui, I wanted to follow up and see if this has come up again. If you are able to collect some debug logging with the 'backfill' flag while this is happening I will be happy to look into what is going on. Thanks, Ben Hi Ben, Thanks for the information of getting more debug logs. The issue doesn't happen too often. I will follow up the steps to gather the debug information when similar issue happens again. You may close the case first if you don't hear from me in next a few days as the further steps have been provided. I'd open the case again whenever I could gather the required information. Regards, Hui Hi Hui, I haven't seen an update to this ticket for a week. I'll go ahead and close it, but if it does come up again and you are able to collect the information we discussed, feel free to update the ticket and I'll review the information. Thanks, Ben |