Created attachment 2478 [details] output of spend command We have an ongoing problem with jobs in a specific partition which appear in PENDING state when there are sufficient resources to run them. Enclosed are: (1) output from an NIH-written program ("spend”) which summarizes pending jobs with reasons of “Resources” or “Priority”. The second section reports available resources in that partition, and the third section identifies some specific nodes which have the most free resources (by cpu & memory). (2) slurm.conf Looking at the ‘spend’ output: the few dozen jobs of user patidarr are waiting for resources. These jobs are requesting exclusive 65/240GB nodes, and none are available, so these should be pending. However users buhuleod, davisbw and frankg have pending jobs that are requesting 1 or 8 cpus and 10GB or less. There are hundreds of free nodes meeting these requirements, yet they are in a PENDING/Priority state. These jobs will generally remain queued several hours before they eventually get scheduled to run. If we bump the priority of these jobs they will generally start.
Created attachment 2479 [details] slurm.conf
Can you confirm that those stuck jobs aren't requesting any other resources such as licenses or features? Actually - looking at the spend.txt output I think things are working as expected. The highest priority jobs are all exclusive-node, so Slurm is trying to drain out nodes to provide to those jobs. Without seeing the current node usage and remaining run times on the nodes, I'm guessing that the jobs that could fit in the limited available memory all have run times that don't allow them to be backfilled in at present. The backfill scheduler is conservative - if backfilling a job would delay the launch of the job expected to launch next on that node by even a second it will not schedule that job to launch. If you're manually increasing the priority, then those jobs can launch as they're no longer subject to that constraint as they've become the highest priority job in the queue. If you still think something else is at play, can you provide output from something like: squeue --format "%.18i %.9P %10S %.7Q %.10l %.2t %.10M %.6D %.4C %R" rather than the spend command? Increasing the debuglevel and looking through slurmctld.log may also be helpful. ("scontrol setdebug verbose" and maybe also "scontrol setdebugflags +backfill") - Tim
Created attachment 2480 [details] squeue output
Hi Tim, No licenses or features were requested by those pending jobs. Here are the jobs in question: User Jobid Part Prio Reason Excl Nnodes Ncpus Mem -------------------------------------------------------------------------------------------------- buhuleod 7095988[0-199] b1 51 Priority - 1 1 10GB/node buhuleod 7095993[0-199] b1 51 Priority - 1 1 10GB/node davisbw 7091501[0-899] b1 47 Resources - 1 1 2GB/node frankg 7096084[0-848] b1 32 Priority - 1 8 9GB/node The job pending for Resources is asking for 1 cpu/2 GB. There are over 500 nodes that the jobs could run on. All of the higher priority pending jobs are waiting on nodes with more memory (to satisfy 65/240GB requests). The higher priority jobs could never run on the 500 free nodes we're talking about. I've just run your squeue and another spend. Look at job 7114232. It's been queued since yesterday and wants 1 node, 16 cpu, 12 GB. There are at least 40 nodes on which it could fit and on which none of the higher priority jobs could fit.
Created attachment 2481 [details] spend output
(In reply to steven fellini from comment #4) > I've just run your squeue and another spend. Look at job 7114232. It's > been queued since yesterday > and wants 1 node, 16 cpu, 12 GB. There are at least 40 nodes on which it > could fit and on which > none of the higher priority jobs could fit. There are 40 nodes that it could fit on right now, and for a brief duration until more nodes free up and the higher priority job would be expected to run. Accuracy of timelimits have a huge impact on our ability to backfill correctly (or at all). Job 7114232 requires 12 days of wall-clock time - Slurm can't backfill that job in without disrupting a higher priority job in the future. If the runtime limit on that was decreased to a day or so I'd expect it to be able to launch immediately.
Sorry but I still don't understand. What higher priority jobs are you talking about? NONE of the currently pending higher priority jobs could possibly run on nodes that 7114232 can run on now (the 40 nodes we're talking about have 19 GB; the higher priority jobs all want 200+ GB).
Slurm isn't considered just the resources available immediately, it's also building out a model of when additional nodes become available in the future based on the TimeLimits for the active jobs. The nodes that are currently free right now aren't always going to be free - they have jobs that slurm expects to launch on them sometime in the future. (There is an open bug somewhere about marking such nodes as "earmarked" or something - calling them "idle" leads to a lot of confusion. They may be idle for the moment, but they are expected to be busy in the future.) Slurm then works out when the highest priority jobs will be able to start in the future. This is all done with the normal scheduling logic. Higher priority jobs will then have nodes set aside for them to ensure they're able to start - without this mechanism larger jobs would "starve" and never be able to launch as long as smaller jobs were around to keep consuming idle resources. The backfill scheduling portion is then able to come in and find gaps - resources that would need to sit idle for some time before the higher priority job starts. Jobs that fit within those gaps - gaps of specific resources *for a particular duration of time* can have jobs back-filled in and started immediately. If a job does not fit that gap - in your case, any job that could fit on the resources available has too long of a timelimit set, so they cannot be run now. If that time limit exceeds the projected window we have those resources available by even a second it will not be scheduled to launch.
(In reply to Tim Wickberg from comment #8) Hi Tim, The problem is that there are _no_ higher priority jobs that could possibly run on the free nodes. All the higher priority jobs require larger memory than is available on those free nodes, therefore these nodes should not logically be earmarked for any jobs. The first job that could fit on those nodes has remained queued for 2 days now, while 150+ nodes that could have run this job sit idle. We can't explain this behaviour, and are hoping there is some slurm config parameter that could ameliorate this problem. Could it have something to do with the fact that this partition is heterogenous with respect to memory sizes? Susan.
I think I understand what you're saying - that the nodes that are currently idle would never be run on by the highest priority jobs (too little memory primarily), and the other jobs should be launching. But I'm having a hard time piecing this together from the various outputs to visualize where the issue may be coming from. If you don't mind, can you send in a snapshot of the current system: scontrol show nodes scontrol show partitions scontrol show jobs sinfo squeue --format "%.18i %.9P %10S %.7Q %.10l %.2t %.10M %.6D %.4C %.9m %R" - Tim (Everything in one file is fine, no need to split up commands. If you'd rather not have that attached to the bug you can email it to me directly - tim@schedmd.com.)
I don't see an issue on b1 at the moment - there are only three jobs pending in the dump you sent, and they all are asking for equivalent resources which aren't currently available: tim@bluedwarf:~$ grep PD\ scontrol.txt |grep -v Depe | grep b1\ 7391176_[0-199] b1 N/A 35 7-00:00:00 PD 0:00 1 1 10G (Priority) 7391085_[199] b1 N/A 35 7-00:00:00 PD 0:00 1 1 10G (Resources) 7381792_193 b1 N/A 32 7-00:00:00 PD 0:00 1 1 10G (Priority) I'm going to switch my commentary to the ibfdr partition here, I suspect the same has applied to the other partitions at varying points and am hoping this will highlight how the backfill scheduler is behaving on your system. When you took that snapshot, you had these pending on ibfdr: 6394506 ibfdr N/A 987731 10-00:00:00 PD 0:00 64 1024 1024M (Resources) 6400501 ibfdr N/A 1020 7-00:00:00 PD 0:00 32 512 1024M (QOSMaxCpusPerUserLimit) 7108835 ibfdr N/A 864 7-00:00:00 PD 0:00 8 128 1024M (QOSMaxCpusPerUserLimit) 7258475 ibfdr N/A 567 10-00:00:00 PD 0:00 8 128 1024M (Priority) 7265324 ibfdr N/A 558 10-00:00:00 PD 0:00 12 192 1024M (Priority) 7276712 ibfdr N/A 535 10-00:00:00 PD 0:00 10 160 1024M (Priority) 7286572 ibfdr N/A 429 10-00:00:00 PD 0:00 10 160 1024M (Priority) 7286587 ibfdr N/A 420 10-00:00:00 PD 0:00 10 160 1024M (Priority) 7296561 ibfdr N/A 337 10-00:00:00 PD 0:00 10 160 1024M (Resources) 7340048 ibfdr N/A 149 10-00:00:00 PD 0:00 5 65 4G (Resources) At that point in time 138 out of 192 nodes were in use by 15 running jobs[1]. Leaving 54 nodes idle. Those 54 nodes are being held idle so that job 6394506 can start on 64 nodes. No other pending job will run in a short enough window of time to finish before 6394506 needs to launch, so those are expected to sit vacant at the moment. If we scheduled any of the other jobs - 7340048 for instance - then the expected start time of 6394506 would have to be moved out into the future again. Without that constrain, several of those other pending jobs could launch now within the 54 free nodes - but then 6394506 may never be able to launch. I've verified that Slurm does not prevent jobs from running by (incorrectly) assuming that idle nodes could run the highest priority when that job has memory/cpu requirements that exceed the resources on those nodes. - Tim [1] grep R\ scontrol.txt | grep ibfdr |awk '{jobs+=1;nodes+=$8}END {print jobs,nodes}'
We were aware that the "b1" partition problem was not occurring when we sent you the snapshot. We can however reproduce the problem by submitting a particular mix of jobs. Would you like us to do that and send you another snapshot? Re the "ibfdr" partition, as you point out it's well behaved as are the partitions _other_ than "b1". We suspect the problem may have to do with "b1" being heterogeneous with respect to cpu/memory resources (the other partitions are homogeneous). Also want to point out the following comment from src/plugins/sched/backfill: ****************************************************************************\ * backfill.c - simple backfill scheduler plugin. * * If a partition is does not have root only access and nodes are not shared * then raise the priority of pending jobs if doing so does not adversely * effect the expected initiation of any higher priority job. We do not alter * a job's required or excluded node list, so this is a conservative * algorithm. * * For example, consider a cluster "lx[01-08]" with one job executing on * nodes "lx[01-04]". The highest priority pending job requires five nodes * including "lx05". The next highest priority pending job requires any * three nodes. Without explicitly forcing the second job to use nodes * "lx[06-08]", we can't start it without possibly delaying the higher * priority job. *****************************************************************************
On 12/09/2015 11:30 AM, bugs@schedmd.com wrote: > http://bugs.schedmd.com/show_bug.cgi?id=2217 > > --- Comment #12 from steven fellini <sfellini@nih.gov> --- > We were aware that the "b1" partition problem was not occurring when we sent > you the snapshot. We can however reproduce the problem by submitting a > particular mix of jobs. Would you like us to do that and send you another > snapshot? Yes, please. > Re the "ibfdr" partition, as you point out it's well behaved as are the > partitions _other_ than "b1". We suspect the problem may have to do with "b1" > being heterogeneous with respect to cpu/memory resources (the other partitions > are homogeneous). I've done some testing with mixed nodes and haven't caught that behavior yet. I'm interested in seeing if you have a mix that triggers this - if you can give me some hints as to what you submit to provoke the behavior that'd help as well. My notes from attempting to reproduce this follow. - Tim As an example, I have three nodes defined here. bessie[08-09] have RealMemory=15737, bessie10 has RealMemory=10000. scontrol create partitionname=mixed nodename=bessie[08-10] sbatch --wrap "sleep 1000" -p mixed --exclusive -N 1 --mem=12000 -t 10 --job-name=job_1 sbatch --wrap "sleep 1000" -p mixed --exclusive -N 1 --mem=12000 -t 5 --job-name=job_2 sleep 10 sbatch --wrap "sleep 1000" -p mixed --exclusive -N 2 --mem=12000 -t 10 --job-name=job_3 sbatch --wrap "sleep 1000" -p mixed --exclusive -N 1 --mem=1000 -t 10 --job-name=job_4 --nice job_1 and job_2 start immediately on bessie08 and bessie09. job_3 is blocked waiting on bessie08 and bessie09 becoming free, which won't happen until 10 minutes. If the backfill algorithm was not aware of the heterogeneous nodes and was only considering cpu cores it would instead have assumed bessie09 and bessie10 would suffice, and job_4 would be stalled on the higher-priority job_3. Right after submission: $ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 812 mixed job_3 tim PD 0:00 2 (Resources) 813 mixed job_4 tim PD 0:00 1 (Priority) 810 mixed job_1 tim R 0:21 1 bessie08 811 mixed job_2 tim R 0:21 1 bessie09 Shows job_4 held up for a few seconds - we need the backfill scheduler pass to run. Waiting another 30 seconds (varies based on your SchedulerParameters) then shows correct behavior: $ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 812 mixed job_3 tim PD 0:00 2 (Resources) 810 mixed job_1 tim R 1:01 1 bessie08 811 mixed job_2 tim R 1:01 1 bessie09 813 mixed job_4 tim R 0:27 1 bessie10 scontrol show job 812 (job_3) shows SchedNodeList=bessie[08-09].
Tim - we are a bit slammed right now. Still owe you some outputs. Please keep this open.
I'm running through and reviewing outstanding bugs - have things stabilized for you in the past few weeks or are you still seeing issues? cheers, and happy new year, - Tim
Tim, We haven't seen instances of the problem lately, so you can probably close the ticket and we'll reopen if we start seeing it again. We did make a number of sched parameter changes which may or may not have made a difference: SchedulerParameters=bf_continue,defer,max_sched_time=5
Alright, closing for now. Let us know if you run into this again. cheers, - Tim
Tim, We're seeing the problem again. User pickardfc's jobs are apparently being "blocked" by user patidarr even though there are free nodes that could run the pickardfc jobs (but not the patidarr jobs). Tarball attached.
Created attachment 2626 [details] cluster state 2016-01-13
Tim has been out sick for the past week. Judging from your logs, the backfill scheduler isn't even attempting to determine resources and a start time for pickardfc's jobs. I believe this problem will go away if you add "bf_max_job_test=10000" to SchedulerParameters in slurm.conf. Documentation below: bf_max_job_test=# The maximum number of jobs to attempt backfill scheduling for (i.e. the queue depth). Higher values result in more overhead and less responsiveness. Until an attempt is made to backfill schedule a job, its expected initiation time value will not be set. The default value is 100. In the case of large clusters, configuring a relatively small value may be desirable. This option applies only to SchedulerType=sched/backfill. Assuming that fixes the problem, I'd like for you to capture the sdiag output periodically (say hourly) through a typical workday and I can help tune the system for better responsiveness and throughput. There are about 20 different scheduling parameters available and getting the values tuned can help a great deal.
PS. You can change the slurm.conf file and run "scontrol reconfig" for the scheduling parameters to take effect without restarting daemons.
I've had a chance to study your configuration and logs some more. Here are some more suggestions for SchedulerParameters in your slurm.conf. First off all your current "SchedulerParameters=bf_continue,defer" is a fine start, but given the number of jobs in your system, more options would provide better performance. See the slurm.conf man page for detailed descriptions of the parameters. bf_max_job_test=10000 Number of jobs to test, as already described. bf_resolution=600 This will reduce the overhead of the backfill scheduler. It will change tracking of resource allocations at 10 minute intervals rather than 1 minute intervals, potentially decreasing the overhead by up to about 80% bf_window=# You_ MIGHT_ consider increasing how far in the future the backfill scheduler looks, but that does increase its overhead, which is high to begin with (you are seeing 24 seconds run time per cycle today). If you do look more than 24 hours into the future, that should be balanced by limiting the number of jobs examined per user and/or partition using bf_max_job_user=# and/or bf_max_job_part=#
Is your system running fine now?
Moe, Thanks for your suggestions. We're going to hold off config changes until after the weekend. We'll let you know.
Moe, We're still having problems, even with SchedulerParameters=bf_continue,defer,bf_max_job_test=10000,bf_resolution=600
Could you attach the output of the following commands plus your slurmctld log file: mkdir slurm_01_28_16_logs cd slurm_01_28_16_logs sdiag >sdiag.out squeue >squeue.out sinfo >sinfo.out scontrol show job >scontrol_job.out scontrol show node >scontrol_node.out scontrol show partition >scontrol_part.out scontrol show reservation >scontrol_resv.out tar -cvf slurm_logs.tar *.out (Then attach the file "slurm_logs.tar", you should probably compress the slurmctld log file before sending)
Created attachment 2665 [details] cluster state 2016-01-28
Are there some specific jobs or partitions that I should look at? There are several thousand pending jobs in these logs. All of the jobs I have checked so far either need more memory than currently available (in the "ccr" partition) or are newly submitted jobs (the "norm" partition).
When I noticed your sdiag output didn't seem to match the your configured SchedulerParameters, I checked the version of Slurm that you are running and found that it's 14.11.9 which lacks many scalability improvements for large job counts and job arrays. For one example of the enhancements, each task of a job array occupies a separate job table record in version 14.11 rather than one record for the entire job array, so the number of job records Slurm is managing in your case is order 60,000. Also the backfill scheduler in your environment is running for at most 30 seconds and only getting through 200 pending jobs (with thousands of running jobs, its likely trying to run each one at hundreds of different start times). If possible, I would strongly recommend upgrading to version 15.08. One of our customers with a similar number of nodes and overlapping queues is running >1 million jobs per day with version 15.08.7. In terms of what you can do with version 14.11, here are some suggested changes to SchedulerParameters, currently: bf_continue,defer,bf_max_job_test=10000,bf_resolution=600 Add: bf_max_job_part=50 Backfill scheduler will only test the first 50 jobs per partition bf_max_job_array_resv=10 Backfill scheduler will only reserve resources for first 10 elements of pending job array bf_interval=60 Run the backfill scheduler once every 60 seconds, this will check more jobs less frequently The net result will be this: SchedulerParameters=bf_continue,defer,bf_max_job_test=10000,bf_resolution=600,bf_max_job_part=50,bf_max_job_array_resv=10,bf_interval=60 I'd suggest that you give that a try and if things aren't running a lot better, send another set of logs in a day or two.
>> Are there some specific jobs or partitions that I should look at? It's the 'b1' partition that we notice the problem most often. This partition has a small number of 256GB nodes and a much larger number of 24GB nodes. If there are a large number of jobs requiring large memory (i.e. 256GB nodes) such that most are pending, then any jobs submitted with lesser memory requirements (which could run on the 24GB nodes) stay pending. We could probably solve this by simply pulling out the 256GB nodes from that partition, but our feeling is that slurm should handle this? Thanks, Steve.
(In reply to steven fellini from comment #31) > >> Are there some specific jobs or partitions that I should look at? > > It's the 'b1' partition that we notice the problem most often. > > This partition has a small number of 256GB nodes and a much larger number of > 24GB nodes. If there are a large number of jobs requiring large memory > (i.e. 256GB nodes) such that most are pending, then any jobs submitted with > lesser memory requirements (which could run on the 24GB nodes) stay pending. > > We could probably solve this by simply pulling out the 256GB nodes from that > partition, but our feeling is that slurm should handle this? It could, but the overhead in the backfill scheduler would be huge for the number of jobs that you have. Assuming the top priority job in the partition/queue can't run, the backfill scheduler will need to determine when and where the pending jobs can run. This is how the algorithm works from a high level. 1. Determine which pending jobs could possibly be run (e.g. dependencies and limits OK) and put them into a sorted list. With Slurm version 14.11, each task of a job array will have a separate job record (not so in v15.08) 2. For every running job, put them into a list sorted by expected end time (based upon time limit). 3. For each pending job test if it can start immediately for each running job simulate the termination of that running job test if the pending job can start then if so then mark those resources off limits for that job's run time end end Given the number of your jobs (you've got order 20,000 pending jobs and order 5,000 running jobs) that means 100 million tests need to be performed, each of which is not particularly fast because some many things need to be checked. Here are some ways to improve matters: * Upgrading to Slurm version 15.08 will streamline handling the job arrays, get you improved scheduling algorithms, and more tuning options (those will buy you perhaps 30% improvement) * Configuring bf_interval=300 (higher than suggested in comment #30) would run the backfill scheduler less frequently, but going deeper in the queue. * Configuring bf_max_job_user may be helpful (see slurm.conf man page for details). * Putting the larger memory nodes in their own partition would keep the small and large memory jobs in a separate queues.
Do you have any update on this?
As of Friday afternoon, we were still seeing the problem in the 'b1' partition. Currently no pending jobs in that partition (early in the week), but we'll continue to watch. SchedulerParameters currently at: SchedulerParameters=bf_continue,defer,bf_max_job_test=10000,bf_resolution=600,bf_max_job_part=50,bf_max_job_array_resv=10,bf_interval=300 Steve.
(In reply to steven fellini from comment #34) > As of Friday afternoon, we were still seeing the problem in the 'b1' > partition. Currently no pending jobs in that partition (early in the week), > but we'll continue to watch. > > SchedulerParameters currently at: > > SchedulerParameters=bf_continue,defer,bf_max_job_test=10000, > bf_resolution=600,bf_max_job_part=50,bf_max_job_array_resv=10,bf_interval=300 Given the number of running and pending jobs, the backfill scheduler will need to be tuned to look at a smaller number of them. You've got some options with version 14.11: * Configuring "bf_max_job_user" should be helpful (see slurm.conf man page for details). * The "bf_min_age_reserve" configuration parameter may also be helpful (see slurm.conf man page for details). * Putting the larger memory nodes in their own partition would keep the small and large memory jobs in a separate queues, but make the system use more difficult to use. Upgrading to version 15.08 should give you some additional relief: * Upgrading to Slurm version 15.08 will streamline handling the job arrays, get you improved scheduling algorithms, and more tuning options (those will buy you perhaps 30% improvement).
Any updates?
Moe, We've prevailed on the one or two users who were submitting large numbers of large memory jobs to the 'b1' partition even when they didn't need that much memory. So we haven't been seeing any problems this week since the piling up of jobs for the larger memory nodes was what was preventing other jobs from getting scheduled on smaller memory ones. We are planning to move to 15.08 asap but are not confident that will change the issue. We really don't want to have to be creating lots of partitions to suit hardware characteristics, just the opposite, we want to have as few partitions as possible. So, we're still thinking of this as a deficiency, but for the time being not a hardship. That could change tomorrow. Thanks for the time you've put into this.
Created attachment 2817 [details] cluster state 2016-03-04 We upgraded to 15.08 a few days ago and our scheduling problems are worse than ever. I've raised impact to "High Impact" as we are getting complaints from our users. We have IDLE nodes in the 'norm' and 'gpu' partitions with many jobs pending. Enclosing cluster state as of a few minutes ago.
Do you have any advanced reservations? (see "scontrol show res")
[steve@biowulf ~]$ scontrol show res No reservations in the system
Same slurm.conf as attached, correct? If not, please attach new one.
Created attachment 2818 [details] slurm.conf current slurm.conf attached
I noticed that your squeue output does not include the interactive jobs, while they clearly exist based upon the sinfo and "scontrol show job" output. Since you have overlapping partitions, I really need that information to perform an analysis. The way Slurm handles overlapping partitions is once the first job in that partition/queue can't be scheduled due to lack of resources, the nodes associated with that partition are removed from consideration for jobs in other partitions. This the jobs in lower priority partitions from being allocated those resources and delaying jobs in higher priority partitions. (The jobs can still use nodes outside of the blocked higher priority partition.) That can trigger a cascade effect with respect to scheduling resources, which might be what you see.
Moe, Not sure I understand. The squeue output I sent does include several jobs in the interactive partition, all of them in R state. None are pending for resources.
(In reply to steven fellini from comment #45) > Moe, > Not sure I understand. The squeue output I sent does include several jobs > in the interactive partition, all of them in R state. None are pending for > resources. Mea culpa. The partition field was truncated: $ grep inter squeue.out 15397361 interacti 2016-03-03 992060 1-12:00:00 R 19:10:04 1 2 8G cn1738 Exactly which version of 15.08 did you install?
Here are a few more things that may be relevant: The jobs are sorted for scheduling first by partition priority then by job priority. That means all of the jobs in partitions interactive, niddk, nimh, ccr, ccrclin, and b1 get tested for resources before any jobs in partition gpu and norm. That means the ordering of squeue output may be misleading. Your configuration has "bf_max_job_user=100", so only the first 100 jobs for each user get tested (based upon the ordering described above). The attached squeue output does not include user information and while the output of "scontrol show job" does contain that information, that's a lot of data to parse through. Perhaps "bf_max_job_user" should be configured higher. Each task of a job array counts against the "bf_max_job_user" limit.
I'm not sure exactly what is happening, but would like to suggest increasing the values for bf_max_job_part and bf_max_job_user as shown below, then run "scontrol reconfig" and see if that gets jobs on those idle nodes. Current values: SchedulerParameters=bf_continue,defer,bf_max_job_test=10000,bf_resolution=600,bf_max_job_part=50,bf_max_job_array_resv=10,bf_interval=300,bf_max_job_user=100 Suggested values: SchedulerParameters=bf_continue,defer,bf_max_job_test=10000,bf_resolution=600,bf_max_job_part=300,bf_max_job_array_resv=10,bf_interval=300,bf_max_job_user=200
(In reply to Moe Jette from comment #46) > (In reply to steven fellini from comment #45) > > Moe, > > Not sure I understand. The squeue output I sent does include several jobs > > in the interactive partition, all of them in R state. None are pending for > > resources. > > Mea culpa. The partition field was truncated: > $ grep inter squeue.out > 15397361 interacti 2016-03-03 992060 1-12:00:00 R 19:10:04 > 1 2 8G cn1738 > > Exactly which version of 15.08 did you install? 15.08.8
It may also be helpful if I can get a copy of your slurmctld log file (at /var/log/slurm/ctld.log).
Created attachment 2819 [details] slurmdbd.conf
(In reply to Moe Jette from comment #50) > It may also be helpful if I can get a copy of your slurmctld log file (at > /var/log/slurm/ctld.log). Sorry, misread your msg, log file on its way.
Created attachment 2820 [details] ctld.log
In your job logs, I see the top priority pending job in the "gpu" queue is waiting due with "Reason=QOSMaxCpuPerUserLimit" and that same user has a large number of pending jobs in the queue. Can you tell me what is happening with respect to that limit? That one user's jobs could block the entire queue due to "bf_max_job_part=50". Here's the job: JobId=15329061 JobName=job-9.sh UserId=joehanesr(34049) GroupId=joehanesr(34049) Priority=470137 Nice=0 Account=munson QOS=gpu JobState=PENDING Reason=QOSMaxCpuPerUserLimit Dependency=(null)
FYI there are thousands of entries in the log about users trying to cancel each others jobs. I'm hoping that's accidental, but that's a lot of accidents. The messages look like this: [2016-03-01T12:21:13.727] error: Security violation, JOB_CANCEL RPC for jobID 15315274 from uid 34966 [2016-03-02T10:02:01.785] error: Security violation, JOB_CANCEL RPC for jobID 15343675 from uid 36777 [2016-03-03T14:17:50.220] error: Security violation, JOB_CANCEL RPC for jobID 15329225 from uid 34385
(In reply to Moe Jette from comment #55) > FYI there are thousands of entries in the log about users trying to cancel > each others jobs. I'm hoping that's accidental, but that's a lot of > accidents. The messages look like this: Yes, we're aware of that and trying to figure out what's going on.
(In reply to Moe Jette from comment #54) > In your job logs, I see the top priority pending job in the "gpu" queue is > waiting due with "Reason=QOSMaxCpuPerUserLimit" and that same user has a > large number of pending jobs in the queue. Can you tell me what is happening > with respect to that limit? > > That one user's jobs could block the entire queue due to > "bf_max_job_part=50". > > Here's the job: > JobId=15329061 JobName=job-9.sh > UserId=joehanesr(34049) GroupId=joehanesr(34049) > Priority=470137 Nice=0 Account=munson QOS=gpu > JobState=PENDING Reason=QOSMaxCpuPerUserLimit Dependency=(null) Hi Moe, Right now, SchedulerParameters = bf_continue,defer,bf_max_job_test=10000,bf_resolution=600,bf_max_job_part=300,bf_max_job_array_resv=10,bf_interval=300,bf_max_job_user=200 The user joehanesr has 1 job queued for QOSMaxCpuPerUserLimit and 363 jobs queued for Priority, on this partition. Shouldn't the scheduler look through 200 of joehanesr's jobs, and then move on to the next user? The next user's jobs could run because we have idle GPU nodes available. Secondly, we had bf_max_job_part=50 until this morning. We routinely have users with 1000s of jobs or job-array-subjobs queued. Shouldn't we have seen this problem before now?
Created attachment 2824 [details] sdiag part 1 Moe, I'm attaching 3 files, each of which plots 'sdiag' output over the past 24 hours. Don't know if will help, but...
Created attachment 2825 [details] sdiag part 2
Created attachment 2826 [details] sdiag part 3
(In reply to Susan Chacko from comment #57) > (In reply to Moe Jette from comment #54) > > In your job logs, I see the top priority pending job in the "gpu" queue is > > waiting due with "Reason=QOSMaxCpuPerUserLimit" and that same user has a > > large number of pending jobs in the queue. Can you tell me what is happening > > with respect to that limit? > > > > That one user's jobs could block the entire queue due to > > "bf_max_job_part=50". > > > > Here's the job: > > JobId=15329061 JobName=job-9.sh > > UserId=joehanesr(34049) GroupId=joehanesr(34049) > > Priority=470137 Nice=0 Account=munson QOS=gpu > > JobState=PENDING Reason=QOSMaxCpuPerUserLimit Dependency=(null) > > Hi Moe, > > Right now, > SchedulerParameters = > bf_continue,defer,bf_max_job_test=10000,bf_resolution=600, > bf_max_job_part=300,bf_max_job_array_resv=10,bf_interval=300, > bf_max_job_user=200 > > The user joehanesr has 1 job queued for QOSMaxCpuPerUserLimit and 363 jobs > queued for Priority, on this partition. > > Shouldn't the scheduler look through 200 of joehanesr's jobs, and then move > on to the next user? The next user's jobs could run because we have idle GPU > nodes available. Correct, but with "bf_max_job_part=50" user "joehanesr" would basically block the queue. That's the conclusion that I came to based upon log records like that below: [2016-03-03T09:31:08.215] debug2: job 15329062 being held, if allowed the job request will exceed QOS gpu max tres(cpu) per user limit 128 with already used 128 + requested 16 [2016-03-03T09:31:08.215] debug3: backfill: Failed to start JobId=15329062: Job violates accounting/QOS policy (job submit limit, user's size and/or time limits) > Secondly, we had bf_max_job_part=50 until this morning. We routinely have > users with 1000s of jobs or job-array-subjobs queued. Shouldn't we have seen > this problem before now? There is definitely a distinction between jobs and job arrays here. Job arrays are a single job record and based upon your "bf_max_job_array_resv=10" configuration, only 10 of the tasks will be examined and counted against the limit of 50. User "joehanesr" has a few hundred separate jobs rather than a job array.
(In reply to Susan Chacko from comment #56) > (In reply to Moe Jette from comment #55) > > FYI there are thousands of entries in the log about users trying to cancel > > each others jobs. I'm hoping that's accidental, but that's a lot of > > accidents. The messages look like this: > > Yes, we're aware of that and trying to figure out what's going on. Ok, we talked to a couple of users and figured out what was going on. One user had 'scancel' with no arguments, in an attempt to delete all his own jobs. Another user had 'scancel --jobname somejob' with a generic-enough jobname that it matched some of another user's jobs. So user error in all cases that we've investigated. We're telling them to use '-u user' for all scancels, to keep our logs cleaner.
> Secondly, we had bf_max_job_part=50 until this morning. How are things working with the higher value? Judging from what I know of your workload now, I would expect utilization to be much better.
(In reply to Moe Jette from comment #63) > > Secondly, we had bf_max_job_part=50 until this morning. > > How are things working with the higher value? > > Judging from what I know of your workload now, I would expect utilization to > be much better. Better in the norm partition and the ccr partition. In the GPU partition, we still have a clog, but based on your explanation, that's because of joehanesr's 360+ jobs. We have MaxSubmit set to 4000, and MaxArraySize set to 1001 so any user could potentially submit 4000 independent jobs, or an array of 1000 subjobs. We want users to be able to submit large numbers of jobs, as that's the nature of our workload. What would be the implications of setting bf_max_job_part to a really large value, like 5000, to prevent any single user from blocking the queue?
(In reply to Susan Chacko from comment #64) > (In reply to Moe Jette from comment #63) > > > Secondly, we had bf_max_job_part=50 until this morning. > > > > How are things working with the higher value? > > > > Judging from what I know of your workload now, I would expect utilization to > > be much better. > > Better in the norm partition and the ccr partition. > > In the GPU partition, we still have a clog, but based on your explanation, > that's because of joehanesr's 360+ jobs. We have MaxSubmit set to 4000, and > MaxArraySize set to 1001 so any user could potentially submit 4000 > independent jobs, or an array of 1000 subjobs. We want users to be able to > submit large numbers of jobs, as that's the nature of our workload. > > What would be the implications of setting bf_max_job_part to a really large > value, like 5000, to prevent any single user from blocking the queue? The fundamental issue is that the backfill scheduling can be a very time consuming operation (see comment #32 for the details). The intent of the bf_max_job_part and bf_max_job_user configuration parameters is to reduce the overhead of backfill scheduling so that a greater variety of jobs can be considered for starting. The idea is that a user's jobs will tend to be similar or subject to similar limits. In this particular case, determining exactly when and where joehanesr's 360+ job should start is probably not beneficial. Setting bf_max_job_part to a value much higher than bf_max_job_user would be beneficial in your case. Rather than increasing bf_max_job_part to 5000, would decreasing bf_max_job_user handle your workload better? Perhaps some combination of changing them both to keep the user limit well below the partition limit? The downside is that if a user submits large numbers of jobs to multiple partitions, the jobs in the lower priority partitions would never be considered by the backfill scheduler. Realistically, you'll want to keep the total number of pending jobs that the backfill scheduler considers on the order of 1000 for performance reasons (as described in comment #32, each of those jobs is being considered for starting at about 5000 different times for a total of about 5,000,000 fairly heavy weight operations).
Do you have any updates on this?
(In reply to Moe Jette from comment #66) > Do you have any updates on this? So far we seem to be doing okay with most of the partitions except for the gpu partition where one user's jobs seem to get in the way of scheduling other users' gpu jobs. We have been manually bumping the priority on those stuck jobs when we see gpu resources available. OTOH, we are currently at: SchedulerParameters=bf_continue,defer,bf_max_job_test=10000,bf_resolution=600,bf_max_job_part=300,bf_max_job_array_resv=10,bf_interval=300,bf_max_job_user=200 And we are thinking about increasing bf_max_job_part=300 to bf_max_job_part=600. Do you feel that should be okay or have any benefit? Thanks for your help with this!
(In reply to rl303f from comment #67) > (In reply to Moe Jette from comment #66) > > Do you have any updates on this? > > So far we seem to be doing okay with most of the partitions except for the > gpu partition where one user's jobs seem to get in the way of scheduling > other > users' gpu jobs. We have been manually bumping the priority on those stuck > jobs when we see gpu resources available. > > OTOH, we are currently at: > > SchedulerParameters=bf_continue,defer,bf_max_job_test=10000, > bf_resolution=600,bf_max_job_part=300,bf_max_job_array_resv=10, > bf_interval=300,bf_max_job_user=200 > > And we are thinking about increasing bf_max_job_part=300 to > bf_max_job_part=600. > > Do you feel that should be okay or have any benefit? > > Thanks for your help with this! The downside is that the backfill scheduler will likely be running continuously. Even then, it may not get through all of the jobs since it will start over after 300 seconds (bf_interval=300). If you do increase those limits (user and partition), I would also increase bf_interval and monitor the output of the sdiag command to see how much time the backfill scheduler is running and how many of the jobs it is processing. System responsiveness (i.e. time responding to any Slurm command) may be slightly adversely effected.
I wanted to check in with you to see how scheduling was working with your current configuration and propose a changed in the backfill scheduling algorithm. As previously discussed the backfill scheduling algorithm is very heavy weight due to a nested loop where the expected start time of each pending job is determined by simulating the termination of each running job at the end of its time limit (at for however many jobs need to release resources for the pending job under consideration). The proposed change would short-circuit the logic for some jobs. Specifically, if a pending job's priority is below some configured threshold OR if it's time in pending state is below some threshold, then only determine if the job can start immediately. If it can't start immediately, the job will be left pending with no resources reserved for it. The idea is that if most jobs don't satisfy the configured threshold, the backfill algorithm changes from order NxM to order N (where N is the number of pending jobs and M is the number of running jobs), which permits far more jobs to be considered. Do you think that would be helpful in the case of your workload?
We recently done some work on another bug that may dramatically increase the number of jobs Slurm's backfill scheduler can handle. See: https://bugs.schedmd.com/show_bug.cgi?id=2565#c2 With that change, a full analysis of when and where pending jobs can start will be performed only for jobs above some particular priority or jobs that have been pending for at least some threshold. After that, it will do a quick check to see if any of the other jobs can be started immediately on remaining resources. You would also look to remove or greatly increase these parameters: bf_max_job_part=50,bf_max_job_user=100 NERSC will be testing in the week of March 21.
(In reply to Moe Jette from comment #70) > We recently done some work on another bug that may dramatically increase the > number of jobs Slurm's backfill scheduler can handle. See: > https://bugs.schedmd.com/show_bug.cgi?id=2565#c2 > > With that change, a full analysis of when and where pending jobs can start > will be performed only for jobs above some particular priority or jobs that > have been pending for at least some threshold. After that, it will do a > quick check to see if any of the other jobs can be started immediately on > remaining resources. > > You would also look to remove or greatly increase these parameters: > bf_max_job_part=50,bf_max_job_user=100 > > NERSC will be testing in the week of March 21. Thanks Moe, if NERSC reports success we'll apply the patch.
(In reply to steven fellini from comment #71) > Thanks Moe, if NERSC reports success we'll apply the patch. NERSC reports very good results. See: https://bugs.schedmd.com/show_bug.cgi?id=2565#c8
We are running 15.08.9 with the NERSC patch on our development cluster; we'll get it going on production early next week.
(In reply to steven fellini from comment #74) > We are running 15.08.9 with the NERSC patch on our development cluster; > we'll get it going on production early next week. We went ahead and moved our development cluster up to 15.08.10 due to "backfill scheduler race condition that could cause invalid pointer in select/cons_res plugin. Bug introduced in 15.08.9." However, our attempt to apply the NERSC patch (bug_2565.patch) from https://bugs.schedmd.com/attachment.cgi?id=2886 resulted in an error and it would not apply. (It did successfully apply to 15.08.9) Below is the error: $ git apply /usr/local/src/slurm-15.08/bug_2565.patch /usr/local/src/slurm-15.08/bug_2565.patch:53: space before tab in indent. if (sched_params && error: patch failed: src/plugins/sched/backfill/backfill.c:964 error: src/plugins/sched/backfill/backfill.c: patch does not apply We just want to confirm that this is because 15.08.10 already includes the NERSC bug_2565 code thereby making the patch unnecessary? Or does 15.08.10 still need that patch but there is some code incompatibility preventing successful application of the patch? Thank you!
Created attachment 2992 [details] Backport of bug_2565 patch to v15.08.10 We do not plan to add this functionality to version 15.08 for the sake of improved stability of that version, relatively late in it's release cycle. I've attached a version of the patch that will apply cleanly to v15.08.10.
(In reply to Moe Jette from comment #76) > Created attachment 2992 [details] > Backport of bug_2565 patch to v15.08.10 > > We do not plan to add this functionality to version 15.08 for the sake of > improved stability of that version, relatively late in it's release cycle. > I've attached a version of the patch that will apply cleanly to v15.08.10. Many thanks, Moe. The new patch applied successfully. I guess we better start thinking about moving up to 16.05 soon.
Have you found the bf_min_prio_reserve option helpful? How are things running now?
(In reply to Moe Jette from comment #78) > Have you found the bf_min_prio_reserve option helpful? > How are things running now? Moe, We put the patch into effect yesterday morning, setting bf_min_prio_reserve to 100000. We think there has been an improvement. We're seeing two things: (1) time of backfilling cycles has reduced drastically, from mean times of near 300 s during peak hours to around 40 s, and (2) almost all jobs with prio>100000 have a start time. We haven't run with the patch long enough to be sure, but I think you can close the ticket and if problems reemerge we can always open it again. Thanks for your help. Steve.
(In reply to steven fellini from comment #79) > We haven't run with the patch long enough to be sure, but I think you can > close the ticket and if problems reemerge we can always open it again. Excellent! That's about what NERSC found. WARNING: You will need to treat this as a local patch until upgrading to version 16.05. If you do upgrade to a newer version of 15.08, the patch may not apply properly (it may put some new code in the wrong place) and the result may crash slurmctld. There is a ticket related to the batch patch apply here: https://bugs.schedmd.com/show_bug.cgi?id=2634