Created attachment 650 [details] slurmctld debug output around a server_thread_count limit event We've noticed slurmctld regularly hitting its server_thread_count limit. I've attached some debugging output around the time of the problem. At 10:12:23, ~60 jobs are preempted and requeued, and it looks like there are also a number of dump_job_user and dump_job_single RPCs being processed that are taking an unusually long time. About a minute later, at 10:13, slurmctld hits the thread limit several times. At that point, it's basically unresponsive since it can't process any more RPCs until the data structures are unlocked that some pending RPCs depend on, freeing up slots for new incoming RPCs to be processed. A few questions or thoughts about this situation: 1. Is this being caused primarily by the job churn (rescheduling ~60 jobs at once)? Is there a way to rate limit this behavior, or have slurmctld handle it more gracefully so it can gradually schedule/requeue jobs and remain responsive to incoming RPCs? We've seen similar behavior when scheduling large (4k+ index) job arrays. 2. It seems increasing MAX_SERVER_THREADS won't help in this case, since new incoming RPCs are almost certainly going to block in the same way the existing RPCs are. There will be more threads trying to process RPCs, but any increase in MAX_SERVER_THREADS will result in the same unresponsiveness because all the available threads will still be blocking on trying to process RPCs. 3. It would be really, really useful to be able to introspect into the thread pool. Being able to tell what type of RPC each thread is processing and some other basic information (such as the current elapsed time for that RPC and the user and host initiating it) would let us at least triage these situations, if only by quelling the offending user (if it's in fact user behavior causing the RPCs in question in a particular case). 4. It would also be nice if slurmctld had some sort of rate limiting for user-initiated RPCs. We've encountered some cases where users are using squeue or sacct, becoming frustrated by the lack of response, and run them repeatedly trying to get them to run to completion. I haven't assembled any evidence that this contributes to slurmctld unresponsiveness, but it seems like having a way for slurmctld to shed "unnecessary" load by rate limiting s* tool access by unprivilieged users might help, and could leave some headroom so privileged (i.e., administrative) users can troubleshoot slurmctld when an adverse event happens. Let me know what you think, especially about MAX_SERVER_THREADS as there's some controversy here about whether increasing its value will help. thanks, -john
Could you attach your slurm.conf file? The number of squeue commands being executed seems very high (like 15 at 10:12:17 and a bunch more at 10:12:18). Are you running squeue from prolog or epilog scripts? If so, could you append them also? There may be lighter weight options to accomplish your goal.
I concur with your assessment of MAX_SERVER_THREADS. Raising the limit will only add more contention for the locks and not increase throughput.
Created attachment 651 [details] slurm.conf
Your problem is right here: SchedulerParameters=default_queue_depth=10000 I would recommend removing default_queue_depth to use the default value of 100. The backfill scheduler will probably be scheduling most jobs anyway. This is from the slurm.conf man page: default_queue_depth=# The default number of jobs to attempt scheduling (i.e. the queue depth) when a running job completes or other routine actions occur. The full queue will be tested on a less frequent basis. The default value is 100. See the partition_job_depth option to limit depth by partition. In the case of large clusters (more than 1000 nodes), configuring a relatively small value may be desirable. Specifying a large value (say 1000 or higher) can be expected to result in poor system responsiveness since this scheduling logic will not release locks for other events to occur. It would be better to let the backfill scheduler process a larger number of jobs (see max_job_bf, bf_continue and other options here for more information).
We're running squeue(1) once in our epilog (as 'squeue -ho %A -u "$SLURM_JOB_USER" -w localhost') to determine whether the user has any other jobs running on the current host, so we can kill any leftover processes, such as SSH sessions, etc. We also run srun(1) once in the prolog, if the job is running on the first node in the nodelist, and BatchFlag=1 for the job. We got that section of prolog from another facility and kept rolling with it because reasons, and I've never been entirely sure what it's doing. Is this something we need to be doing? -john
(In reply to Moe Jette from comment #4) > Your problem is right here: > SchedulerParameters=default_queue_depth=10000 > > I would recommend removing default_queue_depth to use the default value of > 100. The backfill scheduler will probably be scheduling most jobs anyway. Hm, we had default_queue_depth at 100 up until about a week ago, and had it at 1000 up until a couple hours ago. I'm not sure why Paul increased it; I'll ask him. In any event, the debug output I sent would have had default_queue_depth at 1000. I figure you still recommend bumping it down to 100?
(In reply to John Morrissey from comment #5) > We also run srun(1) once in the prolog, if the job is running on the first > node in the nodelist, and BatchFlag=1 for the job. We got that section of > prolog from another facility and kept rolling with it because reasons, and > I've never been entirely sure what it's doing. Is this something we need to > be doing? Here's the actual hunk: -- this_host=$(hostname -s) hostlist=$(squeue -j "$SLURM_JOB_ID" -ho %N) lead_node=$(scontrol show hostname "$hostlist" | head -n 1) # If this is the lead node, and there are hosts other than the lead node # in this host list... if [ "$lead_node" = "$this_host" ] && echo "$hostlist" | fgrep -vx "$this_host"; then batchmode=$( scontrol -o show job "$SLURM_JOB_ID" | sed -e 's/.*[[:space:]]BatchFlag=\([^[:space:]]\{1,\}\).*/\1/g' ) if [ "$batchmode" -eq 1 ]; then echo "running srun true on $lead_node to get prolog on" \ 'other hosts' | log su "$SLURM_JOB_USER" -c 'srun -J prolog true' 2>&1 | log fi fi --
(In reply to John Morrissey from comment #6) > (In reply to Moe Jette from comment #4) > > Your problem is right here: > > SchedulerParameters=default_queue_depth=10000 > > > > I would recommend removing default_queue_depth to use the default value of > > 100. The backfill scheduler will probably be scheduling most jobs anyway. > > Hm, we had default_queue_depth at 100 up until about a week ago, and had it > at 1000 up until a couple hours ago. I'm not sure why Paul increased it; > I'll ask him. > > In any event, the debug output I sent would have had default_queue_depth at > 1000. I figure you still recommend bumping it down to 100? Absolutely.The scheduler is probably running constantly and slowing everything down.
(In reply to John Morrissey from comment #7) > (In reply to John Morrissey from comment #5) > > We also run srun(1) once in the prolog, if the job is running on the first > > node in the nodelist, and BatchFlag=1 for the job. We got that section of > > prolog from another facility and kept rolling with it because reasons, and > > I've never been entirely sure what it's doing. Is this something we need to > > be doing? > > Here's the actual hunk: > > -- > this_host=$(hostname -s) > hostlist=$(squeue -j "$SLURM_JOB_ID" -ho %N) > lead_node=$(scontrol show hostname "$hostlist" | head -n 1) > > # If this is the lead node, and there are hosts other than the lead node > # in this host list... > if [ "$lead_node" = "$this_host" ] && > echo "$hostlist" | fgrep -vx "$this_host"; then > > batchmode=$( > scontrol -o show job "$SLURM_JOB_ID" | > sed -e > 's/.*[[:space:]]BatchFlag=\([^[:space:]]\{1,\}\).*/\1/g' > ) > if [ "$batchmode" -eq 1 ]; then > echo "running srun true on $lead_node to get prolog on" \ > 'other hosts' | log > su "$SLURM_JOB_USER" -c 'srun -J prolog true' 2>&1 | log > fi > fi > -- Shooting yourself in the foot here too: The job's hostlist is available in the SLURM_NODELIST environment variable, so the squeue call is redundant. We can get the batch flag with minor code changes and eliminate the scontrol call also. I'll see if I can get you a patch for that.
Created attachment 652 [details] add SLURM_STEP_ID to Prolog environment This will be in version 2.6.7, but you can apply it now to eliminate the script test for batch jobs. If a batch job, the SLURM_STEP_ID will be set as follows: SLURM_STEP_ID=4294967294 Otherwise, you should see a small number with a value dependent upon the use case, but probably 0. Your prolog should now have no calls to slurm except if you want to spawn prologs on every node of a batch job. Sample environment for a batch job shown below: SLURM_NODELIST=tux[12-18] SLURMD_NODENAME=tux12 SLURM_JOBID=11 SLURM_STEP_ID=4294967294 SLURM_CONF=/home/jette/Desktop/SLURM/install.linux/etc/slurm.conf SLURM_JOB_ID=11 PWD=/home/jette/Desktop/SLURM/install.linux/sbin SLURM_JOB_USER=jette SLURM_UID=1000 SLURM_JOB_UID=1000
(In reply to John Morrissey from comment #5) > We're running squeue(1) once in our epilog (as 'squeue -ho %A -u > "$SLURM_JOB_USER" -w localhost') to determine whether the user has any other > jobs running on the current host, so we can kill any leftover processes, > such as SSH sessions, etc. > > We also run srun(1) once in the prolog, if the job is running on the first > node in the nodelist, and BatchFlag=1 for the job. We got that section of > prolog from another facility and kept rolling with it because reasons, and > I've never been entirely sure what it's doing. Is this something we need to > be doing? > > -john Whoever developed this script wrote it in the most inefficient fashion possible. The hostname and job host list are available in environment variables. After placing extra load on slurm to gather information that is otherwise available, it launches something on all nodes allocated to a batch job, which does absolutely nothing but slow down batch jobs. If this is the entirety of your Prolog, just remove it. What does your epilog look like? What about PrologSlurmctld? Slurm provides a lot of flexibility, it just looks like you've chained a bunch of anchors to it.
(In reply to Moe Jette from comment #11) > (In reply to John Morrissey from comment #5) > > We're running squeue(1) once in our epilog (as 'squeue -ho %A -u > > "$SLURM_JOB_USER" -w localhost') to determine whether the user has any other > > jobs running on the current host, so we can kill any leftover processes, > > such as SSH sessions, etc. > > Whoever developed this script wrote it in the most inefficient fashion > possible. The hostname and job host list are available in environment > variables. After placing extra load on slurm to gather information that is > otherwise available, it launches something on all nodes allocated to a batch > job, which does absolutely nothing but slow down batch jobs. If this is the > entirety of your Prolog, just remove it. Believe me, you're preaching to the choir. :-) We decided that we don't need anything else we had in the prolog, so we're no longer running it. > What does your epilog look like? The epilog only makes the one call to squeue(1). Are you concerned about its runtime, or just the quantity of RPCs it's making? > What about PrologSlurmctld? PrologSlurmctld saves the complete job script (it reads the job script and environment from Slurm's spool directory, does a little preprocessing and saves it to an SQL database). It's fast enough that I'm not worried about it, and it makes no RPCs to Slurm. Also, we've backed default_queue_depth down to 100, and I've... encouraged everyone here to refrain from turning that knob again without a solid reason. :-)
(In reply to John Morrissey from comment #12) > (In reply to Moe Jette from comment #11) > > (In reply to John Morrissey from comment #5) > > > We're running squeue(1) once in our epilog (as 'squeue -ho %A -u > > > "$SLURM_JOB_USER" -w localhost') to determine whether the user has any other > > > jobs running on the current host, so we can kill any leftover processes, > > > such as SSH sessions, etc. > > > > Whoever developed this script wrote it in the most inefficient fashion > > possible. The hostname and job host list are available in environment > > variables. After placing extra load on slurm to gather information that is > > otherwise available, it launches something on all nodes allocated to a batch > > job, which does absolutely nothing but slow down batch jobs. If this is the > > entirety of your Prolog, just remove it. > > Believe me, you're preaching to the choir. :-) > > We decided that we don't need anything else we had in the prolog, so we're > no longer running it. Great! That should help. > > What does your epilog look like? > > The epilog only makes the one call to squeue(1). Are you concerned about its > runtime, or just the quantity of RPCs it's making? I'm just trying to figure out where there are bottlenecks are and trying to get your throughput up to an acceptable level. I'd be happy to review that to see if it can be sped up. > > What about PrologSlurmctld? > > PrologSlurmctld saves the complete job script (it reads the job script and > environment from Slurm's spool directory, does a little preprocessing and > saves it to an SQL database). It's fast enough that I'm not worried about > it, and it makes no RPCs to Slurm. > > Also, we've backed default_queue_depth down to 100, and I've... encouraged > everyone here to refrain from turning that knob again without a solid > reason. :-) That knob can have a big impact upon performance. Changing the value of any configuration parameter by two orders of magnitude from the default setting is rarely a good thing. Would you let me know how things are working with the new configuration in the next day or two.
Created attachment 663 [details] slurm_epilog On Thu, Feb 27, 2014 at 11:59:04PM +0000, bugs@schedmd.com wrote: > --- Comment #13 from Moe Jette <jette@schedmd.com> --- > (In reply to John Morrissey from comment #12) > > (In reply to Moe Jette from comment #11) > > > What does your epilog look like? > > > > The epilog only makes the one call to squeue(1). Are you concerned about its > > runtime, or just the quantity of RPCs it's making? > > I'm just trying to figure out where there are bottlenecks are and trying > to get your throughput up to an acceptable level. I'd be happy to review > that to see if it can be sped up. Epilog is attached; it typically exits within a second or two, presuming all of the user's processes have already exited. We could probably eliminate the squeue(1) call by getting the PIDs of running slurmstepds, and walking the process tree to get the usernames of their child PIDs, to see if the current user has any other jobs running on the current host. Does that seem reasonable? > > Also, we've backed default_queue_depth down to 100, and I've... encouraged > > everyone here to refrain from turning that knob again without a solid > > reason. :-) > > That knob can have a big impact upon performance. Changing the value of any > configuration parameter by two orders of magnitude from the default setting is > rarely a good thing. > > Would you let me know how things are working with the new configuration in the > next day or two. We experienced a RPC lock contention event (for lack of a better term) today that correlated with a user submitting about four thousand jobs in a relatively small time period. slurmctld debug output for that hour is attached. We wound up SIGTERMing slurmctld a couple of times, and backed default_queue_depth down even further, to 10, which seemed to make slurmctld much more responsive. Since the prolog was already out of the picture, this feels like slurmctld is... ambitious about the rate at which it can spawn jobs, ends up getting itself deep into the weeds with lots of pending RPCs, and can't get itself back together without some intervention. Beyond whatever tuning or other fixes come from this as part of a more permanent solution, this kind of situation is why we're asking about the triage features I mentioned in my original bug submission. If slurmctld had reserved some RPC slots for administrative users, maybe we could have looked at the queue (dunno if lock contention would have precluded that at that point), or if there were better ways to introspect into a running slurmctld (a way to see what RPC calls were blocking, a breakdown of recent RPCs by user and/or host, etc.), we would have stood a better chance of fixing slurmctld in place instead of resorting to cruder methods. In combination with those introspection features, if we could have blacklisted RPCs from certain users or hosts, we could have dealt with the situation more concisely. In this case, we could have had our job_submit plugin drop further job submissions from this user, but I don't see a way to defend against excessive use of the user-facing s{acct,queue,etc.} tools, so in that case, we'd be harder pressed to implement a fast solution. Does this make sense? I see this as two-part: figuring out how to have slurmctld deal with large bursts of job submissions or other job churn in our environment, and figuring out how to triage and stabilize slurmctld when that first part hasn't completely protected us. john
Created attachment 664 [details] slurmctld-syslogs.gz
(In reply to John Morrissey from comment #14) > Created attachment 663 [details] > slurm_epilog > > On Thu, Feb 27, 2014 at 11:59:04PM +0000, bugs@schedmd.com wrote: > > --- Comment #13 from Moe Jette <jette@schedmd.com> --- > > (In reply to John Morrissey from comment #12) > > > (In reply to Moe Jette from comment #11) > > > > What does your epilog look like? > > > > > > The epilog only makes the one call to squeue(1). Are you concerned about its > > > runtime, or just the quantity of RPCs it's making? > > > > I'm just trying to figure out where there are bottlenecks are and trying > > to get your throughput up to an acceptable level. I'd be happy to review > > that to see if it can be sped up. > > Epilog is attached; it typically exits within a second or two, presuming all > of the user's processes have already exited. We could probably eliminate the > squeue(1) call by getting the PIDs of running slurmstepds, and walking the > process tree to get the usernames of their child PIDs, to see if the current > user has any other jobs running on the current host. Does that seem > reasonable? The epilog looks reasonable to me.
(In reply to John Morrissey from comment #14) > > > Also, we've backed default_queue_depth down to 100, and I've... encouraged > > > everyone here to refrain from turning that knob again without a solid > > > reason. :-) > > > > That knob can have a big impact upon performance. Changing the value of any > > configuration parameter by two orders of magnitude from the default setting is > > rarely a good thing. > > > > Would you let me know how things are working with the new configuration in the > > next day or two. > > We experienced a RPC lock contention event (for lack of a better term) today > that correlated with a user submitting about four thousand jobs in a > relatively small time period. slurmctld debug output for that hour is > attached. > > We wound up SIGTERMing slurmctld a couple of times, and backed > default_queue_depth down even further, to 10, which seemed to make slurmctld > much more responsive. That's why I thought 10000 was a really bad setting. > Since the prolog was already out of the picture, this > feels like slurmctld is... ambitious about the rate at which it can spawn > jobs, ends up getting itself deep into the weeds with lots of pending RPCs, > and can't get itself back together without some intervention. Judging from your logs, that seems an accurate assessment. It looks pretty ugly. Could you impose some sort of limits here? One option to use the account limits that will restrict the number of jobs that any user can have running at one time or limit the number of jobs that a user can submit. Another option is to use use the SchedulerParameters parameter of bg_max_job_user=# that will limit the number of jobs the backfill scheduler will start at any one time for any single user. That has proven helpful in some similar situations. > Beyond whatever tuning or other fixes come from this as part of a more > permanent solution, this kind of situation is why we're asking about the > triage features I mentioned in my original bug submission. > > If slurmctld had reserved some RPC slots for administrative users, maybe we > could have looked at the queue (dunno if lock contention would have > precluded that at that point), or if there were better ways to introspect > into a running slurmctld (a way to see what RPC calls were blocking, a > breakdown of recent RPCs by user and/or host, etc.), we would have stood a > better chance of fixing slurmctld in place instead of resorting to cruder > methods. > > In combination with those introspection features, if we could have > blacklisted RPCs from certain users or hosts, we could have dealt with the > situation more concisely. In this case, we could have had our job_submit > plugin drop further job submissions from this user, but I don't see a way to > defend against excessive use of the user-facing s{acct,queue,etc.} tools, so > in that case, we'd be harder pressed to implement a fast solution. I concur, but it's not available today and this problem seems rather pressing. > Does this make sense? I see this as two-part: figuring out how to have > slurmctld deal with large bursts of job submissions or other job churn in > our environment, and figuring out how to triage and stabilize slurmctld when > that first part hasn't completely protected us. I'd suggest SchedulerParameters parameter of bg_max_job_user=# for now.
On Fri, Feb 28, 2014 at 11:04:42PM +0000, bugs@schedmd.com wrote: > --- Comment #17 from Moe Jette <jette@schedmd.com> --- > (In reply to John Morrissey from comment #14) > > Since the prolog was already out of the picture, this feels like > > slurmctld is... ambitious about the rate at which it can spawn jobs, > > ends up getting itself deep into the weeds with lots of pending RPCs, > > and can't get itself back together without some intervention. > > Judging from your logs, that seems an accurate assessment. It looks pretty > ugly. > > Could you impose some sort of limits here? > One option to use the account limits that will restrict the number of jobs that > any user can have running at one time or limit the number of jobs that a user > can submit. > Another option is to use use the SchedulerParameters parameter of > bg_max_job_user=# that will limit the number of jobs the backfill scheduler > will start at any one time for any single user. That has proven helpful in some > similar situations. We already set bf_max_job_user=100, but given what we've seen with even the difference between 100->10 and default_queue_depth, I'm inclined to bring that down to 10, as well. FWIW, here's what we have for bf_* SchedulerParameters: bf_interval=600,bf_continue,bf_resolution=300,max_job_bf=5000,bf_max_job_part=5000,bf_max_job_user=100 Could probably stand to crank down a bunch of the other limits, too. We also set MaxSubmitJobs on each user to about 10k, since we have a number of users that can legitimately have a few thousand jobs pending. Unless there's a better way, maybe we'll implement something in our job_submit script for now that rate limits job submissions to N jobs/minute. In that vein, I'm really looking forward to the next major release since it will let us return descriptive error messages from job_submit plugins. We've avoided implementing much administrative policy in ours so far because the possible return values are fairly vague, especially for a user base that's not always the most technically versed. > > Beyond whatever tuning or other fixes come from this as part of a more > > permanent solution, this kind of situation is why we're asking about the > > triage features I mentioned in my original bug submission. > > > > If slurmctld had reserved some RPC slots for administrative users, maybe > > we could have looked at the queue (dunno if lock contention would have > > precluded that at that point), or if there were better ways to > > introspect into a running slurmctld (a way to see what RPC calls were > > blocking, a breakdown of recent RPCs by user and/or host, etc.), we > > would have stood a better chance of fixing slurmctld in place instead of > > resorting to cruder methods. > > > > In combination with those introspection features, if we could have > > blacklisted RPCs from certain users or hosts, we could have dealt with > > the situation more concisely. In this case, we could have had our > > job_submit plugin drop further job submissions from this user, but I > > don't see a way to defend against excessive use of the user-facing > > s{acct,queue,etc.} tools, so in that case, we'd be harder pressed to > > implement a fast solution. > > I concur, but it's not available today and this problem seems rather pressing. nod, I'm mostly trying to describe our use case and paint a picture of the tools that would be useful in our context. I certainly realize (and empathize) that they don't exist today. :-) john
(In reply to John Morrissey from comment #18) > On Fri, Feb 28, 2014 at 11:04:42PM +0000, bugs@schedmd.com wrote: > > --- Comment #17 from Moe Jette <jette@schedmd.com> --- > > (In reply to John Morrissey from comment #14) > > > Since the prolog was already out of the picture, this feels like > > > slurmctld is... ambitious about the rate at which it can spawn jobs, > > > ends up getting itself deep into the weeds with lots of pending RPCs, > > > and can't get itself back together without some intervention. > > > > Judging from your logs, that seems an accurate assessment. It looks pretty > > ugly. > > > > Could you impose some sort of limits here? > > One option to use the account limits that will restrict the number of jobs that > > any user can have running at one time or limit the number of jobs that a user > > can submit. > > Another option is to use use the SchedulerParameters parameter of > > bg_max_job_user=# that will limit the number of jobs the backfill scheduler > > will start at any one time for any single user. That has proven helpful in some > > similar situations. > > We already set bf_max_job_user=100, but given what we've seen with even the > difference between 100->10 and default_queue_depth, I'm inclined to bring > that down to 10, as well. > > FWIW, here's what we have for bf_* SchedulerParameters: > > > bf_interval=600,bf_continue,bf_resolution=300,max_job_bf=5000, > bf_max_job_part=5000,bf_max_job_user=100 > > Could probably stand to crank down a bunch of the other limits, too. > > We also set MaxSubmitJobs on each user to about 10k, since we have a number > of users that can legitimately have a few thousand jobs pending. Unless > there's a better way, maybe we'll implement something in our job_submit > script for now that rate limits job submissions to N jobs/minute. There is also a MaxJobs, which limits the number of running jobs by user, but I'm thinking the total number of running jobs may be less of an issue that the number of jobs started at the same time. The latter can induce a lot of message traffic. My next suggestion would be setting bf_max_job_user=10 as a starting point. If you could send the output of "sdiag" that may be helpful to me also. It's some statistics on the scheduling logic.
On Fri, Feb 28, 2014 at 11:46:18PM +0000, bugs@schedmd.com wrote: > --- Comment #19 from Moe Jette <jette@schedmd.com> --- > (In reply to John Morrissey from comment #18) > > On Fri, Feb 28, 2014 at 11:04:42PM +0000, bugs@schedmd.com wrote: > > > Could you impose some sort of limits here? > > > One option to use the account limits that will restrict the number of jobs that > > > any user can have running at one time or limit the number of jobs that a user > > > can submit. > > > Another option is to use use the SchedulerParameters parameter of > > > bg_max_job_user=# that will limit the number of jobs the backfill scheduler > > > will start at any one time for any single user. That has proven helpful in some > > > similar situations. > > > > We already set bf_max_job_user=100, but given what we've seen with even the > > difference between 100->10 and default_queue_depth, I'm inclined to bring > > that down to 10, as well. > > > > FWIW, here's what we have for bf_* SchedulerParameters: > > > > > > bf_interval=600,bf_continue,bf_resolution=300,max_job_bf=5000, > > bf_max_job_part=5000,bf_max_job_user=100 > > > > Could probably stand to crank down a bunch of the other limits, too. > > > > We also set MaxSubmitJobs on each user to about 10k, since we have a > > number of users that can legitimately have a few thousand jobs pending. > > Unless there's a better way, maybe we'll implement something in our > > job_submit script for now that rate limits job submissions to N > > jobs/minute. > > There is also a MaxJobs, which limits the number of running jobs by user, > but I'm thinking the total number of running jobs may be less of an issue > that the number of jobs started at the same time. The latter can induce a > lot of message traffic. nod, agree. > My next suggestion would be setting bf_max_job_user=10 as a starting > point. k, I'll plan on that, but might not get to it until Monday. > If you could send the output of "sdiag" that may be helpful to me also. > It's some statistics on the scheduling logic. Sure, here you go: -- [jwm@holy-slurm01:pts/8 ~> sdiag ******************************************************* sdiag output at Fri Feb 28 18:59:10 2014 Data since Fri Feb 28 14:24:06 2014 ******************************************************* Server thread count: 5 Agent queue size: 0 Jobs submitted: 14508 Jobs started: 13860 Jobs completed: 14021 Jobs canceled: 1086 Jobs failed: 0 Main schedule statistics (microseconds): Last cycle: 243415 Max cycle: 5768680 Total cycles: 13419 Mean cycle: 175773 Mean depth cycle: 9 Cycles per minute: 48 Last queue length: 8558 Backfilling stats (WARNING: data obtained in the middle of backfilling execution Total backfilled jobs (since last slurm start): 850 Total backfilled jobs (since last stats cycle start): 850 Total cycles: 3 Last cycle when: Fri Feb 28 18:31:24 2014 Last cycle: 2091267014 Max cycle: 2091267014 Mean cycle: 175056444 Last depth cycle: 5670 Last depth cycle (try sched): 321 Depth Mean: 7554 Depth Mean (try depth): 569 Last queue length: 8519 Queue length mean: 11620 -- john
> We also set MaxSubmitJobs on each user to about 10k, since we have a number > of users that can legitimately have a few thousand jobs pending. Unless > there's a better way, maybe we'll implement something in our job_submit > script for now that rate limits job submissions to N jobs/minute. This would involve some change in the usage mode and would not work for everyone, but launching job steps (typically an MPI job invoked by srun) is much more lightweight than running an independent job for each. There are many use cases where a single job allocation spawns hundreds to tens of thousands of job steps. That might even be a preferable mode of operation in some cases; get a lot of work done once your job is allocated resources.
How have things been running over the past few days?
On Tue, Mar 04, 2014 at 10:09:31PM +0000, bugs@schedmd.com wrote: > http://bugs.schedmd.com/show_bug.cgi?id=607 > > --- Comment #22 from Moe Jette <jette@schedmd.com> --- > How have things been running over the past few days? Much better, but not completely without problems. We had a couple instances today where the scheduling of a few hundred jobs caused slurmctld to experience the same contention-related unresponsiveness, but was able to dig itself out after 20-30 minutes, which is a huge improvement. I didn't have a chance to change bf_max_job_user until this afternoon, and I'm not sure whether those jobs were scheduled by the primary or backfill scheduler. Thanks for checking in on us, Moe. john
(In reply to John Morrissey from comment #23) > On Tue, Mar 04, 2014 at 10:09:31PM +0000, bugs@schedmd.com wrote: > > http://bugs.schedmd.com/show_bug.cgi?id=607 > > > > --- Comment #22 from Moe Jette <jette@schedmd.com> --- > > How have things been running over the past few days? > > Much better, but not completely without problems. We had a couple instances > today where the scheduling of a few hundred jobs caused slurmctld to > experience the same contention-related unresponsiveness, but was able to dig > itself out after 20-30 minutes, which is a huge improvement. > > I didn't have a chance to change bf_max_job_user until this afternoon, and > I'm not sure whether those jobs were scheduled by the primary or backfill > scheduler. > > Thanks for checking in on us, Moe. > > john I think the new work that I did will solve this problem by limiting the total number of jobs that can be started in a single backfill scheduling cycle. When you upgrade, I would be inclined to set the bf_max_job_start to around 100 and see how that goes. The per-user limit may not suite your environment if there are many users with many jobs each. The commit is here: https://github.com/SchedMD/slurm/commit/1b0c4a33590c24a6509750f48b2108c0baea4fbe
On Wed, Mar 05, 2014 at 04:12:53AM +0000, bugs@schedmd.com wrote: > --- Comment #24 from Moe Jette <jette@schedmd.com> --- > (In reply to John Morrissey from comment #23) > > On Tue, Mar 04, 2014 at 10:09:31PM +0000, bugs@schedmd.com wrote: > > > --- Comment #22 from Moe Jette <jette@schedmd.com> --- > > > How have things been running over the past few days? > > > > Much better, but not completely without problems. We had a couple > > instances today where the scheduling of a few hundred jobs caused > > slurmctld to experience the same contention-related unresponsiveness, > > but was able to dig itself out after 20-30 minutes, which is a huge > > improvement. > > > > I didn't have a chance to change bf_max_job_user until this afternoon, > > and I'm not sure whether those jobs were scheduled by the primary or > > backfill scheduler. > > I think the new work that I did will solve this problem by limiting the total > number of jobs that can be started in a single backfill scheduling cycle. When > you upgrade, I would be inclined to set the bf_max_job_start to around 100 and > see how that goes. The per-user limit may not suite your environment if there > are many users with many jobs each. The commit is here: > https://github.com/SchedMD/slurm/commit/1b0c4a33590c24a6509750f48b2108c0baea4fbe I added this patch to our local packages and set bf_max_job_start=100. john
John, could you update this bug, so we can close if possible.
On Thu, Mar 20, 2014 at 06:46:19PM +0000, bugs@schedmd.com wrote: > --- Comment #26 from Danny Auble <da@schedmd.com> --- > John, could you update this bug, so we can close if possible. The scheduler's looking quite good for us now; there's still a few-minute period once a day or two where slurmctld isn't responding to RPCs, but that's live-withable at this point, and far better than it was. Thanks for checking up on us, Danny, and feel free to resolve this. john
Thanks John.