607 – Constantly hitting server_thread_count limit

Ticket 607 - Constantly hitting server_thread_count limit

Summary: Constantly hitting server_thread_count limit

Status:	RESOLVED FIXED

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	slurmctld (show other tickets)
Version:	2.6.5
Hardware:	Linux Linux

Severity:	2 - High Impact
Assignee:	Moe Jette
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2014-02-27 03:23 MST by John Morrissey
Modified:	2014-03-20 11:43 MDT (History)
CC List:	1 user (show)

See Also:
Site:	Harvard University
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:	2.6.7 14.03.0rc1
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
slurmctld debug output around a server_thread_count limit event (374.62 KB, application/octet-stream) 2014-02-27 03:23 MST, John Morrissey	Details
slurm.conf (18.31 KB, application/octet-stream) 2014-02-27 03:48 MST, John Morrissey	Details
add SLURM_STEP_ID to Prolog environment (3.18 KB, patch) 2014-02-27 05:13 MST, Moe Jette	Details \| Diff
slurm_epilog (8.76 KB, text/plain) 2014-02-28 08:37 MST, John Morrissey	Details
slurmctld-syslogs.gz (447.33 KB, application/octet-stream) 2014-02-28 08:37 MST, John Morrissey	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description John Morrissey 2014-02-27 03:23:41 MST

Created attachment 650 [details]
slurmctld debug output around a server_thread_count limit event

We've noticed slurmctld regularly hitting its server_thread_count limit.

I've attached some debugging output around the time of the problem. At 10:12:23, ~60 jobs are preempted and requeued, and it looks like there are also a number of dump_job_user and dump_job_single RPCs being processed that are taking an unusually long time.

About a minute later, at 10:13, slurmctld hits the thread limit several times. At that point, it's basically unresponsive since it can't process any more RPCs until the data structures are unlocked that some pending RPCs depend on, freeing up slots for new incoming RPCs to be processed.

A few questions or thoughts about this situation:

1. Is this being caused primarily by the job churn (rescheduling ~60 jobs at once)? Is there a way to rate limit this behavior, or have slurmctld handle it more gracefully so it can gradually schedule/requeue jobs and remain responsive to incoming RPCs? We've seen similar behavior when scheduling large (4k+ index) job arrays.

2. It seems increasing MAX_SERVER_THREADS won't help in this case, since new incoming RPCs are almost certainly going to block in the same way the existing RPCs are. There will be more threads trying to process RPCs, but any increase in MAX_SERVER_THREADS will result in the same unresponsiveness because all the available threads will still be blocking on trying to process RPCs.

3. It would be really, really useful to be able to introspect into the thread pool. Being able to tell what type of RPC each thread is processing and some other basic information (such as the current elapsed time for that RPC and the user and host initiating it) would let us at least triage these situations, if only by quelling the offending user (if it's in fact user behavior causing the RPCs in question in a particular case).

4. It would also be nice if slurmctld had some sort of rate limiting for user-initiated RPCs. We've encountered some cases where users are using squeue or sacct, becoming frustrated by the lack of response, and run them repeatedly trying to get them to run to completion. I haven't assembled any evidence that this contributes to slurmctld unresponsiveness, but it seems like having a way for slurmctld to shed "unnecessary" load by rate limiting s* tool access by unprivilieged users might help, and could leave some headroom so privileged (i.e., administrative) users can troubleshoot slurmctld when an adverse event happens.


Let me know what you think, especially about MAX_SERVER_THREADS as there's some controversy here about whether increasing its value will help.

thanks,
-john

Comment 1 Moe Jette 2014-02-27 03:39:25 MST

Could you attach your slurm.conf file?

The number of squeue commands being executed seems very high (like 15 at 10:12:17 and a bunch more at 10:12:18). Are you running squeue from prolog or epilog scripts? If so, could you append them also? There may be lighter weight options to accomplish your goal.

Comment 2 Moe Jette 2014-02-27 03:41:41 MST

I concur with your assessment of MAX_SERVER_THREADS. Raising the limit will only add more contention for the locks and not increase throughput.

Comment 3 John Morrissey 2014-02-27 03:48:15 MST

Created attachment 651 [details]
slurm.conf

Comment 4 Moe Jette 2014-02-27 03:53:22 MST

Your problem is right here:
SchedulerParameters=default_queue_depth=10000

I would recommend removing default_queue_depth to use the default value of 100. The backfill scheduler will probably be scheduling most jobs anyway.

This is from the slurm.conf man page:
default_queue_depth=#
The default number of jobs to attempt scheduling (i.e. the queue depth) when a
running job completes or other routine actions occur. The full queue will be
tested on a less frequent basis. The default value is 100.
See the partition_job_depth option to limit depth by partition.
In the case of large clusters (more than 1000 nodes), configuring a relatively
small value may be desirable.
Specifying a large value (say 1000 or higher) can be expected to result in
poor system responsiveness since this scheduling logic will not release
locks for other events to occur.
It would be better to let the backfill scheduler process a larger number of jobs
(see max_job_bf, bf_continue  and other options here for more
information).

Comment 5 John Morrissey 2014-02-27 03:55:27 MST

We're running squeue(1) once in our epilog (as 'squeue -ho %A -u "$SLURM_JOB_USER" -w localhost') to determine whether the user has any other jobs running on the current host, so we can kill any leftover processes, such as SSH sessions, etc.

We also run srun(1) once in the prolog, if the job is running on the first node in the nodelist, and BatchFlag=1 for the job. We got that section of prolog from another facility and kept rolling with it because reasons, and I've never been entirely sure what it's doing. Is this something we need to be doing?

-john

Comment 6 John Morrissey 2014-02-27 04:03:23 MST

(In reply to Moe Jette from comment #4)
> Your problem is right here:
> SchedulerParameters=default_queue_depth=10000
> 
> I would recommend removing default_queue_depth to use the default value of
> 100. The backfill scheduler will probably be scheduling most jobs anyway.

Hm, we had default_queue_depth at 100 up until about a week ago, and had it at 1000 up until a couple hours ago. I'm not sure why Paul increased it; I'll ask him.

In any event, the debug output I sent would have had default_queue_depth at 1000. I figure you still recommend bumping it down to 100?

Comment 7 John Morrissey 2014-02-27 04:04:16 MST

(In reply to John Morrissey from comment #5)
> We also run srun(1) once in the prolog, if the job is running on the first
> node in the nodelist, and BatchFlag=1 for the job. We got that section of
> prolog from another facility and kept rolling with it because reasons, and
> I've never been entirely sure what it's doing. Is this something we need to
> be doing?

Here's the actual hunk:

--
this_host=$(hostname -s)
hostlist=$(squeue -j "$SLURM_JOB_ID" -ho %N)
lead_node=$(scontrol show hostname "$hostlist" | head -n 1)

# If this is the lead node, and there are hosts other than the lead node
# in this host list...
if [ "$lead_node" = "$this_host" ] &&
   echo "$hostlist" | fgrep -vx "$this_host"; then

        batchmode=$(
                scontrol -o show job "$SLURM_JOB_ID" |
                        sed -e 's/.*[[:space:]]BatchFlag=\([^[:space:]]\{1,\}\).*/\1/g'
        )
        if [ "$batchmode" -eq 1 ]; then
                echo "running srun true on $lead_node to get prolog on" \
                        'other hosts' | log
                su "$SLURM_JOB_USER" -c 'srun -J prolog true' 2>&1 | log
        fi
fi
--

Comment 8 Moe Jette 2014-02-27 04:12:45 MST

(In reply to John Morrissey from comment #6)
> (In reply to Moe Jette from comment #4)
> > Your problem is right here:
> > SchedulerParameters=default_queue_depth=10000
> > 
> > I would recommend removing default_queue_depth to use the default value of
> > 100. The backfill scheduler will probably be scheduling most jobs anyway.
> 
> Hm, we had default_queue_depth at 100 up until about a week ago, and had it
> at 1000 up until a couple hours ago. I'm not sure why Paul increased it;
> I'll ask him.
> 
> In any event, the debug output I sent would have had default_queue_depth at
> 1000. I figure you still recommend bumping it down to 100?

Absolutely.The scheduler is probably running constantly and slowing everything down.

Comment 9 Moe Jette 2014-02-27 04:19:56 MST

(In reply to John Morrissey from comment #7)
> (In reply to John Morrissey from comment #5)
> > We also run srun(1) once in the prolog, if the job is running on the first
> > node in the nodelist, and BatchFlag=1 for the job. We got that section of
> > prolog from another facility and kept rolling with it because reasons, and
> > I've never been entirely sure what it's doing. Is this something we need to
> > be doing?
> 
> Here's the actual hunk:
> 
> --
> this_host=$(hostname -s)
> hostlist=$(squeue -j "$SLURM_JOB_ID" -ho %N)
> lead_node=$(scontrol show hostname "$hostlist" | head -n 1)
> 
> # If this is the lead node, and there are hosts other than the lead node
> # in this host list...
> if [ "$lead_node" = "$this_host" ] &&
>    echo "$hostlist" | fgrep -vx "$this_host"; then
> 
>         batchmode=$(
>                 scontrol -o show job "$SLURM_JOB_ID" |
>                         sed -e
> 's/.*[[:space:]]BatchFlag=\([^[:space:]]\{1,\}\).*/\1/g'
>         )
>         if [ "$batchmode" -eq 1 ]; then
>                 echo "running srun true on $lead_node to get prolog on" \
>                         'other hosts' | log
>                 su "$SLURM_JOB_USER" -c 'srun -J prolog true' 2>&1 | log
>         fi
> fi
> --

Shooting yourself in the foot here too:
The job's hostlist is available in the SLURM_NODELIST environment variable, so the squeue call is redundant.
We can get the batch flag with minor code changes and eliminate the scontrol call also. I'll see if I can get you a patch for that.

Comment 10 Moe Jette 2014-02-27 05:13:20 MST

Created attachment 652 [details]
add SLURM_STEP_ID to Prolog environment

This will be in version 2.6.7, but you can apply it now to eliminate the script test for batch jobs. If a batch job, the SLURM_STEP_ID will be set as follows: 
SLURM_STEP_ID=4294967294
Otherwise, you should see a small number with a value dependent upon the use case, but probably 0.

Your prolog should now have no calls to slurm except if you want to spawn prologs on every node of a batch job. Sample environment for a batch job shown below:

SLURM_NODELIST=tux[12-18]
SLURMD_NODENAME=tux12
SLURM_JOBID=11
SLURM_STEP_ID=4294967294
SLURM_CONF=/home/jette/Desktop/SLURM/install.linux/etc/slurm.conf
SLURM_JOB_ID=11
PWD=/home/jette/Desktop/SLURM/install.linux/sbin
SLURM_JOB_USER=jette
SLURM_UID=1000
SLURM_JOB_UID=1000

Comment 11 Moe Jette 2014-02-27 05:22:49 MST

(In reply to John Morrissey from comment #5)
> We're running squeue(1) once in our epilog (as 'squeue -ho %A -u
> "$SLURM_JOB_USER" -w localhost') to determine whether the user has any other
> jobs running on the current host, so we can kill any leftover processes,
> such as SSH sessions, etc.
> 
> We also run srun(1) once in the prolog, if the job is running on the first
> node in the nodelist, and BatchFlag=1 for the job. We got that section of
> prolog from another facility and kept rolling with it because reasons, and
> I've never been entirely sure what it's doing. Is this something we need to
> be doing?
> 
> -john

Whoever developed this script wrote it in the most inefficient fashion possible. The hostname and job host list are available in environment variables. After placing extra load on slurm to gather information that is otherwise available, it launches something on all nodes allocated to a batch job, which does absolutely nothing but slow down batch jobs. If this is the entirety of your Prolog, just remove it.

What does your epilog look like? 
What about PrologSlurmctld?

Slurm provides a lot of flexibility, it just looks like you've chained a bunch of anchors to it.

Comment 12 John Morrissey 2014-02-27 07:59:23 MST

(In reply to Moe Jette from comment #11)
> (In reply to John Morrissey from comment #5)
> > We're running squeue(1) once in our epilog (as 'squeue -ho %A -u
> > "$SLURM_JOB_USER" -w localhost') to determine whether the user has any other
> > jobs running on the current host, so we can kill any leftover processes,
> > such as SSH sessions, etc.
> 
> Whoever developed this script wrote it in the most inefficient fashion
> possible. The hostname and job host list are available in environment
> variables. After placing extra load on slurm to gather information that is
> otherwise available, it launches something on all nodes allocated to a batch
> job, which does absolutely nothing but slow down batch jobs. If this is the
> entirety of your Prolog, just remove it.

Believe me, you're preaching to the choir. :-)

We decided that we don't need anything else we had in the prolog, so we're no longer running it.

> What does your epilog look like? 

The epilog only makes the one call to squeue(1). Are you concerned about its runtime, or just the quantity of RPCs it's making?

> What about PrologSlurmctld?

PrologSlurmctld saves the complete job script (it reads the job script and environment from Slurm's spool directory, does a little preprocessing and saves it to an SQL database). It's fast enough that I'm not worried about it, and it makes no RPCs to Slurm.

Also, we've backed default_queue_depth down to 100, and I've... encouraged everyone here to refrain from turning that knob again without a solid reason. :-)

Comment 13 Moe Jette 2014-02-27 09:59:04 MST

(In reply to John Morrissey from comment #12)
> (In reply to Moe Jette from comment #11)
> > (In reply to John Morrissey from comment #5)
> > > We're running squeue(1) once in our epilog (as 'squeue -ho %A -u
> > > "$SLURM_JOB_USER" -w localhost') to determine whether the user has any other
> > > jobs running on the current host, so we can kill any leftover processes,
> > > such as SSH sessions, etc.
> > 
> > Whoever developed this script wrote it in the most inefficient fashion
> > possible. The hostname and job host list are available in environment
> > variables. After placing extra load on slurm to gather information that is
> > otherwise available, it launches something on all nodes allocated to a batch
> > job, which does absolutely nothing but slow down batch jobs. If this is the
> > entirety of your Prolog, just remove it.
> 
> Believe me, you're preaching to the choir. :-)
> 
> We decided that we don't need anything else we had in the prolog, so we're
> no longer running it.

Great! That should help.


> > What does your epilog look like? 
> 
> The epilog only makes the one call to squeue(1). Are you concerned about its
> runtime, or just the quantity of RPCs it's making?

I'm just trying to figure out where there are bottlenecks are and trying to get your throughput up to an acceptable level. I'd be happy to review that to see if it can be sped up.


> > What about PrologSlurmctld?
> 
> PrologSlurmctld saves the complete job script (it reads the job script and
> environment from Slurm's spool directory, does a little preprocessing and
> saves it to an SQL database). It's fast enough that I'm not worried about
> it, and it makes no RPCs to Slurm.
> 
> Also, we've backed default_queue_depth down to 100, and I've... encouraged
> everyone here to refrain from turning that knob again without a solid
> reason. :-)

That knob can have a big impact upon performance. Changing the value of any configuration parameter by two orders of magnitude from the default setting is rarely a good thing.

Would you let me know how things are working with the new configuration in the next day or two.

Comment 14 John Morrissey 2014-02-28 08:37:39 MST

Created attachment 663 [details]
slurm_epilog

On Thu, Feb 27, 2014 at 11:59:04PM +0000, bugs@schedmd.com wrote:
> --- Comment #13 from Moe Jette <jette@schedmd.com> ---
> (In reply to John Morrissey from comment #12)
> > (In reply to Moe Jette from comment #11)
> > > What does your epilog look like? 
> > 
> > The epilog only makes the one call to squeue(1). Are you concerned about its
> > runtime, or just the quantity of RPCs it's making?
> 
> I'm just trying to figure out where there are bottlenecks are and trying
> to get your throughput up to an acceptable level. I'd be happy to review
> that to see if it can be sped up.

Epilog is attached; it typically exits within a second or two, presuming all
of the user's processes have already exited. We could probably eliminate the
squeue(1) call by getting the PIDs of running slurmstepds, and walking the
process tree to get the usernames of their child PIDs, to see if the current
user has any other jobs running on the current host. Does that seem
reasonable?

> > Also, we've backed default_queue_depth down to 100, and I've... encouraged
> > everyone here to refrain from turning that knob again without a solid
> > reason. :-)
> 
> That knob can have a big impact upon performance. Changing the value of any
> configuration parameter by two orders of magnitude from the default setting is
> rarely a good thing.
> 
> Would you let me know how things are working with the new configuration in the
> next day or two.

We experienced a RPC lock contention event (for lack of a better term) today
that correlated with a user submitting about four thousand jobs in a
relatively small time period. slurmctld debug output for that hour is
attached.

We wound up SIGTERMing slurmctld a couple of times, and backed
default_queue_depth down even further, to 10, which seemed to make slurmctld
much more responsive. Since the prolog was already out of the picture, this
feels like slurmctld is... ambitious about the rate at which it can spawn
jobs, ends up getting itself deep into the weeds with lots of pending RPCs,
and can't get itself back together without some intervention.

Beyond whatever tuning or other fixes come from this as part of a more
permanent solution, this kind of situation is why we're asking about the
triage features I mentioned in my original bug submission.

If slurmctld had reserved some RPC slots for administrative users, maybe we
could have looked at the queue (dunno if lock contention would have
precluded that at that point), or if there were better ways to introspect
into a running slurmctld (a way to see what RPC calls were blocking, a
breakdown of recent RPCs by user and/or host, etc.), we would have stood a
better chance of fixing slurmctld in place instead of resorting to cruder
methods.

In combination with those introspection features, if we could have
blacklisted RPCs from certain users or hosts, we could have dealt with the
situation more concisely. In this case, we could have had our job_submit
plugin drop further job submissions from this user, but I don't see a way to
defend against excessive use of the user-facing s{acct,queue,etc.} tools, so
in that case, we'd be harder pressed to implement a fast solution.

Does this make sense? I see this as two-part: figuring out how to have
slurmctld deal with large bursts of job submissions or other job churn in
our environment, and figuring out how to triage and stabilize slurmctld when
that first part hasn't completely protected us.

john

Comment 15 John Morrissey 2014-02-28 08:37:46 MST

Created attachment 664 [details]
slurmctld-syslogs.gz

Comment 16 Moe Jette 2014-02-28 08:51:38 MST

(In reply to John Morrissey from comment #14)
> Created attachment 663 [details]
> slurm_epilog
> 
> On Thu, Feb 27, 2014 at 11:59:04PM +0000, bugs@schedmd.com wrote:
> > --- Comment #13 from Moe Jette <jette@schedmd.com> ---
> > (In reply to John Morrissey from comment #12)
> > > (In reply to Moe Jette from comment #11)
> > > > What does your epilog look like? 
> > > 
> > > The epilog only makes the one call to squeue(1). Are you concerned about its
> > > runtime, or just the quantity of RPCs it's making?
> > 
> > I'm just trying to figure out where there are bottlenecks are and trying
> > to get your throughput up to an acceptable level. I'd be happy to review
> > that to see if it can be sped up.
> 
> Epilog is attached; it typically exits within a second or two, presuming all
> of the user's processes have already exited. We could probably eliminate the
> squeue(1) call by getting the PIDs of running slurmstepds, and walking the
> process tree to get the usernames of their child PIDs, to see if the current
> user has any other jobs running on the current host. Does that seem
> reasonable?

The epilog looks reasonable to me.

Comment 17 Moe Jette 2014-02-28 09:04:42 MST

(In reply to John Morrissey from comment #14)
> > > Also, we've backed default_queue_depth down to 100, and I've... encouraged
> > > everyone here to refrain from turning that knob again without a solid
> > > reason. :-)
> > 
> > That knob can have a big impact upon performance. Changing the value of any
> > configuration parameter by two orders of magnitude from the default setting is
> > rarely a good thing.
> > 
> > Would you let me know how things are working with the new configuration in the
> > next day or two.
> 
> We experienced a RPC lock contention event (for lack of a better term) today
> that correlated with a user submitting about four thousand jobs in a
> relatively small time period. slurmctld debug output for that hour is
> attached.
> 
> We wound up SIGTERMing slurmctld a couple of times, and backed
> default_queue_depth down even further, to 10, which seemed to make slurmctld
> much more responsive.

That's why I thought 10000 was a really bad setting.


> Since the prolog was already out of the picture, this
> feels like slurmctld is... ambitious about the rate at which it can spawn
> jobs, ends up getting itself deep into the weeds with lots of pending RPCs,
> and can't get itself back together without some intervention.

Judging from your logs, that seems an accurate assessment. It looks pretty ugly.

Could you impose some sort of limits here?
One option to use the account limits that will restrict the number of jobs that any user can have running at one time or limit the number of jobs that a user can submit.
Another option is to use use the SchedulerParameters parameter of bg_max_job_user=# that will limit the number of jobs the backfill scheduler will start at any one time for any single user. That has proven helpful in some similar situations.


> Beyond whatever tuning or other fixes come from this as part of a more
> permanent solution, this kind of situation is why we're asking about the
> triage features I mentioned in my original bug submission.
> 
> If slurmctld had reserved some RPC slots for administrative users, maybe we
> could have looked at the queue (dunno if lock contention would have
> precluded that at that point), or if there were better ways to introspect
> into a running slurmctld (a way to see what RPC calls were blocking, a
> breakdown of recent RPCs by user and/or host, etc.), we would have stood a
> better chance of fixing slurmctld in place instead of resorting to cruder
> methods.
> 
> In combination with those introspection features, if we could have
> blacklisted RPCs from certain users or hosts, we could have dealt with the
> situation more concisely. In this case, we could have had our job_submit
> plugin drop further job submissions from this user, but I don't see a way to
> defend against excessive use of the user-facing s{acct,queue,etc.} tools, so
> in that case, we'd be harder pressed to implement a fast solution.

I concur, but it's not available today and this problem seems rather pressing.


> Does this make sense? I see this as two-part: figuring out how to have
> slurmctld deal with large bursts of job submissions or other job churn in
> our environment, and figuring out how to triage and stabilize slurmctld when
> that first part hasn't completely protected us.

I'd suggest SchedulerParameters parameter of bg_max_job_user=# for now.

Comment 18 John Morrissey 2014-02-28 09:28:01 MST

On Fri, Feb 28, 2014 at 11:04:42PM +0000, bugs@schedmd.com wrote:
> --- Comment #17 from Moe Jette <jette@schedmd.com> ---
> (In reply to John Morrissey from comment #14)
> > Since the prolog was already out of the picture, this feels like
> > slurmctld is... ambitious about the rate at which it can spawn jobs,
> > ends up getting itself deep into the weeds with lots of pending RPCs,
> > and can't get itself back together without some intervention.
> 
> Judging from your logs, that seems an accurate assessment. It looks pretty
> ugly.
> 
> Could you impose some sort of limits here?
> One option to use the account limits that will restrict the number of jobs that
> any user can have running at one time or limit the number of jobs that a user
> can submit.
> Another option is to use use the SchedulerParameters parameter of
> bg_max_job_user=# that will limit the number of jobs the backfill scheduler
> will start at any one time for any single user. That has proven helpful in some
> similar situations.

We already set bf_max_job_user=100, but given what we've seen with even the
difference between 100->10 and default_queue_depth, I'm inclined to bring
that down to 10, as well.

FWIW, here's what we have for bf_* SchedulerParameters:

  bf_interval=600,bf_continue,bf_resolution=300,max_job_bf=5000,bf_max_job_part=5000,bf_max_job_user=100

Could probably stand to crank down a bunch of the other limits, too.

We also set MaxSubmitJobs on each user to about 10k, since we have a number
of users that can legitimately have a few thousand jobs pending. Unless
there's a better way, maybe we'll implement something in our job_submit
script for now that rate limits job submissions to N jobs/minute.

In that vein, I'm really looking forward to the next major release since it
will let us return descriptive error messages from job_submit plugins. We've
avoided implementing much administrative policy in ours so far because the
possible return values are fairly vague, especially for a user base that's
not always the most technically versed.

> > Beyond whatever tuning or other fixes come from this as part of a more
> > permanent solution, this kind of situation is why we're asking about the
> > triage features I mentioned in my original bug submission.
> > 
> > If slurmctld had reserved some RPC slots for administrative users, maybe
> > we could have looked at the queue (dunno if lock contention would have
> > precluded that at that point), or if there were better ways to
> > introspect into a running slurmctld (a way to see what RPC calls were
> > blocking, a breakdown of recent RPCs by user and/or host, etc.), we
> > would have stood a better chance of fixing slurmctld in place instead of
> > resorting to cruder methods.
> > 
> > In combination with those introspection features, if we could have
> > blacklisted RPCs from certain users or hosts, we could have dealt with
> > the situation more concisely. In this case, we could have had our
> > job_submit plugin drop further job submissions from this user, but I
> > don't see a way to defend against excessive use of the user-facing
> > s{acct,queue,etc.} tools, so in that case, we'd be harder pressed to
> > implement a fast solution.
> 
> I concur, but it's not available today and this problem seems rather pressing.

nod, I'm mostly trying to describe our use case and paint a picture of the
tools that would be useful in our context. I certainly realize (and
empathize) that they don't exist today. :-)

john

Comment 19 Moe Jette 2014-02-28 09:46:18 MST

(In reply to John Morrissey from comment #18)
> On Fri, Feb 28, 2014 at 11:04:42PM +0000, bugs@schedmd.com wrote:
> > --- Comment #17 from Moe Jette <jette@schedmd.com> ---
> > (In reply to John Morrissey from comment #14)
> > > Since the prolog was already out of the picture, this feels like
> > > slurmctld is... ambitious about the rate at which it can spawn jobs,
> > > ends up getting itself deep into the weeds with lots of pending RPCs,
> > > and can't get itself back together without some intervention.
> > 
> > Judging from your logs, that seems an accurate assessment. It looks pretty
> > ugly.
> > 
> > Could you impose some sort of limits here?
> > One option to use the account limits that will restrict the number of jobs that
> > any user can have running at one time or limit the number of jobs that a user
> > can submit.
> > Another option is to use use the SchedulerParameters parameter of
> > bg_max_job_user=# that will limit the number of jobs the backfill scheduler
> > will start at any one time for any single user. That has proven helpful in some
> > similar situations.
> 
> We already set bf_max_job_user=100, but given what we've seen with even the
> difference between 100->10 and default_queue_depth, I'm inclined to bring
> that down to 10, as well.
> 
> FWIW, here's what we have for bf_* SchedulerParameters:
> 
>  
> bf_interval=600,bf_continue,bf_resolution=300,max_job_bf=5000,
> bf_max_job_part=5000,bf_max_job_user=100
> 
> Could probably stand to crank down a bunch of the other limits, too.
> 
> We also set MaxSubmitJobs on each user to about 10k, since we have a number
> of users that can legitimately have a few thousand jobs pending. Unless
> there's a better way, maybe we'll implement something in our job_submit
> script for now that rate limits job submissions to N jobs/minute.

There is also a MaxJobs, which limits the number of running jobs by user, but I'm thinking the total number of running jobs may be less of an issue that the number of jobs started at the same time. The latter can induce a lot of message traffic.

My next suggestion would be setting bf_max_job_user=10 as a starting point.

If you could send the output of "sdiag" that may be helpful to me also. It's some statistics on the scheduling logic.

Comment 20 John Morrissey 2014-02-28 10:01:11 MST

On Fri, Feb 28, 2014 at 11:46:18PM +0000, bugs@schedmd.com wrote:
> --- Comment #19 from Moe Jette <jette@schedmd.com> ---
> (In reply to John Morrissey from comment #18)
> > On Fri, Feb 28, 2014 at 11:04:42PM +0000, bugs@schedmd.com wrote:
> > > Could you impose some sort of limits here?
> > > One option to use the account limits that will restrict the number of jobs that
> > > any user can have running at one time or limit the number of jobs that a user
> > > can submit.
> > > Another option is to use use the SchedulerParameters parameter of
> > > bg_max_job_user=# that will limit the number of jobs the backfill scheduler
> > > will start at any one time for any single user. That has proven helpful in some
> > > similar situations.
> > 
> > We already set bf_max_job_user=100, but given what we've seen with even the
> > difference between 100->10 and default_queue_depth, I'm inclined to bring
> > that down to 10, as well.
> > 
> > FWIW, here's what we have for bf_* SchedulerParameters:
> > 
> >  
> > bf_interval=600,bf_continue,bf_resolution=300,max_job_bf=5000,
> > bf_max_job_part=5000,bf_max_job_user=100
> > 
> > Could probably stand to crank down a bunch of the other limits, too.
> > 
> > We also set MaxSubmitJobs on each user to about 10k, since we have a
> > number of users that can legitimately have a few thousand jobs pending.
> > Unless there's a better way, maybe we'll implement something in our
> > job_submit script for now that rate limits job submissions to N
> > jobs/minute.
> 
> There is also a MaxJobs, which limits the number of running jobs by user,
> but I'm thinking the total number of running jobs may be less of an issue
> that the number of jobs started at the same time. The latter can induce a
> lot of message traffic.

nod, agree.

> My next suggestion would be setting bf_max_job_user=10 as a starting
> point.

k, I'll plan on that, but might not get to it until Monday.

> If you could send the output of "sdiag" that may be helpful to me also.
> It's some statistics on the scheduling logic.

Sure, here you go:

--
[jwm@holy-slurm01:pts/8 ~> sdiag
*******************************************************
sdiag output at Fri Feb 28 18:59:10 2014
Data since      Fri Feb 28 14:24:06 2014
*******************************************************
Server thread count: 5
Agent queue size:    0

Jobs submitted: 14508
Jobs started:   13860
Jobs completed: 14021
Jobs canceled:  1086
Jobs failed:    0

Main schedule statistics (microseconds):
	Last cycle:   243415
	Max cycle:    5768680
	Total cycles: 13419
	Mean cycle:   175773
	Mean depth cycle:  9
	Cycles per minute: 48
	Last queue length: 8558

Backfilling stats (WARNING: data obtained in the middle of backfilling
execution
	Total backfilled jobs (since last slurm start): 850
	Total backfilled jobs (since last stats cycle start): 850
	Total cycles: 3
	Last cycle when: Fri Feb 28 18:31:24 2014
	Last cycle: 2091267014
	Max cycle:  2091267014
	Mean cycle: 175056444
	Last depth cycle: 5670
	Last depth cycle (try sched): 321
	Depth Mean: 7554
	Depth Mean (try depth): 569
	Last queue length: 8519
	Queue length mean: 11620
--

john

Comment 21 Moe Jette 2014-03-01 13:49:23 MST

> We also set MaxSubmitJobs on each user to about 10k, since we have a number
> of users that can legitimately have a few thousand jobs pending. Unless
> there's a better way, maybe we'll implement something in our job_submit
> script for now that rate limits job submissions to N jobs/minute.

This would involve some change in the usage mode and would not work for everyone, but launching job steps (typically an MPI job invoked by srun) is much more lightweight than running an independent job for each. There are many use cases where a single job allocation spawns hundreds to tens of thousands of job steps. That might even be a preferable mode of operation in some cases; get a lot of work done once your job is allocated resources.

Comment 22 Moe Jette 2014-03-04 08:09:31 MST

How have things been running over the past few days?

Comment 23 John Morrissey 2014-03-04 11:17:23 MST

On Tue, Mar 04, 2014 at 10:09:31PM +0000, bugs@schedmd.com wrote:
> http://bugs.schedmd.com/show_bug.cgi?id=607
> 
> --- Comment #22 from Moe Jette <jette@schedmd.com> ---
> How have things been running over the past few days?

Much better, but not completely without problems. We had a couple instances
today where the scheduling of a few hundred jobs caused slurmctld to
experience the same contention-related unresponsiveness, but was able to dig
itself out after 20-30 minutes, which is a huge improvement.

I didn't have a chance to change bf_max_job_user until this afternoon, and
I'm not sure whether those jobs were scheduled by the primary or backfill
scheduler.

Thanks for checking in on us, Moe.

john

Comment 24 Moe Jette 2014-03-04 14:12:53 MST

(In reply to John Morrissey from comment #23)
> On Tue, Mar 04, 2014 at 10:09:31PM +0000, bugs@schedmd.com wrote:
> > http://bugs.schedmd.com/show_bug.cgi?id=607
> > 
> > --- Comment #22 from Moe Jette <jette@schedmd.com> ---
> > How have things been running over the past few days?
> 
> Much better, but not completely without problems. We had a couple instances
> today where the scheduling of a few hundred jobs caused slurmctld to
> experience the same contention-related unresponsiveness, but was able to dig
> itself out after 20-30 minutes, which is a huge improvement.
> 
> I didn't have a chance to change bf_max_job_user until this afternoon, and
> I'm not sure whether those jobs were scheduled by the primary or backfill
> scheduler.
> 
> Thanks for checking in on us, Moe.
> 
> john

I think the new work that I did will solve this problem by limiting the total number of jobs that can be started in a single backfill scheduling cycle. When you upgrade, I would be inclined to set the bf_max_job_start to around 100 and see how that goes. The per-user limit may not suite your environment if there are many users with many jobs each. The commit is here:
https://github.com/SchedMD/slurm/commit/1b0c4a33590c24a6509750f48b2108c0baea4fbe

Comment 25 John Morrissey 2014-03-10 08:11:53 MDT

On Wed, Mar 05, 2014 at 04:12:53AM +0000, bugs@schedmd.com wrote:
> --- Comment #24 from Moe Jette <jette@schedmd.com> ---
> (In reply to John Morrissey from comment #23)
> > On Tue, Mar 04, 2014 at 10:09:31PM +0000, bugs@schedmd.com wrote:
> > > --- Comment #22 from Moe Jette <jette@schedmd.com> ---
> > > How have things been running over the past few days?
> > 
> > Much better, but not completely without problems. We had a couple
> > instances today where the scheduling of a few hundred jobs caused
> > slurmctld to experience the same contention-related unresponsiveness,
> > but was able to dig itself out after 20-30 minutes, which is a huge
> > improvement.
> > 
> > I didn't have a chance to change bf_max_job_user until this afternoon,
> > and I'm not sure whether those jobs were scheduled by the primary or
> > backfill scheduler.
> 
> I think the new work that I did will solve this problem by limiting the total
> number of jobs that can be started in a single backfill scheduling cycle. When
> you upgrade, I would be inclined to set the bf_max_job_start to around 100 and
> see how that goes. The per-user limit may not suite your environment if there
> are many users with many jobs each. The commit is here:
> https://github.com/SchedMD/slurm/commit/1b0c4a33590c24a6509750f48b2108c0baea4fbe

I added this patch to our local packages and set bf_max_job_start=100.

john

Comment 26 Danny Auble 2014-03-20 06:46:19 MDT

John, could you update this bug, so we can close if possible.

Comment 27 John Morrissey 2014-03-20 10:47:00 MDT

On Thu, Mar 20, 2014 at 06:46:19PM +0000, bugs@schedmd.com wrote:
> --- Comment #26 from Danny Auble <da@schedmd.com> ---
> John, could you update this bug, so we can close if possible.

The scheduler's looking quite good for us now; there's still a few-minute
period once a day or two where slurmctld isn't responding to RPCs, but
that's live-withable at this point, and far better than it was.

Thanks for checking up on us, Danny, and feel free to resolve this.

john

Comment 28 Danny Auble 2014-03-20 11:43:04 MDT

Thanks John.