Ticket 1623

Summary:	825388 - Slurm freezes when a large number of jobs are canceled
Product:	Slurm	Reporter:	Jason Coverston <jason.coverston>
Component:	Cray ALPS	Assignee:	Moe Jette <jette>
Status:	RESOLVED FIXED	QA Contact:
Severity:	2 - High Impact
Priority:	---	CC:	brian.gilmer, brian, da
Version:	14.11.3
Hardware:	Cray XC
OS:	Linux
Site:	KAUST	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:	14.11.7 15.08.0-0pre5
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	slow_scheduler2.patch improved release patch Don't signal a batch job when cancelling it. Wait in the stepd for a release

Description Jason Coverston 2015-04-27 05:36:34 MDT

Created attachment 1844 [details]
slow_scheduler2.patch

Opening this BUG to track the cancel issue. Some of this has already been capture in http://bugs.schedmd.com/show_bug.cgi?id=1608 but I think we should separate them for clarity.

When a large number of jobs are canceled slurmctld loses contact with the slurmd on the frontend servers and requires a start clean to get Slurm working.

- Site needs to know the quantity of jobs safe to cancel at one time.
- Is it possible to throttle the number of terminate job messages sent to the front end.

--

Test SAT 5.4.11 Scheduling fragmentation
Test Description:
1. Submit enough single node MPI jobs such that the system is completely filled, each job on a single
node to run for long enough duration for the purpose of this test.
2. 2 minutes after the last job started execution; cancel 200 jobs chosen randomly, sparse in the
system.
3. After the 200 nodes are marked as available, submit a 200 nodes MPI job

Comment 1 Jason Coverston 2015-04-27 05:37:52 MDT

From SchedMD
In Slurm v14.11.6 we'll avoid repeated messages to terminate jobs if the slurmctld daemon is busy (lots of RPCs in play). New commit here:
https://github.com/SchedMD/slurm/commit/2a9027616b28ca1d9ceefc57f9f284c4fef0ba58
For now, increasing the configured value of KillWait will have a similar effect.

--

Configuration changes:
alps.conf
maxResv=8000


slurm.conf
MaxArraySize=10001

KillWait=300

SchedulerParameters ... max_rpc_cnt=57

--

From SchedMD

I've added an option to limit the number of jobs started in Slurm's main scheduling logic (as opposed to the backfill logic). That's SchedulerParameters=sched_max_job_start=#, which is a much more precise option than limits by run time in seconds or the RPC backlog (the only options previously available). That change will be in v14.11.
https://github.com/SchedMD/slurm/commit/c0eb47c2677bc9f9f0bfa41ba78907c2254e14fa

Comment 2 Jason Coverston 2015-04-27 05:38:58 MDT

The patch attached to this BUG http://bugzilla.us.cray.com/show_bug.cgi?id=825465#c6 

http://bugs.schedmd.com/show_bug.cgi?id=1608#c38

looks like it has a positive affect on the issue described here.

Testing on crystal:

I tested an array job of 1046 elements.

It took ~10 minutes for 990 jobs to launch (only 990 ran and not the full
1046 because of a queue limitation I believe). During the job launches
there was not one socket timeout error. The commands sinfo and squeue were
very responsive.

I then did an "scancel -u jcovers" - on all 990 running jobs. Slurm did
not freeze. Again, squeue and sinfo were responsive during the time it
took for Slurm to cancel all running and pending jobs.

Comment 3 Jason Coverston 2015-04-27 05:39:21 MDT

Update from customer before patch was installed:

A slurm cancelling incident at Kaust occurred on Apr 23, 2015 12:00.

The CASE is 108704.
  ---- 
While canceling several jobs slurm became unresponsive. The problem appeared to be an array_job which contained over 200 entries. 

It was finally necessary to restart the slurm daemon.

Comment 4 Jason Coverston 2015-04-27 05:39:48 MDT

One suggestions from SchedMD:

... to scancel by the array id instead of the user. Obviously this is not what every user will want to do, but it is much more efficient.

--

Patch has been installed on customer system.

Comment 5 Jason Coverston 2015-04-27 05:41:38 MDT

Slurm no longer freezes when canceling large amounts of jobs at once. However, Slurm commands are unresponsive during this time. The only way to tell if forward progress is being made is to look at apstat output and see that reservations are being deleted.

The customer has state this is not acceptable and believes they should be able to perform squeues and sinfos during this time.

Comment 6 Danny Auble 2015-04-27 08:28:38 MDT

Jason, please try the attached patch.  It is better than the one I sent 
on Friday.

Thanks,
Danny

On 04/27/15 10:41, bugs@schedmd.com wrote:
>
> *Comment # 5 <http://bugs.schedmd.com/show_bug.cgi?id=1623#c5> on bug 
> 1623 <http://bugs.schedmd.com/show_bug.cgi?id=1623> from Jason 
> Coverston <mailto:jcovers@cray.com> *
> Slurm no longer freezes when canceling large amounts of jobs at once. However,
> Slurm commands are unresponsive during this time. The only way to tell if
> forward progress is being made is to look at apstat output and see that
> reservations are being deleted.
>
> The customer has state this is not acceptable and believes they should be able
> to perform squeues and sinfos during this time.
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You are on the CC list for the bug.
>   * You are the assignee for the bug.
>

Comment 7 Jason Coverston 2015-04-27 14:42:34 MDT

Created attachment 1846 [details]
improved release patch

This patch is a proof of concept implementing the improved release as documented in the BASIL 1.2 spec.

It implements a new configuration parameter named ImprovedRelease. If set, slurm will not send apkills, instead it will call the ALPS RELEASE method from the slurmd and wait until ALPS says the application and reservation is gone.

Jason

Comment 8 Danny Auble 2015-04-28 11:29:07 MDT

Created attachment 1848 [details]
Don't signal a batch job when cancelling it.

This patch allows for the code to have a new cray.conf option NoAPIDSignalOnKill.

When set to yes this will make it so the slurmctld will not signal the apid's in a job.  Instead it relies on the rpc coming from the slurmctld to kill the job to end things correctly.

This makes a dramatic improvement when canceling the jobs.

I will also note commit 225a1dea6f7e will also help with the scancel issues.

Comment 9 Danny Auble 2015-04-28 11:31:07 MDT

Created attachment 1849 [details]
Wait in the stepd for a release

This patch is a variation on Jason's patch to wait in the slurmstepd for the release to happen.  In testing on an emulated system this removed all transient errors we saw from the invention of the inventory_interval added earlier.

Comment 10 Danny Auble 2015-04-28 11:50:59 MDT

These last 2 attachments are commits

d4d64877009
2eefdbd62b3

respectfully.

Comment 11 Jason Coverston 2015-04-29 15:21:40 MDT

Hi Danny,

These patches worked great.

We have seen a good improvement in scancel performance.

Our benchmarker reported during our test session today on the machine:

    2074 + 1 jobs running at this point.   Executed 'scancel -u rwalsh'  and noted
    a rapid drop in the number of running jobs.  Reach under 2000 in about 2 minutes.
    Did experience some slowness from squeue and sinfo about 1/2 way in the
    mass cancellation, but the process did not stall and no socket time outs
    messages from 'squeue'.  All jobs were cancelled in perhaps 5 or 6 minutes.

Excellent!

Thanks,

Jason

Comment 12 Danny Auble 2015-04-29 15:37:17 MDT

Prefect, I feel this bug is closed.  What is your feeling? 

On April 29, 2015 8:21:40 PM PDT, bugs@schedmd.com wrote:
>http://bugs.schedmd.com/show_bug.cgi?id=1623
>
>--- Comment #11 from Jason Coverston <jcovers@cray.com> ---
>Hi Danny,
>
>These patches worked great.
>
>We have seen a good improvement in scancel performance.
>
>Our benchmarker reported during our test session today on the machine:
>
>2074 + 1 jobs running at this point.   Executed 'scancel -u rwalsh' 
>and
>noted
>a rapid drop in the number of running jobs.  Reach under 2000 in about
>2
>minutes.
>Did experience some slowness from squeue and sinfo about 1/2 way in the
>mass cancellation, but the process did not stall and no socket time
>outs
>messages from 'squeue'.  All jobs were cancelled in perhaps 5 or 6
>minutes.
>
>Excellent!
>
>Thanks,
>
>Jason
>
>-- 
>You are receiving this mail because:
>You are on the CC list for the bug.

Comment 13 Jason Coverston 2015-04-30 07:40:33 MDT

Hi Danny,

Yep! Go ahead and close. I'll close the BUG on our end.

Cheers!

Jason


(In reply to Danny Auble from comment #12)
> Prefect, I feel this bug is closed.  What is your feeling? 
> 
> On April 29, 2015 8:21:40 PM PDT, bugs@schedmd.com wrote:
> >http://bugs.schedmd.com/show_bug.cgi?id=1623
> >
> >--- Comment #11 from Jason Coverston <jcovers@cray.com> ---
> >Hi Danny,
> >
> >These patches worked great.
> >
> >We have seen a good improvement in scancel performance.
> >
> >Our benchmarker reported during our test session today on the machine:
> >
> >2074 + 1 jobs running at this point.   Executed 'scancel -u rwalsh' 
> >and
> >noted
> >a rapid drop in the number of running jobs.  Reach under 2000 in about
> >2
> >minutes.
> >Did experience some slowness from squeue and sinfo about 1/2 way in the
> >mass cancellation, but the process did not stall and no socket time
> >outs
> >messages from 'squeue'.  All jobs were cancelled in perhaps 5 or 6
> >minutes.
> >
> >Excellent!
> >
> >Thanks,
> >
> >Jason
> >
> >-- 
> >You are receiving this mail because:
> >You are on the CC list for the bug.

Comment 14 Danny Auble 2015-04-30 09:27:25 MDT

Sounds good, sugoi!