| Summary: | 825388 - Slurm freezes when a large number of jobs are canceled | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Jason Coverston <jason.coverston> |
| Component: | Cray ALPS | Assignee: | Moe Jette <jette> |
| Status: | RESOLVED FIXED | QA Contact: | |
| Severity: | 2 - High Impact | ||
| Priority: | --- | CC: | brian.gilmer, brian, da |
| Version: | 14.11.3 | ||
| Hardware: | Cray XC | ||
| OS: | Linux | ||
| Site: | KAUST | Slinky Site: | --- |
| Alineos Sites: | --- | Atos/Eviden Sites: | --- |
| Confidential Site: | --- | Coreweave sites: | --- |
| Cray Sites: | --- | DS9 clusters: | --- |
| Google sites: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | NoveTech Sites: | --- |
| Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Tzag Elita Sites: | --- |
| Linux Distro: | --- | Machine Name: | |
| CLE Version: | Version Fixed: | 14.11.7 15.08.0-0pre5 | |
| Target Release: | --- | DevPrio: | --- |
| Emory-Cloud Sites: | --- | ||
| Attachments: |
slow_scheduler2.patch
improved release patch Don't signal a batch job when cancelling it. Wait in the stepd for a release |
||
|
Description
Jason Coverston
2015-04-27 05:36:34 MDT
From SchedMD In Slurm v14.11.6 we'll avoid repeated messages to terminate jobs if the slurmctld daemon is busy (lots of RPCs in play). New commit here: https://github.com/SchedMD/slurm/commit/2a9027616b28ca1d9ceefc57f9f284c4fef0ba58 For now, increasing the configured value of KillWait will have a similar effect. -- Configuration changes: alps.conf maxResv=8000 slurm.conf MaxArraySize=10001 KillWait=300 SchedulerParameters ... max_rpc_cnt=57 -- From SchedMD I've added an option to limit the number of jobs started in Slurm's main scheduling logic (as opposed to the backfill logic). That's SchedulerParameters=sched_max_job_start=#, which is a much more precise option than limits by run time in seconds or the RPC backlog (the only options previously available). That change will be in v14.11. https://github.com/SchedMD/slurm/commit/c0eb47c2677bc9f9f0bfa41ba78907c2254e14fa The patch attached to this BUG http://bugzilla.us.cray.com/show_bug.cgi?id=825465#c6 http://bugs.schedmd.com/show_bug.cgi?id=1608#c38 looks like it has a positive affect on the issue described here. Testing on crystal: I tested an array job of 1046 elements. It took ~10 minutes for 990 jobs to launch (only 990 ran and not the full 1046 because of a queue limitation I believe). During the job launches there was not one socket timeout error. The commands sinfo and squeue were very responsive. I then did an "scancel -u jcovers" - on all 990 running jobs. Slurm did not freeze. Again, squeue and sinfo were responsive during the time it took for Slurm to cancel all running and pending jobs. Update from customer before patch was installed: A slurm cancelling incident at Kaust occurred on Apr 23, 2015 12:00. The CASE is 108704. ---- While canceling several jobs slurm became unresponsive. The problem appeared to be an array_job which contained over 200 entries. It was finally necessary to restart the slurm daemon. One suggestions from SchedMD: ... to scancel by the array id instead of the user. Obviously this is not what every user will want to do, but it is much more efficient. -- Patch has been installed on customer system. Slurm no longer freezes when canceling large amounts of jobs at once. However, Slurm commands are unresponsive during this time. The only way to tell if forward progress is being made is to look at apstat output and see that reservations are being deleted. The customer has state this is not acceptable and believes they should be able to perform squeues and sinfos during this time. Jason, please try the attached patch. It is better than the one I sent on Friday. Thanks, Danny On 04/27/15 10:41, bugs@schedmd.com wrote: > > *Comment # 5 <http://bugs.schedmd.com/show_bug.cgi?id=1623#c5> on bug > 1623 <http://bugs.schedmd.com/show_bug.cgi?id=1623> from Jason > Coverston <mailto:jcovers@cray.com> * > Slurm no longer freezes when canceling large amounts of jobs at once. However, > Slurm commands are unresponsive during this time. The only way to tell if > forward progress is being made is to look at apstat output and see that > reservations are being deleted. > > The customer has state this is not acceptable and believes they should be able > to perform squeues and sinfos during this time. > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You are on the CC list for the bug. > * You are the assignee for the bug. > Created attachment 1846 [details]
improved release patch
This patch is a proof of concept implementing the improved release as documented in the BASIL 1.2 spec.
It implements a new configuration parameter named ImprovedRelease. If set, slurm will not send apkills, instead it will call the ALPS RELEASE method from the slurmd and wait until ALPS says the application and reservation is gone.
Jason
Created attachment 1848 [details]
Don't signal a batch job when cancelling it.
This patch allows for the code to have a new cray.conf option NoAPIDSignalOnKill.
When set to yes this will make it so the slurmctld will not signal the apid's in a job. Instead it relies on the rpc coming from the slurmctld to kill the job to end things correctly.
This makes a dramatic improvement when canceling the jobs.
I will also note commit 225a1dea6f7e will also help with the scancel issues.
Created attachment 1849 [details]
Wait in the stepd for a release
This patch is a variation on Jason's patch to wait in the slurmstepd for the release to happen. In testing on an emulated system this removed all transient errors we saw from the invention of the inventory_interval added earlier.
These last 2 attachments are commits d4d64877009 2eefdbd62b3 respectfully. Hi Danny,
These patches worked great.
We have seen a good improvement in scancel performance.
Our benchmarker reported during our test session today on the machine:
2074 + 1 jobs running at this point. Executed 'scancel -u rwalsh' and noted
a rapid drop in the number of running jobs. Reach under 2000 in about 2 minutes.
Did experience some slowness from squeue and sinfo about 1/2 way in the
mass cancellation, but the process did not stall and no socket time outs
messages from 'squeue'. All jobs were cancelled in perhaps 5 or 6 minutes.
Excellent!
Thanks,
Jason
Prefect, I feel this bug is closed. What is your feeling? On April 29, 2015 8:21:40 PM PDT, bugs@schedmd.com wrote: >http://bugs.schedmd.com/show_bug.cgi?id=1623 > >--- Comment #11 from Jason Coverston <jcovers@cray.com> --- >Hi Danny, > >These patches worked great. > >We have seen a good improvement in scancel performance. > >Our benchmarker reported during our test session today on the machine: > >2074 + 1 jobs running at this point. Executed 'scancel -u rwalsh' >and >noted >a rapid drop in the number of running jobs. Reach under 2000 in about >2 >minutes. >Did experience some slowness from squeue and sinfo about 1/2 way in the >mass cancellation, but the process did not stall and no socket time >outs >messages from 'squeue'. All jobs were cancelled in perhaps 5 or 6 >minutes. > >Excellent! > >Thanks, > >Jason > >-- >You are receiving this mail because: >You are on the CC list for the bug. Hi Danny, Yep! Go ahead and close. I'll close the BUG on our end. Cheers! Jason (In reply to Danny Auble from comment #12) > Prefect, I feel this bug is closed. What is your feeling? > > On April 29, 2015 8:21:40 PM PDT, bugs@schedmd.com wrote: > >http://bugs.schedmd.com/show_bug.cgi?id=1623 > > > >--- Comment #11 from Jason Coverston <jcovers@cray.com> --- > >Hi Danny, > > > >These patches worked great. > > > >We have seen a good improvement in scancel performance. > > > >Our benchmarker reported during our test session today on the machine: > > > >2074 + 1 jobs running at this point. Executed 'scancel -u rwalsh' > >and > >noted > >a rapid drop in the number of running jobs. Reach under 2000 in about > >2 > >minutes. > >Did experience some slowness from squeue and sinfo about 1/2 way in the > >mass cancellation, but the process did not stall and no socket time > >outs > >messages from 'squeue'. All jobs were cancelled in perhaps 5 or 6 > >minutes. > > > >Excellent! > > > >Thanks, > > > >Jason > > > >-- > >You are receiving this mail because: > >You are on the CC list for the bug. Sounds good, sugoi! |