Created attachment 1844 [details] slow_scheduler2.patch Opening this BUG to track the cancel issue. Some of this has already been capture in http://bugs.schedmd.com/show_bug.cgi?id=1608 but I think we should separate them for clarity. When a large number of jobs are canceled slurmctld loses contact with the slurmd on the frontend servers and requires a start clean to get Slurm working. - Site needs to know the quantity of jobs safe to cancel at one time. - Is it possible to throttle the number of terminate job messages sent to the front end. -- Test SAT 5.4.11 Scheduling fragmentation Test Description: 1. Submit enough single node MPI jobs such that the system is completely filled, each job on a single node to run for long enough duration for the purpose of this test. 2. 2 minutes after the last job started execution; cancel 200 jobs chosen randomly, sparse in the system. 3. After the 200 nodes are marked as available, submit a 200 nodes MPI job
From SchedMD In Slurm v14.11.6 we'll avoid repeated messages to terminate jobs if the slurmctld daemon is busy (lots of RPCs in play). New commit here: https://github.com/SchedMD/slurm/commit/2a9027616b28ca1d9ceefc57f9f284c4fef0ba58 For now, increasing the configured value of KillWait will have a similar effect. -- Configuration changes: alps.conf maxResv=8000 slurm.conf MaxArraySize=10001 KillWait=300 SchedulerParameters ... max_rpc_cnt=57 -- From SchedMD I've added an option to limit the number of jobs started in Slurm's main scheduling logic (as opposed to the backfill logic). That's SchedulerParameters=sched_max_job_start=#, which is a much more precise option than limits by run time in seconds or the RPC backlog (the only options previously available). That change will be in v14.11. https://github.com/SchedMD/slurm/commit/c0eb47c2677bc9f9f0bfa41ba78907c2254e14fa
The patch attached to this BUG http://bugzilla.us.cray.com/show_bug.cgi?id=825465#c6 http://bugs.schedmd.com/show_bug.cgi?id=1608#c38 looks like it has a positive affect on the issue described here. Testing on crystal: I tested an array job of 1046 elements. It took ~10 minutes for 990 jobs to launch (only 990 ran and not the full 1046 because of a queue limitation I believe). During the job launches there was not one socket timeout error. The commands sinfo and squeue were very responsive. I then did an "scancel -u jcovers" - on all 990 running jobs. Slurm did not freeze. Again, squeue and sinfo were responsive during the time it took for Slurm to cancel all running and pending jobs.
Update from customer before patch was installed: A slurm cancelling incident at Kaust occurred on Apr 23, 2015 12:00. The CASE is 108704. ---- While canceling several jobs slurm became unresponsive. The problem appeared to be an array_job which contained over 200 entries. It was finally necessary to restart the slurm daemon.
One suggestions from SchedMD: ... to scancel by the array id instead of the user. Obviously this is not what every user will want to do, but it is much more efficient. -- Patch has been installed on customer system.
Slurm no longer freezes when canceling large amounts of jobs at once. However, Slurm commands are unresponsive during this time. The only way to tell if forward progress is being made is to look at apstat output and see that reservations are being deleted. The customer has state this is not acceptable and believes they should be able to perform squeues and sinfos during this time.
Jason, please try the attached patch. It is better than the one I sent on Friday. Thanks, Danny On 04/27/15 10:41, bugs@schedmd.com wrote: > > *Comment # 5 <http://bugs.schedmd.com/show_bug.cgi?id=1623#c5> on bug > 1623 <http://bugs.schedmd.com/show_bug.cgi?id=1623> from Jason > Coverston <mailto:jcovers@cray.com> * > Slurm no longer freezes when canceling large amounts of jobs at once. However, > Slurm commands are unresponsive during this time. The only way to tell if > forward progress is being made is to look at apstat output and see that > reservations are being deleted. > > The customer has state this is not acceptable and believes they should be able > to perform squeues and sinfos during this time. > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You are on the CC list for the bug. > * You are the assignee for the bug. >
Created attachment 1846 [details] improved release patch This patch is a proof of concept implementing the improved release as documented in the BASIL 1.2 spec. It implements a new configuration parameter named ImprovedRelease. If set, slurm will not send apkills, instead it will call the ALPS RELEASE method from the slurmd and wait until ALPS says the application and reservation is gone. Jason
Created attachment 1848 [details] Don't signal a batch job when cancelling it. This patch allows for the code to have a new cray.conf option NoAPIDSignalOnKill. When set to yes this will make it so the slurmctld will not signal the apid's in a job. Instead it relies on the rpc coming from the slurmctld to kill the job to end things correctly. This makes a dramatic improvement when canceling the jobs. I will also note commit 225a1dea6f7e will also help with the scancel issues.
Created attachment 1849 [details] Wait in the stepd for a release This patch is a variation on Jason's patch to wait in the slurmstepd for the release to happen. In testing on an emulated system this removed all transient errors we saw from the invention of the inventory_interval added earlier.
These last 2 attachments are commits d4d64877009 2eefdbd62b3 respectfully.
Hi Danny, These patches worked great. We have seen a good improvement in scancel performance. Our benchmarker reported during our test session today on the machine: 2074 + 1 jobs running at this point. Executed 'scancel -u rwalsh' and noted a rapid drop in the number of running jobs. Reach under 2000 in about 2 minutes. Did experience some slowness from squeue and sinfo about 1/2 way in the mass cancellation, but the process did not stall and no socket time outs messages from 'squeue'. All jobs were cancelled in perhaps 5 or 6 minutes. Excellent! Thanks, Jason
Prefect, I feel this bug is closed. What is your feeling? On April 29, 2015 8:21:40 PM PDT, bugs@schedmd.com wrote: >http://bugs.schedmd.com/show_bug.cgi?id=1623 > >--- Comment #11 from Jason Coverston <jcovers@cray.com> --- >Hi Danny, > >These patches worked great. > >We have seen a good improvement in scancel performance. > >Our benchmarker reported during our test session today on the machine: > >2074 + 1 jobs running at this point. Executed 'scancel -u rwalsh' >and >noted >a rapid drop in the number of running jobs. Reach under 2000 in about >2 >minutes. >Did experience some slowness from squeue and sinfo about 1/2 way in the >mass cancellation, but the process did not stall and no socket time >outs >messages from 'squeue'. All jobs were cancelled in perhaps 5 or 6 >minutes. > >Excellent! > >Thanks, > >Jason > >-- >You are receiving this mail because: >You are on the CC list for the bug.
Hi Danny, Yep! Go ahead and close. I'll close the BUG on our end. Cheers! Jason (In reply to Danny Auble from comment #12) > Prefect, I feel this bug is closed. What is your feeling? > > On April 29, 2015 8:21:40 PM PDT, bugs@schedmd.com wrote: > >http://bugs.schedmd.com/show_bug.cgi?id=1623 > > > >--- Comment #11 from Jason Coverston <jcovers@cray.com> --- > >Hi Danny, > > > >These patches worked great. > > > >We have seen a good improvement in scancel performance. > > > >Our benchmarker reported during our test session today on the machine: > > > >2074 + 1 jobs running at this point. Executed 'scancel -u rwalsh' > >and > >noted > >a rapid drop in the number of running jobs. Reach under 2000 in about > >2 > >minutes. > >Did experience some slowness from squeue and sinfo about 1/2 way in the > >mass cancellation, but the process did not stall and no socket time > >outs > >messages from 'squeue'. All jobs were cancelled in perhaps 5 or 6 > >minutes. > > > >Excellent! > > > >Thanks, > > > >Jason > > > >-- > >You are receiving this mail because: > >You are on the CC list for the bug.
Sounds good, sugoi!