Ticket 525 - slurmctld slowdown when 1000s of jobs finish simultaneously
Summary: slurmctld slowdown when 1000s of jobs finish simultaneously
Status: RESOLVED FIXED
Alias: None
Product: Slurm
Classification: Unclassified
Component: slurmctld (show other tickets)
Version: 2.6.x
Hardware: Linux Linux
: 2 - High Impact
Assignee: Moe Jette
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2013-11-16 10:07 MST by David Parks
Modified: 2013-11-29 03:47 MST (History)
1 user (show)

See Also:
Site: CRAY
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
slurm.conf (3.57 KB, application/octet-stream)
2013-11-16 11:43 MST, Danny Auble
Details
use qsort() (4.94 KB, patch)
2013-11-18 11:02 MST, David Bigagli
Details | Diff

Note You need to log in before you can comment on or make changes to this ticket.
Description David Parks 2013-11-16 10:07:56 MST
We have a simple test that submits 13600 jobs, they each run for 1 minute of compute bound "spin wait", and then finish simultaneously.

It takes over 15 real time minutes for all 13600 jobs to completely exit the job scheduling subsystem.

Please see routine _build_row_bitmaps in ./select/cons_res/select_cons_res.c.

Best regards,

Dave
Comment 1 Moe Jette 2013-11-16 10:36:45 MST
Can you attach the configuration file?

In particular, I'm interested in if gang scheduling is configured. Is "Shared" configured on any of the partitions and if so, what is the value?

Also, what are the PreemptType and PreempMode values?

The variable "num_rows" should be the oversubscription level. Without oversubscription, num_rows should be 1 and the nested loops should not be executed.
Comment 2 Moe Jette 2013-11-16 10:46:24 MST
From Dave Parks:

So I was looking at the loop in timer section #4.  (I'm not trying to be a busy body – but I've spent years optimizing HPC applications without many times having an in-depth knowledge of the physical science)

I have no idea what the truth frequency of the if statement is, but if the true condition is low, maybe a simple:
if (start[j] > start[I]) continue;
Would speed the loop up by eliminating one test of the current if.

If we have ~13600 jobs in the system, the I and j loops represent a combined trip count of 92.4x10^6 iterations.  Even if we could get 1 condition per cycle (impossible) that would represent 0.033 seconds.  At 100000 micro seconds, that is only 3 cycles per instruction.

Clearly there might be a better algorithmic solution.

Thanks for all your help.

Dave
Comment 3 Moe Jette 2013-11-16 11:12:57 MST
Let me explain what this logic is doing (I'm not the author, but do have some familiarity).

First of all, it really looks like the system is configured to run more than one job on each resource (cpu). This would be due to configuring a parameter in a partition of "Shared=force", which would subscribe each CPU to up to 4 jobs (default value, but the count is configurable). The internal table identifies which jobs should run on each resource in each time slice, something like this:

         cpu 0   cpu 1   cpu 2   cpu 3
time 0   job 1   job 1   job 2   job 2
time 1   job 1   job 1   job 3   job 4
time 2   job 1   job 1   job 2   job 2
time 3   job 1   job 1   job 5   job 4

Where the time slices go from time 0 to 3 then repeat.

Basically, what the algorithm does is determine what the scheduling matrix should look like after a job ends. For example, after job 5 ends, then job 3 should get an extra time slice; in fact the rows for times 2 and 3 will go away. On a system with 1000s of jobs, this is a high-overhead operation. Turning off gang scheduling is going to save a lot of time, if that's possible.

If that is not possible or desirable, then replacing the bubble sort containing nested loops (this bit of code: for (i = 0; i < num_jobs; i++) { for (j = i+1; j < num_jobs; j++) {), with a better sort could make a huge difference in performance.
Comment 4 Danny Auble 2013-11-16 11:43:32 MST
Created attachment 515 [details]
slurm.conf

The only difference here (I believe) is they have changed from proctrack/pgid to linux/proc and in the node definition is 

Sockets=2 CoresPerSocket=10 ThreadsPerCore=1

instead of

CPUs=20 Sockets=2 CoresPerSocket=10 ThreadsPerCore=1

We also changed MinJobAge=2
Comment 5 Moe Jette 2013-11-16 11:55:10 MST
The Shared=FORCE:20 in slurm.conf is the issue I believe.

I would suggest that we switch to qsort, which will reduce the sort fron N^2 to N log(N), a huge difference for large job counts without a great deal of effort. We can probably include that in the next release of version 2.6 with a patch to Mr. Parks next week.

We can then study the algorithm for the next major release. I think that we can probably come up with a lower overhead option than what is currently done, which is basically a complete re-build of the time*resource matrix previous described.
Comment 6 David Parks 2013-11-16 12:20:19 MST
Hi,

With regards to our configuration settings, we did change:
ProctrackType=proctrack/pgid
to
ProctrackType=proctrack/linuxproc

And, we originally had:

NodeName=wbn-[0001-0680] CPUs=20 Sockets=2 CoresPerSocket=10 ThreadsPerCore=1 State=UNKNOWN

Then changed to:

NodeName=wbn-[0001-0680] Sockets=2 CoresPerSocket=10 ThreadsPerCore=1 State=UNKNOWN

And last ran with:

NodeName=wbn-[0001-0680] CPUs=20 ThreadsPerCore=1 State=UNKNOWN


The termination problem existed with all of the above settings.

Thank you for all your help.

Dave

P.S. Mr. Parks is my father ;-)
Comment 7 Moe Jette 2013-11-16 12:24:48 MST
It seems like half the people that I deal with are named Dave or David. Sorry to get you confused ;)
Comment 8 Moe Jette 2013-11-16 12:30:43 MST
On a more serious node, configuring the nodes with Shared=NO will make this problem go away. Not an ideal solution, but it may be sufficient for you to make some progress while we at least get qsort to replace the bubble sort.
Comment 9 Moe Jette 2013-11-17 02:09:09 MST
Danny and I will be at SC13 this coming week, so our availability will be limited. This is how I would suggest we proceed.

Dave Parks can change the partitions to Shared=NO. This will prevent the CPUs from being oversubscripted even if the user submits a job with the --shared option. This will prevent the slow logic from being executed and I am suspicious that the client really does not want to oversubscribe the CPUs anyway. If that is sufficient for now, I would suggest that you change the severity from 2 to 3.

Starting on Monday, Dave Bigagli of SchedMD will modify the slow logic to replace the bubble sort with quicksort, which will reduce the overhead from n^2 to n*log(n). He can send you a patch with the change and it will also be in the next release of Slurm, version 2.6.5, likely to be available within a couple of weeks.

Later, I will see if the logic can be substantially restructured for additional performance gains. The restructured logic will not appear until the next major release of Slurm in March 2014, but I believe the other changes described above will at least make the situation acceptable until that time.
Comment 10 David Parks 2013-11-17 02:43:54 MST
Hi Moe and Danny,

I changed the Shared=yes to Shared=no attribute on the PartitionName in slurm.conf, reran the test, and *this* version of the test completed in 2 minutes and 15 seconds.  With Shared=yes, the test took over 16 minutes.

I will rerun the complete test and validate that we need the acceptance criteria, but from the above, we're look much better than yesterday.

Thank you for all your help.

Dave
Comment 11 David Bigagli 2013-11-18 11:02:53 MST
Created attachment 521 [details]
use qsort()
Comment 12 David Bigagli 2013-11-18 11:04:47 MST
Patch attached. I was able to run 15K jobs submitted as:

repeat 15 sbatch -o /dev/null -N 1 -c 20 --array=1-1000 ./sleepme

all exit at the same time. MessageTimeout=80. No timeouts while running
'sinfo -i 1' and 'squeue -i 1' were observed.

David
Comment 13 Moe Jette 2013-11-29 03:47:54 MST
The attached patch has been incorporated into the Slurm code base for version 2.6.5. Here are the details:

Cosmetic changes:
https://github.com/SchedMD/slurm/commit/c5bcacd6b44faee2a50f28c55398069102279d85.patch

Modify to use quicksort rather than bubble sort:
https://github.com/SchedMD/slurm/commit/74d1a4b4c95b7a860ffe1649613d2a589cf9a5d6.patch

Remove arrays which were made redundant in the previous commit:
https://github.com/SchedMD/slurm/commit/d704c747f268c9151967b34d9f1031972dd8b1d9.patch

I am going to close this bug given the massive performance improvement provided by these patches. Please re-open this bug if necessary.