| Summary: | slurmctld slowdown when 1000s of jobs finish simultaneously | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | David Parks <dparks> |
| Component: | slurmctld | Assignee: | Moe Jette <jette> |
| Status: | RESOLVED FIXED | QA Contact: | |
| Severity: | 2 - High Impact | ||
| Priority: | --- | CC: | da |
| Version: | 2.6.x | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | CRAY | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
| Attachments: |
slurm.conf
use qsort() |
||
|
Description
David Parks
2013-11-16 10:07:56 MST
Can you attach the configuration file? In particular, I'm interested in if gang scheduling is configured. Is "Shared" configured on any of the partitions and if so, what is the value? Also, what are the PreemptType and PreempMode values? The variable "num_rows" should be the oversubscription level. Without oversubscription, num_rows should be 1 and the nested loops should not be executed. From Dave Parks: So I was looking at the loop in timer section #4. (I'm not trying to be a busy body – but I've spent years optimizing HPC applications without many times having an in-depth knowledge of the physical science) I have no idea what the truth frequency of the if statement is, but if the true condition is low, maybe a simple: if (start[j] > start[I]) continue; Would speed the loop up by eliminating one test of the current if. If we have ~13600 jobs in the system, the I and j loops represent a combined trip count of 92.4x10^6 iterations. Even if we could get 1 condition per cycle (impossible) that would represent 0.033 seconds. At 100000 micro seconds, that is only 3 cycles per instruction. Clearly there might be a better algorithmic solution. Thanks for all your help. Dave Let me explain what this logic is doing (I'm not the author, but do have some familiarity).
First of all, it really looks like the system is configured to run more than one job on each resource (cpu). This would be due to configuring a parameter in a partition of "Shared=force", which would subscribe each CPU to up to 4 jobs (default value, but the count is configurable). The internal table identifies which jobs should run on each resource in each time slice, something like this:
cpu 0 cpu 1 cpu 2 cpu 3
time 0 job 1 job 1 job 2 job 2
time 1 job 1 job 1 job 3 job 4
time 2 job 1 job 1 job 2 job 2
time 3 job 1 job 1 job 5 job 4
Where the time slices go from time 0 to 3 then repeat.
Basically, what the algorithm does is determine what the scheduling matrix should look like after a job ends. For example, after job 5 ends, then job 3 should get an extra time slice; in fact the rows for times 2 and 3 will go away. On a system with 1000s of jobs, this is a high-overhead operation. Turning off gang scheduling is going to save a lot of time, if that's possible.
If that is not possible or desirable, then replacing the bubble sort containing nested loops (this bit of code: for (i = 0; i < num_jobs; i++) { for (j = i+1; j < num_jobs; j++) {), with a better sort could make a huge difference in performance.
Created attachment 515 [details]
slurm.conf
The only difference here (I believe) is they have changed from proctrack/pgid to linux/proc and in the node definition is
Sockets=2 CoresPerSocket=10 ThreadsPerCore=1
instead of
CPUs=20 Sockets=2 CoresPerSocket=10 ThreadsPerCore=1
We also changed MinJobAge=2
The Shared=FORCE:20 in slurm.conf is the issue I believe. I would suggest that we switch to qsort, which will reduce the sort fron N^2 to N log(N), a huge difference for large job counts without a great deal of effort. We can probably include that in the next release of version 2.6 with a patch to Mr. Parks next week. We can then study the algorithm for the next major release. I think that we can probably come up with a lower overhead option than what is currently done, which is basically a complete re-build of the time*resource matrix previous described. Hi, With regards to our configuration settings, we did change: ProctrackType=proctrack/pgid to ProctrackType=proctrack/linuxproc And, we originally had: NodeName=wbn-[0001-0680] CPUs=20 Sockets=2 CoresPerSocket=10 ThreadsPerCore=1 State=UNKNOWN Then changed to: NodeName=wbn-[0001-0680] Sockets=2 CoresPerSocket=10 ThreadsPerCore=1 State=UNKNOWN And last ran with: NodeName=wbn-[0001-0680] CPUs=20 ThreadsPerCore=1 State=UNKNOWN The termination problem existed with all of the above settings. Thank you for all your help. Dave P.S. Mr. Parks is my father ;-) It seems like half the people that I deal with are named Dave or David. Sorry to get you confused ;) On a more serious node, configuring the nodes with Shared=NO will make this problem go away. Not an ideal solution, but it may be sufficient for you to make some progress while we at least get qsort to replace the bubble sort. Danny and I will be at SC13 this coming week, so our availability will be limited. This is how I would suggest we proceed. Dave Parks can change the partitions to Shared=NO. This will prevent the CPUs from being oversubscripted even if the user submits a job with the --shared option. This will prevent the slow logic from being executed and I am suspicious that the client really does not want to oversubscribe the CPUs anyway. If that is sufficient for now, I would suggest that you change the severity from 2 to 3. Starting on Monday, Dave Bigagli of SchedMD will modify the slow logic to replace the bubble sort with quicksort, which will reduce the overhead from n^2 to n*log(n). He can send you a patch with the change and it will also be in the next release of Slurm, version 2.6.5, likely to be available within a couple of weeks. Later, I will see if the logic can be substantially restructured for additional performance gains. The restructured logic will not appear until the next major release of Slurm in March 2014, but I believe the other changes described above will at least make the situation acceptable until that time. Hi Moe and Danny, I changed the Shared=yes to Shared=no attribute on the PartitionName in slurm.conf, reran the test, and *this* version of the test completed in 2 minutes and 15 seconds. With Shared=yes, the test took over 16 minutes. I will rerun the complete test and validate that we need the acceptance criteria, but from the above, we're look much better than yesterday. Thank you for all your help. Dave Created attachment 521 [details]
use qsort()
Patch attached. I was able to run 15K jobs submitted as: repeat 15 sbatch -o /dev/null -N 1 -c 20 --array=1-1000 ./sleepme all exit at the same time. MessageTimeout=80. No timeouts while running 'sinfo -i 1' and 'squeue -i 1' were observed. David The attached patch has been incorporated into the Slurm code base for version 2.6.5. Here are the details: Cosmetic changes: https://github.com/SchedMD/slurm/commit/c5bcacd6b44faee2a50f28c55398069102279d85.patch Modify to use quicksort rather than bubble sort: https://github.com/SchedMD/slurm/commit/74d1a4b4c95b7a860ffe1649613d2a589cf9a5d6.patch Remove arrays which were made redundant in the previous commit: https://github.com/SchedMD/slurm/commit/d704c747f268c9151967b34d9f1031972dd8b1d9.patch I am going to close this bug given the massive performance improvement provided by these patches. Please re-open this bug if necessary. |