| Summary: | squeue scalability issues | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Yiannis Georgiou <yiannis.georgiou> |
| Component: | Other | Assignee: | Danny Auble <da> |
| Status: | RESOLVED FIXED | QA Contact: | |
| Severity: | 2 - High Impact | ||
| Priority: | --- | CC: | da, jblomqvist |
| Version: | 2.6.x | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | Universitat Dresden (Germany) | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
| Attachments: |
change list sort to merge sort
fix list sort2 |
||
|
Description
Yiannis Georgiou
2013-06-17 06:25:39 MDT
Yiannis, this patch appears to have a great deal of merit, we have tested it a bit and haven't found any issues yet. As you report it appears to give a huge speed up as well. We are going to play with it a bit more. I am hesitant to put it into 2.6 as well, but since it appears to speed things up so much we hope it to make it in. I have committed this to the master branch commit 32649d6af01092580411bb14a0484b02594a8eea. You are correct that list.c doesn't follow the normal programming styles. This is because it was lifted from other code and so we didn't break diff changes we left it the way it was. I'll let Moe reply more on the other parts of this bug. A site in submitting large numbers of jobs at the same time reported problems with Slurm locking. I believe this is the same problem that you report. Write locks have a higher priority than read locks. This is because more than one read lock can take place at the same time (two squeue commands can run at the same time). If the read locks arrive at different times then a large number of read locks could keep the write lock from happening until all read locks complete, possibly a long time later. However since write locks have a higher priority than read locks, a large number of write locks (say to submit many jobs) can prevent the read lock from being made, so a large number of sbatch command can prevent the squeue from completing. To fix this I added logic so that a read lock could be would ignore the pending write lock every so often. The code has a ratio of 10 to 1 so on a very heavily loaded system, there would be one read lock made for every 10 write locks. The original code had a ratio of 4 to 1, which resulted in the system stopping responding after several days. It did not deadlock, but spent all of its time in some sort of loop of locking and unlocking. The site in question tried different ratio and settled on a ratio of 7 to 1. I have never been able to reproduce this problem, but using the Dresden configuration we will try again. The site in question was not able to supply configuration information or logs, so I do not know exactly what was happening and I did not enable this logic in version 2.6 because of these unknowns, but all of the counters and logic exist, you just need to change the "#if 1" to "#if 0" in src/slurmctld/locks.c at line 202 and the ratio logic on line 209 from "> 10" to "> 7". The comment explains this problem a bit more. You might need to experiment with the ratio based upon the configuration. Yiannis, it turns out this patch causes memory corruption. If you are running with it anywhere beware you will have issues. Commit 7b66e6499d36625f87f808fb2c94b9fdc875d5e8 fixes the memory corruption, but it is unclear if it does the right thing. With that said I reverted back to the old sort method in patch 61a125bf1243451e505b78803d1bc110ad22a199 leaving the new code as list_sort2 so when we come up with a legitimate patch we can just switch. If you can have Dominik look at this closer it would be great. It appears to easiest way to get the memory corruption to happen is just add or remove a user from the database. The slurmdbd will either core or the message received from the slurmctld will be all bad. Let me know what you find out. Hi Danny, here is a fix from Dominik Friedrich for list_sort2 : I attach the fix and here are his comments : "The line removed in the second patch was indeed the cause for the memory corruption. I didn't see that the pointer "tail" in the list structure had to point at the lasts item next pointer and updated it to point at the last item. Next time an item gets added to the end of the list the memory corruption happened. To correct this and enable the new sorting algorithm again, the attached patch against the current slurm HEAD from github can be used. I've tested it on my local machine by adding and removing users with sacctmgr as Danny mentioned." let me know if you agree or if you have any comments thanks Yiannis Created attachment 307 [details]
fix list sort2
Yiannis, this code looks quite suspicious. It appears to set the tail of the list to the address of next which is pointing to NULL. While it seems strange this appears to be what is happening in the older code though as well. I will apply the patch but we have made the decision this will not make it in 2.6, we will leave the merge sort as list_sort2. If you want to use it just change the name back. We have looked closer at the list stuff in general and we believe most of it should be rewritten anyway. There are questionable spots in the code elsewhere as well. Like if (!(p->next = *pp)) that seems very strange. Changing "assigned to" since Danny has been actively working on this (In reply to Moe Jette from comment #3) > A site in submitting large numbers of jobs at the same time reported > problems with Slurm locking. I believe this is the same problem that you > report. Write locks have a higher priority than read locks. This is because > more than one read lock can take place at the same time (two squeue commands > can run at the same time). If the read locks arrive at different times then > a large number of read locks could keep the write lock from happening until > all read locks complete, possibly a long time later. I think we're occasionally seeing this problem as well. > However since write locks have a higher priority than read locks, a large > number of write locks (say to submit many jobs) can prevent the read lock > from being made, so a large number of sbatch command can prevent the squeue > from completing. To fix this I added logic so that a read lock could be > would ignore the pending write lock every so often. The code has a ratio of > 10 to 1 so on a very heavily loaded system, there would be one read lock > made for every 10 write locks. The original code had a ratio of 4 to 1, > which resulted in the system stopping responding after several days. It did > not deadlock, but spent all of its time in some sort of loop of locking and > unlocking. The site in question tried different ratio and settled on a ratio > of 7 to 1. I have never been able to reproduce this problem, but using the > Dresden configuration we will try again. > > The site in question was not able to supply configuration information or > logs, so I do not know exactly what was happening and I did not enable this > logic in version 2.6 because of these unknowns, but all of the counters and > logic exist, you just need to change the "#if 1" to "#if 0" in > src/slurmctld/locks.c at line 202 and the ratio logic on line 209 from "> > 10" to "> 7". The comment explains this problem a bit more. You might need > to experiment with the ratio based upon the configuration. I'm not sure this will really fix the issue; by changing the ratio you essentially just need a slightly different workload for the starvation to occur. What can be done instead is to enqueue the waiters on a FIFO queue. To better take advantage of the multi-reader propertly of the lock, you can pack several readers on the same queue element and awake all of them at the same time with pthread_cond_broadcast. Googling around a bit, an open source C implementation of (roughly) this idea is at http://www.shlomifish.org/rwlock/ with the algorithm described at http://www.shlomifish.org/rwlock/Scheme_RLE.txt If you're interested, I can take a shot at implementing a FIFO rwlock in the next few weeks, it should be quite straightforward. Or for something even fancier and somewhat pie-in-the-sky, one could use RCU instead of rwlocks. This provides wait-free readers (that is, readers *never* block). RCU is widely used in the Linux kernel, but there exists a userspace implementation as well at http://lttng.org/urcu The urcu library also contains a bunch of concurrent RCU-based data structures such as hash tables, queues, lists. There's an article describing userspace RCU at http://www.efficios.com/pub/rcu/urcu-main.pdf http://www.efficios.com/pub/rcu/urcu-supp.pdf The comparisons to mutexes and rwlocks in that article are quite interesting! OK 13.12 now has this as the default sort commit 61a125bf1243451e505b78803d1bc110ad22a199 I didn't feel good about putting this into 2.6, but it does appear to work just fine now. The above patch can be used against 2.6 to make it work there if you want. I am reopening the bug. The new sort algorithm is breaking three tests (21.22, 22.1 and 24.1). In all three cases, no element is lost, but the order is different. Sample output from test24.1 is included below without and with the change. The new sort algorithm is currently in Slurm's "master" branch. ============================================ TEST: 24.1 build_dir is /home/jette/SLURM/build_smd gcc test24.1.prog.c -g -pthread -o test24.1.prog -I/home/jette/SLURM/build_smd -I/home/jette/SLURM/slurm.git /home/jette/SLURM/build_smd/src/api/libslurm.o /home/jette/SLURM/build_smd/src/slurmctld/locks.o /home/jette/SLURM/build_smd/src/sshare/process.o -ldl -lm -lhwloc-export-dynamic spawn ./test24.1.prog test24.1.prog: No last decay (/dev/null/priority_last_decay_ran) to recover test24.1.prog: error: Can not save priority state information, StateSaveLocation is /dev/null This error is expected. No worries. root|||1.000000|100||1.000000|0.500000|0|0| AccountA||40|0.400000|45|0.450000|0.450000|0.458502|0|0| AccountB||30|0.300000|20|0.200000|0.387500|0.408479|0|0| AccountB|User1|1|0.300000|20|0.200000|0.387500|0.408479|0|0| AccountC||10|0.100000|25|0.250000|0.300000|0.125000|0|0| AccountC|User2|1|0.050000|25|0.250000|0.275000|0.022097|0|0| AccountC|User3|1|0.050000|0|0.000000|0.150000|0.125000|0|0| AccountD||60|0.600000|25|0.250000|0.250000|0.749154|0|0| AccountE||25|0.250000|25|0.250000|0.250000|0.500000|0|0| AccountE|User4|1|0.250000|25|0.250000|0.250000|0.500000|0|0| AccountF||35|0.350000|0|0.000000|0.145833|0.749154|0|0| AccountF|User5|1|0.350000|0|0.000000|0.145833|0.749154|0|0| AccountG||0|0.000000|30|0.300000|0.300000|0.000000|0|0| AccountG|User6|0|-nan|30|0.300000|-nan|nan|0|0| SUCCESS ============================================ ============================================ TEST: 24.1 build_dir is /home/jette/SLURM/build_smd gcc test24.1.prog.c -g -pthread -o test24.1.prog -I/home/jette/SLURM/build_smd -I/home/jette/SLURM/slurm.git /home/jette/SLURM/build_smd/src/api/libslurm.o /home/jette/SLURM/build_smd/src/slurmctld/locks.o /home/jette/SLURM/build_smd/src/sshare/process.o -ldl -lm -lhwloc-export-dynamic spawn ./test24.1.prog test24.1.prog: No last decay (/dev/null/priority_last_decay_ran) to recover test24.1.prog: error: Can not save priority state information, StateSaveLocation is /dev/null This error is expected. No worries. AccountB|User1|1|0.300000|20|0.200000|0.387500|0.408479|0|0| AccountC|User2|1|0.050000|25|0.250000|0.275000|0.022097|0|0| AccountC|User3|1|0.050000|0|0.000000|0.150000|0.125000|0|0| AccountE|User4|1|0.250000|25|0.250000|0.250000|0.500000|0|0| AccountF|User5|1|0.350000|0|0.000000|0.145833|0.749154|0|0| AccountG|User6|0|-nan|30|0.300000|-nan|nan|0|0| AccountA||40|0.400000|45|0.450000|0.450000|0.458502|0|0| AccountB||30|0.300000|20|0.200000|0.387500|0.408479|0|0| AccountC||10|0.100000|25|0.250000|0.300000|0.125000|0|0| AccountD||60|0.600000|25|0.250000|0.250000|0.749154|0|0| AccountE||25|0.250000|25|0.250000|0.250000|0.500000|0|0| AccountF||35|0.350000|0|0.000000|0.145833|0.749154|0|0| AccountG||0|0.000000|30|0.300000|0.300000|0.000000|0|0| root|||1.000000|100||1.000000|0.500000|0|0| FAILURE: we didn't get the correct priorities from the plugin (7 != 13) test24.1 FAILURE ============================================ (In reply to Janne Blomqvist from comment #9) > (In reply to Moe Jette from comment #3) > > A site in submitting large numbers of jobs at the same time reported > > problems with Slurm locking. I believe this is the same problem that you > > report. Write locks have a higher priority than read locks. This is because > > more than one read lock can take place at the same time (two squeue commands > > can run at the same time). If the read locks arrive at different times then > > a large number of read locks could keep the write lock from happening until > > all read locks complete, possibly a long time later. > > I think we're occasionally seeing this problem as well. > > > However since write locks have a higher priority than read locks, a large > > number of write locks (say to submit many jobs) can prevent the read lock > > from being made, so a large number of sbatch command can prevent the squeue > > from completing. To fix this I added logic so that a read lock could be > > would ignore the pending write lock every so often. The code has a ratio of > > 10 to 1 so on a very heavily loaded system, there would be one read lock > > made for every 10 write locks. The original code had a ratio of 4 to 1, > > which resulted in the system stopping responding after several days. It did > > not deadlock, but spent all of its time in some sort of loop of locking and > > unlocking. The site in question tried different ratio and settled on a ratio > > of 7 to 1. I have never been able to reproduce this problem, but using the > > Dresden configuration we will try again. > > > > The site in question was not able to supply configuration information or > > logs, so I do not know exactly what was happening and I did not enable this > > logic in version 2.6 because of these unknowns, but all of the counters and > > logic exist, you just need to change the "#if 1" to "#if 0" in > > src/slurmctld/locks.c at line 202 and the ratio logic on line 209 from "> > > 10" to "> 7". The comment explains this problem a bit more. You might need > > to experiment with the ratio based upon the configuration. > > I'm not sure this will really fix the issue; by changing the ratio you > essentially just need a slightly different workload for the starvation to > occur. What can be done instead is to enqueue the waiters on a FIFO queue. > To better take advantage of the multi-reader propertly of the lock, you can > pack several readers on the same queue element and awake all of them at the > same time with pthread_cond_broadcast. Googling around a bit, an open source > C implementation of (roughly) this idea is at > > http://www.shlomifish.org/rwlock/ > > with the algorithm described at > > http://www.shlomifish.org/rwlock/Scheme_RLE.txt > > If you're interested, I can take a shot at implementing a FIFO rwlock in the > next few weeks, it should be quite straightforward. > > > Or for something even fancier and somewhat pie-in-the-sky, one could use RCU > instead of rwlocks. This provides wait-free readers (that is, readers > *never* block). RCU is widely used in the Linux kernel, but there exists a > userspace implementation as well at > > http://lttng.org/urcu > > The urcu library also contains a bunch of concurrent RCU-based data > structures such as hash tables, queues, lists. > > There's an article describing userspace RCU at > > http://www.efficios.com/pub/rcu/urcu-main.pdf > http://www.efficios.com/pub/rcu/urcu-supp.pdf > > The comparisons to mutexes and rwlocks in that article are quite interesting! In the case that I studied, the problem was mostly one of too much competition for the write locks due to 100 threads submitting batch jobs in parallel. I modified a few RPCs including batch job submit to permit only one of them at a time to be competing for the write locks. It slightly slows down the batch job submission, but greatly improves overall system performance. This logic is in v2.6.0.
We should also consider lib C qsort() which is fast and less complex code
to maintain. A pseudo code for a sorting function is like this:
while ((e = list_pop(l))) {
v[n] = e;
++n;
}
lsize = n;
qsort(v, n, sizeof(char *), (__compar_fn_t)f);
for (n = 0; n < lsize; n++) {
list_append(l, v[n]);
}
We are running some performance tests and the results seems
to be quite encouraging.
David
We have implemented the list_sort() using the lib C qsort() algorithm. The code is committed to the master branch. David |