Ticket 336

Summary: squeue scalability issues
Product: Slurm Reporter: Yiannis Georgiou <yiannis.georgiou>
Component: OtherAssignee: Danny Auble <da>
Status: RESOLVED FIXED QA Contact:
Severity: 2 - High Impact    
Priority: --- CC: da, jblomqvist
Version: 2.6.x   
Hardware: Linux   
OS: Linux   
Site: Universitat Dresden (Germany) Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---
Attachments: change list sort to merge sort
fix list sort2

Description Yiannis Georgiou 2013-06-17 06:25:39 MDT
Created attachment 291 [details]
change list sort to merge sort

Hello,

in Dresden they have needs for high throughput computing with workloads going upto 120K in Pending state.In this context squeue becomes quite slow. The main reason seems to be the list_sort function in common/list.c. The submitted patch changes this to merge sort. This improved the squeue speed quite a lot with such a full queue.

without the patch:
$ time squeue | wc -l
120034

real    4m16.427s
user    4m13.770s
sys    0m0.900s

with the patch:
$ time /home/bull/dominik/slurm/src/squeue/squeue | wc -l
120034

real    0m5.943s
user    0m4.080s
sys    0m0.790s

I've noticed that the code in common/list.c does not respect the programming rules in the rest of slurm... I kept the code on the rules of that file.
The code seems to work fine let me know if you find any issue.

The patch has been proposed by Dominik Friedrich <D.Friedrich@bull.de> project manager of the TU-Dresden cluster in Bull. You should use his name in the commit.

Even if this improves the performance of squeue in case of pending jobs we still have issues during the submission of a big number of jobs, we discussed this in email exchanges last week :

squeue: error: slurm_receive_msg: Socket timed out on send/recv operation
slurm_load_jobs error: Socket timed out on send/recv operation 

Moe mentioned that version 2.6 code has new variables and counters so that a site can easily enable a different ratio of write locks for every read lock in src/slurmctld/locks.c How can we find the best ratio for a particular configuration? only by testing I imagine . Any ideas for starting ratio?.. The cluster has 450 nodes and there could be 20K jobs in the queue asking 1 node each with small duration.
Comment 1 Danny Auble 2013-06-17 09:35:12 MDT
Yiannis, this patch appears to have a great deal of merit, we have tested it a bit and haven't found any issues yet.  As you report it appears to give a huge speed up as well.  We are going to play with it a bit more.  I am hesitant to put it into 2.6 as well, but since it appears to speed things up so much we hope it to make it in.
Comment 2 Danny Auble 2013-06-17 11:53:45 MDT
I have committed this to the master branch commit 32649d6af01092580411bb14a0484b02594a8eea.

You are correct that list.c doesn't follow the normal programming styles.  This is because it was lifted from other code and so we didn't break diff changes we left it the way it was.

I'll let Moe reply more on the other parts of this bug.
Comment 3 Moe Jette 2013-06-17 17:33:12 MDT
A site in submitting large numbers of jobs at the same time reported problems with Slurm locking. I believe this is the same problem that you report. Write locks have a higher priority than read locks. This is because more than one read lock can take place at the same time (two squeue commands can run at the same time). If the read locks arrive at different times then a large number of read locks could keep the write lock from happening until all read locks complete, possibly a long time later.

However since write locks have a higher priority than read locks, a large number of write locks (say to submit many jobs) can prevent the read lock from being made, so a large number of sbatch command can prevent the squeue from completing. To fix this I added logic so that a read lock could be would ignore the pending write lock every so often. The code has a ratio of 10 to 1 so on a very heavily loaded system, there would be one read lock made for every 10 write locks. The original code had a ratio of 4 to 1, which resulted in the system stopping responding after several days. It did not deadlock, but spent all of its time in some sort of loop of locking and unlocking. The site in question tried different ratio and settled on a ratio of 7 to 1. I have never been able to reproduce this problem, but using the Dresden configuration we will try again.

The site in question was not able to supply configuration information or logs, so I do not know exactly what was happening and I did not enable this logic in version 2.6 because of these unknowns, but all of the counters and logic exist, you just need to change the "#if 1" to "#if 0" in src/slurmctld/locks.c at line 202 and the ratio logic on line 209 from "> 10" to "> 7". The comment explains this problem a bit more. You might need to experiment with the ratio based upon the configuration.
Comment 4 Danny Auble 2013-06-19 11:50:26 MDT
Yiannis, it turns out this patch causes memory corruption.  If you are running with it anywhere beware you will have issues.

Commit 7b66e6499d36625f87f808fb2c94b9fdc875d5e8 fixes the memory corruption, but it is unclear if it does the right thing.  With that said I reverted back to the old sort method in patch 61a125bf1243451e505b78803d1bc110ad22a199 leaving the new code as list_sort2 so when we come up with a legitimate patch we can just switch.

If you can have Dominik look at this closer it would be great.

It appears to easiest way to get the memory corruption to happen is just add or remove a user from the database.  The slurmdbd will either core or the message received from the slurmctld will be all bad.  Let me know what you find out.
Comment 5 Yiannis Georgiou 2013-06-24 05:06:40 MDT
Hi Danny,

here is a fix from Dominik Friedrich for list_sort2 :

I attach the fix and here are his comments :

"The line removed in the second patch was indeed the cause for the memory corruption. I didn't see that the pointer "tail" in the list structure had to point at the lasts item next pointer and updated it to point at the last item. Next time an item gets added to the end of the list the memory corruption happened.

To correct this and enable the new sorting algorithm again, the attached patch against the current slurm HEAD from github can be used. I've tested it on my local machine by adding and removing users with sacctmgr as Danny mentioned."

let me know if you agree or if you have any comments
thanks
Yiannis
Comment 6 Yiannis Georgiou 2013-06-24 05:08:06 MDT
Created attachment 307 [details]
fix list sort2
Comment 7 Danny Auble 2013-06-25 11:39:10 MDT
Yiannis, this code looks quite suspicious. It appears to set the tail of the list to the address of next which is pointing to NULL.  While it seems strange this appears to be what is happening in the older code though as well.

I will apply the patch but we have made the decision this will not make it in 2.6, we will leave the merge sort as list_sort2.  If you want to use it just change the name back.

We have looked closer at the list stuff in general and we believe most of it should be rewritten anyway.  There are questionable spots in the code elsewhere as well. Like

 if (!(p->next = *pp))

that seems very strange.
Comment 8 Moe Jette 2013-06-28 10:06:02 MDT
Changing "assigned to" since Danny has been actively working on this
Comment 9 Janne Blomqvist 2013-07-04 21:00:27 MDT
(In reply to Moe Jette from comment #3)
> A site in submitting large numbers of jobs at the same time reported
> problems with Slurm locking. I believe this is the same problem that you
> report. Write locks have a higher priority than read locks. This is because
> more than one read lock can take place at the same time (two squeue commands
> can run at the same time). If the read locks arrive at different times then
> a large number of read locks could keep the write lock from happening until
> all read locks complete, possibly a long time later.

I think we're occasionally seeing this problem as well. 

> However since write locks have a higher priority than read locks, a large
> number of write locks (say to submit many jobs) can prevent the read lock
> from being made, so a large number of sbatch command can prevent the squeue
> from completing. To fix this I added logic so that a read lock could be
> would ignore the pending write lock every so often. The code has a ratio of
> 10 to 1 so on a very heavily loaded system, there would be one read lock
> made for every 10 write locks. The original code had a ratio of 4 to 1,
> which resulted in the system stopping responding after several days. It did
> not deadlock, but spent all of its time in some sort of loop of locking and
> unlocking. The site in question tried different ratio and settled on a ratio
> of 7 to 1. I have never been able to reproduce this problem, but using the
> Dresden configuration we will try again.
> 
> The site in question was not able to supply configuration information or
> logs, so I do not know exactly what was happening and I did not enable this
> logic in version 2.6 because of these unknowns, but all of the counters and
> logic exist, you just need to change the "#if 1" to "#if 0" in
> src/slurmctld/locks.c at line 202 and the ratio logic on line 209 from ">
> 10" to "> 7". The comment explains this problem a bit more. You might need
> to experiment with the ratio based upon the configuration.

I'm not sure this will really fix the issue; by changing the ratio you essentially just need a slightly different workload for the starvation to occur. What can be done instead is to enqueue the waiters on a FIFO queue. To better take advantage of the multi-reader propertly of the lock, you can pack several readers on the same queue element and awake all of them at the same time with pthread_cond_broadcast. Googling around a bit, an open source C implementation of (roughly) this idea is at 

http://www.shlomifish.org/rwlock/

with the algorithm described at

http://www.shlomifish.org/rwlock/Scheme_RLE.txt

If you're interested, I can take a shot at implementing a FIFO rwlock in the next few weeks, it should be quite straightforward. 


Or for something even fancier and somewhat pie-in-the-sky, one could use RCU instead of rwlocks. This provides wait-free readers (that is, readers *never* block). RCU is widely used in the Linux kernel, but there exists a userspace implementation as well at 

http://lttng.org/urcu

The urcu library also contains a bunch of concurrent RCU-based data structures such as hash tables, queues, lists.

There's an article describing userspace RCU at

http://www.efficios.com/pub/rcu/urcu-main.pdf
http://www.efficios.com/pub/rcu/urcu-supp.pdf

The comparisons to mutexes and rwlocks in that article are quite interesting!
Comment 10 Danny Auble 2013-07-09 09:52:40 MDT
OK 13.12 now has this as the default sort commit 61a125bf1243451e505b78803d1bc110ad22a199

I didn't feel good about putting this into 2.6, but it does appear to work just fine now.  The above patch can be used against 2.6 to make it work there if you want.
Comment 11 Moe Jette 2013-07-12 09:36:14 MDT
I am reopening the bug. The new sort algorithm is breaking three tests (21.22, 22.1 and 24.1). In all three cases, no element is lost, but the order is different. Sample output from test24.1 is included below without and with the change. The new sort algorithm is currently in Slurm's "master" branch.

============================================
TEST: 24.1
build_dir is /home/jette/SLURM/build_smd
gcc test24.1.prog.c -g -pthread -o test24.1.prog -I/home/jette/SLURM/build_smd -I/home/jette/SLURM/slurm.git /home/jette/SLURM/build_smd/src/api/libslurm.o /home/jette/SLURM/build_smd/src/slurmctld/locks.o  /home/jette/SLURM/build_smd/src/sshare/process.o -ldl -lm -lhwloc-export-dynamic 
spawn ./test24.1.prog
test24.1.prog: No last decay (/dev/null/priority_last_decay_ran) to recover
test24.1.prog: error: Can not save priority state information, StateSaveLocation is /dev/null
This error is expected.  No worries.
root|||1.000000|100||1.000000|0.500000|0|0|
 AccountA||40|0.400000|45|0.450000|0.450000|0.458502|0|0|
  AccountB||30|0.300000|20|0.200000|0.387500|0.408479|0|0|
   AccountB|User1|1|0.300000|20|0.200000|0.387500|0.408479|0|0|
  AccountC||10|0.100000|25|0.250000|0.300000|0.125000|0|0|
   AccountC|User2|1|0.050000|25|0.250000|0.275000|0.022097|0|0|
   AccountC|User3|1|0.050000|0|0.000000|0.150000|0.125000|0|0|
 AccountD||60|0.600000|25|0.250000|0.250000|0.749154|0|0|
  AccountE||25|0.250000|25|0.250000|0.250000|0.500000|0|0|
   AccountE|User4|1|0.250000|25|0.250000|0.250000|0.500000|0|0|
  AccountF||35|0.350000|0|0.000000|0.145833|0.749154|0|0|
   AccountF|User5|1|0.350000|0|0.000000|0.145833|0.749154|0|0|
 AccountG||0|0.000000|30|0.300000|0.300000|0.000000|0|0|
  AccountG|User6|0|-nan|30|0.300000|-nan|nan|0|0|

SUCCESS
============================================

============================================
TEST: 24.1
build_dir is /home/jette/SLURM/build_smd
gcc test24.1.prog.c -g -pthread -o test24.1.prog -I/home/jette/SLURM/build_smd -I/home/jette/SLURM/slurm.git /home/jette/SLURM/build_smd/src/api/libslurm.o /home/jette/SLURM/build_smd/src/slurmctld/locks.o  /home/jette/SLURM/build_smd/src/sshare/process.o -ldl -lm -lhwloc-export-dynamic 
spawn ./test24.1.prog
test24.1.prog: No last decay (/dev/null/priority_last_decay_ran) to recover
test24.1.prog: error: Can not save priority state information, StateSaveLocation is /dev/null
This error is expected.  No worries.
AccountB|User1|1|0.300000|20|0.200000|0.387500|0.408479|0|0|
AccountC|User2|1|0.050000|25|0.250000|0.275000|0.022097|0|0|
AccountC|User3|1|0.050000|0|0.000000|0.150000|0.125000|0|0|
AccountE|User4|1|0.250000|25|0.250000|0.250000|0.500000|0|0|
AccountF|User5|1|0.350000|0|0.000000|0.145833|0.749154|0|0|
AccountG|User6|0|-nan|30|0.300000|-nan|nan|0|0|
AccountA||40|0.400000|45|0.450000|0.450000|0.458502|0|0|
 AccountB||30|0.300000|20|0.200000|0.387500|0.408479|0|0|
 AccountC||10|0.100000|25|0.250000|0.300000|0.125000|0|0|
AccountD||60|0.600000|25|0.250000|0.250000|0.749154|0|0|
 AccountE||25|0.250000|25|0.250000|0.250000|0.500000|0|0|
 AccountF||35|0.350000|0|0.000000|0.145833|0.749154|0|0|
AccountG||0|0.000000|30|0.300000|0.300000|0.000000|0|0|
root|||1.000000|100||1.000000|0.500000|0|0|

FAILURE: we didn't get the correct priorities from the plugin (7 != 13)
test24.1 FAILURE
============================================
Comment 12 Moe Jette 2013-07-12 09:44:46 MDT
(In reply to Janne Blomqvist from comment #9)
> (In reply to Moe Jette from comment #3)
> > A site in submitting large numbers of jobs at the same time reported
> > problems with Slurm locking. I believe this is the same problem that you
> > report. Write locks have a higher priority than read locks. This is because
> > more than one read lock can take place at the same time (two squeue commands
> > can run at the same time). If the read locks arrive at different times then
> > a large number of read locks could keep the write lock from happening until
> > all read locks complete, possibly a long time later.
> 
> I think we're occasionally seeing this problem as well. 
> 
> > However since write locks have a higher priority than read locks, a large
> > number of write locks (say to submit many jobs) can prevent the read lock
> > from being made, so a large number of sbatch command can prevent the squeue
> > from completing. To fix this I added logic so that a read lock could be
> > would ignore the pending write lock every so often. The code has a ratio of
> > 10 to 1 so on a very heavily loaded system, there would be one read lock
> > made for every 10 write locks. The original code had a ratio of 4 to 1,
> > which resulted in the system stopping responding after several days. It did
> > not deadlock, but spent all of its time in some sort of loop of locking and
> > unlocking. The site in question tried different ratio and settled on a ratio
> > of 7 to 1. I have never been able to reproduce this problem, but using the
> > Dresden configuration we will try again.
> > 
> > The site in question was not able to supply configuration information or
> > logs, so I do not know exactly what was happening and I did not enable this
> > logic in version 2.6 because of these unknowns, but all of the counters and
> > logic exist, you just need to change the "#if 1" to "#if 0" in
> > src/slurmctld/locks.c at line 202 and the ratio logic on line 209 from ">
> > 10" to "> 7". The comment explains this problem a bit more. You might need
> > to experiment with the ratio based upon the configuration.
> 
> I'm not sure this will really fix the issue; by changing the ratio you
> essentially just need a slightly different workload for the starvation to
> occur. What can be done instead is to enqueue the waiters on a FIFO queue.
> To better take advantage of the multi-reader propertly of the lock, you can
> pack several readers on the same queue element and awake all of them at the
> same time with pthread_cond_broadcast. Googling around a bit, an open source
> C implementation of (roughly) this idea is at 
> 
> http://www.shlomifish.org/rwlock/
> 
> with the algorithm described at
> 
> http://www.shlomifish.org/rwlock/Scheme_RLE.txt
> 
> If you're interested, I can take a shot at implementing a FIFO rwlock in the
> next few weeks, it should be quite straightforward. 
> 
> 
> Or for something even fancier and somewhat pie-in-the-sky, one could use RCU
> instead of rwlocks. This provides wait-free readers (that is, readers
> *never* block). RCU is widely used in the Linux kernel, but there exists a
> userspace implementation as well at 
> 
> http://lttng.org/urcu
> 
> The urcu library also contains a bunch of concurrent RCU-based data
> structures such as hash tables, queues, lists.
> 
> There's an article describing userspace RCU at
> 
> http://www.efficios.com/pub/rcu/urcu-main.pdf
> http://www.efficios.com/pub/rcu/urcu-supp.pdf
> 
> The comparisons to mutexes and rwlocks in that article are quite interesting!

In the case that I studied, the problem was mostly one of too much competition for the write locks due to 100 threads submitting batch jobs in parallel. I modified a few RPCs including batch job submit to permit only one of them at a time to be competing for the write locks. It slightly slows down the batch job submission, but greatly improves overall system performance. This logic is in v2.6.0.
Comment 13 David Bigagli 2013-07-15 09:30:50 MDT
We should also consider lib C qsort() which is fast and less complex code 
to maintain. A pseudo code for a sorting function is like this:

while ((e = list_pop(l))) {
     v[n] = e;
     ++n;
}
lsize = n;
qsort(v, n, sizeof(char *), (__compar_fn_t)f);

for (n = 0; n < lsize; n++) {
	list_append(l, v[n]);
}

We are running some performance tests and the results seems 
to be quite encouraging.

David
Comment 14 David Bigagli 2013-07-17 09:43:53 MDT
We have implemented the list_sort() using the lib C qsort() algorithm.
The code is committed to the master branch.

 David