| Summary: | The srun --ntasks-per-socket=<ntasks> does not work properly with the -m block:block:block option | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Zhengji Zhao <zzhao> |
| Component: | User Commands | Assignee: | Danny Auble <da> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 3 - Medium Impact | ||
| Priority: | --- | CC: | dmjacobsen, zzhao |
| Version: | 15.08.6 | ||
| Hardware: | Cray XC | ||
| OS: | Linux | ||
| Site: | NERSC | Slinky Site: | --- |
| Alineos Sites: | --- | Atos/Eviden Sites: | --- |
| Confidential Site: | --- | Coreweave sites: | --- |
| Cray Sites: | --- | DS9 clusters: | --- |
| Google sites: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | NoveTech Sites: | --- |
| Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Tzag Elita Sites: | --- |
| Linux Distro: | --- | Machine Name: | |
| CLE Version: | Version Fixed: | ||
| Target Release: | --- | DevPrio: | --- |
| Emory-Cloud Sites: | --- | ||
| Attachments: | fct_edison.tar.gz | ||
|
Description
Zhengji Zhao
2016-01-05 14:05:39 MST
Hey Zhengji, This is most likely an issue of configuration. I noticed you don't have task/affinity in your task plugin line you will most likely want that, the cgroup plugin does affinity differently and might not work as you would expect for all options, I know the nomultithread option doesn't work for instance. After talking with Doug it sounds like you had this in the past but had found that *not* using it and instead setting cons_res to be CR_Socket_Memory for most partitions covered most cases as it seemed much easier to use. I can't remember the issues you had with task/affinity or how it was more difficult to use, but I would suggest turning the affinity off in that plugin and layering the task/affinity plugin on top of the other 2 in this manner. TaskPlugin = affinity,cgroup,cray There are other suggestions we have for your configuration as well. I have opened bug 2311 for that. Dear Danny,
Thanks for your advice. I will work with Doug to test your suggestion. I
think some basic command option like --ntasks-per-socket (even without
using -m) still does not work fully in our current configuration.
I have one more question for you. I read the following in the
slurm.conf man page regarding task/cgroup
task/cgroup enables resource containment using Linux
control cgroups. This enables the --cpu_bind and/or --mem_bind srun
options. NOTE: see "man cgroup.conf" for configuration
details. *NOTE: This plugin writes to disk
and can slightly impact performance. * If you are running lots of short
running jobs (less than a couple of seconds) this plugin
slows down performance slightly. It
should probably be avoided in an HTC environment.
I would like to understand a bit more about the disk writing part for
the task/cgroup. For a very large scale job, such as using more than
133k cores, do you think this disk writing part would affect the mpi
performance? Is this disk writing part occurs only before and after user
code execution? I am asking because we see 50% performance slowdown with
the mpi_alltoall time after switching to SLURM. Anything you can think
of that could slowdown the mpi_alltoall time under slurm?
Thanks,
Zhengji
On 1/6/16 10:49 AM, bugs@schedmd.com wrote:
>
> *Comment # 2 <http://bugs.schedmd.com/show_bug.cgi?id=2307#c2> on bug
> 2307 <http://bugs.schedmd.com/show_bug.cgi?id=2307> from Danny Auble
> <mailto:da@schedmd.com> *
> Hey Zhengji,
>
> This is most likely an issue of configuration. I noticed you don't have
> task/affinity in your task plugin line you will most likely want that, the
> cgroup plugin does affinity differently and might not work as you would expect
> for all options, I know the nomultithread option doesn't work for instance.
>
> After talking with Doug it sounds like you had this in the past but had found
> that *not* using it and instead setting cons_res to be CR_Socket_Memory for
> most partitions covered most cases as it seemed much easier to use.
>
> I can't remember the issues you had with task/affinity or how it was more
> difficult to use, but I would suggest turning the affinity off in that plugin
> and layering the task/affinity plugin on top of the other 2 in this manner.
>
> TaskPlugin = affinity,cgroup,cray
>
> There are other suggestions we have for your configuration as well. I have
> openedbug 2311 <show_bug.cgi?id=2311> for that.
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
> * You are on the CC list for the bug.
> * You reported the bug.
>
I doubt the writing to disk in this case would account for a 50% slow down. It is a nominal hit that happens when a new process is spawned. Usually this only noticed when running High Throughput (100s of jobs a second). How are you running your alltoall? Perhaps Doug has seen this before? Could you post your cgroup.conf file so I can look at that as well? If you could give me your slurmdbd.conf file that would be handy as well. Created attachment 2575 [details] fct_edison.tar.gz Danny, Thanks for your info. I hope Doug could get back to you with those .conf files. If not, I will send you the .conf files soon. I am attaching a tar file, fct_edison.tar.gz, in this email (~3MB), and included all the needed information for the FCT (full configuration test) runs on our Cray XC30, Edison. Please read the README.ZZ file for the descriptions about the files included. Thanks a lot, I am looking forward to getting this performance issue understood. Any help from you is highly appreciated. Zhengji On 1/6/16 12:47 PM, bugs@schedmd.com wrote: > > *Comment # 4 <http://bugs.schedmd.com/show_bug.cgi?id=2307#c4> on bug > 2307 <http://bugs.schedmd.com/show_bug.cgi?id=2307> from Danny Auble > <mailto:da@schedmd.com> * > I doubt the writing to disk in this case would account for a 50% slow down. It > is a nominal hit that happens when a new process is spawned. Usually this only > noticed when running High Throughput (100s of jobs a second). > > How are you running your alltoall? Perhaps Doug has seen this before? > > Could you post your cgroup.conf file so I can look at that as well? > > If you could give me your slurmdbd.conf file that would be handy as well. > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You are on the CC list for the bug. > * You reported the bug. > Zhengji, I'm not sure where the numbers are coming from in the README.ZZ slurm-340.out should not be used as a benchmark as there was sleeping waiting for IO which was never going to happen which intern prolonged the job long after it was finished. This appears to be talked about in bug 985 (http://bugs.schedmd.com/show_bug.cgi?id=985), but it doesn't look like the root cause was ever figured out. The slurm-313.out appears to be a correct run with times much closer to what you would expect. slurm-313.out:MPI_ATOA-1 100368 137625600 1 3.528e+01 3.528e+01 3.528e+01 sec 2.975970e+01 MB/sec Is there anyway you can attempt the full system alltoall again? I would consider 340 a failed job. Some of the other modifications suggested in bug 2311 could also improve performance a bit. Zhengji, ignore the first line of comment 6, I wrote that before I had fully read the README.ZZ, I understand where your numbers are coming from now and forgot to take out that line :). Also as you indicate in README.ZZ you would like more sorted output. You can always pipe srun's output to sort -V which should give you what you want. I would expect srun -n 132486 ./bigmpi5 -mb 2100 -nit 1 | sort -V would give you nicer output. If you want it sorted by rank you can do something like srun -l -n 132486 ./bigmpi5 -mb 2100 -nit 1 | sort -V the -l option will add the rank number to each of the lines from the various ranks. I'm not sure why aprun would have more sorted output, my guess is srun launches things in a more parallel manor than aprun does and perhaps does some sort of sorting before printing things out. Thanks a lot for looking into this, I will be able to run another set of full scale runs tomorrow. I will update you tomorrow with new numbers. Doug doubt the ask affinity may played some rule here, and we may had some network issue when I was running the FCT runs last time. So tomorrow we will have more data point. I really appreciate all your help. Zhengji On 1/6/16 2:56 PM, bugs@schedmd.com wrote: > > *Comment # 7 <http://bugs.schedmd.com/show_bug.cgi?id=2307#c7> on bug > 2307 <http://bugs.schedmd.com/show_bug.cgi?id=2307> from Danny Auble > <mailto:da@schedmd.com> * > Zhengji, ignore the first line ofcomment 6 <show_bug.cgi?id=2307#c6>, I wrote that before I had fully > read the README.ZZ, I understand where your numbers are coming from now and > forgot to take out that line :). > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You are on the CC list for the bug. > * You reported the bug. > No problem, I also doubt the affinity would have much to do with an ALL2ALL test. I would expect the network to be behind the issues with 340. I look forward to seeing a new test tomorrow, good luck ;)! Thanks for the tip. It will be very useful! Zhengji On 1/6/16 3:08 PM, bugs@schedmd.com wrote: > > *Comment # 8 <http://bugs.schedmd.com/show_bug.cgi?id=2307#c8> on bug > 2307 <http://bugs.schedmd.com/show_bug.cgi?id=2307> from Danny Auble > <mailto:da@schedmd.com> * > Also as you indicate in README.ZZ you would like more sorted output. You can > always pipe srun's output to sort -V which should give you what you want. > > I would expect > > srun -n 132486 ./bigmpi5 -mb 2100 -nit 1 | sort -V > > would give you nicer output. If you want it sorted by rank you can do > something like > > srun -l -n 132486 ./bigmpi5 -mb 2100 -nit 1 | sort -V > > the -l option will add the rank number to each of the lines from the various > ranks. > > I'm not sure why aprun would have more sorted output, my guess is srun launches > things in a more parallel manor than aprun does and perhaps does some sort of > sorting before printing things out. > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You are on the CC list for the bug. > * You reported the bug. > Zhengji, any more on this? Danny, Thanks for following up. Unfortunately, different task placement did not help, I still got the same slow performance. On 1/7, I ran the same full configuration test (fct) with the distribution method -m block:block:block, and the results were as follows: Batch system Run Date Output filename:KEY:tag ntasks n it min (sec) max (sec) avg (sec) BW (MB/sec) *The -m option* Torque/Moab 2/19/14 fct99p1.o785758:MPI_ATOA-1 132367 137625600 1 40.94 40.94 40.94 25.65 the same as block:block:block Slurm 1/7/16 slurm-7531-block.out:MPI_ATOA-1 132367 137625600 1 61.93 61.94 61.94 16.95 block:block:block Slurm 1/7/16 slurm-7532-block.out:MPI_ATOA-1 132367 137625600 1 60.14 60.14 60.14 17.46 block:block:block Slurm 1/7/16 slurm-7534.out:MPI_ATOA-1 132367 137625600 1 60.77 60.77 60.77 17.28 block:cyclic:cyclic Slurm 1/1/16 slurm-340.out:MPI_ATOA-1 132486 137625600 1 62.54 62.55 62.55 16.79 block:cyclic:cyclic Slurm 1/7/16 slurm-7530.out:MPI_ATOA-1 132486 137625600 1 59.71 59.72 59.71 17.58 block:cyclic:cyclic Slurm 1/7/16 slurm-7533-block.out:MPI_ATOA-1 132486 137625600 1 59.74 59.74 59.74 17.58 block:block:block We did not have time to test other suggested Slurm configuration because other scheduled activities were much delayed than scheduled time. Any advice would be much appreciated, but I also understand that this performance issue might be beyond the scope of your support. I have also filed a bug with Cray, so to get some advice from them as well. Thanks again for all your help! Zhengji On 1/15/16 10:47 AM, bugs@schedmd.com wrote: > > *Comment # 12 <http://bugs.schedmd.com/show_bug.cgi?id=2307#c12> on > bug 2307 <http://bugs.schedmd.com/show_bug.cgi?id=2307> from Danny > Auble <mailto:da@schedmd.com> * > Zhengji, any more on this? > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You are on the CC list for the bug. > * You reported the bug. > Thanks for the update Zhengji, as expected the layout didn't seem to matter. I'm guessing none of these jobs had issues like 340 did. Is that correct? I am wondering if the slow down is from the way Slurm does it's output and that could be causing the delay. Would it be possible to alter your test to not do any stdout until the end? I understand that isn't an apples to apples test but it would be interesting to see what overhead, if any, that much output adds to the mix. I am pretty sure ALPS handles things with respect to stdio from the different ranks very different than Slurm does. Slurm usually isn't expecting a job of this size to produce this kind of output since that usually kills performance. Danny, Thanks for the advice, I think it is worth trying. From my limited experience so far both my own and users that I interacted with, indeed there were some issues with the standard output being delayed or lost. Some users reported that the problem is worse if a binary compiled with a cray compiler (I am still investigating the issue). I will definitely plan a few runs without the I/O especially the all process printing MPI_Init info part. I will update you if I have futhur info. Thanks again, Zhenjgi On 1/15/16 1:56 PM, bugs@schedmd.com wrote: > > *Comment # 14 <http://bugs.schedmd.com/show_bug.cgi?id=2307#c14> on > bug 2307 <http://bugs.schedmd.com/show_bug.cgi?id=2307> from Danny > Auble <mailto:da@schedmd.com> * > Thanks for the update Zhengji, as expected the layout didn't seem to matter. > I'm guessing none of these jobs had issues like 340 did. Is that correct? > > I am wondering if the slow down is from the way Slurm does it's output and that > could be causing the delay. Would it be possible to alter your test to not do > any stdout until the end? I understand that isn't an apples to apples test but > it would be interesting to see what overhead, if any, that much output adds to > the mix. I am pretty sure ALPS handles things with respect to stdio from the > different ranks very different than Slurm does. Slurm usually isn't expecting > a job of this size to produce this kind of output since that usually kills > performance. > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You are on the CC list for the bug. > * You reported the bug. > Hum, I haven't heard of missing stdout, that is strange. The stdout is buffered by default, so perhaps that can add to a delay they are seeing. I am still thinking a network issue may be the cause of this. I am hoping your non-stdout tests will confirm or debunk this theory though. Zhenjgi any more on this? I will update you right after today's tests. We are doing another set of full system runs now. We may have a good news to tell you ... Thanks for following up on this. Zhengji On 1/28/16 11:46 AM, bugs@schedmd.com wrote: > > *Comment # 17 <http://bugs.schedmd.com/show_bug.cgi?id=2307#c17> on > bug 2307 <http://bugs.schedmd.com/show_bug.cgi?id=2307> from Danny > Auble <mailto:da@schedmd.com> * > Zhenjgi any more on this? > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You are on the CC list for the bug. > * You reported the bug. > How did it go? Any more on this? Hi Danny, Sorry for the delay. Thanks a lot for following up with this. When I told you last time that I might be able to tell you a good news, I was about to conclude that the FCT binary that was compiled with the intel compilers were not performing, because I was able to get a good timing with a FCT binary that was built with a Cray compiler on 1/23 (in the production environment). All the slower numbers before that date used a binary built from an intel compiler. However, on 1/28 when I had a chance to run both the binaries (Intel and cray compiler builds) side by side, both of them were able to get good timing as we used to get. So we still do not understand exactly what was causing the performance regression last time, but now the performance is returned to the expected level. See the update I provided to Cray support (attached below) for more details. Please feel free to close this bug. I appreciate all your help! Zhengji We did the tests suggested in the comment #8 during the maintenance on 1/23/2016. 1) Linked FCT to the cray-mpich 7.3.1. We did not see any change in the MPI_Alltoall time (still got about 58 secs for the MPI_Alltoall time). 2) Tested with the DMAPP optimized MPI_Alltoall using cray-mpich 7.3.1, but the job failed without any meaningful error message (just this message: srun: error: nid00008: tasks 0-23: Killed). We were in contact with SchedMD for this performance issue as well, and they suggested us to try after disabling the all process writing in the code. So we also tested it on 1/23/2016. Unfortunately this did not make any observable difference either. (MPI_Alltoall time was still 59 secs) However, since on 1/26/2016, we were able to get good timing (about 40 sec) again with FCT both with Cray and Intel compiler builds for some unknown "improvement" on the system (see the data attached below, the last column is the measured MPI_Alltoall time in FCT). We were not able to identify any specific changes on the system that could have accounted for this performance improvement. In summary, the performance is as good as it used to be now, but the performance issue we had earlier are still not understood. I am OK to close this bug for now. Thanks, Zhengji Run date output/KEY:tag ntasks MPI_Alltoall time(sec) 1/1/16 0:00 18:00 slurm-314.out:MPI_ATOA-1 132486 61.05 1/1/16 0:00 18:18 slurm-316.out:MPI_ATOA-1 132486 62.13 1/1/16 0:00 22:26 slurm-336.out:MPI_ATOA-1 132486 61.11 1/1/16 0:00 22:44 slurm-340.out:MPI_ATOA-1 132486 62.55 1/2/16 0:00 21:06 slurm-349.out:MPI_ATOA-1 133656 62.55 1/7/16 0:00 22:44 slurm-7530.out:MPI_ATOA-1 132486 59.71 1/7/16 0:00 22:51 slurm-7531.out:MPI_ATOA-1 132367 61.94 1/7/16 0:00 22:56 slurm-7532.out:MPI_ATOA-1 132367 60.14 1/7/16 0:00 23:02 slurm-7533.out:MPI_ATOA-1 132486 59.74 1/7/16 0:00 23:07 slurm-7534.out:MPI_ATOA-1 132367 60.77 1/23/16 0:00 11:31 slurm-60924.out:MPI_ATOA-1 132367 58.36 1/23/16 0:00 11:44 slurm-60932.out:MPI_ATOA-1 132367 59.59 1/23/16 0:00 11:50 slurm-60933.out:MPI_ATOA-1 132367 70.69 1/23/16 0:00 11:56 slurm-60936.out:MPI_ATOA-1 132367 57.92 1/26/16 0:00 22:10 slurm-72708.out:MPI_ATOA-1 133088 40.76 1/26/16 0:00 22:15 slurm-72710.out:MPI_ATOA-1 133088 40.72 1/26/16 0:00 22:21 slurm-72721.out:MPI_ATOA-1 133088 40.80 1/28/16 0:00 23:06 slurm-78107.out:MPI_ATOA-1 132367 39.71 1/28/16 0:00 23:49 slurm-78112.out:MPI_ATOA-1 132367 45.68 1/28/16 0:00 23:11 slurm-78114.out:MPI_ATOA-1 132367 39.81 1/28/16 0:00 23:16 slurm-78115.out:MPI_ATOA-1 132367 39.35 1/28/16 0:00 23:22 slurm-78116.out:MPI_ATOA-1 132367 39.73 1/28/16 0:00 23:26 slurm-78117.out:MPI_ATOA-1 132367 39.80 1/28/16 0:00 23:31 slurm-78118.out:MPI_ATOA-1 132367 38.47 On 2/11/16 3:42 PM, bugs@schedmd.com wrote: > > *Comment # 20 <http://bugs.schedmd.com/show_bug.cgi?id=2307#c20> on > bug 2307 <http://bugs.schedmd.com/show_bug.cgi?id=2307> from Danny > Auble <mailto:da@schedmd.com> * > Any more on this? > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You are on the CC list for the bug. > * You reported the bug. > Well, I am glad you are now getting better performance :). Thanks for following up. Looks like things are slightly better than before as well, that is great! Thanks again for the numbers *** Ticket 2463 has been marked as a duplicate of this ticket. *** |