Ticket 2307

Summary:	The srun --ntasks-per-socket=<ntasks> does not work properly with the -m block:block:block option
Product:	Slurm	Reporter:	Zhengji Zhao <zzhao>
Component:	User Commands	Assignee:	Danny Auble <da>
Status:	RESOLVED INFOGIVEN	QA Contact:
Severity:	3 - Medium Impact
Priority:	---	CC:	dmjacobsen, zzhao
Version:	15.08.6
Hardware:	Cray XC
OS:	Linux
Site:	NERSC	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	fct_edison.tar.gz

Description Zhengji Zhao 2016-01-05 14:05:39 MST

We have the slurm 15.08.6 as the workload manager on our Cray XC30 system (12 core dual socket Ivy Bridge processors, each node has 24 cores, each core with 2 hardware threads). Currently the srun option,  --ntasks-per-socket=<ntasks>, does not work properly with the -m option. I would expect the following result with the srun command  srun -n 8 --ntasks-per-socket=4 --cpu_bind=cores -m block:block:block ./a.out

Hello from rank 0, thread 0, on nid00292. (core affinity = 0,24)
Hello from rank 1, thread 0, on nid00292. (core affinity = 1,25)
Hello from rank 2, thread 0, on nid00292. (core affinity = 2,26)
Hello from rank 3, thread 0, on nid00292. (core affinity = 3,27)
Hello from rank 4, thread 0, on nid00292. (core affinity = 12,40)
Hello from rank 5, thread 0, on nid00292. (core affinity = 13,41)
Hello from rank 6, thread 0, on nid00292. (core affinity = 14,42)
Hello from rank 7, thread 0, on nid00292. (core affinity = 15,43)

i.e., the first 4 tasks to be allocated on to the first socket, and next 4 tasks (rank 4-7) to be allocated to the second socket, but I am getting the following:

Hello from rank 0, thread 0, on nid00292. (core affinity = 0,24)
Hello from rank 1, thread 0, on nid00292. (core affinity = 1,25)
Hello from rank 2, thread 0, on nid00292. (core affinity = 2,26)
Hello from rank 3, thread 0, on nid00292. (core affinity = 3,27)
Hello from rank 4, thread 0, on nid00292. (core affinity = 4,28)
Hello from rank 5, thread 0, on nid00292. (core affinity = 5,29)
Hello from rank 6, thread 0, on nid00292. (core affinity = 6,30)
Hello from rank 7, thread 0, on nid00292. (core affinity = 7,31)

which means all 8 tasks are allocated on to the first socket. Could you please advice how we can get this to work?

Thanks,
Zhengji


Here is our configuration:

zz217@nid00292:~/tests/coreid> scontrol show config
Configuration data as of 2016-01-05T18:26:45
AccountingStorageBackupHost = edique02
AccountingStorageEnforce = associations,limits,qos,safe
AccountingStorageHost   = edique01
AccountingStorageLoc    = N/A
AccountingStoragePort   = 6819
AccountingStorageTRES   = cpu,mem,energy,node
AccountingStorageType   = accounting_storage/slurmdbd
AccountingStorageUser   = N/A
AccountingStoreJobComment = Yes
AcctGatherEnergyType    = acct_gather_energy/cray
AcctGatherFilesystemType = acct_gather_filesystem/none
AcctGatherInfinibandType = acct_gather_infiniband/none
AcctGatherNodeFreq      = 0 sec
AcctGatherProfileType   = acct_gather_profile/none
AllowSpecResourcesUsage = 1
AuthInfo                = (null)
AuthType                = auth/munge
BackupAddr              = 128.55.143.33
BackupController        = edique01
BatchStartTimeout       = 60 sec
BOOT_TIME               = 2016-01-05T08:29:12
BurstBufferType         = (null)
CacheGroups             = 0
CheckpointType          = checkpoint/none
ChosLoc                 = (null)
ClusterName             = edison
CompleteWait            = 0 sec
ControlAddr             = nid01605
ControlMachine          = nid01605
CoreSpecPlugin          = core_spec/cray
CpuFreqDef              = OnDemand
CpuFreqGovernors        = OnDemand
CryptoType              = crypto/munge
DebugFlags              = (null)
DefMemPerNode           = UNLIMITED
DisableRootJobs         = Yes
EioTimeout              = 60
EnforcePartLimits       = Yes
Epilog                  = (null)
EpilogMsgTime           = 2000 usec
EpilogSlurmctld         = (null)
ExtSensorsType          = ext_sensors/none
ExtSensorsFreq          = 0 sec
FairShareDampeningFactor = 1
FastSchedule            = 1
FirstJobId              = 1
GetEnvTimeout           = 2 sec
GresTypes               = craynetwork
GroupUpdateForce        = 0
GroupUpdateTime         = 600 sec
HASH_VAL                = Match
HealthCheckInterval     = 0 sec
HealthCheckNodeState    = ANY
HealthCheckProgram      = (null)
InactiveLimit           = 600 sec
JobAcctGatherFrequency  = 0
JobAcctGatherType       = jobacct_gather/cgroup
JobAcctGatherParams     = (null)
JobCheckpointDir        = /var/slurm/checkpoint
JobCompHost             = localhost
JobCompLoc              = /var/log/slurm_jobcomp.log
JobCompPort             = 0
JobCompType             = jobcomp/none
JobCompUser             = root
JobContainerType        = job_container/cncu
JobCredentialPrivateKey = (null)
JobCredentialPublicCertificate = (null)
JobFileAppend           = 0
JobRequeue              = 0
JobSubmitPlugins        = cray,lua
KeepAliveTime           = SYSTEM_DEFAULT
KillOnBadExit           = 1
KillWait                = 30 sec
LaunchParameters        = (null)
LaunchType              = launch/slurm
Layouts                 =
Licenses                = (null)
LicensesUsed            = (null)
MailProg                = /bin/mail
MaxArraySize            = 65000
MaxJobCount             = 200000
MaxJobId                = 2147418112
MaxMemPerNode           = UNLIMITED
MaxStepCount            = 40000
MaxTasksPerNode         = 128
MemLimitEnforce         = Yes
MessageTimeout          = 60 sec
MinJobAge               = 300 sec
MpiDefault              = openmpi
MpiParams               = ports=63001-64000
MsgAggregationParams    = WindowMsgs=1,WindowTime=100
NEXT_JOB_ID             = 3542
OverTimeLimit           = 0 min
PluginDir               = /opt/slurm/default/lib/slurm
PlugStackConfig         = /opt/slurm/etc/plugstack.conf
PowerParameters         = (null)
PowerPlugin             =
PreemptMode             = REQUEUE
PreemptType             = preempt/qos
PriorityParameters      = (null)
PriorityDecayHalfLife   = 8-00:00:00
PriorityCalcPeriod      = 00:05:00
PriorityFavorSmall      = No
PriorityFlags           = ACCRUE_ALWAYS
PriorityMaxAge          = 21-00:00:00
PriorityUsageResetPeriod = NONE
PriorityType            = priority/multifactor
PriorityWeightAge       = 30240
PriorityWeightFairShare = 1440
PriorityWeightJobSize   = 0
PriorityWeightPartition = 0
PriorityWeightQOS       = 24480
PriorityWeightTRES      = (null)
PrivateData             = none
ProctrackType           = proctrack/cray
Prolog                  = (null)
PrologEpilogTimeout     = 65534
PrologSlurmctld         = (null)
PrologFlags             = Alloc
PropagatePrioProcess    = 0
PropagateResourceLimits = ALL
PropagateResourceLimitsExcept = (null)
RebootProgram           = (null)
ReconfigFlags           = (null)
RequeueExit             = (null)
RequeueExitHold         = (null)
ResumeProgram           = (null)
ResumeRate              = 300 nodes/min
ResumeTimeout           = 60 sec
ResvEpilog              = (null)
ResvOverRun             = 0 min
ResvProlog              = (null)
ReturnToService         = 1
RoutePlugin             = (null)
SallocDefaultCommand    = srun -n1 -N1 --mem-per-cpu=0 --pty --preserve-env --gres=craynetwork:0 --mpi=none $SHELL
SchedulerParameters     = no_backup_scheduling,bf_window=5760,bf_resolution=120,bf_max_job_array_resv=20,default_queue_depth=400,bf_max_job_test=6000,bf_max_job_user=10,bf_continue,nohold_on_prolog_fail,kill_invalid_depend
SchedulerPort           = 7321
SchedulerRootFilter     = 1
SchedulerTimeSlice      = 30 sec
SchedulerType           = sched/backfill
SelectType              = select/cray
SelectTypeParameters    = CR_SOCKET_MEMORY,OTHER_CONS_RES
SlurmUser               = root(0)
SlurmctldDebug          = debug
SlurmctldLogFile        = /var/tmp/slurm/slurmctld.log
SlurmctldPort           = 6817
SlurmctldTimeout        = 120 sec
SlurmdDebug             = info
SlurmdLogFile           = /var/spool/slurmd/%h.log
SlurmdPidFile           = /var/run/slurmd.pid
SlurmdPlugstack         = (null)
SlurmdPort              = 6818
SlurmdSpoolDir          = /var/spool/slurmd
SlurmdTimeout           = 300 sec
SlurmdUser              = root(0)
SlurmSchedLogFile       = (null)
SlurmSchedLogLevel      = 0
SlurmctldPidFile        = /var/run/slurmctld.pid
SlurmctldPlugstack      = (null)
SLURM_CONF              = /opt/slurm/etc/slurm.conf
SLURM_VERSION           = 15.08.6
SrunEpilog              = (null)
SrunPortRange           = 60001-63000
SrunProlog              = (null)
StateSaveLocation       = /global/syscom/sc/nsg/var/edison-slurm-state
SuspendExcNodes         = (null)
SuspendExcParts         = (null)
SuspendProgram          = (null)
SuspendRate             = 60 nodes/min
SuspendTime             = NONE
SuspendTimeout          = 30 sec
SwitchType              = switch/cray
TaskEpilog              = (null)
TaskPlugin              = task/cgroup,task/cray
TaskPluginParam         = (null type)
TaskProlog              = (null)
TmpFS                   = /tmp
TopologyParam           = (null)
TopologyPlugin          = topology/none
TrackWCKey              = No
TreeWidth               = 18
UsePam                  = 0
UnkillableStepProgram   = (null)
UnkillableStepTimeout   = 60 sec
VSizeFactor             = 0 percent
WaitTime                = 0 sec

Slurmctld(primary/backup) at nid01605/edique01 are UP/UP

Comment 2 Danny Auble 2016-01-06 04:49:47 MST

Hey Zhengji,

This is most likely an issue of configuration. I noticed you don't have task/affinity in your task plugin line you will most likely want that, the cgroup plugin does affinity differently and might not work as you would expect for all options, I know the nomultithread option doesn't work for instance.

After talking with Doug it sounds like you had this in the past but had found that *not* using it and instead setting cons_res to be CR_Socket_Memory for most partitions covered most cases as it seemed much easier to use.

I can't remember the issues you had with task/affinity or how it was more difficult to use, but I would suggest turning the affinity off in that plugin and layering the task/affinity plugin on top of the other 2 in this manner.

TaskPlugin = affinity,cgroup,cray

There are other suggestions we have for your configuration as well.  I have opened bug 2311 for that.

Comment 3 Zhengji Zhao 2016-01-06 06:23:28 MST

Dear Danny,

Thanks for your advice. I will work with Doug to test your suggestion. I 
think some basic command option like --ntasks-per-socket (even without 
using -m) still does not work fully in our current configuration.

I have one more question for you.   I read the following in the 
slurm.conf man page regarding task/cgroup


             task/cgroup    enables resource containment using Linux 
control cgroups.  This enables the --cpu_bind and/or --mem_bind srun 
options.   NOTE:  see  "man  cgroup.conf"  for configuration
                              details. *NOTE: This plugin writes to disk 
and can slightly impact performance. * If you are running lots of short 
running jobs (less than a couple of seconds) this plugin
                              slows down performance slightly.  It 
should probably be avoided in an HTC environment.

I would like to understand a bit more about the disk writing part for 
the task/cgroup. For a very large scale job, such as using more than 
133k cores, do you think this disk writing part would affect the mpi 
performance? Is this disk writing part occurs only before and after user 
code execution? I am asking because we see 50% performance slowdown with 
the mpi_alltoall time after switching to SLURM. Anything you can think 
of that could slowdown the mpi_alltoall time under slurm?

Thanks,
Zhengji

On 1/6/16 10:49 AM, bugs@schedmd.com wrote:
>
> *Comment # 2 <http://bugs.schedmd.com/show_bug.cgi?id=2307#c2> on bug 
> 2307 <http://bugs.schedmd.com/show_bug.cgi?id=2307> from Danny Auble 
> <mailto:da@schedmd.com> *
> Hey Zhengji,
>
> This is most likely an issue of configuration. I noticed you don't have
> task/affinity in your task plugin line you will most likely want that, the
> cgroup plugin does affinity differently and might not work as you would expect
> for all options, I know the nomultithread option doesn't work for instance.
>
> After talking with Doug it sounds like you had this in the past but had found
> that *not* using it and instead setting cons_res to be CR_Socket_Memory for
> most partitions covered most cases as it seemed much easier to use.
>
> I can't remember the issues you had with task/affinity or how it was more
> difficult to use, but I would suggest turning the affinity off in that plugin
> and layering the task/affinity plugin on top of the other 2 in this manner.
>
> TaskPlugin = affinity,cgroup,cray
>
> There are other suggestions we have for your configuration as well.  I have
> openedbug 2311 <show_bug.cgi?id=2311>  for that.
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You are on the CC list for the bug.
>   * You reported the bug.
>

Comment 4 Danny Auble 2016-01-06 06:47:59 MST

I doubt the writing to disk in this case would account for a 50% slow down.  It is a nominal hit that happens when a new process is spawned.  Usually this only noticed when running High Throughput (100s of jobs a second).

How are you running your alltoall?  Perhaps Doug has seen this before?

Could you post your cgroup.conf file so I can look at that as well?

If you could give me your slurmdbd.conf file that would be handy as well.

Comment 5 Zhengji Zhao 2016-01-06 08:26:03 MST

Created attachment 2575 [details]
fct_edison.tar.gz

Danny,

Thanks for your info. I hope Doug could get back to you with those .conf 
files. If not, I will send you the .conf files soon.

I am attaching a tar file, fct_edison.tar.gz, in this email (~3MB), and 
included all the needed information for the FCT (full configuration 
test) runs on our Cray XC30, Edison. Please read the README.ZZ file for 
the descriptions about the files included.

Thanks a lot, I am looking forward to getting this performance issue 
understood. Any help from you is highly appreciated.

Zhengji

On 1/6/16 12:47 PM, bugs@schedmd.com wrote:
>
> *Comment # 4 <http://bugs.schedmd.com/show_bug.cgi?id=2307#c4> on bug 
> 2307 <http://bugs.schedmd.com/show_bug.cgi?id=2307> from Danny Auble 
> <mailto:da@schedmd.com> *
> I doubt the writing to disk in this case would account for a 50% slow down.  It
> is a nominal hit that happens when a new process is spawned.  Usually this only
> noticed when running High Throughput (100s of jobs a second).
>
> How are you running your alltoall?  Perhaps Doug has seen this before?
>
> Could you post your cgroup.conf file so I can look at that as well?
>
> If you could give me your slurmdbd.conf file that would be handy as well.
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You are on the CC list for the bug.
>   * You reported the bug.
>

Comment 6 Danny Auble 2016-01-06 08:54:02 MST

Zhengji, I'm not sure where the numbers are coming from in the README.ZZ

slurm-340.out should not be used as a benchmark as there was sleeping waiting for IO which was never going to happen which intern prolonged the job long after it was finished.  This appears to be talked about in bug 985 (http://bugs.schedmd.com/show_bug.cgi?id=985), but it doesn't look like the root cause was ever figured out.

The slurm-313.out appears to be a correct run with times much closer to what you would expect.

slurm-313.out:MPI_ATOA-1      100368 137625600 1 3.528e+01 3.528e+01 3.528e+01 sec 2.975970e+01 MB/sec

Is there anyway you can attempt the full system alltoall again?  I would consider 340 a failed job.

Some of the other modifications suggested in bug 2311 could also improve performance a bit.

Comment 7 Danny Auble 2016-01-06 08:56:50 MST

Zhengji, ignore the first line of comment 6, I wrote that before I had fully read the README.ZZ, I understand where your numbers are coming from now and forgot to take out that line :).

Comment 8 Danny Auble 2016-01-06 09:08:30 MST

Also as you indicate in README.ZZ you would like more sorted output.  You can always pipe srun's output to sort -V which should give you what you want.

I would expect

srun -n 132486 ./bigmpi5 -mb 2100 -nit 1 | sort -V

would give you nicer output.  If you want it sorted by rank you can do something like

srun -l -n 132486 ./bigmpi5 -mb 2100 -nit 1 | sort -V

the -l option will add the rank number to each of the lines from the various ranks.

I'm not sure why aprun would have more sorted output, my guess is srun launches things in a more parallel manor than aprun does and perhaps does some sort of sorting before printing things out.

Comment 9 Zhengji Zhao 2016-01-06 09:18:45 MST

Thanks a lot for looking into this, I will be able to run another set of 
full scale runs tomorrow. I will update you tomorrow with new numbers. 
Doug doubt the ask affinity may played some rule here, and we may had 
some network issue when I was running the FCT runs last time. So 
tomorrow we will have more data point.

I really appreciate all your help.
Zhengji

On 1/6/16 2:56 PM, bugs@schedmd.com wrote:
>
> *Comment # 7 <http://bugs.schedmd.com/show_bug.cgi?id=2307#c7> on bug 
> 2307 <http://bugs.schedmd.com/show_bug.cgi?id=2307> from Danny Auble 
> <mailto:da@schedmd.com> *
> Zhengji, ignore the first line ofcomment 6 <show_bug.cgi?id=2307#c6>, I wrote that before I had fully
> read the README.ZZ, I understand where your numbers are coming from now and
> forgot to take out that line :).
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You are on the CC list for the bug.
>   * You reported the bug.
>

Comment 10 Danny Auble 2016-01-06 09:23:43 MST

No problem, I also doubt the affinity would have much to do with an ALL2ALL test.  I would expect the network to be behind the issues with 340.  I look forward to seeing a new test tomorrow, good luck ;)!

Comment 11 Zhengji Zhao 2016-01-06 09:40:59 MST

Thanks for the tip. It will be very useful!
Zhengji

On 1/6/16 3:08 PM, bugs@schedmd.com wrote:
>
> *Comment # 8 <http://bugs.schedmd.com/show_bug.cgi?id=2307#c8> on bug 
> 2307 <http://bugs.schedmd.com/show_bug.cgi?id=2307> from Danny Auble 
> <mailto:da@schedmd.com> *
> Also as you indicate in README.ZZ you would like more sorted output.  You can
> always pipe srun's output to sort -V which should give you what you want.
>
> I would expect
>
> srun -n 132486 ./bigmpi5 -mb 2100 -nit 1 | sort -V
>
> would give you nicer output.  If you want it sorted by rank you can do
> something like
>
> srun -l -n 132486 ./bigmpi5 -mb 2100 -nit 1 | sort -V
>
> the -l option will add the rank number to each of the lines from the various
> ranks.
>
> I'm not sure why aprun would have more sorted output, my guess is srun launches
> things in a more parallel manor than aprun does and perhaps does some sort of
> sorting before printing things out.
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You are on the CC list for the bug.
>   * You reported the bug.
>

Comment 12 Danny Auble 2016-01-15 04:47:02 MST

Zhengji, any more on this?

Comment 13 Zhengji Zhao 2016-01-15 07:20:20 MST

Danny,

Thanks for following up. Unfortunately, different task placement did not 
help, I still got the same slow performance. On 1/7, I ran the same full 
configuration test (fct) with the distribution method -m 
block:block:block, and the results were as follows:
Batch system 	Run Date 	Output filename:KEY:tag 	ntasks 	n 	it 	min 
(sec) 	max (sec) 	avg (sec) 	BW (MB/sec) 	*The -m option*
Torque/Moab 	2/19/14 	fct99p1.o785758:MPI_ATOA-1 	132367 	137625600 	1 
40.94 	40.94 	40.94 	25.65 	the same as block:block:block
Slurm 	1/7/16 	slurm-7531-block.out:MPI_ATOA-1 	132367 	137625600 	1 
61.93 	61.94 	61.94 	16.95 	block:block:block
Slurm 	1/7/16 	slurm-7532-block.out:MPI_ATOA-1 	132367 	137625600 	1 
60.14 	60.14 	60.14 	17.46 	block:block:block
Slurm 	1/7/16 	slurm-7534.out:MPI_ATOA-1 	132367 	137625600 	1 	60.77 
60.77 	60.77 	17.28 	block:cyclic:cyclic
										
Slurm 	1/1/16 	slurm-340.out:MPI_ATOA-1 	132486 	137625600 	1 	62.54 
62.55 	62.55 	16.79 	block:cyclic:cyclic
Slurm 	1/7/16 	slurm-7530.out:MPI_ATOA-1 	132486 	137625600 	1 	59.71 
59.72 	59.71 	17.58 	block:cyclic:cyclic
Slurm 	1/7/16 	slurm-7533-block.out:MPI_ATOA-1 	132486 	137625600 	1 
59.74 	59.74 	59.74 	17.58 	block:block:block


We did not have time to test other suggested Slurm configuration because 
other scheduled activities were much delayed than scheduled time. Any 
advice would be much appreciated, but I also understand that this 
performance issue might be beyond the scope of your support. I have also 
filed a bug with Cray, so to get some advice from them as well.

Thanks again for all your help!

Zhengji




On 1/15/16 10:47 AM, bugs@schedmd.com wrote:
>
> *Comment # 12 <http://bugs.schedmd.com/show_bug.cgi?id=2307#c12> on 
> bug 2307 <http://bugs.schedmd.com/show_bug.cgi?id=2307> from Danny 
> Auble <mailto:da@schedmd.com> *
> Zhengji, any more on this?
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You are on the CC list for the bug.
>   * You reported the bug.
>

Comment 14 Danny Auble 2016-01-15 07:56:05 MST

Thanks for the update Zhengji, as expected the layout didn't seem to matter.  I'm guessing none of these jobs had issues like 340 did.  Is that correct?

I am wondering if the slow down is from the way Slurm does it's output and that could be causing the delay.  Would it be possible to alter your test to not do any stdout until the end?  I understand that isn't an apples to apples test but it would be interesting to see what overhead, if any, that much output adds to the mix.  I am pretty sure ALPS handles things with respect to stdio from the different ranks very different than Slurm does.  Slurm usually isn't expecting a job of this size to produce this kind of output since that usually kills performance.

Comment 15 Zhengji Zhao 2016-01-15 08:53:06 MST

Danny,

Thanks for the advice, I think it is worth trying. From my limited 
experience so far both my own and users that I interacted with, indeed 
there were some issues with the standard output being delayed or lost. 
Some users reported that the problem is worse if a binary compiled with 
a cray compiler (I am still investigating the issue). I will definitely 
plan a few runs without the I/O especially the all process printing 
MPI_Init info part. I will update you if I have futhur info.

Thanks again,
Zhenjgi

On 1/15/16 1:56 PM, bugs@schedmd.com wrote:
>
> *Comment # 14 <http://bugs.schedmd.com/show_bug.cgi?id=2307#c14> on 
> bug 2307 <http://bugs.schedmd.com/show_bug.cgi?id=2307> from Danny 
> Auble <mailto:da@schedmd.com> *
> Thanks for the update Zhengji, as expected the layout didn't seem to matter.
> I'm guessing none of these jobs had issues like 340 did.  Is that correct?
>
> I am wondering if the slow down is from the way Slurm does it's output and that
> could be causing the delay.  Would it be possible to alter your test to not do
> any stdout until the end?  I understand that isn't an apples to apples test but
> it would be interesting to see what overhead, if any, that much output adds to
> the mix.  I am pretty sure ALPS handles things with respect to stdio from the
> different ranks very different than Slurm does.  Slurm usually isn't expecting
> a job of this size to produce this kind of output since that usually kills
> performance.
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You are on the CC list for the bug.
>   * You reported the bug.
>

Comment 16 Danny Auble 2016-01-15 09:51:25 MST

Hum, I haven't heard of missing stdout, that is strange.  The stdout is buffered by default, so perhaps that can add to a delay they are seeing.  I am still thinking a network issue may be the cause of this.  I am hoping your non-stdout tests will confirm or debunk this theory though.

Comment 17 Danny Auble 2016-01-28 05:46:38 MST

Zhenjgi any more on this?

Comment 18 Zhengji Zhao 2016-01-28 06:24:39 MST

I will update you right after today's tests. We are doing another set of 
full system runs now. We may have a good news to tell you ...
Thanks for following up on this.
Zhengji

On 1/28/16 11:46 AM, bugs@schedmd.com wrote:
>
> *Comment # 17 <http://bugs.schedmd.com/show_bug.cgi?id=2307#c17> on 
> bug 2307 <http://bugs.schedmd.com/show_bug.cgi?id=2307> from Danny 
> Auble <mailto:da@schedmd.com> *
> Zhenjgi any more on this?
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You are on the CC list for the bug.
>   * You reported the bug.
>

Comment 19 Danny Auble 2016-02-01 08:31:15 MST

How did it go?

Comment 20 Danny Auble 2016-02-11 09:42:23 MST

Any more on this?

Comment 21 Zhengji Zhao 2016-02-12 08:04:50 MST

Hi Danny,

Sorry for the delay. Thanks a lot for following up with this. When I 
told you last time that I might be able to tell you a good news, I was 
about to conclude that the FCT binary that was compiled with the intel 
compilers were not performing, because I was able to get a good timing 
with a FCT binary that was built with a Cray compiler on 1/23 (in the 
production environment). All the slower numbers before that date used a 
binary built from an intel compiler.  However, on 1/28 when I had a 
chance to run both the binaries (Intel and cray compiler builds) side by 
side, both of them were able to get good timing as we used to get. So we 
still do not understand exactly what was causing the performance 
regression last time, but now the performance is returned to the 
expected level. See the update I provided to Cray support (attached 
below) for more details.

Please feel free to close this bug. I appreciate all your help!

Zhengji

We did the tests suggested in the comment #8 during the maintenance on 
1/23/2016.

1) Linked FCT to the cray-mpich 7.3.1. We did not see any change in the 
MPI_Alltoall time (still got about 58 secs for the MPI_Alltoall time).
2) Tested with the DMAPP optimized MPI_Alltoall using cray-mpich 7.3.1, 
but the job failed without any meaningful error message (just this 
message: srun: error: nid00008: tasks 0-23: Killed).

We were in contact with SchedMD for this performance issue as well, and 
they suggested us to try after disabling the all process writing in the 
code. So we also tested it on 1/23/2016. Unfortunately this did not make 
any observable difference either. (MPI_Alltoall time was still 59 secs)

However, since on 1/26/2016, we were able to get good timing (about 40 
sec) again with FCT both with Cray and Intel compiler builds for some 
unknown "improvement" on the system (see the data attached below, the 
last column is the measured MPI_Alltoall time in FCT). We were not able 
to identify any specific changes on the system that could have accounted 
for this performance improvement.

In summary, the performance is as good as it used to be now, but the 
performance issue we had earlier are still not understood. I am OK to 
close this bug for now.

Thanks,
Zhengji

Run date        output/KEY:tag ntasks MPI_Alltoall time(sec)
1/1/16 0:00     18:00   slurm-314.out:MPI_ATOA-1        132486 61.05
1/1/16 0:00     18:18   slurm-316.out:MPI_ATOA-1        132486 62.13
1/1/16 0:00     22:26   slurm-336.out:MPI_ATOA-1        132486 61.11
1/1/16 0:00     22:44   slurm-340.out:MPI_ATOA-1        132486 62.55
1/2/16 0:00     21:06   slurm-349.out:MPI_ATOA-1        133656 62.55
1/7/16 0:00     22:44   slurm-7530.out:MPI_ATOA-1       132486 59.71
1/7/16 0:00     22:51   slurm-7531.out:MPI_ATOA-1       132367 61.94
1/7/16 0:00     22:56   slurm-7532.out:MPI_ATOA-1       132367 60.14
1/7/16 0:00     23:02   slurm-7533.out:MPI_ATOA-1       132486 59.74
1/7/16 0:00     23:07   slurm-7534.out:MPI_ATOA-1       132367 60.77
1/23/16 0:00    11:31   slurm-60924.out:MPI_ATOA-1      132367 58.36
1/23/16 0:00    11:44   slurm-60932.out:MPI_ATOA-1      132367 59.59
1/23/16 0:00    11:50   slurm-60933.out:MPI_ATOA-1      132367 70.69
1/23/16 0:00    11:56   slurm-60936.out:MPI_ATOA-1      132367 57.92

1/26/16 0:00    22:10   slurm-72708.out:MPI_ATOA-1      133088 40.76
1/26/16 0:00    22:15   slurm-72710.out:MPI_ATOA-1      133088 40.72
1/26/16 0:00    22:21   slurm-72721.out:MPI_ATOA-1      133088 40.80
1/28/16 0:00    23:06   slurm-78107.out:MPI_ATOA-1      132367 39.71
1/28/16 0:00    23:49   slurm-78112.out:MPI_ATOA-1      132367 45.68
1/28/16 0:00    23:11   slurm-78114.out:MPI_ATOA-1      132367 39.81
1/28/16 0:00    23:16   slurm-78115.out:MPI_ATOA-1      132367 39.35
1/28/16 0:00    23:22   slurm-78116.out:MPI_ATOA-1      132367 39.73
1/28/16 0:00    23:26   slurm-78117.out:MPI_ATOA-1      132367 39.80
1/28/16 0:00    23:31   slurm-78118.out:MPI_ATOA-1      132367 38.47

On 2/11/16 3:42 PM, bugs@schedmd.com wrote:
>
> *Comment # 20 <http://bugs.schedmd.com/show_bug.cgi?id=2307#c20> on 
> bug 2307 <http://bugs.schedmd.com/show_bug.cgi?id=2307> from Danny 
> Auble <mailto:da@schedmd.com> *
> Any more on this?
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You are on the CC list for the bug.
>   * You reported the bug.
>

Comment 22 Danny Auble 2016-02-12 08:34:11 MST

Well, I am glad you are now getting better performance :).  Thanks for following up.  Looks like things are slightly better than before as well, that is great!

Thanks again for the numbers

Comment 23 Moe Jette 2016-02-19 07:05:34 MST

*** Ticket 2463 has been marked as a duplicate of this ticket. ***