Ticket 2311

Summary:	Configuration suggestions
Product:	Slurm	Reporter:	Danny Auble <da>
Component:	Configuration	Assignee:	Danny Auble <da>
Status:	RESOLVED FIXED	QA Contact:
Severity:	4 - Minor Issue
Priority:	---	CC:	dmjacobsen, zzhao
Version:	15.08.6
Hardware:	Cray XC
OS:	Linux
Site:	NERSC	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:	Edison
CLE Version:		Version Fixed:	15.08.7 16.05.0-pre1
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---

Description Danny Auble 2016-01-06 04:46:36 MST

Here is our configuration:

zz217@nid00292:~/tests/coreid> scontrol show config
Configuration data as of 2016-01-05T18:26:45
AccountingStorageBackupHost = edique02
AccountingStorageEnforce = associations,limits,qos,safe
AccountingStorageHost   = edique01
AccountingStorageLoc    = N/A
AccountingStoragePort   = 6819
AccountingStorageTRES   = cpu,mem,energy,node
AccountingStorageType   = accounting_storage/slurmdbd
AccountingStorageUser   = N/A
AccountingStoreJobComment = Yes
AcctGatherEnergyType    = acct_gather_energy/cray
AcctGatherFilesystemType = acct_gather_filesystem/none
AcctGatherInfinibandType = acct_gather_infiniband/none
AcctGatherNodeFreq      = 0 sec
AcctGatherProfileType   = acct_gather_profile/none
AllowSpecResourcesUsage = 1
AuthInfo                = (null)
AuthType                = auth/munge
BackupAddr              = 128.55.143.33
BackupController        = edique01
BatchStartTimeout       = 60 sec
BOOT_TIME               = 2016-01-05T08:29:12
BurstBufferType         = (null)
CacheGroups             = 0
CheckpointType          = checkpoint/none
ChosLoc                 = (null)
ClusterName             = edison
CompleteWait            = 0 sec
ControlAddr             = nid01605
ControlMachine          = nid01605
CoreSpecPlugin          = core_spec/cray
CpuFreqDef              = OnDemand
CpuFreqGovernors        = OnDemand
CryptoType              = crypto/munge
DebugFlags              = (null)
DefMemPerNode           = UNLIMITED
DisableRootJobs         = Yes
EioTimeout              = 60
EnforcePartLimits       = Yes
Epilog                  = (null)
EpilogMsgTime           = 2000 usec
EpilogSlurmctld         = (null)
ExtSensorsType          = ext_sensors/none
ExtSensorsFreq          = 0 sec
FairShareDampeningFactor = 1
FastSchedule            = 1
FirstJobId              = 1
GetEnvTimeout           = 2 sec
GresTypes               = craynetwork
GroupUpdateForce        = 0
GroupUpdateTime         = 600 sec
HASH_VAL                = Match
HealthCheckInterval     = 0 sec
HealthCheckNodeState    = ANY
HealthCheckProgram      = (null)
InactiveLimit           = 600 sec
JobAcctGatherFrequency  = 0
JobAcctGatherType       = jobacct_gather/cgroup
JobAcctGatherParams     = (null)
JobCheckpointDir        = /var/slurm/checkpoint
JobCompHost             = localhost
JobCompLoc              = /var/log/slurm_jobcomp.log
JobCompPort             = 0
JobCompType             = jobcomp/none
JobCompUser             = root
JobContainerType        = job_container/cncu
JobCredentialPrivateKey = (null)
JobCredentialPublicCertificate = (null)
JobFileAppend           = 0
JobRequeue              = 0
JobSubmitPlugins        = cray,lua
KeepAliveTime           = SYSTEM_DEFAULT
KillOnBadExit           = 1
KillWait                = 30 sec
LaunchParameters        = (null)
LaunchType              = launch/slurm
Layouts                 =
Licenses                = (null)
LicensesUsed            = (null)
MailProg                = /bin/mail
MaxArraySize            = 65000
MaxJobCount             = 200000
MaxJobId                = 2147418112
MaxMemPerNode           = UNLIMITED
MaxStepCount            = 40000
MaxTasksPerNode         = 128
MemLimitEnforce         = Yes
MessageTimeout          = 60 sec
MinJobAge               = 300 sec
MpiDefault              = openmpi
MpiParams               = ports=63001-64000
MsgAggregationParams    = WindowMsgs=1,WindowTime=100
NEXT_JOB_ID             = 3542
OverTimeLimit           = 0 min
PluginDir               = /opt/slurm/default/lib/slurm
PlugStackConfig         = /opt/slurm/etc/plugstack.conf
PowerParameters         = (null)
PowerPlugin             =
PreemptMode             = REQUEUE
PreemptType             = preempt/qos
PriorityParameters      = (null)
PriorityDecayHalfLife   = 8-00:00:00
PriorityCalcPeriod      = 00:05:00
PriorityFavorSmall      = No
PriorityFlags           = ACCRUE_ALWAYS
PriorityMaxAge          = 21-00:00:00
PriorityUsageResetPeriod = NONE
PriorityType            = priority/multifactor
PriorityWeightAge       = 30240
PriorityWeightFairShare = 1440
PriorityWeightJobSize   = 0
PriorityWeightPartition = 0
PriorityWeightQOS       = 24480
PriorityWeightTRES      = (null)
PrivateData             = none
ProctrackType           = proctrack/cray
Prolog                  = (null)
PrologEpilogTimeout     = 65534
PrologSlurmctld         = (null)
PrologFlags             = Alloc
PropagatePrioProcess    = 0
PropagateResourceLimits = ALL
PropagateResourceLimitsExcept = (null)
RebootProgram           = (null)
ReconfigFlags           = (null)
RequeueExit             = (null)
RequeueExitHold         = (null)
ResumeProgram           = (null)
ResumeRate              = 300 nodes/min
ResumeTimeout           = 60 sec
ResvEpilog              = (null)
ResvOverRun             = 0 min
ResvProlog              = (null)
ReturnToService         = 1
RoutePlugin             = (null)
SallocDefaultCommand    = srun -n1 -N1 --mem-per-cpu=0 --pty --preserve-env --gres=craynetwork:0 --mpi=none $SHELL
SchedulerParameters     = no_backup_scheduling,bf_window=5760,bf_resolution=120,bf_max_job_array_resv=20,default_queue_depth=400,bf_max_job_test=6000,bf_max_job_user=10,bf_continue,nohold_on_prolog_fail,kill_invalid_depend
SchedulerPort           = 7321
SchedulerRootFilter     = 1
SchedulerTimeSlice      = 30 sec
SchedulerType           = sched/backfill
SelectType              = select/cray
SelectTypeParameters    = CR_SOCKET_MEMORY,OTHER_CONS_RES
SlurmUser               = root(0)
SlurmctldDebug          = debug
SlurmctldLogFile        = /var/tmp/slurm/slurmctld.log
SlurmctldPort           = 6817
SlurmctldTimeout        = 120 sec
SlurmdDebug             = info
SlurmdLogFile           = /var/spool/slurmd/%h.log
SlurmdPidFile           = /var/run/slurmd.pid
SlurmdPlugstack         = (null)
SlurmdPort              = 6818
SlurmdSpoolDir          = /var/spool/slurmd
SlurmdTimeout           = 300 sec
SlurmdUser              = root(0)
SlurmSchedLogFile       = (null)
SlurmSchedLogLevel      = 0
SlurmctldPidFile        = /var/run/slurmctld.pid
SlurmctldPlugstack      = (null)
SLURM_CONF              = /opt/slurm/etc/slurm.conf
SLURM_VERSION           = 15.08.6
SrunEpilog              = (null)
SrunPortRange           = 60001-63000
SrunProlog              = (null)
StateSaveLocation       = /global/syscom/sc/nsg/var/edison-slurm-state
SuspendExcNodes         = (null)
SuspendExcParts         = (null)
SuspendProgram          = (null)
SuspendRate             = 60 nodes/min
SuspendTime             = NONE
SuspendTimeout          = 30 sec
SwitchType              = switch/cray
TaskEpilog              = (null)
TaskPlugin              = task/cgroup,task/cray
TaskPluginParam         = (null type)
TaskProlog              = (null)
TmpFS                   = /tmp
TopologyParam           = (null)
TopologyPlugin          = topology/none
TrackWCKey              = No
TreeWidth               = 18
UsePam                  = 0
UnkillableStepProgram   = (null)
UnkillableStepTimeout   = 60 sec
VSizeFactor             = 0 percent
WaitTime                = 0 sec

Comment 1 Danny Auble 2016-01-06 05:07:37 MST

This is a follow on to bug 2307.  Outside of what is suggested there I recommend these below.

Immediate changes I would do are these...

Change

JobAcctGatherType to jobacct_gather/linux

The jobacct_gather/cgroup plugin only reads the memory info from a cgroup, everything else is pulled from the proc file system as the linux plugin does.  It is much slower and adds very little in terms of functionality.  I can't think of a time I have ever recommended it to someone to use in production.

Change

MsgAggregationParams    = WindowMsgs=1,WindowTime=100

To something different or just remove it.  With WindowMsgs=1 this adds very little in terms of functionality.  Perhaps set WindowMsgs=100 and WindowTime=20, that is usually what I tell people to set, but you should play with it a bit to come up with the correct values.

You might also consider these other options...

Since you already have PrologFlags = Alloc I usually like to add "nohold" to make it so the delay of the prolog running doesn't hold up an salloc at submit, but pushed to the first srun which will usually go unnoticed to the user as it will most likely already have ran.

You might find the "fair_tree" style of fairshare more appealing than the normal fairshare add "fair_tree" to PriorityFlags.  You can see a presentation at http://slurm.schedmd.com/SC14/BYU_Fair_Tree.pdf.

Please let me know if you have any questions/comments.

Comment 2 Doug Jacobsen 2016-01-06 23:31:16 MST

Hi Danny,

Thanks for taking a look.

I don't actually have message aggregation configured -- it seemed to be causing problems with the recommended settings from SLUG, so I just removed the config:

nid01605:~ # cat /opt/slurm/etc/slurm.conf | grep MsgAggregationParams
nid01605:~ #


and yet the config appears to be set:

nid01605:~ # scontrol show config | grep MsgAgg
MsgAggregationParams    = WindowMsgs=1,WindowTime=100
nid01605:~ #

I'm guessing these are the implicit defaults.... however whenever slurmd is started or is HUPd it reports that message aggregation is disabled.


Regarding jobacctgather/cgroup vs jobacctgather/linux.  On cori I have jobacctgather/linux configured and on edison cgroup.  Was planning to move cori to cgroup because many users have been complaining about incorrect termination of job steps owing to slurmstepd killing them for exceeding resident memory limits when processes were performing lazy copy-on-write style forking.   As far as I know cgroup-style memory accounting works well for this use-case whereas /proc summations by an outside observer are likely to be inaccurate.  I also removed the default memory limits for our Shared=EXCLUSIVE partitions (on both systems).


I will take a look at the fair_tree documentation you sent -- thanks!

-Doug

Comment 3 Danny Auble 2016-01-11 08:24:44 MST

(In reply to Doug Jacobsen from comment #2)
> Hi Danny,
> 
> Thanks for taking a look.
> 
> I don't actually have message aggregation configured -- it seemed to be
> causing problems with the recommended settings from SLUG, so I just removed
> the config:
> 
> nid01605:~ # cat /opt/slurm/etc/slurm.conf | grep MsgAggregationParams
> nid01605:~ #
> 
> 
> and yet the config appears to be set:
> 
> nid01605:~ # scontrol show config | grep MsgAgg
> MsgAggregationParams    = WindowMsgs=1,WindowTime=100
> nid01605:~ #
> 
> I'm guessing these are the implicit defaults.... however whenever slurmd is
> started or is HUPd it reports that message aggregation is disabled.


I see the same thing, I'll see about making that not print things when not enabled.  You guess is most likely correct though, it is the default.

> 
> 
> Regarding jobacctgather/cgroup vs jobacctgather/linux.  On cori I have
> jobacctgather/linux configured and on edison cgroup.  Was planning to move
> cori to cgroup because many users have been complaining about incorrect
> termination of job steps owing to slurmstepd killing them for exceeding
> resident memory limits when processes were performing lazy copy-on-write
> style forking.   As far as I know cgroup-style memory accounting works well
> for this use-case whereas /proc summations by an outside observer are likely
> to be inaccurate.  I also removed the default memory limits for our
> Shared=EXCLUSIVE partitions (on both systems).

Keep in mind Much of the memory stuff is done in the task/cgroup plugin.  I would be interested in seeing an example of what you are talking about.  What I have witnessed is the cgroup plugin is slightly more than the linux plugin, but never very different (<1M difference).  What I do know is it is much slower, but unless you are running HTC (100+ jobs a second) you probably will never notice.

> 
> 
> I will take a look at the fair_tree documentation you sent -- thanks!

No problem.  Let me know what you decide.

> 
> -Doug

Comment 4 Danny Auble 2016-01-11 08:47:26 MST

MsgAggregationParams    = WindowMsgs=1,WindowTime=100

is fixed in commit af3d1eadbcd.  Now it will print the correct NULL.

If you have any more questions please reopen, but I think you are in good shape, thanks!

Comment 5 Danny Auble 2016-01-20 06:49:05 MST

Doug, I just saw your slurm.conf from bug 2350.  You may be able to significantly shorten your slurm.conf for Edison which would help in quite a few ways.

I believe all your nodename lines can be shortened to

NodeName=DEFAULT CPUS=48 Sockets=2 CoresPerSocket=12 ThreadsPerCore=2 gres=craynetwork:4 RealMemory=64523 TmpDisk=32261
NodeName=nid00[008-296] Weight=100
NodeName=nid0[0297-6143] Weight=1000

This would speed up reading the slurm.conf file dramatically I would expect and make administration much easier.  I don't believe the NodeAddr is needed, if it is that would be slightly unfortunate as you would have to break them up in chunks of 254, but even then it would be much smaller and manageable.

Comment 6 Doug Jacobsen 2016-01-20 07:07:54 MST

Hi Danny,

I'm using NodeAddr to force slurmd to only listen to the ipogif0 interface
and not listen on the RSIP interface. If there was another way to
communicate this, I would *greatly* appreciate it and prefer it.

-Doug

----
Doug Jacobsen, Ph.D.
NERSC Computer Systems Engineer
National Energy Research Scientific Computing Center <http://www.nersc.gov>
dmjacobsen@lbl.gov

------------- __o
---------- _ '\<,_
----------(_)/  (_)__________________________


On Wed, Jan 20, 2016 at 1:49 PM, <bugs@schedmd.com> wrote:

> *Comment # 5 <http://bugs.schedmd.com/show_bug.cgi?id=2311#c5> on bug 2311
> <http://bugs.schedmd.com/show_bug.cgi?id=2311> from Danny Auble
> <da@schedmd.com> *
>
> Doug, I just saw your slurm.conf from bug 2350 <http://bugs.schedmd.com/show_bug.cgi?id=2350>.  You may be able to
> significantly shorten your slurm.conf for Edison which would help in quite a
> few ways.
>
> I believe all your nodename lines can be shortened to
>
> NodeName=DEFAULT CPUS=48 Sockets=2 CoresPerSocket=12 ThreadsPerCore=2
> gres=craynetwork:4 RealMemory=64523 TmpDisk=32261
> NodeName=nid00[008-296] Weight=100
> NodeName=nid0[0297-6143] Weight=1000
>
> This would speed up reading the slurm.conf file dramatically I would expect and
> make administration much easier.  I don't believe the NodeAddr is needed, if it
> is that would be slightly unfortunate as you would have to break them up in
> chunks of 254, but even then it would be much smaller and manageable.
>
> ------------------------------
> You are receiving this mail because:
>
>    - You are on the CC list for the bug.
>
>

Comment 7 Danny Auble 2016-01-20 07:54:39 MST

Hum, I would have expected NoInAddrAny to fix this issue for the slurmd, but I see this isn't the case, but it is possible :).  I'll do this right after I tag 15.08.7 :).

Comment 8 Doug Jacobsen 2016-01-20 07:57:51 MST

pretty please tag 15.08.7 -- cori will be out of maintenance soon and I'm
hoping to get it in today =)

----
Doug Jacobsen, Ph.D.
NERSC Computer Systems Engineer
National Energy Research Scientific Computing Center <http://www.nersc.gov>
dmjacobsen@lbl.gov

------------- __o
---------- _ '\<,_
----------(_)/  (_)__________________________


On Wed, Jan 20, 2016 at 2:54 PM, <bugs@schedmd.com> wrote:

> *Comment # 7 <http://bugs.schedmd.com/show_bug.cgi?id=2311#c7> on bug 2311
> <http://bugs.schedmd.com/show_bug.cgi?id=2311> from Danny Auble
> <da@schedmd.com> *
>
> Hum, I would have expected NoInAddrAny to fix this issue for the slurmd, but I
> see this isn't the case, but it is possible :).  I'll do this right after I tag
> 15.08.7 :).
>
> ------------------------------
> You are receiving this mail because:
>
>    - You are on the CC list for the bug.
>
>

Comment 9 Danny Auble 2016-01-20 08:04:02 MST

Your wish has been granted, it is available for download now.

Comment 10 Danny Auble 2016-01-20 10:42:57 MST

Doug I believe commit 4b9cf7319b54f3 will give you what you want.  It will be in 15.08.8.  Let me know if it doesn't work as you would expect, hopefully it will get rid of the crazy amount of node lines in your config :).

Comment 11 Tim Wickberg 2016-03-14 09:06:04 MDT

Hey Doug - 

We were troubleshooting an issue for a non-NERSC system, and were comparing configs to how you have them on NERSC and realized you hadn't had a chance to cut over to this yet.

15.08.9 will have a NoCtldInAddrAny flag as well. With 15.08.8 and earlier NoInAddrAny does affect the slurmctld, afterwards it will not affect slurmctld and you'd have to set NoCtldInAddrAny to get the same behavior (but we think that'd be undesirable behavior in your case).

We're hoping that with 15.08, you'll be able to set NoInAddrAny, clean up the config substantially, and use a single config throughout the cluster. (Or at least most of the cluster, you may still have some discrepancy with the eslogin nodes?)

- Tim