Here is our configuration: zz217@nid00292:~/tests/coreid> scontrol show config Configuration data as of 2016-01-05T18:26:45 AccountingStorageBackupHost = edique02 AccountingStorageEnforce = associations,limits,qos,safe AccountingStorageHost = edique01 AccountingStorageLoc = N/A AccountingStoragePort = 6819 AccountingStorageTRES = cpu,mem,energy,node AccountingStorageType = accounting_storage/slurmdbd AccountingStorageUser = N/A AccountingStoreJobComment = Yes AcctGatherEnergyType = acct_gather_energy/cray AcctGatherFilesystemType = acct_gather_filesystem/none AcctGatherInfinibandType = acct_gather_infiniband/none AcctGatherNodeFreq = 0 sec AcctGatherProfileType = acct_gather_profile/none AllowSpecResourcesUsage = 1 AuthInfo = (null) AuthType = auth/munge BackupAddr = 128.55.143.33 BackupController = edique01 BatchStartTimeout = 60 sec BOOT_TIME = 2016-01-05T08:29:12 BurstBufferType = (null) CacheGroups = 0 CheckpointType = checkpoint/none ChosLoc = (null) ClusterName = edison CompleteWait = 0 sec ControlAddr = nid01605 ControlMachine = nid01605 CoreSpecPlugin = core_spec/cray CpuFreqDef = OnDemand CpuFreqGovernors = OnDemand CryptoType = crypto/munge DebugFlags = (null) DefMemPerNode = UNLIMITED DisableRootJobs = Yes EioTimeout = 60 EnforcePartLimits = Yes Epilog = (null) EpilogMsgTime = 2000 usec EpilogSlurmctld = (null) ExtSensorsType = ext_sensors/none ExtSensorsFreq = 0 sec FairShareDampeningFactor = 1 FastSchedule = 1 FirstJobId = 1 GetEnvTimeout = 2 sec GresTypes = craynetwork GroupUpdateForce = 0 GroupUpdateTime = 600 sec HASH_VAL = Match HealthCheckInterval = 0 sec HealthCheckNodeState = ANY HealthCheckProgram = (null) InactiveLimit = 600 sec JobAcctGatherFrequency = 0 JobAcctGatherType = jobacct_gather/cgroup JobAcctGatherParams = (null) JobCheckpointDir = /var/slurm/checkpoint JobCompHost = localhost JobCompLoc = /var/log/slurm_jobcomp.log JobCompPort = 0 JobCompType = jobcomp/none JobCompUser = root JobContainerType = job_container/cncu JobCredentialPrivateKey = (null) JobCredentialPublicCertificate = (null) JobFileAppend = 0 JobRequeue = 0 JobSubmitPlugins = cray,lua KeepAliveTime = SYSTEM_DEFAULT KillOnBadExit = 1 KillWait = 30 sec LaunchParameters = (null) LaunchType = launch/slurm Layouts = Licenses = (null) LicensesUsed = (null) MailProg = /bin/mail MaxArraySize = 65000 MaxJobCount = 200000 MaxJobId = 2147418112 MaxMemPerNode = UNLIMITED MaxStepCount = 40000 MaxTasksPerNode = 128 MemLimitEnforce = Yes MessageTimeout = 60 sec MinJobAge = 300 sec MpiDefault = openmpi MpiParams = ports=63001-64000 MsgAggregationParams = WindowMsgs=1,WindowTime=100 NEXT_JOB_ID = 3542 OverTimeLimit = 0 min PluginDir = /opt/slurm/default/lib/slurm PlugStackConfig = /opt/slurm/etc/plugstack.conf PowerParameters = (null) PowerPlugin = PreemptMode = REQUEUE PreemptType = preempt/qos PriorityParameters = (null) PriorityDecayHalfLife = 8-00:00:00 PriorityCalcPeriod = 00:05:00 PriorityFavorSmall = No PriorityFlags = ACCRUE_ALWAYS PriorityMaxAge = 21-00:00:00 PriorityUsageResetPeriod = NONE PriorityType = priority/multifactor PriorityWeightAge = 30240 PriorityWeightFairShare = 1440 PriorityWeightJobSize = 0 PriorityWeightPartition = 0 PriorityWeightQOS = 24480 PriorityWeightTRES = (null) PrivateData = none ProctrackType = proctrack/cray Prolog = (null) PrologEpilogTimeout = 65534 PrologSlurmctld = (null) PrologFlags = Alloc PropagatePrioProcess = 0 PropagateResourceLimits = ALL PropagateResourceLimitsExcept = (null) RebootProgram = (null) ReconfigFlags = (null) RequeueExit = (null) RequeueExitHold = (null) ResumeProgram = (null) ResumeRate = 300 nodes/min ResumeTimeout = 60 sec ResvEpilog = (null) ResvOverRun = 0 min ResvProlog = (null) ReturnToService = 1 RoutePlugin = (null) SallocDefaultCommand = srun -n1 -N1 --mem-per-cpu=0 --pty --preserve-env --gres=craynetwork:0 --mpi=none $SHELL SchedulerParameters = no_backup_scheduling,bf_window=5760,bf_resolution=120,bf_max_job_array_resv=20,default_queue_depth=400,bf_max_job_test=6000,bf_max_job_user=10,bf_continue,nohold_on_prolog_fail,kill_invalid_depend SchedulerPort = 7321 SchedulerRootFilter = 1 SchedulerTimeSlice = 30 sec SchedulerType = sched/backfill SelectType = select/cray SelectTypeParameters = CR_SOCKET_MEMORY,OTHER_CONS_RES SlurmUser = root(0) SlurmctldDebug = debug SlurmctldLogFile = /var/tmp/slurm/slurmctld.log SlurmctldPort = 6817 SlurmctldTimeout = 120 sec SlurmdDebug = info SlurmdLogFile = /var/spool/slurmd/%h.log SlurmdPidFile = /var/run/slurmd.pid SlurmdPlugstack = (null) SlurmdPort = 6818 SlurmdSpoolDir = /var/spool/slurmd SlurmdTimeout = 300 sec SlurmdUser = root(0) SlurmSchedLogFile = (null) SlurmSchedLogLevel = 0 SlurmctldPidFile = /var/run/slurmctld.pid SlurmctldPlugstack = (null) SLURM_CONF = /opt/slurm/etc/slurm.conf SLURM_VERSION = 15.08.6 SrunEpilog = (null) SrunPortRange = 60001-63000 SrunProlog = (null) StateSaveLocation = /global/syscom/sc/nsg/var/edison-slurm-state SuspendExcNodes = (null) SuspendExcParts = (null) SuspendProgram = (null) SuspendRate = 60 nodes/min SuspendTime = NONE SuspendTimeout = 30 sec SwitchType = switch/cray TaskEpilog = (null) TaskPlugin = task/cgroup,task/cray TaskPluginParam = (null type) TaskProlog = (null) TmpFS = /tmp TopologyParam = (null) TopologyPlugin = topology/none TrackWCKey = No TreeWidth = 18 UsePam = 0 UnkillableStepProgram = (null) UnkillableStepTimeout = 60 sec VSizeFactor = 0 percent WaitTime = 0 sec
This is a follow on to bug 2307. Outside of what is suggested there I recommend these below. Immediate changes I would do are these... Change JobAcctGatherType to jobacct_gather/linux The jobacct_gather/cgroup plugin only reads the memory info from a cgroup, everything else is pulled from the proc file system as the linux plugin does. It is much slower and adds very little in terms of functionality. I can't think of a time I have ever recommended it to someone to use in production. Change MsgAggregationParams = WindowMsgs=1,WindowTime=100 To something different or just remove it. With WindowMsgs=1 this adds very little in terms of functionality. Perhaps set WindowMsgs=100 and WindowTime=20, that is usually what I tell people to set, but you should play with it a bit to come up with the correct values. You might also consider these other options... Since you already have PrologFlags = Alloc I usually like to add "nohold" to make it so the delay of the prolog running doesn't hold up an salloc at submit, but pushed to the first srun which will usually go unnoticed to the user as it will most likely already have ran. You might find the "fair_tree" style of fairshare more appealing than the normal fairshare add "fair_tree" to PriorityFlags. You can see a presentation at http://slurm.schedmd.com/SC14/BYU_Fair_Tree.pdf. Please let me know if you have any questions/comments.
Hi Danny, Thanks for taking a look. I don't actually have message aggregation configured -- it seemed to be causing problems with the recommended settings from SLUG, so I just removed the config: nid01605:~ # cat /opt/slurm/etc/slurm.conf | grep MsgAggregationParams nid01605:~ # and yet the config appears to be set: nid01605:~ # scontrol show config | grep MsgAgg MsgAggregationParams = WindowMsgs=1,WindowTime=100 nid01605:~ # I'm guessing these are the implicit defaults.... however whenever slurmd is started or is HUPd it reports that message aggregation is disabled. Regarding jobacctgather/cgroup vs jobacctgather/linux. On cori I have jobacctgather/linux configured and on edison cgroup. Was planning to move cori to cgroup because many users have been complaining about incorrect termination of job steps owing to slurmstepd killing them for exceeding resident memory limits when processes were performing lazy copy-on-write style forking. As far as I know cgroup-style memory accounting works well for this use-case whereas /proc summations by an outside observer are likely to be inaccurate. I also removed the default memory limits for our Shared=EXCLUSIVE partitions (on both systems). I will take a look at the fair_tree documentation you sent -- thanks! -Doug
(In reply to Doug Jacobsen from comment #2) > Hi Danny, > > Thanks for taking a look. > > I don't actually have message aggregation configured -- it seemed to be > causing problems with the recommended settings from SLUG, so I just removed > the config: > > nid01605:~ # cat /opt/slurm/etc/slurm.conf | grep MsgAggregationParams > nid01605:~ # > > > and yet the config appears to be set: > > nid01605:~ # scontrol show config | grep MsgAgg > MsgAggregationParams = WindowMsgs=1,WindowTime=100 > nid01605:~ # > > I'm guessing these are the implicit defaults.... however whenever slurmd is > started or is HUPd it reports that message aggregation is disabled. I see the same thing, I'll see about making that not print things when not enabled. You guess is most likely correct though, it is the default. > > > Regarding jobacctgather/cgroup vs jobacctgather/linux. On cori I have > jobacctgather/linux configured and on edison cgroup. Was planning to move > cori to cgroup because many users have been complaining about incorrect > termination of job steps owing to slurmstepd killing them for exceeding > resident memory limits when processes were performing lazy copy-on-write > style forking. As far as I know cgroup-style memory accounting works well > for this use-case whereas /proc summations by an outside observer are likely > to be inaccurate. I also removed the default memory limits for our > Shared=EXCLUSIVE partitions (on both systems). Keep in mind Much of the memory stuff is done in the task/cgroup plugin. I would be interested in seeing an example of what you are talking about. What I have witnessed is the cgroup plugin is slightly more than the linux plugin, but never very different (<1M difference). What I do know is it is much slower, but unless you are running HTC (100+ jobs a second) you probably will never notice. > > > I will take a look at the fair_tree documentation you sent -- thanks! No problem. Let me know what you decide. > > -Doug
MsgAggregationParams = WindowMsgs=1,WindowTime=100 is fixed in commit af3d1eadbcd. Now it will print the correct NULL. If you have any more questions please reopen, but I think you are in good shape, thanks!
Doug, I just saw your slurm.conf from bug 2350. You may be able to significantly shorten your slurm.conf for Edison which would help in quite a few ways. I believe all your nodename lines can be shortened to NodeName=DEFAULT CPUS=48 Sockets=2 CoresPerSocket=12 ThreadsPerCore=2 gres=craynetwork:4 RealMemory=64523 TmpDisk=32261 NodeName=nid00[008-296] Weight=100 NodeName=nid0[0297-6143] Weight=1000 This would speed up reading the slurm.conf file dramatically I would expect and make administration much easier. I don't believe the NodeAddr is needed, if it is that would be slightly unfortunate as you would have to break them up in chunks of 254, but even then it would be much smaller and manageable.
Hi Danny, I'm using NodeAddr to force slurmd to only listen to the ipogif0 interface and not listen on the RSIP interface. If there was another way to communicate this, I would *greatly* appreciate it and prefer it. -Doug ---- Doug Jacobsen, Ph.D. NERSC Computer Systems Engineer National Energy Research Scientific Computing Center <http://www.nersc.gov> dmjacobsen@lbl.gov ------------- __o ---------- _ '\<,_ ----------(_)/ (_)__________________________ On Wed, Jan 20, 2016 at 1:49 PM, <bugs@schedmd.com> wrote: > *Comment # 5 <http://bugs.schedmd.com/show_bug.cgi?id=2311#c5> on bug 2311 > <http://bugs.schedmd.com/show_bug.cgi?id=2311> from Danny Auble > <da@schedmd.com> * > > Doug, I just saw your slurm.conf from bug 2350 <http://bugs.schedmd.com/show_bug.cgi?id=2350>. You may be able to > significantly shorten your slurm.conf for Edison which would help in quite a > few ways. > > I believe all your nodename lines can be shortened to > > NodeName=DEFAULT CPUS=48 Sockets=2 CoresPerSocket=12 ThreadsPerCore=2 > gres=craynetwork:4 RealMemory=64523 TmpDisk=32261 > NodeName=nid00[008-296] Weight=100 > NodeName=nid0[0297-6143] Weight=1000 > > This would speed up reading the slurm.conf file dramatically I would expect and > make administration much easier. I don't believe the NodeAddr is needed, if it > is that would be slightly unfortunate as you would have to break them up in > chunks of 254, but even then it would be much smaller and manageable. > > ------------------------------ > You are receiving this mail because: > > - You are on the CC list for the bug. > >
Hum, I would have expected NoInAddrAny to fix this issue for the slurmd, but I see this isn't the case, but it is possible :). I'll do this right after I tag 15.08.7 :).
pretty please tag 15.08.7 -- cori will be out of maintenance soon and I'm hoping to get it in today =) ---- Doug Jacobsen, Ph.D. NERSC Computer Systems Engineer National Energy Research Scientific Computing Center <http://www.nersc.gov> dmjacobsen@lbl.gov ------------- __o ---------- _ '\<,_ ----------(_)/ (_)__________________________ On Wed, Jan 20, 2016 at 2:54 PM, <bugs@schedmd.com> wrote: > *Comment # 7 <http://bugs.schedmd.com/show_bug.cgi?id=2311#c7> on bug 2311 > <http://bugs.schedmd.com/show_bug.cgi?id=2311> from Danny Auble > <da@schedmd.com> * > > Hum, I would have expected NoInAddrAny to fix this issue for the slurmd, but I > see this isn't the case, but it is possible :). I'll do this right after I tag > 15.08.7 :). > > ------------------------------ > You are receiving this mail because: > > - You are on the CC list for the bug. > >
Your wish has been granted, it is available for download now.
Doug I believe commit 4b9cf7319b54f3 will give you what you want. It will be in 15.08.8. Let me know if it doesn't work as you would expect, hopefully it will get rid of the crazy amount of node lines in your config :).
Hey Doug - We were troubleshooting an issue for a non-NERSC system, and were comparing configs to how you have them on NERSC and realized you hadn't had a chance to cut over to this yet. 15.08.9 will have a NoCtldInAddrAny flag as well. With 15.08.8 and earlier NoInAddrAny does affect the slurmctld, afterwards it will not affect slurmctld and you'd have to set NoCtldInAddrAny to get the same behavior (but we think that'd be undesirable behavior in your case). We're hoping that with 15.08, you'll be able to set NoInAddrAny, clean up the config substantially, and use a single config throughout the cluster. (Or at least most of the cluster, you may still have some discrepancy with the eslogin nodes?) - Tim