| Summary: | Configuration suggestions | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Danny Auble <da> |
| Component: | Configuration | Assignee: | Danny Auble <da> |
| Status: | RESOLVED FIXED | QA Contact: | |
| Severity: | 4 - Minor Issue | ||
| Priority: | --- | CC: | dmjacobsen, zzhao |
| Version: | 15.08.6 | ||
| Hardware: | Cray XC | ||
| OS: | Linux | ||
| Site: | NERSC | Slinky Site: | --- |
| Alineos Sites: | --- | Atos/Eviden Sites: | --- |
| Confidential Site: | --- | Coreweave sites: | --- |
| Cray Sites: | --- | DS9 clusters: | --- |
| Google sites: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | NoveTech Sites: | --- |
| Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Tzag Elita Sites: | --- |
| Linux Distro: | --- | Machine Name: | Edison |
| CLE Version: | Version Fixed: | 15.08.7 16.05.0-pre1 | |
| Target Release: | --- | DevPrio: | --- |
| Emory-Cloud Sites: | --- | ||
|
Description
Danny Auble
2016-01-06 04:46:36 MST
This is a follow on to bug 2307. Outside of what is suggested there I recommend these below. Immediate changes I would do are these... Change JobAcctGatherType to jobacct_gather/linux The jobacct_gather/cgroup plugin only reads the memory info from a cgroup, everything else is pulled from the proc file system as the linux plugin does. It is much slower and adds very little in terms of functionality. I can't think of a time I have ever recommended it to someone to use in production. Change MsgAggregationParams = WindowMsgs=1,WindowTime=100 To something different or just remove it. With WindowMsgs=1 this adds very little in terms of functionality. Perhaps set WindowMsgs=100 and WindowTime=20, that is usually what I tell people to set, but you should play with it a bit to come up with the correct values. You might also consider these other options... Since you already have PrologFlags = Alloc I usually like to add "nohold" to make it so the delay of the prolog running doesn't hold up an salloc at submit, but pushed to the first srun which will usually go unnoticed to the user as it will most likely already have ran. You might find the "fair_tree" style of fairshare more appealing than the normal fairshare add "fair_tree" to PriorityFlags. You can see a presentation at http://slurm.schedmd.com/SC14/BYU_Fair_Tree.pdf. Please let me know if you have any questions/comments. Hi Danny, Thanks for taking a look. I don't actually have message aggregation configured -- it seemed to be causing problems with the recommended settings from SLUG, so I just removed the config: nid01605:~ # cat /opt/slurm/etc/slurm.conf | grep MsgAggregationParams nid01605:~ # and yet the config appears to be set: nid01605:~ # scontrol show config | grep MsgAgg MsgAggregationParams = WindowMsgs=1,WindowTime=100 nid01605:~ # I'm guessing these are the implicit defaults.... however whenever slurmd is started or is HUPd it reports that message aggregation is disabled. Regarding jobacctgather/cgroup vs jobacctgather/linux. On cori I have jobacctgather/linux configured and on edison cgroup. Was planning to move cori to cgroup because many users have been complaining about incorrect termination of job steps owing to slurmstepd killing them for exceeding resident memory limits when processes were performing lazy copy-on-write style forking. As far as I know cgroup-style memory accounting works well for this use-case whereas /proc summations by an outside observer are likely to be inaccurate. I also removed the default memory limits for our Shared=EXCLUSIVE partitions (on both systems). I will take a look at the fair_tree documentation you sent -- thanks! -Doug (In reply to Doug Jacobsen from comment #2) > Hi Danny, > > Thanks for taking a look. > > I don't actually have message aggregation configured -- it seemed to be > causing problems with the recommended settings from SLUG, so I just removed > the config: > > nid01605:~ # cat /opt/slurm/etc/slurm.conf | grep MsgAggregationParams > nid01605:~ # > > > and yet the config appears to be set: > > nid01605:~ # scontrol show config | grep MsgAgg > MsgAggregationParams = WindowMsgs=1,WindowTime=100 > nid01605:~ # > > I'm guessing these are the implicit defaults.... however whenever slurmd is > started or is HUPd it reports that message aggregation is disabled. I see the same thing, I'll see about making that not print things when not enabled. You guess is most likely correct though, it is the default. > > > Regarding jobacctgather/cgroup vs jobacctgather/linux. On cori I have > jobacctgather/linux configured and on edison cgroup. Was planning to move > cori to cgroup because many users have been complaining about incorrect > termination of job steps owing to slurmstepd killing them for exceeding > resident memory limits when processes were performing lazy copy-on-write > style forking. As far as I know cgroup-style memory accounting works well > for this use-case whereas /proc summations by an outside observer are likely > to be inaccurate. I also removed the default memory limits for our > Shared=EXCLUSIVE partitions (on both systems). Keep in mind Much of the memory stuff is done in the task/cgroup plugin. I would be interested in seeing an example of what you are talking about. What I have witnessed is the cgroup plugin is slightly more than the linux plugin, but never very different (<1M difference). What I do know is it is much slower, but unless you are running HTC (100+ jobs a second) you probably will never notice. > > > I will take a look at the fair_tree documentation you sent -- thanks! No problem. Let me know what you decide. > > -Doug MsgAggregationParams = WindowMsgs=1,WindowTime=100 is fixed in commit af3d1eadbcd. Now it will print the correct NULL. If you have any more questions please reopen, but I think you are in good shape, thanks! Doug, I just saw your slurm.conf from bug 2350. You may be able to significantly shorten your slurm.conf for Edison which would help in quite a few ways. I believe all your nodename lines can be shortened to NodeName=DEFAULT CPUS=48 Sockets=2 CoresPerSocket=12 ThreadsPerCore=2 gres=craynetwork:4 RealMemory=64523 TmpDisk=32261 NodeName=nid00[008-296] Weight=100 NodeName=nid0[0297-6143] Weight=1000 This would speed up reading the slurm.conf file dramatically I would expect and make administration much easier. I don't believe the NodeAddr is needed, if it is that would be slightly unfortunate as you would have to break them up in chunks of 254, but even then it would be much smaller and manageable. Hi Danny, I'm using NodeAddr to force slurmd to only listen to the ipogif0 interface and not listen on the RSIP interface. If there was another way to communicate this, I would *greatly* appreciate it and prefer it. -Doug ---- Doug Jacobsen, Ph.D. NERSC Computer Systems Engineer National Energy Research Scientific Computing Center <http://www.nersc.gov> dmjacobsen@lbl.gov ------------- __o ---------- _ '\<,_ ----------(_)/ (_)__________________________ On Wed, Jan 20, 2016 at 1:49 PM, <bugs@schedmd.com> wrote: > *Comment # 5 <http://bugs.schedmd.com/show_bug.cgi?id=2311#c5> on bug 2311 > <http://bugs.schedmd.com/show_bug.cgi?id=2311> from Danny Auble > <da@schedmd.com> * > > Doug, I just saw your slurm.conf from bug 2350 <http://bugs.schedmd.com/show_bug.cgi?id=2350>. You may be able to > significantly shorten your slurm.conf for Edison which would help in quite a > few ways. > > I believe all your nodename lines can be shortened to > > NodeName=DEFAULT CPUS=48 Sockets=2 CoresPerSocket=12 ThreadsPerCore=2 > gres=craynetwork:4 RealMemory=64523 TmpDisk=32261 > NodeName=nid00[008-296] Weight=100 > NodeName=nid0[0297-6143] Weight=1000 > > This would speed up reading the slurm.conf file dramatically I would expect and > make administration much easier. I don't believe the NodeAddr is needed, if it > is that would be slightly unfortunate as you would have to break them up in > chunks of 254, but even then it would be much smaller and manageable. > > ------------------------------ > You are receiving this mail because: > > - You are on the CC list for the bug. > > Hum, I would have expected NoInAddrAny to fix this issue for the slurmd, but I see this isn't the case, but it is possible :). I'll do this right after I tag 15.08.7 :). pretty please tag 15.08.7 -- cori will be out of maintenance soon and I'm hoping to get it in today =) ---- Doug Jacobsen, Ph.D. NERSC Computer Systems Engineer National Energy Research Scientific Computing Center <http://www.nersc.gov> dmjacobsen@lbl.gov ------------- __o ---------- _ '\<,_ ----------(_)/ (_)__________________________ On Wed, Jan 20, 2016 at 2:54 PM, <bugs@schedmd.com> wrote: > *Comment # 7 <http://bugs.schedmd.com/show_bug.cgi?id=2311#c7> on bug 2311 > <http://bugs.schedmd.com/show_bug.cgi?id=2311> from Danny Auble > <da@schedmd.com> * > > Hum, I would have expected NoInAddrAny to fix this issue for the slurmd, but I > see this isn't the case, but it is possible :). I'll do this right after I tag > 15.08.7 :). > > ------------------------------ > You are receiving this mail because: > > - You are on the CC list for the bug. > > Your wish has been granted, it is available for download now. Doug I believe commit 4b9cf7319b54f3 will give you what you want. It will be in 15.08.8. Let me know if it doesn't work as you would expect, hopefully it will get rid of the crazy amount of node lines in your config :). Hey Doug - We were troubleshooting an issue for a non-NERSC system, and were comparing configs to how you have them on NERSC and realized you hadn't had a chance to cut over to this yet. 15.08.9 will have a NoCtldInAddrAny flag as well. With 15.08.8 and earlier NoInAddrAny does affect the slurmctld, afterwards it will not affect slurmctld and you'd have to set NoCtldInAddrAny to get the same behavior (but we think that'd be undesirable behavior in your case). We're hoping that with 15.08, you'll be able to set NoInAddrAny, clean up the config substantially, and use a single config throughout the cluster. (Or at least most of the cluster, you may still have some discrepancy with the eslogin nodes?) - Tim |