Ticket 4468 - Configuration Review Request
Summary: Configuration Review Request
Status: RESOLVED INFOGIVEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: Configuration (show other tickets)
Version: 17.11.0
Hardware: Linux Linux
: 4 - Minor Issue
Assignee: Tim Wickberg
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2017-12-04 16:14 MST by Ben Matthews
Modified: 2018-01-18 18:20 MST (History)
0 users

See Also:
Site: UCAR
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
config files (4.62 KB, application/x-bzip2)
2017-12-04 16:14 MST, Ben Matthews
Details

Note You need to log in before you can comment on or make changes to this ticket.
Description Ben Matthews 2017-12-04 16:14:35 MST
Created attachment 5668 [details]
config files

Now that we're officially customers, I was wondering if you guys can review our configuration and make sure I'm not doing anything completely unreasonable (my understanding is that this is something that you guys like to do). 

I think we're a good target for federated scheduling, but it wasn't ready on the schedule that we needed, so we're making it work without for now.

Logically, we have a few domains, all are connected to our (publicly addressed) Ethernet Network (10/40/100GigE depending) (anything on the public ethernet has a host firewall, even within our network). 

- Cheyenne login nodes
  - This cluster is mainly scheduled by PBSPro. EDR IB internally and 10GigE (2 lagg'ed ports each). It is submit only into SLURM for now

- Cheyenne batch nodes (SLES12)
  - Also submit only, routed to Ethernet on a limited basis

- Yellowstone (RHEL6)
  - Going away soon, submit only, limited access to the public network, but direct access to SLURM over IB
  - Running the slurmctld and DB for now

- Geyser/Caldera/Pronghorn (RHEL6) 
  - on the Yellowstone IB fabric for now
  - Heterogenous visualization cluster scheduled by SLURM (migrating from LSF)
  - We want to schedule GPUs, on an opt-in basis (lots of viz software barely uses them)

- "gladeslurm" (RHEL7)
  - 40GigE attached minimal containers with the SLURM runtime and an interface to our tape system only. Scheduled by SLURM. No cgroupfs exposed/don't much care about resource limits. 

-SLURM is defined on the IB networks and /etc/hosts entries are used to get things that are ethernet connected only to talk (since SLURM doesn't have a concept of a separate control path) 

-For now, everything shares a single munge key. We'd like to eventually add some semi-trusted submit hosts with alternative keys. 

-We have a handful of QoS settings which allow users to request higher priority (with a penalty in our in-house charging system). 

-Walltime limits are defined on QoS levels for now, overrides are provided on a user/partition/project basis. I'd be happy to share more of our DB in a less public forum. 

-To get around differences in environment (LD_PRELOADs and custom libc are the main problem) between platforms, we erase the majority of the environment from a submit plugin (I'll attach it). We don't have a good solution for the salloc/srun flow yet - for now, we ask that users do something like srun --export=TERM,HOME,SHELL


Thanks for looking (and if I'm missing important parts of the configuration, let me know). 

-Ben
Comment 1 Tim Wickberg 2017-12-04 16:52:58 MST
(In reply to Ben Matthews from comment #0)
> Created attachment 5668 [details]
> config files
> 
> Now that we're officially customers, I was wondering if you guys can review
> our configuration and make sure I'm not doing anything completely
> unreasonable (my understanding is that this is something that you guys like
> to do). 
> 
> I think we're a good target for federated scheduling, but it wasn't ready on
> the schedule that we needed, so we're making it work without for now.

I don't think you really need federation here, just to tie all the systems together with a single slurmdbd installation and use -M to switch which cluster the commands are talking to.

> Logically, we have a few domains, all are connected to our (publicly
> addressed) Ethernet Network (10/40/100GigE depending) (anything on the
> public ethernet has a host firewall, even within our network). 

For srun to work directly you'll need to make sure it can receive inbound connections on ephemeral TCP ports from the compute nodes.

And/or set SrunPortRange with appropriate firewall rules in place.

> -For now, everything shares a single munge key. We'd like to eventually add
> some semi-trusted submit hosts with alternative keys. 

That's not really supported at present; the two-stage keying mechanisms are designed around one key used within the cluster, and an alternate used to communicate from the slurmctld(s) <-> slurmdbd.

> -We have a handful of QoS settings which allow users to request higher
> priority (with a penalty in our in-house charging system). 
> 
> -Walltime limits are defined on QoS levels for now, overrides are provided
> on a user/partition/project basis. I'd be happy to share more of our DB in a
> less public forum. 

Feel free to email it, or I can flag this bug as private if desired.

> -To get around differences in environment (LD_PRELOADs and custom libc are
> the main problem) between platforms, we erase the majority of the
> environment from a submit plugin (I'll attach it). We don't have a good
> solution for the salloc/srun flow yet - for now, we ask that users do
> something like srun --export=TERM,HOME,SHELL

... seems like you want NERSC's cli_filter stuff. Although that is only tentatively scheduled for the 18.08 release.

I don't see anything major in the configs themselves, just a few suggestions:

- use NodeName=default to clean things up
- set JobAcctGatherType to linux instead of cgroup (should have nearly identical stats, and is much faster)
- lower the SlurmctldDebug level down to debug at most (debug4 is insanely verbose, and will slow down your system drastically)
- there is no such thing as JobCompType=jobcomp/slurmdbd
- MaxTime=INFINITE is not really recommended. Unless you never plan to take a maintenance outage ever?
- You might want to set MailDomain, although I'm guessing your email script handles that for you already.

Also, make sure that StateSaveLocation can keep pace with your expected throughput level, or you'll see some performance problems.

- Tim
Comment 2 Ben Matthews 2017-12-04 17:16:01 MST
(In reply to Tim Wickberg from comment #1)
> (In reply to Ben Matthews from comment #0)
> > Created attachment 5668 [details]
> > config files
> > 
> > Now that we're officially customers, I was wondering if you guys can review
> > our configuration and make sure I'm not doing anything completely
> > unreasonable (my understanding is that this is something that you guys like
> > to do). 
> > 
> > I think we're a good target for federated scheduling, but it wasn't ready on
> > the schedule that we needed, so we're making it work without for now.
> 
> I don't think you really need federation here, just to tie all the systems
> together with a single slurmdbd installation and use -M to switch which
> cluster the commands are talking to.
> 
> > Logically, we have a few domains, all are connected to our (publicly
> > addressed) Ethernet Network (10/40/100GigE depending) (anything on the
> > public ethernet has a host firewall, even within our network). 
> 
> For srun to work directly you'll need to make sure it can receive inbound
> connections on ephemeral TCP ports from the compute nodes.
> 
> And/or set SrunPortRange with appropriate firewall rules in place.

We do have this set for places where interactive jobs make sense. It's hard to do from inside the large batch systems. It'd be nice if this weren't necessary and there were an option to proxy everything through slurmctld

> 
> > -For now, everything shares a single munge key. We'd like to eventually add
> > some semi-trusted submit hosts with alternative keys. 
> 
> That's not really supported at present; the two-stage keying mechanisms are
> designed around one key used within the cluster, and an alternate used to
> communicate from the slurmctld(s) <-> slurmdbd.
> 
> > -We have a handful of QoS settings which allow users to request higher
> > priority (with a penalty in our in-house charging system). 
> > 
> > -Walltime limits are defined on QoS levels for now, overrides are provided
> > on a user/partition/project basis. I'd be happy to share more of our DB in a
> > less public forum. 
> 
> Feel free to email it, or I can flag this bug as private if desired.
> 
> > -To get around differences in environment (LD_PRELOADs and custom libc are
> > the main problem) between platforms, we erase the majority of the
> > environment from a submit plugin (I'll attach it). We don't have a good
> > solution for the salloc/srun flow yet - for now, we ask that users do
> > something like srun --export=TERM,HOME,SHELL
> 
> ... seems like you want NERSC's cli_filter stuff. Although that is only
> tentatively scheduled for the 18.08 release.
> 
> I don't see anything major in the configs themselves, just a few suggestions:
> 
> - use NodeName=default to clean things up

Sounds like a plan

> - set JobAcctGatherType to linux instead of cgroup (should have nearly
> identical stats, and is much faster)

We'll look into this one. One of the draws to using the cgroup way is that the stats should line up with our (cgroup based) custom out-of-memory management tooling. How much is much faster? I wouldn't think reading cgroup stats should be that expensive. 

> - lower the SlurmctldDebug level down to debug at most (debug4 is insanely
> verbose, and will slow down your system drastically)

Thanks, will do. This is a holdover from trying to get all the network/firewall stuff figured out

> - there is no such thing as JobCompType=jobcomp/slurmdbd

should this be jobcomp/mysql? Does that mean that the slurmctld needs to be able to talk to the sql server directly? Doesn't that defeat the purpose of slurmdbd? 

> - MaxTime=INFINITE is not really recommended. Unless you never plan to take
> a maintenance outage ever?

There are political reasons for this - it's managed through db associations. I'd love to be able to set per-partition limits, but I need to be able to override them for special people. 

> - You might want to set MailDomain, although I'm guessing your email script
> handles that for you already.

Ok, I did forget about this, but yeah, our mail system should be fixing it (and more importantly, doing rate limiting). 

> 
> Also, make sure that StateSaveLocation can keep pace with your expected
> throughput level, or you'll see some performance problems.

It's SSD backed GPFS at the moment. It's not awful, but not the fastest either. What do you usually recommend for failover configurations? SSD+NFS? drbd? 

> 
> - Tim
Comment 3 Tim Wickberg 2017-12-05 22:39:46 MST
> We do have this set for places where interactive jobs make sense. It's hard
> to do from inside the large batch systems. It'd be nice if this weren't
> necessary and there were an option to proxy everything through slurmctld

That's an interesting idea. I can split that into an enhancement request if you'd like, I can think of at least a few other environments that could make use of some aspects of this.

> > - set JobAcctGatherType to linux instead of cgroup (should have nearly
> > identical stats, and is much faster)
> 
> We'll look into this one. One of the draws to using the cgroup way is that
> the stats should line up with our (cgroup based) custom out-of-memory
> management tooling. How much is much faster? I wouldn't think reading cgroup
> stats should be that expensive. 

It's only a mild suggestion at this point, and if you want the data to perfectly match up you can leave it alone then.

It's proven an order of magnitude or two slower in some internal testing, but you're probably not going to notice it in most instances.

> > - there is no such thing as JobCompType=jobcomp/slurmdbd
> 
> should this be jobcomp/mysql? Does that mean that the slurmctld needs to be
> able to talk to the sql server directly? Doesn't that defeat the purpose of
> slurmdbd? 

Probably not.

The slurmdbd accounting is always enabled when using AccountingStorageType=slurmdbd.

The JobComp accounting is separate, and is not what sacct is using by default. Most sites don't need that, or use it to dump a separate copy to plain-text files as a backup.

> > Also, make sure that StateSaveLocation can keep pace with your expected
> > throughput level, or you'll see some performance problems.
> 
> It's SSD backed GPFS at the moment. It's not awful, but not the fastest
> either. What do you usually recommend for failover configurations? SSD+NFS?
> drbd? 

That should be fine. The recommendations vary with system thoughput, and we don't mandate a specific implementations.

Most sites will do NFS (hopefully backed by SSD/NVMe), or use the Lustre/GPFS filesystem underpinning the rest of the applications/home directories for their cluster.

If you were building something dedicated for this, my personal recommendation is OCFS2 against a SAS/FC array plugged directly into the two control nodes.

I know a few sites are using DRDB, but I don't usually recommend it due to the complexity of getting it configured properly.
Comment 4 Ben Matthews 2017-12-06 11:00:28 MST
(In reply to Tim Wickberg from comment #3)
> > We do have this set for places where interactive jobs make sense. It's hard
> > to do from inside the large batch systems. It'd be nice if this weren't
> > necessary and there were an option to proxy everything through slurmctld
> 
> That's an interesting idea. I can split that into an enhancement request if
> you'd like, I can think of at least a few other environments that could make
> use of some aspects of this.
> 

It's not the most critical thing in the world, but I'd like to see this happen. All the bidirectional connections are quite problematic to (securely) pass through firewalls between internet addressable hosts. If there were an option to have all communications pass through one socket, preferably a connection from the compute host to slurmctld, that would make things much easier - especially if people start federating remote clusters (something we're certainly looking into doing). Obviously this would have to be reasonably performant, but I don't think it should be that hard to do at the scale of the vast majority of today's systems.
Comment 6 Tim Wickberg 2018-01-18 18:20:49 MST
Ben - 

Tagging this as resolved/infogiven; I believe more of the question have moved on to bug 4643 and similar at this point.