Ticket 5285

Summary: slurmsmwd documentation and configuration, with slurmctld running separately on its own service node
Product: Slurm Reporter: S Senator <sts>
Component: DocumentationAssignee: Tim Wickberg <tim>
Status: RESOLVED TIMEDOUT QA Contact:
Severity: 3 - Medium Impact    
Priority: --- CC: bsantos, dmjacobsen, fullop, lena, tsskinner
Version: 17.11.7   
Hardware: Cray XC   
OS: Linux   
See Also: https://bugs.schedmd.com/show_bug.cgi?id=3318
https://bugs.schedmd.com/show_bug.cgi?id=5569
Site: LANL Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: trinity CLE Version: UP05
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---

Description S Senator 2018-06-08 13:03:40 MDT
Would you please provide documentation and configuration guidance on the slurmsmwd?  In particualar, our Cray systems run slurmctld on a separate service node, rather than the SMW, so if there are any particular requirements for that architecture, could you please note them.

I am opening this ticket because I didn't find any reference through the normal search mechanisms on docs.schedmd.com other than https://bugs.schedmd.com/show_bug.cgi?id=3318 comments 67 and 74.

Thank you.
Comment 1 Doug Jacobsen 2018-06-08 13:05:47 MDT
we have an ethernet pathway wherein the SMW can reach the slurmctld node.  it would be quite a bad idea to run slurmctld on the smw =)
Comment 2 S Senator 2018-06-08 13:07:25 MDT
I was mislead by the name then. Guidance appreciated.
Comment 3 Tim Wickberg 2018-06-08 14:20:41 MDT
As Doug mentions, you do need a network path available between the SMW and wherever slurmctld lives.

On the SMW:

- MUNGE needs to be installed, and use the same key as the rest of the system
- slurm.conf needs to be accessible in its usual location
- libslurm needs to be available
- slurmsmwd.conf needs to be setup, it has only three config options at present which you'll need to adjust:

CabinetsPerRow=12
LogFile=/var/opt/cray/log/p0-current/slurmsmwd.log
DebugLevel=debug3
Comment 4 S Senator 2018-06-08 14:56:27 MDT
In particular, I want to be sure that installing those RPMS (slurm, munge) plays well with the standard Cray ansible plays and doesn't require post-RPM removal of any artifacts that they normally install but would be inappropriate on the SMW.
Comment 5 Tim Wickberg 2018-06-08 15:21:14 MDT
(In reply to S Senator from comment #4)
> In particular, I want to be sure that installing those RPMS (slurm, munge)
> plays well with the standard Cray ansible plays and doesn't require post-RPM
> removal of any artifacts that they normally install but would be
> inappropriate on the SMW.

To the best of my knowledge, it should not cause any issues.

Note that you only need the slurm and slurm-slurmsmwd packages.
Comment 6 Tim Skinner 2018-06-13 12:44:33 MDT
We are seeing lots of conflicts trying to install these packages. As I solve some I keep hitting the next one and the next etc. 
Currently Slurm will not install due to a conflict between cray-libcodbc0 and cray-xt-libcodbc0.

The rest of the system has cray-munge-0.5.11-8.1 installed, the SMW has munge-0.5.12-3.1 installed as a requirement of cray-diod and cray-imps-distribution-smw.  The SMW also has a different munge.key 

Please advise.
Comment 7 Tim Wickberg 2018-06-13 12:58:53 MDT
(In reply to Tim Skinner from comment #6)
> We are seeing lots of conflicts trying to install these packages. As I solve
> some I keep hitting the next one and the next etc. 
> Currently Slurm will not install due to a conflict between cray-libcodbc0
> and cray-xt-libcodbc0.

That does not sound like a Slurm problem per-se, neither of those packages are ours, or directly required by our RPM spec file.

Or are those picked up as build-time dependencies?

> The rest of the system has cray-munge-0.5.11-8.1 installed, the SMW has
> munge-0.5.12-3.1 installed as a requirement of cray-diod and
> cray-imps-distribution-smw.  The SMW also has a different munge.key 

What is that alternate MUNGE key used for?

slurmsmwd will need the MUNGE key to match up, or it will be unable to trigger changes in slurmctld.

While it is possible to setup an alternate MUNGE installation and run multiple simultaneously, I wouldn't recommend it. If that install was provided by Cray, I would suggest you get in contact with them and work out what ramifications there may be to changing it over.
Comment 8 Tim Wickberg 2018-06-27 12:01:19 MDT
I'm assuming this has been worked out by the lack of response. Marking resolved/infogiven at this time, please reopen if there's anything further we can help with.

- Tim
Comment 9 S Senator 2018-06-27 14:42:39 MDT
(In reply to Tim Wickberg from comment #8)
> I'm assuming this has been worked out by the lack of response. Marking
> resolved/infogiven at this time, please reopen if there's anything further
> we can help with.
> 
> - Tim

No, this hasn't been worked out. The lack of response is due to the fact that it has been deferred as a new feature that requires more scope definition, integration with the Cray configuration management mechanisms, coordinated planning and then implementation.

The scope definition will also include a security review, as there are significant questions whether using the same munge key on the external login nodes (as required by the Cray munge implementation on the SMW) as within the whole main frame would increase the attack surface and risk.
Comment 10 Tim Wickberg 2018-06-27 14:45:21 MDT
> No, this hasn't been worked out. The lack of response is due to the fact
> that it has been deferred as a new feature that requires more scope
> definition, integration with the Cray configuration management mechanisms,
> coordinated planning and then implementation.
> 
> The scope definition will also include a security review, as there are
> significant questions whether using the same munge key on the external login
> nodes (as required by the Cray munge implementation on the SMW) as within
> the whole main frame would increase the attack surface and risk.

Okay, I understand. In the meantime I will switch this to resolved/timedout on our end to make it clear we're waiting on further details from your end. If there's anything you need from us, please just flip it back to 'unconfirmed'.

thanks,
- Tim
Comment 11 Ben Santos 2018-08-15 15:37:40 MDT
I am currently out of the office.  I will have limited email access but will follow up as soon as possible. If you have an urgent issue, please contact the HPC Consulting Office consult@lanl.gov or 505-665-4444 opt 3.