| Summary: | slurmsmwd documentation and configuration, with slurmctld running separately on its own service node | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | S Senator <sts> |
| Component: | Documentation | Assignee: | Tim Wickberg <tim> |
| Status: | RESOLVED TIMEDOUT | QA Contact: | |
| Severity: | 3 - Medium Impact | ||
| Priority: | --- | CC: | bsantos, dmjacobsen, fullop, lena, tsskinner |
| Version: | 17.11.7 | ||
| Hardware: | Cray XC | ||
| OS: | Linux | ||
| See Also: |
https://bugs.schedmd.com/show_bug.cgi?id=3318 https://bugs.schedmd.com/show_bug.cgi?id=5569 |
||
| Site: | LANL | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | trinity | CLE Version: | UP05 |
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
|
Description
S Senator
2018-06-08 13:03:40 MDT
we have an ethernet pathway wherein the SMW can reach the slurmctld node. it would be quite a bad idea to run slurmctld on the smw =) I was mislead by the name then. Guidance appreciated. As Doug mentions, you do need a network path available between the SMW and wherever slurmctld lives. On the SMW: - MUNGE needs to be installed, and use the same key as the rest of the system - slurm.conf needs to be accessible in its usual location - libslurm needs to be available - slurmsmwd.conf needs to be setup, it has only three config options at present which you'll need to adjust: CabinetsPerRow=12 LogFile=/var/opt/cray/log/p0-current/slurmsmwd.log DebugLevel=debug3 In particular, I want to be sure that installing those RPMS (slurm, munge) plays well with the standard Cray ansible plays and doesn't require post-RPM removal of any artifacts that they normally install but would be inappropriate on the SMW. (In reply to S Senator from comment #4) > In particular, I want to be sure that installing those RPMS (slurm, munge) > plays well with the standard Cray ansible plays and doesn't require post-RPM > removal of any artifacts that they normally install but would be > inappropriate on the SMW. To the best of my knowledge, it should not cause any issues. Note that you only need the slurm and slurm-slurmsmwd packages. We are seeing lots of conflicts trying to install these packages. As I solve some I keep hitting the next one and the next etc. Currently Slurm will not install due to a conflict between cray-libcodbc0 and cray-xt-libcodbc0. The rest of the system has cray-munge-0.5.11-8.1 installed, the SMW has munge-0.5.12-3.1 installed as a requirement of cray-diod and cray-imps-distribution-smw. The SMW also has a different munge.key Please advise. (In reply to Tim Skinner from comment #6) > We are seeing lots of conflicts trying to install these packages. As I solve > some I keep hitting the next one and the next etc. > Currently Slurm will not install due to a conflict between cray-libcodbc0 > and cray-xt-libcodbc0. That does not sound like a Slurm problem per-se, neither of those packages are ours, or directly required by our RPM spec file. Or are those picked up as build-time dependencies? > The rest of the system has cray-munge-0.5.11-8.1 installed, the SMW has > munge-0.5.12-3.1 installed as a requirement of cray-diod and > cray-imps-distribution-smw. The SMW also has a different munge.key What is that alternate MUNGE key used for? slurmsmwd will need the MUNGE key to match up, or it will be unable to trigger changes in slurmctld. While it is possible to setup an alternate MUNGE installation and run multiple simultaneously, I wouldn't recommend it. If that install was provided by Cray, I would suggest you get in contact with them and work out what ramifications there may be to changing it over. I'm assuming this has been worked out by the lack of response. Marking resolved/infogiven at this time, please reopen if there's anything further we can help with. - Tim (In reply to Tim Wickberg from comment #8) > I'm assuming this has been worked out by the lack of response. Marking > resolved/infogiven at this time, please reopen if there's anything further > we can help with. > > - Tim No, this hasn't been worked out. The lack of response is due to the fact that it has been deferred as a new feature that requires more scope definition, integration with the Cray configuration management mechanisms, coordinated planning and then implementation. The scope definition will also include a security review, as there are significant questions whether using the same munge key on the external login nodes (as required by the Cray munge implementation on the SMW) as within the whole main frame would increase the attack surface and risk. > No, this hasn't been worked out. The lack of response is due to the fact
> that it has been deferred as a new feature that requires more scope
> definition, integration with the Cray configuration management mechanisms,
> coordinated planning and then implementation.
>
> The scope definition will also include a security review, as there are
> significant questions whether using the same munge key on the external login
> nodes (as required by the Cray munge implementation on the SMW) as within
> the whole main frame would increase the attack surface and risk.
Okay, I understand. In the meantime I will switch this to resolved/timedout on our end to make it clear we're waiting on further details from your end. If there's anything you need from us, please just flip it back to 'unconfirmed'.
thanks,
- Tim
I am currently out of the office. I will have limited email access but will follow up as soon as possible. If you have an urgent issue, please contact the HPC Consulting Office consult@lanl.gov or 505-665-4444 opt 3. |