Ticket 6561

Summary: Request for clarification of current working functionality for Crays
Product: Slurm Reporter: Lena <lena>
Component: Heterogeneous JobsAssignee: Tim Wickberg <tim>
Status: RESOLVED DUPLICATE QA Contact:
Severity: 3 - Medium Impact    
Priority: --- CC: bsantos, dmjacobsen, fullop, lena, sts
Version: 17.11.12   
Hardware: Linux   
OS: Linux   
Site: LANL Slinky Site: ---
Alineos Sites: --- Atos/Eviden Sites: ---
Confidential Site: --- Coreweave sites: ---
Cray Sites: --- DS9 clusters: ---
Google sites: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- NoveTech Sites: ---
Nvidia HWinf-CS Sites: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Tzag Elita Sites: ---
Linux Distro: --- Machine Name:
CLE Version: Version Fixed:
Target Release: --- DevPrio: ---
Emory-Cloud Sites: ---

Description Lena 2019-02-21 20:02:42 MST
At QBR Tim presented slides that showed which functionalities for heterogeneous job submissions do currently work and which do not work. My understanding was that that information was slightly different for Cray machines from official Slurm documentation. 

Would it be possible for you to clarify for Cray machines : which functionality of heterogeneous job submission works and which does not on Cray machines for the case of Slurm 17.11.12 and Slurm 18.08.3.
Comment 1 Jason Booth 2019-02-21 21:17:28 MST
Please use the correct severity when logging issues.

For a complete description please see:
https://www.schedmd.com/support.php

SEVERITY LEVELS

Severity 1 — Major Impact

A Severity 1 issue occurs when there is a continued system outage that affects a large set of end users. The system is down and non-functional due to Slurm problem(s) and no procedural workaround exists.

Severity 2 — High Impact

A Severity 2 issue is a high-impact problem that is causing sporadic outages or is consistently encountered by end users with adverse impact to end user interaction with the system.

Severity 3 — Medium Impact

A Severity 3 issue is a medium-to-low impact problem that includes partial non-critical loss of system access or which impairs some operations on the system but allows the end user to continue to function on the system with workarounds.

Severity 4 — Minor Issues

A Severity 4 issue is a minor issue with limited or no loss in functionality within the customer environment. Severity 4 issues may also be used for recommendations for future product enhancements or modifications.
Comment 2 Lena 2019-02-21 21:38:53 MST
Jason, 
This is high impact issue for our site, since we are getting ready on a short time frame for large scale heterogeneous run. Since, Tim had this information at QBR, we were hopping it would not be to much trouble for you to send it to us. Any assistance you can provide is highly appreciated.
Lena
Comment 7 Tim Wickberg 2019-02-22 19:51:43 MST
(In reply to Lena from comment #0)
> At QBR Tim presented slides that showed which functionalities for
> heterogeneous job submissions do currently work and which do not work. My
> understanding was that that information was slightly different for Cray
> machines from official Slurm documentation. 

I may have presented them at the QBR, but Doug wrote them. I will email them to you out of band. (I won't attach them publicly, and you're not in any of the security groups I could enable.[1])

> Would it be possible for you to clarify for Cray machines : which
> functionality of heterogeneous job submission works and which does not on
> Cray machines for the case of Slurm 17.11.12 and Slurm 18.08.3.

You can submit jobs, but cannot run with a unified MPI_COMM_WORLD with Cray MPI, or on any MPI with the 17.11 release.

Enhancement request bug 4105, which I am closing this as a duplicate of, has been tracking this issue, and has more background on the limitations.

- Tim

[1] If you should be in the LANL security group, please ping Michael and ask him to add you on LANL's behalf. I'll also caution that you're not on our list of technical contacts for LANL, and thus there is some delay verifying you're supposed to be opening tickets. Please ask Joshi, Michael, or Steven to open these, or if you should be swapped in please follow up with us directly.

*** This ticket has been marked as a duplicate of ticket 4105 ***