Ticket 4006

Summary:	mpi_launch time out, unless sbcast used prior - request barrier or equiv. like Cray alps did
Product:	Slurm	Reporter:	S Senator <sts>
Component:	Cray ALPS	Assignee:	Unassigned Developer <dev-unassigned>
Status:	OPEN ---	QA Contact:
Severity:	5 - Enhancement
Priority:	---	CC:	fullop, lena
Version:	17.02.6
Hardware:	Cray XC
OS:	Linux
Site:	LANL	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---

Description S Senator 2017-07-18 10:47:07 MDT

We have seen differences in MPI launching behavior on Trinity running the current native Slurm environment and prior instantiations utilizing ALPS from Cray. This has been mitigated in a large part by instructing users to utilize sbcast to distribute executables and data files as part of the job submission. We still see some instances where users are unable or unwilling to utilize sbcast, and in those cases we see instances of pmi timing out.  We would request that SchedMD work with Cray to see if  a similar or equivalent behavior to what we have seen in Alps could be implemented in the native Slurm environment where Alps implemented a barrier, awaiting the equivalent of an sbcast completion.  This could be implemented on the Cray software stack or the Slurm side of things.

Comment 1 Moe Jette 2017-07-18 14:20:34 MDT

Did you see at srun's --bcast option?
The application does not launch until after the file transfer completes.

Copy executable file to allocated compute nodes. If a file name is specified, copy the executable to the specified destination file path. If no path is specified, copy the file to a file named "slurm_bcast_<job_id>.<step_id>" in the current working. For example, "srun --bcast=/tmp/mine -N3 a.out" will copy the file "a.out" from your current directory to the file "/tmp/mine" on each of the three allocated compute nodes and execute that file. This option applies to step allocations.

Comment 2 Brian Christiansen 2017-08-01 16:52:14 MDT

Have you been able to try the --bcast option that Moe mentioned?

Comment 3 Joseph 'Joshi' Fullop 2017-08-17 15:37:31 MDT

So we have used bcast in its various incarnations but many user applications have too many file requirements to make this practical. We have been profiling our launches and have recently identified some problems in the configuration of the environment and we are starting to resolve those issues.  We expect this to help, but do not know if it will be enough to stabilize our launches.  

The discussions here have centered on the need for a synchronization mechanism similar to how Alps implemented it.  We understand that this could be rather involved to implement, but wanted to communicate where we are at on this currently.  We also realize that there are ways to address file system contention. But since that is not the only delay-inducing variable at work in job launches, a more encompassing solution for synchronization might be necessary.

We will know more and report as our environment work progresses.

Comment 4 Brian Christiansen 2017-09-26 16:08:25 MDT

Marking this as an enhancement. Please let keep us posted as you get you new information.