We have seen differences in MPI launching behavior on Trinity running the current native Slurm environment and prior instantiations utilizing ALPS from Cray. This has been mitigated in a large part by instructing users to utilize sbcast to distribute executables and data files as part of the job submission. We still see some instances where users are unable or unwilling to utilize sbcast, and in those cases we see instances of pmi timing out. We would request that SchedMD work with Cray to see if a similar or equivalent behavior to what we have seen in Alps could be implemented in the native Slurm environment where Alps implemented a barrier, awaiting the equivalent of an sbcast completion. This could be implemented on the Cray software stack or the Slurm side of things.
Did you see at srun's --bcast option? The application does not launch until after the file transfer completes. Copy executable file to allocated compute nodes. If a file name is specified, copy the executable to the specified destination file path. If no path is specified, copy the file to a file named "slurm_bcast_<job_id>.<step_id>" in the current working. For example, "srun --bcast=/tmp/mine -N3 a.out" will copy the file "a.out" from your current directory to the file "/tmp/mine" on each of the three allocated compute nodes and execute that file. This option applies to step allocations.
Have you been able to try the --bcast option that Moe mentioned?
So we have used bcast in its various incarnations but many user applications have too many file requirements to make this practical. We have been profiling our launches and have recently identified some problems in the configuration of the environment and we are starting to resolve those issues. We expect this to help, but do not know if it will be enough to stabilize our launches. The discussions here have centered on the need for a synchronization mechanism similar to how Alps implemented it. We understand that this could be rather involved to implement, but wanted to communicate where we are at on this currently. We also realize that there are ways to address file system contention. But since that is not the only delay-inducing variable at work in job launches, a more encompassing solution for synchronization might be necessary. We will know more and report as our environment work progresses.
Marking this as an enhancement. Please let keep us posted as you get you new information.