Ticket 7604

Summary: Long startup with srun and mpiexec
Product: Slurm Reporter: lenovosng
Component: OtherAssignee: Felip Moll <felip.moll>
Status: RESOLVED DUPLICATE QA Contact:
Severity: 3 - Medium Impact    
Priority: ---    
Version: 18.08.4   
Hardware: Linux   
OS: Linux   
See Also: https://bugs.schedmd.com/show_bug.cgi?id=7475
Site: LRZ Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---
Attachments: Job output and slurm.conf

Description lenovosng 2019-08-20 08:28:19 MDT
Created attachment 11288 [details]
Job output and slurm.conf

A user reported a sporadic long waiting times to start/run a simple MPI application with srun. Time varies from several seconds up to 5+ minutes.
Attached are logfiles of a job running into this issue along with our slurm.conf file.
As can be seen from this logfile there is a 5min gap between MPI execusions at one point

Tue 20 Aug 10:15:01 CEST 2019
Tue 20 Aug 10:15:03 CEST 2019

srun: Job 172530 step creation temporarily disabled, retrying
srun: Step created for job 172530
Tue 20 Aug 10:20:06 CEST 2019
Tue 20 Aug 10:20:08 CEST 2019
Comment 1 Felip Moll 2019-08-22 04:03:39 MDT
This bug seems a duplicate of bug 7540. The provided examples show like the submission is done every second and eventually you get an:

srun: Job 172530 step creation temporarily disabled, retrying

which seems to me like the exact symptoms of the other bug.


Is there anything different which make you think is not the same issue?
Comment 2 Felip Moll 2019-08-22 04:17:18 MDT
Sorry, in my last comment I meant bug 7475, not 7540.

In the meantime, can you set DefMemPerCPU in slurm.conf or specifically set the memory of each task (srun --mem) to a limited value?

Inside an allocation slurm allows oversuscribe cpus and when you don't set tasks memory each task will take whole memory from node. Memory will block other tasks start. Your error can be related to this if every srun takes whole memory on the node and the next task tries to start a step with all the memory, the previous step cannot be cleaned up yet.
Comment 3 Felip Moll 2019-08-30 09:31:47 MDT
Hi, since we've not received any response I am marking this bug as a dup of 7475.
If it is not the same thing, feel free to mark it as open again.

Thanks

*** This ticket has been marked as a duplicate of ticket 7475 ***