7604 – Long startup with srun and mpiexec

Ticket 7604 - Long startup with srun and mpiexec

Summary: Long startup with srun and mpiexec

Status:	RESOLVED DUPLICATE of ticket 7475

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	Other (show other tickets)
Version:	18.08.4
Hardware:	Linux Linux

Severity:	3 - Medium Impact
Assignee:	Felip Moll
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2019-08-20 08:28 MDT by lenovosng
Modified:	2019-08-30 09:31 MDT (History)
CC List:	0 users

See Also:	7475
Site:	LRZ
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
Job output and slurm.conf (60.00 KB, application/x-tar) 2019-08-20 08:28 MDT, lenovosng	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description lenovosng 2019-08-20 08:28:19 MDT

Created attachment 11288 [details]
Job output and slurm.conf

A user reported a sporadic long waiting times to start/run a simple MPI application with srun. Time varies from several seconds up to 5+ minutes.
Attached are logfiles of a job running into this issue along with our slurm.conf file.
As can be seen from this logfile there is a 5min gap between MPI execusions at one point

Tue 20 Aug 10:15:01 CEST 2019
Tue 20 Aug 10:15:03 CEST 2019

srun: Job 172530 step creation temporarily disabled, retrying
srun: Step created for job 172530
Tue 20 Aug 10:20:06 CEST 2019
Tue 20 Aug 10:20:08 CEST 2019

Comment 1 Felip Moll 2019-08-22 04:03:39 MDT

This bug seems a duplicate of bug 7540. The provided examples show like the submission is done every second and eventually you get an:

srun: Job 172530 step creation temporarily disabled, retrying

which seems to me like the exact symptoms of the other bug.


Is there anything different which make you think is not the same issue?

Comment 2 Felip Moll 2019-08-22 04:17:18 MDT

Sorry, in my last comment I meant bug 7475, not 7540.

In the meantime, can you set DefMemPerCPU in slurm.conf or specifically set the memory of each task (srun --mem) to a limited value?

Inside an allocation slurm allows oversuscribe cpus and when you don't set tasks memory each task will take whole memory from node. Memory will block other tasks start. Your error can be related to this if every srun takes whole memory on the node and the next task tries to start a step with all the memory, the previous step cannot be cleaned up yet.

Comment 3 Felip Moll 2019-08-30 09:31:47 MDT

Hi, since we've not received any response I am marking this bug as a dup of 7475.
If it is not the same thing, feel free to mark it as open again.

Thanks

*** This ticket has been marked as a duplicate of ticket 7475 ***