Ticket 11783

Summary:	Second mpirun in a job fails after Slurm upgrade
Product:	Slurm	Reporter:	hpc-cs-hd
Component:	User Commands	Assignee:	Marcin Stolarek <cinek>
Status:	RESOLVED DUPLICATE	QA Contact:
Severity:	4 - Minor Issue
Priority:	---	CC:	cinek, sts
Version:	20.11.7
Hardware:	Linux
OS:	Linux
Site:	Cineca	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	Strace output Jobscript for the case Strace otput updated Slurmctld log for a job where the case is reproduced Full slurmctld log

Description hpc-cs-hd 2021-06-08 06:45:05 MDT

We recently upgraded to Slurm 20.11.7 and after that, we encountered a
new problem regarding consecutive parallel executions in a single job. If
more than one node is asked and the job calls two mpirun instances
consecutively, the second one will fail with the following error:
--------------------------------------------------------------------------
An ORTE daemon has unexpectedly failed after launch and before
communicating back to mpirun. This could be caused by a number
of factors, including an inability to create a connection back
to mpirun due to a lack of common network interfaces and/or no
route found between them. Please check network connectivity
(including firewalls and network routing requirements).
--------------------------------------------------------------------------
After a better check made by adding the debugging flag --mca
plm_base_verbose 100, we get an additional information reported for the
second mpirun call:
srun: error: Unable to create step for job 3130006: Memory required by
task is not available
so it appears that the first mpirun (or, better, the first srun called by the first mpirun) isn't able to free the allocated
memory fast enough for give room to the second one.

In our cluster we have different compiler families installed. We tried
with openmpi/4.0.3, spectrum_mpi/11.4.0, and hpc-sdk/2021. In all these
cases, the mpirun command issued from the relative compiler family
produced this error. Please note that:
- if we use "srun" instead of "mpirun", the issue doesn't appear and all
the two parallel instances are executed smoothly with all the tested
compiler families;
- the problem can be reproduced with a simple "mpirun uname -a", so it's
not related to the compiler used for the code (since even a system command can trigger it);
- a simple workaround is to put a "sleep 5" command between the two
mpirun instancies, so that the first parallel execution has the time to
clear the memory before giving access to the second one;
- this happens only when multiple nodes are requested. For single node
jobs the two mpirun istancies present no problems, as expected since mpirun does not call srun on 1 node runs.

Comment 1 Marcin Stolarek 2021-06-15 05:22:12 MDT

Sorry for the delay in reply.

Could you please set SlurmctldDebug to at least verbose, enable DebugFlags=steps and share slurmctld log from the time when you reproduce the issue?

Could you please strace mpirun (or use other method) to check options used to execute srun?

cheers,
Marcin

Comment 2 Marcin Stolarek 2021-06-22 02:25:19 MDT

Cineca Team,

Could you please take a look at the case?

cheers,
Marcin

Comment 3 hpc-cs-hd 2021-06-22 02:53:49 MDT

Created attachment 20043 [details]
Strace output

Comment 4 hpc-cs-hd 2021-06-22 02:54:17 MDT

Created attachment 20044 [details]
Jobscript for the case

Comment 5 hpc-cs-hd 2021-06-22 02:56:52 MDT

Dear Marcin,
sorry for the wait, we have to coordinate with our system administrators for doing the tests that you indicated. Hopefully tomorrow we will be able to provide the required slurmctld logs.
Meanwhile, I attached the output of a job launched with strace on both the consecutive calls of mpirun, and the jobscript used to run this case.

Regards,
Alessandro

Comment 6 Marcin Stolarek 2021-06-22 03:37:21 MDT

I think you need to add a few options to `strace`:
a) `-f` to follow forks
b) `-s strsize` with strsize bing big enough to see srun options. My guess is that something like -s250 will be enough.

cheers,
Marcin

Comment 7 hpc-cs-hd 2021-06-22 03:46:40 MDT

Created attachment 20046 [details]
Strace otput updated

In attachment the output of strace with the proposed flags.

Regards,
Alessandro

Comment 8 Marcin Stolarek 2021-06-22 04:09:31 MDT

Is there any `SLURM_*` environment variable set by default - by a mechanism like .bashrc?

cheers,
Marcina

Comment 9 hpc-cs-hd 2021-06-23 03:16:37 MDT

Created attachment 20072 [details]
Slurmctld log for a job where the case is reproduced

Dear Marcin,
in attachment the slurmctld log for a job where the case is reproduced, and with the debug settings that you requested.
As for your question, there is no SLURM_* veriable that is set manually in any way.

Regards,
Alessandro

Comment 10 Marcin Stolarek 2021-06-23 03:34:57 MDT

Is it possible to get the full log between:
>[2021-06-23T10:40:22.912]
>..
>[2021-06-23T10:40:38.149]

not just a grep on JobId?

cheers,
Marcin

Comment 11 hpc-cs-hd 2021-06-23 03:49:37 MDT

Created attachment 20073 [details]
Full slurmctld log

Sure, full log attached.

Alessandro

Comment 13 Marcin Stolarek 2021-06-23 06:14:04 MDT

Alessandro,

Looking at the way srun was called and the logs you shared it looks like the reported issue is a duplicate of Bug 11857, which is already fixed by 49b1b2681eb[1] and will be released in Slurm 20.11.8.

Let me know if you want to apply it locally to confirm. In case of no-reply I'll just close the bug as duplicate.

cheers,
Marcin 
[1]https://github.com/SchedMD/slurm/commit/49b1b2681eb04ee45b9a0e9663903896247f8a8d

Comment 14 hpc-cs-hd 2021-06-23 07:39:12 MDT

Thank you Marcin for the analysis.
It is indeed possible that this bug is a duplicate of the one you linked, although in that case it seems that the problem manifests when 2 srun istancies are called while we have the problem only with mpirun. But maybe it is the same issue that shows up differently because of a different work environment.

Anyway, if this is going to be fixed in 20.11.8 we think you can close the bug.

Thank you for the assistance and best regards,
Alessandro

Comment 15 Marcin Stolarek 2021-06-24 03:31:11 MDT

Alessandro,


I guess that the difference in the way you're calling srun vs how mpirun does that is --nodelist argument used when srun is called by mpirun.
The commit I referred prevents step creation failure when --nodelist used and some nodes in the allocation are busy. From the logs you shared we see that step creation request came before step completion to slurmctld - that's why additional sleep workarounds the issue. I'm confident it's fixed by the referred commit.

The fix should be easy to apply, so you can do that locally before 20.11.8 release.

cheers,
Marcin

*** This ticket has been marked as a duplicate of ticket 11857 ***