We recently upgraded to Slurm 20.11.7 and after that, we encountered a new problem regarding consecutive parallel executions in a single job. If more than one node is asked and the job calls two mpirun instances consecutively, the second one will fail with the following error: -------------------------------------------------------------------------- An ORTE daemon has unexpectedly failed after launch and before communicating back to mpirun. This could be caused by a number of factors, including an inability to create a connection back to mpirun due to a lack of common network interfaces and/or no route found between them. Please check network connectivity (including firewalls and network routing requirements). -------------------------------------------------------------------------- After a better check made by adding the debugging flag --mca plm_base_verbose 100, we get an additional information reported for the second mpirun call: srun: error: Unable to create step for job 3130006: Memory required by task is not available so it appears that the first mpirun (or, better, the first srun called by the first mpirun) isn't able to free the allocated memory fast enough for give room to the second one. In our cluster we have different compiler families installed. We tried with openmpi/4.0.3, spectrum_mpi/11.4.0, and hpc-sdk/2021. In all these cases, the mpirun command issued from the relative compiler family produced this error. Please note that: - if we use "srun" instead of "mpirun", the issue doesn't appear and all the two parallel instances are executed smoothly with all the tested compiler families; - the problem can be reproduced with a simple "mpirun uname -a", so it's not related to the compiler used for the code (since even a system command can trigger it); - a simple workaround is to put a "sleep 5" command between the two mpirun instancies, so that the first parallel execution has the time to clear the memory before giving access to the second one; - this happens only when multiple nodes are requested. For single node jobs the two mpirun istancies present no problems, as expected since mpirun does not call srun on 1 node runs.
Sorry for the delay in reply. Could you please set SlurmctldDebug to at least verbose, enable DebugFlags=steps and share slurmctld log from the time when you reproduce the issue? Could you please strace mpirun (or use other method) to check options used to execute srun? cheers, Marcin
Cineca Team, Could you please take a look at the case? cheers, Marcin
Created attachment 20043 [details] Strace output
Created attachment 20044 [details] Jobscript for the case
Dear Marcin, sorry for the wait, we have to coordinate with our system administrators for doing the tests that you indicated. Hopefully tomorrow we will be able to provide the required slurmctld logs. Meanwhile, I attached the output of a job launched with strace on both the consecutive calls of mpirun, and the jobscript used to run this case. Regards, Alessandro
I think you need to add a few options to `strace`: a) `-f` to follow forks b) `-s strsize` with strsize bing big enough to see srun options. My guess is that something like -s250 will be enough. cheers, Marcin
Created attachment 20046 [details] Strace otput updated In attachment the output of strace with the proposed flags. Regards, Alessandro
Is there any `SLURM_*` environment variable set by default - by a mechanism like .bashrc? cheers, Marcina
Created attachment 20072 [details] Slurmctld log for a job where the case is reproduced Dear Marcin, in attachment the slurmctld log for a job where the case is reproduced, and with the debug settings that you requested. As for your question, there is no SLURM_* veriable that is set manually in any way. Regards, Alessandro
Is it possible to get the full log between: >[2021-06-23T10:40:22.912] >.. >[2021-06-23T10:40:38.149] not just a grep on JobId? cheers, Marcin
Created attachment 20073 [details] Full slurmctld log Sure, full log attached. Alessandro
Alessandro, Looking at the way srun was called and the logs you shared it looks like the reported issue is a duplicate of Bug 11857, which is already fixed by 49b1b2681eb[1] and will be released in Slurm 20.11.8. Let me know if you want to apply it locally to confirm. In case of no-reply I'll just close the bug as duplicate. cheers, Marcin [1]https://github.com/SchedMD/slurm/commit/49b1b2681eb04ee45b9a0e9663903896247f8a8d
Thank you Marcin for the analysis. It is indeed possible that this bug is a duplicate of the one you linked, although in that case it seems that the problem manifests when 2 srun istancies are called while we have the problem only with mpirun. But maybe it is the same issue that shows up differently because of a different work environment. Anyway, if this is going to be fixed in 20.11.8 we think you can close the bug. Thank you for the assistance and best regards, Alessandro
Alessandro, I guess that the difference in the way you're calling srun vs how mpirun does that is --nodelist argument used when srun is called by mpirun. The commit I referred prevents step creation failure when --nodelist used and some nodes in the allocation are busy. From the logs you shared we see that step creation request came before step completion to slurmctld - that's why additional sleep workarounds the issue. I'm confident it's fixed by the referred commit. The fix should be easy to apply, so you can do that locally before 20.11.8 release. cheers, Marcin *** This ticket has been marked as a duplicate of ticket 11857 ***