Summary: | Second mpirun in a job fails after Slurm upgrade | ||
---|---|---|---|
Product: | Slurm | Reporter: | hpc-cs-hd |
Component: | User Commands | Assignee: | Marcin Stolarek <cinek> |
Status: | RESOLVED DUPLICATE | QA Contact: | |
Severity: | 4 - Minor Issue | ||
Priority: | --- | CC: | cinek, sts |
Version: | 20.11.7 | ||
Hardware: | Linux | ||
OS: | Linux | ||
Site: | Cineca | Alineos Sites: | --- |
Atos/Eviden Sites: | --- | Confidential Site: | --- |
Coreweave sites: | --- | Cray Sites: | --- |
DS9 clusters: | --- | Google sites: | --- |
HPCnow Sites: | --- | HPE Sites: | --- |
IBM Sites: | --- | NOAA SIte: | --- |
NoveTech Sites: | --- | Nvidia HWinf-CS Sites: | --- |
OCF Sites: | --- | Recursion Pharma Sites: | --- |
SFW Sites: | --- | SNIC sites: | --- |
Tzag Elita Sites: | --- | Linux Distro: | --- |
Machine Name: | CLE Version: | ||
Version Fixed: | Target Release: | --- | |
DevPrio: | --- | Emory-Cloud Sites: | --- |
Attachments: |
Strace output
Jobscript for the case Strace otput updated Slurmctld log for a job where the case is reproduced Full slurmctld log |
Description
hpc-cs-hd
2021-06-08 06:45:05 MDT
Sorry for the delay in reply. Could you please set SlurmctldDebug to at least verbose, enable DebugFlags=steps and share slurmctld log from the time when you reproduce the issue? Could you please strace mpirun (or use other method) to check options used to execute srun? cheers, Marcin Cineca Team, Could you please take a look at the case? cheers, Marcin Created attachment 20043 [details]
Strace output
Created attachment 20044 [details]
Jobscript for the case
Dear Marcin, sorry for the wait, we have to coordinate with our system administrators for doing the tests that you indicated. Hopefully tomorrow we will be able to provide the required slurmctld logs. Meanwhile, I attached the output of a job launched with strace on both the consecutive calls of mpirun, and the jobscript used to run this case. Regards, Alessandro I think you need to add a few options to `strace`: a) `-f` to follow forks b) `-s strsize` with strsize bing big enough to see srun options. My guess is that something like -s250 will be enough. cheers, Marcin Created attachment 20046 [details]
Strace otput updated
In attachment the output of strace with the proposed flags.
Regards,
Alessandro
Is there any `SLURM_*` environment variable set by default - by a mechanism like .bashrc? cheers, Marcina Created attachment 20072 [details]
Slurmctld log for a job where the case is reproduced
Dear Marcin,
in attachment the slurmctld log for a job where the case is reproduced, and with the debug settings that you requested.
As for your question, there is no SLURM_* veriable that is set manually in any way.
Regards,
Alessandro
Is it possible to get the full log between:
>[2021-06-23T10:40:22.912]
>..
>[2021-06-23T10:40:38.149]
not just a grep on JobId?
cheers,
Marcin
Created attachment 20073 [details]
Full slurmctld log
Sure, full log attached.
Alessandro
Alessandro, Looking at the way srun was called and the logs you shared it looks like the reported issue is a duplicate of Bug 11857, which is already fixed by 49b1b2681eb[1] and will be released in Slurm 20.11.8. Let me know if you want to apply it locally to confirm. In case of no-reply I'll just close the bug as duplicate. cheers, Marcin [1]https://github.com/SchedMD/slurm/commit/49b1b2681eb04ee45b9a0e9663903896247f8a8d Thank you Marcin for the analysis. It is indeed possible that this bug is a duplicate of the one you linked, although in that case it seems that the problem manifests when 2 srun istancies are called while we have the problem only with mpirun. But maybe it is the same issue that shows up differently because of a different work environment. Anyway, if this is going to be fixed in 20.11.8 we think you can close the bug. Thank you for the assistance and best regards, Alessandro Alessandro, I guess that the difference in the way you're calling srun vs how mpirun does that is --nodelist argument used when srun is called by mpirun. The commit I referred prevents step creation failure when --nodelist used and some nodes in the allocation are busy. From the logs you shared we see that step creation request came before step completion to slurmctld - that's why additional sleep workarounds the issue. I'm confident it's fixed by the referred commit. The fix should be easy to apply, so you can do that locally before 20.11.8 release. cheers, Marcin *** This ticket has been marked as a duplicate of ticket 11857 *** |