Summary: | slurmstepd: error: mpi/pmix_v5: _dmdx_req: owl1 [0]: pmixp_dmdx.c:319: Bad request from owl1: nspace "slurm.pmix.5245.0" has only 2 ranks, asked for -1 | ||
---|---|---|---|
Product: | Slurm | Reporter: | Rodrigo <rodrigo.arias> |
Component: | PMIx | Assignee: | Jacob Jenson <jacob> |
Status: | OPEN --- | QA Contact: | |
Severity: | 6 - No support contract | ||
Priority: | --- | ||
Version: | 24.05.x | ||
Hardware: | Linux | ||
OS: | Linux | ||
Site: | -Other- | Alineos Sites: | --- |
Atos/Eviden Sites: | --- | Confidential Site: | --- |
Coreweave sites: | --- | Cray Sites: | --- |
DS9 clusters: | --- | HPCnow Sites: | --- |
HPE Sites: | --- | IBM Sites: | --- |
NOAA SIte: | --- | NoveTech Sites: | --- |
Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
Recursion Pharma Sites: | --- | SFW Sites: | --- |
SNIC sites: | --- | Tzag Elita Sites: | --- |
Linux Distro: | --- | Machine Name: | |
CLE Version: | Version Fixed: | ||
Target Release: | --- | DevPrio: | --- |
Emory-Cloud Sites: | --- |
Description
Rodrigo
2024-03-15 05:14:12 MDT
Hmm, this doesn't make a lot of sense, from [1]: nsptr = pmixp_nspaces_local(); if (nsptr->ntasks <= rank) { char *nodename = pmixp_info_job_host(nodeid); PMIXP_ERROR("Bad request from %s: nspace \"%s\" has only %d ranks, asked for %d", nodename, ns, nsptr->ntasks, rank); _respond_with_error(seq_num, nodeid, sender_ns, PMIX_ERR_BAD_PARAM); xfree(nodename); goto exit; } [1]: https://github.com/SchedMD/slurm/blob/924dce610761c30937e88ac334bdb3ca90beab91/src/plugins/mpi/pmix/pmixp_dmdx.c#L316-L325 If the message is 'nspace "slurm.pmix.5257.0" has only 2 ranks, asked for -1' and asssuming the code is the correct one, nsptr->ntasks = 2 and rank = -1, then "nsptr->ntasks <= rank" should be false. Ah, I think I see what is happening. The rank is a signed integer (int) but the number of tasks is unsigned. So rank is promoted to an unsigned integer causing the value to be 0xffffffff, which is very big. I cannot test this hypothesis with GDB, as I don't know how to stop it right there, but I can test a simple patch. The following patch seems to fix that particular bug: --- a/src/plugins/mpi/pmix/pmixp_dmdx.c 2024-03-15 13:05:24.815313882 +0100 +++ b/src/plugins/mpi/pmix/pmixp_dmdx.c 2024-03-15 13:09:53.936900823 +0100 @@ -314,7 +314,7 @@ static void _dmdx_req(buf_t *buf, int no } nsptr = pmixp_nspaces_local(); - if (nsptr->ntasks <= rank) { + if ((long) nsptr->ntasks <= (long) rank) { char *nodename = pmixp_info_job_host(nodeid); PMIXP_ERROR("Bad request from %s: nspace \"%s\" has only %d ranks, asked for %d", nodename, ns, nsptr->ntasks, rank); But now the command simply gets stuck: owl1% PMIX_DEBUG=100 srun -N2 -v osu_bw srun: defined options srun: -------------------- -------------------- srun: nodes : 2 srun: verbose : 1 srun: -------------------- -------------------- srun: end of defined options srun: Nodes owl[1-2] are ready for job srun: jobid 5275: nodes(2):`owl[1-2]', cpu counts: 2(x2) srun: CpuBindType=(null type) srun: launching StepId=5275.0 on host owl1, 1 tasks: 0 srun: launching StepId=5275.0 on host owl2, 1 tasks: 1 srun: topology/default: init: topology Default plugin loaded srun: Node owl2, 1 tasks started srun: Node owl1, 1 tasks started [owl2:03536] psquash: flex128 init [owl2:03536] psquash: native init [owl2:03536] psquash: flex128 init [owl2:03536] psec: munge init [owl2:03536] psec: munge create_cred [owl1:04806] psquash: flex128 init [owl1:04806] psquash: native init [owl1:04806] psquash: flex128 init [owl1:04806] psec: munge init [owl1:04806] psec: munge create_cred The hang is caused by a bug on MPICH side, fixed with these two patches: https://github.com/pmodels/mpich/issues/6946 Rodrigo. |