Ticket 19324

Summary:	slurmstepd: error: mpi/pmix_v5: _dmdx_req: owl1 [0]: pmixp_dmdx.c:319: Bad request from owl1: nspace "slurm.pmix.5245.0" has only 2 ranks, asked for -1
Product:	Slurm	Reporter:	Rodrigo <rodrigo.arias>
Component:	PMIx	Assignee:	Jacob Jenson <jacob>
Status:	OPEN ---	QA Contact:
Severity:	6 - No support contract
Priority:	---
Version:	24.05.x
Hardware:	Linux
OS:	Linux
Site:	-Other-	Alineos Sites:	---
Atos/Eviden Sites:	---	Confidential Site:	---
Coreweave sites:	---	Cray Sites:	---
DS9 clusters:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---

Description Rodrigo 2024-03-15 05:14:12 MDT

Hi,

I'm upgrading a small cluster with NixOS to SLURM 23.11.4.1 with PMIX 5.0.1 and MPICH 4.2.0, and now I'm encountering an error that prevents jobs to continue (before it was fine):

hut% PMIX_DEBUG=100 srun -N2 -v osu_bw
srun: defined options
srun: -------------------- --------------------
srun: nodes               : 2
srun: verbose             : 1
srun: -------------------- --------------------
srun: end of defined options
srun: Nodes owl[1-2] are ready for job
srun: jobid 5257: nodes(2):`owl[1-2]', cpu counts: 2(x2)
srun: CpuBindType=(null type)
srun: launching StepId=5257.0 on host owl1, 1 tasks: 0
srun: launching StepId=5257.0 on host owl2, 1 tasks: 1
srun: topology/default: init: topology Default plugin loaded
srun: Node owl2, 1 tasks started
srun: Node owl1, 1 tasks started
[owl2:02476] psquash: flex128 init
[owl2:02476] psquash: native init
[owl2:02476] psquash: flex128 init
[owl2:02476] psec: munge init
[owl2:02476] psec: munge create_cred
[owl1:03239] psquash: flex128 init
[owl1:03239] psquash: native init
[owl1:03239] psquash: flex128 init
[owl1:03239] psec: munge init
[owl1:03239] psec: munge create_cred
slurmstepd: error:  mpi/pmix_v5: _dmdx_req: owl1 [0]: pmixp_dmdx.c:319: Bad request from owl2: nspace "slurm.pmix.5257.0" has only 2 ranks, asked for -1
slurmstepd: error:  mpi/pmix_v5: _dmdx_req: owl1 [0]: pmixp_dmdx.c:319: Bad request from owl1: nspace "slurm.pmix.5257.0" has only 2 ranks, asked for -1

The interconnet is Omni-Path:

hut% lspci | grep Omni-Path
05:00.0 Fabric controller: Intel Corporation Omni-Path HFI Silicon 100 Series [discrete] (rev 11)

I have configured SLURM like this:

hut% nix log /nix/store/sry32ixmgyndia3rka3yww1bsbpvns5s-slurm-23.11.4.1/ | grep 'configure flags' | tr ' ' '\n'

configure
flags:
--disable-static
--disable-dependency-tracking
--prefix=/nix/store/sry32ixmgyndia3rka3yww1bsbpvns5s-slurm-23.11.4.1
--bindir=/nix/store/sry32ixmgyndia3rka3yww1bsbpvns5s-slurm-23.11.4.1/bin
--sbindir=/nix/store/sry32ixmgyndia3rka3yww1bsbpvns5s-slurm-23.11.4.1/sbin
--includedir=/nix/store/ma5zdr0h42q08h1aaxzhblb6zzl149lm-slurm-23.11.4.1-dev/include
--oldincludedir=/nix/store/ma5zdr0h42q08h1aaxzhblb6zzl149lm-slurm-23.11.4.1-dev/include
--mandir=/nix/store/sry32ixmgyndia3rka3yww1bsbpvns5s-slurm-23.11.4.1/share/man
--infodir=/nix/store/sry32ixmgyndia3rka3yww1bsbpvns5s-slurm-23.11.4.1/share/info
--docdir=/nix/store/sry32ixmgyndia3rka3yww1bsbpvns5s-slurm-23.11.4.1/share/doc/slurm
--libdir=/nix/store/sry32ixmgyndia3rka3yww1bsbpvns5s-slurm-23.11.4.1/lib
--libexecdir=/nix/store/sry32ixmgyndia3rka3yww1bsbpvns5s-slurm-23.11.4.1/libexec
--localedir=/nix/store/sry32ixmgyndia3rka3yww1bsbpvns5s-slurm-23.11.4.1/share/locale
--with-freeipmi=/nix/store/g6h6zgpqcrdzqdh5bibw4xbgd30igykr-freeipmi-1.6.11
--with-http-parser=/nix/store/4rws10f7hwdqsgy527p0crz49by1jzc1-http-parser-2.9.4
--with-hwloc=/nix/store/x4v0k8xyn89rdnywq6y83fyqn06pj2zh-hwloc-2.10.0-dev
--with-json=/nix/store/0gv5acqnbvlqiw2xyj71x79j4rjwcyk5-json-c-0.17-dev
--with-jwt=/nix/store/bxsf6glp10b42hhr3nry3f2mvbl5g19x-libjwt-1.17.0
--with-lz4=/nix/store/1ykvf9q58nc0r6ydc13ick2sswlz0vly-lz4-1.9.4-dev
--with-munge=/nix/store/yy9sj9yh825k1apmfxn8x8my56vj5wdn-munge-0.5.15
--with-yaml=/nix/store/7s5vmmcfjv1bpf9h33rxwx63dhcz158p-libyaml-0.2.5-dev
--with-ofed=/nix/store/1znw577vmpqjpmmd2d7ycd5yjfixz0qp-rdma-core-50.0-dev
--sysconfdir=/etc/slurm
--with-pmix=/nix/store/bhx5fanlxqxpnq2n4fnjw9m82mvxvrbg-pmix-5.0.1-dev
--with-bpf=/nix/store/s2wddfwy3bdlh3znra60g27c5ky0zf4q-libbpf-1.3.0
--without-rpath


PMIX like this:

hut% nix log /nix/store/0p79lpmq6cnb992n6j7gr9h6zn18sz47-pmix-5.0.1/lib/libpmix.so.2.13.1 | grep 'configure flags' | tr ' ' '\n'
configure
flags:
--disable-static
--disable-dependency-tracking
--prefix=/nix/store/0p79lpmq6cnb992n6j7gr9h6zn18sz47-pmix-5.0.1
--bindir=/nix/store/0p79lpmq6cnb992n6j7gr9h6zn18sz47-pmix-5.0.1/bin
--sbindir=/nix/store/0p79lpmq6cnb992n6j7gr9h6zn18sz47-pmix-5.0.1/sbin
--includedir=/nix/store/bhx5fanlxqxpnq2n4fnjw9m82mvxvrbg-pmix-5.0.1-dev/include
--oldincludedir=/nix/store/bhx5fanlxqxpnq2n4fnjw9m82mvxvrbg-pmix-5.0.1-dev/include
--mandir=/nix/store/0p79lpmq6cnb992n6j7gr9h6zn18sz47-pmix-5.0.1/share/man
--infodir=/nix/store/0p79lpmq6cnb992n6j7gr9h6zn18sz47-pmix-5.0.1/share/info
--docdir=/nix/store/0p79lpmq6cnb992n6j7gr9h6zn18sz47-pmix-5.0.1/share/doc/pmix
--libdir=/nix/store/0p79lpmq6cnb992n6j7gr9h6zn18sz47-pmix-5.0.1/lib
--libexecdir=/nix/store/0p79lpmq6cnb992n6j7gr9h6zn18sz47-pmix-5.0.1/libexec
--localedir=/nix/store/0p79lpmq6cnb992n6j7gr9h6zn18sz47-pmix-5.0.1/share/locale
--with-libevent=/nix/store/2p504711adwv44cnf9hzlm9c38hiha66-libevent-2.1.12-dev
--with-libevent-libdir=/nix/store/0bvk5s2rd9xd0h9a1pamq1vg9bmyxz1i-libevent-2.1.12/lib
--with-munge=/nix/store/yy9sj9yh825k1apmfxn8x8my56vj5wdn-munge-0.5.15
--with-hwloc=/nix/store/x4v0k8xyn89rdnywq6y83fyqn06pj2zh-hwloc-2.10.0-dev
--with-hwloc-libdir=/nix/store/zl15hdf7hclrg7s687z0p5payiks8znp-hwloc-2.10.0-lib/lib


MPICH like this, with no PM:

hut% nix log /nix/store/240bspwny7irmicr16q4sjsavhqck667-mpich-4.2.0/lib/libmpi.so.12 | grep 'configure flags' | tr ' ' '\n'
configure
flags:
--disable-static
--disable-dependency-tracking
--prefix=/nix/store/240bspwny7irmicr16q4sjsavhqck667-mpich-4.2.0
--bindir=/nix/store/240bspwny7irmicr16q4sjsavhqck667-mpich-4.2.0/bin
--sbindir=/nix/store/240bspwny7irmicr16q4sjsavhqck667-mpich-4.2.0/sbin
--includedir=/nix/store/240bspwny7irmicr16q4sjsavhqck667-mpich-4.2.0/include
--oldincludedir=/nix/store/240bspwny7irmicr16q4sjsavhqck667-mpich-4.2.0/include
--mandir=/nix/store/cyjd7hs8xdcb4yfmv6yacwga6gzlabib-mpich-4.2.0-man/share/man
--infodir=/nix/store/240bspwny7irmicr16q4sjsavhqck667-mpich-4.2.0/share/info
--docdir=/nix/store/n1v39hky038sfrlnj0pila1qlkya1j9s-mpich-4.2.0-doc/share/doc/mpich
--libdir=/nix/store/240bspwny7irmicr16q4sjsavhqck667-mpich-4.2.0/lib
--libexecdir=/nix/store/240bspwny7irmicr16q4sjsavhqck667-mpich-4.2.0/libexec
--localedir=/nix/store/240bspwny7irmicr16q4sjsavhqck667-mpich-4.2.0/share/locale
--enable-shared
--enable-sharedlib
--with-pm=no
--with-device=ch4:ofi
--with-pmi=pmix
--with-pmix=/nix/store/xqpyk6kvwpr9hlxzdygfa4zfl8sr2nwg-pmix-all
--with-libfabric=/nix/store/xjypcwc252xbgd4ww3r66rmq5yq3590g-libfabric-1.20.1
--enable-g=log
FFLAGS=-fallow-argument-mismatch
FCFLAGS=-fallow-argument-mismatch

(Notice that /nix/store/xqpyk6kvwpr9hlxzdygfa4zfl8sr2nwg-pmix-all is a directory with symlinks to pmix and pmix-dev directories, as libraries and headers are installed in different paths in NixOS)

And libfabric like this:

hut% nix log /nix/store/xjypcwc252xbgd4ww3r66rmq5yq3590g-libfabric-1.20.1 | grep 'configure flags' | tr ' ' '\n'
configure
flags:
--disable-static
--disable-dependency-tracking
--prefix=/nix/store/xjypcwc252xbgd4ww3r66rmq5yq3590g-libfabric-1.20.1
--bindir=/nix/store/xjypcwc252xbgd4ww3r66rmq5yq3590g-libfabric-1.20.1/bin
--sbindir=/nix/store/xjypcwc252xbgd4ww3r66rmq5yq3590g-libfabric-1.20.1/sbin
--includedir=/nix/store/9sdxrwgnvn99ld70nvsyl39f2rax93q0-libfabric-1.20.1-dev/include
--oldincludedir=/nix/store/9sdxrwgnvn99ld70nvsyl39f2rax93q0-libfabric-1.20.1-dev/include
--mandir=/nix/store/99djnrygbfk4lhz0nxazlcxgx6mpl3q0-libfabric-1.20.1-man/share/man
--infodir=/nix/store/xjypcwc252xbgd4ww3r66rmq5yq3590g-libfabric-1.20.1/share/info
--docdir=/nix/store/xjypcwc252xbgd4ww3r66rmq5yq3590g-libfabric-1.20.1/share/doc/libfabric
--libdir=/nix/store/xjypcwc252xbgd4ww3r66rmq5yq3590g-libfabric-1.20.1/lib
--libexecdir=/nix/store/xjypcwc252xbgd4ww3r66rmq5yq3590g-libfabric-1.20.1/libexec
--localedir=/nix/store/xjypcwc252xbgd4ww3r66rmq5yq3590g-libfabric-1.20.1/share/locale
--enable-psm2=/nix/store/pfi8jmc29wacr797xgxh6p7fpadsd6w0-libpsm2-12.0.1
--enable-opx


I have set "MpiDefault=pmix" in the slurm.conf file, so it uses PMIX by default. It shows the plugin properly:

hut% srun --mpi=list
MPI plugin types are...
        none
        pmi2
        cray_shasta
        pmix
specific pmix plugin versions available: pmix_v5


I can debug it further with gdb, but I'm not sure how to attach it to the slurmstepd process after it forks and before it handles the PMIX request. It seems to be coming from here:

https://github.com/SchedMD/slurm/blob/924dce610761c30937e88ac334bdb3ca90beab91/src/plugins/mpi/pmix/pmixp_dmdx.c#L308

I'm opening the ticket here as it is being reported by slurm code, but I'm not sure if the problem is in SLURM side or PMIX or a combination of both. We can rebuild any component with debug enabled if needed.

Thanks,
Rodrigo.

Comment 1 Rodrigo 2024-03-15 05:48:47 MDT

Hmm, this doesn't make a lot of sense, from [1]:

	nsptr = pmixp_nspaces_local();
	if (nsptr->ntasks <= rank) {
		char *nodename = pmixp_info_job_host(nodeid);
		PMIXP_ERROR("Bad request from %s: nspace \"%s\" has only %d ranks, asked for %d",
			    nodename, ns, nsptr->ntasks, rank);
		_respond_with_error(seq_num, nodeid, sender_ns,
				    PMIX_ERR_BAD_PARAM);
		xfree(nodename);
		goto exit;
	}

[1]: https://github.com/SchedMD/slurm/blob/924dce610761c30937e88ac334bdb3ca90beab91/src/plugins/mpi/pmix/pmixp_dmdx.c#L316-L325

If the message is 'nspace "slurm.pmix.5257.0" has only 2 ranks, asked for -1' and asssuming the code is the correct one, nsptr->ntasks = 2 and rank = -1, then "nsptr->ntasks <= rank" should be false.

Comment 2 Rodrigo 2024-03-15 06:08:53 MDT

Ah, I think I see what is happening. The rank is a signed integer (int) but the number of tasks is unsigned. So rank is promoted to an unsigned integer causing the value to be 0xffffffff, which is very big.

I cannot test this hypothesis with GDB, as I don't know how to stop it right there, but I can test a simple patch.

Comment 3 Rodrigo 2024-03-15 06:28:19 MDT

The following patch seems to fix that particular bug:

--- a/src/plugins/mpi/pmix/pmixp_dmdx.c 2024-03-15 13:05:24.815313882 +0100
+++ b/src/plugins/mpi/pmix/pmixp_dmdx.c 2024-03-15 13:09:53.936900823 +0100
@@ -314,7 +314,7 @@ static void _dmdx_req(buf_t *buf, int no
        }

        nsptr = pmixp_nspaces_local();
-       if (nsptr->ntasks <= rank) {
+       if ((long) nsptr->ntasks <= (long) rank) {
                char *nodename = pmixp_info_job_host(nodeid);
                PMIXP_ERROR("Bad request from %s: nspace \"%s\" has only %d ranks, asked for %d",
                            nodename, ns, nsptr->ntasks, rank);



But now the command simply gets stuck:

owl1% PMIX_DEBUG=100 srun -N2 -v osu_bw
srun: defined options
srun: -------------------- --------------------
srun: nodes               : 2
srun: verbose             : 1
srun: -------------------- --------------------
srun: end of defined options
srun: Nodes owl[1-2] are ready for job
srun: jobid 5275: nodes(2):`owl[1-2]', cpu counts: 2(x2)
srun: CpuBindType=(null type)
srun: launching StepId=5275.0 on host owl1, 1 tasks: 0
srun: launching StepId=5275.0 on host owl2, 1 tasks: 1
srun: topology/default: init: topology Default plugin loaded
srun: Node owl2, 1 tasks started
srun: Node owl1, 1 tasks started
[owl2:03536] psquash: flex128 init
[owl2:03536] psquash: native init
[owl2:03536] psquash: flex128 init
[owl2:03536] psec: munge init
[owl2:03536] psec: munge create_cred
[owl1:04806] psquash: flex128 init
[owl1:04806] psquash: native init
[owl1:04806] psquash: flex128 init
[owl1:04806] psec: munge init
[owl1:04806] psec: munge create_cred

Comment 4 Rodrigo 2024-03-18 04:26:27 MDT

The hang is caused by a bug on MPICH side, fixed with these two patches:

https://github.com/pmodels/mpich/issues/6946

Rodrigo.