| Summary: | Documentation for mpi.conf incorrect PMIxTlsUCX options shown | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | mike coyne <mcoyne> |
| Component: | PMIx | Assignee: | Felip Moll <felip.moll> |
| Status: | RESOLVED FIXED | QA Contact: | Ben Roberts <ben> |
| Severity: | 4 - Minor Issue | ||
| Priority: | --- | CC: | bsantos, sts |
| Version: | 22.05.3 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | LANL | Slinky Site: | --- |
| Alineos Sites: | --- | Atos/Eviden Sites: | --- |
| Confidential Site: | --- | Coreweave sites: | --- |
| Cray Sites: | --- | DS9 clusters: | --- |
| Google sites: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | NoveTech Sites: | --- |
| Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Tzag Elita Sites: | --- |
| Linux Distro: | RHEL | Machine Name: | kit test clluster |
| CLE Version: | Version Fixed: | 22.05.4 | |
| Target Release: | --- | DevPrio: | --- |
| Emory-Cloud Sites: | --- | ||
|
Description
mike coyne
2022-08-17 11:21:09 MDT
(In reply to mike coyne from comment #0) > issue with setting for mpi.conf > > https://slurm.schedmd.com/mpi.conf.html > > PMIxTlsUCX={true|false} > Use TLS for UCX communication. Defaults to not being set. > > Using either true or false will cause job submission failure on pmix > jobs with ucx enabled > It would appear that the true/false value that worked in 22.05.2 now > should be set to a legal option for the ucx variable UCX_TLS=xxxx such as > all which allows 22.05.3 to work . Hi Mike, Can you show me the output of 'scontrol show config' ? The PMIxTlsUCX documentation is wrong, it is not a boolean of true/false but a string which can be one of these according to the UCX documentation: all - use all the available transports. sm / shm - all shared memory transports. mm - shared memory transports - only memory mappers. ugni - ugni_rdma and ugni_udt. rc - rc and ud. rc_x - rc with accelerated verbs and ud. ud_x - ud with accelerated verbs. In fact my show config shows when I have no PMIxTlsUCX set shows the following, which is not set to true or to false: MPI Plugins Configuration: PMIxCliTmpDirBase = (null) PMIxCollFence = (null) PMIxDebug = 0 PMIxDirectConn = yes PMIxDirectConnEarly = no PMIxDirectConnUCX = no PMIxDirectSameArch = no PMIxEnv = (null) PMIxFenceBarrier = no PMIxNetDevicesUCX = (null) PMIxTimeout = 300 PMIxTlsUCX = (null) What is your actual value? What is the combination of values that makes your jobs to fail? And what is the error seen? I will correct our documentation about this parameter. (In reply to Felip Moll from comment #1) > (In reply to mike coyne from comment #0) > > issue with setting for mpi.conf > > > > https://slurm.schedmd.com/mpi.conf.html > > > > PMIxTlsUCX={true|false} > > Use TLS for UCX communication. Defaults to not being set. > > > > Using either true or false will cause job submission failure on pmix > > jobs with ucx enabled > > It would appear that the true/false value that worked in 22.05.2 now > > should be set to a legal option for the ucx variable UCX_TLS=xxxx such as > > all which allows 22.05.3 to work . > > Hi Mike, > > Can you show me the output of 'scontrol show config' ? > > The PMIxTlsUCX documentation is wrong, it is not a boolean of true/false but > a string which can be one of these according to the UCX documentation: > > all - use all the available transports. > sm / shm - all shared memory transports. > mm - shared memory transports - only memory mappers. > ugni - ugni_rdma and ugni_udt. > rc - rc and ud. > rc_x - rc with accelerated verbs and ud. > ud_x - ud with accelerated verbs. > > In fact my show config shows when I have no PMIxTlsUCX set shows the > following, which is not set to true or to false: > > MPI Plugins Configuration: > PMIxCliTmpDirBase = (null) > PMIxCollFence = (null) > PMIxDebug = 0 > PMIxDirectConn = yes > PMIxDirectConnEarly = no > PMIxDirectConnUCX = no > PMIxDirectSameArch = no > PMIxEnv = (null) > PMIxFenceBarrier = no > PMIxNetDevicesUCX = (null) > PMIxTimeout = 300 > PMIxTlsUCX = (null) > > What is your actual value? What is the combination of values that makes your > jobs to fail? And what is the error seen? > I will correct our documentation about this parameter. ... one additional question on mpi.conf on our cray XC's will need some env vars pushed PMIxEnv=UCX_MEM_MALLOC_HOOKS=no,UCX_MEM_MALLOC_RELOC=no,UCX_MEM_EVENTS=no,UCX_UNIFIED_MODE=1 PMIxNetDevicesUCX=ipogif0 is this the correct syntax .. ... Current Configuration toss3(rhel7) x86_64 that works .. note scontrol show config does not seem to be showing the settings i tried this on both a compute node and front end , as a note this cluster kit has OPA fabric and is built using ucx 1.12.1 scontrol show config .. MPI Plugins Configuration: PMIxCliTmpDirBase = (null) PMIxCollFence = (null) PMIxDebug = 0 PMIxDirectConn = yes PMIxDirectConnEarly = no PMIxDirectConnUCX = no PMIxDirectSameArch = no PMIxEnv = (null) PMIxFenceBarrier = no PMIxNetDevicesUCX = (null) PMIxTimeout = 300 PMIxTlsUCX = (null) -bash-4.2$ cat mpi.conf PMIxDebug=1 PMIxDirectConn=true PMIxDirectConnEarly=true PMIxDirectConnUCX=true #PMIxEnv= PMIxNetDevicesUCX=hfi1_0:1 #PMIxTimeout=10 PMIxTlsUCX=all with the setting as true >>> [mcoyne@kit005 lib64]$ cat /etc/slurm/mpi.conf PMIxDebug=1 PMIxDirectConn=true PMIxDirectConnEarly=true PMIxDirectConnUCX=true #PMIxEnv= PMIxNetDevicesUCX=hfi1_0:1 #PMIxTimeout=10 PMIxTlsUCX=true [mcoyne@kit005 lib64]$ module load gcc openmpi [mcoyne@kit005 lib64]$ srun -N2 /users/mcoyne/Wip/supermagic/buildomp4/supermagic srun: launch/slurm: launch_p_step_launch: StepId=3102145.0 aborted before step completely launched. srun: error: task 1 launch failed: Unspecified error srun: Job step aborted: Waiting up to 32 seconds for job step to finish. srun: error: task 0 launch failed: Unspecified error in the slurmd.log file [2022-08-18T07:30:43.806] [3102145.0] cred/munge: init: Munge credential signature plugin loaded [2022-08-18T07:30:43.808] [3102145.0] spank/lua: Loaded 2 plugins in this context [2022-08-18T07:30:43.830] [3102145.0] error: mpi/pmix_v3: pmixp_dconn_ucx_prepare: kit005 [0]: pmixp_dconn_ucx.c:248: Fail to init UCX: No such device [2022-08-18T07:30:43.830] [3102145.0] error: mpi/pmix_v3: pmixp_dconn_init: kit005 [0]: pmixp_dconn.c:74: Cannot get polling fd [2022-08-18T07:30:43.830] [3102145.0] error: mpi/pmix_v3: pmixp_stepd_init: kit005 [0]: pmixp_server.c:402: pmixp_dconn_init() failed [2022-08-18T07:30:43.830] [3102145.0] error: mpi/pmix_v3: mpi_p_slurmstepd_prefork: (null) [0]: mpi_pmix.c:224: pmixp_stepd_init() failed [2022-08-18T07:30:43.833] [3102145.0] error: Failed mpi_g_slurmstepd_prefork [2022-08-18T07:30:43.833] [3102145.0] Sent signal 9 to StepId=3102145.0 [2022-08-18T07:30:43.834] [3102145.0] Sent signal 9 to StepId=3102145.0 still does not show the change in configurations ( on compute node ) MPI Plugins Configuration: PMIxCliTmpDirBase = (null) PMIxCollFence = (null) PMIxDebug = 0 PMIxDirectConn = yes PMIxDirectConnEarly = no PMIxDirectConnUCX = no PMIxDirectSameArch = no PMIxEnv = (null) PMIxFenceBarrier = no PMIxNetDevicesUCX = (null) PMIxTimeout = 300 PMIxTlsUCX = (null) should note i do not have a mpi.conf or a oci.conf on the master in the /etc/slurm configuration directory Is this needed ? so scontrol show config works? (In reply to mike coyne from comment #5) > should note i do not have a mpi.conf or a oci.conf on the master in the > /etc/slurm configuration directory Is this needed ? so scontrol show config > works? MPI Plugins Configuration: PMIxCliTmpDirBase = (null) PMIxCollFence = (null) PMIxDebug = 1 PMIxDirectConn = yes PMIxDirectConnEarly = yes PMIxDirectConnUCX = yes PMIxDirectSameArch = no PMIxEnv = (null) PMIxFenceBarrier = no PMIxNetDevicesUCX = hfi1_0:1 PMIxTimeout = 300 PMIxTlsUCX = all Slurmctld(primary) at kit-master is UP i put the mpi.conf file in the slurmctld's slurm conf directory and it does show up. Mike: >PMIxEnv=UCX_MEM_MALLOC_HOOKS=no,UCX_MEM_MALLOC_RELOC=no,UCX_MEM_EVENTS=no,UCX_UNIFIED_MODE=1 Your syntax is correct. Thanks for the other information, it confirms what you already explained. Setting true or false is not correct for UCX_TLS (PMIxTlsUCX), the acceptable values are described in UCX documentation. I uploaded a patch for review for changing the documentation. Please, set any of the correct one for your system: https://openucx.readthedocs.io/en/master/faq.html About the 'scontrol show config', this only shows the config which is in the controller, so if you changed nodes locally it won't make a difference and you must check every mpi.conf manually. Thanks for your comments Mike, The docs are fixed in 22.05.4 commit a5e5b88ea and will be available in the webpage after next release is released. Regards |