Created attachment 23873 [details] mpi/pmi2 invalid client request Hello, We are trying to submit a cfdace job using slurm 20.11.8. The job status shows as running under slurm queue but it isn't actually running on the system. Upon checking the slurmd logs below is the message found: [2022-03-15T10:29:13.822] [847.1] error: mpi/pmi2: invalid client request [2022-03-15T10:29:13.823] [847.1] error: mpi/pmi2: value not properly terminated in client request [2022-03-15T10:29:13.823] [847.1] error: mpi/pmi2: request not begin with 'cmd=' [2022-03-15T10:29:13.823] [847.1] error: mpi/pmi2: full request is: 00000000000000000000000000000000000000 cmd=put kvsname=847.1 key=bc-5-1 value=00000$ 847 e162968 R normal cfdace 20.5 /tmp/tmp.H2kFCzK6Lw Mar 15 10:29 24 N/A (null) 1 dcaldh003 Running the recent slurmd daemon on the compute node too: [root@dcaldh003 ~]# ps auxfw|grep -i slurmd root 13178 0.0 0.0 112728 2396 pts/0 S+ 10:40 0:00 \_ grep --color=auto -i slurmd root 30755 0.0 0.0 232108 8160 ? Ss Mar11 0:00 /cm/shared/apps/slurm/20.11.8/sbin/slurmd -D Attaching the complete log for the job Please let us know how do we interpret and resolve this? Thank you Shraddha Kiran
Hi Shraddha, Assuming you have compiled and installed Slurm's pmi2, that you are running Intel MPI and that you run "mpirun/srun <cfdace binary>" inside an sbatch job.. Can you try setting this in your batch script before executing the "mpirun/srun"?: export I_MPI_PMI_LIBRARY=/path/to/slurm/lib/libpmi2.so Be aware not to use system's libpmi2.so but Slurm's one.
(In reply to Felip Moll from comment #1) > Hi Shraddha, > > Assuming you have compiled and installed Slurm's pmi2, that you are running > Intel MPI and that you run "mpirun/srun <cfdace binary>" inside an sbatch > job.. > > Can you try setting this in your batch script before executing the > "mpirun/srun"?: > > export I_MPI_PMI_LIBRARY=/path/to/slurm/lib/libpmi2.so > > Be aware not to use system's libpmi2.so but Slurm's one. Hi Sharaddha, can you confirm please that my proposed workaround fixes the issue? Thanks
Hello Felip, We are troubleshooting at our end. Request you to kindly wait until EOD Thank You Regards Shraddha Kiran HPC-Apps | GIS | Applied Materials India E-mail : HPC_Unified_Support@amat.com<mailto:HPC_Unified_Support@amat.com> From: bugs@schedmd.com <bugs@schedmd.com> Sent: 21 March 2022 23:08 To: Shraddha Kiran <Shraddha_Kiran@amat.com> Subject: [EXTERNAL] [Bug 13622] mpi/pmi2 invalid client request Comment # 3 on bug 13622 from Felip Moll (In reply to Felip Moll from comment #1) > Hi Shraddha, > > Assuming you have compiled and installed Slurm's pmi2, that you are running > Intel MPI and that you run "mpirun/srun <cfdace Comment # 3<https://urldefense.com/v3/__https:/bugs.schedmd.com/show_bug.cgi?id=13622*c3__;Iw!!NH8t9uXaRvxizNEf!HARXP-9uGtWQJMbST6-rc6voUkZin63tEbI83YdUSJDMo-jZjJCGGhhIAqEfhXf6TQ$> on bug 13622<https://urldefense.com/v3/__https:/bugs.schedmd.com/show_bug.cgi?id=13622__;!!NH8t9uXaRvxizNEf!HARXP-9uGtWQJMbST6-rc6voUkZin63tEbI83YdUSJDMo-jZjJCGGhhIAqE6wY4g0Q$> from Felip Moll<mailto:felip.moll@schedmd.com> (In reply to Felip Moll from comment #1<https://urldefense.com/v3/__https:/bugs.schedmd.com/show_bug.cgi?id=13622*c1__;Iw!!NH8t9uXaRvxizNEf!HARXP-9uGtWQJMbST6-rc6voUkZin63tEbI83YdUSJDMo-jZjJCGGhhIAqH6eiupUw$>) > Hi Shraddha, > > Assuming you have compiled and installed Slurm's pmi2, that you are running > Intel MPI and that you run "mpirun/srun <cfdace binary>" inside an sbatch > job.. > > Can you try setting this in your batch script before executing the > "mpirun/srun"?: > > export I_MPI_PMI_LIBRARY=/path/to/slurm/lib/libpmi2.so > > Be aware not to use system's libpmi2.so but Slurm's one. Hi Sharaddha, can you confirm please that my proposed workaround fixes the issue? Thanks ________________________________ You are receiving this mail because: * You reported the bug. The content of this message is APPLIED MATERIALS CONFIDENTIAL. If you are not the intended recipient, please notify me, delete this email and do not use or distribute this email.
Hello Felip, We verified again by running the compiled version of ours and also the vanilla version from bright computing cluster manager and we run into the same error of [2022-03-22T09:12:17.751] [857.1] error: mpi/pmi2: invalid client request [2022-03-22T09:12:17.751] [857.1] error: mpi/pmi2: value not properly terminated in client request [2022-03-22T09:12:17.751] [857.1] error: mpi/pmi2: request not begin with 'cmd=' [2022-03-22T09:12:17.751] [857.1] error: mpi/pmi2: full request is: 00000000000000000000000000000000000000 cmd=put kvsname=857.1 key=bc-7-1 value=00000$ Could you please suggest us the next steps? Let us know if you want to take a look at the packages from bright Thank You Regards Shraddha Kiran HPC-Apps | GIS | Applied Materials India E-mail : HPC_Unified_Support@amat.com<mailto:HPC_Unified_Support@amat.com> From: Shraddha Kiran Sent: 22 March 2022 19:02 To: bugs@schedmd.com Subject: RE: [EXTERNAL] [Bug 13622] mpi/pmi2 invalid client request Hello Felip, We are troubleshooting at our end. Request you to kindly wait until EOD Thank You Regards Shraddha Kiran HPC-Apps | GIS | Applied Materials India E-mail : HPC_Unified_Support@amat.com<mailto:HPC_Unified_Support@amat.com> From: bugs@schedmd.com<mailto:bugs@schedmd.com> <bugs@schedmd.com<mailto:bugs@schedmd.com>> Sent: 21 March 2022 23:08 To: Shraddha Kiran <Shraddha_Kiran@amat.com<mailto:Shraddha_Kiran@amat.com>> Subject: [EXTERNAL] [Bug 13622] mpi/pmi2 invalid client request Comment # 3 on bug 13622 from Felip Moll (In reply to Felip Moll from comment #1) > Hi Shraddha, > > Assuming you have compiled and installed Slurm's pmi2, that you are running > Intel MPI and that you run "mpirun/srun <cfdace Comment # 3<https://urldefense.com/v3/__https:/bugs.schedmd.com/show_bug.cgi?id=13622*c3__;Iw!!NH8t9uXaRvxizNEf!HARXP-9uGtWQJMbST6-rc6voUkZin63tEbI83YdUSJDMo-jZjJCGGhhIAqEfhXf6TQ$> on bug 13622<https://urldefense.com/v3/__https:/bugs.schedmd.com/show_bug.cgi?id=13622__;!!NH8t9uXaRvxizNEf!HARXP-9uGtWQJMbST6-rc6voUkZin63tEbI83YdUSJDMo-jZjJCGGhhIAqE6wY4g0Q$> from Felip Moll<mailto:felip.moll@schedmd.com> (In reply to Felip Moll from comment #1<https://urldefense.com/v3/__https:/bugs.schedmd.com/show_bug.cgi?id=13622*c1__;Iw!!NH8t9uXaRvxizNEf!HARXP-9uGtWQJMbST6-rc6voUkZin63tEbI83YdUSJDMo-jZjJCGGhhIAqH6eiupUw$>) > Hi Shraddha, > > Assuming you have compiled and installed Slurm's pmi2, that you are running > Intel MPI and that you run "mpirun/srun <cfdace binary>" inside an sbatch > job.. > > Can you try setting this in your batch script before executing the > "mpirun/srun"?: > > export I_MPI_PMI_LIBRARY=/path/to/slurm/lib/libpmi2.so > > Be aware not to use system's libpmi2.so but Slurm's one. Hi Sharaddha, can you confirm please that my proposed workaround fixes the issue? Thanks ________________________________ You are receiving this mail because: * You reported the bug. The content of this message is APPLIED MATERIALS CONFIDENTIAL. If you are not the intended recipient, please notify me, delete this email and do not use or distribute this email.
(In reply to Shraddha Kiran from comment #5) > Hello Felip, > > We verified again by running the compiled version of ours and also the > vanilla version from bright computing cluster manager and we run into the > same error of > > [2022-03-22T09:12:17.751] [857.1] error: mpi/pmi2: invalid client request > [2022-03-22T09:12:17.751] [857.1] error: mpi/pmi2: value not properly > terminated in client request > [2022-03-22T09:12:17.751] [857.1] error: mpi/pmi2: request not begin with > 'cmd=' > [2022-03-22T09:12:17.751] [857.1] error: mpi/pmi2: full request is: > 00000000000000000000000000000000000000 > cmd=put kvsname=857.1 key=bc-7-1 value=00000$ Hi Shraddha, but did you try with my suggestion or just rerun? I commented you needed to to this, please look at my comment 1. > export I_MPI_PMI_LIBRARY=/path/to/slurm/lib/libpmi2.so Can you show me the batch script you are using?
Created attachment 24018 [details] 20mm_40mil_ZonalWidth_WO_Liner_CASE_01_Updated_NEW_22p.job Hello Felip, Yes we did try implementing your suggestion which errored into the following error( different from mpi/pmi2 invalid client request) iPMI_Virtualization(): PMI calls are forwarded to /cm/shared/apps/slurm/20.11.8/lib64/libpmi2.so Abort(567055) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack: MPIR_Init_thread(703): MPID_Init(762).......: PMI_Init returned -1 Attempting to use an MPI routine before initializing MPICH Attaching the logs for more details This happened with both our compiled version and brightโs vanilla version of slurm packages. Hence we tried using the original slurm package from bright ( to ensure if we arenโt missing anything ) which also resulted in the mpi/pmi2 invalid client request Thank You Regards Shraddha Kiran HPC-Apps | GIS | Applied Materials India E-mail : HPC_Unified_Support@amat.com<mailto:HPC_Unified_Support@amat.com> From: bugs@schedmd.com <bugs@schedmd.com> Sent: 23 March 2022 03:15 To: Shraddha Kiran <Shraddha_Kiran@amat.com> Subject: [EXTERNAL] [Bug 13622] mpi/pmi2 invalid client request Comment # 6 on bug 13622 from Felip Moll (In reply to Shraddha Kiran from comment #5) > Hello Felip, > > We verified again by running the compiled version of ours and also the > vanilla version from bright computing cluster manager Comment # 6<https://urldefense.com/v3/__https:/bugs.schedmd.com/show_bug.cgi?id=13622*c6__;Iw!!NH8t9uXaRvxizNEf!Frw12uGf_3CFZA6wG8XvyPrrtSM2jM-0kegRxh8HNFuql2s0pS2S-E23aew3k89qlQ$> on bug 13622<https://urldefense.com/v3/__https:/bugs.schedmd.com/show_bug.cgi?id=13622__;!!NH8t9uXaRvxizNEf!Frw12uGf_3CFZA6wG8XvyPrrtSM2jM-0kegRxh8HNFuql2s0pS2S-E23aez3mBQVoQ$> from Felip Moll<mailto:felip.moll@schedmd.com> (In reply to Shraddha Kiran from comment #5<https://urldefense.com/v3/__https:/bugs.schedmd.com/show_bug.cgi?id=13622*c5__;Iw!!NH8t9uXaRvxizNEf!Frw12uGf_3CFZA6wG8XvyPrrtSM2jM-0kegRxh8HNFuql2s0pS2S-E23aezqpMRD5g$>) > Hello Felip, > > We verified again by running the compiled version of ours and also the > vanilla version from bright computing cluster manager and we run into the > same error of > > [2022-03-22T09:12:17.751] [857.1] error: mpi/pmi2: invalid client request > [2022-03-22T09:12:17.751] [857.1] error: mpi/pmi2: value not properly > terminated in client request > [2022-03-22T09:12:17.751] [857.1] error: mpi/pmi2: request not begin with > 'cmd=' > [2022-03-22T09:12:17.751] [857.1] error: mpi/pmi2: full request is: > 00000000000000000000000000000000000000 > cmd=put kvsname=857.1 key=bc-7-1 value=00000$ Hi Shraddha, but did you try with my suggestion or just rerun? I commented you needed to to this, please look at my comment 1<https://urldefense.com/v3/__https:/bugs.schedmd.com/show_bug.cgi?id=13622*c1__;Iw!!NH8t9uXaRvxizNEf!Frw12uGf_3CFZA6wG8XvyPrrtSM2jM-0kegRxh8HNFuql2s0pS2S-E23aexm0VTL0A$>. > export I_MPI_PMI_LIBRARY=/path/to/slurm/lib/libpmi2.so Can you show me the batch script you are using? ________________________________ You are receiving this mail because: * You reported the bug. The content of this message is APPLIED MATERIALS CONFIDENTIAL. If you are not the intended recipient, please notify me, delete this email and do not use or distribute this email.
Thanks, Please send me the batch script you use for submitting the job.
Created attachment 24025 [details] cfdaceslurm-dev Hello Felip, Please find the attached batch script used for job submission Thank You Regards Shraddha Kiran HPC-Apps | GIS | Applied Materials India E-mail : HPC_Unified_Support@amat.com<mailto:HPC_Unified_Support@amat.com> From: bugs@schedmd.com <bugs@schedmd.com> Sent: 23 March 2022 20:28 To: Shraddha Kiran <Shraddha_Kiran@amat.com> Subject: [EXTERNAL] [Bug 13622] mpi/pmi2 invalid client request Comment # 8 on bug 13622 from Felip Moll Thanks, Please send me the batch script you use for submitting the job. You are receiving this mail because: โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ Comment # 8<https://urldefense.com/v3/__https:/bugs.schedmd.com/show_bug.cgi?id=13622*c8__;Iw!!NH8t9uXaRvxizNEf!F1T_086CAxlya8TvuiJfhdJyzSE1zpfxe3yQSUPDE3krwDM-QLwqDKx0M10jtYBmKw$> on bug 13622<https://urldefense.com/v3/__https:/bugs.schedmd.com/show_bug.cgi?id=13622__;!!NH8t9uXaRvxizNEf!F1T_086CAxlya8TvuiJfhdJyzSE1zpfxe3yQSUPDE3krwDM-QLwqDKx0M11O_DZQJA$> from Felip Moll<mailto:felip.moll@schedmd.com> Thanks, Please send me the batch script you use for submitting the job. ________________________________ You are receiving this mail because: * You reported the bug. The content of this message is APPLIED MATERIALS CONFIDENTIAL. If you are not the intended recipient, please notify me, delete this email and do not use or distribute this email.
Hello, Could you provide any further update? Hope you have received the batch script that you had requested for Please let me know for any queries Regards Shraddha
(In reply to Shraddha Kiran from comment #10) > Hello, > > Could you provide any further update? Hope you have received the batch > script that you had requested for > > Please let me know for any queries > > Regards > > Shraddha Hi, Yes. I received the script but it is a wrapper, not the actual script. I need the output of $SCRIPT which is in the code. In any case I see you have this line: export I_MPI_PMI_LIBRARY=/cm/shared/apps/slurm/20.11.8/lib64/libpmi2.so but it is not inside $SCRIPT. It is in the wrapper. Can you try setting it into the $SCRIPT (in the cat <<EOFEOF > $SCRIPT block) and try again? If this one does not work, please try this one instead: export I_MPI_PMI_VALUE_LENGTH_MAX=512 To help debugging this I suggest to simplify the use case, if you can omit the latest "rmdir -f $SCRIPT" and keep the script, plus get the exact 'sbatch' line which is executed and attach it to the bug, it would help. Also, send me the output of: srun --mpi=list And the slurm.conf of your site. What kind of MPI is this running?: /hpc_lsf/application/ESI_Software/$DIR/UTILS/bin/CFD-SOLVER, is it MPICH2? IntelMPI? OpenMPI?. Does other MPI jobs (OpenMPI or IntelMPI) work or is only this CFD-SOLVER? This is likely a bug in pmi1, so I need you to force this CFD-SOLVER to use pmi2 to confirm the case. Please provide all the requested info.
Created attachment 24110 [details] with-PMI-LIB.txt Hello Felip, As per your suggestions ,below are the data points: 1. Added export I_MPI_PMI_LIBRARY=/cm/shared/apps/slurm/20.11.8/lib64/libpmi2.so under the $SCRIPT block and observed the following error message ( attached logs): MPI startup(): Imported environment partly inaccesible. Map=0 Info=0 iPMI_Virtualization(): PMI calls are forwarded to /cm/shared/apps/slurm/20.11.8/lib64/libpmi2.so Abort(567055) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack: MPIR_Init_thread(703): MPID_Init(762).......: PMI_Init returned -1 Attempting to use an MPI routine before initializing MPICH MPI startup(): Imported environment partly inaccesible. Map=0 Info=0 iPMI_Virtualization(): PMI calls are forwarded to /cm/shared/apps/slurm/20.11.8/lib64/libpmi2.so 1. Added export I_MPI_PMI_VALUE_LENGTH_MAX=512 instead and observed the following error message: [0] MPI startup(): libfabric version: 1.9.0a1-impi [0] MPI startup(): libfabric provider: mlx [1648486227.896231] [dcaldh003:40733:0] select.c:445 UCX ERROR no active messages transport to <no debug data>: posix/memory - Destination is unreachable, sysv/memory - Destination is unreachable, self/memory - Destination is unreachable, sockcm/sockaddr - no am bcopy Abort(1091471) on node 41 (rank 41 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack: MPIR_Init_thread(703)........: MPID_Init(958)...............: MPIDI_OFI_mpi_init_hook(1382): OFI get address vector map failed srun: Job step aborted: Waiting up to 32 seconds for job step to finish. slurmstepd: error: *** STEP 887.2 ON dcaldh001 CANCELLED AT 2022-03-28T09:50:27 *** srun: error: dcaldh003: task 41: Killed srun: error: dcaldh001: tasks 0-40: Killed DEBUG: executing cleanupAndDie 10 Trouble running CFD-ACE-SOLVER Have omitted the โrmdir -rf $SCRIPTโ and retained the script in both 1. and 2. 1. Exact sbatch command while submitting sbatch -n 42 --use-min-nodes -N 2 --job-name=cfdace 20.5 -p normal /tmp/tmp.cvc2GujShO 2. -bash-4.2$ srun --mpi=list srun: MPI types are... srun: pmix srun: pmix_v3 srun: none srun: pmi2 1. Attached the slurm.conf 2. CFDACE runs intel_mpi-19.6 which is bundled within the application <path-to-CFD>/ACE+Suite/2020.5/Linux_x86_64_2.17/UTILS/mpirt/intel_mpi-19.6/lib 3. Yes, other mpi jobs successfully run on cluster. Application ansys mechanical uses intel mpi (<path-to ansys>/MPI/Intel/2018.3.222) Please let me know if any other info is needed Thank You Regards Shraddha Kiran HPC-Apps | GIS | Applied Materials India E-mail : HPC_Unified_Support@amat.com<mailto:HPC_Unified_Support@amat.com> From: bugs@schedmd.com <bugs@schedmd.com> Sent: 25 March 2022 20:33 To: Shraddha Kiran <Shraddha_Kiran@amat.com> Subject: [EXTERNAL] [Bug 13622] mpi/pmi2 invalid client request Comment # 11 on bug 13622 from Felip Moll (In reply to Shraddha Kiran from comment #10) > Hello, > > Could you provide any further update? Hope you have received the batch > script that you had requested for > > Please let Comment # 11<https://urldefense.com/v3/__https:/bugs.schedmd.com/show_bug.cgi?id=13622*c11__;Iw!!NH8t9uXaRvxizNEf!HZgCDBe5PRGTono8Fe9Zv_trwf1xEkIzW_RyK9kSoiCTHN7Vix05cjZhQP1l0I9cvA$> on bug 13622<https://urldefense.com/v3/__https:/bugs.schedmd.com/show_bug.cgi?id=13622__;!!NH8t9uXaRvxizNEf!HZgCDBe5PRGTono8Fe9Zv_trwf1xEkIzW_RyK9kSoiCTHN7Vix05cjZhQP00HB0G5g$> from Felip Moll<mailto:felip.moll@schedmd.com> (In reply to Shraddha Kiran from comment #10<https://urldefense.com/v3/__https:/bugs.schedmd.com/show_bug.cgi?id=13622*c10__;Iw!!NH8t9uXaRvxizNEf!HZgCDBe5PRGTono8Fe9Zv_trwf1xEkIzW_RyK9kSoiCTHN7Vix05cjZhQP1us3dgYQ$>) > Hello, > > Could you provide any further update? Hope you have received the batch > script that you had requested for > > Please let me know for any queries > > Regards > > Shraddha Hi, Yes. I received the script but it is a wrapper, not the actual script. I need the output of $SCRIPT which is in the code. In any case I see you have this line: export I_MPI_PMI_LIBRARY=/cm/shared/apps/slurm/20.11.8/lib64/libpmi2.so but it is not inside $SCRIPT. It is in the wrapper. Can you try setting it into the $SCRIPT (in the cat <<EOFEOF > $SCRIPT block) and try again? If this one does not work, please try this one instead: export I_MPI_PMI_VALUE_LENGTH_MAX=512 To help debugging this I suggest to simplify the use case, if you can omit the latest "rmdir -f $SCRIPT" and keep the script, plus get the exact 'sbatch' line which is executed and attach it to the bug, it would help. Also, send me the output of: srun --mpi=list And the slurm.conf of your site. What kind of MPI is this running?: /hpc_lsf/application/ESI_Software/$DIR/UTILS/bin/CFD-SOLVER, is it MPICH2? IntelMPI? OpenMPI?. Does other MPI jobs (OpenMPI or IntelMPI) work or is only this CFD-SOLVER? This is likely a bug in pmi1, so I need you to force this CFD-SOLVER to use pmi2 to confirm the case. Please provide all the requested info. ________________________________ You are receiving this mail because: * You reported the bug. The content of this message is APPLIED MATERIALS CONFIDENTIAL. If you are not the intended recipient, please notify me, delete this email and do not use or distribute this email.
Created attachment 24111 [details] with-PMI-VALUE-LENGTH.txt
Created attachment 24112 [details] without-any-parmeter.txt
Created attachment 24113 [details] slurm.conf
Hello Felip, Could you please provide any update on this issue? Let me know if any other information needs to be shared Thank you, Shraddha
(In reply to Shraddha Kiran from comment #16) > Hello Felip, > > Could you please provide any update on this issue? Let me know if any other > information needs to be shared > > Thank you, > > Shraddha Sorry I've been out these past 3 days. I need more time to investigate. Are you using UCX?
Hello Felip, Yes we are using ucx On head node: [root@dcaldh000 ~]# rpm -qa | grep ucx ucx-rdmacm-1.9.0-1.el7.x86_64 ucx-cma-1.9.0-1.el7.x86_64 ucx-1.9.0-1.el7.x86_64 ucx-ib-1.9.0-1.el7.x86_64 cm-ucx-1.6.1-100022_cm9.0_00647aba5a.x86_64 ucx-devel-1.9.0-1.el7.x86_64 On compute nodes: ---------------- cm-ucx-1.6.1-100022_cm9.0_00647aba5a.x86_64 ucx-1.9.0-1.el7.x86_64 Thank You Regards Shraddha Kiran HPC-Apps | GIS | Applied Materials India E-mail : HPC_Unified_Support@amat.com<mailto:HPC_Unified_Support@amat.com> From: bugs@schedmd.com <bugs@schedmd.com> Sent: 01 April 2022 02:39 To: Shraddha Kiran <Shraddha_Kiran@amat.com> Subject: [EXTERNAL] [Bug 13622] mpi/pmi2 invalid client request Comment # 17 on bug 13622 from Felip Moll (In reply to Shraddha Kiran from comment #16) > Hello Felip, > > Could you please provide any update on this issue? Let me know if any other > information needs to be shared > > Thank Comment # 17<https://urldefense.com/v3/__https:/bugs.schedmd.com/show_bug.cgi?id=13622*c17__;Iw!!NH8t9uXaRvxizNEf!Aa6essPnq9QOIUv2dGDXXCyBjXOa9pMjTqefxhxr-GIYElu13D31gTG0ytkWW-Qd7A$> on bug 13622<https://urldefense.com/v3/__https:/bugs.schedmd.com/show_bug.cgi?id=13622__;!!NH8t9uXaRvxizNEf!Aa6essPnq9QOIUv2dGDXXCyBjXOa9pMjTqefxhxr-GIYElu13D31gTG0ytnehb0sYg$> from Felip Moll<mailto:felip.moll@schedmd.com> (In reply to Shraddha Kiran from comment #16<https://urldefense.com/v3/__https:/bugs.schedmd.com/show_bug.cgi?id=13622*c16__;Iw!!NH8t9uXaRvxizNEf!Aa6essPnq9QOIUv2dGDXXCyBjXOa9pMjTqefxhxr-GIYElu13D31gTG0ytnE6rWgKQ$>) > Hello Felip, > > Could you please provide any update on this issue? Let me know if any other > information needs to be shared > > Thank you, > > Shraddha Sorry I've been out these past 3 days. I need more time to investigate. Are you using UCX? ________________________________ You are receiving this mail because: * You reported the bug. The content of this message is APPLIED MATERIALS CONFIDENTIAL. If you are not the intended recipient, please notify me, delete this email and do not use or distribute this email.
Shraddha, I've been looking more into the provided information. I still miss the real script from $SCRIPT. In your last post you said you retained it. Could you upload the script here? > sbatch command while submitting sbatch -n 42 --use-min-nodes -N 2 --job-name=cfdace 20.5 -p normal /tmp/tmp.cvc2GujShO If this was the line I'd need tmp.cvc2GujShO. I have one other theory. UCX do not support mlx provider: https://github.com/ofiwg/libfabric/pull/5281 so your issue may be related as shown here: [0] MPI startup(): libfabric version: 1.9.0a1-impi [0] MPI startup(): libfabric provider: mlx [1648486227.896231] [dcaldh003:40733:0] select.c:445 UCX ERROR no active messages transport to <no debug data>: posix/memory - Destination is unreachable, sysv/memory - Destination is unreachable, self/memory - Destination is unreachable, sockcm/sockaddr - no am bcopy Abort(1091471) on node 41 (rank 41 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack: MPIR_Init_thread(703)........: MPID_Init(958)...............: MPIDI_OFI_mpi_init_hook(1382): OFI get address vector map failed Can you show which transports are available running ucx_info? IMPI 2019 U6 seems to use dc transport by default, you may try to tune UCX_TLS (e.g. UCX_TLS=ud,sm,self). Also, is Slurm compiled with UCX support? (--with-ucx) Finally, I'd also ask to rerun the tests, keeping the temporary scripts and setting "export I_MPI_DEBUG 5" to get more verbosity on the logs. Upload everything here please. And if you can set the debug level of slurmd to "debug3" and provide the slurmd logs would be great too.
Hello Felip, I re-ran the tests with export I_MPI_PMI_LIBRARY=/cm/shared/apps/slurm/20.11.8/lib64/libpmi2.so and export I_MPI_PMI_VALUE_LENGTH_MAX=512 Also ensured the environment is set as per your suggestion for slurmd debug and I_MPI_DEBUG, details below: e162968@dcaldh000 bin]$ grep -i debug /etc/slurm/slurm.conf SlurmctldDebug=4 #DebugFlags = Gang,CPU_Bind #DebugFlags=SelectType,Gres,Backfill,BackfillMap SlurmdDebug=3 #SlurmdDebug=5 DebugFlags=Elasticsearch,Agent,Protocol [e162968@dcaldh000 bin]$ echo $I_MPI_DEBUG 5 Jobs ran as below along with exact sbatch command sbatch -n 42 --use-min-nodes -N 1-2 --job-name=cfdace 20.5 -p normal /tmp/tmp.kzJK1Rwkk2 Submitted batch job 932 [e162968@dcaldh000 uncommitted]$ sq JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 932 normal cfdace 2 e162968 R 0:01 1 dcaldh001 sbatch -n 42 --use-min-nodes -N 1-2 --job-name=cfdace 20.5 -p normal /tmp/tmp.JzyDl3pD0M Submitted batch job 934 [e162968@dcaldh000 uncommitted]$ sq JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 934 normal cfdace 2 e162968 R 0:01 1 dcaldh001 I am also attaching the tmp.xxx, slurmd and job file and ucx_info details. I am working on whether slurm is being compiled --with-ucx on our environment, will keep you posted Meanwhile, please let me know if you need any other information from my end Thank you Shraddha Kiran Please let me
Created attachment 24249 [details] 932-job-log.txt
Created attachment 24250 [details] 934-slurmd-logs.txt
Created attachment 24251 [details] 932-slurmd-logs.txt
Created attachment 24252 [details] tmp.JzyDl3pD0M.txt
Created attachment 24253 [details] 934-job-log.txt
Created attachment 24254 [details] tmp.kzJK1Rwkk2.txt
> I am working on whether slurm is being compiled --with-ucx on our > environment, will keep you posted > > Meanwhile, please let me know if you need any other information from my end Yes. I am decoding all the steps the ESI Software is doing. At the end it calls this: srun --mpi=pmi2 --multi-prog 20mm_40mil_ZonalWidth_WO_Liner_CASE_01_Updated_NEW_40p.Config.1011 Can you get this config file and upload it here please? I am not sure how you can do this because this file is autogenerated by CFD-SOLVER on runtime, but please look if you can retain it. 20mm_40mil_ZonalWidth_WO_Liner_CASE_01_Updated_NEW_40p.Config.1011 Also, taking a look at slurmd logs I see they are truncated and without enough information. Have you reconfigured slurm after modifying the debug level? Note you set: SlurmdDebug=3 while it is SlurmdDebug=debug3 A single '3' is equivalent to log level "info". The numeric equivalent to debug3 is '7'. I suggest to always use the human-readable form. Same for slurmctld debug. Please repeat the tests with the correct debug level.
Created attachment 24311 [details] tmp.OANZAHdmiK Hello Felip, Resubmitted the tests with below details: [e162968@dcaldh000 e162968]$ echo $I_MPI_DEBUG 5 [e162968@dcaldh000 e162968]$ grep -i debug /etc/slurm/slurm.conf SlurmctldDebug=debug3 #DebugFlags = Gang,CPU_Bind #DebugFlags=SelectType,Gres,Backfill,BackfillMap SlurmdDebug=debug3 #SlurmdDebug=5 DebugFlags=Elasticsearch,Agent,Protocol sbatch -n 42 --use-min-nodes -N 2 --job-name=cfdace 20.5 -p normal /tmp/tmp.LwTZssRAvv Submitted batch job 945 [e162968@dcaldh000 bin]$ sq JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 945 normal cfdace 2 e162968 R 0:10 2 dcaldh[001-002] sbatch -n 42 --use-min-nodes -N 2 --job-name=cfdace 20.5 -p normal /tmp/tmp.OANZAHdmiK Submitted batch job 946 [e162968@dcaldh000 bin]$ sq JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 946 normal cfdace 2 e162968 R 0:06 2 dcaldh[001-002] Attaching the corresponding files, along with 20mm_40mil_ZonalWidth_WO_Liner_CASE_01_Updated_NEW_40p.Config.1011 20mm_40mil_ZonalWidth_WO_Liner_CASE_01_Updated_NEW_40p.Config.24589 belongs to the retested results ( both .1011 and .24589 are same though ) Thank You Regards Shraddha Kiran HPC-Apps | GIS | Applied Materials India E-mail : HPC_Unified_Support@amat.com<mailto:HPC_Unified_Support@amat.com> From: bugs@schedmd.com <bugs@schedmd.com> Sent: 07 April 2022 18:41 To: Shraddha Kiran <Shraddha_Kiran@amat.com> Subject: [EXTERNAL] [Bug 13622] mpi/pmi2 invalid client request Comment # 31 on bug 13622 from Felip Moll > I am working on whether slurm is being compiled --with-ucx on our > environment, will keep you posted > > Meanwhile, please let me know if you need any other information from my end โ โ Comment # 31<https://urldefense.com/v3/__https:/bugs.schedmd.com/show_bug.cgi?id=13622*c31__;Iw!!NH8t9uXaRvxizNEf!BVG8Yxtrd_anIwdMHrOgVwje454Rw7qSySJk1HWRI7T2h7k-FzhFOuIsRBM3dtKGWQ$> on bug 13622<https://urldefense.com/v3/__https:/bugs.schedmd.com/show_bug.cgi?id=13622__;!!NH8t9uXaRvxizNEf!BVG8Yxtrd_anIwdMHrOgVwje454Rw7qSySJk1HWRI7T2h7k-FzhFOuIsRBPQz8ipdg$> from Felip Moll<mailto:felip.moll@schedmd.com> > I am working on whether slurm is being compiled --with-ucx on our > environment, will keep you posted > > Meanwhile, please let me know if you need any other information from my end Yes. I am decoding all the steps the ESI Software is doing. At the end it calls this: srun --mpi=pmi2 --multi-prog 20mm_40mil_ZonalWidth_WO_Liner_CASE_01_Updated_NEW_40p.Config.1011 Can you get this config file and upload it here please? I am not sure how you can do this because this file is autogenerated by CFD-SOLVER on runtime, but please look if you can retain it. 20mm_40mil_ZonalWidth_WO_Liner_CASE_01_Updated_NEW_40p.Config.1011 Also, taking a look at slurmd logs I see they are truncated and without enough information. Have you reconfigured slurm after modifying the debug level? Note you set: SlurmdDebug=3 while it is SlurmdDebug=debug3 A single '3' is equivalent to log level "info". The numeric equivalent to debug3 is '7'. I suggest to always use the human-readable form. Same for slurmctld debug. Please repeat the tests with the correct debug level. ________________________________ You are receiving this mail because: * You reported the bug. The content of this message is APPLIED MATERIALS CONFIDENTIAL. If you are not the intended recipient, please notify me, delete this email and do not use or distribute this email.
Created attachment 24312 [details] 946-slurmctld-logs-with-pmi-val.txt
Created attachment 24313 [details] 946-002-slurmd-logs-with-pmi-val.txt
Created attachment 24314 [details] 946-slurmd-logs-with-pmi-val.txt
Created attachment 24315 [details] tmp.LwTZssRAvv
Created attachment 24316 [details] 945-slurmctld-logs-with-pmi-lib.txt
Created attachment 24317 [details] 945-slurmd-logs-with-pmi-lib.txt
Created attachment 24318 [details] 945-002-slurmd-logs-with-pmi-lib.txt
Created attachment 24319 [details] 20mm_40mil_ZonalWidth_WO_Liner_CASE_01_Updated_NEW_40p.Config.1011
Created attachment 24320 [details] 20mm_40mil_ZonalWidth_WO_Liner_CASE_01_Updated_NEW_40p.Config.24589
While I look at these logs, is there anything in ESI Software that says which PMI implementations are supported, plus if it can work with UCX? Note I am still waiting for this feedback from you: >Can you show which transports are available running ucx_info? IMPI 2019 U6 seems >to use dc transport by default, you may try to tune UCX_TLS (e.g. UCX_TLS=ud,sm,self). >Also, is Slurm compiled with UCX support? (--with-ucx) Thanks
Hello Felip, Is it possible to discuss over a quick meeting? Thank You Regards Shraddha Kiran HPC-Apps | GIS | Applied Materials India E-mail : HPC_Unified_Support@amat.com<mailto:HPC_Unified_Support@amat.com> From: bugs@schedmd.com <bugs@schedmd.com> Sent: 08 April 2022 18:00 To: Shraddha Kiran <Shraddha_Kiran@amat.com> Subject: [EXTERNAL] [Bug 13622] mpi/pmi2 invalid client request Comment # 42 on bug 13622 from Felip Moll While I look at these logs, is there anything in ESI Software that says which PMI implementations are supported, plus if it can work with UCX? Note I am still waiting for this feedback from you: โ โ Comment # 42<https://urldefense.com/v3/__https:/bugs.schedmd.com/show_bug.cgi?id=13622*c42__;Iw!!NH8t9uXaRvxizNEf!BYwOuV6-qF_k1pVdvHrNq_o819yHuwYN9BqQMUj7yc5-LnaCMZd2AR0uwRuRgMS2Qw$> on bug 13622<https://urldefense.com/v3/__https:/bugs.schedmd.com/show_bug.cgi?id=13622__;!!NH8t9uXaRvxizNEf!BYwOuV6-qF_k1pVdvHrNq_o819yHuwYN9BqQMUj7yc5-LnaCMZd2AR0uwRtnAQh5lg$> from Felip Moll<mailto:felip.moll@schedmd.com> While I look at these logs, is there anything in ESI Software that says which PMI implementations are supported, plus if it can work with UCX? Note I am still waiting for this feedback from you: >Can you show which transports are available running ucx_info? IMPI 2019 U6 seems >to use dc transport by default, you may try to tune UCX_TLS (e.g. UCX_TLS=ud,sm,self). >Also, is Slurm compiled with UCX support? (--with-ucx) Thanks ________________________________ You are receiving this mail because: * You reported the bug. The content of this message is APPLIED MATERIALS CONFIDENTIAL. If you are not the intended recipient, please notify me, delete this email and do not use or distribute this email.
Created attachment 24337 [details] ucx_info.txt Hello Felip, I did send the ucx_info information the last time, sending it again but I did not see the UCX_TLS option upon running ucx_info. Could you let me know if I need to do anything more in order to tune it to (e.g. UCX_TLS=ud,sm,self)? Also which option should I tune it to? All three? The SLURM 20.11.8 was compiled by one of my senior team member. However, I tried a bunch of things to determine whether it was compiled using ucx( eg: ran ldd over slurm libraries under /cm/shared/apps/slurm/20.11.8/lib64/slurm/ and objdump over the same slurm libraries) but wasnโt able to get much information. Could you let me know if thereโs a better way to check? Thank You Regards Shraddha Kiran HPC-Apps | GIS | Applied Materials India E-mail : HPC_Unified_Support@amat.com<mailto:HPC_Unified_Support@amat.com> From: Shraddha Kiran Sent: 08 April 2022 19:38 To: bugs@schedmd.com Subject: RE: [EXTERNAL] [Bug 13622] mpi/pmi2 invalid client request Hello Felip, Is it possible to discuss over a quick meeting? Thank You Regards Shraddha Kiran HPC-Apps | GIS | Applied Materials India E-mail : HPC_Unified_Support@amat.com<mailto:HPC_Unified_Support@amat.com> From: bugs@schedmd.com<mailto:bugs@schedmd.com> <bugs@schedmd.com<mailto:bugs@schedmd.com>> Sent: 08 April 2022 18:00 To: Shraddha Kiran <Shraddha_Kiran@amat.com<mailto:Shraddha_Kiran@amat.com>> Subject: [EXTERNAL] [Bug 13622] mpi/pmi2 invalid client request Comment # 42 on bug 13622 from Felip Moll While I look at these logs, is there anything in ESI Software that says which PMI implementations are supported, plus if it can work with UCX? Note I am still waiting for this feedback from you: โ โ Comment # 42<https://urldefense.com/v3/__https:/bugs.schedmd.com/show_bug.cgi?id=13622*c42__;Iw!!NH8t9uXaRvxizNEf!BYwOuV6-qF_k1pVdvHrNq_o819yHuwYN9BqQMUj7yc5-LnaCMZd2AR0uwRuRgMS2Qw$> on bug 13622<https://urldefense.com/v3/__https:/bugs.schedmd.com/show_bug.cgi?id=13622__;!!NH8t9uXaRvxizNEf!BYwOuV6-qF_k1pVdvHrNq_o819yHuwYN9BqQMUj7yc5-LnaCMZd2AR0uwRtnAQh5lg$> from Felip Moll<mailto:felip.moll@schedmd.com> While I look at these logs, is there anything in ESI Software that says which PMI implementations are supported, plus if it can work with UCX? Note I am still waiting for this feedback from you: >Can you show which transports are available running ucx_info? IMPI 2019 U6 seems >to use dc transport by default, you may try to tune UCX_TLS (e.g. UCX_TLS=ud,sm,self). >Also, is Slurm compiled with UCX support? (--with-ucx) Thanks ________________________________ You are receiving this mail because: * You reported the bug. The content of this message is APPLIED MATERIALS CONFIDENTIAL. If you are not the intended recipient, please notify me, delete this email and do not use or distribute this email.
Hello Felip, Below is the response from ESI CFD โWe support whatever PMI implementations that Intel MPI 19.9 supports. The same applies to UCX, if Intel MPI 19.9 works with it, we should work with it.โ Thank You Regards Shraddha Kiran HPC-Apps | GIS | Applied Materials India E-mail : HPC_Unified_Support@amat.com<mailto:HPC_Unified_Support@amat.com> From: Shraddha Kiran Sent: 08 April 2022 20:18 To: bugs@schedmd.com Subject: RE: [EXTERNAL] [Bug 13622] mpi/pmi2 invalid client request Hello Felip, I did send the ucx_info information the last time, sending it again but I did not see the UCX_TLS option upon running ucx_info. Could you let me know if I need to do anything more in order to tune it to (e.g. UCX_TLS=ud,sm,self)? Also which option should I tune it to? All three? The SLURM 20.11.8 was compiled by one of my senior team member. However, I tried a bunch of things to determine whether it was compiled using ucx( eg: ran ldd over slurm libraries under /cm/shared/apps/slurm/20.11.8/lib64/slurm/ and objdump over the same slurm libraries) but wasnโt able to get much information. Could you let me know if thereโs a better way to check? Thank You Regards Shraddha Kiran HPC-Apps | GIS | Applied Materials India E-mail : HPC_Unified_Support@amat.com<mailto:HPC_Unified_Support@amat.com> From: Shraddha Kiran Sent: 08 April 2022 19:38 To: bugs@schedmd.com<mailto:bugs@schedmd.com> Subject: RE: [EXTERNAL] [Bug 13622] mpi/pmi2 invalid client request Hello Felip, Is it possible to discuss over a quick meeting? Thank You Regards Shraddha Kiran HPC-Apps | GIS | Applied Materials India E-mail : HPC_Unified_Support@amat.com<mailto:HPC_Unified_Support@amat.com> From: bugs@schedmd.com<mailto:bugs@schedmd.com> <bugs@schedmd.com<mailto:bugs@schedmd.com>> Sent: 08 April 2022 18:00 To: Shraddha Kiran <Shraddha_Kiran@amat.com<mailto:Shraddha_Kiran@amat.com>> Subject: [EXTERNAL] [Bug 13622] mpi/pmi2 invalid client request Comment # 42 on bug 13622 from Felip Moll While I look at these logs, is there anything in ESI Software that says which PMI implementations are supported, plus if it can work with UCX? Note I am still waiting for this feedback from you: โ โ Comment # 42<https://urldefense.com/v3/__https:/bugs.schedmd.com/show_bug.cgi?id=13622*c42__;Iw!!NH8t9uXaRvxizNEf!BYwOuV6-qF_k1pVdvHrNq_o819yHuwYN9BqQMUj7yc5-LnaCMZd2AR0uwRuRgMS2Qw$> on bug 13622<https://urldefense.com/v3/__https:/bugs.schedmd.com/show_bug.cgi?id=13622__;!!NH8t9uXaRvxizNEf!BYwOuV6-qF_k1pVdvHrNq_o819yHuwYN9BqQMUj7yc5-LnaCMZd2AR0uwRtnAQh5lg$> from Felip Moll<mailto:felip.moll@schedmd.com> While I look at these logs, is there anything in ESI Software that says which PMI implementations are supported, plus if it can work with UCX? Note I am still waiting for this feedback from you: >Can you show which transports are available running ucx_info? IMPI 2019 U6 seems >to use dc transport by default, you may try to tune UCX_TLS (e.g. UCX_TLS=ud,sm,self). >Also, is Slurm compiled with UCX support? (--with-ucx) Thanks ________________________________ You are receiving this mail because: * You reported the bug. The content of this message is APPLIED MATERIALS CONFIDENTIAL. If you are not the intended recipient, please notify me, delete this email and do not use or distribute this email.
Hello Felip, Below is the response from ESI CFD team regarding which : Hello Shraddha, I see the following when I run with --slurmLauncher=srun in the CFD-SOLVER command and the case ran to completion. So, I guess at least PMI2 is supported. srun --mpi=pmi2 --multi-prog ccp_par4.Config.224538 Note that when using srun, you need to request 2 more cores than the num of parallel processes to account for the dtfioserver and wmserver processes. Awaiting your response on other queries from my end: 1. Could you let me know if I need to do anything more in order to tune it to (e.g. UCX_TLS=ud,sm,self)? Also which option should I tune it to? All three? 2. Could you let me know if thereโs a better way to check the information on whether slurm was compiled with -ucx or no Thank You Regards Shraddha Kiran HPC-Apps | GIS | Applied Materials India E-mail : HPC_Unified_Support@amat.com<mailto:HPC_Unified_Support@amat.com> From: Shraddha Kiran Sent: 08 April 2022 23:35 To: bugs@schedmd.com Subject: RE: [EXTERNAL] [Bug 13622] mpi/pmi2 invalid client request Hello Felip, Below is the response from ESI CFD โWe support whatever PMI implementations that Intel MPI 19.9 supports. The same applies to UCX, if Intel MPI 19.9 works with it, we should work with it.โ Thank You Regards Shraddha Kiran HPC-Apps | GIS | Applied Materials India E-mail : HPC_Unified_Support@amat.com<mailto:HPC_Unified_Support@amat.com> From: Shraddha Kiran Sent: 08 April 2022 20:18 To: bugs@schedmd.com<mailto:bugs@schedmd.com> Subject: RE: [EXTERNAL] [Bug 13622] mpi/pmi2 invalid client request Hello Felip, I did send the ucx_info information the last time, sending it again but I did not see the UCX_TLS option upon running ucx_info. Could you let me know if I need to do anything more in order to tune it to (e.g. UCX_TLS=ud,sm,self)? Also which option should I tune it to? All three? The SLURM 20.11.8 was compiled by one of my senior team member. However, I tried a bunch of things to determine whether it was compiled using ucx( eg: ran ldd over slurm libraries under /cm/shared/apps/slurm/20.11.8/lib64/slurm/ and objdump over the same slurm libraries) but wasnโt able to get much information. Could you let me know if thereโs a better way to check? Thank You Regards Shraddha Kiran HPC-Apps | GIS | Applied Materials India E-mail : HPC_Unified_Support@amat.com<mailto:HPC_Unified_Support@amat.com> From: Shraddha Kiran Sent: 08 April 2022 19:38 To: bugs@schedmd.com<mailto:bugs@schedmd.com> Subject: RE: [EXTERNAL] [Bug 13622] mpi/pmi2 invalid client request Hello Felip, Is it possible to discuss over a quick meeting? Thank You Regards Shraddha Kiran HPC-Apps | GIS | Applied Materials India E-mail : HPC_Unified_Support@amat.com<mailto:HPC_Unified_Support@amat.com> From: bugs@schedmd.com<mailto:bugs@schedmd.com> <bugs@schedmd.com<mailto:bugs@schedmd.com>> Sent: 08 April 2022 18:00 To: Shraddha Kiran <Shraddha_Kiran@amat.com<mailto:Shraddha_Kiran@amat.com>> Subject: [EXTERNAL] [Bug 13622] mpi/pmi2 invalid client request Comment # 42 on bug 13622 from Felip Moll While I look at these logs, is there anything in ESI Software that says which PMI implementations are supported, plus if it can work with UCX? Note I am still waiting for this feedback from you: โ โ Comment # 42<https://urldefense.com/v3/__https:/bugs.schedmd.com/show_bug.cgi?id=13622*c42__;Iw!!NH8t9uXaRvxizNEf!BYwOuV6-qF_k1pVdvHrNq_o819yHuwYN9BqQMUj7yc5-LnaCMZd2AR0uwRuRgMS2Qw$> on bug 13622<https://urldefense.com/v3/__https:/bugs.schedmd.com/show_bug.cgi?id=13622__;!!NH8t9uXaRvxizNEf!BYwOuV6-qF_k1pVdvHrNq_o819yHuwYN9BqQMUj7yc5-LnaCMZd2AR0uwRtnAQh5lg$> from Felip Moll<mailto:felip.moll@schedmd.com> While I look at these logs, is there anything in ESI Software that says which PMI implementations are supported, plus if it can work with UCX? Note I am still waiting for this feedback from you: >Can you show which transports are available running ucx_info? IMPI 2019 U6 seems >to use dc transport by default, you may try to tune UCX_TLS (e.g. UCX_TLS=ud,sm,self). >Also, is Slurm compiled with UCX support? (--with-ucx) Thanks ________________________________ You are receiving this mail because: * You reported the bug. The content of this message is APPLIED MATERIALS CONFIDENTIAL. If you are not the intended recipient, please notify me, delete this email and do not use or distribute this email.
Shraddha Kiran - I talked with Jess and Felip about this issue. Felip has been looking over the data you have sent in so far and is trying to understand the root cause. He will reply to you a little later today on what he has found so far. I also talked with Felip about doing a remote session with your site. Felip will reply and organize this after he looks over a few leads he has. My questions for you are: 1. Are you able to provide ssh access with a test user and a few nodes that we can use to look at this issue directly on your system? 2. If ssh access is not possible, would your site be willing to do a shared screen session with Felip and me to look at the issue directly?
Hello Jason, Sure, we can have a shared session with your team to look at the issue directly. Please let me know when would be the feasible time for it Thank You Regards Shraddha Kiran HPC-Apps | GIS | Applied Materials India E-mail: HPC_Unified_Support@amat.com<mailto:HPC_Unified_Support@amat.com> From: bugs@schedmd.com <bugs@schedmd.com> Sent: 12 April 2022 21:24 To: Shraddha Kiran <Shraddha_Kiran@amat.com> Subject: [EXTERNAL] [Bug 13622] mpi/pmi2 invalid client request Comment # 47 on bug 13622 from Jason Booth Shraddha Kiran - I talked with Jess and Felip about this issue. Felip has been looking over the data you have sent in so far and is trying to understand the root cause. He will reply to you a little Comment # 47<https://urldefense.com/v3/__https:/bugs.schedmd.com/show_bug.cgi?id=13622*c47__;Iw!!NH8t9uXaRvxizNEf!XNDvL449nKVJh5XYUELnCM4ChIZnq4dBw_WCVf8cmByZBaDfmwFibpm5mUSicx78kG6CR2BfJN6CO1w$> on bug 13622<https://urldefense.com/v3/__https:/bugs.schedmd.com/show_bug.cgi?id=13622__;!!NH8t9uXaRvxizNEf!XNDvL449nKVJh5XYUELnCM4ChIZnq4dBw_WCVf8cmByZBaDfmwFibpm5mUSicx78kG6CR2Bf25IIbt8$> from Jason Booth<mailto:jbooth@schedmd.com> Shraddha Kiran - I talked with Jess and Felip about this issue. Felip has been looking over the data you have sent in so far and is trying to understand the root cause. He will reply to you a little later today on what he has found so far. I also talked with Felip about doing a remote session with your site. Felip will reply and organize this after he looks over a few leads he has. My questions for you are: 1. Are you able to provide ssh access with a test user and a few nodes that we can use to look at this issue directly on your system? 2. If ssh access is not possible, would your site be willing to do a shared screen session with Felip and me to look at the issue directly? ________________________________ You are receiving this mail because: * You reported the bug. The content of this message is APPLIED MATERIALS CONFIDENTIAL. If you are not the intended recipient, please notify me, delete this email and do not use or distribute this email.
> I see the following when I run with --slurmLauncher=srun in the CFD-SOLVER > command and the case ran to completion. So, I guess at least PMI2 is > supported. > srun --mpi=pmi2 --multi-prog ccp_par4.Config.224538 > Note that when using srun, you need to request 2 more cores than the num of > parallel processes to account for the dtfioserver and wmserver processes. > Oookay. Let's try the overlap option of srun. This allows steps to share cpus. Since you are running many sruns (steps) in parallel in the same node (because of --multi-prog) it is possible they are not all running. First test: You need to add this before the call to CFD-SOLVER, into the $SCRIPT export SLURM_OVERLAP=1 .. and run the test again. Second test: The second test I'd want you to do in your environment is to run with a fewer number of tasks. Your wrapper runs this: sbatch -n 42 --use-min-nodes -N 1-2 --job-name=cfdace 20.5 -p normal /tmp/tmp.kzJK1Rwkk2 could we switch to run the same but with for example 4 (-n 4) tasks? I want to see if there's any "overflow" in pmi packets being sent. Send back the output of the jobs. If none of these 2 tests makes any difference, we'll schedule the screen sharing. ----- > 1. Could you let me know if I need to do anything more in order to tune > it to (e.g. UCX_TLS=ud,sm,self)? Also which option should I tune it to? All > three? Let's table this for now. I see the correct transports in ucx_info -d output you sent, and moreover we're going to try with PMI2 only which does not use UCX. I am not sure as why the UCX error ever showed up. Just for your info, the UCX_TLS parameter must be set as an environment variable of slurmd when starting it up. > 2. Could you let me know if thereโs a better way to check the information > on whether slurm was compiled with -ucx or no You were doing it correctly. The idea is to check this library and see if *at least* these symbols do show up: ]$ objdump -T mpi_pmix_v3.so|grep ucx 0000000000022f91 g DF .text 00000000000000b9 Base pmixp_dconn_ucx_stop 000000000002304a g DF .text 00000000000005d9 Base pmixp_dconn_ucx_finalize 00000000000324c0 g DO .bss 0000000000000028 Base _ucx_worker_lock 0000000000023623 g DF .text 000000000000006b Base _ucx_process_msg 000000000002297e g DF .text 0000000000000613 Base pmixp_dconn_ucx_prepare 000000000000f0b6 g DF .text 0000000000000019 Base pmixp_info_srv_direct_conn_ucx But as I said previously let's focus on pmi2.
Created attachment 24480 [details] 953-slurm-overlap-pmi-val-slurmd.txt Hello Felip, Attaching the slurmctld, slurmd, and job logs for test 1 and test 2 as suggested by you. Maintained the $I_MPI_DEBUG value as 5 1. Test one: export SLURM_OVERLAP=1 with export I_MPI_PMI_LIBRARY=/cm/shared/apps/slurm/20.11.8/lib64/libpmi2.so export SLURM_OVERLAP=1 with export I_MPI_PMI_VALUE_LENGTH_MAX=512 sbatch -n 42 --use-min-nodes -N 2 --job-name=cfdace 20.5 -p normal /tmp/tmp.4GAtRQUf3C Submitted batch job 952 [e162968@dcaldh000 bin]$ sq JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 952 normal cfdace 2 e162968 R 0:06 2 dcaldh[001,003] [e162968@dcaldh000 bin]$ cat /tmp/tmp.4GAtRQUf3C #!/bin/sh #SK: 04-04-2022 : adding as per schedmd export I_MPI_PMI_LIBRARY=/cm/shared/apps/slurm/20.11.8/lib64/libpmi2.so #export I_MPI_PMI_VALUE_LENGTH_MAX=512 export SLURM_OVERLAP=1 if [ -n ""]; then input=`sed -n p ` fi if [ $SLURM_NNODES = 1 ]; then /hpc_lsf/application/ESI_Software/2020.5/ACE+Suite/2020.5/Linux_x86_64_2.17//UTILS/bin/CFD-SOLVER -dtf 20mm_40mil_ZonalWidth_WO_Liner_CASE_01_Updated_NEW_40p.DTF -num 40 --slurmLauncher=srun -verbose 3 -job -keepTmpFiles -nodecomp # /hpc_lsf/application/ESI_Software/2020.5/ACE+Suite/2020.5/Linux_x86_64_2.17//UTILS/bin/CFD-SOLVER -dtf 20mm_40mil_ZonalWidth_WO_Liner_CASE_01_Updated_NEW_40p.DTF -num 40 -verbose 3 -job -keepTmpFiles -nodecomp else HOSTFILE=`mktemp --tmpdir=./` srun hostname > $HOSTFILE #/hpc_lsf/application/ESI_Software/2020.5/ACE+Suite/2020.5/Linux_x86_64_2.17//UTILS/bin/CFD-SOLVER -dtf 20mm_40mil_ZonalWidth_WO_Liner_CASE_01_Updated_NEW_40p.DTF -num 40 -verbose 3 -job -keepTmpFiles -nodecomp --hosts $HOSTFILE /hpc_lsf/application/ESI_Software/2020.5/ACE+Suite/2020.5/Linux_x86_64_2.17//UTILS/bin/CFD-SOLVER -dtf 20mm_40mil_ZonalWidth_WO_Liner_CASE_01_Updated_NEW_40p.DTF -num 40 --slurmLauncher=srun -verbose 3 -job -keepTmpFiles -nodecomp --hosts $HOSTFILE rm -f $HOSTFILE fi sbatch -n 42 --use-min-nodes -N 2 --job-name=cfdace 20.5 -p normal /tmp/tmp.42gHi19mGV Submitted batch job 953 [e162968@dcaldh000 bin]$ sq JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 953 normal cfdace 2 e162968 R 0:03 2 dcaldh[001,003] [e162968@dcaldh000 bin]$ cat /tmp/tmp.42gHi19mGV #!/bin/sh #SK: 04-04-2022 : adding as per schedmd #export I_MPI_PMI_LIBRARY=/cm/shared/apps/slurm/20.11.8/lib64/libpmi2.so export I_MPI_PMI_VALUE_LENGTH_MAX=512 export SLURM_OVERLAP=1 if [ -n ""]; then input=`sed -n p ` fi if [ $SLURM_NNODES = 1 ]; then /hpc_lsf/application/ESI_Software/2020.5/ACE+Suite/2020.5/Linux_x86_64_2.17//UTILS/bin/CFD-SOLVER -dtf 20mm_40mil_ZonalWidth_WO_Liner_CASE_01_Updated_NEW_40p.DTF -num 40 --slurmLauncher=srun -verbose 3 -job -keepTmpFiles -nodecomp # /hpc_lsf/application/ESI_Software/2020.5/ACE+Suite/2020.5/Linux_x86_64_2.17//UTILS/bin/CFD-SOLVER -dtf 20mm_40mil_ZonalWidth_WO_Liner_CASE_01_Updated_NEW_40p.DTF -num 40 -verbose 3 -job -keepTmpFiles -nodecomp else HOSTFILE=`mktemp --tmpdir=./` srun hostname > $HOSTFILE #/hpc_lsf/application/ESI_Software/2020.5/ACE+Suite/2020.5/Linux_x86_64_2.17//UTILS/bin/CFD-SOLVER -dtf 20mm_40mil_ZonalWidth_WO_Liner_CASE_01_Updated_NEW_40p.DTF -num 40 -verbose 3 -job -keepTmpFiles -nodecomp --hosts $HOSTFILE /hpc_lsf/application/ESI_Software/2020.5/ACE+Suite/2020.5/Linux_x86_64_2.17//UTILS/bin/CFD-SOLVER -dtf 20mm_40mil_ZonalWidth_WO_Liner_CASE_01_Updated_NEW_40p.DTF -num 40 --slurmLauncher=srun -verbose 3 -job -keepTmpFiles -nodecomp --hosts $HOSTFILE rm -f $HOSTFILE fi 2. Test two: Few ntasks ( n = 6) sbatch -n 6 --use-min-nodes -N 1 --job-name=cfdace 20.5 -p normal /tmp/tmp.Oi95I4dxmC Submitted batch job 957 [e162968@dcaldh000 bin]$ sq JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 957 normal cfdace 2 e162968 R 0:02 1 dcaldh001 [e162968@dcaldh000 bin]$ cat /tmp/tmp.Oi95I4dxmC #!/bin/sh #SK: 04-04-2022 : adding as per schedmd export I_MPI_PMI_LIBRARY=/cm/shared/apps/slurm/20.11.8/lib64/libpmi2.so #export I_MPI_PMI_VALUE_LENGTH_MAX=512 #export SLURM_OVERLAP=1 if [ -n ""]; then input=`sed -n p ` fi if [ $SLURM_NNODES = 1 ]; then /hpc_lsf/application/ESI_Software/2020.5/ACE+Suite/2020.5/Linux_x86_64_2.17//UTILS/bin/CFD-SOLVER -dtf RPOex_CH_Wi_SVI_0p9T_3p5SLM_4p.DTF -num 4 --slurmLauncher=srun -verbose 3 -job -keepTmpFiles -nodecomp # /hpc_lsf/application/ESI_Software/2020.5/ACE+Suite/2020.5/Linux_x86_64_2.17//UTILS/bin/CFD-SOLVER -dtf RPOex_CH_Wi_SVI_0p9T_3p5SLM_4p.DTF -num 4 -verbose 3 -job -keepTmpFiles -nodecomp else HOSTFILE=`mktemp --tmpdir=./` srun hostname > $HOSTFILE #/hpc_lsf/application/ESI_Software/2020.5/ACE+Suite/2020.5/Linux_x86_64_2.17//UTILS/bin/CFD-SOLVER -dtf RPOex_CH_Wi_SVI_0p9T_3p5SLM_4p.DTF -num 4 -verbose 3 -job -keepTmpFiles -nodecomp --hosts $HOSTFILE /hpc_lsf/application/ESI_Software/2020.5/ACE+Suite/2020.5/Linux_x86_64_2.17//UTILS/bin/CFD-SOLVER -dtf RPOex_CH_Wi_SVI_0p9T_3p5SLM_4p.DTF -num 4 --slurmLauncher=srun -verbose 3 -job -keepTmpFiles -nodecomp --hosts $HOSTFILE rm -f $HOSTFILE fi sbatch -n 6 --use-min-nodes -N 1 --job-name=cfdace 20.5 -p normal /tmp/tmp.cz2sedCg6R Submitted batch job 958 [e162968@dcaldh000 bin]$ sq JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 958 normal cfdace 2 e162968 R 0:01 1 dcaldh001 [e162968@dcaldh000 bin]$ [e162968@dcaldh000 bin]$ cat /tmp/tmp.cz2sedCg6R #!/bin/sh #SK: 04-04-2022 : adding as per schedmd #export I_MPI_PMI_LIBRARY=/cm/shared/apps/slurm/20.11.8/lib64/libpmi2.so export I_MPI_PMI_VALUE_LENGTH_MAX=512 #export SLURM_OVERLAP=1 if [ -n ""]; then input=`sed -n p ` fi if [ $SLURM_NNODES = 1 ]; then /hpc_lsf/application/ESI_Software/2020.5/ACE+Suite/2020.5/Linux_x86_64_2.17//UTILS/bin/CFD-SOLVER -dtf RPOex_CH_Wi_SVI_0p9T_3p5SLM_4p.DTF -num 4 --slurmLauncher=srun -verbose 3 -job -keepTmpFiles -nodecomp # /hpc_lsf/application/ESI_Software/2020.5/ACE+Suite/2020.5/Linux_x86_64_2.17//UTILS/bin/CFD-SOLVER -dtf RPOex_CH_Wi_SVI_0p9T_3p5SLM_4p.DTF -num 4 -verbose 3 -job -keepTmpFiles -nodecomp else HOSTFILE=`mktemp --tmpdir=./` srun hostname > $HOSTFILE #/hpc_lsf/application/ESI_Software/2020.5/ACE+Suite/2020.5/Linux_x86_64_2.17//UTILS/bin/CFD-SOLVER -dtf RPOex_CH_Wi_SVI_0p9T_3p5SLM_4p.DTF -num 4 -verbose 3 -job -keepTmpFiles -nodecomp --hosts $HOSTFILE /hpc_lsf/application/ESI_Software/2020.5/ACE+Suite/2020.5/Linux_x86_64_2.17//UTILS/bin/CFD-SOLVER -dtf RPOex_CH_Wi_SVI_0p9T_3p5SLM_4p.DTF -num 4 --slurmLauncher=srun -verbose 3 -job -keepTmpFiles -nodecomp --hosts $HOSTFILE rm -f $HOSTFILE fi Kindly validate and let me know for additional details Thank You Regards Shraddha Kiran HPC-Apps | GIS | Applied Materials India E-mail : HPC_Unified_Support@amat.com<mailto:HPC_Unified_Support@amat.com> From: bugs@schedmd.com <bugs@schedmd.com> Sent: 13 April 2022 00:33 To: Shraddha Kiran <Shraddha_Kiran@amat.com> Subject: [EXTERNAL] [Bug 13622] mpi/pmi2 invalid client request Comment # 49 on bug 13622 from Felip Moll > I see the following when I run with --slurmLauncher=srun in the CFD-SOLVER > command and the case ran to completion. So, I guess at least PMI2 is > supported. > srun --mpi=pmi2 --multi-prog Comment # 49<https://urldefense.com/v3/__https:/bugs.schedmd.com/show_bug.cgi?id=13622*c49__;Iw!!NH8t9uXaRvxizNEf!WBsk_EpqYg3KsYrq8VB_ECv087jjdcbwqJYVAfUY5q4o-rq7QqtARANaikAUNWQgrNrxrgiBPCu1UdQ$> on bug 13622<https://urldefense.com/v3/__https:/bugs.schedmd.com/show_bug.cgi?id=13622__;!!NH8t9uXaRvxizNEf!WBsk_EpqYg3KsYrq8VB_ECv087jjdcbwqJYVAfUY5q4o-rq7QqtARANaikAUNWQgrNrxrgiBEGSBvBI$> from Felip Moll<mailto:felip.moll@schedmd.com> > I see the following when I run with --slurmLauncher=srun in the CFD-SOLVER > command and the case ran to completion. So, I guess at least PMI2 is > supported. > srun --mpi=pmi2 --multi-prog ccp_par4.Config.224538 > Note that when using srun, you need to request 2 more cores than the num of > parallel processes to account for the dtfioserver and wmserver processes. > Oookay. Let's try the overlap option of srun. This allows steps to share cpus. Since you are running many sruns (steps) in parallel in the same node (because of --multi-prog) it is possible they are not all running. First test: You need to add this before the call to CFD-SOLVER, into the $SCRIPT export SLURM_OVERLAP=1 .. and run the test again. Second test: The second test I'd want you to do in your environment is to run with a fewer number of tasks. Your wrapper runs this: sbatch -n 42 --use-min-nodes -N 1-2 --job-name=cfdace 20.5 -p normal /tmp/tmp.kzJK1Rwkk2 could we switch to run the same but with for example 4 (-n 4) tasks? I want to see if there's any "overflow" in pmi packets being sent. Send back the output of the jobs. If none of these 2 tests makes any difference, we'll schedule the screen sharing. ----- > 1. Could you let me know if I need to do anything more in order to tune > it to (e.g. UCX_TLS=ud,sm,self)? Also which option should I tune it to? All > three? Let's table this for now. I see the correct transports in ucx_info -d output you sent, and moreover we're going to try with PMI2 only which does not use UCX. I am not sure as why the UCX error ever showed up. Just for your info, the UCX_TLS parameter must be set as an environment variable of slurmd when starting it up. > 2. Could you let me know if thereโs a better way to check the information > on whether slurm was compiled with -ucx or no You were doing it correctly. The idea is to check this library and see if *at least* these symbols do show up: ]$ objdump -T mpi_pmix_v3.so|grep ucx 0000000000022f91 g DF .text 00000000000000b9 Base pmixp_dconn_ucx_stop 000000000002304a g DF .text 00000000000005d9 Base pmixp_dconn_ucx_finalize 00000000000324c0 g DO .bss 0000000000000028 Base _ucx_worker_lock 0000000000023623 g DF .text 000000000000006b Base _ucx_process_msg 000000000002297e g DF .text 0000000000000613 Base pmixp_dconn_ucx_prepare 000000000000f0b6 g DF .text 0000000000000019 Base pmixp_info_srv_direct_conn_ucx But as I said previously let's focus on pmi2. ________________________________ You are receiving this mail because: * You reported the bug. The content of this message is APPLIED MATERIALS CONFIDENTIAL. If you are not the intended recipient, please notify me, delete this email and do not use or distribute this email.
Created attachment 24481 [details] 953-003-slurm-overlap-pmi-val-slurmd.txt
Created attachment 24482 [details] 953-slurm-overlap-pmi-val-slurmctld.txt
Created attachment 24483 [details] 952-slurm-overlap-pmi-lib-slurmd.txt
Created attachment 24484 [details] 952-003-slurm-overlap-pmi-lib-slurmd.txt
Created attachment 24485 [details] 952-slurm-overlap-pmi-lib-slurmctld.txt
Created attachment 24486 [details] 20mm_40mil_ZonalWidth_WO_Liner_CASE_01_Updated_NEW_40p.job
Created attachment 24487 [details] 20mm_40mil_ZonalWidth_WO_Liner_CASE_01_Updated_NEW_40p.job
Created attachment 24488 [details] RPOex_CH_Wi_SVI_0p9T_3p5SLM_4p.job
Created attachment 24489 [details] 957-pmi-lib-slurmd-ntasks4.txt
Created attachment 24490 [details] 957-pmi-lib-slurmctld-ntasks4.txt
Created attachment 24491 [details] 958-pmi-val-slurmctld-ntasks4.txt
Created attachment 24492 [details] 958-pmi-val-slurmd-ntasks4.txt
Created attachment 24493 [details] RPOex_CH_Wi_SVI_0p9T_3p5SLM_4p.job
Hello Felip, Please let me know if you were able to gather any information from the logs provided. Can we have a screen-sharing session anytime soon? Thank You Regards Shraddha Kiran HPC-Apps | GIS | Applied Materials India E-mail : HPC_Unified_Support@amat.com<mailto:HPC_Unified_Support@amat.com> From: Shraddha Kiran Sent: 16 April 2022 00:03 To: bugs@schedmd.com Cc: Sergey Meirovich <Sergey_Meirovich@amat.com> Subject: RE: [EXTERNAL] [Bug 13622] mpi/pmi2 invalid client request Hello Felip, Attaching the slurmctld, slurmd, and job logs for test 1 and test 2 as suggested by you. Maintained the $I_MPI_DEBUG value as 5 1. Test one: export SLURM_OVERLAP=1 with export I_MPI_PMI_LIBRARY=/cm/shared/apps/slurm/20.11.8/lib64/libpmi2.so export SLURM_OVERLAP=1 with export I_MPI_PMI_VALUE_LENGTH_MAX=512 sbatch -n 42 --use-min-nodes -N 2 --job-name=cfdace 20.5 -p normal /tmp/tmp.4GAtRQUf3C Submitted batch job 952 [e162968@dcaldh000 bin]$ sq JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 952 normal cfdace 2 e162968 R 0:06 2 dcaldh[001,003] [e162968@dcaldh000 bin]$ cat /tmp/tmp.4GAtRQUf3C #!/bin/sh #SK: 04-04-2022 : adding as per schedmd export I_MPI_PMI_LIBRARY=/cm/shared/apps/slurm/20.11.8/lib64/libpmi2.so #export I_MPI_PMI_VALUE_LENGTH_MAX=512 export SLURM_OVERLAP=1 if [ -n ""]; then input=`sed -n p ` fi if [ $SLURM_NNODES = 1 ]; then /hpc_lsf/application/ESI_Software/2020.5/ACE+Suite/2020.5/Linux_x86_64_2.17//UTILS/bin/CFD-SOLVER -dtf 20mm_40mil_ZonalWidth_WO_Liner_CASE_01_Updated_NEW_40p.DTF -num 40 --slurmLauncher=srun -verbose 3 -job -keepTmpFiles -nodecomp # /hpc_lsf/application/ESI_Software/2020.5/ACE+Suite/2020.5/Linux_x86_64_2.17//UTILS/bin/CFD-SOLVER -dtf 20mm_40mil_ZonalWidth_WO_Liner_CASE_01_Updated_NEW_40p.DTF -num 40 -verbose 3 -job -keepTmpFiles -nodecomp else HOSTFILE=`mktemp --tmpdir=./` srun hostname > $HOSTFILE #/hpc_lsf/application/ESI_Software/2020.5/ACE+Suite/2020.5/Linux_x86_64_2.17//UTILS/bin/CFD-SOLVER -dtf 20mm_40mil_ZonalWidth_WO_Liner_CASE_01_Updated_NEW_40p.DTF -num 40 -verbose 3 -job -keepTmpFiles -nodecomp --hosts $HOSTFILE /hpc_lsf/application/ESI_Software/2020.5/ACE+Suite/2020.5/Linux_x86_64_2.17//UTILS/bin/CFD-SOLVER -dtf 20mm_40mil_ZonalWidth_WO_Liner_CASE_01_Updated_NEW_40p.DTF -num 40 --slurmLauncher=srun -verbose 3 -job -keepTmpFiles -nodecomp --hosts $HOSTFILE rm -f $HOSTFILE fi sbatch -n 42 --use-min-nodes -N 2 --job-name=cfdace 20.5 -p normal /tmp/tmp.42gHi19mGV Submitted batch job 953 [e162968@dcaldh000 bin]$ sq JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 953 normal cfdace 2 e162968 R 0:03 2 dcaldh[001,003] [e162968@dcaldh000 bin]$ cat /tmp/tmp.42gHi19mGV #!/bin/sh #SK: 04-04-2022 : adding as per schedmd #export I_MPI_PMI_LIBRARY=/cm/shared/apps/slurm/20.11.8/lib64/libpmi2.so export I_MPI_PMI_VALUE_LENGTH_MAX=512 export SLURM_OVERLAP=1 if [ -n ""]; then input=`sed -n p ` fi if [ $SLURM_NNODES = 1 ]; then /hpc_lsf/application/ESI_Software/2020.5/ACE+Suite/2020.5/Linux_x86_64_2.17//UTILS/bin/CFD-SOLVER -dtf 20mm_40mil_ZonalWidth_WO_Liner_CASE_01_Updated_NEW_40p.DTF -num 40 --slurmLauncher=srun -verbose 3 -job -keepTmpFiles -nodecomp # /hpc_lsf/application/ESI_Software/2020.5/ACE+Suite/2020.5/Linux_x86_64_2.17//UTILS/bin/CFD-SOLVER -dtf 20mm_40mil_ZonalWidth_WO_Liner_CASE_01_Updated_NEW_40p.DTF -num 40 -verbose 3 -job -keepTmpFiles -nodecomp else HOSTFILE=`mktemp --tmpdir=./` srun hostname > $HOSTFILE #/hpc_lsf/application/ESI_Software/2020.5/ACE+Suite/2020.5/Linux_x86_64_2.17//UTILS/bin/CFD-SOLVER -dtf 20mm_40mil_ZonalWidth_WO_Liner_CASE_01_Updated_NEW_40p.DTF -num 40 -verbose 3 -job -keepTmpFiles -nodecomp --hosts $HOSTFILE /hpc_lsf/application/ESI_Software/2020.5/ACE+Suite/2020.5/Linux_x86_64_2.17//UTILS/bin/CFD-SOLVER -dtf 20mm_40mil_ZonalWidth_WO_Liner_CASE_01_Updated_NEW_40p.DTF -num 40 --slurmLauncher=srun -verbose 3 -job -keepTmpFiles -nodecomp --hosts $HOSTFILE rm -f $HOSTFILE fi 2. Test two: Few ntasks ( n = 6) sbatch -n 6 --use-min-nodes -N 1 --job-name=cfdace 20.5 -p normal /tmp/tmp.Oi95I4dxmC Submitted batch job 957 [e162968@dcaldh000 bin]$ sq JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 957 normal cfdace 2 e162968 R 0:02 1 dcaldh001 [e162968@dcaldh000 bin]$ cat /tmp/tmp.Oi95I4dxmC #!/bin/sh #SK: 04-04-2022 : adding as per schedmd export I_MPI_PMI_LIBRARY=/cm/shared/apps/slurm/20.11.8/lib64/libpmi2.so #export I_MPI_PMI_VALUE_LENGTH_MAX=512 #export SLURM_OVERLAP=1 if [ -n ""]; then input=`sed -n p ` fi if [ $SLURM_NNODES = 1 ]; then /hpc_lsf/application/ESI_Software/2020.5/ACE+Suite/2020.5/Linux_x86_64_2.17//UTILS/bin/CFD-SOLVER -dtf RPOex_CH_Wi_SVI_0p9T_3p5SLM_4p.DTF -num 4 --slurmLauncher=srun -verbose 3 -job -keepTmpFiles -nodecomp # /hpc_lsf/application/ESI_Software/2020.5/ACE+Suite/2020.5/Linux_x86_64_2.17//UTILS/bin/CFD-SOLVER -dtf RPOex_CH_Wi_SVI_0p9T_3p5SLM_4p.DTF -num 4 -verbose 3 -job -keepTmpFiles -nodecomp else HOSTFILE=`mktemp --tmpdir=./` srun hostname > $HOSTFILE #/hpc_lsf/application/ESI_Software/2020.5/ACE+Suite/2020.5/Linux_x86_64_2.17//UTILS/bin/CFD-SOLVER -dtf RPOex_CH_Wi_SVI_0p9T_3p5SLM_4p.DTF -num 4 -verbose 3 -job -keepTmpFiles -nodecomp --hosts $HOSTFILE /hpc_lsf/application/ESI_Software/2020.5/ACE+Suite/2020.5/Linux_x86_64_2.17//UTILS/bin/CFD-SOLVER -dtf RPOex_CH_Wi_SVI_0p9T_3p5SLM_4p.DTF -num 4 --slurmLauncher=srun -verbose 3 -job -keepTmpFiles -nodecomp --hosts $HOSTFILE rm -f $HOSTFILE fi sbatch -n 6 --use-min-nodes -N 1 --job-name=cfdace 20.5 -p normal /tmp/tmp.cz2sedCg6R Submitted batch job 958 [e162968@dcaldh000 bin]$ sq JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 958 normal cfdace 2 e162968 R 0:01 1 dcaldh001 [e162968@dcaldh000 bin]$ [e162968@dcaldh000 bin]$ cat /tmp/tmp.cz2sedCg6R #!/bin/sh #SK: 04-04-2022 : adding as per schedmd #export I_MPI_PMI_LIBRARY=/cm/shared/apps/slurm/20.11.8/lib64/libpmi2.so export I_MPI_PMI_VALUE_LENGTH_MAX=512 #export SLURM_OVERLAP=1 if [ -n ""]; then input=`sed -n p ` fi if [ $SLURM_NNODES = 1 ]; then /hpc_lsf/application/ESI_Software/2020.5/ACE+Suite/2020.5/Linux_x86_64_2.17//UTILS/bin/CFD-SOLVER -dtf RPOex_CH_Wi_SVI_0p9T_3p5SLM_4p.DTF -num 4 --slurmLauncher=srun -verbose 3 -job -keepTmpFiles -nodecomp # /hpc_lsf/application/ESI_Software/2020.5/ACE+Suite/2020.5/Linux_x86_64_2.17//UTILS/bin/CFD-SOLVER -dtf RPOex_CH_Wi_SVI_0p9T_3p5SLM_4p.DTF -num 4 -verbose 3 -job -keepTmpFiles -nodecomp else HOSTFILE=`mktemp --tmpdir=./` srun hostname > $HOSTFILE #/hpc_lsf/application/ESI_Software/2020.5/ACE+Suite/2020.5/Linux_x86_64_2.17//UTILS/bin/CFD-SOLVER -dtf RPOex_CH_Wi_SVI_0p9T_3p5SLM_4p.DTF -num 4 -verbose 3 -job -keepTmpFiles -nodecomp --hosts $HOSTFILE /hpc_lsf/application/ESI_Software/2020.5/ACE+Suite/2020.5/Linux_x86_64_2.17//UTILS/bin/CFD-SOLVER -dtf RPOex_CH_Wi_SVI_0p9T_3p5SLM_4p.DTF -num 4 --slurmLauncher=srun -verbose 3 -job -keepTmpFiles -nodecomp --hosts $HOSTFILE rm -f $HOSTFILE fi Kindly validate and let me know for additional details Thank You Regards Shraddha Kiran HPC-Apps | GIS | Applied Materials India E-mail : HPC_Unified_Support@amat.com<mailto:HPC_Unified_Support@amat.com> From: bugs@schedmd.com<mailto:bugs@schedmd.com> <bugs@schedmd.com<mailto:bugs@schedmd.com>> Sent: 13 April 2022 00:33 To: Shraddha Kiran <Shraddha_Kiran@amat.com<mailto:Shraddha_Kiran@amat.com>> Subject: [EXTERNAL] [Bug 13622] mpi/pmi2 invalid client request Comment # 49 on bug 13622 from Felip Moll > I see the following when I run with --slurmLauncher=srun in the CFD-SOLVER > command and the case ran to completion. So, I guess at least PMI2 is > supported. > srun --mpi=pmi2 --multi-prog Comment # 49<https://urldefense.com/v3/__https:/bugs.schedmd.com/show_bug.cgi?id=13622*c49__;Iw!!NH8t9uXaRvxizNEf!WBsk_EpqYg3KsYrq8VB_ECv087jjdcbwqJYVAfUY5q4o-rq7QqtARANaikAUNWQgrNrxrgiBPCu1UdQ$> on bug 13622<https://urldefense.com/v3/__https:/bugs.schedmd.com/show_bug.cgi?id=13622__;!!NH8t9uXaRvxizNEf!WBsk_EpqYg3KsYrq8VB_ECv087jjdcbwqJYVAfUY5q4o-rq7QqtARANaikAUNWQgrNrxrgiBEGSBvBI$> from Felip Moll<mailto:felip.moll@schedmd.com> > I see the following when I run with --slurmLauncher=srun in the CFD-SOLVER > command and the case ran to completion. So, I guess at least PMI2 is > supported. > srun --mpi=pmi2 --multi-prog ccp_par4.Config.224538 > Note that when using srun, you need to request 2 more cores than the num of > parallel processes to account for the dtfioserver and wmserver processes. > Oookay. Let's try the overlap option of srun. This allows steps to share cpus. Since you are running many sruns (steps) in parallel in the same node (because of --multi-prog) it is possible they are not all running. First test: You need to add this before the call to CFD-SOLVER, into the $SCRIPT export SLURM_OVERLAP=1 .. and run the test again. Second test: The second test I'd want you to do in your environment is to run with a fewer number of tasks. Your wrapper runs this: sbatch -n 42 --use-min-nodes -N 1-2 --job-name=cfdace 20.5 -p normal /tmp/tmp.kzJK1Rwkk2 could we switch to run the same but with for example 4 (-n 4) tasks? I want to see if there's any "overflow" in pmi packets being sent. Send back the output of the jobs. If none of these 2 tests makes any difference, we'll schedule the screen sharing. ----- > 1. Could you let me know if I need to do anything more in order to tune > it to (e.g. UCX_TLS=ud,sm,self)? Also which option should I tune it to? All > three? Let's table this for now. I see the correct transports in ucx_info -d output you sent, and moreover we're going to try with PMI2 only which does not use UCX. I am not sure as why the UCX error ever showed up. Just for your info, the UCX_TLS parameter must be set as an environment variable of slurmd when starting it up. > 2. Could you let me know if thereโs a better way to check the information > on whether slurm was compiled with -ucx or no You were doing it correctly. The idea is to check this library and see if *at least* these symbols do show up: ]$ objdump -T mpi_pmix_v3.so|grep ucx 0000000000022f91 g DF .text 00000000000000b9 Base pmixp_dconn_ucx_stop 000000000002304a g DF .text 00000000000005d9 Base pmixp_dconn_ucx_finalize 00000000000324c0 g DO .bss 0000000000000028 Base _ucx_worker_lock 0000000000023623 g DF .text 000000000000006b Base _ucx_process_msg 000000000002297e g DF .text 0000000000000613 Base pmixp_dconn_ucx_prepare 000000000000f0b6 g DF .text 0000000000000019 Base pmixp_info_srv_direct_conn_ucx But as I said previously let's focus on pmi2. ________________________________ You are receiving this mail because: * You reported the bug. The content of this message is APPLIED MATERIALS CONFIDENTIAL. If you are not the intended recipient, please notify me, delete this email and do not use or distribute this email.
(In reply to Shraddha Kiran from comment #64) > Hello Felip, > > Please let me know if you were able to gather any information from the logs > provided. Can we have a screen-sharing session anytime soon? > > Thank You > > Regards > > Shraddha Kiran Hi Shraddha, I am not seeing anything conclusive. What is your timezone to do a screensharing? Mine is UTC+2. We need to sync also with Jason, which is UTC-6. I am not available today anymore. I will talk with Jason too for available days and let him coordinate.
Felip and I have time to meet tomorrow Wednesday the 20th at 7:30 AM PST / 8:30 AM MST. We will have to limit the time to just 30minuits. What we would like to do is see the issue directly and verify your node configuration. If this time works for you I will send out the invite.
Sure Jason Please send me the invite Thank You Regards Shraddha Kiran HPC-Apps | GIS | Applied Materials India E-mail : HPC_Unified_Support@amat.com<mailto:HPC_Unified_Support@amat.com> From: bugs@schedmd.com <bugs@schedmd.com> Sent: 19 April 2022 22:47 To: Shraddha Kiran <Shraddha_Kiran@amat.com> Subject: [EXTERNAL] [Bug 13622] mpi/pmi2 invalid client request Comment # 66 on bug 13622 from Jason Booth Felip and I have time to meet tomorrow Wednesday the 20th at 7:30 AM PST / 8:30 AM MST. We will have to limit the time to just 30minuits. What we would like to do is see the issue directly and verify Comment # 66<https://urldefense.com/v3/__https:/bugs.schedmd.com/show_bug.cgi?id=13622*c66__;Iw!!NH8t9uXaRvxizNEf!WO2YX0IJ5BfBeXYC6B0oGbWvNewRmAG0V3TGjwlauFfZ1_6S4F0IzqQK7QZvxTqBpZ2Kfm9MVJPxzao$> on bug 13622<https://urldefense.com/v3/__https:/bugs.schedmd.com/show_bug.cgi?id=13622__;!!NH8t9uXaRvxizNEf!WO2YX0IJ5BfBeXYC6B0oGbWvNewRmAG0V3TGjwlauFfZ1_6S4F0IzqQK7QZvxTqBpZ2Kfm9MeGkaPv8$> from Jason Booth<mailto:jbooth@schedmd.com> Felip and I have time to meet tomorrow Wednesday the 20th at 7:30 AM PST / 8:30 AM MST. We will have to limit the time to just 30minuits. What we would like to do is see the issue directly and verify your node configuration. If this time works for you I will send out the invite. ________________________________ You are receiving this mail because: * You reported the bug. The content of this message is APPLIED MATERIALS CONFIDENTIAL. If you are not the intended recipient, please notify me, delete this email and do not use or distribute this email.
(In reply to Shraddha Kiran from comment #67) > Sure Jason > Please send me the invite > Thank You > > Regards > > Shraddha Kiran Invite has been sent.
TEST #0 for screensharing multiprog.txt ------------------ 0 /sw/ESI_Software/2020.5/ACE+Suite/2020.5/Linux_x86_64_2.17/ACE_SOLVER/bin/wmServer-MPICH-MPI 1 /sw/ESI_Software/2020.5/ACE+Suite/2020.5/Linux_x86_64_2.17/ACE_SOLVER/bin/dtfIoServer-MPICH-MPI 2-41 /sw/ESI_Software/2020.5/ACE+Suite/2020.5/Linux_x86_64_2.17/ACE_SOLVER/bin/CFD-ACE-SOLVER-MPICH-MPI -useDtfServer -dtf 20mm_40mil_ZonalWidth_WO_Liner_CASE_01_Updated_NEW_40p.DTF -sim 1 -parallel run.sh ---------------- #!/bin/sh srun -vvv --mpi=pmi2 --multi-prog multiprog.txt command to run: ------------------- sbatch -n 42 -p normal run.sh ------------------------ NOTES: export I_MPI_PMI_VALUE_LENGTH_MAX=512 export I_MPI_PMI_LIBRARY=/cm/shared/apps/slurm/20.11.8/lib64/libpmi2.so export I_MPI_DEBUG 5
TEST #1 for screensharing multiprog.txt ------------------ 0 hostname 1-41 .../intelmpi/compilers_and_libraries_2018.2.199/linux/mpi/intel64/bin/mpirun mpi_hello run.sh ---------------- #!/bin/sh export SLURM_OVERLAP=1 srun -vvv --mpi=pmi2 --multi-prog multiprog.txt command to run: ------------------- sbatch -n 42 -p normal run.sh mpi_hello.c ------------ #include <mpi.h> #include <stdio.h> #include <stdlib.h> #include <unistd.h> int main(int argc, char** argv) { MPI_Init(NULL, NULL); int world_size; MPI_Comm_size(MPI_COMM_WORLD, &world_size); int world_rank; MPI_Comm_rank(MPI_COMM_WORLD, &world_rank); char processor_name[MPI_MAX_PROCESSOR_NAME]; int name_len; MPI_Get_processor_name(processor_name, &name_len); const char* s = getenv("SLURMD_NODENAME"); printf("Hello world from processor %s, rank %d out of %d processors, %s\n", processor_name, world_rank, world_size,s); MPI_Finalize(); } mpi_hello compile: -------------------- mpicc -o mpi_hello mpi_hello.c ------------------------ NOTES: export I_MPI_PMI_LIBRARY=/cm/shared/apps/slurm/20.11.8/lib64/libpmi2.so export I_MPI_DEBUG=5 export SLURM_OVERLAP=1
Hello Felip, Could you please share a Teams invite if possible? Thank You Regards Shraddha Kiran HPC-Apps | GIS | Applied Materials India E-mail : HPC_Unified_Support@amat.com<mailto:HPC_Unified_Support@amat.com> From: bugs@schedmd.com <bugs@schedmd.com> Sent: 20 April 2022 20:01 To: Shraddha Kiran <Shraddha_Kiran@amat.com> Subject: [EXTERNAL] [Bug 13622] mpi/pmi2 invalid client request Comment # 70 on bug 13622 from Felip Moll TEST #1 for screensharing multiprog.txt ------------------ 0 hostname 1-41 .../intelmpi/compilers_and_libraries_2018.2.199/linux/mpi/intel64/bin/mpirun mpi_hello run.sh ---------------- #!/bin/sh export Comment # 70<https://urldefense.com/v3/__https:/bugs.schedmd.com/show_bug.cgi?id=13622*c70__;Iw!!NH8t9uXaRvxizNEf!S9GXOVZ_dzZT69FFCxJQ-jmJ81-9VyYSsjE-8wI-fMqniUPB6p3pIm_f_L3Fc4qwr62jSQiOhD-He6w$> on bug 13622<https://urldefense.com/v3/__https:/bugs.schedmd.com/show_bug.cgi?id=13622__;!!NH8t9uXaRvxizNEf!S9GXOVZ_dzZT69FFCxJQ-jmJ81-9VyYSsjE-8wI-fMqniUPB6p3pIm_f_L3Fc4qwr62jSQiOx5D3qas$> from Felip Moll<mailto:felip.moll@schedmd.com> TEST #1 for screensharing multiprog.txt ------------------ 0 hostname 1-41 .../intelmpi/compilers_and_libraries_2018.2.199/linux/mpi/intel64/bin/mpirun mpi_hello run.sh ---------------- #!/bin/sh export SLURM_OVERLAP=1 srun -vvv --mpi=pmi2 --multi-prog multiprog.txt command to run: ------------------- sbatch -n 42 -p normal run.sh mpi_hello.c ------------ #include <mpi.h> #include <stdio.h> #include <stdlib.h> #include <unistd.h> int main(int argc, char** argv) { MPI_Init(NULL, NULL); int world_size; MPI_Comm_size(MPI_COMM_WORLD, &world_size); int world_rank; MPI_Comm_rank(MPI_COMM_WORLD, &world_rank); char processor_name[MPI_MAX_PROCESSOR_NAME]; int name_len; MPI_Get_processor_name(processor_name, &name_len); const char* s = getenv("SLURMD_NODENAME"); printf("Hello world from processor %s, rank %d out of %d processors, %s\n", processor_name, world_rank, world_size,s); MPI_Finalize(); } mpi_hello compile: -------------------- mpicc -o mpi_hello mpi_hello.c ------------------------ NOTES: export I_MPI_PMI_LIBRARY=/cm/shared/apps/slurm/20.11.8/lib64/libpmi2.so export I_MPI_DEBUG=5 export SLURM_OVERLAP=1 ________________________________ You are receiving this mail because: * You reported the bug. The content of this message is APPLIED MATERIALS CONFIDENTIAL. If you are not the intended recipient, please notify me, delete this email and do not use or distribute this email.
(In reply to Shraddha Kiran from comment #71) > Hello Felip, > > Could you please share a Teams invite if possible? > > Thank You > > Regards Hi Shraddha, sorry, we don't have Teams. We saw you joining. Can you try again? meet.google.com/ykd-xcwk-yhi Thanks
Shraddha, just fyi I am talking with Abraham from ESI and I will probably get a demo license. Thanks!.
Shraddha, Is it possible to get one small example model to do some testing like: 20mm_40mil_ZonalWidth_WO_Liner_CASE_01_Updated_NEW_40p.DTF ??
Hello Felip, We can share something similar with you is that okay? Thank You Regards Shraddha Kiran HPC-Apps | GIS | Applied Materials India E-mail : HPC_Unified_Support@amat.com<mailto:HPC_Unified_Support@amat.com> From: bugs@schedmd.com <bugs@schedmd.com> Sent: 20 April 2022 23:35 To: Shraddha Kiran <Shraddha_Kiran@amat.com> Subject: [EXTERNAL] [Bug 13622] mpi/pmi2 invalid client request Comment # 75 on bug 13622 from Felip Moll Shraddha, Is it possible to get one small example model to do some testing like: 20mm_40mil_ZonalWidth_WO_Liner_CASE_01_Updated_NEW_40p.DTF ?? โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ Comment # 75<https://urldefense.com/v3/__https:/bugs.schedmd.com/show_bug.cgi?id=13622*c75__;Iw!!NH8t9uXaRvxizNEf!X-kJsXw_BWOupSjuFu9TF6BP_6cXGiBQ0GO1fHqzoTYlkMoS3IM-eBYmGZLqb602N6Fy1-NdhfNlHmo$> on bug 13622<https://urldefense.com/v3/__https:/bugs.schedmd.com/show_bug.cgi?id=13622__;!!NH8t9uXaRvxizNEf!X-kJsXw_BWOupSjuFu9TF6BP_6cXGiBQ0GO1fHqzoTYlkMoS3IM-eBYmGZLqb602N6Fy1-NdSSJmxsk$> from Felip Moll<mailto:felip.moll@schedmd.com> Shraddha, Is it possible to get one small example model to do some testing like: 20mm_40mil_ZonalWidth_WO_Liner_CASE_01_Updated_NEW_40p.DTF ?? ________________________________ You are receiving this mail because: * You reported the bug. The content of this message is APPLIED MATERIALS CONFIDENTIAL. If you are not the intended recipient, please notify me, delete this email and do not use or distribute this email.
(In reply to Shraddha Kiran from comment #76) > Hello Felip, > > We can share something similar with you is that okay? > > Thank You > > Of course. Any model that fails in your setup will be good for my testing. Let me know about how test 0 and 1 ended :)
Created attachment 24638 [details] sample case
Shradda, I tested your Pipe_run.DTF and it seems to work for me, except for resolution errors: Number of processes : 22 Number of zones : 1 ACE Solver requires that number of processes should be equal to number of zones. Please check if you have decomposed it and using right simulation number. Is it because he sample you sent is prepared to run in one single processor? Can you send me one that can be split in 22 processors or tell me how to do it? Thanks
Created attachment 24646 [details] decomposed case
(In reply to Shraddha Kiran from comment #80) > Created attachment 24646 [details] > decomposed case Ok, with this one I am getting 4 workers. Can I get one example that may run with 24 or 42 workers?
Added to my last request, it is important that you do test #0 and test #1. Let me know when you're done, and the results. Thanks
Sure Felip, I will let you know the results of both tests by Thursday, 04/28 Thank You Regards Shraddha Kiran HPC-Apps | GIS | Applied Materials India E-mail : HPC_Unified_Support@amat.com<mailto:HPC_Unified_Support@amat.com> From: bugs@schedmd.com <bugs@schedmd.com> Sent: 26 April 2022 03:56 To: Shraddha Kiran <Shraddha_Kiran@amat.com> Subject: [EXTERNAL] [Bug 13622] mpi/pmi2 invalid client request Comment # 82 on bug 13622 from Felip Moll Added to my last request, it is important that you do test #0 and test #1. Let me know when you're done, and the results. Thanks โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ Comment # 82<https://urldefense.com/v3/__https:/bugs.schedmd.com/show_bug.cgi?id=13622*c82__;Iw!!NH8t9uXaRvxizNEf!TSRQ0H6x-7F29ZgHcF9DQHHxGZQPtifbCtOsxPvFJsKEaMQTGAho9lnbCWBVlC-6VI_1jDCIUZ7WPhw$> on bug 13622<https://urldefense.com/v3/__https:/bugs.schedmd.com/show_bug.cgi?id=13622__;!!NH8t9uXaRvxizNEf!TSRQ0H6x-7F29ZgHcF9DQHHxGZQPtifbCtOsxPvFJsKEaMQTGAho9lnbCWBVlC-6VI_1jDCI6-5CtZ4$> from Felip Moll<mailto:felip.moll@schedmd.com> Added to my last request, it is important that you do test #0 and test #1. Let me know when you're done, and the results. Thanks ________________________________ You are receiving this mail because: * You reported the bug. The content of this message is APPLIED MATERIALS CONFIDENTIAL. If you are not the intended recipient, please notify me, delete this email and do not use or distribute this email.
Created attachment 24661 [details] decomposed-for24-ntasks
Shraddha, Just to inform that the software runs without issues on my machine. I think you may have an installation problem with pmi libraries, with the fabric setup, or similar. To summarize a bit what I investigated: A) If you run the software without any modification you get the following error: [2022-03-15T10:29:13.823] [847.1] error: mpi/pmi2: full request is: 00000000000000000000000000000000000000 cmd=put kvsname=847.1 key=bc-5-1 value=00000$ In that case, CFD uses its internal PMI client side implementation that goes against our pmi2 internal server plugin (because CFD calls slurm with srun --mpi=pmi2). The internal Slurm's pmi2 may have a bug which demonstrates the error you see when a pmi-1 message is sent. That one can be workarounded setting: export I_MPI_PMI_VALUE_LENGTH_MAX=512 This doesn't happen on my machine maybe because I don't have so many processors. B) But as we have seen.... if you set this I_MPI_PMI_VALUE_LENGTH_MAX You see then this error: [0] MPI startup(): libfabric provider: mlx [1648486227.896231] [dcaldh003:40733:0] select.c:445 UCX ERROR no active messages transport to <no debug data>: posix/memory - Destination is unreachable, sysv/memory - Destination is unreachable, self/memory - Destination is unreachable, sockcm/sockaddr - no am bcopy I am guessing then that Slurm works correctly, but since CFD uses Intel's internal PMI-1 and internal UCX, it shows this error. It is probably due to CFD configuration issue. I don't see it on my machine since I don't have any fabric. I think you can try to modify the UCX setup that Intel's PMI-1 will read by setting both of: export I_MPI_PMI_VALUE_LENGTH_MAX=512 export UCX_TLS=ud,sm,self And see if this fixes the issues. C) There's another possibility.. we could just force Intel to use Slurm's PMI-1 instead of its own, which would avoid the use of UCX. Slurm only uses UCX on PMIx. Note we would still need to avoid the Slurm's server side PMI2 bug. So, to try this, just set: export I_MPI_PMI_LIBRARY=path_to_slurm/libpmi.so export I_MPI_PMI_VALUE_LENGTH_MAX=512 D) The best option we have is to make CFD to work with Slurm's client side PMI-2. This would avoid scalability issues compared to PMI-1. This can be done by specifying the following: export I_MPI_PMI_LIBRARY=path_to_slurm/libpmi2.so but then.. you saw the generic error: "Abort(567055) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack: MPIR_Init_thread(703): MPID_Init(762).......: PMI_Init returned -1" I am suspecting here a bad configuration about fabrics. You can try to set: export I_MPI_FABRIC=shm:ofi export FI_PROVIDER=mlx export I_MPI_PMI_LIBRARY=path_to_slurm/libpmi2.so E) if this does not work, we will need to get more information by setting: export I_MPI_PMI_LIBRARY=path_to_slurm/libpmi2.so export I_MPI_DEBUG=5 F) Also, please, read and make sure your environment is properly set in accordance to CFDACE. https://www.intel.com/content/www/us/en/develop/documentation/mpi-developer-guide-linux/top/running-applications/fabrics-control/ofi-providers-support.html ---- I need you to go deeply through these points and know what you see. I cannot do that on my environment because of our very different configurations. Please, let me know if you have any doubt while reading this post. Note I am still interested in thest #0 and #1. Thanks
Hello Felip, I tried performing test 0 and test 1 on my machine as below: Test 0: [e162968@dcaldh000 cfd-test-slurmu-upgrade-20.11.8]$ echo $I_MPI_PMI_VALUE_LENGTH_MAX 512 [e162968@dcaldh000 cfd-test-slurmu-upgrade-20.11.8]$ echo $I_MPI_PMI_LIBRARY /cm/shared/apps/slurm/20.11.8/lib64/libpmi2.so [e162968@dcaldh000 cfd-test-slurmu-upgrade-20.11.8]$ echo $I_MPI_DEBUG 5 [e162968@dcaldh000 test-0]$ sbatch -n 42 -p normal run.sh Submitted batch job 986 [e162968@dcaldh000 test-0]$ sq JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 986 normal run.sh e162968 R 0:01 1 dcaldh001 Note: /sw/ESI_Software/2020.5/ACE+Suite/2020.5/Linux_x86_64_2.17/ACE_SOLVER/bin/CFD-ACE-SOLVER-MPICH-MPI: error while loading shared libraries: libfabric.so.1: cannot open shared object file: No such file or directory env|grep libfabric.so.1 LIBRARY_PATH=/usr/lib64/libfabric.so.1:/cm/shared/apps/slurm/19.05.7/lib64/slurm:/cm/shared/apps/slurm/19.05.7/lib64:/cm/shared/apps/slurm/20.11.8/lib64/slurm:/cm/shared/apps/slurm/20.11.8/lib64 Test 1: [e162968@dcaldh000 ~]$ echo $I_MPI_PMI_VALUE_LENGTH_MAX [e162968@dcaldh000 ~]$ echo $I_MPI_PMI_LIBRARY /cm/shared/apps/slurm/20.11.8/lib64/libpmi2.so [e162968@dcaldh000 ~]$ echo $I_MPI_DEBUG 5 Note: SLURM_OVERLAP=1 was added in the run.sh as suggested Intel lib used was 2018.3.222 instead of 2018.2.199 [e162968@dcaldh000 test-1]$ sbatch -n 42 -p normal run.sh Submitted batch job 976 [e162968@dcaldh000 test-1]$ sq JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 976 normal run.sh e162968 R 0:01 1 dcaldh001 I will be attaching the files accordingly, please validate Thank you Shraddha
Created attachment 24693 [details] slurm output logs
Created attachment 24694 [details] test-0-run
Created attachment 24695 [details] test1-slurm-logs
Created attachment 24696 [details] test1-run
Hello Felip, Please note I get errors for both test0 ( libfabric.so.1 error inspite of its presence on environment ) and test1 ( [proxy:0:0@dcaldh001] HYDU_create_process (../../utils/launch/launch.c:825): execvp error on file mpi_hello (No such file or directory) ) Am I missing something? Please advise Thank you Shraddha
(In reply to Shraddha Kiran from comment #91) > Hello Felip, > > Please note I get errors for both test0 ( libfabric.so.1 error inspite of > its presence on environment ) and test1 ( [proxy:0:0@dcaldh001] Can you check that all the nodes have libfabric installed??? I see this: LIBRARY_PATH=/usr/lib64/libfabric.so.1:/cm/shared/apps/slurm/19.05.7/lib64/slurm:/cm/shared/apps/slurm/19.05.7/lib64:/cm/shared/apps/slurm/20.11.8/lib64/slurm:/cm/shared/apps/slurm/20.11.8/lib64 Please, check where libfabric.so.1 is. I see here: 1. the variable is LD_LIBRARY_PATH, not LIBRARY_PATH 2. /usr/lib64/libfabric.so.1 is not valid, this must refer to a directory where libfabric.so.1 resides, which can be /usr/lib64/ 3. If it is /usr/lib64 it must be available in all nodes. > HYDU_create_process (../../utils/launch/launch.c:825): execvp error on file > mpi_hello (No such file or directory) > ) > Please, check the path in multiprog.txt to your mpi_hello, and that is accessible from the nodes.
Created attachment 24724 [details] resending-test-0-slurm-log Hi Felip, Test 0 results ( again ): I observed the same error Abort(567055) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack: MPIR_Init_thread(703): MPID_Init(762).......: PMI_Init returned -1 that we got earlier Attached logs for the same Working on test 1... Thank You Shraddha
(In reply to Shraddha Kiran from comment #93) > Created attachment 24724 [details] > resending-test-0-slurm-log > > Hi Felip, > > Test 0 results ( again ): > I observed the same error Abort(567055) on node 0 (rank 0 in comm 0): Fatal > error in PMPI_Init: Other MPI error, error stack: > MPIR_Init_thread(703): > MPID_Init(762).......: PMI_Init returned -1 > that we got earlier Please, check your module slurm/20.11.8, it seems to be using "setenv" and this command is not found. >Loading app_env/slurm > Loading requirement: slurm/20.11.8 >/cm/local/apps/slurm/var/spool/job00988/slurm_script: line 12: setenv: command not found If it stills fails we have a simpler case to work on. Let's see test #1.
Created attachment 24726 [details] resend-test1-slurm-log Hi Felip, Attaching test 1 logs( got errors with MPI_INIT this time) and multiprog.txt. Please validate Thank You Shraddha
Created attachment 24727 [details] multiprog.txt
(In reply to Shraddha Kiran from comment #96) > Created attachment 24727 [details] > multiprog.txt Shraddha, Test #1 confirms that MPI is not working at all in your cluster and it is not related to CFDACE. Do you agree? Can you repeat TEST #1 but WITHOUT setting these? export I_MPI_PMI_LIBRARY=/cm/shared/apps/slurm/20.11.8/lib64/libpmi2.so export I_MPI_DEBUG=5 export SLURM_OVERLAP=1
Hi Felip, Yes, it appears to me that it is not CFDACE related now but the MPI itself has issues. Any ideas / further suggestions as to how can we make it work? Recompile maybe? Performed test 1 without the environment setup -bash-4.2$ echo $I_MPI_PMI_LIBRARY -bash-4.2$ echo $I_MPI_DEBUG Commented the SLURM_OVERLAP=1 inside run.sh The job seems to have stuck in this state and the slurm out file doesn't update anymore after this point ( attaching the logs for clarity ) JOBID USER ST PARTITION NAME COMMAND SUBMIT_TIME CPUS TRES_PER_NODFEATURES NODES NODELIST(REASON) 991 e162968 R normal run.sh /dat/usr/e162968/cfdApr 28 10:27 42 N/A (null) 1 dcaldh001 Please suggest Thank You Shraddha
Created attachment 24729 [details] test-1-slurm-logs-without env setup
Hi Felip, Could you please highlight again what should I expect with and without environment set up in case of test 1? Thank you Shraddha
(In reply to Shraddha Kiran from comment #100) > Hi Felip, > > Could you please highlight again what should I expect with and without > environment set up in case of test 1? > > Thank you > > Shraddha Yeah, without SLURM_OVERLAP I expect exactly what you are seeing. The job stuck. I think we need to check how you installed Slurm, from the beggining. Also it is very important that all nodes have access to the same libraries, I am still concerned about libfabric. Which exact steps did you follow to install Slurm?
Hi Felip, The libfabric is present on all compute nodes, except for one( 002) because it is down ---------------- dcaldh001 ---------------- libfabric-1.7.2-1.el7.x86_64 libfabric-devel-1.7.2-1.el7.x86_64 ---------------- dcaldh[003-004] ---------------- libfabric-devel-1.7.2-1.el7.x86_64 libfabric-1.7.2-1.el7.x86_64 rpm -ql libfabric /usr/bin/fi_info /usr/bin/fi_pingpong /usr/bin/fi_strerror /usr/lib64/libfabric.so.1 /usr/lib64/libfabric.so.1.10.2 /usr/lib64/pkgconfig/libfabric.pc /usr/share/doc/libfabric-1.7.2 /usr/share/doc/libfabric-1.7.2/AUTHORS /usr/share/doc/libfabric-1.7.2/README /usr/share/licenses/libfabric-1.7.2 /usr/share/licenses/libfabric-1.7.2/COPYING As always, we had compiled the Bright CLuster Manager's version of SLURM spec file in order to accomodate our environment. Please let us know if there are any changes to be made Thank You Shraddha
(In reply to Shraddha Kiran from comment #102) > Hi Felip, > > The libfabric is present on all compute nodes, except for one( 002) because > it is down >.. > /usr/lib64/libfabric.so.1 If this is really there I can only think that it is not getting the correct one. In the LIBRARY_PATH variable (which was wrong, should be LD_LIBRARY_PATH), I see things from slurm 19.05 mixed there: >env|grep libfabric.so.1 >LIBRARY_PATH=/usr/lib64/libfabric.so.1:/cm/shared/apps/slurm/19.05.7/lib64/slurm:/cm/shared/apps/slurm/19.05.7/lib64:/cm/shared/apps/slurm/20.11.8/lib64/slurm:/cm/shared/apps/slurm/20.11.8/lib64 > As always, we had compiled the Bright CLuster Manager's version of SLURM > spec file in order to accomodate our environment. Please let us know if > there are any changes to be made If you cannot get a simple MPI job to work then there's some basic thing in the installation/compilation/network setup that doesn't work. I don't know how Bright setups the environment, how configures and compiles Slurm or how does Slurm configuration, network, libraries and others. It is clearly something in your environment. I'd recommend to uninstall it and install the last Slurm version from sources, following the procedures in schedmd.com documentation. The very basic steps are: 1. Download the code and uncompress 2. Create a build directory and cd into it 3. $ ./<path_to_source>/configure --enable-debug --prefix=<path_to_shared_inst_dir> --enable-developer $ make -j install $ cd contribs/pmi2 $ make -j install
Hi Felip, Shall I repeat test1 again with a clean? As per your comment #97? Thank You Shraddha
(In reply to Shraddha Kiran from comment #104) > Hi Felip, > > Shall I repeat test1 again with a clean? As per your comment #97? > > Thank You > > Shraddha You mean repeat test 1 with a clean install (not Bright?). Sorry, I am not following. AFAIU in comment #98 you already repeated test 1 without these variables I said in comment #97.
(In reply to Felip Moll from comment #105) > (In reply to Shraddha Kiran from comment #104) > > Hi Felip, > > > > Shall I repeat test1 again with a clean? As per your comment #97? > > > > Thank You > > > > Shraddha > > You mean repeat test 1 with a clean install (not Bright?). Sorry, I am not > following. > > AFAIU in comment #98 you already repeated test 1 without these variables I > said in comment #97. Hi Felip, I meant shall I repeat the test1 with a clean environment setup? without the interruption of slurm 19.05 as you had observed it earlier Thank you Shraddha
Hi Felip, Also, could you tell me again what output should I expect once mpi_hello ( test 1 ) runs successfully? Thank you Shraddha
> I meant shall I repeat the test1 with a clean environment setup? without the > interruption of slurm 19.05 as you had observed it earlier If the environment was not ok as we saw, the test is not valid. You can repeat it. Your goal is to be able to run a simple hello world as described in comment 70. > Also, could you tell me again what output should I expect once mpi_hello ( > test 1 ) runs successfully? You should see one string with the hostname of one node, plus 41 messages like this at is in the source code. You should not see any mpi errors and the job must terminate correctly: Hello world from processor %s, rank %d out of %d processors, %s e.g.: node001 Hello world from processor node001, rank 0 out of 12 processors, (null) Hello world from processor node001, rank 1 out of 12 processors, (null) Hello world from processor node001, rank 2 out of 12 processors, (null) Hello world from processor node001, rank 3 out of 12 processors, (null) .... etc ...
Shraddha, I am curious if you have done any progress with this issue. Just let me know about any step forward. Thanks!
Created attachment 24944 [details] single-node-run-test-1
Created attachment 24945 [details] multi-node( 2 node )-run-test1
Hi Felip, I had run the test1 for both multi and single node run. However, I did not see the expected output in both cases. I see a message that says "srun: Job 996 step creation still disabled, retrying (Requested nodes are busy)" in both the slurm out files ( please find the attached files for today, i.e., 10-05-2022) Please advise Thank you Shraddha
(In reply to Shraddha Kiran from comment #112) > Hi Felip, > > I had run the test1 for both multi and single node run. However, I did not > see the expected output in both cases. I see a message that says "srun: Job > 996 step creation still disabled, retrying (Requested nodes are busy)" in > both the slurm out files ( please find the attached files for today, i.e., > 10-05-2022) > > Please advise > > Thank you > > Shraddha This is happening with "export SLURM_OVERLAP=1" set as in comment 70?
Hi Felip, I don't think so. The jobs 995 and 996 are still running JOBID USER ST PARTITION NAME COMMAND SUBMIT_TIME CPUS TRES_PER_NODFEATURES NODES NODELIST(REASON) 996 e162968 R normal run.sh /dat/usr/e162968/cfdMay 10 10:30 54 N/A (null) 2 dcaldh[001,003] 995 e162968 R normal run.sh /dat/usr/e162968/cfdMay 9 9:00 42 N/A (null) 1 dcaldh003 I have also echoed the SLURM_OVERLAP variable -bash-4.2$ echo $SLURM_OVERLAP -bash-4.2$ Please let me know for any other things Thank you Shraddha
Running with the same message as srun: Job 996 step creation still disabled, retrying (Requested nodes are busy)" As I had mentioned, I commented the SLURM_OVERLAP variable inside the script that was set before Thank you Shraddha
Hello Felip, Is it possible to provide further help? Thank you Shraddha
Hello Felip, Is this issue related to SLURM CVEs ,CVE-2022-29500, 29501, 2950 released lately? Kindly comment Thank you Shraddha
(In reply to Shraddha Kiran from comment #114) > Hi Felip, > > I don't think so. The jobs 995 and 996 are still running > > JOBID USER ST PARTITION NAME COMMAND > SUBMIT_TIME CPUS TRES_PER_NODFEATURES NODES > NODELIST(REASON) > 996 e162968 R normal run.sh /dat/usr/e162968/cfdMay > 10 10:30 54 N/A (null) 2 dcaldh[001,003] > 995 e162968 R normal run.sh /dat/usr/e162968/cfdMay > 9 9:00 42 N/A (null) 1 dcaldh003 > > I have also echoed the SLURM_OVERLAP variable > -bash-4.2$ echo $SLURM_OVERLAP > > -bash-4.2$ > > Please let me know for any other things > > Thank you > > Shraddha Shraddha, I think we almost got it in our last comments, but we missed a detail. You repeated the test #1 in a clean environment and then your job got stuck and received "step creation still disabled, retrying (Requested nodes are busy)" as you explained in comment 112. Then in comment 113 I asked if SLURM_OVERLAP=1 was set, and you responded that it was not set. That's because on the first run of test #1 the test was not successful. Then I suggested to remove the environment variables. You repeated that and it failed again, but I gave the test as valid (comment 101), but then you told me the environment was not ok and you repeated it, but without the environment variables. I was expecting you to repeat the test #1 but with this environment variables and a clean environment. Can you do so? -- I think this bug is already too long and we need to recap where we are and what we're looking for: - The point we are now is: The initial errors when you reported the bug back in march happened because CFDACE uses its internal Intel MPI implementation (version 2019) which used PMI-1, and Intel called our pmi2 library, with some kind of message sent when running with a large number of ranks that make it fail. So I suggested to just instruct Intel MPI 2019 to use PMI-2, but then it failed with other errors. - Then I tested CFDACE in my machine and it seemed to work just fine. - Then I wanted to see if a simple mpi program works in your environment, and there we are now. It seems not to be working as per comment 95 and 112. - But the test #1 which is a simple mpi program was run once with an unclean environment, then with a clean environment but without the SLURM_OVERLAP and other variables. - We need test #1 (as in comment 70) with a clean environment and SLURM_OVERLAP=1, not like it happened in comment 92. With a wrong LIBRARY_PATH (which should've been LD_LIBRARY_PATH, etc.) ยท SLURM_OVERLAP=1 is needed for multiple steps to run at the same time, this will avoid this error: "step creation still disabled, retrying (Requested nodes are busy)" Please, confirm me all is clear at this point and that we still haven't run test #1 correctly. Then run it and attach the logs as usual. Attach also the slurmd log.
Hello Felip, I shall rerun the test#1 and attach the logs as requested. Thank you for understanding Shraddha
(In reply to Shraddha Kiran from comment #117) > Hello Felip, > > Is this issue related to SLURM CVEs ,CVE-2022-29500, 29501, 2950 released > lately? In what regards to this question, these CVEs are totally unrelated to this bug.
Hello Felip, As requested, performed test#1 as below: 1. Ensured a clean environment 2. Ensured SLURM_OVERLAP=1 is set Attaching run logs and slurmd logs Please validate and let me know for any further enhancements Thank you Shraddha
Created attachment 25181 [details] mpi-job-run-logs
Created attachment 25182 [details] mpi-job-run-logs Hello Felip, Sorry to miss on the namings. The previous that I had attached is for slurmd logs for this corresponding job ( 1047 jobid) Thanks for understanding
Excellent!. This looks good, which would mean test #1 passes, so Slurm and MPI is working, and the problem must be elsewhere. But... I ask this now: Please, repeat *exactly* the same test but REPLACING: sbatch -n 42 -p normal run.sh BY sbatch -n 1 -p normal run.sh Send me the same logs (slurmd + output of job) you have just sent.
Hello Felip, Sure, Should I make any changes in the multiprog.txt file too? -bash-4.2$ cat multiprog.txt 0 hostname 1-41 /sw/intel-ps/intel-ps-2018u3/compilers_and_libraries_2018.3.222/linux/mpi/intel64/bin/mpirun /dat/usr/e162968/cfd-test-slurmu-upgrade-20.11.8/test-1/mpi_hello Thank You Shraddha
(In reply to Shraddha Kiran from comment #126) > Hello Felip, > > Sure, Should I make any changes in the multiprog.txt file too? > > -bash-4.2$ cat multiprog.txt > 0 hostname > 1-41 > /sw/intel-ps/intel-ps-2018u3/compilers_and_libraries_2018.3.222/linux/mpi/ > intel64/bin/mpirun > /dat/usr/e162968/cfd-test-slurmu-upgrade-20.11.8/test-1/mpi_hello > > Thank You > Shraddha No, not yet. Thanks
Created attachment 25184 [details] mpi-job-logs-1-CPU
Created attachment 25185 [details] mpi-slurmd-logs
Yep. Exactly what I expected, I just wanted to be sure you saw the same than me. Test #1 is success. Conclusion: Slurm works. Intel MPI works with Slurm. And Intel MPI works with srun and pmi2. I think your installation is good and the problem is in CFDACE or an external configuration, not in Slurm. -- 1. Since test #1 works, let's start from there. Revert this last change we've done, get back to have "sbatch -n 42 -p normal run.sh" instead of -n1. Then, modify ONLY multiprog.txt and leave it like this: 0 /sw/ESI_Software/2020.5/ACE+Suite/2020.5/Linux_x86_64_2.17/ACE_SOLVER/bin/dtfIoServer-MPICH-MPI 1 /sw/ESI_Software/2020.5/ACE+Suite/2020.5/Linux_x86_64_2.17/ACE_SOLVER/bin/wmServer-MPICH-MPI 2-41 /sw/ESI_Software/2020.5/ACE+Suite/2020.5/Linux_x86_64_2.17/ACE_SOLVER/bin/CFD-ACE-SOLVER-MPICH-MPI -useDtfServer -dtf 20mm_40mil_ZonalWidth_WO_Liner_CASE_01_Updated_NEW_40p.DTF -sim 7 -parallel 2. Then run again the same test. 3. Remember that running this directly instead of with your wrappers, it may be required to set some LD_LIBRARY_PATH variables or CFD-ACE-SOLVER may fail to startup. If you manage to start these binaries but still get MPI issues, most likely is an issue in your configuration related to CFDACE, or some environment variable that it needs, but that would be something to ask to ACE. Please do these test and let me know.
If you cannot start CFD-ACE I can give you a hint on how I did it in my machine. Let me know.
(In reply to Felip Moll from comment #130) > Yep. Exactly what I expected, I just wanted to be sure you saw the same than > me. > > Test #1 is success. > > Conclusion: > > Slurm works. Intel MPI works with Slurm. And Intel MPI works with srun and > pmi2. > I think your installation is good and the problem is in CFDACE or an > external configuration, not in Slurm. > > > -- > > 1. Since test #1 works, let's start from there. > > Revert this last change we've done, get back to have "sbatch -n 42 -p normal > run.sh" instead of -n1. > > Then, modify ONLY multiprog.txt and leave it like this: > > 0 > /sw/ESI_Software/2020.5/ACE+Suite/2020.5/Linux_x86_64_2.17/ACE_SOLVER/bin/ > dtfIoServer-MPICH-MPI > 1 > /sw/ESI_Software/2020.5/ACE+Suite/2020.5/Linux_x86_64_2.17/ACE_SOLVER/bin/ > wmServer-MPICH-MPI > 2-41 > /sw/ESI_Software/2020.5/ACE+Suite/2020.5/Linux_x86_64_2.17/ACE_SOLVER/bin/ > CFD-ACE-SOLVER-MPICH-MPI -useDtfServer -dtf > 20mm_40mil_ZonalWidth_WO_Liner_CASE_01_Updated_NEW_40p.DTF -sim 7 -parallel > > > 2. Then run again the same test. > > 3. Remember that running this directly instead of with your wrappers, it may > be required to set some LD_LIBRARY_PATH variables or CFD-ACE-SOLVER may fail > to startup. If you manage to start these binaries but still get MPI issues, > most likely is an issue in your configuration related to CFDACE, or some > environment variable that it needs, but that would be something to ask to > ACE. > > Please do these test and let me know. Should I add in extra environment variables before running? export I_MPI_PMI_VALUE_LENGTH_MAX=512 #export I_MPI_PMI_LIBRARY=/cm/shared/apps/slurm/20.11.8/lib64/libpmi2.so #export I_MPI_DEBUG=5 Anything?
> export I_MPI_PMI_VALUE_LENGTH_MAX=512 > #export I_MPI_PMI_LIBRARY=/cm/shared/apps/slurm/20.11.8/lib64/libpmi2.so > #export I_MPI_DEBUG=5 > > Anything? No. *Exactly* as test #1 that worked, but just changing the multiprog.txt.
Created attachment 25263 [details] cfd-job-logs Hello Felip, I tried running as you had suggested but failed on few libraries. Attaching the logs May you suggest how you were able to run on your machine? Thank you Shraddha
Hello Felip, Note: The input file has changed ( but the issue remains on any cfd case ) due to machine restrictions. Have changed the multiprog.txt also accordingly
(In reply to Shraddha Kiran from comment #136) > Hello Felip, > > Note: The input file has changed ( but the issue remains on any cfd case ) > due to machine restrictions. Have changed the multiprog.txt also accordingly (In reply to Shraddha Kiran from comment #135) > Created attachment 25263 [details] > cfd-job-logs > > Hello Felip, > > I tried running as you had suggested but failed on few libraries. Attaching > the logs > > May you suggest how you were able to run on your machine? > > Thank you > Shraddha Before proceeding further, I see this line in the top of your recent uploaded log: Loading app_env/slurm Loading requirement: slurm/19.05.7 while in the log uploaded in comment 123 there's no such line. Please, ensure the test is exactly the same and you're only changing multiprog.txt.
In my local machine, to run CFD-ACE manually I needed the following batch script. Read carefully and change accordingly if you want to use it. ------ my batch script -------- #!/bin/bash # # Customer runs this one as: # sbatch -n 42 --use-min-nodes -N 1-2 --job-name=cfdace 20.5 -p normal /tmp/tmp.kzJK1Rwkk2 ulimit -s unlimited ##################################################### ########## START LOADING ENVIRONMENT ################ ##################################################### export CFD_ROOT=/<change this path>/root/ACE+Suite/2021.0/linux-x64-intel19-glibc2.17/UTILS export CFD_BINDIR=$CFD_ROOT/bin/ export PAM_LMD_LICENSE_FILE=/<change this path>/felip_license.lic export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$CFD_ROOT"/mpirt/mpich3/lib/" export PATH=$PATH:$CFD_BINDIR ### ESI-Group begin ### # # ESI-Group Software environment # PAMHOME=/<change this path>/root PAMENV=$PAMHOME/env-`uname` export PAMHOME PAMENV ESI_HOME=${PAMHOME}; export ESI_HOME ESI_CANONICAL=`${PAMHOME}/getppgdir.sh s`; export ESI_CANONICAL if [ -d ${PAMHOME}/ACE+Suite/2021.0/${ESI_CANONICAL} ]; then PATH=${PAMHOME}/ACE+Suite/2021.0/${ESI_CANONICAL}/UTILS/bin:${PATH}; export PATH fi if [ -r $PAMENV/psi.Baenv ]; then . $PAMENV/psi.Baenv fi # Next line to avoid error BASH_ENV=${BASH_ENV:-""} if [ -z "$BASH_ENV" ];then BASH_ENV=$HOME/.bashrc export BASH_ENV fi ### ESI-Group end ### ##################################################### ##########END LOADING ENVIRONMENT#################### ##################################################### export I_MPI_PMI_LIBRARY=<path_to_slurm_installation>/lib/libpmi2.so ##export I_MPI_FABRICS=shm:ofi ##THIS LINE IS A TEST ONLY, DONT UNCOMMENT## if [ $SLURM_NNODES = 1 ] then CFD-SOLVER -dtf $PWD/Pipe_run_To_Decompose_24.DTF -num 22 --slurmLauncher=srun -verbose 3 -job -keepTmpFiles -sim 1 #-nodecomp else HOSTFILE=`mktemp --tmpdir=./` srun hostname > $HOSTFILE CFD-SOLVER -dtf $PWD/Pipe_run_To_Decompose_24.DTF -num 20 --slurmLauncher=srun -verbose 3 -job -keepTmpFiles -nodecomp --hosts $HOSTFILE rm -f $HOSTFILE fi
Created attachment 25264 [details] cfd-job-logs2 Hello Felip, reattached the logs Thank you Shraddha
(In reply to Felip Moll from comment #138) > In my local machine, to run CFD-ACE manually I needed the following batch > script. Read carefully and change accordingly if you want to use it. > > ------ my batch script -------- > #!/bin/bash > # > # Customer runs this one as: > # sbatch -n 42 --use-min-nodes -N 1-2 --job-name=cfdace 20.5 -p normal > /tmp/tmp.kzJK1Rwkk2 > > ulimit -s unlimited > > ##################################################### > ########## START LOADING ENVIRONMENT ################ > ##################################################### > export CFD_ROOT=/<change this > path>/root/ACE+Suite/2021.0/linux-x64-intel19-glibc2.17/UTILS > export CFD_BINDIR=$CFD_ROOT/bin/ > export PAM_LMD_LICENSE_FILE=/<change this path>/felip_license.lic > export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$CFD_ROOT"/mpirt/mpich3/lib/" > export PATH=$PATH:$CFD_BINDIR > > ### ESI-Group begin ### > # > # ESI-Group Software environment > # > PAMHOME=/<change this path>/root > PAMENV=$PAMHOME/env-`uname` > export PAMHOME PAMENV > ESI_HOME=${PAMHOME}; export ESI_HOME > ESI_CANONICAL=`${PAMHOME}/getppgdir.sh s`; export ESI_CANONICAL > if [ -d ${PAMHOME}/ACE+Suite/2021.0/${ESI_CANONICAL} ]; then > PATH=${PAMHOME}/ACE+Suite/2021.0/${ESI_CANONICAL}/UTILS/bin:${PATH}; > export PATH > fi > if [ -r $PAMENV/psi.Baenv ]; then > . $PAMENV/psi.Baenv > fi > # Next line to avoid error > BASH_ENV=${BASH_ENV:-""} > if [ -z "$BASH_ENV" ];then > BASH_ENV=$HOME/.bashrc > export BASH_ENV > fi > ### ESI-Group end ### > ##################################################### > ##########END LOADING ENVIRONMENT#################### > ##################################################### > > export I_MPI_PMI_LIBRARY=<path_to_slurm_installation>/lib/libpmi2.so > > ##export I_MPI_FABRICS=shm:ofi ##THIS LINE IS A TEST ONLY, DONT UNCOMMENT## > > if [ $SLURM_NNODES = 1 ] > then > CFD-SOLVER -dtf $PWD/Pipe_run_To_Decompose_24.DTF -num 22 > --slurmLauncher=srun -verbose 3 -job -keepTmpFiles -sim 1 #-nodecomp > else > HOSTFILE=`mktemp --tmpdir=./` > srun hostname > $HOSTFILE > CFD-SOLVER -dtf $PWD/Pipe_run_To_Decompose_24.DTF -num 20 > --slurmLauncher=srun -verbose 3 -job -keepTmpFiles -nodecomp --hosts > $HOSTFILE > rm -f $HOSTFILE > fi Thank you, will try this and let you know
Hello Felip, Could you please explain below portion of your scirpt? What is the goal we are trying to achieve by: if [ -r $PAMENV/psi.Baenv ]; then . $PAMENV/psi.Baenv fi # Next line to avoid error BASH_ENV=${BASH_ENV:-""} if [ -z "$BASH_ENV" ];then BASH_ENV=$HOME/.bashrc export BASH_ENV fi Thank you Shraddha
Created attachment 25291 [details] cfd-job-logs3 Hello Felip, After reviewing your script I made corresponding changes and now I get the error as /sw/ESI_Software/2020.5/ACE+Suite/2020.5/Linux_x86_64_2.17/ACE_SOLVER/bin/CFD-ACE-SOLVER-MPICH-MPI: error while loading shared libraries: libifport.so.5: cannot open shared object file: No such file or directory logs attached( cfd-job-logs3 ) After which I changed the paths to point to right libraries and now I get the below error: srun: debug2: Entering _file_write Fatal error in PMPI_Recv: Invalid rank, error stack: PMPI_Recv(171): MPI_Recv(buf=0x4ad3fe0, count=1, MPI_INT, src=20, tag=10660668, comm=0x84000002, status=0x7fffffff0d80) failed PMPI_Recv(108): Invalid rank has value 20 but must be nonnegative and less than 20 srun: debug2: Leaving _file_write srun: debug2: Called _file_readable srun: debug2: Called _file_writable srun: debug2: Called _file_writable srun: debug2: Entering _file_write slurmstepd: error: *** STEP 6039136.0 ON dcalph034 CANCELLED AT 2022-05-31T02:31:53 *** Logs attached for the same again (cfd-job-logs4) Thank you Shraddha
Created attachment 25292 [details] cfd-job-logs4
Is it also possible to test in future without the SLURM_OVERLAP option?
(In reply to Shraddha Kiran from comment #141) > Hello Felip, > > Could you please explain below portion of your scirpt? What is the goal we > are trying to achieve by: > > if [ -r $PAMENV/psi.Baenv ]; then > . $PAMENV/psi.Baenv > fi > # Next line to avoid error > BASH_ENV=${BASH_ENV:-""} > if [ -z "$BASH_ENV" ];then > BASH_ENV=$HOME/.bashrc > export BASH_ENV > fi > > Thank you > Shraddha This was copied from CFDACE, it is something the installation of this software put in my bashrc. Not sure if it is relevant.
> PMPI_Recv(108): Invalid rank has value 20 but must be nonnegative and less > than 20 > srun: debug2: Leaving _file_write > srun: debug2: Called _file_readable > srun: debug2: Called _file_writable > srun: debug2: Called _file_writable > srun: debug2: Entering _file_write > slurmstepd: error: *** STEP 6039136.0 ON dcalph034 CANCELLED AT > 2022-05-31T02:31:53 *** Oh, it is possible I did a mistake, please change this 22 by a 20: ------------ before: if [ $SLURM_NNODES = 1 ] then CFD-SOLVER -dtf $PWD/Pipe_run_To_Decompose_24.DTF -num 22 --slurmLauncher=srun -verbose 3 -job -keepTmpFiles -sim 1 #-nodecomp else ... ------------ after if [ $SLURM_NNODES = 1 ] then CFD-SOLVER -dtf $PWD/Pipe_run_To_Decompose_24.DTF -num 20 --slurmLauncher=srun -verbose 3 -job -keepTmpFiles -sim 1 #-nodecomp else ... If it doesn't work try to adjust this value.
(In reply to Shraddha Kiran from comment #144) > Is it also possible to test in future without the SLURM_OVERLAP option? Not for the moment. We need SLURM_OVERLAP because otherwise several srun's may not run at the same time in the node. This is a change introduced in 20.11.
Shraddha, Any news after my previous 2 comments? Thanks!
(In reply to Felip Moll from comment #148) > Shraddha, > > Any news after my previous 2 comments? > > Thanks! Hello Felip I tried submitting after your comments but looks like I am missing out on some data points working with CFDACE vendor to help me debug I have made the changes in the run.sh file in order to accomodate for 22 CPUs as below if [ $SLURM_NNODES = 1 ] then CFD-SOLVER -dtf $PWD/20mm_40mil_ZonalWidth_WO_Liner_CASE_01_Updated_NEW_22p.DTF -num 20 --slurmLauncher=srun -verbose 3 -job -keepTmpFiles -sim 1 #-nodecomp else HOSTFILE=`mktemp --tmpdir=./` srun hostname > $HOSTFILE CFD-SOLVER -dtf $PWD/20mm_40mil_ZonalWidth_WO_Liner_CASE_01_Updated_NEW_22p.DTF -num 20 --slurmLauncher=srun -verbose 3 -job -keepTmpFiles -nodecomp --hosts $HOSTFILE rm -f $HOSTFILE fi And the remaining +2 CPUs is being mentioned in the multiprog.txt file as : 0 /sw/ESI_Software/2020.5/ACE+Suite/2020.5/Linux_x86_64_2.17/ACE_SOLVER/bin/dtfIoServer-MPICH-MPI 1 /sw/ESI_Software/2020.5/ACE+Suite/2020.5/Linux_x86_64_2.17/ACE_SOLVER/bin/wmServer-MPICH-MPI 2-21 /sw/ESI_Software/2020.5/ACE+Suite/2020.5/Linux_x86_64_2.17/ACE_SOLVER/bin/CFD-ACE-SOLVER-MPICH-MPI -useDtfServer -dtf 20mm_40mil_ZonalWidth_WO_Liner_CASE_01_Updated_NEW_22p.DTF -sim 7 -parallel Still I get the error saying: Warning! num > max_num in num2string(): 21 20 srun: debug2: Leaving _file_write srun: debug2: Called _file_readable srun: debug2: Called _file_writable srun: debug2: Called _file_writable srun: debug2: eio_message_socket_accept: got message connection from 10.141.1.200:41762 16 srun: debug2: received job step complete message srun: Complete job step 6042425.0 received srun: Job step aborted: Waiting up to 32 seconds for job step to finish. srun: debug2: eio_message_socket_accept: got message connection from 10.141.1.200:41768 16 srun: debug2: received job step complete message srun: Complete job step 6042425.0 received srun: debug2: Called _file_readable srun: debug2: Called _file_writable srun: debug2: Called _file_writable srun: debug2: Entering _file_write Warning! num > max_num in num2string(): 22 20 srun: debug2: Leaving _file_write srun: debug2: Called _file_readable srun: debug2: Called _file_writable srun: debug2: Called _file_writable srun: debug2: Entering _file_write Fatal error in PMPI_Recv: Invalid rank, error stack: PMPI_Recv(171): MPI_Recv(buf=0x4ad3fe0, count=1, MPI_INT, src=20, tag=10660668, comm=0x84000002, status=0x7fffffff0f00) failed PMPI_Recv(108): Invalid rank has value I am not sure why the two numbers for num and max_num aren't matching or I am missing something.. Shraddha
> Hello Felip > > I tried submitting after your comments but looks like I am missing out on > some data points working with CFDACE vendor to help me debug That sounds good. Please let me know if you have anything on their side. > I have made the changes in the run.sh file in order to accomodate for 22 > CPUs as below >... > Still I get the error saying: > > Warning! num > max_num in num2string(): 21 20 >... > I am not sure why the two numbers for num and max_num aren't matching or I > am missing something.. That's out of my knowledge too. Have you tried different lower numbers? E.g. decreasing it to 10 or something but keeping the sbatch -n to 42? What tests have you done? I can retry it here.
(In reply to Felip Moll from comment #150) > > Hello Felip > > > > I tried submitting after your comments but looks like I am missing out on > > some data points working with CFDACE vendor to help me debug > > That sounds good. Please let me know if you have anything on their side. > > > > I have made the changes in the run.sh file in order to accomodate for 22 > > CPUs as below > >... > > Still I get the error saying: > > > > Warning! num > max_num in num2string(): 21 20 > >... > > I am not sure why the two numbers for num and max_num aren't matching or I > > am missing something.. > > That's out of my knowledge too. > > Have you tried different lower numbers? > > E.g. decreasing it to 10 or something but keeping the sbatch -n to 42? What > tests have you done? > > I can retry it here. Hello Felip, The CFD case that I have is 20mm_40mil_ZonalWidth_WO_Liner_CASE_01_Updated_NEW_22p.DTF implying it would need 22 + 2 CPU cores I did the following: Kept the multiprog.txt as below: 0 /sw/ESI_Software/2020.5/ACE+Suite/2020.5/Linux_x86_64_2.17/ACE_SOLVER/bin/dtfIoServer-MPICH-MPI 1 /sw/ESI_Software/2020.5/ACE+Suite/2020.5/Linux_x86_64_2.17/ACE_SOLVER/bin/wmServer-MPICH-MPI 2-21 /sw/ESI_Software/2020.5/ACE+Suite/2020.5/Linux_x86_64_2.17/ACE_SOLVER/bin/CFD-ACE-SOLVER-MPICH-MPI -useDtfServer -dtf 20mm_40mil_ZonalWidth_WO_Liner_CASE_01_Updated_NEW_22p.DTF -sim 7 -parallel Changed the num =18: if [ $SLURM_NNODES = 1 ] then CFD-SOLVER -dtf $PWD/20mm_40mil_ZonalWidth_WO_Liner_CASE_01_Updated_NEW_22p.DTF -num 18 --slurmLauncher=srun -verbose 3 -job -keepTmpFiles -sim 1 #-nodecomp else HOSTFILE=`mktemp --tmpdir=./` srun hostname > $HOSTFILE CFD-SOLVER -dtf $PWD/20mm_40mil_ZonalWidth_WO_Liner_CASE_01_Updated_NEW_22p.DTF -num 18 --slurmLauncher=srun -verbose 3 -job -keepTmpFiles -nodecomp --hosts $HOSTFILE rm -f $HOSTFILE fi errors out as: Warning! num > max_num in num2string(): 21 20 srun: debug2: received job step complete message srun: Complete job step 6042546.0 received srun: debug2: Leaving _file_write srun: debug2: Called _file_readable srun: debug2: Called _file_writable srun: debug2: Called _file_writable srun: Job step aborted: Waiting up to 32 seconds for job step to finish. srun: debug2: eio_message_socket_accept: got message connection from 10.141.1.200:39138 16 srun: debug2: received job step complete message srun: Complete job step 6042546.0 received srun: debug2: Called _file_readable srun: debug2: Called _file_writable srun: debug2: Called _file_writable srun: debug2: Entering _file_write Warning! num > max_num in num2string(): 22 20 srun: debug2: Leaving _file_write srun: debug2: Called _file_readable srun: debug2: Called _file_writable srun: debug2: Called _file_writable srun: debug2: Entering _file_write Fatal error in PMPI_Recv: Invalid rank, error stack: PMPI_Recv(171): MPI_Recv(buf=0x4ad3fe0, count=1, MPI_INT, src=20, tag=10660668, comm=0x84000002, status=0x7fffffff0f00) failed PMPI_Recv(108): Invalid rank has value 20 but must be nonnegative and less than 20 Changed num value to 16: errors out as: Warning! num > max_num in num2string(): 21 20 srun: debug2: Leaving _file_write srun: debug2: Called _file_readable srun: debug2: Called _file_writable srun: debug2: Called _file_writable srun: debug2: eio_message_socket_accept: got message connection from 10.141.1.200:40868 16 srun: debug2: received job step complete message srun: Complete job step 6042547.0 received srun: Job step aborted: Waiting up to 32 seconds for job step to finish. srun: debug2: eio_message_socket_accept: got message connection from 10.141.1.200:40876 16 srun: debug2: received job step complete message srun: Complete job step 6042547.0 received srun: debug2: Called _file_readable srun: debug2: Called _file_writable srun: debug2: Called _file_writable srun: debug2: Entering _file_write Warning! num > max_num in num2string(): 22 20 srun: debug2: Leaving _file_write srun: debug2: Called _file_readable srun: debug2: Called _file_writable srun: debug2: Called _file_writable srun: debug2: Entering _file_write Fatal error in PMPI_Recv: Invalid rank, error stack: PMPI_Recv(171): MPI_Recv(buf=0x4ad3fe0, count=1, MPI_INT, src=20, tag=10660668, comm=0x84000002, status=0x7fffffff0f00) failed PMPI_Recv(108): Invalid rank has value 20 but must be nonnegative and less than 20 Looks like it doesn't affect the number Could you please try at your end and let me know Thank you Shraddha
Also let me know when you would want to discuss this integration issue with CFD vendor. I can setup sometime with you, CFD and AMAT accordingly
Ok, I will do the test locally asap and inform you. (In reply to Shraddha Kiran from comment #152) > Also let me know when you would want to discuss this integration issue with > CFD vendor. I can setup sometime with you, CFD and AMAT accordingly I understood you were talking with CFD on your side, which is what I think more appropiate rather than involving SchedMD on a talk with CFD at the moment. And moreover taking into account this works on my side :) I will let you know about my tests.
(In reply to Felip Moll from comment #153) > Ok, I will do the test locally asap and inform you. > > (In reply to Shraddha Kiran from comment #152) > > Also let me know when you would want to discuss this integration issue with > > CFD vendor. I can setup sometime with you, CFD and AMAT accordingly > > I understood you were talking with CFD on your side, which is what I think > more appropiate rather than involving SchedMD on a talk with CFD at the > moment. And moreover taking into account this works on my side :) > > I will let you know about my tests. Hello Felip, Did you get a chance to confirm on your tests?
Wanted to update you on one more data point, we see this issue of slurmstepd: error: mpi/pmi2: invalid client request slurmstepd: error: mpi/pmi2: request not begin with 'cmd=' on Slurm 19.05.7 version too Do you suggest trying the same parameters as export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/lib64/libfabric.so.1 and SLURM_OVERLAP=1 ? Thank You
> Did you get a chance to confirm on your tests? Not yet. I will inform ASAP. > Wanted to update you on one more data point, we see this issue of > slurmstepd: error: mpi/pmi2: invalid client request > slurmstepd: error: mpi/pmi2: request not begin with 'cmd=' > > on Slurm 19.05.7 version too Oh, that's relevant. Thanks for this information, this means that Slurm version doesn't matter, and it is probably an application issue. (or a bug that has not been identified yet). > Do you suggest trying the same parameters as > export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/lib64/libfabric.so.1 and > SLURM_OVERLAP=1 > ? No, SLURM_OVERLAP=1 doesn't exist in 19.05, the behavior is different, and the libfabric.so.1 was only set to make CFDACE happy. These has nothing to do with the issue you see.
Hmm...okay. However, the issue that has arised again on another application for SLURM 19.05.7 is using intel-ps modules Do you suggest running the basic MPI test again on SLURM 19?
Shraddha, I successfully run with "-num 22", and with "#SBATCH -n 24". I run with 'sbatch run.sh' My license is expired anyway, so my job ends up with an error because of licenses. I don't suggest running tests under a 19 version which is already very old. Is it the same software (CFDACE) you ran or you saw this error with another software?
(In reply to Felip Moll from comment #158) > Shraddha, > > I successfully run with "-num 22", and with "#SBATCH -n 24". > I run with 'sbatch run.sh' > > My license is expired anyway, so my job ends up with an error because of > licenses. > > I don't suggest running tests under a 19 version which is already very old. > Is it the same software (CFDACE) you ran or you saw this error with another > software? Hello Felip, Yes this is another software ( built internally) using intel-ps-2020u4 modules
(In reply to Shraddha Kiran from comment #155) > Wanted to update you on one more data point, we see this issue of > slurmstepd: error: mpi/pmi2: invalid client request > slurmstepd: error: mpi/pmi2: request not begin with 'cmd=' > > on Slurm 19.05.7 version too > > Do you suggest trying the same parameters as > export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/lib64/libfabric.so.1 and > SLURM_OVERLAP=1 > ? > > Thank You Shraddha, in Intel I see people with the same issue that needed to set I_MPI_PMI_LIBRARY to slurm's libpmi2.so, otherwise it uses pmi-1 and it fails as I explained at some point in this bug. Make sure in your Slurm 19.05.7 you export the correct I_MPI_PMI_LIBRARY variable to point Intel to the corresponding libpmi2.so of slurm. Do you have any news from CFDACE about what could be needed for your issue in comment 151? As I noted it worked for me in comment 158.
Hi Shraddha, Did you get any information from CFDACE on what could go on? How do you want to proceed with this issue? From Slurm side we've verified this works, and for the moment we have no more suggestions.
(In reply to Felip Moll from comment #161) > Hi Shraddha, > > Did you get any information from CFDACE on what could go on? > > How do you want to proceed with this issue? From Slurm side we've verified > this works, and for the moment we have no more suggestions. Hello Felip, Not yet, we are currently working on our internal environment setup. Appreciate the suggestions that you had already given. You may tentatively close this ticket at the moment. We will reach out to you in any case required Thank you Shraddha
(In reply to Shraddha Kiran from comment #162) > (In reply to Felip Moll from comment #161) > > Hi Shraddha, > > > > Did you get any information from CFDACE on what could go on? > > > > How do you want to proceed with this issue? From Slurm side we've verified > > this works, and for the moment we have no more suggestions. > > Hello Felip, > > Not yet, we are currently working on our internal environment setup. > Appreciate the suggestions that you had already given. > You may tentatively close this ticket at the moment. We will reach out to > you in any case required > > Thank you > Shraddha Shraddha, I appreciate your response. You know that we're here in case you have more feedback from CFDACE side. I am closing this out, but you can freely mark this bug as OPEN again at any time, or if you want to start fresh (here we've reached already 162 comments which can be a bit confusing) just open a new one and reference me there so I will take it. Regards and thanks