Created attachment 22834 [details] The slurm.conf file that is used for both versions Hi, Hope you had a wonderful Holiday! I am resubmitting this as a new ticket on recommendation from Jason. I am just doing a cut-and-past from the original bug 10889, I changed the title of the ticket to avoid confusing the discrepancy between salloc vs sbatch behavior in the previous ticket. With the new version right now it is not possible to use Forge, so this is a little more disruptive for users. --------- cut and paste from previous ticket --------- Not sure if I should submit a new ticket but I am replying to this ticket because it was never quite resolved, and since Tony has retired we are short staffed and could not get you the logs/files you needed. Now the issue is this. If you remember, Forge worked fine with salloc but not with sbatch, this was the basic problem. Now we have two versions of Slurm on our test system and Forge does not work at all the when run with the new version of Slurm! Whether I try salloc and sbatch it does not work. For the time being let us ignore the difference between salloc and sbatch. Let us focus no salloc not working with the new version. On our test system we are using the 21.08.2 version of Slurm: jfe01.% ll /apps/slurm lrwxrwxrwx 1 root root 9 Oct 18 10:40 /apps/slurm -> slurmtest/ jfe01.% jfe01.% ll /apps/slurm*/default lrwxrwxrwx 1 slurm slurm 7 Oct 18 12:22 /apps/slurm/default -> 21.08.2/ lrwxrwxrwx 1 slurm slurm 10 Sep 14 14:17 /apps/slurmprod/default -> 20.11.7.p2/ lrwxrwxrwx 1 slurm slurm 7 Oct 18 12:22 /apps/slurmtest/default -> 21.08.2/ jfe01.% Our production system is still using 20.11.7.p2. Using salloc, Forge works fine on our production system but does not work on our test system. I am including the output below: jfe01.% cd S2/forge-test jfe01.% module load intel impi forge jfe01.% salloc -A nesccmgmt -q admin -n 128 salloc: Granted job allocation 4966599 salloc: Waiting for resource configuration salloc: Nodes j1c[08-11] are ready for job [Raghu.Reddy@j1c08 forge-test]$ setenv SLURM_OVERLAP 1 [Raghu.Reddy@j1c08 forge-test]$ map --profile srun /home/Raghu.Reddy/S2/Testsuite3/NPB3.3-MPI/bin/mg-intel-impi.D.128 Arm Forge 21.0 - Arm MAP srun: error: Unable to create step for job 4966599: Invalid Trackable RESource (TRES) specification MAP: Arm MAP could not launch the debuggers: MAP: srun exited with code 1 [Raghu.Reddy@j1c08 forge-test]$ After setting the ALLINEA_USE_SSH_STARTUP to 1: jfe01.% salloc -A nesccmgmt -q admin -n 128 salloc: Granted job allocation 4966600 salloc: Waiting for resource configuration salloc: Nodes j1c[08-11] are ready for job [Raghu.Reddy@j1c08 forge-test]$ module load intel impi forge [Raghu.Reddy@j1c08 forge-test]$ setenv SLURM_OVERLAP 1 [Raghu.Reddy@j1c08 forge-test]$ setenv ALLINEA_USE_SSH_STARTUP 1 [Raghu.Reddy@j1c08 forge-test]$ map --profile srun /home/Raghu.Reddy/S2/Testsuite3/NPB3.3-MPI/bin/mg-intel-impi.D.128 Arm Forge 21.0 - Arm MAP Profiling : srun /home/Raghu.Reddy/S2/Testsuite3/NPB3.3-MPI/bin/mg-intel-impi.D.128 Allinea sampler : not preloading (Express Launch) MPI implementation : Auto-Detect (SLURM (MPMD)) * number of processes : 128 * number of nodes : 4 * Allinea MPI wrapper : not preloading (Express Launch) MAP: Process 6: MAP: MAP: The Allinea sampler was not preloaded. MAP: Check the user guide for instructions on how to link with the Allinea sampler. [Raghu.Reddy@j1c08 forge-test]$ On our production system it works (I have edited program output for brevity): hfe03.% module load intel impi forge hfe03.% salloc -A nesccmgmt -q admin -n 128 salloc: Pending job allocation 27033112 salloc: job 27033112 queued and waiting for resources salloc: job 27033112 has been allocated resources salloc: Granted job allocation 27033112 salloc: Waiting for resource configuration salloc: Nodes h13c[24,26,28,31] are ready for job h13c24.% h13c24.% setenv SLURM_OVERLAP 1 h13c24.% map --profile srun /home/Raghu.Reddy/S2/Testsuite3/NPB3.3-MPI/bin/mg-intel-impi.D.128 Arm Forge 21.0 - Arm MAP Profiling : srun /home/Raghu.Reddy/S2/Testsuite3/NPB3.3-MPI/bin/mg-intel-impi.D.128 Allinea sampler : preload (Express Launch) MPI implementation : Auto-Detect (SLURM (MPMD)) * number of processes : 128 * number of nodes : 4 * Allinea MPI wrapper : preload (JIT compiled) (Express Launch) NAS Parallel Benchmarks 3.3 -- MG Benchmark No input file. Using compiled defaults Size: 1024x1024x1024 (class D) Iterations: 50 Number of processes: 128 Initialization time: 0.663 seconds iter 1 ... ... iter 50 Benchmark completed VERIFICATION SUCCESSFUL L2 Norm is 0.1583275060429E-09 Error is 0.6697470786978E-11 MG Benchmark Completed. Class = D Size = 1024x1024x1024 Iterations = 50 ... ... MS T27A-1 NASA Ames Research Center Moffett Field, CA 94035-1000 Fax: 650-604-3957 MAP analysing program... MAP gathering samples... MAP generated /scratch2/SYSADMIN/nesccmgmt/Raghu.Reddy/forge-test/mg-intel-impi.D_128p_4n_1t_2021-12-30_15-08.map h13c24.% exit salloc: Relinquishing job allocation 27033112 hfe03.% I will upload the slurm.conf file and that file is same in both versions. Wish you a Very Happy New Year! Thank you!
This is a duplicate of bug 12880 and is fixed in 21.08.5 by commit 8b7b1e7128f. You just need to upgrade to the latest 21.08 (and we always encourage being on the latest micro release of whatever major version you're on anyway). You can read bug 12880 for details. Summary: --gres=none was broken and Forge Ddt uses --gres=none. *** This ticket has been marked as a duplicate of ticket 12880 ***