Ticket 13102

Summary: ARM Forge tool is no longer working with 21.08.2
Product: Slurm Reporter: Raghu Reddy <Raghu.Reddy>
Component: User CommandsAssignee: Marshall Garey <marshall>
Status: RESOLVED DUPLICATE QA Contact:
Severity: 3 - Medium Impact    
Priority: ---    
Version: 21.08.2   
Hardware: Linux   
OS: Linux   
Site: NOAA Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: NESCC OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---
Attachments: The slurm.conf file that is used for both versions

Description Raghu Reddy 2022-01-03 08:20:35 MST
Created attachment 22834 [details]
The slurm.conf file that is used for both versions

Hi,

Hope you had a wonderful Holiday!

I am resubmitting this as a new ticket on recommendation from Jason.

I am just doing a cut-and-past from the original bug 10889, I changed the title of the ticket to avoid confusing the discrepancy between salloc vs sbatch behavior in the previous ticket.

With the new version right now it is not possible to use Forge, so this is a little more disruptive for users.

--------- cut and paste from previous ticket ---------

Not sure if I should submit a new ticket but I am replying to this ticket because it was never quite resolved, and since Tony has retired we are short staffed and could not get you the logs/files you needed.

Now the issue is this.  If you remember, Forge worked fine with salloc but not with sbatch, this was the basic problem.

Now we have two versions of Slurm on our test system and Forge does not work at all the when run with the new version of Slurm!  Whether I try salloc and sbatch it does not work.

For the time being let us ignore the difference between salloc and sbatch.

Let us focus no salloc not working with the new version.

On our test system we are using the 21.08.2 version of Slurm:

jfe01.% ll /apps/slurm
lrwxrwxrwx 1 root root 9 Oct 18 10:40 /apps/slurm -> slurmtest/
jfe01.%

jfe01.% ll /apps/slurm*/default
lrwxrwxrwx 1 slurm slurm  7 Oct 18 12:22 /apps/slurm/default -> 21.08.2/
lrwxrwxrwx 1 slurm slurm 10 Sep 14 14:17 /apps/slurmprod/default -> 20.11.7.p2/
lrwxrwxrwx 1 slurm slurm  7 Oct 18 12:22 /apps/slurmtest/default -> 21.08.2/
jfe01.%

Our production system is still using 20.11.7.p2.

Using salloc, Forge works fine on our production system but does not work on our test system.

I am including the output below:

jfe01.% cd S2/forge-test
jfe01.% module load intel impi forge
jfe01.% salloc -A nesccmgmt -q admin -n 128
salloc: Granted job allocation 4966599
salloc: Waiting for resource configuration
salloc: Nodes j1c[08-11] are ready for job
[Raghu.Reddy@j1c08 forge-test]$ setenv SLURM_OVERLAP 1
[Raghu.Reddy@j1c08 forge-test]$ map --profile srun /home/Raghu.Reddy/S2/Testsuite3/NPB3.3-MPI/bin/mg-intel-impi.D.128
Arm Forge 21.0 - Arm MAP

srun: error: Unable to create step for job 4966599: Invalid Trackable RESource (TRES) specification
MAP: Arm MAP could not launch the debuggers:
MAP: srun exited with code 1
[Raghu.Reddy@j1c08 forge-test]$

After setting the ALLINEA_USE_SSH_STARTUP to 1:

jfe01.% salloc -A nesccmgmt -q admin -n 128
salloc: Granted job allocation 4966600
salloc: Waiting for resource configuration
salloc: Nodes j1c[08-11] are ready for job
[Raghu.Reddy@j1c08 forge-test]$ module load intel impi forge
[Raghu.Reddy@j1c08 forge-test]$ setenv SLURM_OVERLAP 1
[Raghu.Reddy@j1c08 forge-test]$ setenv ALLINEA_USE_SSH_STARTUP 1
[Raghu.Reddy@j1c08 forge-test]$ map --profile srun /home/Raghu.Reddy/S2/Testsuite3/NPB3.3-MPI/bin/mg-intel-impi.D.128
Arm Forge 21.0 - Arm MAP

Profiling             : srun /home/Raghu.Reddy/S2/Testsuite3/NPB3.3-MPI/bin/mg-intel-impi.D.128
Allinea sampler       : not preloading (Express Launch)
MPI implementation    : Auto-Detect (SLURM (MPMD))
* number of processes : 128
* number of nodes     : 4
* Allinea MPI wrapper : not preloading (Express Launch)

MAP: Process 6:
MAP:
MAP: The Allinea sampler was not preloaded.
MAP: Check the user guide for instructions on how to link with the Allinea sampler.
[Raghu.Reddy@j1c08 forge-test]$



On our production system it works (I have edited program output for brevity):

hfe03.% module load intel impi forge
hfe03.% salloc -A nesccmgmt -q admin -n 128
salloc: Pending job allocation 27033112
salloc: job 27033112 queued and waiting for resources
salloc: job 27033112 has been allocated resources
salloc: Granted job allocation 27033112
salloc: Waiting for resource configuration
salloc: Nodes h13c[24,26,28,31] are ready for job
h13c24.%
h13c24.% setenv SLURM_OVERLAP 1
h13c24.% map --profile srun /home/Raghu.Reddy/S2/Testsuite3/NPB3.3-MPI/bin/mg-intel-impi.D.128
Arm Forge 21.0 - Arm MAP

Profiling             : srun /home/Raghu.Reddy/S2/Testsuite3/NPB3.3-MPI/bin/mg-intel-impi.D.128
Allinea sampler       : preload (Express Launch)
MPI implementation    : Auto-Detect (SLURM (MPMD))
* number of processes : 128
* number of nodes     : 4
* Allinea MPI wrapper : preload (JIT compiled) (Express Launch)



 NAS Parallel Benchmarks 3.3 -- MG Benchmark

 No input file. Using compiled defaults
 Size: 1024x1024x1024  (class D)
 Iterations:   50
 Number of processes:    128

 Initialization time:           0.663 seconds

  iter    1
...
...
  iter   50

 Benchmark completed
 VERIFICATION SUCCESSFUL
 L2 Norm is  0.1583275060429E-09
 Error is    0.6697470786978E-11


 MG Benchmark Completed.
 Class           =                        D
 Size            =           1024x1024x1024
 Iterations      =                       50
...
...
 MS T27A-1
 NASA Ames Research Center
 Moffett Field, CA  94035-1000

 Fax: 650-604-3957



MAP analysing program...
MAP gathering samples...
MAP generated /scratch2/SYSADMIN/nesccmgmt/Raghu.Reddy/forge-test/mg-intel-impi.D_128p_4n_1t_2021-12-30_15-08.map
h13c24.% exit
salloc: Relinquishing job allocation 27033112
hfe03.%

I will upload the slurm.conf file and that file is same in both versions.

Wish you a Very Happy New Year!

Thank you!
Comment 1 Marshall Garey 2022-01-03 09:33:45 MST
This is a duplicate of bug 12880 and is fixed in 21.08.5 by commit 8b7b1e7128f. You just need to upgrade to the latest 21.08 (and we always encourage being on the latest micro release of whatever major version you're on anyway).

You can read bug 12880 for details. Summary: --gres=none was broken and Forge Ddt uses --gres=none.

*** This ticket has been marked as a duplicate of ticket 12880 ***