|
Description
Raghu Reddy
2021-02-17 11:04:20 MST
Can you tell us what changed on the cluster? Did you do any upgrades or has this issue recently started with no change to the cluster? It would be helpful to see the logs from the node and from the controller during the time that the scheduler thought the node was busy. Created attachment 17987 [details]
Juno test system slurmctld.log
Created attachment 17988 [details]
Job syslog extraction containing slurmd logs - at debug level 5
Created attachment 17989 [details]
Job syslog extraction containing slurmd logs - at debug level 3
Created attachment 17990 [details]
latest slurm.conf file
(In reply to Jason Booth from comment #1) > Can you tell us what changed on the cluster? Did you do any upgrades or has > this issue recently started with no change to the cluster? Hi Jason, Yes, this issue has cropped up since we upgraded to slurm 20.11.3 from 20.02.6. I've attached the latest slurm.conf file as well as the slurmd logs (they go to syslogs) in the job syslog. Provided two jobs syslogs - one at level 5 and another at level 3. Glad to provide anything else you may need. Best, Tony. Hi Tony and Raghu, Thanks for the additional logs and config files. This is interesting, I'm going to give a recap to make sure I'm understanding it correctly. It looks like you're getting an allocation of 128 CPUs on 4 nodes using salloc. Then you are using srun to run the 'mg-intel-impi.D.128' program, which I assume uses all 128 CPUs. When you make the call to srun directly it works fine. But if you use 'map --profile' to make the same srun call it will work once, but fail when you try it a second time. Is that correct? I have a few things I'd like to clarify. If you make a second call to srun (without map) do you see a similar failure? How much time passes between the two calls using map? Does the error message you get on the second attempt show up immediately, or does it take some time to come back with it? Before you make the second call to 'map' can you verify that the previous step isn't running (you can show job steps with 'squeue -j <job_id> -s')? Thanks, Ben (In reply to Ben Roberts from comment #7) > Hi Tony and Raghu, > > Thanks for the additional logs and config files. This is interesting, I'm > going to give a recap to make sure I'm understanding it correctly. It looks > like you're getting an allocation of 128 CPUs on 4 nodes using salloc. Then > you are using srun to run the 'mg-intel-impi.D.128' program, which I assume > uses all 128 CPUs. When you make the call to srun directly it works fine. > But if you use 'map --profile' to make the same srun call it will work once, > but fail when you try it a second time. Is that correct? > > I have a few things I'd like to clarify. If you make a second call to srun > (without map) do you see a similar failure? How much time passes between > the two calls using map? Does the error message you get on the second > attempt show up immediately, or does it take some time to come back with it? > Before you make the second call to 'map' can you verify that the previous > step isn't running (you can show job steps with 'squeue -j <job_id> -s')? > > Thanks, > Ben Hi Ben, The first part is correct, we request 128 CPUs (4 nodes get allocated) and then use srun to run 128 CPUs. The part of "salloc" vs "sbatch" does not matter. The following will work: (assuming the following modules are loaded to intel impi forge): sbatch -A nesccmgmt -n 128 --wrap 'srun ~Raghu.Reddy/S2/Testsuite/NPB3.3-MPI/bin/mg-intel-impi.D.128' The output will look like this (showing only the top part with "head"): jfe01.% head slurm-5174726.out NAS Parallel Benchmarks 3.3 -- MG Benchmark No input file. Using compiled defaults Size: 1024x1024x1024 (class D) Iterations: 50 Number of processes: 128 Initialization time: 1.218 seconds jfe01.% Whereas the following will fail: jfe01.% sbatch -A nesccmgmt -n 128 --wrap 'map --profile srun ~Raghu.Reddy/S2/Testsuite/NPB3.3-MPI/bin/mg-intel-impi.D.128' Submitted batch job 5174727 jfe01.% jfe01.% cat slurm-5174727.out Arm Forge 20.1 - Arm MAP QIODevice::write (QPipeWriter): device not open srun: error: Unable to create step for job 5174727: Requested nodes are busy MAP: Arm MAP could not launch the debuggers: MAP: srun exited with code 1 _______________________________________________________________ Start Epilog v20.08.28 on node j1c08 for job 5174727 :: Thu Feb 18 18:07:39 UTC 2021 Job 5174727 (not serial) finished for user Raghu.Reddy in partition juno with exit code 1:0 _______________________________________________________________ End Epilogue v20.08.28 Thu Feb 18 18:07:39 UTC 2021 jfe01.% So to clarify, it is not the second attempt fails, just that it works fine without the "map --profile" but fails when this is used. This used to work fine just before the upgrade. The previous test were done with "salloc" just for convenience. Please let us know if you need any additional information. Thanks! Ok, thanks for clarifying that. If you run something other than the mg-intel-impi.D.128 script do you see a similar failure? I would like to have you try running 'map --profile srun -vvv hostname' in the job and send the output. I would also like to see if there is a difference when using the oversubscribe option: 'map --profile srun -vvv --oversubscribe ~Raghu.Reddy/S2/Testsuite/NPB3.3-MPI/bin/mg-intel-impi.D.128' Thanks, Ben (In reply to Ben Roberts from comment #9) > Ok, thanks for clarifying that. If you run something other than the > mg-intel-impi.D.128 script do you see a similar failure? I would like to > have you try running 'map --profile srun -vvv hostname' in the job and send > the output. > > I would also like to see if there is a difference when using the > oversubscribe option: > 'map --profile srun -vvv --oversubscribe > ~Raghu.Reddy/S2/Testsuite/NPB3.3-MPI/bin/mg-intel-impi.D.128' > > Thanks, > Ben Hi Ben, Here the output files from "-vvv hostname" and "-vvv --oversubscribe hostname": The filenames indicate which is which. Thanks! Created attachment 18000 [details]
Output with map --profile srun -vvv hostname
Created attachment 18001 [details]
map --profile srun -vvv --oversubscribe hostname
Hi Raghu, My apologies, I meant to have you try the '--overlap' flag, but I guess old habits die hard and I typed '--oversubscribe'. Can I have you try one more time like this: 'map --profile srun -vvv --overlap hostname' I think this might be some side effect of the changes made to Slurm 20.11.3 that were described in greater detail in bug 10769. Let me know if that allows the job to run. Thanks, Ben (In reply to Ben Roberts from comment #13) > Hi Raghu, > > My apologies, I meant to have you try the '--overlap' flag, but I guess old > habits die hard and I typed '--oversubscribe'. Can I have you try one more > time like this: > 'map --profile srun -vvv --overlap hostname' > > I think this might be some side effect of the changes made to Slurm 20.11.3 > that were described in greater detail in bug 10769. Let me know if that > allows the job to run. > > Thanks, > Ben Hi Ben, I did this and it is still not working: sbatch -A nesccmgmt -n 128 --wrap 'set -x;map --profile srun -vvv --overlap hostname' The top part of the output file is included below, and I will upload the file separately in a moment: + map --profile srun -vvv --overlap hostname Arm Forge 20.1 - Arm MAP QIODevice::write (QPipeWriter): device not open srun: defined options srun: -------------------- -------------------- srun: (null) : j1c[08-11] srun: export : LD_LIBRARY_PATH=/apps/forge/20.1/lib/64:/apps/intel/compilers_and_libraries_2018/linux/mpi/intel64/ lib:/apps/slurm/default/lib:/apps/intel/parallel_studio_xe_2018.4.057/compilers_and_libraries_2018/linux/compiler/lib/intel64:/ ... Please let me know if I can provide any additional information. Thanks Created attachment 18026 [details]
Output from 'set -x;map --profile srun -vvv --overlap hostname'
Output from:
sbatch -A nesccmgmt -n 128 --wrap 'set -x;map --profile srun -vvv --overlap hostname'
Thanks for trying the --overlap flag as well. I'm trying to trace what's happening that would lead to this error message appearing, but wonder if I could get one more piece of information from you. Can you start a job interactively again (with salloc) and then get details of the nodes assigned to the job with and without 'map', like this: scontrol show node $SLURM_NODELIST map --profile scontrol show node $SLURM_NODELIST Thanks, Ben (In reply to Ben Roberts from comment #16) > Thanks for trying the --overlap flag as well. I'm trying to trace what's > happening that would lead to this error message appearing, but wonder if I > could get one more piece of information from you. Can you start a job > interactively again (with salloc) and then get details of the nodes assigned > to the job with and without 'map', like this: > scontrol show node $SLURM_NODELIST > map --profile scontrol show node $SLURM_NODELIST > > Thanks, > Ben Hi Ben, It looks like that does not work with this tool anymore. Here is the output, instead of using 128 tasks, I am just asking for 4 tasks to simplify the output: --------------------------- jfe01.% salloc -A nesccmgmt -t 30 -n 4 salloc: Granted job allocation 5174735 salloc: Waiting for resource configuration salloc: Nodes j1c12 are ready for job j1c12.% j1c12.% module load intel impi forge j1c12.% j1c12.% scontrol show node "$SLURM_NODELIST" | & tee out-without-map NodeName=j1c12 Arch=x86_64 CoresPerSocket=20 CPUAlloc=40 CPUTot=40 CPULoad=0.02 AvailableFeatures=compute,smallmem,Xeon6148,2.40G ActiveFeatures=compute,smallmem,Xeon6148,2.40G Gres=(null) NodeAddr=j1c12 NodeHostName=j1c12 Version=20.11.3 OS=Linux 3.10.0-1160.15.2.el7.x86_64 #1 SMP Wed Feb 3 15:06:38 UTC 2021 RealMemory=94500 AllocMem=90000 FreeMem=90488 Sockets=2 Boards=1 MemSpecLimit=2000 State=ALLOCATED ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A Partitions=juno,novel BootTime=2021-02-11T15:55:20 SlurmdStartTime=2021-02-11T16:03:30 CfgTRES=cpu=40,mem=94500M,billing=40 AllocTRES=cpu=40,mem=90000M CapWatts=n/a CurrentWatts=0 AveWatts=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s Comment=(null) j1c12.% j1c12.% map --profile scontrol show node "$SLURM_NODELIST" | & tee out-with-map Arm Forge 20.1 - Arm MAP /apps/slurm/default/bin/scontrol: unrecognized option '--task-prolog=/apps/forge/20.1/libexec/slurm-preloads' Try "scontrol --help" for more information MAP: Arm MAP could not launch your MPI job using the 'scontrol' command. MAP: MAP: Check that the correct MPI implementation is selected. MAP: MAP: /apps/slurm/default/bin/scontrol exited before reaching MPIR_Breakpoint. j1c12.% ------------------------- Please let me know if you need any other information. Thanks! That's interesting that the 'scontrol show node' command doesn't work from inside the map command. I'm not sure where the '--task-prolog=/apps/forge/20.1/libexec/slurm-preloads' is coming from that it says is an unrecognized option unless something is happening to the $SLURM_NODELIST variable we're passing to the command. I would like to have you run a similar test where you call 'env' by itself and then again from within 'map --profile'. I would also like to see what the cgroups look like for an interactive job while it's running. If you can start a job with 'salloc' and go to the following directory and get a directory listing: /sys/fs/cgroup/cpu/<node name>/<user id>/<job id>/ This assumes you have '/sys/fs/cgroup' configured as your cgroup mount point. You can verify the path configured by running: scontrol show config | grep CgroupMountpoint One more thing that I think would be worth trying would be to add '--core-spec=1' to the sbatch or salloc command you use to get the job allocation. This will reserve one core for system usage inside the job. If the problem is originating because the map command is occupying one of the cores then this may be a way to avoid that conflict. Thanks, Ben (In reply to Ben Roberts from comment #18) > That's interesting that the 'scontrol show node' command doesn't work from > inside the map command. I'm not sure where the > '--task-prolog=/apps/forge/20.1/libexec/slurm-preloads' is coming from that > it says is an unrecognized option unless something is happening to the > $SLURM_NODELIST variable we're passing to the command. > > I would like to have you run a similar test where you call 'env' by itself > and then again from within 'map --profile'. > > I would also like to see what the cgroups look like for an interactive job > while it's running. If you can start a job with 'salloc' and go to the > following directory and get a directory listing: > /sys/fs/cgroup/cpu/<node name>/<user id>/<job id>/ > > This assumes you have '/sys/fs/cgroup' configured as your cgroup mount > point. You can verify the path configured by running: > scontrol show config | grep CgroupMountpoint > > One more thing that I think would be worth trying would be to add > '--core-spec=1' to the sbatch or salloc command you use to get the job > allocation. This will reserve one core for system usage inside the job. If > the problem is originating because the map command is occupying one of the > cores then this may be a way to avoid that conflict. > > Thanks, > Ben Hi Ben, Tony and I did another test this morning and it confirms that this is a problem that is definitely caused by the recent Slurm upgrade. Tony reverted our test system to the previous version of Slrum and I did the same test and everything works like it should. The job completed successfully and generated the ".map" file like it should. jfe01.% ll total 1052 -rw-r--r-- 1 Raghu.Reddy nesccmgmt 942 Feb 22 17:52 hostname.5174742 -rw-r--r-- 1 Raghu.Reddy nesccmgmt 1065596 Feb 22 17:54 mg-intel-impi.D_128p_4n_2021-02-22_17-53.map -rw-r--r-- 1 Raghu.Reddy nesccmgmt 2434 Feb 22 17:54 slurm-5174743.out jfe01.% I submitted 2 jobs, one with "map" utility and one without, and both completed successfully. The two submit lines were: sbatch -A nesccmgmt -n 128 --wrap 'srun ~Raghu.Reddy/S2/Testsuite/NPB3.3-MPI/bin/mg-intel-impi.D.128' sbatch -A nesccmgmt -n 128 --wrap 'map --profile srun ~Raghu.Reddy/S2/Testsuite/NPB3.3-MPI/bin/mg-intel-impi.D.128' Then i renamed the output files to that that names indicate which is which. And the difference between the 2 output files is the expected output: jfe01.% diff -irw slurm-5174744-without-map.out slurm-with-map-5174743.out 0a1,14 > Arm Forge 20.1 - Arm MAP > > QIODevice::write (QPipeWriter): device not open > Profiling : srun /home/Raghu.Reddy/S2/Testsuite/NPB3.3-MPI/bin/mg-intel-impi.D.128 > Allinea sampler : preload (Express Launch) > MPI implementation : Auto-Detect (SLURM (MPMD)) > * number of processes : 128 > * number of nodes : 4 > * Allinea MPI wrapper : preload (JIT compiled) (Express Launch) > > > MAP analysing program... > MAP gathering samples... > MAP generated /tds_scratch1/SYSADMIN/nesccmgmt/Raghu.Reddy/temp/Forge-test/mg-intel-impi.D_128p_4n_2021-02-22_17-53.map 10c24 < Initialization time: 0.673 seconds --- > Initialization time: 0.828 seconds 34c48 < Time in seconds = 14.11 --- > Time in seconds = 15.12 37,38c51,52 < Mop/s total = 220637.25 < Mop/s/process = 1723.73 --- > Mop/s total = 205972.99 > Mop/s/process = 1609.16 69,70c83,84 < Start Epilog v20.08.28 on node j1c08 for job 5174744 :: Mon Feb 22 18:27:56 UTC 2021 < Job 5174744 (not serial) finished for user Raghu.Reddy in partition juno with exit code 0:0 --- > Start Epilog v20.08.28 on node j1c08 for job 5174743 :: Mon Feb 22 17:54:24 UTC 2021 > Job 5174743 (not serial) finished for user Raghu.Reddy in partition juno with exit code 0:0 72c86 < End Epilogue v20.08.28 Mon Feb 22 18:27:56 UTC 2021 --- > End Epilogue v20.08.28 Mon Feb 22 17:54:24 UTC 2021 jfe01.% Please let me know if you need any additional information. Thanks, Hi Raghu, Thanks for doing that test to confirm that the version change did make a difference on your test system. I still think it would be good to get the information I was asking for previously in case the difference is in how we handle a change in the environment. I'll paste the instructions to gather what I'm looking for again for your convenience: I would like to have you run a similar test where you call 'env' by itself and then again from within 'map --profile'. I would also like to see what the cgroups look like for an interactive job while it's running. If you can start a job with 'salloc' and go to the following directory and get a directory listing: /sys/fs/cgroup/cpu/<node name>/<user id>/<job id>/ This assumes you have '/sys/fs/cgroup' configured as your cgroup mount point. You can verify the path configured by running: scontrol show config | grep CgroupMountpoint One more thing that I think would be worth trying would be to add '--core-spec=1' to the sbatch or salloc command you use to get the job allocation. This will reserve one core for system usage inside the job. If the problem is originating because the map command is occupying one of the cores then this may be a way to avoid that conflict. Thanks, Ben Created attachment 18090 [details]
Output files using Slrum 20.02.6 with patch p1
I ran the following jobs using a previous version of Slrum:
sbatch -A nesccmgmt -n 128 -o env.%j --wrap 'env'
sbatch -A nesccmgmt -n 128 -o map--profile_env.%j --wrap 'map --profile env'
module load intel impi forge
sbatch -A nesccmgmt -o map-profile_srun_mg.o%j -n 128 --wrap 'map --profile srun ~Raghu.Reddy/S2/Testsuite/NPB3.3-MPI/bin/mg-intel-impi.D.128'
sbatch -A nesccmgmt -o srun_mg.o%j -n 128 --wrap 'srun ~Raghu.Reddy/S2/Testsuite/NPB3.3-MPI/bin/mg-intel-impi.D.128'
Then I did an:
salloc -A nesccmgmt -t 30 -n 4
and then ran a bunch of commands interactively and saved the output in the file named "salloc.out"
I did the same with the Slrum 20.11.3 version and will upload the tar file.
Created attachment 18091 [details]
output files using Slurm 20.11.03 version
I ran the following jobs using Slurm 20.11.03:
sbatch -A nesccmgmt -n 128 -o env.%j --wrap 'env'
sbatch -A nesccmgmt -n 128 -o map--profile_env.%j --wrap 'map --profile env'
module load intel impi forge
sbatch -A nesccmgmt -o map-profile_srun_mg.o%j -n 128 --wrap 'map --profile srun ~Raghu.Reddy/S2/Testsuite/NPB3.3-MPI/bin/mg-intel-impi.D.128'
sbatch -A nesccmgmt -o srun_mg.o%j -n 128 --wrap 'srun ~Raghu.Reddy/S2/Testsuite/NPB3.3-MPI/bin/mg-intel-impi.D.128'
Then I did an:
salloc -A nesccmgmt -t 30 -n 4
and then ran a bunch of commands interactively and saved the output in the file named "salloc.out"
Hi Ben, I have uploaded two tar files doing a similar things were with the previous version that was used in production and the current version. One thing that may be helpful is refer you to watch over my shoulder as I'm doing things by doing a share screen. That may be helpful in troubleshooting this problem more quickly. Please let me know if you need any additional information. Please let me know doing a Google Meet with screen sharing may be possible. Thanks! Raghu Thanks for collecting that information. It does look like the environment variables are still set correctly in the case that doesn't work. That doesn't explain where the --task-prolog option was coming from when you tried to run the 'scontrol' command. Is it possible that the map command is passing arguments to the slurm commands that it runs, or do you have some other kind of filter that would be adding options? I would like to see one more thing to see what exactly is being picked up by the srun command when it is called inside the map command. Can you run one more test and send the output of 'map --profile srun -vvv ~Raghu.Reddy/S2/Testsuite/NPB3.3-MPI/bin/mg-intel-impi.D.128' The '-vvv' causes srun to print debug output to the screen and it will include a list of the options it was passed. I do think it would be helpful to do a conference call if we can do some experimenting on your test system. If you put the test system back on 20.11 you do see the same behavior there, right? If we could enable/disable various features on the test system I think we should be able to make some progress. Thanks, Ben Created attachment 18096 [details]
output files using Slurm 20.11.03 version with srun -vvv
Hi Ben,
Here is what you requested. The job was submitted with:
module load intel impi forge
sbatch -A nesccmgmt -o map-profile_srun-vvv_mg.o%j -n 128 --wrap 'map --profile srun -vvv ~Raghu.Reddy/S2/Testsuite/NPB3.3-MPI/bin/mg-intel-impi.D.128'
Thanks!
Raghu
Hi Raghu, Do you have time for a debugging session some time today? I have another commitment at 1:00 Eastern Time, but I'm free besides that. Let me know when would work for you. Thanks, Ben (In reply to Ben Roberts from comment #26) > Hi Raghu, > > Do you have time for a debugging session some time today? I have another > commitment at 1:00 Eastern Time, but I'm free besides that. Let me know > when would work for you. > > Thanks, > Ben I have meetings at 2 and again at 4 ET. How about tomorrow at 11 ET or later in the day? And how do we do the screenshare? Can I do a Google meet and invite you? Or do you have other ways? Thanks, Raghu Hi Raghu, Tomorrow at 11 Eastern Time sounds good to me. If it's just going to be the two of us I can send a Zoom invitation, otherwise a Google meeting would be good. Thanks, Ben Created attachment 18153 [details]
slurmd/syslogs for today's TS session with job 5174819
Hi Tony and Raghu, Thanks for sending the logs from our debugging session on Friday. I did spend time going through them but still didn't see any smoking guns. The things I saw that looked like they might be related had to do with cgroups, which we disabled during our test on Friday, so I don't think they're the problem. I did discuss this problem with a colleague and we feel like we're shooting in the dark still. Since doing the srun command in a job works fine, but the same srun command fails when call from within the 'map' profiler means that something is being done to modify the command that we can't see. We also see evidence of this from trying to run an scontrol command from within 'map'. It's adding a requirement for a task prolog that scontrol doesn't recognize. ----------------------- j1c12.% map --profile scontrol show node "$SLURM_NODELIST" | & tee out-with-map Arm Forge 20.1 - Arm MAP /apps/slurm/default/bin/scontrol: unrecognized option '--task-prolog=/apps/forge/20.1/libexec/slurm-preloads' ----------------------- We would like to know if you can get any information about what is being done to the commands run by 'map'. If we can get an idea of what's being done we will have a better idea of what to look for in changes made in 20.11 that would be causing the difference in behavior. Thanks, Ben (In reply to Ben Roberts from comment #31) > Hi Tony and Raghu, > > Thanks for sending the logs from our debugging session on Friday. I did > spend time going through them but still didn't see any smoking guns. The > things I saw that looked like they might be related had to do with cgroups, > which we disabled during our test on Friday, so I don't think they're the > problem. > > I did discuss this problem with a colleague and we feel like we're shooting > in the dark still. Since doing the srun command in a job works fine, but > the same srun command fails when call from within the 'map' profiler means > that something is being done to modify the command that we can't see. We > also see evidence of this from trying to run an scontrol command from within > 'map'. It's adding a requirement for a task prolog that scontrol doesn't > recognize. > ----------------------- > j1c12.% map --profile scontrol show node "$SLURM_NODELIST" | & tee > out-with-map > Arm Forge 20.1 - Arm MAP > > /apps/slurm/default/bin/scontrol: unrecognized option > '--task-prolog=/apps/forge/20.1/libexec/slurm-preloads' > ----------------------- > > > We would like to know if you can get any information about what is being > done to the commands run by 'map'. If we can get an idea of what's being > done we will have a better idea of what to look for in changes made in 20.11 > that would be causing the difference in behavior. > > Thanks, > Ben Hi Ben, I have forwarded your question on to ARM/Forge vendor and waiting for a reply. Will let you know once I hear from them. Thanks! Hi Raghu, I just wanted to follow up and see if you've been able to get any information about what changes are being made when running srun from within the map command. Thanks, Ben (In reply to Ben Roberts from comment #33) > Hi Raghu, > > I just wanted to follow up and see if you've been able to get any > information about what changes are being made when running srun from within > the map command. > > Thanks, > Ben Hi Ben, Unfortunately they have not been very responsive, I have pinged them again. This is definitely related to srun change that went in as it works fine on our test cluster if we revert to the previous version Slurm we were using. Any thing more we can do while we wait to hear from them? Are there other sites that could be using ARM Forge that may be running into the same problem? Thanks! (In reply to Raghu Reddy from comment #34) > (In reply to Ben Roberts from comment #33) > > Hi Raghu, > > > > I just wanted to follow up and see if you've been able to get any > > information about what changes are being made when running srun from within > > the map command. > > > > Thanks, > > Ben > > Hi Ben, > > Unfortunately they have not been very responsive, I have pinged them again. > > This is definitely related to srun change that went in as it works fine on > our test cluster if we revert to the previous version Slurm we were using. > > Any thing more we can do while we wait to hear from them? Are there other > sites that could be using ARM Forge that may be running into the same > problem? > > Thanks! Hi Ben, I got a response from ARM and I am including it here as an FYI so that you can also look into it simultaneously. <quote> I was able to dig up the following information: When you call map srun executable, it first runs srun --help It looks for --export in the output and if it is detected MAP will use that to preload sampling libraries. If it does not find --export in that output, it will instead use --task-prologue Based on the output you have provided, it seems that --export is not being found. Could you please paste the output of: srun --version srun --help </quote> And I have sent them that information to them. I will keep you posted, please let me know if you need any additional information. Thanks! Thanks for passing on that information. I'm sure you've seen this, but the export option is still there for srun in 20.11.
$ srun --help | grep export
--export=env_vars|NONE environment variables passed to launcher with
$ srun -V
slurm 20.11.4
But it sounds like it's doing the same check for the scontrol command, which doesn't have an '--export' or '--task-prologue' option. So that explains why the scontrol command inside of map is failing like it is.
I ran some tests with different environment variables and the --export option but didn't see any difference between 20.02 and 20.11. I also looked through the list of changes I don't see that there has been anything that would affect the way --export works between 20.02 and 20.11.
I know I had you send the output of 'env' with and without 'map' and in the different versions and I didn't see any differences before, but I was looking primarily at the SLURM_ environment variables. I looked again at all the environment variables to see if something wasn't being set, but I didn't see anything missing. The ouput when env was called from within map did have some additional variables, but I think the more relevant thing is whether variables are missing.
I wonder if we can try the export command manually as a test on your system. Can you get a job allocation and then set an environment variable like TEST=RAGHU. Then run the following commands to verify that it is passed correctly.
srun --export=TEST env
map --profile env
It would be nice if we could see 'map --profile srun env' but I assume that will still fail. Let's start with the above to see make sure it passes the variable correctly with --export.
Thanks,
Ben
Hi Raghu, I wanted to check in with you to see if you've had a chance to try passing variables with the --export flag of srun. I also wonder if you've heard any additional feedback from ARM/Forge about what happens to the srun command. The lack of the '--export' flag for the scontrol command explains why it's throwing that error message, but I don't see any problem with exporting variables in my own test with srun that would explain why that would cause any difference in behavior when run with 'map --profile'. Thanks, Ben (In reply to Ben Roberts from comment #37) > Hi Raghu, > > I wanted to check in with you to see if you've had a chance to try passing > variables with the --export flag of srun. I also wonder if you've heard any > additional feedback from ARM/Forge about what happens to the srun command. > The lack of the '--export' flag for the scontrol command explains why it's > throwing that error message, but I don't see any problem with exporting > variables in my own test with srun that would explain why that would cause > any difference in behavior when run with 'map --profile'. > > Thanks, > Ben Hi Ben, I have been working with ARM (the vendor for Forge) and they have provided me with a fix. First I had to download their latest version and for the time being we have to set the following environment variable for it work: setenv ALLINEA_DEBUG_SRUN_ARGS " %jobid% -I -W0 --overlap --oversubscribe --gres=none" So with this setting and the new version of Forge the problem has been resolved. This has been tested on our test system and are in the process of installing it on our production system. Thank you very much for your help and please feel free to close this ticket. Thanks for the update. That does make sense with the --overlap flag needing to be added. We did try that in comment 13, but it sounds like forge may have been overriding the flag we passed. I'm glad to hear that it's working. Let us know if there's anything else we can do to help. Thanks, Ben Hi Ben,
I have to reopen this issue as I am seeing a strange behavior that I don't understand.
I run Forge application two different ways which I thought were identical, but one of them works and the other does not.
Approach 1: Do an salloc and then execute the program
Approach 2: Do sbatch with --wrap option
This problem is seen only when used with the "map" utility of Forge.
I am including the screen output from the two methods above; Both methods are done within the same session.
Approach 1:
jfe01.% module load intel impi forge
jfe01.% salloc -A nesccmgmt -t 120 -n 128
salloc: Granted job allocation 5192738
salloc: Waiting for resource configuration
salloc: Nodes j1c[08-11] are ready for job
j1c08.%
j1c08.% map --profile srun /home/Raghu.Reddy/S2/Testsuite3/NPB3.3-MPI/bin/mg-intel-impi.D.128
Arm Forge 21.0 - Arm MAP
Profiling : srun /home/Raghu.Reddy/S2/Testsuite3/NPB3.3-MPI/bin/mg-intel-impi.D.128
Allinea sampler : preload (Express Launch)
MPI implementation : Auto-Detect (SLURM (MPMD))
* number of processes : 128
* number of nodes : 4
* Allinea MPI wrapper : preload (JIT compiled) (Express Launch)
NAS Parallel Benchmarks 3.3 -- MG Benchmark
No input file. Using compiled defaults
Size: 1024x1024x1024 (class D)
Iterations: 50
Number of processes: 128
Initialization time: 0.807 seconds
iter 1
iter 5
iter 10
iter 15
iter 20
iter 25
iter 30
iter 35
iter 40
iter 45
iter 50
Benchmark completed
VERIFICATION SUCCESSFUL
L2 Norm is 0.1583275060429E-09
Error is 0.6697470786978E-11
MG Benchmark Completed.
Class = D
Size = 1024x1024x1024
Iterations = 50
Time in seconds = 15.20
Total processes = 128
Compiled procs = 128
Mop/s total = 204821.06
Mop/s/process = 1600.16
Operation type = floating point
Verification = SUCCESSFUL
Version = 3.3
Compile date = 12 Feb 2015
Compile options:
MPIF77 = ifort
FLINK = $(MPIF77)
FMPI_LIB = -lmpi
FMPI_INC = -I/usr/local/include
FFLAGS = -O3 -mcmodel medium -shared-intel
FLINKFLAGS = -mcmodel medium -shared-intel
RAND = randi8
Please send the results of this run to:
NPB Development Team
Internet: npb@nas.nasa.gov
If email is not available, send this to:
MS T27A-1
NASA Ames Research Center
Moffett Field, CA 94035-1000
Fax: 650-604-3957
MAP analysing program...
MAP gathering samples...
MAP generated /home/Raghu.Reddy/mg-intel-impi.D_128p_4n_2021-03-31_14-24.map
j1c08.%
Approach 2:
j1c08.% sbatch -A nesccmgmt -t 30 -n 128 --wrap 'map --profile srun /home/Raghu.Reddy/S2/Testsuite3/NPB3.3-MPI/bin/mg-int el-impi.D.128'
Submitted batch job 5192739
j1c08.%
j1c08.% cat slurm-5192739.out
Arm Forge 21.0 - Arm MAP
srun: error: Unable to create step for job 5192739: Requested nodes are busy
MAP: Arm MAP could not launch the debuggers:
MAP: srun exited with code 1
_______________________________________________________________
Start Epilog v20.08.28 on node j1c01 for job 5192739 :: Wed Mar 31 14:25:44 UTC 2021
Job 5192739 (not serial) finished for user Raghu.Reddy in partition juno with exit code 1:0
_______________________________________________________________
End Epilogue v20.08.28 Wed Mar 31 14:25:44 UTC 2021
j1c08.%
I believe our slurm.conf file was already uploaded to the case, if not I can upload it again.
Can you please let us know why this difference?
Hi Raghu, My apologies for the delayed response, I was out of the office last week. This seems like it is probably a difference in how Forge is handling the script you are passing with --wrap. If you run the same test with sbatch being passed a job script rather than --wrap does the job work? In your previous message you said that you had to set an environment variable like this: setenv ALLINEA_DEBUG_SRUN_ARGS " %jobid% -I -W0 --overlap --oversubscribe --gres=none" Where in the process is that being set? Is that happening for job scripts that are being passed with the --wrap flag? Thanks, Ben Hi Ben, That is correct, it works with "salloc" but not with "sbatch" whether I do it with "--wrap" or in a script file. I do the module loads and setenv outside of the job file in all cases, I don't think that matters, but I will try doing the setenv inside the job file. That output is also included at the end. Here is copy and paste from my terminal session: jfe01.% module load intel impi forge jfe01.% setenv ALLINEA_DEBUG_SRUN_ARGS " %jobid% -I -W0 --overlap --oversubscribe --gres=none" jfe01.% jfe01.% cat ~/forge-test.job #!/bin/bash -l map --profile srun /tds_scratch1/SYSADMIN/nesccmgmt/Raghu.Reddy/Testsuite3/NPB3.3-MPI/bin/mg-intel-impi.D.128 jfe01.% jfe01.% sbatch -A nesccmgmt -N2 -n 4 ~/forge-test.job Submitted batch job 5192752 jfe01.% jfe01.% cat slurm-5192752.out Arm Forge 21.0 - Arm MAP srun: error: Unable to create step for job 5192752: Requested nodes are busy MAP: Arm MAP could not launch the debuggers: MAP: srun exited with code 1 _______________________________________________________________ Start Epilog v20.08.28 on node j1c12 for job 5192752 :: Tue Apr 6 13:34:26 UTC 2021 Job 5192752 (not serial) finished for user Raghu.Reddy in partition juno with exit code 1:0 _______________________________________________________________ End Epilogue v20.08.28 Tue Apr 6 13:34:26 UTC 2021 jfe01.% jfe01.% salloc -A nesccmgmt -q admin -t 30 -n 128 salloc: Granted job allocation 5192753 salloc: Waiting for resource configuration salloc: Nodes j1c[08-11] are ready for job j1c08.% j1c08.% map --profile srun /tds_scratch1/SYSADMIN/nesccmgmt/Raghu.Reddy/Testsuite3/NPB3.3-MPI/bin/mg-intel-impi.D.128 Arm Forge 21.0 - Arm MAP Profiling : srun /tds_scratch1/SYSADMIN/nesccmgmt/Raghu.Reddy/Testsuite3/NPB3.3-MPI/bin/mg-intel-impi.D.128 Allinea sampler : preload (Express Launch) MPI implementation : Auto-Detect (SLURM (MPMD)) * number of processes : 128 * number of nodes : 4 * Allinea MPI wrapper : preload (JIT compiled) (Express Launch) NAS Parallel Benchmarks 3.3 -- MG Benchmark <edited for brevity> MAP analysing program... MAP gathering samples... MAP generated /home/Raghu.Reddy/mg-intel-impi.D_128p_4n_2021-04-06_13-35.map j1c08.% jfe01.% cat ~/forge-test.job #!/bin/bash -l export ALLINEA_DEBUG_SRUN_ARGS=" %jobid% -I -W0 --overlap --oversubscribe --gres=none" map --profile srun /tds_scratch1/SYSADMIN/nesccmgmt/Raghu.Reddy/Testsuite3/NPB3.3-MPI/bin/mg-intel-impi.D.128 jfe01.% jfe01.% sbatch -A nesccmgmt -N2 -n 4 ~/forge-test.job Submitted batch job 5192754 jfe01.% jfe01.% cat slurm-5192754.out Arm Forge 21.0 - Arm MAP srun: error: Unable to create step for job 5192754: Requested nodes are busy MAP: Arm MAP could not launch the debuggers: MAP: srun exited with code 1 _______________________________________________________________ Start Epilog v20.08.28 on node j1c12 for job 5192754 :: Tue Apr 6 13:37:47 UTC 2021 Job 5192754 (not serial) finished for user Raghu.Reddy in partition juno with exit code 1:0 _______________________________________________________________ End Epilogue v20.08.28 Tue Apr 6 13:37:47 UTC 2021 jfe01.% Please let me know if you need any additional information! Thanks! On Mon, Apr 5, 2021 at 3:47 PM <bugs@schedmd.com> wrote: > *Comment # 41 <https://bugs.schedmd.com/show_bug.cgi?id=10889#c41> on bug > 10889 <https://bugs.schedmd.com/show_bug.cgi?id=10889> from Ben Roberts > <ben@schedmd.com> * > > Hi Raghu, > > My apologies for the delayed response, I was out of the office last week. This > seems like it is probably a difference in how Forge is handling the script you > are passing with --wrap. If you run the same test with sbatch being passed a > job script rather than --wrap does the job work? > > In your previous message you said that you had to set an environment variable > like this: > setenv ALLINEA_DEBUG_SRUN_ARGS " %jobid% -I -W0 --overlap --oversubscribe > --gres=none" > > Where in the process is that being set? Is that happening for job scripts that > are being passed with the --wrap flag? > > Thanks, > Ben > > ------------------------------ > You are receiving this mail because: > > - You reported the bug. > > Hi Raghu, Can you verify that the ALLINEA_DEBUG_SRUN_ARGS environment variable is being recognized (by running env) before you run the map command? I assume that variable is set when you're in the salloc job, is that right? In the sbatch job can you also try setting the SLURM_OVERLAP environment variable to get srun to use that flag? Thanks, Ben Hi Raghu, I'm just following up to see if you've had a chance to look at my suggestions in comment 43. Let me know if you still need help with this ticket. Thanks, Ben Hi Ben,
Here are the two attempts, with and without the SLURM_OVERLAP setting. The first, I commented out SLURM_OVERLAP.
jfe01.% cat ~/forge-test.job
#!/bin/bash -l
module load intel impi forge
###export SLURM_OVERLAP=1
env
map --profile srun /home/Raghu.Reddy/S2/Testsuite3/NPB3.3-MPI/bin/mg-intel-impi.D.128
jfe01.%
jfe01.% sbatch -A nesccmgmt -n 128 ~/forge-test.job
Submitted batch job 5192755
jfe01.%
jfe01.% cat slurm-5192755.out
MKLROOT=/apps/intel/parallel_studio_xe_2018.4.057/compilers_and_libraries_2018/linux/mkl
LMOD_FAMILY_COMPILER_VERSION=18.0.5.274
I_MPI_HYDRA_BRANCH_COUNT=128
SLURM_NODELIST=j1c[08-11]
REMOTEHOST=hera-rsa.princeton.rdhpcs.noaa.gov
SLURM_JOB_NAME=forge-test.job
MANPATH=/apps/forge/21.0/share/man:/apps/intel/compilers_and_libraries_2018/linux/mpi/man:/apps/intel/parallel_studio_xe_2018.4.057/man/common:/apps/intel/parallel_studio_xe_2018.4.057/documentation_2018/en/debugger/gdb-ia/man:/apps/intel/parallel_studio_xe_2018.4.057/documentation_2018/en/debugger/gdb-igfx/man:/apps/lmod/lmod/share/man:/apps/local/man:/apps/slurm/default/share/man:/apps/slurm/tools/sbank/share/man::
XDG_SESSION_ID=24789
_ModuleTable003_=bGludXgtY2VudG9zNy14ODZfNjQiLCIvYXBwcy9sbW9kL2xtb2QvbW9kdWxlZmlsZXMvQ29yZSIsIi9hcHBzL21vZHVsZXMvbW9kdWxlZmlsZXMvTGludXgiLCIvYXBwcy9tb2R1bGVzL21vZHVsZWZpbGVzIiwiL29wdC9jcmF5L21vZHVsZWZpbGVzIiwiL29wdC9jcmF5L2NyYXlwZS9kZWZhdWx0L21vZHVsZWZpbGVzIiwiL2FwcHMvbW9kdWxlcy9tb2R1bGVmYW1pbGllcy9pbnRlbCIsIi9hcHBzL21vZHVsZXMvbW9kdWxlZmFtaWxpZXMvaW50ZWxfaW1waSIsfSxbInN5c3RlbUJhc2VNUEFUSCJdPSIvYXBwcy9sbW9kL2xtb2QvbW9kdWxlZmlsZXMvQ29yZTovYXBwcy9tb2R1bGVzL21vZHVsZWZpbGVzL0xpbnV4Oi9hcHBzL21vZHVsZXMvbW9kdWxlZmlsZXM6L29wdC9jcmF5
SLURMD_NODENAME=j1c08
SLURM_TOPOLOGY_ADDR=jroot0.s2.j1c08
SPACK_ROOT=/home/Raghu.Reddy/spack
LMOD_ANCIENT_TIME=1
HOSTNAME=j1c08
__LMOD_REF_COUNT_CLASSPATH=/apps/intel/compilers_and_libraries_2018/linux/mpi/intel64/lib/mpi.jar:1;/apps/intel/parallel_studio_xe_2018.4.057/compilers_and_libraries_2018/linux/daal/lib/daal.jar:1
SLURM_PRIO_PROCESS=0
INTEL_LICENSE_FILE=/apps/intel/parallel_studio_xe_2018.4.057/compilers_and_libraries_2018/linux/licenses:/apps/intel/licenses
SLURM_NODE_ALIASES=(null)
__LMOD_REF_COUNT_MODULEPATH=/home/Raghu.Reddy/modules:2;/tds_scratch1/SYSADMIN/nesccmgmt/Raghu.Reddy/spack/spack/local/modules/linux-centos7-x86_64:1;/apps/lmod/lmod/modulefiles/Core:1;/apps/modules/modulefiles/Linux:1;/apps/modules/modulefiles:1;/opt/cray/modulefiles:1;/opt/cray/craype/default/modulefiles:1;/apps/modules/modulefamilies/intel:1;/apps/modules/modulefamilies/intel_impi:1
SBATCH_IGNORE_PBS=1
HOST=jfe01
SHELL=/bin/tcsh
TERM=screen
HISTSIZE=1000
SLURM_JOB_QOS=Added as default
GDBSERVER_MIC=/apps/intel/parallel_studio_xe_2018.4.057/debugger_2018/gdb/targets/intel64/x200/bin/gdbserver
TMPDIR=/tmp
SLURM_TOPOLOGY_ADDR_PATTERN=switch.switch.node
MODULEPATH_ROOT=/apps/modules/modulefiles
SSH_CLIENT=140.208.152.8 47055 22
LIBRARY_PATH=/apps/intel/parallel_studio_xe_2018.4.057/compilers_and_libraries_2018/linux/compiler/lib/intel64_lin:/apps/intel/parallel_studio_xe_2018.4.057/compilers_and_libraries_2018/linux/mkl/lib/intel64_lin:/apps/intel/parallel_studio_xe_2018.4.057/compilers_and_libraries_2018/linux/tbb/lib/intel64/gcc4.7:/apps/intel/parallel_studio_xe_2018.4.057/compilers_and_libraries_2018/linux/daal/lib/intel64_lin:/apps/intel/parallel_studio_xe_2018.4.057/compilers_and_libraries_2018/linux/daal/../tbb/lib/intel64_lin/gcc4.4
LMOD_PACKAGE_PATH=/apps/lmod/etc
LMOD_PKG=/apps/lmod/7.7.18
QTDIR=/usr/lib64/qt-3.3
QTINC=/usr/lib64/qt-3.3/include
LMOD_VERSION=7.7.18
LMOD_SHORT_TIME=86400
SSH_TTY=/dev/pts/2
__LMOD_REF_COUNT_LOADEDMODULES=intel/18.0.5.274:1;impi/2018.4.274:1;forge/21.0:1
SLURM_MEM_PER_CPU=2250
QT_GRAPHICSSYSTEM_CHECKED=1
TARG_TITLE_BAR_PAREN=
SLURM_NNODES=4
GROUP=nesccmgmt
USER=Raghu.Reddy
LD_LIBRARY_PATH=/apps/intel/compilers_and_libraries_2018/linux/mpi/intel64/lib:/apps/slurm/default/lib:/apps/intel/parallel_studio_xe_2018.4.057/compilers_and_libraries_2018/linux/compiler/lib/intel64:/apps/intel/parallel_studio_xe_2018.4.057/compilers_and_libraries_2018/linux/ipp/lib/intel64:/apps/intel/parallel_studio_xe_2018.4.057/compilers_and_libraries_2018/linux/compiler/lib/intel64_lin:/apps/intel/parallel_studio_xe_2018.4.057/compilers_and_libraries_2018/linux/mkl/lib/intel64_lin:/apps/intel/parallel_studio_xe_2018.4.057/compilers_and_libraries_2018/linux/tbb/lib/intel64/gcc4.7:/apps/intel/parallel_studio_xe_2018.4.057/debugger_2018/libipt/intel64/lib:/apps/intel/parallel_studio_xe_2018.4.057/compilers_and_libraries_2018/linux/daal/lib/intel64_lin:/apps/intel/parallel_studio_xe_2018.4.057/compilers_and_libraries_2018/linux/daal/../tbb/lib/intel64_lin/gcc4.4
LMOD_sys=Linux
LS_COLORS=rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:mi=01;05;37;41:su=37;41:sg=30;43:ca=30;41:tw=30;42:ow=34;42:st=37;44:ex=01;32:*.tar=01;31:*.tgz=01;31:*.arc=01;31:*.arj=01;31:*.taz=01;31:*.lha=01;31:*.lz4=01;31:*.lzh=01;31:*.lzma=01;31:*.tlz=01;31:*.txz=01;31:*.tzo=01;31:*.t7z=01;31:*.zip=01;31:*.z=01;31:*.Z=01;31:*.dz=01;31:*.gz=01;31:*.lrz=01;31:*.lz=01;31:*.lzo=01;31:*.xz=01;31:*.bz2=01;31:*.bz=01;31:*.tbz=01;31:*.tbz2=01;31:*.tz=01;31:*.deb=01;31:*.rpm=01;31:*.jar=01;31:*.war=01;31:*.ear=01;31:*.sar=01;31:*.rar=01;31:*.alz=01;31:*.ace=01;31:*.zoo=01;31:*.cpio=01;31:*.7z=01;31:*.rz=01;31:*.cab=01;31:*.jpg=01;35:*.jpeg=01;35:*.gif=01;35:*.bmp=01;35:*.pbm=01;35:*.pgm=01;35:*.ppm=01;35:*.tga=01;35:*.xbm=01;35:*.xpm=01;35:*.tif=01;35:*.tiff=01;35:*.png=01;35:*.svg=01;35:*.svgz=01;35:*.mng=01;35:*.pcx=01;35:*.mov=01;35:*.mpg=01;35:*.mpeg=01;35:*.m2v=01;35:*.mkv=01;35:*.webm=01;35:*.ogm=01;35:*.mp4=01;35:*.m4v=01;35:*.mp4v=01;35:*.vob=01;35:*.qt=01;35:*.nuv=01;35:*.wmv=01;35:*.asf=01;35:*.rm=01;35:*.rmvb=01;35:*.flc=01;35:*.avi=01;35:*.fli=01;35:*.flv=01;35:*.gl=01;35:*.dl=01;35:*.xcf=01;35:*.xwd=01;35:*.yuv=01;35:*.cgm=01;35:*.emf=01;35:*.axv=01;35:*.anx=01;35:*.ogv=01;35:*.ogx=01;35:*.aac=01;36:*.au=01;36:*.flac=01;36:*.mid=01;36:*.midi=01;36:*.mka=01;36:*.mp3=01;36:*.mpc=01;36:*.ogg=01;36:*.ra=01;36:*.wav=01;36:*.axa=01;36:*.oga=01;36:*.spx=01;36:*.xspf=01;36:
_ModuleTable004_=L21vZHVsZWZpbGVzOi9vcHQvY3JheS9jcmF5cGUvZGVmYXVsdC9tb2R1bGVmaWxlcyIsfQ==
CPATH=/apps/intel/parallel_studio_xe_2018.4.057/compilers_and_libraries_2018/linux/ipp/include:/apps/intel/parallel_studio_xe_2018.4.057/compilers_and_libraries_2018/linux/mkl/include:/apps/intel/parallel_studio_xe_2018.4.057/compilers_and_libraries_2018/linux/pstl/include:/apps/intel/parallel_studio_xe_2018.4.057/compilers_and_libraries_2018/linux/tbb/include:/apps/intel/parallel_studio_xe_2018.4.057/compilers_and_libraries_2018/linux/daal/include
SLURM_JOBID=5192755
HOSTTYPE=x86_64-linux
TERMCAP=SC|screen|VT 100/ANSI X3.64 virtual terminal:\
:DO=\E[%dB:LE=\E[%dD:RI=\E[%dC:UP=\E[%dA:bs:bt=\E[Z:\
:cd=\E[J:ce=\E[K:cl=\E[H\E[J:cm=\E[%i%d;%dH:ct=\E[3g:\
:do=^J:nd=\E[C:pt:rc=\E8:rs=\Ec:sc=\E7:st=\EH:up=\EM:\
:le=^H:bl=^G:cr=^M:it#8:ho=\E[H:nw=\EE:ta=^I:is=\E)0:\
:li#47:co#123:am:xn:xv:LP:sr=\EM:al=\E[L:AL=\E[%dL:\
:cs=\E[%i%d;%dr:dl=\E[M:DL=\E[%dM:dc=\E[P:DC=\E[%dP:\
:im=\E[4h:ei=\E[4l:mi:IC=\E[%d@:ks=\E[?1h\E=:\
:ke=\E[?1l\E>:vi=\E[?25l:ve=\E[34h\E[?25h:vs=\E[34l:\
:ti=\E[?1049h:te=\E[?1049l:us=\E[4m:ue=\E[24m:so=\E[3m:\
:se=\E[23m:mb=\E[5m:md=\E[1m:mr=\E[7m:me=\E[m:ms:\
:Co#8:pa#64:AF=\E[3%dm:AB=\E[4%dm:op=\E[39;49m:AX:\
:vb=\Eg:G0:as=\E(0:ae=\E(B:\
:ac=\140\140aaffggjjkkllmmnnooppqqrrssttuuvvwwxxyyzz{{||}}~~..--++,,hhII00:\
:po=\E[5i:pf=\E[4i:Km=\E[M:k0=\E[10~:k1=\EOP:k2=\EOQ:\
:k3=\EOR:k4=\EOS:k5=\E[15~:k6=\E[17~:k7=\E[18~:\
:k8=\E[19~:k9=\E[20~:k;=\E[21~:F1=\E[23~:F2=\E[24~:\
:F3=\E[1;2P:F4=\E[1;2Q:F5=\E[1;2R:F6=\E[1;2S:\
:F7=\E[15;2~:F8=\E[17;2~:F9=\E[18;2~:FA=\E[19;2~:kb:\
:K2=\EOE:kB=\E[Z:kF=\E[1;2B:kR=\E[1;2A:*4=\E[3;2~:\
:*7=\E[1;2F:#2=\E[1;2H:#3=\E[2;2~:#4=\E[1;2D:%c=\E[6;2~:\
:%e=\E[5;2~:%i=\E[1;2C:kh=\E[1~:@1=\E[1~:kH=\E[4~:\
:@7=\E[4~:kN=\E[6~:kP=\E[5~:kI=\E[2~:kD=\E[3~:ku=\EOA:\
:kd=\EOB:kr=\EOC:kl=\EOD:km:
__LMOD_REF_COUNT__LMFILES_=/apps/modules/modulefiles/intel/18.0.5.274:1;/apps/modules/modulefamilies/intel/impi/2018.4.274:1;/apps/modules/modulefiles/forge/21.0:1
SLURM_NTASKS=128
LMOD_FAMILY_MPI_VERSION=2018.4.274
LMOD_PREPEND_BLOCK=normal
NLSPATH=/apps/intel/parallel_studio_xe_2018.4.057/compilers_and_libraries_2018/linux/compiler/lib/intel64/locale/%l_%t/%N:/apps/intel/parallel_studio_xe_2018.4.057/compilers_and_libraries_2018/linux/mkl/lib/intel64_lin/locale/%l_%t/%N:/apps/intel/parallel_studio_xe_2018.4.057/debugger_2018/gdb/intel64/share/locale/%l_%t/%N
__LMOD_REF_COUNT_NLSPATH=/apps/intel/parallel_studio_xe_2018.4.057/compilers_and_libraries_2018/linux/compiler/lib/intel64/locale/%l_%t/%N:1;/apps/intel/parallel_studio_xe_2018.4.057/compilers_and_libraries_2018/linux/mkl/lib/intel64_lin/locale/%l_%t/%N:1;/apps/intel/parallel_studio_xe_2018.4.057/debugger_2018/gdb/intel64/share/locale/%l_%t/%N:1
SLURM_TASKS_PER_NODE=32(x4)
_ModuleTable001_=X01vZHVsZVRhYmxlXz17WyJNVHZlcnNpb24iXT0zLFsiY19yZWJ1aWxkVGltZSJdPWZhbHNlLFsiY19zaG9ydFRpbWUiXT1mYWxzZSxkZXB0aFQ9e30sZmFtaWx5PXtbImNvbXBpbGVyIl09ImludGVsIixbIm1waSJdPSJpbXBpIix9LG1UPXtmb3JnZT17WyJmbiJdPSIvYXBwcy9tb2R1bGVzL21vZHVsZWZpbGVzL2ZvcmdlLzIxLjAiLFsiZnVsbE5hbWUiXT0iZm9yZ2UvMjEuMCIsWyJsb2FkT3JkZXIiXT0zLHByb3BUPXt9LFsic3RhY2tEZXB0aCJdPTAsWyJzdGF0dXMiXT0iYWN0aXZlIixbInVzZXJOYW1lIl09ImZvcmdlIix9LGltcGk9e1siZm4iXT0iL2FwcHMvbW9kdWxlcy9tb2R1bGVmYW1pbGllcy9pbnRlbC9pbXBpLzIwMTguNC4yNzQiLFsiZnVsbE5hbWUiXT0iaW1w
MAIL=/var/spool/mail/Raghu.Reddy
PATH=/apps/forge/21.0/bin:/apps/intel/compilers_and_libraries_2018/linux/mpi/intel64/bin:/apps/intel/parallel_studio_xe_2018.4.057/compilers_and_libraries_2018/linux/bin/intel64:/scratch4/SYSADMIN/nesccmgmt/Raghu.Reddy/FGA/TensorFlow/anaconda3/bin:/home/Raghu.Reddy/spack/bin:/usr/lib64/qt-3.3/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/opt/ibutils/bin:/apps/local/bin:/apps/local/sbin:/apps/slurm/default/slurm/tools:/apps/slurm/default/bin:/apps/slurm/default/sbin:/apps/slurm/tools/sbank/bin:.:/home/Raghu.Reddy/bin:/apps/slurm/default/tools:/home/Raghu.Reddy/.local/bin
SLURM_WORKING_CLUSTER=juno:10.187.0.170:6817:9216:101
SLURM_CONF=/apps/slurm/20.11.3/etc/slurm.conf
STY=248079.pts-2.jfe01
IPATH_NO_BACKTRACE=1
SLURM_JOB_ID=5192755
LMOD_SETTARG_CMD=:
SLURM_JOB_USER=Raghu.Reddy
PWD=/home/Raghu.Reddy/S2/temp/Forge-test
_LMFILES_=/apps/modules/modulefiles/intel/18.0.5.274:/apps/modules/modulefamilies/intel/impi/2018.4.274:/apps/modules/modulefiles/forge/21.0
X509_USER_CERT=/home/Raghu.Reddy/.globus/usercert.pem
GDB_CROSS=/apps/intel/parallel_studio_xe_2018.4.057/debugger_2018/gdb/intel64/bin/gdb-ia
KDE_IS_PRELINKED=1
MODULEPATH=/home/Raghu.Reddy/modules:/tds_scratch1/SYSADMIN/nesccmgmt/Raghu.Reddy/spack/spack/local/modules/linux-centos7-x86_64:/apps/lmod/lmod/modulefiles/Core:/apps/modules/modulefiles/Linux:/apps/modules/modulefiles:/opt/cray/modulefiles:/opt/cray/craype/default/modulefiles:/apps/modules/modulefamilies/intel:/apps/modules/modulefamilies/intel_impi
LOADEDMODULES=intel/18.0.5.274:impi/2018.4.274:forge/21.0
__LMOD_REF_COUNT_INFOPATH=/apps/intel/parallel_studio_xe_2018.4.057/documentation_2018/en/debugger/gdb-ia/info:1;/apps/intel/parallel_studio_xe_2018.4.057/documentation_2018/en/debugger/gdb-igfx/info:1
SLURM_JOB_UID=537
_ModuleTable_Sz_=4
KDEDIRS=/usr
SLURM_NODEID=0
I_MPI_F90=ifort
SLURM_SUBMIT_DIR=/tds_scratch1/SYSADMIN/nesccmgmt/Raghu.Reddy/temp/Forge-test
I_MPI_CC=icc
SLURM_TASK_PID=68144
SLURM_NPROCS=128
LMOD_CMD=/apps/lmod/7.7.18/libexec/lmod
SLURM_CPUS_ON_NODE=40
SQUEUE_FORMAT=%.10i %.9P %.9q %.20j %.15u %.3t %.6M %.9l %.6D %R
MPM_LAUNCHER=/apps/intel/parallel_studio_xe_2018.4.057/debugger_2018/mpm/mic/bin/start_mpm.sh
HISTCONTROL=ignoredups
SLURM_PROCID=0
ENVIRONMENT=BATCH
LMOD_REDIRECT=yes
INTEL_PYTHONHOME=/apps/intel/parallel_studio_xe_2018.4.057/debugger_2018/python/intel64/
SLURM_JOB_NODELIST=j1c[08-11]
I_MPI_HYDRA_PMI_CONNECT=alltoall
ALLINEA_DEBUG_SRUN_ARGS= %jobid% -I -W0 --overlap --oversubscribe --gres=none
SHLVL=3
HOME=/home/Raghu.Reddy
__LMOD_REF_COUNT_PATH=/apps/forge/21.0/bin:1;/apps/intel/compilers_and_libraries_2018/linux/mpi/intel64/bin:1;/apps/intel/parallel_studio_xe_2018.4.057/compilers_and_libraries_2018/linux/bin/intel64:1;/scratch4/SYSADMIN/nesccmgmt/Raghu.Reddy/FGA/TensorFlow/anaconda3/bin:1;/home/Raghu.Reddy/spack/bin:2;/usr/lib64/qt-3.3/bin:1;/usr/local/bin:1;/usr/bin:1;/usr/local/sbin:1;/usr/sbin:1;/opt/ibutils/bin:1;/apps/local/bin:2;/apps/local/sbin:2;/apps/slurm/default/slurm/tools:2;/apps/slurm/default/bin:3;/apps/slurm/default/sbin:3;/apps/slurm/tools/sbank/bin:3;.:2;/home/Raghu.Reddy/bin:3;/apps/slurm/default/tools:1;/home/Raghu.Reddy/.local/bin:1
SLURM_LOCALID=0
GLOBUS_TCP_PORT_RANGE=40000,46999
OSTYPE=linux
__LMOD_REF_COUNT_CPATH=/apps/intel/parallel_studio_xe_2018.4.057/compilers_and_libraries_2018/linux/ipp/include:1;/apps/intel/parallel_studio_xe_2018.4.057/compilers_and_libraries_2018/linux/mkl/include:1;/apps/intel/parallel_studio_xe_2018.4.057/compilers_and_libraries_2018/linux/pstl/include:1;/apps/intel/parallel_studio_xe_2018.4.057/compilers_and_libraries_2018/linux/tbb/include:2;/apps/intel/parallel_studio_xe_2018.4.057/compilers_and_libraries_2018/linux/daal/include:1
X509_USER_PROXY=/home/Raghu.Reddy/.globus/usercert.pem
SLURM_JOB_GID=18001
SLURM_JOB_CPUS_PER_NODE=40(x4)
SLURM_CLUSTER_NAME=juno
_ModuleTable002_=aS8yMDE4LjQuMjc0IixbImxvYWRPcmRlciJdPTIscHJvcFQ9e30sWyJzdGFja0RlcHRoIl09MCxbInN0YXR1cyJdPSJhY3RpdmUiLFsidXNlck5hbWUiXT0iaW1waSIsfSxpbnRlbD17WyJmbiJdPSIvYXBwcy9tb2R1bGVzL21vZHVsZWZpbGVzL2ludGVsLzE4LjAuNS4yNzQiLFsiZnVsbE5hbWUiXT0iaW50ZWwvMTguMC41LjI3NCIsWyJsb2FkT3JkZXIiXT0xLHByb3BUPXt9LFsic3RhY2tEZXB0aCJdPTAsWyJzdGF0dXMiXT0iYWN0aXZlIixbInVzZXJOYW1lIl09ImludGVsIix9LH0sbXBhdGhBPXsiL2hvbWUvUmFnaHUuUmVkZHkvbW9kdWxlcyIsIi90ZHNfc2NyYXRjaDEvU1lTQURNSU4vbmVzY2NtZ210L1JhZ2h1LlJlZGR5L3NwYWNrL3NwYWNrL2xvY2FsL21vZHVsZXMv
SLURM_GTIDS=0
SLURM_SUBMIT_HOST=jfe01
SLURM_JOB_PARTITION=juno
BASH_ENV=/apps/lmod/lmod/init/bash
VENDOR=unknown
LMOD_arch=x86_64
MACHTYPE=x86_64
LOGNAME=Raghu.Reddy
QTLIB=/usr/lib64/qt-3.3/lib
GLOBUS_TCP_SOURCE_RANGE=40000,46999
WINDOW=2
CLASSPATH=/apps/intel/compilers_and_libraries_2018/linux/mpi/intel64/lib/mpi.jar:/apps/intel/parallel_studio_xe_2018.4.057/compilers_and_libraries_2018/linux/daal/lib/daal.jar
SLURM_JOB_ACCOUNT=nesccmgmt
SSH_CONNECTION=140.208.152.8 47055 140.208.193.20 22
__LMOD_REF_COUNT_LIBRARY_PATH=/apps/intel/parallel_studio_xe_2018.4.057/compilers_and_libraries_2018/linux/compiler/lib/intel64_lin:1;/apps/intel/parallel_studio_xe_2018.4.057/compilers_and_libraries_2018/linux/mkl/lib/intel64_lin:1;/apps/intel/parallel_studio_xe_2018.4.057/compilers_and_libraries_2018/linux/tbb/lib/intel64/gcc4.7:2;/apps/intel/parallel_studio_xe_2018.4.057/compilers_and_libraries_2018/linux/daal/lib/intel64_lin:1;/apps/intel/parallel_studio_xe_2018.4.057/compilers_and_libraries_2018/linux/daal/../tbb/lib/intel64_lin/gcc4.4:1
SLURM_JOB_NUM_NODES=4
MODULESHOME=/apps/lmod/7.7.18
PKG_CONFIG_PATH=/apps/intel/compilers_and_libraries_2018/linux/mkl/bin/pkgconfig
__LMOD_REF_COUNT_LD_LIBRARY_PATH=/apps/intel/compilers_and_libraries_2018/linux/mpi/intel64/lib:1;/apps/slurm/default/lib:2;/apps/intel/parallel_studio_xe_2018.4.057/compilers_and_libraries_2018/linux/compiler/lib/intel64:1;/apps/intel/parallel_studio_xe_2018.4.057/compilers_and_libraries_2018/linux/ipp/lib/intel64:1;/apps/intel/parallel_studio_xe_2018.4.057/compilers_and_libraries_2018/linux/compiler/lib/intel64_lin:2;/apps/intel/parallel_studio_xe_2018.4.057/compilers_and_libraries_2018/linux/mkl/lib/intel64_lin:1;/apps/intel/parallel_studio_xe_2018.4.057/compilers_and_libraries_2018/linux/tbb/lib/intel64/gcc4.7:2;/apps/intel/parallel_studio_xe_2018.4.057/debugger_2018/libipt/intel64/lib:1;/apps/intel/parallel_studio_xe_2018.4.057/compilers_and_libraries_2018/linux/daal/lib/intel64_lin:1;/apps/intel/parallel_studio_xe_2018.4.057/compilers_and_libraries_2018/linux/daal/../tbb/lib/intel64_lin/gcc4.4:1
LMOD_SETTARG_FULL_SUPPORT=no
LESSOPEN=||/usr/bin/lesspipe.sh %s
LMOD_FAMILY_COMPILER=intel
INFOPATH=/apps/intel/parallel_studio_xe_2018.4.057/documentation_2018/en/debugger/gdb-ia/info:/apps/intel/parallel_studio_xe_2018.4.057/documentation_2018/en/debugger/gdb-igfx/info
LMOD_FULL_SETTARG_SUPPORT=no
__LMOD_REF_COUNT_PKG_CONFIG_PATH=/apps/intel/compilers_and_libraries_2018/linux/mkl/bin/pkgconfig:1
__LMOD_REF_COUNT_INTEL_LICENSE_FILE=/apps/intel/parallel_studio_xe_2018.4.057/compilers_and_libraries_2018/linux/licenses:1;/apps/intel/licenses:1
XDG_RUNTIME_DIR=/run/user/537
DISPLAY=localhost:11.0
QT_PLUGIN_PATH=/usr/lib64/kde4/plugins:/usr/lib/kde4/plugins
__LMOD_REF_COUNT_MANPATH=/apps/forge/21.0/share/man:1;/apps/intel/compilers_and_libraries_2018/linux/mpi/man:1;/apps/intel/parallel_studio_xe_2018.4.057/man/common:1;/apps/intel/parallel_studio_xe_2018.4.057/documentation_2018/en/debugger/gdb-ia/man:1;/apps/intel/parallel_studio_xe_2018.4.057/documentation_2018/en/debugger/gdb-igfx/man:1;/apps/lmod/lmod/share/man:1;/apps/local/man:2;/apps/slurm/default/share/man:3;/apps/slurm/tools/sbank/share/man:3
LMOD_DIR=/apps/lmod/7.7.18/libexec
I_MPI_TMI_PROVIDER=psm
LMOD_FAMILY_MPI=impi
LMOD_COLORIZE=yes
I_MPI_ROOT=/apps/intel/compilers_and_libraries_2018/linux/mpi
BASH_FUNC_module()=() { eval $($LMOD_CMD bash "$@") && eval $(${LMOD_SETTARG_CMD:-:} -s sh)
}
BASH_FUNC_ml()=() { eval $($LMOD_DIR/ml_cmd "$@")
}
_=/usr/bin/env
Arm Forge 21.0 - Arm MAP
srun: error: Unable to create step for job 5192755: Requested nodes are busy
MAP: Arm MAP could not launch the debuggers:
MAP: srun exited with code 1
_______________________________________________________________
Start Epilog v20.08.28 on node j1c08 for job 5192755 :: Fri Apr 16 13:38:44 UTC 2021
Job 5192755 (not serial) finished for user Raghu.Reddy in partition juno with exit code 1:0
_______________________________________________________________
End Epilogue v20.08.28 Fri Apr 16 13:38:44 UTC 2021
jfe01.%
jfe01.% cat ~/forge-test.job
#!/bin/bash -l
module load intel impi forge
export SLURM_OVERLAP=1
env | grep SLURM
map --profile srun /home/Raghu.Reddy/S2/Testsuite3/NPB3.3-MPI/bin/mg-intel-impi.D.128
jfe01.%
jfe01.% sbatch -A nesccmgmt -n 128 ~/forge-test.job
Submitted batch job 5192756
jfe01.%
jfe01.% cat slurm-5192756.out
SLURM_NODELIST=j1c[08-11]
SLURM_JOB_NAME=forge-test.job
SLURMD_NODENAME=j1c08
SLURM_TOPOLOGY_ADDR=jroot0.s2.j1c08
SLURM_PRIO_PROCESS=0
SLURM_OVERLAP=1
SLURM_NODE_ALIASES=(null)
SLURM_JOB_QOS=Added as default
SLURM_TOPOLOGY_ADDR_PATTERN=switch.switch.node
SLURM_MEM_PER_CPU=2250
SLURM_NNODES=4
SLURM_JOBID=5192756
SLURM_NTASKS=128
SLURM_TASKS_PER_NODE=32(x4)
SLURM_WORKING_CLUSTER=juno:10.187.0.170:6817:9216:101
SLURM_CONF=/apps/slurm/20.11.3/etc/slurm.conf
SLURM_JOB_ID=5192756
SLURM_JOB_USER=Raghu.Reddy
SLURM_JOB_UID=537
SLURM_NODEID=0
SLURM_SUBMIT_DIR=/tds_scratch1/SYSADMIN/nesccmgmt/Raghu.Reddy/temp/Forge-test
SLURM_TASK_PID=74844
SLURM_NPROCS=128
SLURM_CPUS_ON_NODE=40
SLURM_PROCID=0
SLURM_JOB_NODELIST=j1c[08-11]
SLURM_LOCALID=0
SLURM_JOB_GID=18001
SLURM_JOB_CPUS_PER_NODE=40(x4)
SLURM_CLUSTER_NAME=juno
SLURM_GTIDS=0
SLURM_SUBMIT_HOST=jfe01
SLURM_JOB_PARTITION=juno
SLURM_JOB_ACCOUNT=nesccmgmt
SLURM_JOB_NUM_NODES=4
Arm Forge 21.0 - Arm MAP
srun: error: Unable to create step for job 5192756: Requested nodes are busy
MAP: Arm MAP could not launch the debuggers:
MAP: srun exited with code 1
jfe01.%
Please let me know if you need any additional information. Also if you think another shared session will be helpful we can try it next week.
Thanks!
Hi Raghu, I asked a colleague to take a look at this with me today and he noticed that we have debug logs from slurmctld and debug information from the srun command, but not overlapping. He was looking specifically at job 5174731, from comment 15. We would like to get the slurmctld logs and the slurmd logs from j1c08 for that period of time if you still have them. We would also like to see the output of 'sacct -D -o all -p -j 5174731'. If you don't have the logs any more from that job would you be able to run a similar test and gather the logging and sacct information for it? If you could increase the debug level to debug2 for both slurmctld and slurmd for the duration of that test it would be helpful. Thanks, Ben Hi Raghu, I wanted to follow up with this ticket. I know we've asked for a lot of output from you on this ticket and are asking for more still to try and get to the bottom of things. Is this something you can reproduce one more time for us and get the information I requested in comment 46? Thanks, Ben Hi Ben, You were looking for some logs and since Tony was away on vacation we have not had time to look into this. Tony is back now so we should be able to try again next week (he may be still catching up after coming back from vacation). Thanks! On Wed, Apr 28, 2021 at 12:29 PM <bugs@schedmd.com> wrote: > *Comment # 47 <https://bugs.schedmd.com/show_bug.cgi?id=10889#c47> on bug > 10889 <https://bugs.schedmd.com/show_bug.cgi?id=10889> from Ben Roberts > <ben@schedmd.com> * > > Hi Raghu, > > I wanted to follow up with this ticket. I know we've asked for a lot of output > from you on this ticket and are asking for more still to try and get to the > bottom of things. Is this something you can reproduce one more time for us and > get the information I requested in comment 46 <https://bugs.schedmd.com/show_bug.cgi?id=10889#c46>? > > Thanks, > Ben > > ------------------------------ > You are receiving this mail because: > > - You reported the bug. > > Sounds good. Thanks for the update. HI Ben, Before I gather the logs I just want to make sure I am gathering the information correctly. Since there was a somewhat long gap I wanted to reconfirm: You want us to gather the following: - slurmctld logs - slurmd logs - output of 'sacct -D -o all -p -j <jobID>' This is what I have in my job file, is this what you want me to run to collect the logs above? jfe01.% cat ~/forge-test.job #!/bin/bash -l module load intel impi forge export SLURM_OVERLAP=1 env | grep SLURM map --profile srun /home/Raghu.Reddy/S2/Testsuite3/NPB3.3-MPI/bin/mg-intel-impi.D.128 jfe01.% Please let me know if I have it right, and also if you think any additional information is needed. To summarize the problem precisely, the map utility from ARM Forge works fine if I do "salloc" and then execute the job but fails to run if have the same commands in a batch script and submit it using sbatch. Thanks! Thanks! On Wed, Apr 28, 2021 at 2:47 PM <bugs@schedmd.com> wrote: > *Comment # 49 <https://bugs.schedmd.com/show_bug.cgi?id=10889#c49> on bug > 10889 <https://bugs.schedmd.com/show_bug.cgi?id=10889> from Ben Roberts > <ben@schedmd.com> * > > Sounds good. Thanks for the update. > > ------------------------------ > You are receiving this mail because: > > - You reported the bug. > > Sure, it's good to clarify what exactly we are hoping to gather from this test. Before beginning the test I'd like to have you increase the log level of the controller for the duration of the test (scontrol setdebug debug2). If you could restart slurmd to increase the debug level on the node you're going to be testing that would also be helpful. You can then run a test job to reproduce the issue. If you could add a test where you just try to run 'hostname' with the --overlap flag added that would be helpful. You could add that test to the beginning of the job script. And just to be clear, I'm expecting that the ALLINEA_DEBUG_SRUN_ARGS environment variable will be set. Maybe check for that variable as well as the SLURM variables. With those changes the script might look like this: #!/bin/bash -l module load intel impi forge map --profile srun -vvv --overlap hostname export SLURM_OVERLAP=1 env | egrep 'SLURM|ALLINEA' map --profile srun /home/Raghu.Reddy/S2/Testsuite3/NPB3.3-MPI/bin/mg-intel-impi.D.128 Then we would like to see the logs that were generated during that time along with the following output: Job output file scontrol show job <jobid> sacct -D -o all -p -j <jobid> Thanks, Ben Created attachment 19292 [details]
slurmd/syslogs for today's TS session with job 5453035
Created attachment 19293 [details]
sacct output for job 5453035
Created attachment 19294 [details]
terminal output for job 5453035
Created attachment 19295 [details]
job output for job 5453035
Created attachment 19296 [details]
job output for job 5453035
Created attachment 19297 [details]
core files generated during job 5453035
Hi Tony and Raghu, Thanks for collecting that information. I'm looking into some cgroup related log entries from the nodes, but I did notice that the output you sent didn't include the slurmctld logs. I don't know that we'll need them for sure, but since you just ran the test I figure now is the time to ask for them in case we do need them. Would you mind attaching those logs as well? I'll keep looking into the entries in the slurmd logs. Thanks, Ben Created attachment 19334 [details]
juno slurmctld.log as of 2021-05-05
Thanks Tony. The thing I find that looks the most suspicious to me is that the task prolog is being skipped because it can't find environment variables that communicate the resource limits that should be set. May 4 13:32:18 j1c08 PROLOG-TASK: Job 5453035: Skipping slurm task prolog for step 0 and procid 30 May 4 13:32:18 j1c08 slurmstepd[178507]: debug: Couldn't find SLURM_RLIMIT_CPU in environment May 4 13:32:18 j1c08 slurmstepd[178507]: debug: Couldn't find SLURM_RLIMIT_FSIZE in environment May 4 13:32:18 j1c08 slurmstepd[178507]: debug: Couldn't find SLURM_RLIMIT_DATA in environment May 4 13:32:18 j1c08 slurmstepd[178507]: debug: Couldn't find SLURM_RLIMIT_STACK in environment May 4 13:32:18 j1c08 slurmstepd[178507]: debug: Couldn't find SLURM_RLIMIT_CORE in environment May 4 13:32:18 j1c08 slurmstepd[178507]: debug: Couldn't find SLURM_RLIMIT_RSS in environment May 4 13:32:18 j1c08 slurmstepd[178507]: debug: Couldn't find SLURM_RLIMIT_NPROC in environment May 4 13:32:18 j1c08 slurmstepd[178507]: debug: Couldn't find SLURM_RLIMIT_NOFILE in environment May 4 13:32:18 j1c08 slurmstepd[178507]: debug: Couldn't find SLURM_RLIMIT_MEMLOCK in environment May 4 13:32:18 j1c08 slurmstepd[178507]: debug: Couldn't find SLURM_RLIMIT_AS in environment May 4 13:32:18 j1c08 slurmstepd[178507]: debug2: Set task rss(90000 MB) Since you're seeing this on your test system can I have you confirm that this is the problem by disabling the TaskProlog long enough to run one more test? Thanks, Ben (In reply to Ben Roberts from comment #61) Ben, I'm confused about your request: > Since you're seeing this on your test system can I have you confirm that > this is the problem by disabling the TaskProlog long enough to run one more > test? The TaskProlog script was once in use a very long time ago, but is no longer. It has the following initial lines: 1 #!/bin/bash 2 3 # Slurm Task Prolog Script 4 5 # Identify where we're coming from and for what job 6 JMSG="PROLOG-TASK Job $SLURM_JOB_ID:" 7 logger -t $JMSG "Skipping slurm task prolog for step $SLURM_STEP_ID and procid $SLURM_PROCID" 8 exit 0 ... It just runs long enough to log what step and procid it is at in the job and immediately exits. It's the first line in the output of your last comment: > May 4 13:32:18 j1c08 PROLOG-TASK: Job 5453035: Skipping slurm task prolog for step 0 and procid 30 Is your concern that this tidbit of code is affecting Forge? Or, are you referring to something else? Tony. Sorry I didn't expound on what I was thinking a little more. I think the problem isn't with the prolog itself, but Slurm does a check for environment variables that it expects to be available to the task prolog. It looks like it's failing to find those variables and so failing to run the prolog, which I suspect is causing the task to fail to start. To confirm this theory if we don't have the task prolog in place then it won't fail on the check for those variables and will be able to complete the task, or it may fail at the task epilog, but at least it would be able to run the main part of the task. So, if you could disable the task prolog and run through the test the same way you did last time (with the debug logging enabled) that should confirm whether that is the problem. Thanks, Ben Created attachment 19356 [details]
tarred data requested for job 5453036
Thank you for running that test again without the prolog. It did make it past the task prolog in the most recent test, but had similar errors for the step itself. As I looked into where the logs for the step came from I could see that they shouldn't be enough to stop the job from running. I looked over this with a colleague and we noticed something else in the logs that has us wondering what's happening. In the slurmctld logs from the test you ran on the 4th it shows that the step was cancelled by the user. [2021-05-04T13:32:30.541] debug2: Processing RPC: REQUEST_CANCEL_JOB_STEP from UID=537 I don't think that Raghu is manually running scancel, but we can't explain where that came from. We'd like to focus the investigation a little more around that with some additional debug flags. Can I ask you to run this test one more time with the following debug flags? slurmctlddebug=debug3 debuflags=select_type,steps You can run it with or without the task prolog, but the information you gathered previously along with the slurmd logs from the nodes and the slurmctld logs should help us get some more information about what's going on with the steps. Thanks, Ben Hi Tony and Raghu, I'm sure you guys are busy with other projects. I'm going to lower the severity of this ticket to 4 while we wait for you to have a chance to collect the logs one more time with the 'select_type' and 'steps' debug flags enabled. Thanks, Ben (In reply to Ben Roberts from comment #67) > Hi Tony and Raghu, > > I'm sure you guys are busy with other projects. I'm going to lower the > severity of this ticket to 4 while we wait for you to have a chance to > collect the logs one more time with the 'select_type' and 'steps' debug > flags enabled. > > Thanks, > Ben Yes, Ben. You know us well. Sorry for the delay. We had to switch slurm trees in order to test 20.11.6 for use at Boulder. They just went to the version during their Tuesday monthly maintenance. Yes - the problem existed there too. Were back to the 20.11.3 branch, now. I just added your flags to our config as: DebugFlags=Reservation,SelectType,steps and we're ready for testing. I'll get those uploaded to you within the next hour or so. Thanks for your patience. Tony. Created attachment 19500 [details]
tarred data requested for job 5453570
Ben,
Let us know if you want to have another interactive session - if that makes things easier for you.
Tony
Thank you for running that test one more time with the extra debug flags, it does show more information about what is happening. What I see in the slurmctld logs is that the nodes attempt to find processors that can be allocated to the job, but it fails with a message about there not being enough usable memory. [2021-05-14T15:23:46.371] STEPS: _pick_step_nodes: JobId=5453570 Currently running steps use 40 of allocated 40 CPUs on node j1c08 [2021-05-14T15:23:46.371] STEPS: _pick_step_nodes: JobId=5453570 Based on --mem-per-cpu=2250 we have 0 usable cpus on node, usable memory was: 0 [2021-05-14T15:23:46.371] STEPS: _pick_step_nodes: JobId=5453570 No task can start on node When this happens it does prompt the scheduler to cancel that step. [2021-05-14T15:23:47.765] STEPS: Processing RPC details: REQUEST_CANCEL_JOB_STEP StepId=5453570.0 [2021-05-14T15:23:47.765] STEPS: _kill_job_step: Cancel of JobId=5453570 StepId=0 by UID=537 usec=231 This does look like we're getting closer to the root cause. To confirm this theory could you run one more test for me? You can have Slurm not enforce memory requirements by changing the select type parameter to 'SelectTypeParameters=CR_Core'. Then if you can run the same test with the 'steps' debug flag enabled I'll review the logs one more time to see what happened (if it fails this time). If we can confirm that the problem is that the step doesn't think there is any memory available then we can dig deeper into what is keeping the step from having access to that memory. Thanks, Ben Created attachment 19559 [details]
tarred data requested 20210519
(In reply to Ben Roberts from comment #71) Ben, I agree ... > This does look like we're getting closer to the root cause. I uploaded a bunch of data for you to review. Raghu ran three tests/jobs this time. The first as you requested. It took us by surprised that the job worked fine that time. So, we wanted to confirm that was not a fluke. We then changed the parameters back in slurm.conf and restarted the slurmctld again and ran a second job. That proved the job still failed with the normal slurm.conf settings. And, finally, we ran a third job to reconfirm the setting you requested is what made the difference. Sure enough, the third job passed. All three jobs used the exact same compute nodes. I uploaded those compute node syslogs, the slurmctld logs, and the normal data we've been providing youthat cover all three jobs. Standing by for your direction. Tony. Hi Tony, I'm glad to hear that we're getting it narrowed down. I'm struggling to make sense of this though since there isn't another step active that would occupy the memory and I'm sure the map command isn't designed to occupy the memory of the command it's trying to profile. I know we've tried this in the past, but it was a while back and I would like to have you do one last sanity check. Can you do one more test where you call a srun command with and without map? Something like this: #!/bin/bash -l srun -vvv --overlap hostname module load intel impi forge map --profile srun -vvv --overlap hostname Thanks, Ben Hi Ben, Before I run this test we need some additional clarification please. - Do you want this test run with our "production" setting or with the "modified" setting with cgroups turned off? - Do you want just the output of this job or do you need the log files also? Thanks! Raghu Hi Raghu, Sorry to not clarify better what I was looking for. For this test I would like to have you use 'SelectTypeParameters=CR_Core_Memory', as you have it on your production system. I would also like to have you enable the 'steps' debug flag and have you send the logs for me to review. Thanks, Ben Created attachment 19944 [details]
tarred data requested for job 5453744
Hi Tony and Raghu, Thank you for running that test one more time and sending the output. My apologies for not responding sooner, but I've been trying to make sense of what is happening. I don't have an answer for why, but since it's been a while I wanted to send an explanation of what the logs show. The first part of the job script you have is to run 'srun --overlap hostname', which runs without a problem. In the logs I can see that it allocates the resources correctly to the job steps. [2021-06-14T16:00:51.925] STEPS: _pick_step_nodes: JobId=5453744 Currently running steps use 0 of allocated 40 CPUs on node j1c12 [2021-06-14T16:00:51.925] STEPS: _pick_step_nodes: JobId=5453744 Based on --mem-per-cpu=2250 we have 40 usable cpus on node, usable memory was: 90000 [2021-06-14T16:00:51.925] STEPS: _pick_step_nodes: JobId=5453744 Currently running steps use 0 of allocated 40 CPUs on node j1c13 [2021-06-14T16:00:51.925] STEPS: _pick_step_nodes: JobId=5453744 Based on --mem-per-cpu=2250 we have 40 usable cpus on node, usable memory was: 90000 When the job step completes it also frees the resources as I would expect. [2021-06-14T16:00:52.091] STEPS: step dealloc on job node 0 (j1c12) used: 0 of 40 CPUs [2021-06-14T16:00:52.091] STEPS: step dealloc on job node 1 (j1c13) used: 0 of 40 CPUs [2021-06-14T16:00:52.091] STEPS: _slurm_rpc_step_complete StepId=5453744.0 usec=316 But when it tries to start the second step it shows that the usable memory is 0 so it reports that the nodes are busy. [2021-06-14T16:01:35.241] STEPS: _pick_step_nodes: JobId=5453744 Currently running steps use 40 of allocated 40 CPUs on node j1c12 [2021-06-14T16:01:35.241] STEPS: _pick_step_nodes: JobId=5453744 Based on --mem-per-cpu=2250 we have 0 usable cpus on node, usable memory was: 0 [2021-06-14T16:01:35.241] STEPS: _pick_step_nodes: JobId=5453744 No task can start on node [2021-06-14T16:01:35.241] STEPS: _slurm_rpc_job_step_create for JobId=5453744: Requested nodes are busy There are a few questions I don't have an answer for yet about this. The biggest is that if there is something about calling srun from within 'map --profile' that uses the available memory, then why don't we see the same behavior when you run the same command from inside an 'salloc' job. I tried looking through the documentation for the 'map' command (https://developer.arm.com/documentation/101136/2102/MAP) to see if there are any options to control how much memory it requests/consumes, but I didn't see anything like that. I am still thinking about the issue and getting some input from colleagues as well, but I wanted to send an update. If you have any thoughts about why it might not report any usage memory when called with map I would be glad to hear them as well. Thanks, Ben Hi Tony and Raghu, We have a few more pieces of information we'd like to have you collect in a test job. I know you've already run test jobs several times for this issue so I appreciate your willingness to keep running tests to collect different pieces of information. We would like to have you modify your job script to look like this: #!/bin/bash -l #SBATCH -A nesccmgmt -N2 -n 2 -t 5 srun -vvv --overlap hostname module load intel impi forge env strace -o ./outfile -f -s250 map --profile srun -vvv --overlap hostname You can set up the test just like you did for the last one, where you use CR_Core_Memory and enable the 'steps' debug flag and 'debug3' for the log level. Then we would like to have you do the following: 1. Submit the above test job. 2. Run 'scontrol show step <jobid>' 3. Start an 'salloc' job with similar options. 4. Call 'env' and the 'strace' commands like you did in the sbatch job. 5. Run 'scontrol show step <jobid>' for the salloc job. 6. Collect output for these two jobs and the slurmctld logs for us to review. Thanks, Ben Created attachment 20159 [details]
tarred data requested for jobs 5453967 and 5453968
The output from this test run is interesting. It looks like the job steps that were launched with 'map' completed successfully this time. Here are the last few lines of output from job 5453968, showing that it successfully executed 'hostname' on the two nodes allocated: $ tail slurm-5453968.out MAP: Check that the correct MPI implementation is selected. MAP: MAP: /apps/slurm/default/bin/srun exited before reaching MPIR_Breakpoint. j1c13 j1c12 _______________________________________________________________ Start Epilog v20.08.28 on node j1c12 for job 5453968 :: Tue Jun 29 18:06:13 UTC 2021 Job 5453968 (not serial) finished for user Raghu.Reddy in partition juno with exit code 1:0 _______________________________________________________________ End Epilogue v20.08.28 Tue Jun 29 18:06:13 UTC 2021 The slurmctld logs that I referenced last time, showing that it couldn't start the second job step because the usable memory was 0, show the correct amount of usable memory this time. I've been combing through logs and output from this run and previous runs to see if there is something I can point to that has changed that would explain why this run succeeded. The only change I can identify is that the job step launched with 'map' was wrapped in an 'strace' command. Can you confirm that this is the case by running one more test using sbatch where one step is created with just 'srun' a second step is created with 'map : srun' and a third is created with 'strace : map : srun'? Collecting all the data may make the testing more involved for you. If so you don't need to worry about collecting all the logs and job output, I'm just interested to see if using strace prevents the issue with the memory for the job that we've been trying to track down. Thanks, Ben (In reply to Ben Roberts from comment #88) > The output from this test run is interesting.... Ben - We left out a key piece of info. We just upgraded to 20.11.7 in preparation for next week's maintenance period. Sorry about that. Hi Ben,
I just wanted to be sure that the problem had not gone away because of the upgrade so I ran the "original" test again; what I confirmed is the following:
- The original issue is still there that Forge map still does not work if I use sbatch
- It does work correctly if I do the same commands with "salloc"
jfe01.% cat ~/forge-test.job
#!/bin/bash -l
module load intel impi forge
map --profile srun -vvv --overlap hostname
export SLURM_OVERLAP=1
env | egrep 'SLURM|ALLINEA'
map --profile srun /home/Raghu.Reddy/S2/Testsuite3/NPB3.3-MPI/bin/mg-intel-impi.D.128
jfe01.%
jfe01.% sbatch -A nesccmgmt -n 128 -q urgent ~/forge-test.job
The output file from the above submit will be uploaded shortly.
The output with salloc is included below:
jfe01.% salloc -A nesccmgmt -n 128 -q urgent
salloc: Granted job allocation 6053293
salloc: Waiting for resource configuration
salloc: Nodes j1c[01-02,04-05] are ready for job
[Raghu.Reddy@j1c01 Forge-test-2021-06-29-prod]$ module load intel impi forge
[Raghu.Reddy@j1c01 Forge-test-2021-06-29-prod]$ export SLURM_OVERLAP=1
export: Command not found.
[Raghu.Reddy@j1c01 Forge-test-2021-06-29-prod]$ setenv SLURM_OVERLAP 1
[Raghu.Reddy@j1c01 Forge-test-2021-06-29-prod]$ map --profile srun /home/Raghu.Reddy/S2/Testsuite3/NPB3.3-MPI/bin/mg-intel-impi.D.128
Arm Forge 21.0 - Arm MAP
Profiling : srun /home/Raghu.Reddy/S2/Testsuite3/NPB3.3-MPI/bin/mg-intel-impi.D.128
Allinea sampler : preload (Express Launch)
MPI implementation : Auto-Detect (SLURM (MPMD))
* number of processes : 128
* number of nodes : 4
* Allinea MPI wrapper : preload (JIT compiled) (Express Launch)
NAS Parallel Benchmarks 3.3 -- MG Benchmark
No input file. Using compiled defaults
Size: 1024x1024x1024 (class D)
Iterations: 50
Number of processes: 128
Initialization time: 0.752 seconds
iter 1
iter 5
iter 10
iter 15
iter 20
iter 25
iter 30
iter 35
iter 40
iter 45
iter 50
Benchmark completed
VERIFICATION SUCCESSFUL
L2 Norm is 0.1583275060429E-09
Error is 0.6697470786978E-11
MG Benchmark Completed.
Class = D
Size = 1024x1024x1024
Iterations = 50
Time in seconds = 14.79
Total processes = 128
Compiled procs = 128
Mop/s total = 210535.22
Mop/s/process = 1644.81
Operation type = floating point
Verification = SUCCESSFUL
Version = 3.3
Compile date = 12 Feb 2015
Compile options:
MPIF77 = ifort
FLINK = $(MPIF77)
FMPI_LIB = -lmpi
FMPI_INC = -I/usr/local/include
FFLAGS = -O3 -mcmodel medium -shared-intel
FLINKFLAGS = -mcmodel medium -shared-intel
RAND = randi8
Please send the results of this run to:
NPB Development Team
Internet: npb@nas.nasa.gov
If email is not available, send this to:
MS T27A-1
NASA Ames Research Center
Moffett Field, CA 94035-1000
Fax: 650-604-3957
MAP analysing program...
MAP gathering samples...
MAP generated /tds_scratch1/SYSADMIN/nesccmgmt/Raghu.Reddy/temp/Forge-test-2021-06-29-prod/mg-intel-impi.D_128p_4n_2021-07-01_12-20.map
[Raghu.Reddy@j1c01 Forge-test-2021-06-29-prod]$
[Raghu.Reddy@j1c01 Forge-test-2021-06-29-prod]$ ll -rt
total 18352
-rw-r--r-- 1 Raghu.Reddy nesccmgmt 187 Jun 29 16:52 strace.job
-rw-r--r-- 1 Raghu.Reddy nesccmgmt 32320 Jun 29 18:03 slurm-5453967.out
-rw-r--r-- 1 Raghu.Reddy nesccmgmt 31996 Jun 29 18:06 slurm-5453968.out
-rw-r--r-- 1 Raghu.Reddy nesccmgmt 17628428 Jun 29 18:09 outfile
-rw-r--r-- 1 Raghu.Reddy nesccmgmt 35550 Jun 29 18:16 terminal-output-2021-06-29
-rw-r--r-- 1 Raghu.Reddy nesccmgmt 10313 Jun 30 20:01 slurm-5822895.out
-rw-r--r-- 1 Raghu.Reddy nesccmgmt 1040770 Jul 1 12:20 mg-intel-impi.D_128p_4n_2021-07-01_12-20.map
[Raghu.Reddy@j1c01 Forge-test-2021-06-29-prod]$ exit
salloc: Relinquishing job allocation 6053293
jfe01.%
Thanks!
Created attachment 20197 [details]
Job output file from today with the 20.11.7 version
Hi Raghu, Thanks for confirming that you still see the issue in 20.11.7 as it was reported (where sbatch fails and salloc works). I would still like to see if starting a task with strace consistently prevents the problem from manifesting, both for jobs that just run 'hostname' and your other test job. Can you run a test job that looks like this: ----------------------------------------- #!/bin/bash -l #SBATCH -A nesccmgmt -N2 -n 2 -t 5 echo "Task 1" srun -vvv --overlap hostname echo "Task 2" module load intel impi forge map --profile srun -vvv --overlap hostname echo "Task 3" strace -o ./outfile -f -s250 map --profile srun -vvv --overlap hostname echo "Task 4" strace -o ./outfile -f -s250 map --profile srun /home/Raghu.Reddy/S2/Testsuite3/NPB3.3-MPI/bin/mg-intel-impi.D.128 ----------------------------------------- At this point I'm just interested in the status of these tasks and don't need the debug logging if that makes it easier for you to run this test. Thanks, Ben Hi Raghu, I'm sure you have been busy with other projects, but I wanted to follow up and see if you have had a chance to run one more test, seeing if we can get strace to give us the information we're hoping for. I appreciate your patience in running all these tests to try to track down the issue. Thanks, Ben Hi Ben, Tony is off this week (and I believe next too), we will pick it up again after he is back. It is strange that it works with "salloc" (and then running interactively) but not from a batch script and sbatch. Thanks! On Tue, Aug 10, 2021 at 10:45 AM <bugs@schedmd.com> wrote: > *Comment # 95 <https://bugs.schedmd.com/show_bug.cgi?id=10889#c95> on bug > 10889 <https://bugs.schedmd.com/show_bug.cgi?id=10889> from Ben Roberts > <ben@schedmd.com> * > > Hi Raghu, > > I'm sure you have been busy with other projects, but I wanted to follow up and > see if you have had a chance to run one more test, seeing if we can get strace > to give us the information we're hoping for. I appreciate your patience in > running all these tests to try to track down the issue. > > Thanks, > Ben > > ------------------------------ > You are receiving this mail because: > > - You reported the bug. > > Thanks for the update Raghu. I agree that the difference between the two submission methods is very strange. I'll wait to hear what you find when Tony is back in the office. Thanks, Ben Created attachment 20770 [details]
patch
Hi Raghu,
One more update on this bug. One of my colleagues recently put together a patch for an issue that he thinks might be related to what you're seeing. I'm attaching it to this ticket. Would you be able to apply this on your test system and see if it helps? The patch should apply to 20.11. If you have any problems with it please let me know.
Thanks,
Ben
Hi Raghu and Tony, I wanted to follow up and see if you've had a chance to try the patch I sent on your test system. Let me know if you have any questions about applying it. Thanks, Ben (In reply to Ben Roberts from comment #99) Hi Ben > I wanted to follow up and see if you've had a chance to try the patch I sent > on your test system. Let me know if you have any questions about applying > it. My fault of lack of progress on this. I've been on PTO and working the Account fairshare/priority issue. Now that it's behind us, I'll make this a priority and work with Raghu this week to see if we can make progress. Tony. (In reply to Ben Roberts from comment #99) Ben, Is this patch already included in some newer version of 20.11? Tony It isn't included in a version of 20.11, but it was added to the 21.08 branch. It will be in 21.08.1, which we are hoping to get released this week or next. Thanks, Ben Ben, Installed the patch, restarted slurm with the patches an retested. Same result - sorry. What would you like to see next? Tony Hi Tony, My apologies for the delay in getting back to you. I've been mulling this over and spent some time today talking with a colleague about the behavior we've observed and what might be going on. Since this ticket has been going on for so long I'm just put together a brief recap of what we've found. 1. After upgrading from 20.02.6 to 20.11.3 you saw job steps failing to be created when using 'map --profile' to call 'srun' to create the steps. 2. What we thought was the solution initially was that Forge provided you with an option to set an environment variable (ALLINEA_DEBUG_SRUN_ARGS) that would allow you to pass the '--overlap' flag to srun when called by 'map'. 3. This worked for salloc jobs, but you realized later that jobs created with sbatch still failed when created in the same way as the salloc jobs. 4. We have compared environments between the two types of jobs and verified that they are the same. We see that calling srun without 'map --profile' works correctly in both types of jobs. We also tried to collect information about these steps using strace, but when using strace both types of steps worked correctly. After discussing this with a colleague, the thing that kept coming up is that it sounds like there's a job step that's trying to create another job step that's causing the failure. We obviously don't see direct evidence of something like 'srun' trying to call 'srun' to create a step, but we're suspicious of the map command. It doesn't seem like it would use its own step creation tool, but we can't rule it out. We see that you call 'module load intel impi forge' each time before you use map. Are the intel and impi modules required for the tool to work? It looks like map is designed to be integrated with different message passing type programs. Do you know if it makes any calls to srun or mpirun or anything similar? We also have a copy of your slurm.conf, but it was uploaded early on in the life of this ticket. Would you mind uploading a new copy from your test system (where I think you're doing most of these tests) so we can make sure we're looking at up to date information? Thanks, Ben Hi Ben, Your summary is correct. The intel and impi modules have to be loaded because without that the MPI application will not work. I don't know the answer to your question, I have forwarded your message to the Forge team to see if they can provide any information. Thank you for looking into this! On Mon, Sep 27, 2021 at 3:37 PM <bugs@schedmd.com> wrote: > *Comment # 104 <https://bugs.schedmd.com/show_bug.cgi?id=10889#c104> on > bug 10889 <https://bugs.schedmd.com/show_bug.cgi?id=10889> from Ben Roberts > <ben@schedmd.com> * > > Hi Tony, > > My apologies for the delay in getting back to you. I've been mulling this over > and spent some time today talking with a colleague about the behavior we've > observed and what might be going on. Since this ticket has been going on for > so long I'm just put together a brief recap of what we've found. > 1. After upgrading from 20.02.6 to 20.11.3 you saw job steps failing to be > created when using 'map --profile' to call 'srun' to create the steps. > 2. What we thought was the solution initially was that Forge provided you with > an option to set an environment variable (ALLINEA_DEBUG_SRUN_ARGS) that would > allow you to pass the '--overlap' flag to srun when called by 'map'. > 3. This worked for salloc jobs, but you realized later that jobs created with > sbatch still failed when created in the same way as the salloc jobs. > 4. We have compared environments between the two types of jobs and verified > that they are the same. We see that calling srun without 'map --profile' works > correctly in both types of jobs. We also tried to collect information about > these steps using strace, but when using strace both types of steps worked > correctly. > > After discussing this with a colleague, the thing that kept coming up is that > it sounds like there's a job step that's trying to create another job step > that's causing the failure. We obviously don't see direct evidence of > something like 'srun' trying to call 'srun' to create a step, but we're > suspicious of the map command. It doesn't seem like it would use its own step > creation tool, but we can't rule it out. We see that you call 'module load > intel impi forge' each time before you use map. Are the intel and impi modules > required for the tool to work? It looks like map is designed to be integrated > with different message passing type programs. Do you know if it makes any > calls to srun or mpirun or anything similar? > > We also have a copy of your slurm.conf, but it was uploaded early on in the > life of this ticket. Would you mind uploading a new copy from your test system > (where I think you're doing most of these tests) so we can make sure we're > looking at up to date information? > > Thanks, > Ben > > ------------------------------ > You are receiving this mail because: > > - You reported the bug. > > Hi Ben, Sorry it took a while to update this case as we didn't have any additional information. Here is some information from ARM (the vendor for Forge tool) that may provide some additional information: ------------------------- To answer Shedmd's questions about map's behavior: Yes, we do another srun for Slurm to "attach to the current job" and launch our backend on all job nodes. It should be possible to disable this by setting the environment variable ALLINEA_USE_SSH_STARTUP=1 assuming SSH is enabled (and ideally password-less). There are potential scalability issues by using SSH - How many compute nodes are anticipated to be used with the Forge tools? Other systems, including our own internal system, have updated slurm but do not experience any issues with the Forge tools. Can you think of anything special that was done to setup slurm on your system? Is there a restriction where it is not possible to create additional job steps? We could help you further if you can provide us some log files. Regards, -------------------------- Please let me know if you have any questions. On Tue, Sep 28, 2021 at 11:56 AM Raghu Reddy - NOAA Affiliate < raghu.reddy@noaa.gov> wrote: > Hi Ben, > > Your summary is correct. > > The intel and impi modules have to be loaded because without that the MPI > application will not work. > > I don't know the answer to your question, I have forwarded your message to > the Forge team to see if they can provide any information. > > Thank you for looking into this! > > > On Mon, Sep 27, 2021 at 3:37 PM <bugs@schedmd.com> wrote: > >> *Comment # 104 <https://bugs.schedmd.com/show_bug.cgi?id=10889#c104> on >> bug 10889 <https://bugs.schedmd.com/show_bug.cgi?id=10889> from Ben Roberts >> <ben@schedmd.com> * >> >> Hi Tony, >> >> My apologies for the delay in getting back to you. I've been mulling this over >> and spent some time today talking with a colleague about the behavior we've >> observed and what might be going on. Since this ticket has been going on for >> so long I'm just put together a brief recap of what we've found. >> 1. After upgrading from 20.02.6 to 20.11.3 you saw job steps failing to be >> created when using 'map --profile' to call 'srun' to create the steps. >> 2. What we thought was the solution initially was that Forge provided you with >> an option to set an environment variable (ALLINEA_DEBUG_SRUN_ARGS) that would >> allow you to pass the '--overlap' flag to srun when called by 'map'. >> 3. This worked for salloc jobs, but you realized later that jobs created with >> sbatch still failed when created in the same way as the salloc jobs. >> 4. We have compared environments between the two types of jobs and verified >> that they are the same. We see that calling srun without 'map --profile' works >> correctly in both types of jobs. We also tried to collect information about >> these steps using strace, but when using strace both types of steps worked >> correctly. >> >> After discussing this with a colleague, the thing that kept coming up is that >> it sounds like there's a job step that's trying to create another job step >> that's causing the failure. We obviously don't see direct evidence of >> something like 'srun' trying to call 'srun' to create a step, but we're >> suspicious of the map command. It doesn't seem like it would use its own step >> creation tool, but we can't rule it out. We see that you call 'module load >> intel impi forge' each time before you use map. Are the intel and impi modules >> required for the tool to work? It looks like map is designed to be integrated >> with different message passing type programs. Do you know if it makes any >> calls to srun or mpirun or anything similar? >> >> We also have a copy of your slurm.conf, but it was uploaded early on in the >> life of this ticket. Would you mind uploading a new copy from your test system >> (where I think you're doing most of these tests) so we can make sure we're >> looking at up to date information? >> >> Thanks, >> Ben >> >> ------------------------------ >> You are receiving this mail because: >> >> - You reported the bug. >> >> > > -- > > Raghu Reddy, RDHPCS Support Team > > GDIT contractor at NOAA/NWS/NCEP/EMC > > 5830 University Research Court, Suite 2146 > > College Park, MD 20740 > > Phone: 301-683-3771 (o), 301-683-3703 (fax), 412-439-8741 (c) > Hi Raghu, Thanks for the update. It is good to hear confirmation that the map tool is making an srun call and that there is a workaround to prevent it from making that call. Have you tried setting the ALLINEA_USE_SSH_STARTUP environment variable to confirm that it allows these jobs to start? Thanks, Ben Hi Raghu, I wanted to follow up and verify that you were able to test using the ALLINEA_USE_SSH_STARTUP environment variable and that things work as expected with that variable. Thanks, Ben Hi Raghu, It seems like you have a solution with the ALLINEA_USE_SSH_STARTUP environment variable. I haven't heard any follow up questions so I'll go ahead and close this ticket. Let us know if there's anything else we can do to help. Thanks, Ben (In reply to Ben Roberts from comment #109) > Hi Raghu, > > It seems like you have a solution with the ALLINEA_USE_SSH_STARTUP > environment variable. I haven't heard any follow up questions so I'll go > ahead and close this ticket. Let us know if there's anything else we can do > to help. > > Thanks, > Ben Hi Ben, Hope you had a wonderful Holiday! Not sure if I should submit a new ticket but I am replying to this ticket because it was never quite resolved, and since Tony has retired we are short staffed and could not get you the logs/files you needed. Now the issue is this. If you remember, Forge worked fine with salloc but not with sbatch, this was the basic problem. Now we have two versions of Slurm on our test system and Forge does not work at all the when run with the new version of Slurm! Whether I try salloc and sbatch it does not work. For the time being let us ignore the difference between salloc and sbatch. Let us focus no salloc not working with the new version. On our test system we are using the 21.08.2 version of Slurm: jfe01.% ll /apps/slurm lrwxrwxrwx 1 root root 9 Oct 18 10:40 /apps/slurm -> slurmtest/ jfe01.% jfe01.% ll /apps/slurm*/default lrwxrwxrwx 1 slurm slurm 7 Oct 18 12:22 /apps/slurm/default -> 21.08.2/ lrwxrwxrwx 1 slurm slurm 10 Sep 14 14:17 /apps/slurmprod/default -> 20.11.7.p2/ lrwxrwxrwx 1 slurm slurm 7 Oct 18 12:22 /apps/slurmtest/default -> 21.08.2/ jfe01.% Our production system is still using 20.11.7.p2. Using salloc, Forge works fine on our production system but does not work on our test system. I am including the output below: jfe01.% cd S2/forge-test jfe01.% module load intel impi forge jfe01.% salloc -A nesccmgmt -q admin -n 128 salloc: Granted job allocation 4966599 salloc: Waiting for resource configuration salloc: Nodes j1c[08-11] are ready for job [Raghu.Reddy@j1c08 forge-test]$ setenv SLURM_OVERLAP 1 [Raghu.Reddy@j1c08 forge-test]$ map --profile srun /home/Raghu.Reddy/S2/Testsuite3/NPB3.3-MPI/bin/mg-intel-impi.D.128 Arm Forge 21.0 - Arm MAP srun: error: Unable to create step for job 4966599: Invalid Trackable RESource (TRES) specification MAP: Arm MAP could not launch the debuggers: MAP: srun exited with code 1 [Raghu.Reddy@j1c08 forge-test]$ After setting the ALLINEA_USE_SSH_STARTUP to 1: jfe01.% salloc -A nesccmgmt -q admin -n 128 salloc: Granted job allocation 4966600 salloc: Waiting for resource configuration salloc: Nodes j1c[08-11] are ready for job [Raghu.Reddy@j1c08 forge-test]$ module load intel impi forge [Raghu.Reddy@j1c08 forge-test]$ setenv SLURM_OVERLAP 1 [Raghu.Reddy@j1c08 forge-test]$ setenv ALLINEA_USE_SSH_STARTUP 1 [Raghu.Reddy@j1c08 forge-test]$ map --profile srun /home/Raghu.Reddy/S2/Testsuite3/NPB3.3-MPI/bin/mg-intel-impi.D.128 Arm Forge 21.0 - Arm MAP Profiling : srun /home/Raghu.Reddy/S2/Testsuite3/NPB3.3-MPI/bin/mg-intel-impi.D.128 Allinea sampler : not preloading (Express Launch) MPI implementation : Auto-Detect (SLURM (MPMD)) * number of processes : 128 * number of nodes : 4 * Allinea MPI wrapper : not preloading (Express Launch) MAP: Process 6: MAP: MAP: The Allinea sampler was not preloaded. MAP: Check the user guide for instructions on how to link with the Allinea sampler. [Raghu.Reddy@j1c08 forge-test]$ On our production system it works (I have edited program output for brevity): hfe03.% module load intel impi forge hfe03.% salloc -A nesccmgmt -q admin -n 128 salloc: Pending job allocation 27033112 salloc: job 27033112 queued and waiting for resources salloc: job 27033112 has been allocated resources salloc: Granted job allocation 27033112 salloc: Waiting for resource configuration salloc: Nodes h13c[24,26,28,31] are ready for job h13c24.% h13c24.% setenv SLURM_OVERLAP 1 h13c24.% map --profile srun /home/Raghu.Reddy/S2/Testsuite3/NPB3.3-MPI/bin/mg-intel-impi.D.128 Arm Forge 21.0 - Arm MAP Profiling : srun /home/Raghu.Reddy/S2/Testsuite3/NPB3.3-MPI/bin/mg-intel-impi.D.128 Allinea sampler : preload (Express Launch) MPI implementation : Auto-Detect (SLURM (MPMD)) * number of processes : 128 * number of nodes : 4 * Allinea MPI wrapper : preload (JIT compiled) (Express Launch) NAS Parallel Benchmarks 3.3 -- MG Benchmark No input file. Using compiled defaults Size: 1024x1024x1024 (class D) Iterations: 50 Number of processes: 128 Initialization time: 0.663 seconds iter 1 ... ... iter 50 Benchmark completed VERIFICATION SUCCESSFUL L2 Norm is 0.1583275060429E-09 Error is 0.6697470786978E-11 MG Benchmark Completed. Class = D Size = 1024x1024x1024 Iterations = 50 ... ... MS T27A-1 NASA Ames Research Center Moffett Field, CA 94035-1000 Fax: 650-604-3957 MAP analysing program... MAP gathering samples... MAP generated /scratch2/SYSADMIN/nesccmgmt/Raghu.Reddy/forge-test/mg-intel-impi.D_128p_4n_1t_2021-12-30_15-08.map h13c24.% exit salloc: Relinquishing job allocation 27033112 hfe03.% I will upload the slurm.conf file and that file is same in both versions. Wish you a Very Happy New Year! Thank you! Created attachment 22823 [details]
The slurm.conf file mentioned in the comment
The slurm.conf file used with both versions is the same.
Hi Raghu- Would you open a new bug for this. I am going to have a fresh pair of eyes look into this issue for you. Hi Jason, I have submitted a new ticket so please feel free to close this ticket. Thanks! On Thu, Dec 30, 2021 at 1:58 PM <bugs@schedmd.com> wrote: > *Comment # 112 <https://bugs.schedmd.com/show_bug.cgi?id=10889#c112> on > bug 10889 <https://bugs.schedmd.com/show_bug.cgi?id=10889> from Jason Booth > <jbooth@schedmd.com> * > > Hi Raghu- Would you open a new bug for this. I am going to have a fresh pair of > eyes look into this issue for you. > > ------------------------------ > You are receiving this mail because: > > - You reported the bug. > > |