Thought to also attach what we see from the controller and node logs, for the job that originally failed slurmctld [2021-11-17T15:22:10.985] _slurm_rpc_submit_batch_job: JobId=17466 InitPrio=40144 usec=7721 [2021-11-17T15:22:11.156] sched: Allocate JobId=17466 NodeList=nid00013 #CPUs=40 Partition=acceptance [2021-11-17T15:22:12.408] prolog_running_decr: Configuration for JobId=17466 is complete [2021-11-17T15:22:23.806] _job_complete: JobId=17466 WEXITSTATUS 1 [2021-11-17T15:22:23.809] _job_complete: JobId=17466 done slurmd [2021-11-17T15:22:12.411] task/affinity: task_p_slurmd_batch_request: task_p_slurmd_batch_request: 17466 [2021-11-17T15:22:12.411] task/affinity: batch_bind: job 17466 CPU input mask for node: 0xFFFFFFFFFF [2021-11-17T15:22:12.411] task/affinity: batch_bind: job 17466 CPU final HW mask for node: 0xFFFFFFFFFF [2021-11-17T15:22:12.422] [17466.extern] core_spec/cray_aries: init: core_spec/cray_aries: init [2021-11-17T15:22:12.430] [17466.extern] task/cgroup: _memcg_initialize: job: alloc=28000MB mem.limit=26600MB memsw.limit=26600MB [2021-11-17T15:22:12.430] [17466.extern] task/cgroup: _memcg_initialize: step: alloc=28000MB mem.limit=26600MB memsw.limit=26600MB [2021-11-17T15:22:14.175] Launching batch job 17466 for UID 22892 [2021-11-17T15:22:14.185] [17466.batch] core_spec/cray_aries: init: core_spec/cray_aries: init [2021-11-17T15:22:14.193] [17466.batch] task/cgroup: _memcg_initialize: job: alloc=28000MB mem.limit=26600MB memsw.limit=26600MB [2021-11-17T15:22:14.193] [17466.batch] task/cgroup: _memcg_initialize: step: alloc=28000MB mem.limit=26600MB memsw.limit=26600MB [2021-11-17T15:22:17.116] launch task StepId=17466.0 request from UID:22892 GID:22892 HOST:10.128.0.14 PORT:56240 [2021-11-17T15:22:17.116] task/affinity: lllp_distribution: JobId=17466 binding: threads, dist 1 [2021-11-17T15:22:17.116] task/affinity: _task_layout_lllp_cyclic: _task_layout_lllp_cyclic [2021-11-17T15:22:17.116] task/affinity: _lllp_generate_cpu_bind: _lllp_generate_cpu_bind jobid [17466]: mask_cpu, 0x000000000F [2021-11-17T15:22:17.126] [17466.0] core_spec/cray_aries: init: core_spec/cray_aries: init [2021-11-17T15:22:17.176] [17466.0] (switch_cray_aries.c: 656: switch_p_job_init) gres_cnt: 2072 0 [2021-11-17T15:22:17.186] [17466.0] task/cgroup: _memcg_initialize: job: alloc=28000MB mem.limit=26600MB memsw.limit=26600MB [2021-11-17T15:22:17.186] [17466.0] task/cgroup: _memcg_initialize: step: alloc=28000MB mem.limit=26600MB memsw.limit=26600MB [2021-11-17T15:22:20.153] [17466.0] get_exit_code task 0 died by signal: 9 [2021-11-17T15:22:23.000] [17466.0] done with job [2021-11-17T15:22:23.805] [17466.batch] sending REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status:256 [2021-11-17T15:22:23.811] [17466.batch] done with job [2021-11-17T15:22:23.978] [17466.extern] done with job which doesn't say all that much about any the TRes condition that my collegue saw in the ReFrame output. Spoiler alert: Note the "nid00013" in the above job's "sched: Allocate" Well, these log snippets were from the ReFrame job, however, that node is defined with Feature=ivybridge which suggests that the #SBATCH --constraint=haswell in the submission script isn't being honoured. As I say though: one for the future. Bit more info. I ran the same job submission script on our production Cray, albeit without the --constraint (because we've taken the view that we don't specify features for homogeneous platforms, which, whilst it makes testing harder, there's less for the users to have to know). I see similar results, the first line is from the srun without the --exact SLURM_CPUS_ON_NODE 48 SLURM_CPUS_ON_NODE 48 SLURM_CPUS_ON_NODE 48 SLURM_CPUS_ON_NODE 8 I think the difference in the first line is down to a different configuration as regards hyperthreading, however the fact that the "srun --exact" returns the same value on both test and prod now has me thinking that it's "doing the right thing" and that the 8 may just be the --cpus-per-task=4 multiplied by two because of the hyperthreading. If that is the case then, there may be nothing to see here? Even more info, this one a difference between the TDS (21.08) and the production system (20.11) Dumping job script Slurm environment Dumping subordinate Slurm environment SLURM_SUBMIT_HOST chaos-1 SLURM_SUBMIT_HOST chaos-1 SLURM_SUBMIT_HOST magnus-1 SLURM_SUBMIT_HOST nid00025 so what that appears to be saying is that on 20.11, the originating batch job has an EnvVar that that its submit host was the eLogin node, however, the subordinate job thinks that its submit host was the node the batch script got allocated to. For the 21.08 deploymnt though, both the originating batch job and the subordinate job think their submit host was the eLogin node. Created attachment 22328 [details]
slurmctld and slurmd logs at debug 5
Ramped up the debugging to 5 and ran just the three Reframe jobs
that exhibit the
srun: error: Unable to create step for job 17479: Invalid Trackable RESource (TRES) specification
messages.
These are three jobs that run the Arm Forge Ddt debugger on
C, C++ and F90 codes.
The attached logs have been cutdown to just the sections that
mention the 17479 JobID, which ran on nid00017, which _is_ a
Haswell node.
It's interesting that when running only three jobs, all three
do observe the constraint, so maybe the failure to observe the
constraint is tied into al of the Haswell nodes bein in use when
the Ddt jobs start (they start late in the set) and something that
clobbers the honouring of the constraint see them run on whatever?
What I also find interesting is that the logs don't appear to
mention invalid tres, though I don't claim to be an expert log
interpreter, especially at Level 5!
Hi Kevin, It seems like there are a few questions here, so I'm taking some time to parse them and try to understand everything. I'll get back to you with more details soon. Created attachment 22370 [details]
20.11 configs for comparison
Occurred to me that you might want to "compare and contrast" the
configs from the 20.11 system, where things work, with the 21.08
configs I supplied before.
FWIW: gres.conf on the 20.11 system is just comments and blank
lines, so not included.
More grist f't mill: We have since run the RefRame checks that were seeing the TRes issue on our Cray XC TDS, on an anciliary, so non-Cray, system that runs SLES12. On the production part of it, 20.11.8, things are fine, but on the tessting part of it, 21.08.4, we see the same issue. I can provide the configs for the ancilliary systems if you want. Also thought to mention that, on the Cray TDS, my notes indicate that we did see the same failures when first deploying 21.08.1, in mid-Sept and updating to 21.08.2 but we didn't follow up on it at the time, (mainly becaue we weren't looking to put 21.08 into production) so the suggestion is that it's something in 21.08 compared to 20.11. HTH Kevin And the info hits just keep on coming. So I thought to run the reframe Ddt jobs is the same way that I ran my "sbatch a subordinate script" job, which enables me to get the environment within the srun, as well as the sbatch, so, instead of ddt <options> srun executable I ran srun subordinate_ddt.sh where $ cat subordinate_ddt.sh #!/bin/bash # ddt --offline --output=ddtreport.txt --trace-at _jacobi.c:91,residual ./jacobi $ That runs without issue. This suggests to me that it is the Ddt interacting with Slurm 21.08 in a way that is different to what happens when it runs against Slurm 20.11. Not clear if that makes this an issue for Ddt, Slurm 21.08, or both? Alright, I've got a few things for you. RE --constraint not being honoured in some cases - I can't reproduce this, but feel free to open a new bug on it. RE --cpus-per-task - this is supposed to imply --exact. This was new for 21.08. This did work when that documentation was written, but unfortunately before 21.08 was released there was a regression, so this behavior doesn't work as documented. I've opened bug 12909 to handle this, which has an example that shows it is broken. It's a publicly viewable bug so you can view that or add yourself to CC if you want to follow it. RE the number of CPUs with respect to --overlap, --exact, hyperthreads, etc. I think you found that things are probably behaving as expected. * Without --exact/--exclusive, a step will use all the CPUs available to the job on the nodes allocated to the step. * Multiple threads per core influences this. * We have a documentation bug open to improve documentation of how many CPUs will be allocated to a step. This is bug 11310, which has some more examples. * If you have any specific questions about this, then can you open a new bug? With this in mind, I'm changing the title of this bug to reflect what I believe to be the main problem - When using Arm Forge Ddt to run a step, this error happens: Unable to create step for job 17466: Invalid Trackable RESource (TRES) specification (In reply to Kevin Buckley from comment #8) > And the info hits just keep on coming. > > So I thought to run the reframe Ddt jobs is the same way that > I ran my "sbatch a subordinate script" job, which enables me > to get the environment within the srun, as well as the sbatch, > so, instead of > > ddt <options> srun executable > > I ran > > srun subordinate_ddt.sh > > where > > $ cat subordinate_ddt.sh > #!/bin/bash > # > ddt --offline --output=ddtreport.txt --trace-at _jacobi.c:91,residual > ./jacobi > $ > > That runs without issue. > > This suggests to me that it is the Ddt interacting with > Slurm 21.08 in a way that is different to what happens > when it runs against Slurm 20.11. > > Not clear if that makes this an issue for Ddt, Slurm 21.08, or both? That's really good info. Can you run the following test? scontrol setdebugflags +steps Run your job with the Ddt step that fails Also run a step that succeeds Finish the job scontrol setdebugflags -steps Then can you upload the slurmctld log file? Can you also let me know the job ID so I can parse the log file? I'm not sure if this will turn up anything more, since this invalid TRES error only happens in one place at the beginning of step creation and we may not see any new logs. But I still think it's worth looking at. (This error happens in other places for jobs, but it's not the job that's failing, just the step.) > > With this in mind, I'm changing the title of this bug to reflect what I believe > to be the main problem - > > When using Arm Forge Ddt to run a step, this error happens: > > Unable to create step for job 17466: Invalid Trackable RESource (TRES) > specification Not sure the title needs a site-specific JobID, but it's your call. Can you run the following test? Will do on that, though one other thing of note that we've since seen, is that on the Crays, if we put the Ddt command into a subordinate script then things don't run. Not clear what this is telling us, as regards which component in all this is falling over, as yet. My colleage on the ticket knows more about Ddt than I do, so he may be able to give you some insight as to what should be happening. BTW, when you say > RE --constraint not being honoured in some cases - I can't > reproduce this, but feel free to open a new bug on it. do you mean that you've run a instance of Ddt using our config, or something else? > I've opened bug 12909 to handle this, ... Will take a look at that: cheers. Created attachment 22400 [details]
Requested logs
The JobIds to look for are 17563, which has the Ddt invoked
from within the subordinate script, and 17564, which just has
the "echo out all the Slurm EnvVars" code in the subordinate
script.
FYI: the debug level goes all the way to five in these logs.
Apologies for all the elastic search server messages: we really
should erase that from the TDS config.
I can reproduce the error with
% srun --gres=none hello_c
srun: error: Unable to create step for job 1655: Invalid Trackable RESource (TRES) specification
I believe Forge uses --gres=none so it can launch the backend even when all the resource is taken up.
> Not sure the title needs a site-specific JobID, but it's your call. It doesn’t really matter to me. I just copy-pasted the message. I would be just as happy with “<jobid>” instead of an actual number. (In reply to Kevin Mooney from comment #12) > I can reproduce the error with > > % srun --gres=none hello_c > srun: error: Unable to create step for job 1655: Invalid Trackable > RESource (TRES) specification > > I believe Forge uses --gres=none so it can launch the backend even when all > the resource is taken up. Nice find, Kevin! I can easily reproduce that, too. I'm pretty sure that --gres=none used to work. I'm looking into this more. > Will do on that, though one other thing of note that we've > since seen, is that on the Crays, if we put the Ddt command > into a subordinate script then things don't run. I wonder - Is this caused by the same thing with --gres=none? > BTW, when you say > > > RE --constraint not being honoured in some cases - I can't > > reproduce this, but feel free to open a new bug on it. > > do you mean that you've run a instance of Ddt using our config, > or something else? I didn’t run Ddt. I just ran various tests with --constraint and --constraint worked for me every time. > (In reply to Kevin Mooney from comment #12) > > I can reproduce the error with > > > > % srun --gres=none hello_c > > srun: error: Unable to create step for job 1655: Invalid Trackable > > RESource (TRES) specification > > > > I believe Forge uses --gres=none so it can launch the backend even when all > > the resource is taken up. > > Nice find, Kevin! I can easily reproduce that, too. I'm pretty sure that > --gres=none used to work. I'm looking into this more. Full disclosure, there are 2 Kevins in this thread, and I am a developer of Forge (DDT & MAP), so I had an advantage when diagnosing the issue. --gres=none is documented in srun --help. > > Will do on that, though one other thing of note that we've > > since seen, is that on the Crays, if we put the Ddt command > > into a subordinate script then things don't run. > > I wonder - Is this caused by the same thing with --gres=none? I suspect this is a separate issue. The first case is equivalent to ddt srun exe which launches ddt on the login/batch node, which attaches a debugger to srun, and then distributes the backend to the compute nodes with srun --jobid=1666 --gres=none --mem-per-cpu=0 -I -W0 --gpus=0 --overlap forge-backend The second case is equivalent to srun ddt exe This will launch ddt's frontend on all the compute nodes and start a separate session for each exe. While this can work, there are plenty of edge cases that might prevent it from working. Usually the first is what you want. Looking at alternatives to get the first case working. It mostly works for me if I remove the --gres=none with ALLINEA_DEBUG_SRUN_ARGS="%jobid% --mem-per-cpu=0 -I -W0 --gpus=0 --overlap" ddt --log=ddt_log.xml srun exe The debug session works fine for me, but on exit I see the following (potentially harmless?) error from slurm slurmstepd: error: *** STEP 1668.1 ON node-1 CANCELLED AT 2021-11-30T11:53:31 *** (In reply to Kevin Mooney from comment #20) > Full disclosure, there are 2 Kevins in this thread, and I am a developer of > Forge (DDT & MAP), so I had an advantage when diagnosing the issue. I realized that. I really appreciate you stepping in and identifying the issue. It made debugging the issue in Slurm much easier. > --gres=none is documented in srun --help. Yes, it's definitely supposed to work and this is a regression in Slurm 21.08. There are actually two issues: * Allow jobs and steps that use --gres=none to run * Steps that use --gres=none should unset GRES environment variables, but they do not anymore (another regression in 21.08) I've identified the exact commits where we had these regressions. The first issue is really easy to fix. The second will be a little bit trickier, but hopefully not too bad. > I suspect this is a separate issue. The first case is equivalent to > > ddt srun exe > > which launches ddt on the login/batch node, which attaches a debugger to > srun, and then distributes the backend to the compute nodes with > > srun --jobid=1666 --gres=none --mem-per-cpu=0 -I -W0 --gpus=0 --overlap > forge-backend > > The second case is equivalent to > > srun ddt exe Does ddt call srun internally? Or does ddt merely set Slurm environment variables or pass arguments to srun? If it is the latter, it makes sense to me why srun ddt exe would magically work - because in this case, srun is called first and wouldn't have --gres=none set. > This will launch ddt's frontend on all the compute nodes and start a > separate session for each exe. While this can work, there are plenty of edge > cases that might prevent it from working. Usually the first is what you want. Agreed. > Looking at alternatives to get the first case working. It mostly works for > me if I remove the --gres=none with > > ALLINEA_DEBUG_SRUN_ARGS="%jobid% --mem-per-cpu=0 -I -W0 --gpus=0 > --overlap" ddt --log=ddt_log.xml srun exe > > The debug session works fine for me, but on exit I see the following > (potentially harmless?) error from slurm > > slurmstepd: error: *** STEP 1668.1 ON node-1 CANCELLED AT > 2021-11-30T11:53:31 *** For jobs that don't have GPUs, simply not using --gres at all works. For jobs that do have GPUs, then this should be fine as a temporary workaround. I'll keep you updated on my progress with the fixes. > > I suspect this is a separate issue. The first case is equivalent to > > > > ddt srun exe > > > > which launches ddt on the login/batch node, which attaches a debugger to > > srun, and then distributes the backend to the compute nodes with > > > > srun --jobid=1666 --gres=none --mem-per-cpu=0 -I -W0 --gpus=0 --overlap > > forge-backend > > > > The second case is equivalent to > > > > srun ddt exe > > Does ddt call srun internally? Or does ddt merely set Slurm environment > variables or pass arguments to srun? If it is the latter, it makes sense to > me why srun ddt exe would magically work - because in this case, srun is > called first and wouldn't have --gres=none set. It's the former, DDT calls srun internally, and it shouldn't make any changes to the user's command line. We launch the users srun under a debugger and then distribute our backend using srun --jobid=<job_id> --gres=none --mem-per-cpu=0 -I -W0 --gpus=0 --overlap forge-backend which then attaches a debugger to each process. Similar to what you said, `srun ddt exe` avoids the --gres=none issue because it doesn't use the srun launch mechanism. > > Looking at alternatives to get the first case working. It mostly works for > > me if I remove the --gres=none with > > > > ALLINEA_DEBUG_SRUN_ARGS="%jobid% --mem-per-cpu=0 -I -W0 --gpus=0 > > --overlap" ddt --log=ddt_log.xml srun exe > > > > The debug session works fine for me, but on exit I see the following > > (potentially harmless?) error from slurm > > > > slurmstepd: error: *** STEP 1668.1 ON node-1 CANCELLED AT > > 2021-11-30T11:53:31 *** > > For jobs that don't have GPUs, simply not using --gres at all works. For > jobs that do have GPUs, then this should be fine as a temporary workaround. > > I'll keep you updated on my progress with the fixes. I should spend some time trying to understand what all these flags do. They have grown organically over time and it's likely some of them aren't needed. We don't have an easy way to test different versions of slurm, so changes here are generally made with caution. Hi Kevin and Kevin, Quick update. I submitted patches to our review queue to fix issues with --gres=none. I'll keep you updated on our progress. We've fixed --gres=none ahead of 21.08.5. Thanks for reporting this! I'm closing this bug as fixed. commit 8b7b1e7128fcf1ac013ab308cf0c09fbc20485f2 Author: Marshall Garey <marshall@schedmd.com> AuthorDate: Tue Nov 30 16:28:15 2021 -0700 Fix regression that broke --gres=none This was broken by commit 6300d47c2d (and related) in 21.08.0pre1 which prepended "gres:" to all GRES, including "none". This caused jobs and steps that used --gres=none to fail. Jobs would fail with this error message: Invalid generic resource (gres) specification Steps would fail with this similar error message: Invalid Trackable RESource (TRES) specification Bug 12880. Brilliant, thank you Marshall! *** Ticket 13102 has been marked as a duplicate of this ticket. *** |
Created attachment 22310 [details] Tarball of local config, job scripts and output So I think of this as a "Job environment" issue, so maybe it could be classed as a "Scheduling" component but, as I'm not sure, went with "Other". Started off as 4-minor as we are only seeing it on a test system. Trying to debug the scheduling and execution of a script that our Services team have setup within our ReFrame acceptance testing suite that has failed on our TDS, since we went to 21.08 there, the latest failure being seen after Tuesday's upgrade to 21.08.4, but which wasn't seen to fail under 20.11.x. The job seems to get to a node but then bombs with: srun: error: Unable to create step for job 17466: Invalid Trackable RESource (TRES) specification The actual job submission script is running a Ddt-launched srun of an MPI binary, but in my testing, I just have the same, in respect of SBATCH defines and locally set EnVars, job submission script that just srun-s a shell script. My submission script (full script attached) sets up Slurm-related things as #SBATCH --job-name="IS-2656" #SBATCH --ntasks=1 #SBATCH --ntasks-per-node=1 #SBATCH --cpus-per-task=4 #SBATCH --output="IS-2656-%j.out" #SBATCH --error="IS-2656-%j.err" #SBATCH --time=0:10:0 #SBATCH --constraint=haswell export SLURM_OVERLAP=1 and then does srun subordinate.sh Looking at the dumped Slurm-related Env Vars, exhibits this one siginificant difference: Submission Subordinate SLURM_CPUS_ON_NODE 24 SLURM_CPUS_ON_NODE 48 which might be exepcted, given past issues with the Crays not allowing us to disable hyperthreading at the BIOS-level. However, having read about SLURM_OVERLAP possibly affecting the --exact that one gets for free, when specifying --cpus-per-task I though to also submit the subordinate task with srun --exact subordinate.sh I then see, in the dumped Slurm-related EnvVars Submission Subordinate SLURM_CPUS_ON_NODE 24 SLURM_CPUS_ON_NODE 8 but where's that factor of three come from? There's another issue we are seeing with the Ref=Frame tests, in that the --constraint=haswell doesn't appear to be respected, as we have seen the job run on a node that doesn't have a defintion of Feature=haswell but, as I have been able to avoid that when running my own jobs from the shell, as oppsoed to the one run by Reframe, we'll leave that for a separate ticket, if needs be! Any thoughts on the numbers seen above? Kevin PS Initial attachment contains cgroup.conf gres.conf slurm.conf reframe-job.sh is-2656-srun.sh subordinate.sh