Ticket 12880

Summary:	Arm Forge Ddt - Unable to create step for job 17466: Invalid Trackable RESource (TRES) specification
Product:	Slurm	Reporter:	Kevin Buckley <kevin.buckley>
Component:	Other	Assignee:	Marshall Garey <marshall>
Status:	RESOLVED FIXED	QA Contact:
Severity:	4 - Minor Issue
Priority:	---	CC:	deva.deeptimahanti, kevin.mooney, Raghu.Reddy
Version:	21.08.4
Hardware:	Cray XC
OS:	Other
See Also:	https://bugs.schedmd.com/show_bug.cgi?id=11525 https://bugs.schedmd.com/show_bug.cgi?id=13042 https://bugs.schedmd.com/show_bug.cgi?id=13595
Site:	Pawsey	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:	Chaos
CLE Version:	6 UP07	Version Fixed:	21.08.5 22.05.0pre1
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	Tarball of local config, job scripts and output slurmctld and slurmd logs at debug 5 20.11 configs for comparison Requested logs

Description Kevin Buckley 2021-11-18 00:32:40 MST

Created attachment 22310 [details]
Tarball of local config, job scripts and output

So I think of this as a "Job environment" issue, so maybe it 
could be classed as a "Scheduling" component but, as I'm not 
sure, went with "Other".

Started off as 4-minor as we are only seeing it on a test system.


Trying to debug the scheduling and execution of a script that our 
Services team have setup within our ReFrame acceptance testing 
suite that has failed on our TDS, since we went to 21.08 there,
the  latest failure being seen after Tuesday's upgrade to 21.08.4,
but which wasn't seen to fail under 20.11.x.

The job seems to get to a node but then bombs with:

srun: error: Unable to create step for job 17466: Invalid Trackable RESource (TRES) specification


The actual job submission script is running a Ddt-launched srun of
an MPI binary, but in my testing, I just have the same, in respect 
of SBATCH defines and locally set EnVars, job submission script 
that just srun-s a shell script.

My submission script (full script attached) sets up Slurm-related 
things as

#SBATCH --job-name="IS-2656"
#SBATCH --ntasks=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=4
#SBATCH --output="IS-2656-%j.out"
#SBATCH --error="IS-2656-%j.err"
#SBATCH --time=0:10:0
#SBATCH --constraint=haswell
export SLURM_OVERLAP=1

and then does

srun subordinate.sh

Looking at the dumped Slurm-related Env Vars, exhibits this one
siginificant difference:

Submission                                   Subordinate
SLURM_CPUS_ON_NODE 24			     SLURM_CPUS_ON_NODE 48

which might be exepcted, given past issues with the Crays
not allowing us to disable hyperthreading at the BIOS-level.

However, having read about SLURM_OVERLAP possibly affecting the 
--exact that one gets for free, when specifying --cpus-per-task
I though to also submit the subordinate task with

srun --exact subordinate.sh

I then see, in the dumped Slurm-related EnvVars

Submission                                   Subordinate
SLURM_CPUS_ON_NODE 24			     SLURM_CPUS_ON_NODE 8

but where's that factor of three come from?


There's another issue we are seeing with the Ref=Frame tests, in 
that the --constraint=haswell doesn't appear to be respected, as
we have seen the job run on a node that doesn't have a defintion
of

Feature=haswell

but, as I have been able to avoid that when running my own jobs
from the shell, as oppsoed to the one run by Reframe, we'll leave 
that for a separate ticket, if needs be!

Any thoughts on the numbers seen above?

Kevin

PS
Initial attachment contains

cgroup.conf
gres.conf
slurm.conf

reframe-job.sh

is-2656-srun.sh
subordinate.sh

Comment 1 Kevin Buckley 2021-11-18 00:49:34 MST

Thought to also attach what we see from the controller
and node logs, for the job that originally failed

slurmctld

[2021-11-17T15:22:10.985] _slurm_rpc_submit_batch_job: JobId=17466 InitPrio=40144 usec=7721
[2021-11-17T15:22:11.156] sched: Allocate JobId=17466 NodeList=nid00013 #CPUs=40 Partition=acceptance
[2021-11-17T15:22:12.408] prolog_running_decr: Configuration for JobId=17466 is complete
[2021-11-17T15:22:23.806] _job_complete: JobId=17466 WEXITSTATUS 1
[2021-11-17T15:22:23.809] _job_complete: JobId=17466 done

slurmd

[2021-11-17T15:22:12.411] task/affinity: task_p_slurmd_batch_request: task_p_slurmd_batch_request: 17466
[2021-11-17T15:22:12.411] task/affinity: batch_bind: job 17466 CPU input mask for node: 0xFFFFFFFFFF
[2021-11-17T15:22:12.411] task/affinity: batch_bind: job 17466 CPU final HW mask for node: 0xFFFFFFFFFF
[2021-11-17T15:22:12.422] [17466.extern] core_spec/cray_aries: init: core_spec/cray_aries: init
[2021-11-17T15:22:12.430] [17466.extern] task/cgroup: _memcg_initialize: job: alloc=28000MB mem.limit=26600MB memsw.limit=26600MB
[2021-11-17T15:22:12.430] [17466.extern] task/cgroup: _memcg_initialize: step: alloc=28000MB mem.limit=26600MB memsw.limit=26600MB
[2021-11-17T15:22:14.175] Launching batch job 17466 for UID 22892
[2021-11-17T15:22:14.185] [17466.batch] core_spec/cray_aries: init: core_spec/cray_aries: init
[2021-11-17T15:22:14.193] [17466.batch] task/cgroup: _memcg_initialize: job: alloc=28000MB mem.limit=26600MB memsw.limit=26600MB
[2021-11-17T15:22:14.193] [17466.batch] task/cgroup: _memcg_initialize: step: alloc=28000MB mem.limit=26600MB memsw.limit=26600MB
[2021-11-17T15:22:17.116] launch task StepId=17466.0 request from UID:22892 GID:22892 HOST:10.128.0.14 PORT:56240
[2021-11-17T15:22:17.116] task/affinity: lllp_distribution: JobId=17466 binding: threads, dist 1
[2021-11-17T15:22:17.116] task/affinity: _task_layout_lllp_cyclic: _task_layout_lllp_cyclic
[2021-11-17T15:22:17.116] task/affinity: _lllp_generate_cpu_bind: _lllp_generate_cpu_bind jobid [17466]: mask_cpu, 0x000000000F
[2021-11-17T15:22:17.126] [17466.0] core_spec/cray_aries: init: core_spec/cray_aries: init
[2021-11-17T15:22:17.176] [17466.0] (switch_cray_aries.c: 656: switch_p_job_init) gres_cnt: 2072 0
[2021-11-17T15:22:17.186] [17466.0] task/cgroup: _memcg_initialize: job: alloc=28000MB mem.limit=26600MB memsw.limit=26600MB
[2021-11-17T15:22:17.186] [17466.0] task/cgroup: _memcg_initialize: step: alloc=28000MB mem.limit=26600MB memsw.limit=26600MB
[2021-11-17T15:22:20.153] [17466.0] get_exit_code task 0 died by signal: 9
[2021-11-17T15:22:23.000] [17466.0] done with job
[2021-11-17T15:22:23.805] [17466.batch] sending REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status:256
[2021-11-17T15:22:23.811] [17466.batch] done with job
[2021-11-17T15:22:23.978] [17466.extern] done with job

which doesn't say all that much about any the TRes condition
that my collegue saw in the ReFrame output.


Spoiler alert:

Note the "nid00013" in the above job's "sched: Allocate"

Well, these log snippets were from the ReFrame job, however, 
that node is defined with

Feature=ivybridge

which suggests that the 

#SBATCH --constraint=haswell

in the submission script isn't being honoured.

As I say though: one for the future.

Comment 2 Kevin Buckley 2021-11-18 19:53:36 MST

Bit more info.

I ran the same job submission script on our production Cray,
albeit without the --constraint (because we've taken the view
that we don't specify features for homogeneous platforms, which,
whilst it makes testing harder, there's less for the users to 
have to know).

I see similar results, the first line is from the srun without
the --exact


SLURM_CPUS_ON_NODE 48      SLURM_CPUS_ON_NODE 48

SLURM_CPUS_ON_NODE 48      SLURM_CPUS_ON_NODE 8


I think the difference in the first line is down to a different
configuration as regards hyperthreading, however the fact that
the "srun --exact" returns the same value on both test and prod
now has me thinking that it's "doing the right thing" and that the
8 may just be the --cpus-per-task=4 multiplied by two because of
the hyperthreading.

If that is the case then, there may be nothing to see here?

Comment 3 Kevin Buckley 2021-11-18 20:59:32 MST

Even more info, this one a difference between the TDS (21.08)
and the production system (20.11)

Dumping job script Slurm environment 	Dumping subordinate Slurm environment   

SLURM_SUBMIT_HOST chaos-1		SLURM_SUBMIT_HOST chaos-1		      

SLURM_SUBMIT_HOST magnus-1		SLURM_SUBMIT_HOST nid00025		     

so what that appears to be saying is that on 20.11, the originating
batch job has an EnvVar that that its submit host was the eLogin
node, however, the subordinate job thinks that its submit host 
was the node the batch script got allocated to.

For the 21.08 deploymnt though, both the originating batch job and
the subordinate job think their submit host was the eLogin node.

Comment 4 Kevin Buckley 2021-11-18 23:46:08 MST

Created attachment 22328 [details]
slurmctld and slurmd logs at debug 5

Ramped up the debugging to 5 and ran just the three Reframe jobs 
that exhibit the 

srun: error: Unable to create step for job 17479: Invalid Trackable RESource (TRES) specification

messages.

These are three jobs that run the Arm Forge Ddt debugger on
C, C++ and F90 codes.

The attached logs have been cutdown to just the sections that
mention the 17479 JobID, which ran on nid00017, which _is_ a
Haswell node.

It's interesting that when running only three jobs, all three
do observe the constraint, so maybe the failure to observe the
constraint is tied into al of the Haswell nodes bein in use when
the Ddt jobs start (they start late in the set) and something that
clobbers the honouring of the constraint see them run on whatever?

What I also find interesting is that the logs don't appear to 
mention invalid tres, though I don't claim to be an expert log
interpreter, especially at Level 5!

Comment 5 Marshall Garey 2021-11-22 16:52:44 MST

Hi Kevin,

It seems like there are a few questions here, so I'm taking some time to parse them and try to understand everything. I'll get back to you with more details soon.

Comment 6 Kevin Buckley 2021-11-22 17:48:26 MST

Created attachment 22370 [details]
20.11 configs for comparison

Occurred to me that you might want to "compare and contrast" the
configs from the 20.11 system, where things work, with the 21.08
configs I supplied before.

FWIW: gres.conf on the 20.11 system is just comments and blank 
lines, so not included.

Comment 7 Kevin Buckley 2021-11-23 19:08:23 MST

More grist f't mill:

We have since run the RefRame checks that were seeing the TRes
issue on our Cray XC TDS, on an anciliary, so non-Cray, system 
that runs SLES12.

On the production part of it,  20.11.8, things are fine, but on 
the tessting part of it, 21.08.4, we see the same issue.

I can provide the configs for the ancilliary systems if you 
want.

Also thought to mention that, on the Cray TDS, my notes indicate
that we did see the same failures when first deploying 21.08.1, 
in mid-Sept and updating to 21.08.2 but we didn't follow up on it 
at the time, (mainly becaue we weren't looking to put 21.08 into 
production) so the suggestion is that it's something in 21.08 
compared to 20.11.

HTH
Kevin

Comment 8 Kevin Buckley 2021-11-23 21:24:05 MST

And the info hits just keep on coming.

So I thought to run the reframe Ddt jobs is the same way that 
I ran my "sbatch a subordinate script" job, which enables me
to get the environment within the srun, as well as the sbatch,
so, instead of

ddt <options> srun executable

I ran

srun subordinate_ddt.sh

where 

$ cat subordinate_ddt.sh 
#!/bin/bash
#
ddt --offline --output=ddtreport.txt --trace-at _jacobi.c:91,residual ./jacobi
$

That runs without issue.

This suggests to me that it is the Ddt interacting with 
Slurm 21.08 in a way that is different to what happens 
when it runs against Slurm 20.11.

Not clear if that makes this an issue for Ddt, Slurm 21.08, or both?

Comment 9 Marshall Garey 2021-11-24 10:38:58 MST

Alright, I've got a few things for you.


RE --constraint not being honoured in some cases - I can't reproduce this, but feel free to open a new bug on it.


RE --cpus-per-task - this is supposed to imply --exact. This was new for 21.08. This did work when that documentation was written, but unfortunately before 21.08 was released there was a regression, so this behavior doesn't work as documented. I've opened bug 12909 to handle this, which has an example that shows it is broken. It's a publicly viewable bug so you can view that or add yourself to CC if you want to follow it.


RE the number of CPUs with respect to --overlap, --exact, hyperthreads, etc. I think you found that things are probably behaving as expected.
* Without --exact/--exclusive, a step will use all the CPUs available to the job on the nodes allocated to the step.
* Multiple threads per core influences this.
* We have a documentation bug open to improve documentation of how many CPUs will be allocated to a step. This is bug 11310, which has some more examples.
* If you have any specific questions about this, then can you open a new bug?


With this in mind, I'm changing the title of this bug to reflect what I believe to be the main problem - 

When using Arm Forge Ddt to run a step, this error happens:

Unable to create step for job 17466: Invalid Trackable RESource (TRES) specification



(In reply to Kevin Buckley from comment #8)
> And the info hits just keep on coming.
> 
> So I thought to run the reframe Ddt jobs is the same way that 
> I ran my "sbatch a subordinate script" job, which enables me
> to get the environment within the srun, as well as the sbatch,
> so, instead of
> 
> ddt <options> srun executable
> 
> I ran
> 
> srun subordinate_ddt.sh
> 
> where 
> 
> $ cat subordinate_ddt.sh 
> #!/bin/bash
> #
> ddt --offline --output=ddtreport.txt --trace-at _jacobi.c:91,residual
> ./jacobi
> $
> 
> That runs without issue.
> 
> This suggests to me that it is the Ddt interacting with 
> Slurm 21.08 in a way that is different to what happens 
> when it runs against Slurm 20.11.
> 
> Not clear if that makes this an issue for Ddt, Slurm 21.08, or both?

That's really good info.

Can you run the following test?

scontrol setdebugflags +steps
Run your job with the Ddt step that fails
Also run a step that succeeds
Finish the job
scontrol setdebugflags -steps

Then can you upload the slurmctld log file? Can you also let me know the job ID so I can parse the log file?

I'm not sure if this will turn up anything more, since this invalid TRES error only happens in one place at the beginning of step creation and we may not see any new logs. But I still think it's worth looking at.

(This error happens in other places for jobs, but it's not the job that's failing, just the step.)

Comment 10 Kevin Buckley 2021-11-24 19:16:42 MST

> 
> With this in mind, I'm changing the title of this bug to reflect what I believe
> to be the main problem -
> 
> When using Arm Forge Ddt to run a step, this error happens:
> 
> Unable to create step for job 17466: Invalid Trackable RESource (TRES)
> specification

Not sure the title needs a site-specific JobID, but it's your call.

Can you run the following test?

Will do on that, though one other thing of note that we've
since seen, is that on the Crays, if we put the Ddt command
into a subordinate script then things don't run.

Not clear what this is telling us, as regards which component
in all this is falling over, as yet. My colleage on the ticket
knows more about Ddt than I do, so he may be able to give you
some insight as to what should be happening.

BTW, when you say

> RE --constraint not being honoured in some cases - I can't
> reproduce this, but feel free to open a new bug on it.

do you mean that you've run a instance of Ddt using our config,
or something else?

> I've opened bug 12909 to handle this, ...

Will take a look at that: cheers.

Comment 11 Kevin Buckley 2021-11-24 23:59:28 MST

Created attachment 22400 [details]
Requested logs

The JobIds to look for are 17563, which has the Ddt invoked
from within the subordinate script, and 17564, which just has
the "echo out all the Slurm EnvVars" code in the subordinate
script.

FYI: the debug level goes all the way to five in these logs.

Apologies for all the elastic search server messages: we really
should erase that from the TDS config.

Comment 12 Kevin Mooney 2021-11-25 07:48:33 MST

I can reproduce the error with

    % srun --gres=none hello_c
    srun: error: Unable to create step for job 1655: Invalid Trackable RESource (TRES) specification

I believe Forge uses --gres=none so it can launch the backend even when all the resource is taken up.

Comment 13 Marshall Garey 2021-11-29 14:18:32 MST

> Not sure the title needs a site-specific JobID, but it's your call.

It doesn’t really matter to me. I just copy-pasted the message. I would be just as happy with “<jobid>” instead of an actual number.


(In reply to Kevin Mooney from comment #12)
> I can reproduce the error with
>
>     % srun --gres=none hello_c
>     srun: error: Unable to create step for job 1655: Invalid Trackable
> RESource (TRES) specification
>
> I believe Forge uses --gres=none so it can launch the backend even when all
> the resource is taken up.

Nice find, Kevin! I can easily reproduce that, too. I'm pretty sure that --gres=none used to work. I'm looking into this more.



> Will do on that, though one other thing of note that we've
> since seen, is that on the Crays, if we put the Ddt command
> into a subordinate script then things don't run.

I wonder - Is this caused by the same thing with --gres=none?



> BTW, when you say
>
> > RE --constraint not being honoured in some cases - I can't
> > reproduce this, but feel free to open a new bug on it.
>
> do you mean that you've run a instance of Ddt using our config,
> or something else?

I didn’t run Ddt. I just ran various tests with --constraint and --constraint worked for me every time.

Comment 20 Kevin Mooney 2021-11-30 05:05:06 MST

> (In reply to Kevin Mooney from comment #12)
> > I can reproduce the error with
> >
> >     % srun --gres=none hello_c
> >     srun: error: Unable to create step for job 1655: Invalid Trackable
> > RESource (TRES) specification
> >
> > I believe Forge uses --gres=none so it can launch the backend even when all
> > the resource is taken up.
> 
> Nice find, Kevin! I can easily reproduce that, too. I'm pretty sure that
> --gres=none used to work. I'm looking into this more.

Full disclosure, there are 2 Kevins in this thread, and I am a developer of Forge (DDT & MAP), so I had an advantage when diagnosing the issue. 

--gres=none is documented in srun --help.

> > Will do on that, though one other thing of note that we've
> > since seen, is that on the Crays, if we put the Ddt command
> > into a subordinate script then things don't run.
> 
> I wonder - Is this caused by the same thing with --gres=none?

I suspect this is a separate issue. The first case is equivalent to 

    ddt srun exe 

which launches ddt on the login/batch node, which attaches a debugger to srun, and then distributes the backend to the compute nodes with 

    srun --jobid=1666 --gres=none --mem-per-cpu=0 -I -W0 --gpus=0 --overlap forge-backend    

The second case is equivalent to

    srun ddt exe

This will launch ddt's frontend on all the compute nodes and start a separate session for each exe. While this can work, there are plenty of edge cases that might prevent it from working. Usually the first is what you want.


Looking at alternatives to get the first case working. It mostly works for me if I remove the --gres=none with

    ALLINEA_DEBUG_SRUN_ARGS="%jobid% --mem-per-cpu=0 -I -W0 --gpus=0 --overlap" ddt --log=ddt_log.xml srun exe

The debug session works fine for me, but on exit I see the following (potentially harmless?) error from slurm

    slurmstepd: error: *** STEP 1668.1 ON node-1 CANCELLED AT 2021-11-30T11:53:31 ***

Comment 22 Marshall Garey 2021-11-30 16:52:17 MST

(In reply to Kevin Mooney from comment #20)
> Full disclosure, there are 2 Kevins in this thread, and I am a developer of
> Forge (DDT & MAP), so I had an advantage when diagnosing the issue. 

I realized that. I really appreciate you stepping in and identifying the issue. It made debugging the issue in Slurm much easier.

> --gres=none is documented in srun --help.

Yes, it's definitely supposed to work and this is a regression in Slurm 21.08. There are actually two issues:

* Allow jobs and steps that use --gres=none to run
* Steps that use --gres=none should unset GRES environment variables, but they do not anymore (another regression in 21.08)

I've identified the exact commits where we had these regressions. The first issue is really easy to fix. The second will be a little bit trickier, but hopefully not too bad.


> I suspect this is a separate issue. The first case is equivalent to 
> 
>     ddt srun exe 
> 
> which launches ddt on the login/batch node, which attaches a debugger to
> srun, and then distributes the backend to the compute nodes with 
> 
>     srun --jobid=1666 --gres=none --mem-per-cpu=0 -I -W0 --gpus=0 --overlap
> forge-backend    
> 
> The second case is equivalent to
> 
>     srun ddt exe

Does ddt call srun internally? Or does ddt merely set Slurm environment variables or pass arguments to srun? If it is the latter, it makes sense to me why srun ddt exe would magically work - because in this case, srun is called first and wouldn't have --gres=none set.


> This will launch ddt's frontend on all the compute nodes and start a
> separate session for each exe. While this can work, there are plenty of edge
> cases that might prevent it from working. Usually the first is what you want.

Agreed.


> Looking at alternatives to get the first case working. It mostly works for
> me if I remove the --gres=none with
> 
>     ALLINEA_DEBUG_SRUN_ARGS="%jobid% --mem-per-cpu=0 -I -W0 --gpus=0
> --overlap" ddt --log=ddt_log.xml srun exe
> 
> The debug session works fine for me, but on exit I see the following
> (potentially harmless?) error from slurm
> 
>     slurmstepd: error: *** STEP 1668.1 ON node-1 CANCELLED AT
> 2021-11-30T11:53:31 ***

For jobs that don't have GPUs, simply not using --gres at all works. For jobs that do have GPUs, then this should be fine as a temporary workaround.

I'll keep you updated on my progress with the fixes.

Comment 23 Kevin Mooney 2021-12-01 08:34:13 MST

> > I suspect this is a separate issue. The first case is equivalent to 
> > 
> >     ddt srun exe 
> > 
> > which launches ddt on the login/batch node, which attaches a debugger to
> > srun, and then distributes the backend to the compute nodes with 
> > 
> >     srun --jobid=1666 --gres=none --mem-per-cpu=0 -I -W0 --gpus=0 --overlap
> > forge-backend    
> > 
> > The second case is equivalent to
> > 
> >     srun ddt exe
> 
> Does ddt call srun internally? Or does ddt merely set Slurm environment
> variables or pass arguments to srun? If it is the latter, it makes sense to
> me why srun ddt exe would magically work - because in this case, srun is
> called first and wouldn't have --gres=none set.

It's the former, DDT calls srun internally, and it shouldn't make any changes to the user's command line. We launch the users srun under a debugger and then distribute our backend using 

    srun --jobid=<job_id> --gres=none --mem-per-cpu=0 -I -W0 --gpus=0 --overlap forge-backend 

which then attaches a debugger to each process. 

Similar to what you said, `srun ddt exe` avoids the --gres=none issue because it doesn't use the srun launch mechanism.

> > Looking at alternatives to get the first case working. It mostly works for
> > me if I remove the --gres=none with
> > 
> >     ALLINEA_DEBUG_SRUN_ARGS="%jobid% --mem-per-cpu=0 -I -W0 --gpus=0
> > --overlap" ddt --log=ddt_log.xml srun exe
> > 
> > The debug session works fine for me, but on exit I see the following
> > (potentially harmless?) error from slurm
> > 
> >     slurmstepd: error: *** STEP 1668.1 ON node-1 CANCELLED AT
> > 2021-11-30T11:53:31 ***
> 
> For jobs that don't have GPUs, simply not using --gres at all works. For
> jobs that do have GPUs, then this should be fine as a temporary workaround.
> 
> I'll keep you updated on my progress with the fixes.

I should spend some time trying to understand what all these flags do. They have grown organically over time and it's likely some of them aren't needed. We don't have an easy way to test different versions of slurm, so changes here are generally made with caution.

Comment 29 Marshall Garey 2021-12-08 12:42:51 MST

Hi Kevin and Kevin,

Quick update. I submitted patches to our review queue to fix issues with --gres=none. I'll keep you updated on our progress.

Comment 33 Marshall Garey 2021-12-15 10:44:39 MST

We've fixed --gres=none ahead of 21.08.5. Thanks for reporting this! I'm closing this bug as fixed.


commit 8b7b1e7128fcf1ac013ab308cf0c09fbc20485f2
Author:     Marshall Garey <marshall@schedmd.com>
AuthorDate: Tue Nov 30 16:28:15 2021 -0700

    Fix regression that broke --gres=none
    
    This was broken by commit 6300d47c2d (and related) in 21.08.0pre1 which
    prepended "gres:" to all GRES, including "none". This caused jobs and steps
    that used --gres=none to fail.
    
    Jobs would fail with this error message:
    
            Invalid generic resource (gres) specification
    
    Steps would fail with this similar error message:
    
            Invalid Trackable RESource (TRES) specification
    
    Bug 12880.

Comment 34 Kevin Mooney 2021-12-16 02:29:40 MST

Brilliant, thank you Marshall!

Comment 35 Marshall Garey 2022-01-03 09:33:45 MST

*** Ticket 13102 has been marked as a duplicate of this ticket. ***