Ticket 633

Summary: Cannot specify 0 value for gres
Product: Slurm Reporter: David Gloe <david.gloe>
Component: OtherAssignee: Moe Jette <jette>
Status: RESOLVED FIXED QA Contact:
Severity: 4 - Minor Issue    
Priority: --- CC: da
Version: 14.03.x   
Hardware: Linux   
OS: Linux   
Site: CRAY Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: 14.11.0-pre4 Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---

Description David Gloe 2014-03-07 08:48:46 MST
We've defined a generic resource 'craynetwork' without a corresponding craynetwork plugin. The only way to ask for none of this resource is using --gres=none, whereas --gres=<name>:0 works for other generic resources.

c16817@opal-p2:~> srun -n 1 --gres=gpu:0 hostname
nid00032
c16817@opal-p2:~> srun -n 1 --gres=craynetwork:0 hostname
srun: error: Unable to allocate resources: Invalid generic resource (gres) specification
c16817@opal-p2:~> salloc -n 1
salloc: Granted job allocation 622
c16817@opal-p2:~> srun -n 1 --gres=craynetwork:0 hostname
srun: error: Unable to create job step: Invalid generic resource (gres) specification

[2014-03-07T16:14:29.659] debug:  gres: Couldn't find the specified plugin name for gres/craynetwork looking at all files
[2014-03-07T16:14:29.664] debug:  Cannot find plugin of type gres/craynetwork, just track gres count
...
[2014-03-07T16:36:09.516] Invalid gres job specification craynetwork:0
[2014-03-07T16:36:09.516] _slurm_rpc_allocate_resources: Invalid generic resource (gres) specification

This makes it impossible for a user to, for example, run a job with the gpu but without using the network (since we have a job_submit plugin that adds the craynetwork gres by default).
Comment 1 Moe Jette 2014-03-07 08:56:37 MST
In your slurm.conf is craynetwork defined?

GresTypes=craynetwork,gpu,etc.

If not defined in slurm.conf, it's going to be reported as an invalid gres.
Comment 2 Moe Jette 2014-03-07 08:57:52 MST
You definitely do not require a gres plugin to use one, but you definitely need to define it in slurm.conf.
Comment 3 David Gloe 2014-03-07 09:03:59 MST
(In reply to Moe Jette from comment #2)
> You definitely do not require a gres plugin to use one, but you definitely
> need to define it in slurm.conf.

It's defined in slurm.conf:
c16817@opal-p2:~> grep craynetwork /etc/opt/slurm/slurm.conf
GresTypes=craynetwork,gpu
NodeName=nid000[24-27] Sockets=2 CoresPerSocket=10 ThreadsPerCore=2 Gres=craynetwork:4 RealMemory=32768
NodeName=nid000[32-35] Sockets=1 CoresPerSocket=8 ThreadsPerCore=2 Gres=craynetwork:4,gpu:1 RealMemory=32768
NodeName=nid000[48-51] Sockets=2 CoresPerSocket=8 ThreadsPerCore=2 Gres=craynetwork:4 RealMemory=65536

I can run with positive values fine, but using 0 is invalid:
c16817@opal-p2:~> srun -n 1 --gres=craynetwork:0 hostname
srun: error: Unable to allocate resources: Invalid generic resource (gres) specification
c16817@opal-p2:~> srun -n 1 --gres=craynetwork:1 hostname
nid00032
c16817@opal-p2:~> srun -n 1 --gres=craynetwork:2 hostname
nid00032
Comment 4 Moe Jette 2014-03-07 09:57:34 MST
I've been able to reproduce this.
Comment 5 Moe Jette 2014-03-07 10:38:13 MST
It turns out this failure is depended upon the order of GRES defined in GresTypes in slurm.conf. Anyway, fix available here:
https://github.com/SchedMD/slurm/commit/22803f86555b97efb22855bec555f5b6413904c6
Comment 6 David Gloe 2014-08-01 04:09:54 MDT
This is back in 14.11.0-0pre2:

dgloe@tiger:~> srun -n 1 --gres=craynetwork:0 hostname
srun: error: Unable to allocate resources: Invalid generic resource (gres) specification
dgloe@tiger:~> srun --version
slurm 14.11.0-0pre2

Also fails for gpu gres:
dgloe@galaxy:~> srun -n 1 --gres=gpu:0 hostname
srun: error: Unable to allocate resources: Invalid generic resource (gres) specification

The slurmctld log shows
[2014-08-01T11:01:26.304] Invalid gres job specification gpu:0
[2014-08-01T11:01:26.304] _slurm_rpc_allocate_resources: Invalid generic resource (gres) specification
Comment 7 Moe Jette 2014-08-01 09:27:26 MDT
(In reply to David Gloe from comment #6)
> This is back in 14.11.0-0pre2:
> 
> dgloe@tiger:~> srun -n 1 --gres=craynetwork:0 hostname
> srun: error: Unable to allocate resources: Invalid generic resource (gres)
> specification
> dgloe@tiger:~> srun --version
> slurm 14.11.0-0pre2
> 
> Also fails for gpu gres:
> dgloe@galaxy:~> srun -n 1 --gres=gpu:0 hostname
> srun: error: Unable to allocate resources: Invalid generic resource (gres)
> specification
> 
> The slurmctld log shows
> [2014-08-01T11:01:26.304] Invalid gres job specification gpu:0
> [2014-08-01T11:01:26.304] _slurm_rpc_allocate_resources: Invalid generic
> resource (gres) specification

This was broken when adding support for GRES model types (e.g. (--gres=gpu:kepler:1,gpu:tesla:2"). The bug exists only in the version 14.11 code base and is now fixed in the commit below:
https://github.com/SchedMD/slurm/commit/a2db773549624e141088eda6dc5084b6107b4cc9

I discovered a problem in matching model types for a job step, but that would have no effect without using various GRES model types. In any event, I fixed that also:
https://github.com/SchedMD/slurm/commit/34cc72318ec7281564c85b50138c0ed2abf8a356
Comment 8 David Gloe 2014-08-27 09:20:02 MDT
I'm still seeing this problem, for the craynetwork gres but not gpu:

dgloe@opal-p2:~> srun -N 2 -n 2 --gres=craynetwork:0 hostname
srun: error: Unable to create job step: Invalid generic resource (gres) specification
srun: Force Terminated job 41717
dgloe@opal-p2:~> srun -N 2 -n 2 --gres=gpu:0 hostname
nid00024
nid00025

In the slurmctld log:
[2014-08-27T16:12:50.541] Invalid gres step 41717.4294967294 specification craynetwork:0
[2014-08-27T16:12:50.541] _slurm_rpc_job_step_create for job 41717: Invalid generic resource (gres) specification

In slurm.conf:
GresTypes=craynetwork,gpu
Comment 9 David Gloe 2014-08-27 09:21:51 MDT
This is happening on c8a43fe5f0b7cb2760f897e3ebfb149b6c1fc8d0.
Comment 10 David Gloe 2014-08-27 09:26:14 MDT
Also fails with --gres=none

dgloe@opal-p2:~> srun -N 2 -n 2 --gres=none -w nid000[24-25] hostname
srun: error: Unable to allocate resources: Invalid generic resource (gres) specification

[2014-08-27T16:24:23.394] Invalid gres job specification none
[2014-08-27T16:24:23.395] _slurm_rpc_allocate_resources: Invalid generic resource (gres) specification
Comment 11 Moe Jette 2014-08-27 09:45:15 MDT
It works for me. Check your configuration.

$ srun -N 2 -n 2 --gres=craynetwork:0 hostname
smd1
smd2

My slurm.conf:
GresTypes=gpu,craynetwork
NodeName=...  Gres=gpu:2,craynetwork:4

Plus in gres.conf on the compute nodes:
Name=craynetwork Count=4
Comment 12 David Gloe 2014-08-27 09:48:41 MDT
Looks like it's the ordering problem again. When I switch GresTypes=craynetwork,gpu to gpu,craynetwork now it fails for gpu but works for craynetwork:

dgloe@opal-p2:~> srun  -n 1 --gres=craynetwork:0 hostname
nid00032
dgloe@opal-p2:~> srun  -n 1 --gres=gpu:0 hostname
srun: error: Unable to create job step: Invalid generic resource (gres) specification
srun: Force Terminated job 41728

Also --gres=none still fails.
Comment 13 Moe Jette 2014-08-27 10:46:49 MDT
Fix for gres count of zero:
https://github.com/SchedMD/slurm/commit/8b220351f9217d58fbfd596cd762b291500fe27b

fix for gres=none when run without having an existing job allocation
https://github.com/SchedMD/slurm/commit/bf669dab0fbf281a48d0b4d41381bf16e8d4fcae