We've defined a generic resource 'craynetwork' without a corresponding craynetwork plugin. The only way to ask for none of this resource is using --gres=none, whereas --gres=<name>:0 works for other generic resources. c16817@opal-p2:~> srun -n 1 --gres=gpu:0 hostname nid00032 c16817@opal-p2:~> srun -n 1 --gres=craynetwork:0 hostname srun: error: Unable to allocate resources: Invalid generic resource (gres) specification c16817@opal-p2:~> salloc -n 1 salloc: Granted job allocation 622 c16817@opal-p2:~> srun -n 1 --gres=craynetwork:0 hostname srun: error: Unable to create job step: Invalid generic resource (gres) specification [2014-03-07T16:14:29.659] debug: gres: Couldn't find the specified plugin name for gres/craynetwork looking at all files [2014-03-07T16:14:29.664] debug: Cannot find plugin of type gres/craynetwork, just track gres count ... [2014-03-07T16:36:09.516] Invalid gres job specification craynetwork:0 [2014-03-07T16:36:09.516] _slurm_rpc_allocate_resources: Invalid generic resource (gres) specification This makes it impossible for a user to, for example, run a job with the gpu but without using the network (since we have a job_submit plugin that adds the craynetwork gres by default).
In your slurm.conf is craynetwork defined? GresTypes=craynetwork,gpu,etc. If not defined in slurm.conf, it's going to be reported as an invalid gres.
You definitely do not require a gres plugin to use one, but you definitely need to define it in slurm.conf.
(In reply to Moe Jette from comment #2) > You definitely do not require a gres plugin to use one, but you definitely > need to define it in slurm.conf. It's defined in slurm.conf: c16817@opal-p2:~> grep craynetwork /etc/opt/slurm/slurm.conf GresTypes=craynetwork,gpu NodeName=nid000[24-27] Sockets=2 CoresPerSocket=10 ThreadsPerCore=2 Gres=craynetwork:4 RealMemory=32768 NodeName=nid000[32-35] Sockets=1 CoresPerSocket=8 ThreadsPerCore=2 Gres=craynetwork:4,gpu:1 RealMemory=32768 NodeName=nid000[48-51] Sockets=2 CoresPerSocket=8 ThreadsPerCore=2 Gres=craynetwork:4 RealMemory=65536 I can run with positive values fine, but using 0 is invalid: c16817@opal-p2:~> srun -n 1 --gres=craynetwork:0 hostname srun: error: Unable to allocate resources: Invalid generic resource (gres) specification c16817@opal-p2:~> srun -n 1 --gres=craynetwork:1 hostname nid00032 c16817@opal-p2:~> srun -n 1 --gres=craynetwork:2 hostname nid00032
I've been able to reproduce this.
It turns out this failure is depended upon the order of GRES defined in GresTypes in slurm.conf. Anyway, fix available here: https://github.com/SchedMD/slurm/commit/22803f86555b97efb22855bec555f5b6413904c6
This is back in 14.11.0-0pre2: dgloe@tiger:~> srun -n 1 --gres=craynetwork:0 hostname srun: error: Unable to allocate resources: Invalid generic resource (gres) specification dgloe@tiger:~> srun --version slurm 14.11.0-0pre2 Also fails for gpu gres: dgloe@galaxy:~> srun -n 1 --gres=gpu:0 hostname srun: error: Unable to allocate resources: Invalid generic resource (gres) specification The slurmctld log shows [2014-08-01T11:01:26.304] Invalid gres job specification gpu:0 [2014-08-01T11:01:26.304] _slurm_rpc_allocate_resources: Invalid generic resource (gres) specification
(In reply to David Gloe from comment #6) > This is back in 14.11.0-0pre2: > > dgloe@tiger:~> srun -n 1 --gres=craynetwork:0 hostname > srun: error: Unable to allocate resources: Invalid generic resource (gres) > specification > dgloe@tiger:~> srun --version > slurm 14.11.0-0pre2 > > Also fails for gpu gres: > dgloe@galaxy:~> srun -n 1 --gres=gpu:0 hostname > srun: error: Unable to allocate resources: Invalid generic resource (gres) > specification > > The slurmctld log shows > [2014-08-01T11:01:26.304] Invalid gres job specification gpu:0 > [2014-08-01T11:01:26.304] _slurm_rpc_allocate_resources: Invalid generic > resource (gres) specification This was broken when adding support for GRES model types (e.g. (--gres=gpu:kepler:1,gpu:tesla:2"). The bug exists only in the version 14.11 code base and is now fixed in the commit below: https://github.com/SchedMD/slurm/commit/a2db773549624e141088eda6dc5084b6107b4cc9 I discovered a problem in matching model types for a job step, but that would have no effect without using various GRES model types. In any event, I fixed that also: https://github.com/SchedMD/slurm/commit/34cc72318ec7281564c85b50138c0ed2abf8a356
I'm still seeing this problem, for the craynetwork gres but not gpu: dgloe@opal-p2:~> srun -N 2 -n 2 --gres=craynetwork:0 hostname srun: error: Unable to create job step: Invalid generic resource (gres) specification srun: Force Terminated job 41717 dgloe@opal-p2:~> srun -N 2 -n 2 --gres=gpu:0 hostname nid00024 nid00025 In the slurmctld log: [2014-08-27T16:12:50.541] Invalid gres step 41717.4294967294 specification craynetwork:0 [2014-08-27T16:12:50.541] _slurm_rpc_job_step_create for job 41717: Invalid generic resource (gres) specification In slurm.conf: GresTypes=craynetwork,gpu
This is happening on c8a43fe5f0b7cb2760f897e3ebfb149b6c1fc8d0.
Also fails with --gres=none dgloe@opal-p2:~> srun -N 2 -n 2 --gres=none -w nid000[24-25] hostname srun: error: Unable to allocate resources: Invalid generic resource (gres) specification [2014-08-27T16:24:23.394] Invalid gres job specification none [2014-08-27T16:24:23.395] _slurm_rpc_allocate_resources: Invalid generic resource (gres) specification
It works for me. Check your configuration. $ srun -N 2 -n 2 --gres=craynetwork:0 hostname smd1 smd2 My slurm.conf: GresTypes=gpu,craynetwork NodeName=... Gres=gpu:2,craynetwork:4 Plus in gres.conf on the compute nodes: Name=craynetwork Count=4
Looks like it's the ordering problem again. When I switch GresTypes=craynetwork,gpu to gpu,craynetwork now it fails for gpu but works for craynetwork: dgloe@opal-p2:~> srun -n 1 --gres=craynetwork:0 hostname nid00032 dgloe@opal-p2:~> srun -n 1 --gres=gpu:0 hostname srun: error: Unable to create job step: Invalid generic resource (gres) specification srun: Force Terminated job 41728 Also --gres=none still fails.
Fix for gres count of zero: https://github.com/SchedMD/slurm/commit/8b220351f9217d58fbfd596cd762b291500fe27b fix for gres=none when run without having an existing job allocation https://github.com/SchedMD/slurm/commit/bf669dab0fbf281a48d0b4d41381bf16e8d4fcae