| Summary: | Cannot specify 0 value for gres | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | David Gloe <david.gloe> |
| Component: | Other | Assignee: | Moe Jette <jette> |
| Status: | RESOLVED FIXED | QA Contact: | |
| Severity: | 4 - Minor Issue | ||
| Priority: | --- | CC: | da |
| Version: | 14.03.x | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | CRAY | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | 14.11.0-pre4 | Target Release: | --- |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
|
Description
David Gloe
2014-03-07 08:48:46 MST
In your slurm.conf is craynetwork defined? GresTypes=craynetwork,gpu,etc. If not defined in slurm.conf, it's going to be reported as an invalid gres. You definitely do not require a gres plugin to use one, but you definitely need to define it in slurm.conf. (In reply to Moe Jette from comment #2) > You definitely do not require a gres plugin to use one, but you definitely > need to define it in slurm.conf. It's defined in slurm.conf: c16817@opal-p2:~> grep craynetwork /etc/opt/slurm/slurm.conf GresTypes=craynetwork,gpu NodeName=nid000[24-27] Sockets=2 CoresPerSocket=10 ThreadsPerCore=2 Gres=craynetwork:4 RealMemory=32768 NodeName=nid000[32-35] Sockets=1 CoresPerSocket=8 ThreadsPerCore=2 Gres=craynetwork:4,gpu:1 RealMemory=32768 NodeName=nid000[48-51] Sockets=2 CoresPerSocket=8 ThreadsPerCore=2 Gres=craynetwork:4 RealMemory=65536 I can run with positive values fine, but using 0 is invalid: c16817@opal-p2:~> srun -n 1 --gres=craynetwork:0 hostname srun: error: Unable to allocate resources: Invalid generic resource (gres) specification c16817@opal-p2:~> srun -n 1 --gres=craynetwork:1 hostname nid00032 c16817@opal-p2:~> srun -n 1 --gres=craynetwork:2 hostname nid00032 I've been able to reproduce this. It turns out this failure is depended upon the order of GRES defined in GresTypes in slurm.conf. Anyway, fix available here: https://github.com/SchedMD/slurm/commit/22803f86555b97efb22855bec555f5b6413904c6 This is back in 14.11.0-0pre2: dgloe@tiger:~> srun -n 1 --gres=craynetwork:0 hostname srun: error: Unable to allocate resources: Invalid generic resource (gres) specification dgloe@tiger:~> srun --version slurm 14.11.0-0pre2 Also fails for gpu gres: dgloe@galaxy:~> srun -n 1 --gres=gpu:0 hostname srun: error: Unable to allocate resources: Invalid generic resource (gres) specification The slurmctld log shows [2014-08-01T11:01:26.304] Invalid gres job specification gpu:0 [2014-08-01T11:01:26.304] _slurm_rpc_allocate_resources: Invalid generic resource (gres) specification (In reply to David Gloe from comment #6) > This is back in 14.11.0-0pre2: > > dgloe@tiger:~> srun -n 1 --gres=craynetwork:0 hostname > srun: error: Unable to allocate resources: Invalid generic resource (gres) > specification > dgloe@tiger:~> srun --version > slurm 14.11.0-0pre2 > > Also fails for gpu gres: > dgloe@galaxy:~> srun -n 1 --gres=gpu:0 hostname > srun: error: Unable to allocate resources: Invalid generic resource (gres) > specification > > The slurmctld log shows > [2014-08-01T11:01:26.304] Invalid gres job specification gpu:0 > [2014-08-01T11:01:26.304] _slurm_rpc_allocate_resources: Invalid generic > resource (gres) specification This was broken when adding support for GRES model types (e.g. (--gres=gpu:kepler:1,gpu:tesla:2"). The bug exists only in the version 14.11 code base and is now fixed in the commit below: https://github.com/SchedMD/slurm/commit/a2db773549624e141088eda6dc5084b6107b4cc9 I discovered a problem in matching model types for a job step, but that would have no effect without using various GRES model types. In any event, I fixed that also: https://github.com/SchedMD/slurm/commit/34cc72318ec7281564c85b50138c0ed2abf8a356 I'm still seeing this problem, for the craynetwork gres but not gpu: dgloe@opal-p2:~> srun -N 2 -n 2 --gres=craynetwork:0 hostname srun: error: Unable to create job step: Invalid generic resource (gres) specification srun: Force Terminated job 41717 dgloe@opal-p2:~> srun -N 2 -n 2 --gres=gpu:0 hostname nid00024 nid00025 In the slurmctld log: [2014-08-27T16:12:50.541] Invalid gres step 41717.4294967294 specification craynetwork:0 [2014-08-27T16:12:50.541] _slurm_rpc_job_step_create for job 41717: Invalid generic resource (gres) specification In slurm.conf: GresTypes=craynetwork,gpu This is happening on c8a43fe5f0b7cb2760f897e3ebfb149b6c1fc8d0. Also fails with --gres=none dgloe@opal-p2:~> srun -N 2 -n 2 --gres=none -w nid000[24-25] hostname srun: error: Unable to allocate resources: Invalid generic resource (gres) specification [2014-08-27T16:24:23.394] Invalid gres job specification none [2014-08-27T16:24:23.395] _slurm_rpc_allocate_resources: Invalid generic resource (gres) specification It works for me. Check your configuration. $ srun -N 2 -n 2 --gres=craynetwork:0 hostname smd1 smd2 My slurm.conf: GresTypes=gpu,craynetwork NodeName=... Gres=gpu:2,craynetwork:4 Plus in gres.conf on the compute nodes: Name=craynetwork Count=4 Looks like it's the ordering problem again. When I switch GresTypes=craynetwork,gpu to gpu,craynetwork now it fails for gpu but works for craynetwork: dgloe@opal-p2:~> srun -n 1 --gres=craynetwork:0 hostname nid00032 dgloe@opal-p2:~> srun -n 1 --gres=gpu:0 hostname srun: error: Unable to create job step: Invalid generic resource (gres) specification srun: Force Terminated job 41728 Also --gres=none still fails. Fix for gres count of zero: https://github.com/SchedMD/slurm/commit/8b220351f9217d58fbfd596cd762b291500fe27b fix for gres=none when run without having an existing job allocation https://github.com/SchedMD/slurm/commit/bf669dab0fbf281a48d0b4d41381bf16e8d4fcae |