Ticket 5739

Summary: srun --gres=gpu fails with Invalid generic resource (gres) specification
Product: Slurm Reporter: David Gloe <david.gloe>
Component: slurmctldAssignee: Dominik Bartkiewicz <bart>
Status: RESOLVED FIXED QA Contact:
Severity: 4 - Minor Issue    
Priority: ---    
Version: 18.08.0   
Hardware: Linux   
OS: Linux   
Site: CRAY Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: Cray Internal
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: 18.08.1 Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---

Description David Gloe 2018-09-17 06:47:11 MDT
On a Cray internal system with GPUs, a user reported a change in behavior between 17.11.8 and 18.08.0. It appears that --gres now requires a count to be specified.

On 18.08.0:

lanton@tiger:~/cray/gpu-omp-tests> srun -C P100 --gres=gpu  -n 1 hostname
srun: error: Unable to allocate resources: Invalid generic resource (gres) specification

lanton@tiger:~/cray/gpu-omp-tests> srun -C P100 --gres=gpu:1  -n 1 hostname
nid00192

On 17.11.8:
dgloe@tiger:~> srun --gres=gpu hostname
nid00012
dgloe@tiger:~> srun --version
slurm 17.11.8

We have the GPU gres defined as so:
slurm.conf:
NodeName=nid000[12-15,20-23] Sockets=1 CoresPerSocket=12 ThreadsPerCore=2 Gres=craynetwork:4,gpu Feature=K40 # RealMemory=32768
NodeName=nid000[24-35] Sockets=1 CoresPerSocket=10 ThreadsPerCore=2 Gres=craynetwork:4,gpu Feature=K20 # RealMemory=32768
NodeName=nid000[36-43] Sockets=2 CoresPerSocket=12 ThreadsPerCore=2 Gres=craynetwork:4 # RealMemory=65536
NodeName=nid000[44-47] Sockets=1 CoresPerSocket=10 ThreadsPerCore=2 Gres=craynetwork:4,gpu Feature=K20 # RealMemory=32768
NodeName=nid000[48-59] Sockets=1 CoresPerSocket=12 ThreadsPerCore=2 Gres=craynetwork:4,gpu Feature=K40 # RealMemory=32768
NodeName=nid00[192-203,224-231] Sockets=1 CoresPerSocket=18 ThreadsPerCore=2 Gres=craynetwork:4,gpu Feature=P100 # RealMemory=65536

gres.conf:
NodeName=nid000[12-15,20-23,36-43] Name=craynetwork Count=4
NodeName=nid000[12-15,20-23] Name=gpu File=/dev/nvidia0
NodeName=nid000[24-35,44-47] Name=craynetwork Count=4
NodeName=nid000[24-35,44-47] Name=gpu File=/dev/nvidia0
NodeName=nid000[48-59] Name=craynetwork Count=4
NodeName=nid000[48-59] Name=gpu File=/dev/nvidia0
NodeName=nid00[192-203,224-235] Name=craynetwork Count=4
NodeName=nid00[192-203,224-235] Name=gpu File=/dev/nvidia0
Comment 1 Dominik Bartkiewicz 2018-09-18 08:17:16 MDT
Hi

I am working on it, I will inform you when we will fix this.

Dominik
Comment 2 Dominik Bartkiewicz 2018-10-01 03:06:53 MDT
Hi

This commit should fix this issue:
https://github.com/SchedMD/slurm/commit/8042bb5fdc076b4
I'm marking this ticket as resolved/info given
As always, please feel free to reopen if you have additional questions.

Dominik