Ticket 5739 - srun --gres=gpu fails with Invalid generic resource (gres) specification
Summary: srun --gres=gpu fails with Invalid generic resource (gres) specification
Status: RESOLVED FIXED
Alias: None
Product: Slurm
Classification: Unclassified
Component: slurmctld (show other tickets)
Version: 18.08.0
Hardware: Linux Linux
: 4 - Minor Issue
Assignee: Dominik Bartkiewicz
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2018-09-17 06:47 MDT by David Gloe
Modified: 2018-10-01 03:06 MDT (History)
0 users

See Also:
Site: CRAY
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: Cray Internal
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed: 18.08.1
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description David Gloe 2018-09-17 06:47:11 MDT
On a Cray internal system with GPUs, a user reported a change in behavior between 17.11.8 and 18.08.0. It appears that --gres now requires a count to be specified.

On 18.08.0:

lanton@tiger:~/cray/gpu-omp-tests> srun -C P100 --gres=gpu  -n 1 hostname
srun: error: Unable to allocate resources: Invalid generic resource (gres) specification

lanton@tiger:~/cray/gpu-omp-tests> srun -C P100 --gres=gpu:1  -n 1 hostname
nid00192

On 17.11.8:
dgloe@tiger:~> srun --gres=gpu hostname
nid00012
dgloe@tiger:~> srun --version
slurm 17.11.8

We have the GPU gres defined as so:
slurm.conf:
NodeName=nid000[12-15,20-23] Sockets=1 CoresPerSocket=12 ThreadsPerCore=2 Gres=craynetwork:4,gpu Feature=K40 # RealMemory=32768
NodeName=nid000[24-35] Sockets=1 CoresPerSocket=10 ThreadsPerCore=2 Gres=craynetwork:4,gpu Feature=K20 # RealMemory=32768
NodeName=nid000[36-43] Sockets=2 CoresPerSocket=12 ThreadsPerCore=2 Gres=craynetwork:4 # RealMemory=65536
NodeName=nid000[44-47] Sockets=1 CoresPerSocket=10 ThreadsPerCore=2 Gres=craynetwork:4,gpu Feature=K20 # RealMemory=32768
NodeName=nid000[48-59] Sockets=1 CoresPerSocket=12 ThreadsPerCore=2 Gres=craynetwork:4,gpu Feature=K40 # RealMemory=32768
NodeName=nid00[192-203,224-231] Sockets=1 CoresPerSocket=18 ThreadsPerCore=2 Gres=craynetwork:4,gpu Feature=P100 # RealMemory=65536

gres.conf:
NodeName=nid000[12-15,20-23,36-43] Name=craynetwork Count=4
NodeName=nid000[12-15,20-23] Name=gpu File=/dev/nvidia0
NodeName=nid000[24-35,44-47] Name=craynetwork Count=4
NodeName=nid000[24-35,44-47] Name=gpu File=/dev/nvidia0
NodeName=nid000[48-59] Name=craynetwork Count=4
NodeName=nid000[48-59] Name=gpu File=/dev/nvidia0
NodeName=nid00[192-203,224-235] Name=craynetwork Count=4
NodeName=nid00[192-203,224-235] Name=gpu File=/dev/nvidia0
Comment 1 Dominik Bartkiewicz 2018-09-18 08:17:16 MDT
Hi

I am working on it, I will inform you when we will fix this.

Dominik
Comment 2 Dominik Bartkiewicz 2018-10-01 03:06:53 MDT
Hi

This commit should fix this issue:
https://github.com/SchedMD/slurm/commit/8042bb5fdc076b4
I'm marking this ticket as resolved/info given
As always, please feel free to reopen if you have additional questions.

Dominik