5739 – srun --gres=gpu fails with Invalid generic resource (gres) specification

Ticket 5739 - srun --gres=gpu fails with Invalid generic resource (gres) specification

Summary: srun --gres=gpu fails with Invalid generic resource (gres) specification

Status:	RESOLVED FIXED

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	slurmctld (show other tickets)
Version:	18.08.0
Hardware:	Linux Linux

Severity:	4 - Minor Issue
Assignee:	Dominik Bartkiewicz
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2018-09-17 06:47 MDT by David Gloe
Modified:	2018-10-01 03:06 MDT (History)
CC List:	0 users

See Also:
Site:	CRAY
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	Cray Internal
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:	18.08.1
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description David Gloe 2018-09-17 06:47:11 MDT

On a Cray internal system with GPUs, a user reported a change in behavior between 17.11.8 and 18.08.0. It appears that --gres now requires a count to be specified.

On 18.08.0:

lanton@tiger:~/cray/gpu-omp-tests> srun -C P100 --gres=gpu  -n 1 hostname
srun: error: Unable to allocate resources: Invalid generic resource (gres) specification

lanton@tiger:~/cray/gpu-omp-tests> srun -C P100 --gres=gpu:1  -n 1 hostname
nid00192

On 17.11.8:
dgloe@tiger:~> srun --gres=gpu hostname
nid00012
dgloe@tiger:~> srun --version
slurm 17.11.8

We have the GPU gres defined as so:
slurm.conf:
NodeName=nid000[12-15,20-23] Sockets=1 CoresPerSocket=12 ThreadsPerCore=2 Gres=craynetwork:4,gpu Feature=K40 # RealMemory=32768
NodeName=nid000[24-35] Sockets=1 CoresPerSocket=10 ThreadsPerCore=2 Gres=craynetwork:4,gpu Feature=K20 # RealMemory=32768
NodeName=nid000[36-43] Sockets=2 CoresPerSocket=12 ThreadsPerCore=2 Gres=craynetwork:4 # RealMemory=65536
NodeName=nid000[44-47] Sockets=1 CoresPerSocket=10 ThreadsPerCore=2 Gres=craynetwork:4,gpu Feature=K20 # RealMemory=32768
NodeName=nid000[48-59] Sockets=1 CoresPerSocket=12 ThreadsPerCore=2 Gres=craynetwork:4,gpu Feature=K40 # RealMemory=32768
NodeName=nid00[192-203,224-231] Sockets=1 CoresPerSocket=18 ThreadsPerCore=2 Gres=craynetwork:4,gpu Feature=P100 # RealMemory=65536

gres.conf:
NodeName=nid000[12-15,20-23,36-43] Name=craynetwork Count=4
NodeName=nid000[12-15,20-23] Name=gpu File=/dev/nvidia0
NodeName=nid000[24-35,44-47] Name=craynetwork Count=4
NodeName=nid000[24-35,44-47] Name=gpu File=/dev/nvidia0
NodeName=nid000[48-59] Name=craynetwork Count=4
NodeName=nid000[48-59] Name=gpu File=/dev/nvidia0
NodeName=nid00[192-203,224-235] Name=craynetwork Count=4
NodeName=nid00[192-203,224-235] Name=gpu File=/dev/nvidia0

Comment 1 Dominik Bartkiewicz 2018-09-18 08:17:16 MDT

Hi

I am working on it, I will inform you when we will fix this.

Dominik

Comment 2 Dominik Bartkiewicz 2018-10-01 03:06:53 MDT

Hi

This commit should fix this issue:
https://github.com/SchedMD/slurm/commit/8042bb5fdc076b4
I'm marking this ticket as resolved/info given
As always, please feel free to reopen if you have additional questions.

Dominik