15535 – GPU setup help

Ticket 15535 - GPU setup help

Summary: GPU setup help

Status:	RESOLVED INFOGIVEN

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	User Commands (show other tickets)
Version:	22.05.6
Hardware:	Linux Linux

Severity:	4 - Minor Issue
Assignee:	Jason Booth
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2022-12-01 13:20 MST by DRW GridOps
Modified:	2022-12-02 08:33 MST (History)
CC List:	0 users

See Also:
Site:	DRW Trading
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
slurm.conf from the relevant cluster (5.33 KB, text/plain) 2022-12-01 13:20 MST, DRW GridOps	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description DRW GridOps 2022-12-01 13:20:19 MST

Created attachment 27995 [details]
slurm.conf from the relevant cluster

Hello, I'm trying to add a single GPU to my test cluster.  I've got the card in, CUDA  installed, and gotten the gres to show up in sinfo but am unable to actually run jobs against it.  I'm not sure what I've got wrong here, and although I can find numerous examples online none seem to help.   Could I get some insight into what I'm doing wrong here?

The gist of it is I have a single R650 compute node (chhq-sulgcmp006) in the cluster with a Tesla A2.  There are no other GPUs.  I have AutoDetect disabled because we did not compile Slurm with nvml support, but from what I have read it should not be  needed.

My gres.conf:
AutoDetect=off
NodeName=chhq-sulgcmp006 Name=gpu Type=A2 File=/dev/nvidia0 COREs=0 

Relevant bits of my slurm.conf (will attach the whole thing as well):
GresTypes=disk,io_disk,io_nic,gpu
NodeName=DEFAULT Sockets=2 CoresPerSocket=18 RealMemory=500000 ThreadsPerCore=1
NodeName=chhq-sulgcmp006 TmpDisk=4000 Gres=disk:20000,io_nic:1,io_disk:1,gpu:A2:1

sinfo:
# sinfo -o "%20N  %6t %3c  %10m  %10f %50G"
NODELIST              STATE  CPU  MEMORY      AVAIL_FEAT GRES
chhq-sulgcmp006,chhq  idle   4+   12000+      (null)     disk:20000,io_disk:1,io_nic:1
chhq-vulgcmp[003-005  idle   4    12000       (null)     disk:10000,io_disk:1,io_nic:1

srun attempt:
$ srun --gres=gpu:1 uname -n
srun: error: Unable to allocate resources: Requested node configuration is not available
$ srun --gres=gpu:A2:1 uname -n
srun: error: Unable to allocate resources: Requested node configuration is not available

Help?

Comment 1 Jason Booth 2022-12-01 16:08:10 MST

Please try the following:


> $ srun -p lab21pbar --gres=gpu:1 uname -n
> $ srun -p lab21pbar --gres=gpu:A2:1 uname -n

Note the addition of the partition that the node is configured to run in "-p lab21pbar".

The default partition is "lab21pmain", and as the error states would never be runnable since the GPU is not part of those nodes in that partition.

Comment 2 DRW GridOps 2022-12-02 07:46:39 MST

oh my god.

Please close this ticket as 'user is a moron'.  Thank you.

Comment 3 Jason Booth 2022-12-02 08:33:49 MST

Resolving