Ticket 15535

Summary: GPU setup help
Product: Slurm Reporter: DRW GridOps <gridadm>
Component: User CommandsAssignee: Jason Booth <jbooth>
Status: RESOLVED INFOGIVEN QA Contact:
Severity: 4 - Minor Issue    
Priority: ---    
Version: 22.05.6   
Hardware: Linux   
OS: Linux   
Site: DRW Trading Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---
Attachments: slurm.conf from the relevant cluster

Description DRW GridOps 2022-12-01 13:20:19 MST
Created attachment 27995 [details]
slurm.conf from the relevant cluster

Hello, I'm trying to add a single GPU to my test cluster.  I've got the card in, CUDA  installed, and gotten the gres to show up in sinfo but am unable to actually run jobs against it.  I'm not sure what I've got wrong here, and although I can find numerous examples online none seem to help.   Could I get some insight into what I'm doing wrong here?

The gist of it is I have a single R650 compute node (chhq-sulgcmp006) in the cluster with a Tesla A2.  There are no other GPUs.  I have AutoDetect disabled because we did not compile Slurm with nvml support, but from what I have read it should not be  needed.

My gres.conf:
AutoDetect=off
NodeName=chhq-sulgcmp006 Name=gpu Type=A2 File=/dev/nvidia0 COREs=0 

Relevant bits of my slurm.conf (will attach the whole thing as well):
GresTypes=disk,io_disk,io_nic,gpu
NodeName=DEFAULT Sockets=2 CoresPerSocket=18 RealMemory=500000 ThreadsPerCore=1
NodeName=chhq-sulgcmp006 TmpDisk=4000 Gres=disk:20000,io_nic:1,io_disk:1,gpu:A2:1

sinfo:
# sinfo -o "%20N  %6t %3c  %10m  %10f %50G"
NODELIST              STATE  CPU  MEMORY      AVAIL_FEAT GRES
chhq-sulgcmp006,chhq  idle   4+   12000+      (null)     disk:20000,io_disk:1,io_nic:1
chhq-vulgcmp[003-005  idle   4    12000       (null)     disk:10000,io_disk:1,io_nic:1

srun attempt:
$ srun --gres=gpu:1 uname -n
srun: error: Unable to allocate resources: Requested node configuration is not available
$ srun --gres=gpu:A2:1 uname -n
srun: error: Unable to allocate resources: Requested node configuration is not available

Help?
Comment 1 Jason Booth 2022-12-01 16:08:10 MST
Please try the following:


> $ srun -p lab21pbar --gres=gpu:1 uname -n
> $ srun -p lab21pbar --gres=gpu:A2:1 uname -n

Note the addition of the partition that the node is configured to run in "-p lab21pbar".

The default partition is "lab21pmain", and as the error states would never be runnable since the GPU is not part of those nodes in that partition.
Comment 2 DRW GridOps 2022-12-02 07:46:39 MST
oh my god.

Please close this ticket as 'user is a moron'.  Thank you.
Comment 3 Jason Booth 2022-12-02 08:33:49 MST
Resolving