Ticket 9450

Summary: trouble configuring gpus
Product: Slurm Reporter: Todd Merritt <tmerritt>
Component: GPUAssignee: Director of Support <support>
Status: RESOLVED INFOGIVEN QA Contact:
Severity: 4 - Minor Issue    
Priority: ---    
Version: 19.05.6   
Hardware: Linux   
OS: Linux   
Site: U of AZ Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---

Description Todd Merritt 2020-07-22 08:14:51 MDT
I'm probably missing something basic, but I'm having trouble getting our gpu node schedulable in slurm

I think I've gotten it set up but when I try to run a job asking for gpus I get

tmerritt@junonia:~ $ srun --nodes=2 --ntasks-per-node=1 --mem-per-cpu=1GB --time=01:00:00 --job-name=interactive --account=windfall --gres=gpu:1  --pty bash -i
srun: error: Unable to allocate resources: Requested node configuration is not available

I have

[root@r5u31n1 ~]# cat /etc/slurm/gres.conf 
Name=gpu Type=volta  File=/dev/nvidia0
Name=gpu Type=volta  File=/dev/nvidia1
Name=gpu Type=volta  File=/dev/nvidia2
Name=gpu Type=volta  File=/dev/nvidia3

SelectType=select/cons_tres
SelectTypeParameters=CR_Core_Memory

GresTypes=gpu
AccountingStorageTRES=gres/gpu:volta

NodeName=r5u31n1 Sockets=2 CoresPerSocket=48 ThreadsPerCore=1 RealMemory=515852 State=UNKNOWN Weight=25 CPUSpecList=0,1 Gres=gpu:volta:4
PartitionName=windfall Default=YES QOS=part_qos_windfall DefaultTime=96:00:00 State=UP Nodes=r1u0[3-9]n[1-2],r1u1[0-8]n[1-2],r1u2[5-9]n[1-2],r1u3[0-6]n[1-2],r2u0[3-9]n[1-2],r2u1[0-8]n[1-2],r2u2[5-9]n[1-2],r2u3[0-6]n[1-2],r5u19n1,r5u24n1,r5u31n1

[root@r5u31n1 ~]# ls /dev/nvidia?
/dev/nvidia0  /dev/nvidia1  /dev/nvidia2  /dev/nvidia3

I'm not sure that I should see any indication of gpus being present at slurmd startup, but I don't. All that's there is this
[2020-07-22T07:02:29.568] CPUs=96 Boards=1 Sockets=2 Cores=48 Threads=1 Memory=515852 TmpDisk=1364841 Uptime=133711 CPUSpecList=0-1 FeaturesAvail=(null) FeaturesActive=(null)

Thanks!
Comment 1 Todd Merritt 2020-07-22 10:17:55 MDT
I started slurmd with -Dvvvv and it does seem to see the gpus, so perhaps I'm just using srun incorrectly

slurmd: debug3: _merge_gres2: From gres.conf, using gpu:volta:1:/dev/nvidia0
slurmd: debug3: _merge_gres2: From gres.conf, using gpu:volta:1:/dev/nvidia1
slurmd: debug3: _merge_gres2: From gres.conf, using gpu:volta:1:/dev/nvidia2
slurmd: debug3: _merge_gres2: From gres.conf, using gpu:volta:1:/dev/nvidia3
slurmd: debug3: Trying to load plugin /usr/lib64/slurm/gpu_generic.so
slurmd: debug:  init: GPU Generic plugin loaded
slurmd: debug3: Success.
slurmd: debug3: gres_device_major : /dev/nvidia0 major 195, minor 0
slurmd: debug3: gres_device_major : /dev/nvidia1 major 195, minor 1
slurmd: debug3: gres_device_major : /dev/nvidia2 major 195, minor 2
slurmd: debug3: gres_device_major : /dev/nvidia3 major 195, minor 3
slurmd: Gres Name=gpu Type=volta Count=1
slurmd: Gres Name=gpu Type=volta Count=1
slurmd: Gres Name=gpu Type=volta Count=1
slurmd: Gres Name=gpu Type=volta Count=1
Comment 2 Todd Merritt 2020-07-22 10:23:00 MDT
I saw the error of my ways. I was accidentally asking for two nodes and I only have one gpu node.