I'm probably missing something basic, but I'm having trouble getting our gpu node schedulable in slurm I think I've gotten it set up but when I try to run a job asking for gpus I get tmerritt@junonia:~ $ srun --nodes=2 --ntasks-per-node=1 --mem-per-cpu=1GB --time=01:00:00 --job-name=interactive --account=windfall --gres=gpu:1 --pty bash -i srun: error: Unable to allocate resources: Requested node configuration is not available I have [root@r5u31n1 ~]# cat /etc/slurm/gres.conf Name=gpu Type=volta File=/dev/nvidia0 Name=gpu Type=volta File=/dev/nvidia1 Name=gpu Type=volta File=/dev/nvidia2 Name=gpu Type=volta File=/dev/nvidia3 SelectType=select/cons_tres SelectTypeParameters=CR_Core_Memory GresTypes=gpu AccountingStorageTRES=gres/gpu:volta NodeName=r5u31n1 Sockets=2 CoresPerSocket=48 ThreadsPerCore=1 RealMemory=515852 State=UNKNOWN Weight=25 CPUSpecList=0,1 Gres=gpu:volta:4 PartitionName=windfall Default=YES QOS=part_qos_windfall DefaultTime=96:00:00 State=UP Nodes=r1u0[3-9]n[1-2],r1u1[0-8]n[1-2],r1u2[5-9]n[1-2],r1u3[0-6]n[1-2],r2u0[3-9]n[1-2],r2u1[0-8]n[1-2],r2u2[5-9]n[1-2],r2u3[0-6]n[1-2],r5u19n1,r5u24n1,r5u31n1 [root@r5u31n1 ~]# ls /dev/nvidia? /dev/nvidia0 /dev/nvidia1 /dev/nvidia2 /dev/nvidia3 I'm not sure that I should see any indication of gpus being present at slurmd startup, but I don't. All that's there is this [2020-07-22T07:02:29.568] CPUs=96 Boards=1 Sockets=2 Cores=48 Threads=1 Memory=515852 TmpDisk=1364841 Uptime=133711 CPUSpecList=0-1 FeaturesAvail=(null) FeaturesActive=(null) Thanks!
I started slurmd with -Dvvvv and it does seem to see the gpus, so perhaps I'm just using srun incorrectly slurmd: debug3: _merge_gres2: From gres.conf, using gpu:volta:1:/dev/nvidia0 slurmd: debug3: _merge_gres2: From gres.conf, using gpu:volta:1:/dev/nvidia1 slurmd: debug3: _merge_gres2: From gres.conf, using gpu:volta:1:/dev/nvidia2 slurmd: debug3: _merge_gres2: From gres.conf, using gpu:volta:1:/dev/nvidia3 slurmd: debug3: Trying to load plugin /usr/lib64/slurm/gpu_generic.so slurmd: debug: init: GPU Generic plugin loaded slurmd: debug3: Success. slurmd: debug3: gres_device_major : /dev/nvidia0 major 195, minor 0 slurmd: debug3: gres_device_major : /dev/nvidia1 major 195, minor 1 slurmd: debug3: gres_device_major : /dev/nvidia2 major 195, minor 2 slurmd: debug3: gres_device_major : /dev/nvidia3 major 195, minor 3 slurmd: Gres Name=gpu Type=volta Count=1 slurmd: Gres Name=gpu Type=volta Count=1 slurmd: Gres Name=gpu Type=volta Count=1 slurmd: Gres Name=gpu Type=volta Count=1
I saw the error of my ways. I was accidentally asking for two nodes and I only have one gpu node.