Ticket 11697

Summary: Cannot schedule all GPUs on node
Product: Slurm Reporter: Will Dennis <wdennis>
Component: SchedulingAssignee: Director of Support <support>
Status: RESOLVED DUPLICATE QA Contact:
Severity: 4 - Minor Issue    
Priority: ---    
Version: 20.11.5   
Hardware: Linux   
OS: Linux   
Site: NEC Labs Slinky Site: ---
Alineos Sites: --- Atos/Eviden Sites: ---
Confidential Site: --- Coreweave sites: ---
Cray Sites: --- DS9 clusters: ---
Google sites: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- NoveTech Sites: ---
Nvidia HWinf-CS Sites: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Tzag Elita Sites: ---
Linux Distro: Ubuntu Machine Name:
CLE Version: Version Fixed:
Target Release: --- DevPrio: ---
Emory-Cloud Sites: ---

Description Will Dennis 2021-05-24 17:45:41 MDT
We have a new cluster, where (for now) all the GPU nodes are the same, 8 x NVIDIA RTX A6000's. The GRES is defined in slurm conf as follows:

NodeName=ma-gpu0[1-4] SocketsPerBoard=2 CoresPerSocket=24 ThreadsPerCore=2 RealMemory=386890 Gres=gpu:8 Feature=A6000,nvarch_ampere,has_tcores

The gres.conf is using "AutoDetect=nvml", and seems working:

root@ma-slurm-ctlr:~# scontrol show node ma-gpu04
NodeName=ma-gpu04 Arch=x86_64 CoresPerSocket=24
   CPUAlloc=0 CPUTot=96 CPULoad=0.00
   AvailableFeatures=A6000,nvarch_ampere,has_tcores
   ActiveFeatures=A6000,nvarch_ampere,has_tcores
   Gres=gpu:8(S:0-1)
   NodeAddr=ma-gpu04 NodeHostName=ma-gpu04 Version=20.11.5
   OS=Linux 4.15.0-142-generic #146-Ubuntu SMP Tue Apr 13 01:11:19 UTC 2021
   RealMemory=386890 AllocMem=0 FreeMem=332378 Sockets=2 Boards=1
   State=IDLE ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
   Partitions=gpu
   BootTime=2021-05-03T21:08:20 SlurmdStartTime=2021-05-05T14:58:41
   CfgTRES=cpu=96,mem=386890M,billing=96,gres/gpu=8
   AllocTRES=
   CapWatts=n/a
   CurrentWatts=0 AveWatts=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
   Comment=(null)

However, when I go to schedule a job using GPUs, we find that we can only schedule up to 7 out of the 8 GPUs on a node. When trying to schedule all 8, we get a failure as follows:

wdennis@ma-slurm-submit01:~$ srun --pty -c 48 -t 8:00 --gres=gpu:8 --mem=128G /bin/bash -l
srun: error: Unable to create step for job 205: Invalid generic resource (gres) specification

Please help me to figure out why this is happening, and to be able to use all 8 GPUs on my nodes.
Comment 1 Michael Hinton 2021-05-25 09:40:07 MDT
Hi Will,

This is a known issue with AutoDetect. See bug 11693 comment 8 for the workaround. To summarize, turn off AutoDetect and specify your GPUs in gres.conf in order of PCI bus ID, and it should fix the issue.

Thanks,
-Michael
Comment 2 Michael Hinton 2021-05-25 09:51:24 MDT
(In reply to Michael Hinton from comment #1)
> This is a known issue with AutoDetect. See bug 11693 comment 8 for the
> workaround. To summarize, turn off AutoDetect and specify your GPUs in
> gres.conf in order of PCI bus ID, and it should fix the issue.
Actually, I don't think the order the GPUs are specified in gres.conf matters, now that I think about it, because Links= is not being set incorrectly. So you can proceed without doing that step.
Comment 3 Will Dennis 2021-05-25 12:08:50 MDT
FYI:

root@ma-gpu04:~# nvidia-smi -q  | grep -Ei "minor|bus ID"
    Minor Number                          : 3
        Bus Id                            : 00000000:01:00.0
    Minor Number                          : 2
        Bus Id                            : 00000000:25:00.0
    Minor Number                          : 1
        Bus Id                            : 00000000:41:00.0
    Minor Number                          : 0
        Bus Id                            : 00000000:61:00.0
    Minor Number                          : 7
        Bus Id                            : 00000000:81:00.0
    Minor Number                          : 6
        Bus Id                            : 00000000:A1:00.0
    Minor Number                          : 5
        Bus Id                            : 00000000:C1:00.0
    Minor Number                          : 4
        Bus Id                            : 00000000:E1:00.0

CPU: AMD EPYC 7402 24-Core Processor (x 2)
Comment 4 Will Dennis 2021-05-25 12:16:56 MDT
Also:

\root@ma-gpu04:~# nvidia-smi topo -m
	GPU0	GPU1	GPU2	GPU3	GPU4	GPU5	GPU6	GPU7	CPU Affinity	NUMA Affinity
GPU0	 X 	NODE	NODE	NODE	SYS	SYS	SYS	SYS	0-23,48-71	0
GPU1	NODE	 X 	NODE	NODE	SYS	SYS	SYS	SYS	0-23,48-71	0
GPU2	NODE	NODE	 X 	NODE	SYS	SYS	SYS	SYS	0-23,48-71	0
GPU3	NODE	NODE	NODE	 X 	SYS	SYS	SYS	SYS	0-23,48-71	0
GPU4	SYS	SYS	SYS	SYS	 X 	NODE	NODE	NODE	24-47,72-95	1
GPU5	SYS	SYS	SYS	SYS	NODE	 X 	NODE	NODE	24-47,72-95	1
GPU6	SYS	SYS	SYS	SYS	NODE	NODE	 X 	NODE	24-47,72-95	1
GPU7	SYS	SYS	SYS	SYS	NODE	NODE	NODE	 X 	24-47,72-95	1

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks
Comment 5 Michael Hinton 2021-05-25 12:23:05 MDT
Perfect; that clearly shows that the device files/minor numbers and PCI bus order is mismatched, causing issues. We've seen this happen with AMD EPYC machines, but it has also happened on certain Intel machines, too.

Is the workaround clear to you? Or do you need assistance?

-Michael
Comment 6 Will Dennis 2021-05-25 12:54:10 MDT
I re-wrote the gres.conf as shown here:

-----
root@ma-gpu04:~# cat /run/slurm/conf/gres.conf
##################################################################
# Slurm's Generic Resource (GRES) configuration file
# Use NVML to gather GPU configuration information
##################################################################

# GPU auto-detect doesn't work quite right yet, see:
# https://bugs.schedmd.com/show_bug.cgi?id=10827
# https://bugs.schedmd.com/show_bug.cgi?id=11693
# https://bugs.schedmd.com/show_bug.cgi?id=11697
#
# AutoDetect=nvml

# So for now, specify the old/direct way...
NodeName=ma-gpu[01-04] Name=gpu Type=a6000 File=/dev/nvidia[0-7]
-----

But after restarting slurmctld, and doing a “scontrol reconfigure” to “HUP” the nodes (using configless), still getting an error when I try to schedule 8 GPUs per node:

wdennis@ma-slurm-submit01:~$ srun --pty -c 48 -t 8:00 --gres=gpu:8 --mem=128G /bin/bash -l
srun: error: Unable to create step for job 212: Invalid generic resource (gres) specification
Comment 7 Michael Hinton 2021-05-25 13:42:45 MDT
I believe that an `scontrol reconfigure` may not be enough when changing GRES - could you try explicitly restarting the slurmds and seeing if that fixes things?
Comment 8 Will Dennis 2021-05-25 16:42:46 MDT
Yes, looks like a restart of slurmd on the login/worker nodes made it work...

wdennis@ma-slurm-submit01:~$ srun --pty -c 48 -t 8:00 --gres=gpu:8 --mem=128G /bin/bash -l
srun: job 214 queued and waiting for resources

(sadly, not enough open GPUs on a node at this point for me to get a shell... but at least the job allocation is working now.)

Any idea on a fix timeline? We have widely disparate GPU nodes in this cluster (since they were bought in groups over a long period of time) and AutoDetect=nvml would save lots of hand-config in gres.conf...
Comment 9 Michael Hinton 2021-05-25 22:07:49 MDT
(In reply to Will Dennis from comment #8)
> Yes, looks like a restart of slurmd on the login/worker nodes made it work...
Great!

> Any idea on a fix timeline? We have widely disparate GPU nodes in this
> cluster (since they were bought in groups over a long period of time) and
> AutoDetect=nvml would save lots of hand-config in gres.conf...
I'm hoping we can get a fix into the next 20.11 minor release, but we'll see.

In the meantime, you can keep AutoDetect on for all GPU nodes where the minor numbers/device files are ordered in ascending PCI bus ID order (you can verify with the command `nvidia-smi -q | grep -Ei "minor|bus ID"`). So it could look something like this:

    AutoDetect=nvml
    NodeName=ma-gpu[01-04] AutoDetect=off Name=gpu Type=a6000 File=/dev/nvidia[0-7]

This would leave AutoDetect on for all nodes but ma-gpu[01-04]. That should reduce the pain while you wait for the patch.

-Michael
Comment 10 Michael Hinton 2021-06-02 17:37:42 MDT
Hey Will,

I'm going to go ahead and mark this as a duplicate of bug 10827. Stay tuned there for a patch.

Thanks!
-Michael

*** This ticket has been marked as a duplicate of ticket 10827 ***