| Summary: | Cannot schedule all GPUs on node | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Will Dennis <wdennis> |
| Component: | Scheduling | Assignee: | Director of Support <support> |
| Status: | RESOLVED DUPLICATE | QA Contact: | |
| Severity: | 4 - Minor Issue | ||
| Priority: | --- | ||
| Version: | 20.11.5 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | NEC Labs | Slinky Site: | --- |
| Alineos Sites: | --- | Atos/Eviden Sites: | --- |
| Confidential Site: | --- | Coreweave sites: | --- |
| Cray Sites: | --- | DS9 clusters: | --- |
| Google sites: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | NoveTech Sites: | --- |
| Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Tzag Elita Sites: | --- |
| Linux Distro: | Ubuntu | Machine Name: | |
| CLE Version: | Version Fixed: | ||
| Target Release: | --- | DevPrio: | --- |
| Emory-Cloud Sites: | --- | ||
|
Description
Will Dennis
2021-05-24 17:45:41 MDT
Hi Will, This is a known issue with AutoDetect. See bug 11693 comment 8 for the workaround. To summarize, turn off AutoDetect and specify your GPUs in gres.conf in order of PCI bus ID, and it should fix the issue. Thanks, -Michael (In reply to Michael Hinton from comment #1) > This is a known issue with AutoDetect. See bug 11693 comment 8 for the > workaround. To summarize, turn off AutoDetect and specify your GPUs in > gres.conf in order of PCI bus ID, and it should fix the issue. Actually, I don't think the order the GPUs are specified in gres.conf matters, now that I think about it, because Links= is not being set incorrectly. So you can proceed without doing that step. FYI:
root@ma-gpu04:~# nvidia-smi -q | grep -Ei "minor|bus ID"
Minor Number : 3
Bus Id : 00000000:01:00.0
Minor Number : 2
Bus Id : 00000000:25:00.0
Minor Number : 1
Bus Id : 00000000:41:00.0
Minor Number : 0
Bus Id : 00000000:61:00.0
Minor Number : 7
Bus Id : 00000000:81:00.0
Minor Number : 6
Bus Id : 00000000:A1:00.0
Minor Number : 5
Bus Id : 00000000:C1:00.0
Minor Number : 4
Bus Id : 00000000:E1:00.0
CPU: AMD EPYC 7402 24-Core Processor (x 2)
Also: \root@ma-gpu04:~# nvidia-smi topo -m GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 CPU Affinity NUMA Affinity GPU0 X NODE NODE NODE SYS SYS SYS SYS 0-23,48-71 0 GPU1 NODE X NODE NODE SYS SYS SYS SYS 0-23,48-71 0 GPU2 NODE NODE X NODE SYS SYS SYS SYS 0-23,48-71 0 GPU3 NODE NODE NODE X SYS SYS SYS SYS 0-23,48-71 0 GPU4 SYS SYS SYS SYS X NODE NODE NODE 24-47,72-95 1 GPU5 SYS SYS SYS SYS NODE X NODE NODE 24-47,72-95 1 GPU6 SYS SYS SYS SYS NODE NODE X NODE 24-47,72-95 1 GPU7 SYS SYS SYS SYS NODE NODE NODE X 24-47,72-95 1 Legend: X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks Perfect; that clearly shows that the device files/minor numbers and PCI bus order is mismatched, causing issues. We've seen this happen with AMD EPYC machines, but it has also happened on certain Intel machines, too. Is the workaround clear to you? Or do you need assistance? -Michael I re-wrote the gres.conf as shown here: ----- root@ma-gpu04:~# cat /run/slurm/conf/gres.conf ################################################################## # Slurm's Generic Resource (GRES) configuration file # Use NVML to gather GPU configuration information ################################################################## # GPU auto-detect doesn't work quite right yet, see: # https://bugs.schedmd.com/show_bug.cgi?id=10827 # https://bugs.schedmd.com/show_bug.cgi?id=11693 # https://bugs.schedmd.com/show_bug.cgi?id=11697 # # AutoDetect=nvml # So for now, specify the old/direct way... NodeName=ma-gpu[01-04] Name=gpu Type=a6000 File=/dev/nvidia[0-7] ----- But after restarting slurmctld, and doing a “scontrol reconfigure” to “HUP” the nodes (using configless), still getting an error when I try to schedule 8 GPUs per node: wdennis@ma-slurm-submit01:~$ srun --pty -c 48 -t 8:00 --gres=gpu:8 --mem=128G /bin/bash -l srun: error: Unable to create step for job 212: Invalid generic resource (gres) specification I believe that an `scontrol reconfigure` may not be enough when changing GRES - could you try explicitly restarting the slurmds and seeing if that fixes things? Yes, looks like a restart of slurmd on the login/worker nodes made it work... wdennis@ma-slurm-submit01:~$ srun --pty -c 48 -t 8:00 --gres=gpu:8 --mem=128G /bin/bash -l srun: job 214 queued and waiting for resources (sadly, not enough open GPUs on a node at this point for me to get a shell... but at least the job allocation is working now.) Any idea on a fix timeline? We have widely disparate GPU nodes in this cluster (since they were bought in groups over a long period of time) and AutoDetect=nvml would save lots of hand-config in gres.conf... (In reply to Will Dennis from comment #8) > Yes, looks like a restart of slurmd on the login/worker nodes made it work... Great! > Any idea on a fix timeline? We have widely disparate GPU nodes in this > cluster (since they were bought in groups over a long period of time) and > AutoDetect=nvml would save lots of hand-config in gres.conf... I'm hoping we can get a fix into the next 20.11 minor release, but we'll see. In the meantime, you can keep AutoDetect on for all GPU nodes where the minor numbers/device files are ordered in ascending PCI bus ID order (you can verify with the command `nvidia-smi -q | grep -Ei "minor|bus ID"`). So it could look something like this: AutoDetect=nvml NodeName=ma-gpu[01-04] AutoDetect=off Name=gpu Type=a6000 File=/dev/nvidia[0-7] This would leave AutoDetect on for all nodes but ma-gpu[01-04]. That should reduce the pain while you wait for the patch. -Michael Hey Will, I'm going to go ahead and mark this as a duplicate of bug 10827. Stay tuned there for a patch. Thanks! -Michael *** This ticket has been marked as a duplicate of ticket 10827 *** |