| Summary: | GPU no such file | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | mukunda |
| Component: | Configuration | Assignee: | Ben Glines <ben.glines> |
| Status: | RESOLVED DUPLICATE | QA Contact: | |
| Severity: | 4 - Minor Issue | ||
| Priority: | --- | ||
| Version: | 21.08.6 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | Synopsys | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
| Attachments: | slurmd log with MIG and missing cap files | ||
|
Description
mukunda
2022-10-17 20:56:58 MDT
I am currently investigating this further to see what may be happening, but it seems that the issue you are seeing may be related to bug 15026. That bug was resolved in 742e6784 (https://github.com/SchedMD/slurm/commit/742e6784c4ed731db32fe8d6af5dc8a29c155d3a) for 22.05 I'll send you updates as I try to reproduce your scenario and check if this commit fixes what you are seeing. Clarification for the previous reply: That fix is in 22.05.5 Thanks . Will try to upgrade to 22.05 and see if that works. Regards Mukund Okay sounds good. Could you also reply with a more full slurmd log from its start up? I want to see all of the GRES lines. Also could you manually check if those /dev/nvidia-caps/nvidia-cap* device files that Slurm is looking for exist? attached file has the messages and log as well Thanks mukund Created attachment 27409 [details]
slurmd log with MIG and missing cap files
please find attached log. It has missing cap files. Thanks Mukund Could you send the full output from `nvidia-smi -l`? Also could you send your node definitions from your slurm.conf? (In reply to Ben Glines from comment #8) > Could you send the full output from `nvidia-smi -l`? Correction: I meant just `nvidia-smi` Here are details
nvidia-smi -l
Mon Oct 24 21:52:39 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.48.07 Driver Version: 515.48.07 CUDA Version: 11.7 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A100-SXM... On | 00000000:01:00.0 Off | 0 |
| N/A 27C P0 59W / 500W | 0MiB / 81920MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA A100-SXM... On | 00000000:41:00.0 Off | 0 |
| N/A 29C P0 58W / 500W | 0MiB / 81920MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 2 NVIDIA A100-SXM... On | 00000000:81:00.0 Off | 0 |
| N/A 26C P0 59W / 500W | 0MiB / 81920MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 3 NVIDIA A100-SXM... On | 00000000:C1:00.0 Off | On |
| N/A 25C P0 53W / 500W | 45MiB / 81920MiB | N/A Default |
| | | Enabled |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| MIG devices: |
+------------------+----------------------+-----------+-----------------------+
| GPU GI CI MIG | Memory-Usage | Vol| Shared |
| ID ID Dev | BAR1-Usage | SM Unc| CE ENC DEC OFA JPG|
| | | ECC| |
|==================+======================+===========+=======================|
| 3 7 0 0 | 6MiB / 9728MiB | 14 0 | 1 0 0 0 0 |
| | 0MiB / 16383MiB | | |
+------------------+----------------------+-----------+-----------------------+
| 3 8 0 1 | 6MiB / 9728MiB | 14 0 | 1 0 0 0 0 |
| | 0MiB / 16383MiB | | |
+------------------+----------------------+-----------+-----------------------+
| 3 9 0 2 | 6MiB / 9728MiB | 14 0 | 1 0 0 0 0 |
| | 0MiB / 16383MiB | | |
+------------------+----------------------+-----------+-----------------------+
| 3 11 0 3 | 6MiB / 9728MiB | 14 0 | 1 0 0 0 0 |
| | 0MiB / 16383MiB | | |
+------------------+----------------------+-----------+-----------------------+
| 3 12 0 4 | 6MiB / 9728MiB | 14 0 | 1 0 0 0 0 |
| | 0MiB / 16383MiB | | |
+------------------+----------------------+-----------+-----------------------+
| 3 13 0 5 | 6MiB / 9728MiB | 14 0 | 1 0 0 0 0 |
| | 0MiB / 16383MiB | | |
+------------------+----------------------+-----------+-----------------------+
| 3 14 0 6 | 6MiB / 9728MiB | 14 0 | 1 0 0 0 0 |
| | 0MiB / 16383MiB | | |
+------------------+----------------------+-----------+-----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
node entry is this
NodeName=snpsitml12 CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=1031800 Gres=gpu:a100_1g.10gb:8 Features=qsc-0,CS7.9 State=IDLE MemSpecLimit=1024
Also I had tried a 22.05 and it also has missing cap files when slurmd starts. So there is some fundamental issue with the setup etc. Regards Mukund Progress is being tracked in Bug 15259. Closing this as duplicate. *** This ticket has been marked as a duplicate of ticket 15259 *** This issue of /dev/nvidia-caps/nvidia-cap* not existing is fixed by a patch that is under review in 15259. |