Ticket 15204

Summary: GPU no such file
Product: Slurm Reporter: mukunda
Component: ConfigurationAssignee: Ben Glines <ben.glines>
Status: RESOLVED DUPLICATE QA Contact:
Severity: 4 - Minor Issue    
Priority: ---    
Version: 21.08.6   
Hardware: Linux   
OS: Linux   
Site: Synopsys Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---
Attachments: slurmd log with MIG and missing cap files

Description mukunda 2022-10-17 20:56:58 MDT
In slurmd debug2 startup file, we are seeing errors like this

slurmd: debug:      GRES[gpu] Type:a100_1g.10gb Count:1 Cores(64):16-31  Links:0,0,0,0,0,-1,0 Flags:HAS_FILE,HAS_TYPE,ENV_NVML File:/dev/nvidia3,/dev/nvidia-caps/nvidia-cap390,/dev/nvidia-caps/nvidia-cap391 UniqueId:MIG-GPU-ce145927-1844-28fd-251b-c665dc1fa511/13/0
slurmd: debug:      GRES[gpu] Type:a100_1g.10gb Count:1 Cores(64):16-31  Links:0,0,0,0,0,-1,0 Flags:HAS_FILE,HAS_TYPE,ENV_NVML File:/dev/nvidia2,/dev/nvidia-caps/nvidia-cap516,/dev/nvidia-caps/nvidia-cap517 UniqueId:MIG-GPU-30e7d5a8-cd64-9239-7133-4d2f7e6e9562/12/0
slurmd: debug:      GRES[gpu] Type:a100_1g.10gb Count:1 Cores(64):0-15  Links:0,0,0,0,0,0,-1 Flags:HAS_FILE,HAS_TYPE,ENV_NVML File:/dev/nvidia1,/dev/nvidia-caps/nvidia-cap120,/dev/nvidia-caps/nvidia-cap121 UniqueId:MIG-GPU-fd4e1dd1-b91e-5943-eea7-fb298b303091/13/0
slurmd: debug:      GRES[gpu] Type:a100_1g.10gb Count:1 Cores(64):0-15  Links:0,0,0,0,0,0,-1 Flags:HAS_FILE,HAS_TYPE,ENV_NVML File:/dev/nvidia0,/dev/nvidia-caps/nvidia-cap264,/dev/nvidia-caps/nvidia-cap265 UniqueId:MIG-GPU-7dcc8c3b-3a54-592e-6001-7f0cdc745d98/14/0
slurmd: debug:      GRES[gpu] Type:a100_1g.10gb Count:1 Cores(64):16-31  Links:0,0,0,0,0,0,-1 Flags:HAS_FILE,HAS_TYPE,ENV_NVML File:/dev/nvidia3,/dev/nvidia-caps/nvidia-cap399,/dev/nvidia-caps/nvidia-cap400 UniqueId:MIG-GPU-ce145927-1844-28fd-251b-c665dc1fa511/14/0
slurmd: debug:      GRES[gpu] Type:a100_1g.10gb Count:1 Cores(64):16-31  Links:0,0,0,0,0,0,-1 Flags:HAS_FILE,HAS_TYPE,ENV_NVML File:/dev/nvidia2,/dev/nvidia-caps/nvidia-cap525,/dev/nvidia-caps/nvidia-cap526 UniqueId:MIG-GPU-30e7d5a8-cd64-9239-7133-4d2f7e6e9562/13/0
slurmd: error: gres_device_major: stat(/dev/nvidia-caps/nvidia-cap93): No such file or directory
slurmd: error: gres_device_major: stat(/dev/nvidia-caps/nvidia-cap94): No such file or directory
slurmd: error: gres_device_major: stat(/dev/nvidia-caps/nvidia-cap498): No such file or directory
slurmd: error: gres_device_major: stat(/dev/nvidia-caps/nvidia-cap499): No such file or directory
slurmd: error: gres_device_major: stat(/dev/nvidia-caps/nvidia-cap264): No such file or directory
slurmd: error: gres_device_major: stat(/dev/nvidia-caps/nvidia-cap265): No such file or directory
slurmd: error: gres_device_major: stat(/dev/nvidia-caps/nvidia-cap399): No such file or directory
slurmd: error: gres_device_major: stat(/dev/nvidia-caps/nvidia-cap400): No such file or directory


We are not sure what is the problem with these devices and this is with gres.conf autodetect=nvml

around 8 devices are badly configured automatically looks like . Not sure. 
Let us know .

We have 28 MIG devices and if 8 are bad due to above errors.  
How can we configure slurm to only use 20 Good ones ignoring the rest 8 ?
Comment 1 Ben Glines 2022-10-19 13:44:42 MDT
I am currently investigating this further to see what may be happening, but it seems that the issue you are seeing may be related to bug 15026. That bug was resolved in 742e6784 (https://github.com/SchedMD/slurm/commit/742e6784c4ed731db32fe8d6af5dc8a29c155d3a) for 22.05

I'll send you updates as I try to reproduce your scenario and check if this commit fixes what you are seeing.
Comment 2 Ben Glines 2022-10-19 14:20:53 MDT
Clarification for the previous reply: That fix is in 22.05.5
Comment 3 mukunda 2022-10-19 17:57:59 MDT
Thanks . 
Will try to upgrade to 22.05 and see if that works.

Regards
Mukund
Comment 4 Ben Glines 2022-10-20 09:41:39 MDT
Okay sounds good.

Could you also reply with a more full slurmd log from its start up? I want to see all of the GRES lines. Also could you manually check if those /dev/nvidia-caps/nvidia-cap* device files that Slurm is looking for exist?
Comment 5 mukunda 2022-10-22 23:04:25 MDT

attached file has the messages and log as well
Thanks
mukund
Comment 6 mukunda 2022-10-22 23:05:10 MDT
Created attachment 27409 [details]
slurmd log with MIG and missing cap files
Comment 7 mukunda 2022-10-22 23:05:35 MDT
please find attached log.
It has missing cap files.

Thanks
Mukund
Comment 8 Ben Glines 2022-10-24 12:18:25 MDT
Could you send the full output from `nvidia-smi -l`?
Comment 9 Ben Glines 2022-10-24 13:28:57 MDT
Also could you send your node definitions from your slurm.conf?
Comment 10 Ben Glines 2022-10-24 14:35:55 MDT
(In reply to Ben Glines from comment #8)
> Could you send the full output from `nvidia-smi -l`?
Correction: I meant just `nvidia-smi`
Comment 11 mukunda 2022-10-24 22:54:23 MDT
Here are details

nvidia-smi -l
Mon Oct 24 21:52:39 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.48.07    Driver Version: 515.48.07    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100-SXM...  On   | 00000000:01:00.0 Off |                    0 |
| N/A   27C    P0    59W / 500W |      0MiB / 81920MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-SXM...  On   | 00000000:41:00.0 Off |                    0 |
| N/A   29C    P0    58W / 500W |      0MiB / 81920MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA A100-SXM...  On   | 00000000:81:00.0 Off |                    0 |
| N/A   26C    P0    59W / 500W |      0MiB / 81920MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA A100-SXM...  On   | 00000000:C1:00.0 Off |                   On |
| N/A   25C    P0    53W / 500W |     45MiB / 81920MiB |     N/A      Default |
|                               |                      |              Enabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| MIG devices:                                                                |
+------------------+----------------------+-----------+-----------------------+
| GPU  GI  CI  MIG |         Memory-Usage |        Vol|         Shared        |
|      ID  ID  Dev |           BAR1-Usage | SM     Unc| CE  ENC  DEC  OFA  JPG|
|                  |                      |        ECC|                       |
|==================+======================+===========+=======================|
|  3    7   0   0  |      6MiB /  9728MiB | 14      0 |  1   0    0    0    0 |
|                  |      0MiB / 16383MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  3    8   0   1  |      6MiB /  9728MiB | 14      0 |  1   0    0    0    0 |
|                  |      0MiB / 16383MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  3    9   0   2  |      6MiB /  9728MiB | 14      0 |  1   0    0    0    0 |
|                  |      0MiB / 16383MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  3   11   0   3  |      6MiB /  9728MiB | 14      0 |  1   0    0    0    0 |
|                  |      0MiB / 16383MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  3   12   0   4  |      6MiB /  9728MiB | 14      0 |  1   0    0    0    0 |
|                  |      0MiB / 16383MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  3   13   0   5  |      6MiB /  9728MiB | 14      0 |  1   0    0    0    0 |
|                  |      0MiB / 16383MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  3   14   0   6  |      6MiB /  9728MiB | 14      0 |  1   0    0    0    0 |
|                  |      0MiB / 16383MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+




node entry is this

NodeName=snpsitml12 CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=1031800 Gres=gpu:a100_1g.10gb:8 Features=qsc-0,CS7.9 State=IDLE  MemSpecLimit=1024
Comment 12 mukunda 2022-10-24 22:56:23 MDT
Also I had tried a 22.05 and it also has missing cap files when slurmd starts.

So there is some fundamental issue with the setup etc.

Regards
Mukund
Comment 13 Ben Glines 2022-10-27 15:20:32 MDT
Progress is being tracked in Bug 15259. Closing this as duplicate.

*** This ticket has been marked as a duplicate of ticket 15259 ***
Comment 14 Ben Glines 2022-10-27 15:29:44 MDT
This issue of /dev/nvidia-caps/nvidia-cap* not existing is fixed by a patch that is under review in 15259.