In slurmd debug2 startup file, we are seeing errors like this slurmd: debug: GRES[gpu] Type:a100_1g.10gb Count:1 Cores(64):16-31 Links:0,0,0,0,0,-1,0 Flags:HAS_FILE,HAS_TYPE,ENV_NVML File:/dev/nvidia3,/dev/nvidia-caps/nvidia-cap390,/dev/nvidia-caps/nvidia-cap391 UniqueId:MIG-GPU-ce145927-1844-28fd-251b-c665dc1fa511/13/0 slurmd: debug: GRES[gpu] Type:a100_1g.10gb Count:1 Cores(64):16-31 Links:0,0,0,0,0,-1,0 Flags:HAS_FILE,HAS_TYPE,ENV_NVML File:/dev/nvidia2,/dev/nvidia-caps/nvidia-cap516,/dev/nvidia-caps/nvidia-cap517 UniqueId:MIG-GPU-30e7d5a8-cd64-9239-7133-4d2f7e6e9562/12/0 slurmd: debug: GRES[gpu] Type:a100_1g.10gb Count:1 Cores(64):0-15 Links:0,0,0,0,0,0,-1 Flags:HAS_FILE,HAS_TYPE,ENV_NVML File:/dev/nvidia1,/dev/nvidia-caps/nvidia-cap120,/dev/nvidia-caps/nvidia-cap121 UniqueId:MIG-GPU-fd4e1dd1-b91e-5943-eea7-fb298b303091/13/0 slurmd: debug: GRES[gpu] Type:a100_1g.10gb Count:1 Cores(64):0-15 Links:0,0,0,0,0,0,-1 Flags:HAS_FILE,HAS_TYPE,ENV_NVML File:/dev/nvidia0,/dev/nvidia-caps/nvidia-cap264,/dev/nvidia-caps/nvidia-cap265 UniqueId:MIG-GPU-7dcc8c3b-3a54-592e-6001-7f0cdc745d98/14/0 slurmd: debug: GRES[gpu] Type:a100_1g.10gb Count:1 Cores(64):16-31 Links:0,0,0,0,0,0,-1 Flags:HAS_FILE,HAS_TYPE,ENV_NVML File:/dev/nvidia3,/dev/nvidia-caps/nvidia-cap399,/dev/nvidia-caps/nvidia-cap400 UniqueId:MIG-GPU-ce145927-1844-28fd-251b-c665dc1fa511/14/0 slurmd: debug: GRES[gpu] Type:a100_1g.10gb Count:1 Cores(64):16-31 Links:0,0,0,0,0,0,-1 Flags:HAS_FILE,HAS_TYPE,ENV_NVML File:/dev/nvidia2,/dev/nvidia-caps/nvidia-cap525,/dev/nvidia-caps/nvidia-cap526 UniqueId:MIG-GPU-30e7d5a8-cd64-9239-7133-4d2f7e6e9562/13/0 slurmd: error: gres_device_major: stat(/dev/nvidia-caps/nvidia-cap93): No such file or directory slurmd: error: gres_device_major: stat(/dev/nvidia-caps/nvidia-cap94): No such file or directory slurmd: error: gres_device_major: stat(/dev/nvidia-caps/nvidia-cap498): No such file or directory slurmd: error: gres_device_major: stat(/dev/nvidia-caps/nvidia-cap499): No such file or directory slurmd: error: gres_device_major: stat(/dev/nvidia-caps/nvidia-cap264): No such file or directory slurmd: error: gres_device_major: stat(/dev/nvidia-caps/nvidia-cap265): No such file or directory slurmd: error: gres_device_major: stat(/dev/nvidia-caps/nvidia-cap399): No such file or directory slurmd: error: gres_device_major: stat(/dev/nvidia-caps/nvidia-cap400): No such file or directory We are not sure what is the problem with these devices and this is with gres.conf autodetect=nvml around 8 devices are badly configured automatically looks like . Not sure. Let us know . We have 28 MIG devices and if 8 are bad due to above errors. How can we configure slurm to only use 20 Good ones ignoring the rest 8 ?
I am currently investigating this further to see what may be happening, but it seems that the issue you are seeing may be related to bug 15026. That bug was resolved in 742e6784 (https://github.com/SchedMD/slurm/commit/742e6784c4ed731db32fe8d6af5dc8a29c155d3a) for 22.05 I'll send you updates as I try to reproduce your scenario and check if this commit fixes what you are seeing.
Clarification for the previous reply: That fix is in 22.05.5
Thanks . Will try to upgrade to 22.05 and see if that works. Regards Mukund
Okay sounds good. Could you also reply with a more full slurmd log from its start up? I want to see all of the GRES lines. Also could you manually check if those /dev/nvidia-caps/nvidia-cap* device files that Slurm is looking for exist?
attached file has the messages and log as well Thanks mukund
Created attachment 27409 [details] slurmd log with MIG and missing cap files
please find attached log. It has missing cap files. Thanks Mukund
Could you send the full output from `nvidia-smi -l`?
Also could you send your node definitions from your slurm.conf?
(In reply to Ben Glines from comment #8) > Could you send the full output from `nvidia-smi -l`? Correction: I meant just `nvidia-smi`
Here are details nvidia-smi -l Mon Oct 24 21:52:39 2022 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 515.48.07 Driver Version: 515.48.07 CUDA Version: 11.7 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA A100-SXM... On | 00000000:01:00.0 Off | 0 | | N/A 27C P0 59W / 500W | 0MiB / 81920MiB | 0% Default | | | | Disabled | +-------------------------------+----------------------+----------------------+ | 1 NVIDIA A100-SXM... On | 00000000:41:00.0 Off | 0 | | N/A 29C P0 58W / 500W | 0MiB / 81920MiB | 0% Default | | | | Disabled | +-------------------------------+----------------------+----------------------+ | 2 NVIDIA A100-SXM... On | 00000000:81:00.0 Off | 0 | | N/A 26C P0 59W / 500W | 0MiB / 81920MiB | 0% Default | | | | Disabled | +-------------------------------+----------------------+----------------------+ | 3 NVIDIA A100-SXM... On | 00000000:C1:00.0 Off | On | | N/A 25C P0 53W / 500W | 45MiB / 81920MiB | N/A Default | | | | Enabled | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | MIG devices: | +------------------+----------------------+-----------+-----------------------+ | GPU GI CI MIG | Memory-Usage | Vol| Shared | | ID ID Dev | BAR1-Usage | SM Unc| CE ENC DEC OFA JPG| | | | ECC| | |==================+======================+===========+=======================| | 3 7 0 0 | 6MiB / 9728MiB | 14 0 | 1 0 0 0 0 | | | 0MiB / 16383MiB | | | +------------------+----------------------+-----------+-----------------------+ | 3 8 0 1 | 6MiB / 9728MiB | 14 0 | 1 0 0 0 0 | | | 0MiB / 16383MiB | | | +------------------+----------------------+-----------+-----------------------+ | 3 9 0 2 | 6MiB / 9728MiB | 14 0 | 1 0 0 0 0 | | | 0MiB / 16383MiB | | | +------------------+----------------------+-----------+-----------------------+ | 3 11 0 3 | 6MiB / 9728MiB | 14 0 | 1 0 0 0 0 | | | 0MiB / 16383MiB | | | +------------------+----------------------+-----------+-----------------------+ | 3 12 0 4 | 6MiB / 9728MiB | 14 0 | 1 0 0 0 0 | | | 0MiB / 16383MiB | | | +------------------+----------------------+-----------+-----------------------+ | 3 13 0 5 | 6MiB / 9728MiB | 14 0 | 1 0 0 0 0 | | | 0MiB / 16383MiB | | | +------------------+----------------------+-----------+-----------------------+ | 3 14 0 6 | 6MiB / 9728MiB | 14 0 | 1 0 0 0 0 | | | 0MiB / 16383MiB | | | +------------------+----------------------+-----------+-----------------------+ +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+ node entry is this NodeName=snpsitml12 CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=1031800 Gres=gpu:a100_1g.10gb:8 Features=qsc-0,CS7.9 State=IDLE MemSpecLimit=1024
Also I had tried a 22.05 and it also has missing cap files when slurmd starts. So there is some fundamental issue with the setup etc. Regards Mukund
Progress is being tracked in Bug 15259. Closing this as duplicate. *** This ticket has been marked as a duplicate of ticket 15259 ***
This issue of /dev/nvidia-caps/nvidia-cap* not existing is fixed by a patch that is under review in 15259.