Ticket 14754

Summary: Slurm not working on node after upgrade cuda
Product: Slurm Reporter: Wayfinder Infrastructure Support <infrastructure-support.wayfinder>
Component: GPUAssignee: Jason Booth <jbooth>
Status: RESOLVED FIXED QA Contact:
Severity: 4 - Minor Issue    
Priority: ---    
Version: - Unsupported Older Versions   
Hardware: Linux   
OS: Linux   
See Also: https://bugs.schedmd.com/show_bug.cgi?id=16131
Site: wfr Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: NA Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---
Attachments: slurmd
slurmctld
gres.conf
slurmdbd
slurm.conf
unit_filies.out

Description Wayfinder Infrastructure Support 2022-08-15 14:03:12 MDT
Created attachment 26328 [details]
slurmd

Hi,
I get the following error after upgrading cuda on our machines.

(base) deric.chau.ctr@nl-202-31:~$ srun --gres=gpu:1 -c 16 --mem 30G --nodelist=ng-201-1 --pty bash
srun: Required node not available (down, drained or reserved)
srun: job 148606 queued and waiting for resources

Undraining did not help.

I am attaching logs and configs.
Comment 1 Wayfinder Infrastructure Support 2022-08-15 14:04:23 MDT
Created attachment 26329 [details]
slurmctld
Comment 2 Wayfinder Infrastructure Support 2022-08-15 14:04:46 MDT
Created attachment 26330 [details]
gres.conf
Comment 3 Wayfinder Infrastructure Support 2022-08-15 14:05:22 MDT
Created attachment 26331 [details]
slurmdbd
Comment 4 Wayfinder Infrastructure Support 2022-08-15 14:05:48 MDT
Created attachment 26332 [details]
slurm.conf
Comment 5 Jason Booth 2022-08-15 14:11:04 MDT
The node is configured with a GPU however no GPU device was found.

[2022-08-15T17:36:17.347] error: Waiting for gres.conf file /dev//nvidia0
[2022-08-15T17:36:36.350] fatal: can't stat gres.conf file /dev//nvidia0: No such file or directory

> NodeName=ng-[201,202]-[1,5] Name=gpu Type=v100 File=/dev/nvidia[0-3] Count=4 Cores=0-19,40-59


This error is considered fatal. Can you verify the devices are present and also verify if nvidia-smi returns any information?
Comment 6 Wayfinder Infrastructure Support 2022-08-15 15:11:08 MDT
nvidia-smi does work.

root@ng-201-1:~# nvidia-smi
Mon Aug 15 21:08:35 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03    Driver Version: 510.47.03    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  On   | 00000000:1A:00.0 Off |                    0 |
| N/A   30C    P0    42W / 300W |      0MiB / 32768MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  On   | 00000000:1B:00.0 Off |                    0 |
| N/A   33C    P0    42W / 300W |      0MiB / 32768MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-SXM2...  On   | 00000000:3D:00.0 Off |                    0 |
| N/A   33C    P0    43W / 300W |      0MiB / 32768MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-SXM2...  On   | 00000000:3E:00.0 Off |                    0 |
| N/A   28C    P0    42W / 300W |      0MiB / 32768MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   4  Tesla V100-SXM2...  On   | 00000000:88:00.0 Off |                    0 |
| N/A   29C    P0    41W / 300W |      0MiB / 32768MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   5  Tesla V100-SXM2...  On   | 00000000:89:00.0 Off |                    0 |
| N/A   32C    P0    43W / 300W |      0MiB / 32768MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   6  Tesla V100-SXM2...  On   | 00000000:B2:00.0 Off |                    0 |
| N/A   32C    P0    42W / 300W |      0MiB / 32768MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   7  Tesla V100-SXM2...  On   | 00000000:B3:00.0 Off |                    0 |
| N/A   29C    P0    42W / 300W |      0MiB / 32768MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
Comment 7 Jason Booth 2022-08-15 15:21:11 MDT
What does the output for the following commands return?

> $ ls -lta /dev/nvidia*
> $ nvidia-smi -L

What happens if you start the slurmd process by hand with the -D option? Send back those errors from the output of this command too if you see any. You can use ctrl+c to exit out of this if you do not see anything and the process starts up.


> slurmd -D

If the process does not exit with errors, and after you ctrl+c, start the process with the systemctl command.
Comment 8 Wayfinder Infrastructure Support 2022-08-15 16:33:00 MDT
root@ng-201-1:~# ls -lta /dev/nvidia*
crw-rw-rw- 1 root root 195,   0 Aug 15 19:50 /dev/nvidia0
crw-rw-rw- 1 root root 195,   1 Aug 15 19:50 /dev/nvidia1
crw-rw-rw- 1 root root 195,   2 Aug 15 19:50 /dev/nvidia2
crw-rw-rw- 1 root root 195,   3 Aug 15 19:50 /dev/nvidia3
crw-rw-rw- 1 root root 195,   4 Aug 15 19:50 /dev/nvidia4
crw-rw-rw- 1 root root 195,   5 Aug 15 19:50 /dev/nvidia5
crw-rw-rw- 1 root root 195,   6 Aug 15 19:50 /dev/nvidia6
crw-rw-rw- 1 root root 195,   7 Aug 15 19:50 /dev/nvidia7
crw-rw-rw- 1 root root 195, 255 Aug 15 19:50 /dev/nvidiactl
crw-rw-rw- 1 root root 195, 254 Aug 15 19:50 /dev/nvidia-modeset
crw-rw-rw- 1 root root 503,   0 Aug 15 19:50 /dev/nvidia-uvm
crw-rw-rw- 1 root root 503,   1 Aug 15 19:50 /dev/nvidia-uvm-tools

/dev/nvidia-caps:
total 0
drwxr-xr-x 21 root root   4800 Aug 15 20:12 ..
drwxr-xr-x  2 root root     80 Aug 15 19:51 .
cr--------  1 root root 506, 1 Aug 15 19:51 nvidia-cap1
cr--r--r--  1 root root 506, 2 Aug 15 19:51 nvidia-cap2

root@ng-201-1:~# nvidia-smi -L
GPU 0: Tesla V100-SXM2-32GB (UUID: GPU-208d67f9-a34e-95c9-029f-23e9da111d76)
GPU 1: Tesla V100-SXM2-32GB (UUID: GPU-6d7f0405-8c54-4eaf-724b-a6e67d9c48a4)
GPU 2: Tesla V100-SXM2-32GB (UUID: GPU-733003ad-4a94-5c7e-e5af-b9653c465289)
GPU 3: Tesla V100-SXM2-32GB (UUID: GPU-e62bac0b-2253-598c-4dad-8015eaec8bba)
GPU 4: Tesla V100-SXM2-32GB (UUID: GPU-3bd35d51-81ee-1a78-6068-ad5d74f843cc)
GPU 5: Tesla V100-SXM2-32GB (UUID: GPU-3be1606c-667e-0f90-185d-edb480e2a094)
GPU 6: Tesla V100-SXM2-32GB (UUID: GPU-eb123203-f3ae-3686-e05b-9f4c0c3630c2)
GPU 7: Tesla V100-SXM2-32GB (UUID: GPU-45a4c6f1-177c-856c-66ea-7cb6604ca6fc)
root@ng-201-1:~#


I'm not sure how to run "slurmd -D" because the command is not found and only seen as a service. This should be ran on that failing node?
Comment 9 Jason Booth 2022-08-15 19:25:22 MDT
It sounds like the binaries are not in your path on that computer node. What happens if you start the slurmd service as usual?

Since the device files are there it would like the systems script just need a dependency on a Cuda/Nvidia service.
Comment 10 Wayfinder Infrastructure Support 2022-08-15 20:06:50 MDT
root@ng-201-1:/cm/local/apps/slurm/var# systemctl status slurmd
● slurmd.service - Slurm node daemon
   Loaded: loaded (/usr/lib/systemd/system/slurmd.service; enabled; vendor preset: enabled)
   Active: active (running) since Mon 2022-08-15 19:51:12 UTC; 6h ago
 Main PID: 17571 (slurmd)
    Tasks: 1
   CGroup: /system.slice/slurmd.service
           └─17571 /cm/shared/apps/slurm/18.08.9/sbin/slurmd

Aug 15 19:51:12 ng-201-1 systemd[1]: Starting Slurm node daemon...
Aug 15 19:51:12 ng-201-1 systemd[1]: slurmd.service: Can't open PID file /var/run/slurmd.pid (yet?) after start: No such file or directory
Aug 15 19:51:12 ng-201-1 systemd[1]: Started Slurm node daemon.
root@ng-201-1:/cm/local/apps/slurm/var# cd /cm/shared/apps/slurm/18.08.9/sbin/


root@ng-201-1:/cm/shared/apps/slurm/18.08.9/sbin# ls
capmc_resume  capmc_suspend  slurmctld  slurmd  slurmdbd  slurmsmwd  slurmstepd
root@ng-201-1:/cm/shared/apps/slurm/18.08.9/sbin# ./slurmd -D
slurmd: Message aggregation disabled
slurmd: error: Core count (39) not multiple of socket count (2)
slurmd: Gres Name=gpu Type=v100 Count=4
slurmd: Gres Name=gpu Type=v100 Count=4
slurmd: gpu device number 0(/dev/nvidia0):c 195:0 rwm
slurmd: gpu device number 1(/dev/nvidia1):c 195:1 rwm
slurmd: gpu device number 2(/dev/nvidia2):c 195:2 rwm
slurmd: gpu device number 3(/dev/nvidia3):c 195:3 rwm
slurmd: gpu device number 4(/dev/nvidia4):c 195:4 rwm
slurmd: gpu device number 5(/dev/nvidia5):c 195:5 rwm
slurmd: gpu device number 6(/dev/nvidia6):c 195:6 rwm
slurmd: gpu device number 7(/dev/nvidia7):c 195:7 rwm
slurmd: topology NONE plugin loaded
slurmd: route default plugin loaded
slurmd: Munge cryptographic signature plugin loaded
slurmd: slurmd version 18.08.9 started
slurmd: killing old slurmd[17571]
slurmd: slurmd started on Tue, 16 Aug 2022 02:04:10 +0000
slurmd: CPUs=80 Boards=1 Sockets=2 Cores=20 Threads=2 Memory=772695 TmpDisk=100665 Uptime=22442 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null)
Comment 11 Jason Booth 2022-08-16 09:28:15 MDT
Please attach the following output. 

> systemctl list-unit-files >  /tmp/unit_filtes.out

I suspect you will have an Nvidia process that is responsible for loading the driver. That service will need to be added to the dependency line in the slurmd service file.

For example: 
nvidia-persistenced


> [Unit]
> Description=Slurm node daemon
> After=munge.service network-online.target remote-fs.target nvidia-persistenced
Comment 12 Wayfinder Infrastructure Support 2022-08-16 09:53:43 MDT
Created attachment 26345 [details]
unit_filies.out

unit_filies.out
Comment 13 Jason Booth 2022-08-16 11:16:55 MDT
Thank you for that information. Please make a change to your slurmd.service file to include the cuda-driver and nvidia-persistenced.

> [Unit]
> Description=Slurm node daemon
> After=munge.service network-online.target remote-fs.target cuda-driver.service nvidia-persistenced.service 

You will need to reload the change to the systemd unit file after making this change.

> systemctl daemon-reload

Reboot the node and see if slurmd start's and joins the cluster.
Comment 14 Jason Booth 2022-08-29 11:02:14 MDT
I am just checking into to see how things are going. Were you able to make the change mentioned in comment#13 and did that help solve the issue you were facing?
Comment 15 Wayfinder Infrastructure Support 2022-08-29 17:26:34 MDT
Thank you, issue is resolved.

The /lib/systemd/system/nvidia-persistenced.service was set to "--no-persistence-mode" after updating cuda

ExecStart=/usr/bin/nvidia-persistenced --user nvidia-persistenced --no-persistence-mode --verbose

Removing --no-persistence-mode and restarting service fixed the issue.