| Summary: | Slurm not working on node after upgrade cuda | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Wayfinder Infrastructure Support <infrastructure-support.wayfinder> |
| Component: | GPU | Assignee: | Jason Booth <jbooth> |
| Status: | RESOLVED FIXED | QA Contact: | |
| Severity: | 4 - Minor Issue | ||
| Priority: | --- | ||
| Version: | - Unsupported Older Versions | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| See Also: | https://bugs.schedmd.com/show_bug.cgi?id=16131 | ||
| Site: | wfr | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | NA | Target Release: | --- |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
| Attachments: |
slurmd
slurmctld gres.conf slurmdbd slurm.conf unit_filies.out |
||
Created attachment 26329 [details]
slurmctld
Created attachment 26330 [details]
gres.conf
Created attachment 26331 [details]
slurmdbd
Created attachment 26332 [details]
slurm.conf
The node is configured with a GPU however no GPU device was found.
[2022-08-15T17:36:17.347] error: Waiting for gres.conf file /dev//nvidia0
[2022-08-15T17:36:36.350] fatal: can't stat gres.conf file /dev//nvidia0: No such file or directory
> NodeName=ng-[201,202]-[1,5] Name=gpu Type=v100 File=/dev/nvidia[0-3] Count=4 Cores=0-19,40-59
This error is considered fatal. Can you verify the devices are present and also verify if nvidia-smi returns any information?
nvidia-smi does work. root@ng-201-1:~# nvidia-smi Mon Aug 15 21:08:35 2022 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 510.47.03 Driver Version: 510.47.03 CUDA Version: 11.6 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 Tesla V100-SXM2... On | 00000000:1A:00.0 Off | 0 | | N/A 30C P0 42W / 300W | 0MiB / 32768MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 1 Tesla V100-SXM2... On | 00000000:1B:00.0 Off | 0 | | N/A 33C P0 42W / 300W | 0MiB / 32768MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 2 Tesla V100-SXM2... On | 00000000:3D:00.0 Off | 0 | | N/A 33C P0 43W / 300W | 0MiB / 32768MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 3 Tesla V100-SXM2... On | 00000000:3E:00.0 Off | 0 | | N/A 28C P0 42W / 300W | 0MiB / 32768MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 4 Tesla V100-SXM2... On | 00000000:88:00.0 Off | 0 | | N/A 29C P0 41W / 300W | 0MiB / 32768MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 5 Tesla V100-SXM2... On | 00000000:89:00.0 Off | 0 | | N/A 32C P0 43W / 300W | 0MiB / 32768MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 6 Tesla V100-SXM2... On | 00000000:B2:00.0 Off | 0 | | N/A 32C P0 42W / 300W | 0MiB / 32768MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 7 Tesla V100-SXM2... On | 00000000:B3:00.0 Off | 0 | | N/A 29C P0 42W / 300W | 0MiB / 32768MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+ What does the output for the following commands return? > $ ls -lta /dev/nvidia* > $ nvidia-smi -L What happens if you start the slurmd process by hand with the -D option? Send back those errors from the output of this command too if you see any. You can use ctrl+c to exit out of this if you do not see anything and the process starts up. > slurmd -D If the process does not exit with errors, and after you ctrl+c, start the process with the systemctl command. root@ng-201-1:~# ls -lta /dev/nvidia* crw-rw-rw- 1 root root 195, 0 Aug 15 19:50 /dev/nvidia0 crw-rw-rw- 1 root root 195, 1 Aug 15 19:50 /dev/nvidia1 crw-rw-rw- 1 root root 195, 2 Aug 15 19:50 /dev/nvidia2 crw-rw-rw- 1 root root 195, 3 Aug 15 19:50 /dev/nvidia3 crw-rw-rw- 1 root root 195, 4 Aug 15 19:50 /dev/nvidia4 crw-rw-rw- 1 root root 195, 5 Aug 15 19:50 /dev/nvidia5 crw-rw-rw- 1 root root 195, 6 Aug 15 19:50 /dev/nvidia6 crw-rw-rw- 1 root root 195, 7 Aug 15 19:50 /dev/nvidia7 crw-rw-rw- 1 root root 195, 255 Aug 15 19:50 /dev/nvidiactl crw-rw-rw- 1 root root 195, 254 Aug 15 19:50 /dev/nvidia-modeset crw-rw-rw- 1 root root 503, 0 Aug 15 19:50 /dev/nvidia-uvm crw-rw-rw- 1 root root 503, 1 Aug 15 19:50 /dev/nvidia-uvm-tools /dev/nvidia-caps: total 0 drwxr-xr-x 21 root root 4800 Aug 15 20:12 .. drwxr-xr-x 2 root root 80 Aug 15 19:51 . cr-------- 1 root root 506, 1 Aug 15 19:51 nvidia-cap1 cr--r--r-- 1 root root 506, 2 Aug 15 19:51 nvidia-cap2 root@ng-201-1:~# nvidia-smi -L GPU 0: Tesla V100-SXM2-32GB (UUID: GPU-208d67f9-a34e-95c9-029f-23e9da111d76) GPU 1: Tesla V100-SXM2-32GB (UUID: GPU-6d7f0405-8c54-4eaf-724b-a6e67d9c48a4) GPU 2: Tesla V100-SXM2-32GB (UUID: GPU-733003ad-4a94-5c7e-e5af-b9653c465289) GPU 3: Tesla V100-SXM2-32GB (UUID: GPU-e62bac0b-2253-598c-4dad-8015eaec8bba) GPU 4: Tesla V100-SXM2-32GB (UUID: GPU-3bd35d51-81ee-1a78-6068-ad5d74f843cc) GPU 5: Tesla V100-SXM2-32GB (UUID: GPU-3be1606c-667e-0f90-185d-edb480e2a094) GPU 6: Tesla V100-SXM2-32GB (UUID: GPU-eb123203-f3ae-3686-e05b-9f4c0c3630c2) GPU 7: Tesla V100-SXM2-32GB (UUID: GPU-45a4c6f1-177c-856c-66ea-7cb6604ca6fc) root@ng-201-1:~# I'm not sure how to run "slurmd -D" because the command is not found and only seen as a service. This should be ran on that failing node? It sounds like the binaries are not in your path on that computer node. What happens if you start the slurmd service as usual? Since the device files are there it would like the systems script just need a dependency on a Cuda/Nvidia service. root@ng-201-1:/cm/local/apps/slurm/var# systemctl status slurmd
● slurmd.service - Slurm node daemon
Loaded: loaded (/usr/lib/systemd/system/slurmd.service; enabled; vendor preset: enabled)
Active: active (running) since Mon 2022-08-15 19:51:12 UTC; 6h ago
Main PID: 17571 (slurmd)
Tasks: 1
CGroup: /system.slice/slurmd.service
└─17571 /cm/shared/apps/slurm/18.08.9/sbin/slurmd
Aug 15 19:51:12 ng-201-1 systemd[1]: Starting Slurm node daemon...
Aug 15 19:51:12 ng-201-1 systemd[1]: slurmd.service: Can't open PID file /var/run/slurmd.pid (yet?) after start: No such file or directory
Aug 15 19:51:12 ng-201-1 systemd[1]: Started Slurm node daemon.
root@ng-201-1:/cm/local/apps/slurm/var# cd /cm/shared/apps/slurm/18.08.9/sbin/
root@ng-201-1:/cm/shared/apps/slurm/18.08.9/sbin# ls
capmc_resume capmc_suspend slurmctld slurmd slurmdbd slurmsmwd slurmstepd
root@ng-201-1:/cm/shared/apps/slurm/18.08.9/sbin# ./slurmd -D
slurmd: Message aggregation disabled
slurmd: error: Core count (39) not multiple of socket count (2)
slurmd: Gres Name=gpu Type=v100 Count=4
slurmd: Gres Name=gpu Type=v100 Count=4
slurmd: gpu device number 0(/dev/nvidia0):c 195:0 rwm
slurmd: gpu device number 1(/dev/nvidia1):c 195:1 rwm
slurmd: gpu device number 2(/dev/nvidia2):c 195:2 rwm
slurmd: gpu device number 3(/dev/nvidia3):c 195:3 rwm
slurmd: gpu device number 4(/dev/nvidia4):c 195:4 rwm
slurmd: gpu device number 5(/dev/nvidia5):c 195:5 rwm
slurmd: gpu device number 6(/dev/nvidia6):c 195:6 rwm
slurmd: gpu device number 7(/dev/nvidia7):c 195:7 rwm
slurmd: topology NONE plugin loaded
slurmd: route default plugin loaded
slurmd: Munge cryptographic signature plugin loaded
slurmd: slurmd version 18.08.9 started
slurmd: killing old slurmd[17571]
slurmd: slurmd started on Tue, 16 Aug 2022 02:04:10 +0000
slurmd: CPUs=80 Boards=1 Sockets=2 Cores=20 Threads=2 Memory=772695 TmpDisk=100665 Uptime=22442 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null)
Please attach the following output. > systemctl list-unit-files > /tmp/unit_filtes.out I suspect you will have an Nvidia process that is responsible for loading the driver. That service will need to be added to the dependency line in the slurmd service file. For example: nvidia-persistenced > [Unit] > Description=Slurm node daemon > After=munge.service network-online.target remote-fs.target nvidia-persistenced Created attachment 26345 [details]
unit_filies.out
unit_filies.out
Thank you for that information. Please make a change to your slurmd.service file to include the cuda-driver and nvidia-persistenced. > [Unit] > Description=Slurm node daemon > After=munge.service network-online.target remote-fs.target cuda-driver.service nvidia-persistenced.service You will need to reload the change to the systemd unit file after making this change. > systemctl daemon-reload Reboot the node and see if slurmd start's and joins the cluster. I am just checking into to see how things are going. Were you able to make the change mentioned in comment#13 and did that help solve the issue you were facing? Thank you, issue is resolved. The /lib/systemd/system/nvidia-persistenced.service was set to "--no-persistence-mode" after updating cuda ExecStart=/usr/bin/nvidia-persistenced --user nvidia-persistenced --no-persistence-mode --verbose Removing --no-persistence-mode and restarting service fixed the issue. |
Created attachment 26328 [details] slurmd Hi, I get the following error after upgrading cuda on our machines. (base) deric.chau.ctr@nl-202-31:~$ srun --gres=gpu:1 -c 16 --mem 30G --nodelist=ng-201-1 --pty bash srun: Required node not available (down, drained or reserved) srun: job 148606 queued and waiting for resources Undraining did not help. I am attaching logs and configs.