Ticket 7769

Summary: Missing systemd dep causes slurmd start problems for nodes with gres GPUs
Product: Slurm Reporter: darrellp
Component: GPUAssignee: Director of Support <support>
Status: RESOLVED INFOGIVEN QA Contact:
Severity: 3 - Medium Impact    
Priority: --- CC: wdennis
Version: 19.05.2   
Hardware: Linux   
OS: Linux   
See Also: https://bugs.schedmd.com/show_bug.cgi?id=8222
Site: Allen AI Slinky Site: ---
Alineos Sites: --- Atos/Eviden Sites: ---
Confidential Site: --- Coreweave sites: ---
Cray Sites: --- DS9 clusters: ---
Google sites: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- NoveTech Sites: ---
Nvidia HWinf-CS Sites: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Tzag Elita Sites: ---
Linux Distro: --- Machine Name:
CLE Version: Version Fixed:
Target Release: --- DevPrio: ---
Emory-Cloud Sites: ---
Attachments: slurmd.service

Description darrellp 2019-09-18 13:50:04 MDT
Folks
    We have an issue where when a GPU node gets restarted, slurmd will often not start because the gres.conf file references the file location for the Nvidia GPU which has not been created as the Nvidia drivers are still loading. The effect is that slurmd does not start (error below) and has to be restarted by hand after the fact. 
	
Please note that this does not happen every time, but enough that I would class it as often. 
	
This is on Slurm 19.05.2 cluster with 4 nodes. 
All running Ubuntu 18.04
All with various Nvidia GPUs (RTX Titan, Quadro RTX 8000, GTX 1080)

Error reported in the logs and systemctl status slurmd:
slurmd[1453]: fatal: can't stat gres.conf file /dev/nvidia0: No such file or directory

This appears to be similar to this existing bug:
https://bugs.schedmd.com/show_bug.cgi?id=3798

We have thought about creating a fix using something in udev and an 'After' in the service file, something like this:
https://unix.stackexchange.com/questions/186899/how-to-wait-for-a-dev-video0-before-starting-a-service/186903

This is definitely not ideal as it would need to be a maintained mapping for every machine and how many GPUs they have. 

Can you give us a better solution? 
Thanks
D

Steps to reproduce. 
On a multi-GPU node, reboot.
sudo systemctl status slurmd
Comment 1 Michael Hinton 2019-09-20 10:34:46 MDT
Hey Darrell,

Could you attach the relevant systemd files you are using with slurmd?

As you've indicated, this seems more like a systemd dependency/udev issue rather than something we can or should solve within Slurm itself. I can't guarantee that I'll find a better solution than the one you've found, but I'll look into it.

Thanks,
-Michael
Comment 2 darrellp 2019-09-20 10:40:19 MDT
I would respectfully argue if Slurm is going to support Nvidia GPUs that this would be something that you would want have an answer for. It is pretty disruptive to when running a cluster. Anyway attaching the systemd file. Note that this is exactly what is packaged in the 19.05.2 release. 

Cheers
D
Comment 3 darrellp 2019-09-20 10:41:23 MDT
Created attachment 11639 [details]
slurmd.service
Comment 4 Michael Hinton 2019-09-20 11:34:37 MDT
(In reply to darrellp from comment #2)
> I would respectfully argue if Slurm is going to support Nvidia GPUs that
> this would be something that you would want have an answer for. It is pretty
> disruptive to when running a cluster.
You are probably right about that :)

My first thought would be to somehow add an "nvidia.service" target to After= that knows when the nvidia driver is up and running (so all /dev/nvidiaX files exist). The closest thing I can find is `nvidia-persistenced.service` (seen in `systemctl list-units --no-pager | grep -i nvidia`). Maybe adding that would solve the issue. But I'm not sure if this service is automatically started on boot for all systems with NVIDIA drivers, or even what the nvidia-persistenced.service does.

If that doesn't work, maybe you can create some kind of Path unit file and a separate service unit file that waits for the path to exist. Something like https://lists.fedoraproject.org/pipermail/devel/2012-January/160910.html. But it looks like this still requires specifying each device file manually.

As a practical solution, you could try to just wait on `/dev/nvidia0` only, even if there are more devices. It's possible that by the time the first device file shows up, the bulk of the waiting time is over and others quickly follow suit, so that when slurmd starts and gets around to detecting GPU devices with NVML, all device files are up.

Hopefully that points you in a useful direction for now. We'll discuss this internally and see what we can do to properly address this.

Thanks,
-Michael
Comment 5 darrellp 2019-09-20 11:40:23 MDT
Thanks Michael. We appreciate it

Cheers
D

On Fri, Sep 20, 2019 at 10:34 AM <bugs@schedmd.com> wrote:

> *Comment # 4 <https://bugs.schedmd.com/show_bug.cgi?id=7769#c4> on bug
> 7769 <https://bugs.schedmd.com/show_bug.cgi?id=7769> from Michael Hinton
> <hinton@schedmd.com> *
>
> (In reply to darrellp from comment #2 <https://bugs.schedmd.com/show_bug.cgi?id=7769#c2>)> I would respectfully argue if Slurm is going to support Nvidia GPUs that
> > this would be something that you would want have an answer for. It is pretty
> > disruptive to when running a cluster.
> You are probably right about that :)
>
> My first thought would be to somehow add an "nvidia.service" target to After=
> that knows when the nvidia driver is up and running (so all /dev/nvidiaX files
> exist). The closest thing I can find is `nvidia-persistenced.service` (seen in
> `systemctl list-units --no-pager | grep -i nvidia`). Maybe adding that would
> solve the issue. But I'm not sure if this service is automatically started on
> boot for all systems with NVIDIA drivers, or even what the
> nvidia-persistenced.service does.
>
> If that doesn't work, maybe you can create some kind of Path unit file and a
> separate service unit file that waits for the path to exist. Something likehttps://lists.fedoraproject.org/pipermail/devel/2012-January/160910.html. But
> it looks like this still requires specifying each device file manually.
>
> As a practical solution, you could try to just wait on `/dev/nvidia0` only,
> even if there are more devices. It's possible that by the time the first device
> file shows up, the bulk of the waiting time is over and others quickly follow
> suit, so that when slurmd starts and gets around to detecting GPU devices with
> NVML, all device files are up.
>
> Hopefully that points you in a useful direction for now. We'll discuss this
> internally and see what we can do to properly address this.
>
> Thanks,
> -Michael
>
> ------------------------------
> You are receiving this mail because:
>
>    - You reported the bug.
>
>
Comment 6 Michael Hinton 2019-10-16 10:18:26 MDT
Hey D,

Here is a solution that you can try. Check this out: https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#runfile-verifications

On every node, before you start slurmd, run a variation of this script to manually create the NVIDIA device files if they don't yet exist. Do you think that would work?

-Michael
Comment 7 darrellp 2019-10-16 10:41:44 MDT
Maybe. That script does not work on our systems out of the box (lspci give
different strings for different cards), but let me play with it this
afternoon and see if I can get something to work. If so, then we should be
able to add this as an Execstartpre to create us the dev files if they do
not exist. I will let you know what I find

Thanks
D

On Wed, Oct 16, 2019 at 9:18 AM <bugs@schedmd.com> wrote:

> *Comment # 6 <https://bugs.schedmd.com/show_bug.cgi?id=7769#c6> on bug
> 7769 <https://bugs.schedmd.com/show_bug.cgi?id=7769> from Michael Hinton
> <hinton@schedmd.com> *
>
> Hey D,
>
> Here is a solution that you can try. Check this out:https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#runfile-verifications
>
> On every node, before you start slurmd, run a variation of this script to
> manually create the NVIDIA device files if they don't yet exist. Do you think
> that would work?
>
> -Michael
>
> ------------------------------
> You are receiving this mail because:
>
>    - You reported the bug.
>
>
Comment 8 darrellp 2019-10-16 16:30:23 MDT
Ok we played with that, an while the script that Nvidia presented did not work, the information in that doc did get us to something that did. We added this to the slurmd.service file and it has worked so far in our testing. We still need to test it on a cluster with larger machines, but for our setup this looks promising. 


ExecStartPre=/usr/bin/nvidia-smi -L 


Thanks
D
Comment 9 Michael Hinton 2019-10-18 11:29:41 MDT
Is your solution still working for you?

I'm curious to understand why it works. Is it because /usr/bin/nvidia-smi stalls until the driver is properly running and devices are accessible? Do you have any insight on that?

My only worry would be if e.g. nvidia0 comes up, allowing /usr/bin/nvidia-smi to run, but nvidia8 is still loading, and then Slurm starts prematurely. But it sounds like this isn't happening.
Comment 10 darrellp 2019-10-18 11:37:44 MDT
It is in testing, but we have not tested it in our full clusters with the 8 GPU nodes yet as the rebooting would be disruptive and we have upcoming paper deadlines. If we have a node failure in the meantime, that would be informative as well. 

As for why it works, nvidia-smi needs the driver to loaded, so I can only speculate that there is some time-out that allows it to wait for the driver to load and see all the devices. This may not work for everyone or on every nvidia GPU, but so far so good. 

We will know more when we can test the full clusters once we get past the paper season. 

Util then I think we can mark this is resolved/info given 

Thanks
D
Comment 11 Will Dennis 2019-10-23 19:46:33 MDT
Hi all,

We have tried the same thing, as this has been a problem for us as well... Here's the unit file we drop in `/etc/systemd/system`:

```
# cat /etc/systemd/system/slurmd.service
[Unit]
Description=Slurm node daemon
After=network.target munge.service
ConditionPathExists=/etc/slurm-llnl/slurm.conf
Documentation=man:slurmd(8)

[Service]
Type=forking
EnvironmentFile=-/etc/default/slurmd
ExecStartPre=-/usr/bin/nvidia-smi
ExecStart=/usr/sbin/slurmd $SLURMD_OPTIONS
ExecReload=/bin/kill -HUP $MAINPID
PIDFile=/var/run/slurm-llnl/slurmd.pid
KillMode=process
LimitNOFILE=51200
LimitMEMLOCK=infinity
LimitSTACK=infinity

[Install]
WantedBy=multi-user.target
```

In our experience, the `ExecStartPre=-/usr/bin/nvidia-smi` works _most_ of the time, but not all of the time, so (I'm guessing) there must sometimes be a lag between the running of `nvidia-smi` and the instantiation of the `/dev/nvidia[#]` files... It is a minor pain point when we periodically have to reboot the GPU systems.

FYI, we are running 17.11.7 on Ubuntu 16.04 x86_64.