| Summary: | Missing systemd dep causes slurmd start problems for nodes with gres GPUs | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | darrellp |
| Component: | GPU | Assignee: | Director of Support <support> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 3 - Medium Impact | ||
| Priority: | --- | CC: | wdennis |
| Version: | 19.05.2 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| See Also: | https://bugs.schedmd.com/show_bug.cgi?id=8222 | ||
| Site: | Allen AI | Slinky Site: | --- |
| Alineos Sites: | --- | Atos/Eviden Sites: | --- |
| Confidential Site: | --- | Coreweave sites: | --- |
| Cray Sites: | --- | DS9 clusters: | --- |
| Google sites: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | NoveTech Sites: | --- |
| Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Tzag Elita Sites: | --- |
| Linux Distro: | --- | Machine Name: | |
| CLE Version: | Version Fixed: | ||
| Target Release: | --- | DevPrio: | --- |
| Emory-Cloud Sites: | --- | ||
| Attachments: | slurmd.service | ||
Hey Darrell, Could you attach the relevant systemd files you are using with slurmd? As you've indicated, this seems more like a systemd dependency/udev issue rather than something we can or should solve within Slurm itself. I can't guarantee that I'll find a better solution than the one you've found, but I'll look into it. Thanks, -Michael I would respectfully argue if Slurm is going to support Nvidia GPUs that this would be something that you would want have an answer for. It is pretty disruptive to when running a cluster. Anyway attaching the systemd file. Note that this is exactly what is packaged in the 19.05.2 release. Cheers D Created attachment 11639 [details]
slurmd.service
(In reply to darrellp from comment #2) > I would respectfully argue if Slurm is going to support Nvidia GPUs that > this would be something that you would want have an answer for. It is pretty > disruptive to when running a cluster. You are probably right about that :) My first thought would be to somehow add an "nvidia.service" target to After= that knows when the nvidia driver is up and running (so all /dev/nvidiaX files exist). The closest thing I can find is `nvidia-persistenced.service` (seen in `systemctl list-units --no-pager | grep -i nvidia`). Maybe adding that would solve the issue. But I'm not sure if this service is automatically started on boot for all systems with NVIDIA drivers, or even what the nvidia-persistenced.service does. If that doesn't work, maybe you can create some kind of Path unit file and a separate service unit file that waits for the path to exist. Something like https://lists.fedoraproject.org/pipermail/devel/2012-January/160910.html. But it looks like this still requires specifying each device file manually. As a practical solution, you could try to just wait on `/dev/nvidia0` only, even if there are more devices. It's possible that by the time the first device file shows up, the bulk of the waiting time is over and others quickly follow suit, so that when slurmd starts and gets around to detecting GPU devices with NVML, all device files are up. Hopefully that points you in a useful direction for now. We'll discuss this internally and see what we can do to properly address this. Thanks, -Michael Thanks Michael. We appreciate it Cheers D On Fri, Sep 20, 2019 at 10:34 AM <bugs@schedmd.com> wrote: > *Comment # 4 <https://bugs.schedmd.com/show_bug.cgi?id=7769#c4> on bug > 7769 <https://bugs.schedmd.com/show_bug.cgi?id=7769> from Michael Hinton > <hinton@schedmd.com> * > > (In reply to darrellp from comment #2 <https://bugs.schedmd.com/show_bug.cgi?id=7769#c2>)> I would respectfully argue if Slurm is going to support Nvidia GPUs that > > this would be something that you would want have an answer for. It is pretty > > disruptive to when running a cluster. > You are probably right about that :) > > My first thought would be to somehow add an "nvidia.service" target to After= > that knows when the nvidia driver is up and running (so all /dev/nvidiaX files > exist). The closest thing I can find is `nvidia-persistenced.service` (seen in > `systemctl list-units --no-pager | grep -i nvidia`). Maybe adding that would > solve the issue. But I'm not sure if this service is automatically started on > boot for all systems with NVIDIA drivers, or even what the > nvidia-persistenced.service does. > > If that doesn't work, maybe you can create some kind of Path unit file and a > separate service unit file that waits for the path to exist. Something likehttps://lists.fedoraproject.org/pipermail/devel/2012-January/160910.html. But > it looks like this still requires specifying each device file manually. > > As a practical solution, you could try to just wait on `/dev/nvidia0` only, > even if there are more devices. It's possible that by the time the first device > file shows up, the bulk of the waiting time is over and others quickly follow > suit, so that when slurmd starts and gets around to detecting GPU devices with > NVML, all device files are up. > > Hopefully that points you in a useful direction for now. We'll discuss this > internally and see what we can do to properly address this. > > Thanks, > -Michael > > ------------------------------ > You are receiving this mail because: > > - You reported the bug. > > Hey D, Here is a solution that you can try. Check this out: https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#runfile-verifications On every node, before you start slurmd, run a variation of this script to manually create the NVIDIA device files if they don't yet exist. Do you think that would work? -Michael Maybe. That script does not work on our systems out of the box (lspci give different strings for different cards), but let me play with it this afternoon and see if I can get something to work. If so, then we should be able to add this as an Execstartpre to create us the dev files if they do not exist. I will let you know what I find Thanks D On Wed, Oct 16, 2019 at 9:18 AM <bugs@schedmd.com> wrote: > *Comment # 6 <https://bugs.schedmd.com/show_bug.cgi?id=7769#c6> on bug > 7769 <https://bugs.schedmd.com/show_bug.cgi?id=7769> from Michael Hinton > <hinton@schedmd.com> * > > Hey D, > > Here is a solution that you can try. Check this out:https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#runfile-verifications > > On every node, before you start slurmd, run a variation of this script to > manually create the NVIDIA device files if they don't yet exist. Do you think > that would work? > > -Michael > > ------------------------------ > You are receiving this mail because: > > - You reported the bug. > > Ok we played with that, an while the script that Nvidia presented did not work, the information in that doc did get us to something that did. We added this to the slurmd.service file and it has worked so far in our testing. We still need to test it on a cluster with larger machines, but for our setup this looks promising. ExecStartPre=/usr/bin/nvidia-smi -L Thanks D Is your solution still working for you? I'm curious to understand why it works. Is it because /usr/bin/nvidia-smi stalls until the driver is properly running and devices are accessible? Do you have any insight on that? My only worry would be if e.g. nvidia0 comes up, allowing /usr/bin/nvidia-smi to run, but nvidia8 is still loading, and then Slurm starts prematurely. But it sounds like this isn't happening. It is in testing, but we have not tested it in our full clusters with the 8 GPU nodes yet as the rebooting would be disruptive and we have upcoming paper deadlines. If we have a node failure in the meantime, that would be informative as well. As for why it works, nvidia-smi needs the driver to loaded, so I can only speculate that there is some time-out that allows it to wait for the driver to load and see all the devices. This may not work for everyone or on every nvidia GPU, but so far so good. We will know more when we can test the full clusters once we get past the paper season. Util then I think we can mark this is resolved/info given Thanks D Hi all, We have tried the same thing, as this has been a problem for us as well... Here's the unit file we drop in `/etc/systemd/system`: ``` # cat /etc/systemd/system/slurmd.service [Unit] Description=Slurm node daemon After=network.target munge.service ConditionPathExists=/etc/slurm-llnl/slurm.conf Documentation=man:slurmd(8) [Service] Type=forking EnvironmentFile=-/etc/default/slurmd ExecStartPre=-/usr/bin/nvidia-smi ExecStart=/usr/sbin/slurmd $SLURMD_OPTIONS ExecReload=/bin/kill -HUP $MAINPID PIDFile=/var/run/slurm-llnl/slurmd.pid KillMode=process LimitNOFILE=51200 LimitMEMLOCK=infinity LimitSTACK=infinity [Install] WantedBy=multi-user.target ``` In our experience, the `ExecStartPre=-/usr/bin/nvidia-smi` works _most_ of the time, but not all of the time, so (I'm guessing) there must sometimes be a lag between the running of `nvidia-smi` and the instantiation of the `/dev/nvidia[#]` files... It is a minor pain point when we periodically have to reboot the GPU systems. FYI, we are running 17.11.7 on Ubuntu 16.04 x86_64. |
Folks We have an issue where when a GPU node gets restarted, slurmd will often not start because the gres.conf file references the file location for the Nvidia GPU which has not been created as the Nvidia drivers are still loading. The effect is that slurmd does not start (error below) and has to be restarted by hand after the fact. Please note that this does not happen every time, but enough that I would class it as often. This is on Slurm 19.05.2 cluster with 4 nodes. All running Ubuntu 18.04 All with various Nvidia GPUs (RTX Titan, Quadro RTX 8000, GTX 1080) Error reported in the logs and systemctl status slurmd: slurmd[1453]: fatal: can't stat gres.conf file /dev/nvidia0: No such file or directory This appears to be similar to this existing bug: https://bugs.schedmd.com/show_bug.cgi?id=3798 We have thought about creating a fix using something in udev and an 'After' in the service file, something like this: https://unix.stackexchange.com/questions/186899/how-to-wait-for-a-dev-video0-before-starting-a-service/186903 This is definitely not ideal as it would need to be a maintained mapping for every machine and how many GPUs they have. Can you give us a better solution? Thanks D Steps to reproduce. On a multi-GPU node, reboot. sudo systemctl status slurmd