| Summary: | Error: can't stat gres.conf file | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Rex Chen <shuningc> |
| Component: | GPU | Assignee: | Director of Support <support> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 3 - Medium Impact | ||
| Priority: | --- | ||
| Version: | 19.05.3 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| See Also: |
https://bugs.schedmd.com/show_bug.cgi?id=7769 https://bugs.schedmd.com/show_bug.cgi?id=16131 |
||
| Site: | AWS+Sixnines Social | Slinky Site: | --- |
| Alineos Sites: | --- | Atos/Eviden Sites: | --- |
| Confidential Site: | --- | Coreweave sites: | --- |
| Cray Sites: | --- | DS9 clusters: | --- |
| Google sites: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | NoveTech Sites: | --- |
| Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Tzag Elita Sites: | --- |
| Linux Distro: | --- | Machine Name: | |
| CLE Version: | Version Fixed: | ||
| Target Release: | --- | DevPrio: | --- |
| Emory-Cloud Sites: | --- | ||
| Attachments: | Configuration files(slurm.conf, gres.conf) | ||
(In reply to Rex Chen from comment #0) > We are seeing the error: > [2019-12-11T20:47:37.133] error: Waiting for gres.conf file /dev//nvidia0 > [2019-12-11T20:47:56.134] fatal: can't stat gres.conf file /dev//nvidia0: No > such file or directory > > We ran nvidia-smi and everything looked normal. Here is the output of ls > /dev, we can see that /dev/nvidia0 is there. It might be there now, but not when you started Slurm. I know some people have had some issues with Slurm finding device files right after boot up when booting Slurm through systemd. If you restart Slurm without rebooting the node, do you get the same error? In bug 7769, this issue is solved for the most part by adding this to their slurmd service file: ExecStartPre=/usr/bin/nvidia-smi -L This seems to force it to wait until the device files are actually instantiated after a system boot. Could you try that and see if it solves the issue? Were you able to solve your issue? -Michael No update from the client. We can go ahead and close this ticket for now. Thanks. Rex Ok. Feel free to reopen if things change. Thanks, -Michael |
Created attachment 12545 [details] Configuration files(slurm.conf, gres.conf) Hi, We are seeing the error: [2019-12-11T20:47:37.133] error: Waiting for gres.conf file /dev//nvidia0 [2019-12-11T20:47:56.134] fatal: can't stat gres.conf file /dev//nvidia0: No such file or directory We ran nvidia-smi and everything looked normal. Here is the output of ls /dev, we can see that /dev/nvidia0 is there. ubuntu@ip-10-100-117-224:~$ ls /dev autofs loop2 nvme1n1 tty16 tty41 ttyS0 block loop3 nvme1n1p1 tty17 tty42 ttyS1 btrfs-control loop4 nvme2 tty18 tty43 ttyS2 char loop5 nvme2n1 tty19 tty44 ttyS3 console loop6 nvme2n1p1 tty2 tty45 ttyprintk core loop7 port tty20 tty46 uinput cpu_dma_latency mapper ppp tty21 tty47 urandom cuse mcelog psaux tty22 tty48 vcs disk mem ptmx tty23 tty49 vcs1 dm-0 memory_bandwidth pts tty24 tty5 vcs2 dri mqueue random tty25 tty50 vcs3 ecryptfs net rfkill tty26 tty51 vcs4 fd network_latency rtc tty27 tty52 vcs5 full network_throughput rtc0 tty28 tty53 vcs6 fuse null shm tty29 tty54 vcsa hpet nvidia0 snapshot tty3 tty55 vcsa1 hugepages nvidia1 stderr tty30 tty56 vcsa2 hwrng nvidia2 stdin tty31 tty57 vcsa3 infiniband nvidia3 stdout tty32 tty58 vcsa4 initctl nvidia4 tty tty33 tty59 vcsa5 input nvidia5 tty0 tty34 tty6 vcsa6 kmsg nvidia6 tty1 tty35 tty60 vfio lightnvm nvidia7 tty10 tty36 tty61 vg.01 lnet nvidiactl tty11 tty37 tty62 vga_arbiter log nvme0 tty12 tty38 tty63 vhost-net loop-control nvme0n1 tty13 tty39 tty7 vhost-vsock loop0 nvme0n1p1 tty14 tty4 tty8 zero loop1 nvme1 tty15 tty40 tty9 The configuration files are attached. Any idea on this issue? Thank you!