Ticket 8222

Summary: Error: can't stat gres.conf file
Product: Slurm Reporter: Rex Chen <shuningc>
Component: GPUAssignee: Director of Support <support>
Status: RESOLVED INFOGIVEN QA Contact:
Severity: 3 - Medium Impact    
Priority: ---    
Version: 19.05.3   
Hardware: Linux   
OS: Linux   
See Also: https://bugs.schedmd.com/show_bug.cgi?id=7769
https://bugs.schedmd.com/show_bug.cgi?id=16131
Site: AWS+Sixnines Social Slinky Site: ---
Alineos Sites: --- Atos/Eviden Sites: ---
Confidential Site: --- Coreweave sites: ---
Cray Sites: --- DS9 clusters: ---
Google sites: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- NoveTech Sites: ---
Nvidia HWinf-CS Sites: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Tzag Elita Sites: ---
Linux Distro: --- Machine Name:
CLE Version: Version Fixed:
Target Release: --- DevPrio: ---
Emory-Cloud Sites: ---
Attachments: Configuration files(slurm.conf, gres.conf)

Description Rex Chen 2019-12-12 01:51:56 MST
Created attachment 12545 [details]
Configuration files(slurm.conf, gres.conf)

Hi,

We are seeing the error:
[2019-12-11T20:47:37.133] error: Waiting for gres.conf file /dev//nvidia0
[2019-12-11T20:47:56.134] fatal: can't stat gres.conf file /dev//nvidia0: No such file or directory

We ran nvidia-smi and everything looked normal. Here is the output of ls /dev, we can see that /dev/nvidia0 is there.

ubuntu@ip-10-100-117-224:~$ ls /dev
autofs           loop2               nvme1n1    tty16  tty41  ttyS0
block            loop3               nvme1n1p1  tty17  tty42  ttyS1
btrfs-control    loop4               nvme2      tty18  tty43  ttyS2
char             loop5               nvme2n1    tty19  tty44  ttyS3
console          loop6               nvme2n1p1  tty2   tty45  ttyprintk
core             loop7               port       tty20  tty46  uinput
cpu_dma_latency  mapper              ppp        tty21  tty47  urandom
cuse             mcelog              psaux      tty22  tty48  vcs
disk             mem                 ptmx       tty23  tty49  vcs1
dm-0             memory_bandwidth    pts        tty24  tty5   vcs2
dri              mqueue              random     tty25  tty50  vcs3
ecryptfs         net                 rfkill     tty26  tty51  vcs4
fd               network_latency     rtc        tty27  tty52  vcs5
full             network_throughput  rtc0       tty28  tty53  vcs6
fuse             null                shm        tty29  tty54  vcsa
hpet             nvidia0             snapshot   tty3   tty55  vcsa1
hugepages        nvidia1             stderr     tty30  tty56  vcsa2
hwrng            nvidia2             stdin      tty31  tty57  vcsa3
infiniband       nvidia3             stdout     tty32  tty58  vcsa4
initctl          nvidia4             tty        tty33  tty59  vcsa5
input            nvidia5             tty0       tty34  tty6   vcsa6
kmsg             nvidia6             tty1       tty35  tty60  vfio
lightnvm         nvidia7             tty10      tty36  tty61  vg.01
lnet             nvidiactl           tty11      tty37  tty62  vga_arbiter
log              nvme0               tty12      tty38  tty63  vhost-net
loop-control     nvme0n1             tty13      tty39  tty7   vhost-vsock
loop0            nvme0n1p1           tty14      tty4   tty8   zero
loop1            nvme1               tty15      tty40  tty9


The configuration files are attached. Any idea on this issue?
Thank you!
Comment 1 Michael Hinton 2019-12-12 06:53:29 MST
(In reply to Rex Chen from comment #0)
> We are seeing the error:
> [2019-12-11T20:47:37.133] error: Waiting for gres.conf file /dev//nvidia0
> [2019-12-11T20:47:56.134] fatal: can't stat gres.conf file /dev//nvidia0: No
> such file or directory
> 
> We ran nvidia-smi and everything looked normal. Here is the output of ls
> /dev, we can see that /dev/nvidia0 is there.

It might be there now, but not when you started Slurm. I know some people have had some issues with Slurm finding device files right after boot up when booting Slurm through systemd.

If you restart Slurm without rebooting the node, do you get the same error?
Comment 2 Michael Hinton 2019-12-12 07:10:37 MST
In bug 7769, this issue is solved for the most part by adding this to their slurmd service file:

ExecStartPre=/usr/bin/nvidia-smi -L

This seems to force it to wait until the device files are actually instantiated after a system boot.

Could you try that and see if it solves the issue?
Comment 3 Michael Hinton 2019-12-18 15:20:32 MST
Were you able to solve your issue?

-Michael
Comment 4 Rex Chen 2019-12-18 17:31:33 MST
No update from the client. We can go ahead and close this ticket for now.
Thanks.

Rex
Comment 5 Michael Hinton 2019-12-18 17:36:04 MST
Ok. Feel free to reopen if things change.

Thanks,
-Michael