Out of 2600 nodes, 50-200 node have slurmd crash. They leave behind a core file: [root@r7i7n35 ~]# ls -al /var/spool/slurmd/ total 7832 drwxr-xr-x 2 root root 140 Jan 21 10:27 . drwxr-xr-x 12 root root 240 Jan 14 12:44 .. -rw------- 1 root root 53854208 Jan 21 10:26 core.4873 The slurm build is pretty simple, it's on a CentOS 7.7 platform and build with (this snippet from config.log in the build dir): ./configure --prefix=/nopt/slurm/21.08.5/ --sysconfdir=/nopt/slurm/etc I just now enabled slurm this in slurm.conf: SlurmdDebug=debug Then did a scontrol reconfig. Anything else I should do? Any know problems like this in 21.08.5? Should I upload the core file?
Hi Bill, > Out of 2600 nodes, 50-200 node have slurmd crash. That's not good. > They leave behind a core file: > Should I upload the core file? Not the coredump, but could you attach a backtrace (bt) from a coredump using gdb? > The slurm build is pretty simple, it's on a CentOS 7.7 platform and build > with (this snippet from config.log in the build dir): > ./configure --prefix=/nopt/slurm/21.08.5/ --sysconfdir=/nopt/slurm/etc Thanks for the info. > I just now enabled slurm this in slurm.conf: > SlurmdDebug=debug > > Then did a scontrol reconfig. > > Anything else I should do? Thanks for adding the debug line, it might help. But please attach the logs of a crashed slurmd. Even without the debug it may contain important info. > Any know problems like this in 21.08.5? Not really, but we would need the logs and the backtrace to verify that. Regards, Albert
Ah, sorry, looks like our compute node image had an old pam_slurm_adopt.so, and when we upgraded the so file slurmd was not restarted. So the pam_slurm_adopt.so seemed to make slurmd unstable, until slurmd was restarted. So we lost 1600 nodes when I did a scontrol reconfig last Friday. But since Friday we didn't lost a single node's slurmd (out of 2600 nodes), so I think the problem is solved, not slurmd's fault to act weird when a .so it uses changes. Closing this ticket.
Closing the ticket. Thanks for the information Bill!