Ticket 13239 - slurmd 21.08.5 crashing
Summary: slurmd 21.08.5 crashing
Status: RESOLVED INFOGIVEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: slurmd (show other tickets)
Version: 21.08.5
Hardware: Linux Linux
: 3 - Medium Impact
Assignee: Albert Gil
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2022-01-21 15:50 MST by Bill Broadley
Modified: 2022-01-24 11:29 MST (History)
0 users

See Also:
Site: NREL
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description Bill Broadley 2022-01-21 15:50:23 MST
Out of 2600 nodes, 50-200 node have slurmd crash.

They leave behind a core file:
[root@r7i7n35 ~]# ls -al /var/spool/slurmd/
total 7832
drwxr-xr-x  2 root root      140 Jan 21 10:27 .
drwxr-xr-x 12 root root      240 Jan 14 12:44 ..
-rw-------  1 root root 53854208 Jan 21 10:26 core.4873

The slurm build is pretty simple, it's on a CentOS 7.7 platform and build with (this snippet from config.log in the build dir):
./configure --prefix=/nopt/slurm/21.08.5/ --sysconfdir=/nopt/slurm/etc

I just now enabled slurm this in slurm.conf:
SlurmdDebug=debug

Then did a scontrol reconfig.

Anything else I should do?  Any know problems like this in 21.08.5?

Should I upload the core file?
Comment 1 Albert Gil 2022-01-24 03:36:38 MST
Hi Bill,

> Out of 2600 nodes, 50-200 node have slurmd crash.

That's not good.

> They leave behind a core file:
> Should I upload the core file?

Not the coredump, but could you attach a backtrace (bt) from a coredump using gdb?


> The slurm build is pretty simple, it's on a CentOS 7.7 platform and build
> with (this snippet from config.log in the build dir):
> ./configure --prefix=/nopt/slurm/21.08.5/ --sysconfdir=/nopt/slurm/etc

Thanks for the info.

> I just now enabled slurm this in slurm.conf:
> SlurmdDebug=debug
> 
> Then did a scontrol reconfig.
> 
> Anything else I should do?

Thanks for adding the debug line, it might help.
But please attach the logs of a crashed slurmd.
Even without the debug it may contain important info.

>  Any know problems like this in 21.08.5?

Not really, but we would need the logs and the backtrace to verify that.

Regards,
Albert
Comment 2 Bill Broadley 2022-01-24 09:48:20 MST
Ah, sorry, looks like our compute node image had an old pam_slurm_adopt.so, and when we upgraded the so file slurmd was not restarted.

So the pam_slurm_adopt.so seemed to make slurmd unstable, until slurmd was restarted.  So we lost 1600 nodes when I did a scontrol reconfig last Friday.

But since Friday we didn't lost a single node's slurmd (out of 2600 nodes), so I think the problem is solved, not slurmd's fault to act weird when a .so it uses changes.

Closing this ticket.
Comment 3 Albert Gil 2022-01-24 11:29:01 MST
Closing the ticket.
Thanks for the information Bill!