Ticket 13239

Summary:	slurmd 21.08.5 crashing
Product:	Slurm	Reporter:	Bill Broadley <bill.broadley>
Component:	slurmd	Assignee:	Albert Gil <albert.gil>
Status:	RESOLVED INFOGIVEN	QA Contact:
Severity:	3 - Medium Impact
Priority:	---
Version:	21.08.5
Hardware:	Linux
OS:	Linux
Site:	NREL	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---

Description Bill Broadley 2022-01-21 15:50:23 MST

Out of 2600 nodes, 50-200 node have slurmd crash.

They leave behind a core file:
[root@r7i7n35 ~]# ls -al /var/spool/slurmd/
total 7832
drwxr-xr-x  2 root root      140 Jan 21 10:27 .
drwxr-xr-x 12 root root      240 Jan 14 12:44 ..
-rw-------  1 root root 53854208 Jan 21 10:26 core.4873

The slurm build is pretty simple, it's on a CentOS 7.7 platform and build with (this snippet from config.log in the build dir):
./configure --prefix=/nopt/slurm/21.08.5/ --sysconfdir=/nopt/slurm/etc

I just now enabled slurm this in slurm.conf:
SlurmdDebug=debug

Then did a scontrol reconfig.

Anything else I should do?  Any know problems like this in 21.08.5?

Should I upload the core file?

Comment 1 Albert Gil 2022-01-24 03:36:38 MST

Hi Bill,

> Out of 2600 nodes, 50-200 node have slurmd crash.

That's not good.

> They leave behind a core file:
> Should I upload the core file?

Not the coredump, but could you attach a backtrace (bt) from a coredump using gdb?


> The slurm build is pretty simple, it's on a CentOS 7.7 platform and build
> with (this snippet from config.log in the build dir):
> ./configure --prefix=/nopt/slurm/21.08.5/ --sysconfdir=/nopt/slurm/etc

Thanks for the info.

> I just now enabled slurm this in slurm.conf:
> SlurmdDebug=debug
> 
> Then did a scontrol reconfig.
> 
> Anything else I should do?

Thanks for adding the debug line, it might help.
But please attach the logs of a crashed slurmd.
Even without the debug it may contain important info.

>  Any know problems like this in 21.08.5?

Not really, but we would need the logs and the backtrace to verify that.

Regards,
Albert

Comment 2 Bill Broadley 2022-01-24 09:48:20 MST

Ah, sorry, looks like our compute node image had an old pam_slurm_adopt.so, and when we upgraded the so file slurmd was not restarted.

So the pam_slurm_adopt.so seemed to make slurmd unstable, until slurmd was restarted.  So we lost 1600 nodes when I did a scontrol reconfig last Friday.

But since Friday we didn't lost a single node's slurmd (out of 2600 nodes), so I think the problem is solved, not slurmd's fault to act weird when a .so it uses changes.

Closing this ticket.

Comment 3 Albert Gil 2022-01-24 11:29:01 MST

Closing the ticket.
Thanks for the information Bill!