13239 – slurmd 21.08.5 crashing

Ticket 13239 - slurmd 21.08.5 crashing

Summary: slurmd 21.08.5 crashing

Status:	RESOLVED INFOGIVEN

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	slurmd (show other tickets)
Version:	21.08.5
Hardware:	Linux Linux

Severity:	3 - Medium Impact
Assignee:	Albert Gil
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2022-01-21 15:50 MST by Bill Broadley
Modified:	2022-01-24 11:29 MST (History)
CC List:	0 users

See Also:
Site:	NREL
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Bill Broadley 2022-01-21 15:50:23 MST

Out of 2600 nodes, 50-200 node have slurmd crash.

They leave behind a core file:
[root@r7i7n35 ~]# ls -al /var/spool/slurmd/
total 7832
drwxr-xr-x  2 root root      140 Jan 21 10:27 .
drwxr-xr-x 12 root root      240 Jan 14 12:44 ..
-rw-------  1 root root 53854208 Jan 21 10:26 core.4873

The slurm build is pretty simple, it's on a CentOS 7.7 platform and build with (this snippet from config.log in the build dir):
./configure --prefix=/nopt/slurm/21.08.5/ --sysconfdir=/nopt/slurm/etc

I just now enabled slurm this in slurm.conf:
SlurmdDebug=debug

Then did a scontrol reconfig.

Anything else I should do?  Any know problems like this in 21.08.5?

Should I upload the core file?

Comment 1 Albert Gil 2022-01-24 03:36:38 MST

Hi Bill,

> Out of 2600 nodes, 50-200 node have slurmd crash.

That's not good.

> They leave behind a core file:
> Should I upload the core file?

Not the coredump, but could you attach a backtrace (bt) from a coredump using gdb?


> The slurm build is pretty simple, it's on a CentOS 7.7 platform and build
> with (this snippet from config.log in the build dir):
> ./configure --prefix=/nopt/slurm/21.08.5/ --sysconfdir=/nopt/slurm/etc

Thanks for the info.

> I just now enabled slurm this in slurm.conf:
> SlurmdDebug=debug
> 
> Then did a scontrol reconfig.
> 
> Anything else I should do?

Thanks for adding the debug line, it might help.
But please attach the logs of a crashed slurmd.
Even without the debug it may contain important info.

>  Any know problems like this in 21.08.5?

Not really, but we would need the logs and the backtrace to verify that.

Regards,
Albert

Comment 2 Bill Broadley 2022-01-24 09:48:20 MST

Ah, sorry, looks like our compute node image had an old pam_slurm_adopt.so, and when we upgraded the so file slurmd was not restarted.

So the pam_slurm_adopt.so seemed to make slurmd unstable, until slurmd was restarted.  So we lost 1600 nodes when I did a scontrol reconfig last Friday.

But since Friday we didn't lost a single node's slurmd (out of 2600 nodes), so I think the problem is solved, not slurmd's fault to act weird when a .so it uses changes.

Closing this ticket.

Comment 3 Albert Gil 2022-01-24 11:29:01 MST

Closing the ticket.
Thanks for the information Bill!