Ticket 13239

Summary: slurmd 21.08.5 crashing
Product: Slurm Reporter: Bill Broadley <bill.broadley>
Component: slurmdAssignee: Albert Gil <albert.gil>
Status: RESOLVED INFOGIVEN QA Contact:
Severity: 3 - Medium Impact    
Priority: ---    
Version: 21.08.5   
Hardware: Linux   
OS: Linux   
Site: NREL Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---

Description Bill Broadley 2022-01-21 15:50:23 MST
Out of 2600 nodes, 50-200 node have slurmd crash.

They leave behind a core file:
[root@r7i7n35 ~]# ls -al /var/spool/slurmd/
total 7832
drwxr-xr-x  2 root root      140 Jan 21 10:27 .
drwxr-xr-x 12 root root      240 Jan 14 12:44 ..
-rw-------  1 root root 53854208 Jan 21 10:26 core.4873

The slurm build is pretty simple, it's on a CentOS 7.7 platform and build with (this snippet from config.log in the build dir):
./configure --prefix=/nopt/slurm/21.08.5/ --sysconfdir=/nopt/slurm/etc

I just now enabled slurm this in slurm.conf:
SlurmdDebug=debug

Then did a scontrol reconfig.

Anything else I should do?  Any know problems like this in 21.08.5?

Should I upload the core file?
Comment 1 Albert Gil 2022-01-24 03:36:38 MST
Hi Bill,

> Out of 2600 nodes, 50-200 node have slurmd crash.

That's not good.

> They leave behind a core file:
> Should I upload the core file?

Not the coredump, but could you attach a backtrace (bt) from a coredump using gdb?


> The slurm build is pretty simple, it's on a CentOS 7.7 platform and build
> with (this snippet from config.log in the build dir):
> ./configure --prefix=/nopt/slurm/21.08.5/ --sysconfdir=/nopt/slurm/etc

Thanks for the info.

> I just now enabled slurm this in slurm.conf:
> SlurmdDebug=debug
> 
> Then did a scontrol reconfig.
> 
> Anything else I should do?

Thanks for adding the debug line, it might help.
But please attach the logs of a crashed slurmd.
Even without the debug it may contain important info.

>  Any know problems like this in 21.08.5?

Not really, but we would need the logs and the backtrace to verify that.

Regards,
Albert
Comment 2 Bill Broadley 2022-01-24 09:48:20 MST
Ah, sorry, looks like our compute node image had an old pam_slurm_adopt.so, and when we upgraded the so file slurmd was not restarted.

So the pam_slurm_adopt.so seemed to make slurmd unstable, until slurmd was restarted.  So we lost 1600 nodes when I did a scontrol reconfig last Friday.

But since Friday we didn't lost a single node's slurmd (out of 2600 nodes), so I think the problem is solved, not slurmd's fault to act weird when a .so it uses changes.

Closing this ticket.
Comment 3 Albert Gil 2022-01-24 11:29:01 MST
Closing the ticket.
Thanks for the information Bill!