| Summary: | slurmd 21.08.5 crashing | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Bill Broadley <bill.broadley> |
| Component: | slurmd | Assignee: | Albert Gil <albert.gil> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 3 - Medium Impact | ||
| Priority: | --- | ||
| Version: | 21.08.5 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | NREL | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
|
Description
Bill Broadley
2022-01-21 15:50:23 MST
Hi Bill, > Out of 2600 nodes, 50-200 node have slurmd crash. That's not good. > They leave behind a core file: > Should I upload the core file? Not the coredump, but could you attach a backtrace (bt) from a coredump using gdb? > The slurm build is pretty simple, it's on a CentOS 7.7 platform and build > with (this snippet from config.log in the build dir): > ./configure --prefix=/nopt/slurm/21.08.5/ --sysconfdir=/nopt/slurm/etc Thanks for the info. > I just now enabled slurm this in slurm.conf: > SlurmdDebug=debug > > Then did a scontrol reconfig. > > Anything else I should do? Thanks for adding the debug line, it might help. But please attach the logs of a crashed slurmd. Even without the debug it may contain important info. > Any know problems like this in 21.08.5? Not really, but we would need the logs and the backtrace to verify that. Regards, Albert Ah, sorry, looks like our compute node image had an old pam_slurm_adopt.so, and when we upgraded the so file slurmd was not restarted. So the pam_slurm_adopt.so seemed to make slurmd unstable, until slurmd was restarted. So we lost 1600 nodes when I did a scontrol reconfig last Friday. But since Friday we didn't lost a single node's slurmd (out of 2600 nodes), so I think the problem is solved, not slurmd's fault to act weird when a .so it uses changes. Closing this ticket. Closing the ticket. Thanks for the information Bill! |