| Summary: | slurmctld not logging, slurmctld restart resolved the issue | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Scott Lucas <slucas> |
| Component: | slurmctld | Assignee: | Nate Rini <nate> |
| Status: | RESOLVED TIMEDOUT | QA Contact: | |
| Severity: | 4 - Minor Issue | ||
| Priority: | --- | CC: | brian.mccaul, kwhetham, nate, slucas |
| Version: | 20.11.8 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | FB (PSLA) | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | Ubuntu |
| Machine Name: | h2slurm1.h2.fair | CLE Version: | |
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
|
Description
Scott Lucas
2021-10-07 16:20:31 MDT
Did you happen to take a `lsof -p $(pgrep slurmctld)` before restarting? We did not, but will make sure we run that should it happen in the future. Anything else we should check? (In reply to Scott Lucas from comment #2) > We did not, but will make sure we run that should it happen in the future. > Anything else we should check? Please take a coredump at the same time (and generate a backtrace). I will need to look at both to figure out what is going on. So the next time this happens, run: `lsof -p $(pgrep slurmctld)` and scontrol abort (to generate a coredump) ? (In reply to Scott Lucas from comment #4) > scontrol abort (to generate a coredump) That would kill the server and potentially cause data loss. Instead use gcore to grab the core without killing the daemon: > gcore $(pgrep slurmctld) Will do Scott I'm going to mark this issue as timed out. Once the logs are ready, please reply with them and we can continue debugging. Thanks, --Nate |