Ticket 12631

Summary: slurmctld not logging, slurmctld restart resolved the issue
Product: Slurm Reporter: Scott Lucas <slucas>
Component: slurmctldAssignee: Nate Rini <nate>
Status: RESOLVED TIMEDOUT QA Contact:
Severity: 4 - Minor Issue    
Priority: --- CC: brian.mccaul, kwhetham, nate, slucas
Version: 20.11.8   
Hardware: Linux   
OS: Linux   
Site: FB (PSLA) Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: Ubuntu
Machine Name: h2slurm1.h2.fair CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---

Description Scott Lucas 2021-10-07 16:20:31 MDT
We were troubleshooting some job issues in slurm, slurm commands were all working, however the slurmctld.log was empty. Restarting resolved the issue, however we're not sure what caused the log file to empty?
Comment 1 Nate Rini 2021-10-07 16:33:04 MDT
Did you happen to take a `lsof -p $(pgrep slurmctld)` before restarting?
Comment 2 Scott Lucas 2021-10-08 12:43:43 MDT
We did not, but will make sure we run that should it happen in the future. Anything else we should check?
Comment 3 Nate Rini 2021-10-08 12:57:15 MDT
(In reply to Scott Lucas from comment #2)
> We did not, but will make sure we run that should it happen in the future.
> Anything else we should check?

Please take a coredump at the same time (and generate a backtrace). I will need to look at both to figure out what is going on.
Comment 4 Scott Lucas 2021-10-08 13:54:10 MDT
So the next time this happens, run:

`lsof -p $(pgrep slurmctld)`

and

scontrol abort (to generate a coredump)

?
Comment 5 Nate Rini 2021-10-08 14:01:22 MDT
(In reply to Scott Lucas from comment #4)
> scontrol abort (to generate a coredump)

That would kill the server and potentially cause data loss.

Instead use gcore to grab the core without killing the daemon:
> gcore $(pgrep slurmctld)
Comment 6 Scott Lucas 2021-10-08 17:31:26 MDT
Will do
Comment 7 Nate Rini 2021-10-13 10:57:19 MDT
Scott

I'm going to mark this issue as timed out. Once the logs are ready, please reply with them and we can continue debugging.

Thanks,
--Nate