Ticket 12631 - slurmctld not logging, slurmctld restart resolved the issue
Summary: slurmctld not logging, slurmctld restart resolved the issue
Status: RESOLVED TIMEDOUT
Alias: None
Product: Slurm
Classification: Unclassified
Component: slurmctld (show other tickets)
Version: 20.11.8
Hardware: Linux Linux
: 4 - Minor Issue
Assignee: Nate Rini
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2021-10-07 16:20 MDT by Scott Lucas
Modified: 2021-10-13 10:57 MDT (History)
4 users (show)

See Also:
Site: FB (PSLA)
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: Ubuntu
Machine Name: h2slurm1.h2.fair
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description Scott Lucas 2021-10-07 16:20:31 MDT
We were troubleshooting some job issues in slurm, slurm commands were all working, however the slurmctld.log was empty. Restarting resolved the issue, however we're not sure what caused the log file to empty?
Comment 1 Nate Rini 2021-10-07 16:33:04 MDT
Did you happen to take a `lsof -p $(pgrep slurmctld)` before restarting?
Comment 2 Scott Lucas 2021-10-08 12:43:43 MDT
We did not, but will make sure we run that should it happen in the future. Anything else we should check?
Comment 3 Nate Rini 2021-10-08 12:57:15 MDT
(In reply to Scott Lucas from comment #2)
> We did not, but will make sure we run that should it happen in the future.
> Anything else we should check?

Please take a coredump at the same time (and generate a backtrace). I will need to look at both to figure out what is going on.
Comment 4 Scott Lucas 2021-10-08 13:54:10 MDT
So the next time this happens, run:

`lsof -p $(pgrep slurmctld)`

and

scontrol abort (to generate a coredump)

?
Comment 5 Nate Rini 2021-10-08 14:01:22 MDT
(In reply to Scott Lucas from comment #4)
> scontrol abort (to generate a coredump)

That would kill the server and potentially cause data loss.

Instead use gcore to grab the core without killing the daemon:
> gcore $(pgrep slurmctld)
Comment 6 Scott Lucas 2021-10-08 17:31:26 MDT
Will do
Comment 7 Nate Rini 2021-10-13 10:57:19 MDT
Scott

I'm going to mark this issue as timed out. Once the logs are ready, please reply with them and we can continue debugging.

Thanks,
--Nate