Ticket 11547

Summary: Seeing "Could not open job state file" message, warns "Jobs may be lost!"
Product: Slurm Reporter: Will Dennis <wdennis>
Component: slurmctldAssignee: Jason Booth <jbooth>
Status: RESOLVED INFOGIVEN QA Contact:
Severity: 4 - Minor Issue    
Priority: ---    
Version: 20.11.5   
Hardware: Linux   
OS: Linux   
Site: NEC Labs Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: Ubuntu
Machine Name: ma-slurm-ctlr CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---

Description Will Dennis 2021-05-05 13:45:36 MDT
Hello,

When shutting down a 1st run of slurmctld (done via "slurmctld -Dvvvv"), I am seeing this in the log output:

slurmctld: Terminate signal (SIGINT or SIGTERM) received
slurmctld: debug:  sched: slurmctld terminating
slurmctld: debug3: _slurmctld_rpc_mgr shutting down
slurmctld: Saving all slurm state
slurmctld: debug:  create_mmap_buf: Failed to open file `/var/lib/slurm/slurmctld/job_state`, No such file or directory
slurmctld: error: Could not open job state file /var/lib/slurm/slurmctld/job_state: No such file or directory
slurmctld: error: NOTE: Trying backup state save file. Jobs may be lost!
slurmctld: debug:  create_mmap_buf: Failed to open file `/var/lib/slurm/slurmctld/job_state.old`, No such file or directory
slurmctld: No job state file (/var/lib/slurm/slurmctld/job_state.old) found
slurmctld: debug3: Writing job id 0 to header record of job_state file
slurmctld: debug3: _slurmctld_background shutting down

Shall I be concerned about this? If so, what to do to fix?
Comment 1 Jason Booth 2021-05-05 15:19:16 MDT
Will - if this is the first time the scheduler has started and shutdown then what you are seeing is normal. Slurm will write out state information to the StateSaveLocation on shutdown. This includes information about jobs, partitions, nodes, associations, clustername, federation, database messages, config state, tres, qos, priority, reservations and triggers.

Since this is the first time the cluster has started it will not contain the state information for the cluster until the first shutdown. At this point, it will write this information out to the StateSaveLocation.
Comment 2 Will Dennis 2021-05-05 20:12:19 MDT
It is the first time, but since this message is new to me, wanted to check it out. You may go ahead and close, thanks!