Ticket 572 - The node is down after scontrol reconfigure
Summary: The node is down after scontrol reconfigure
Status: RESOLVED CANNOTREPRODUCE
Alias: None
Product: Slurm
Classification: Unclassified
Component: slurmctld (show other tickets)
Version: 2.6.x
Hardware: Linux Linux
: 4 - Minor Issue
Assignee: David Bigagli
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2014-01-26 11:21 MST by NAOKI
Modified: 2016-09-19 14:51 MDT (History)
2 users (show)

See Also:
Site: CRAY
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description NAOKI 2014-01-26 11:21:45 MST
Hi,

we changed value of following change configuration as follows,

JobRequeue=0

after changing slurm.conf, we update config to system using 
scontrol reconfigure, unfortunately, the slurmd is down for some nodes with
following error..

 

# scontrol: error: slurm_receive_msg: Socket timed out on send/recv operation # slurm_reconfigure error: Socket timed out on send/recv operation


what are the situations that this error is occured? 
i would like to know gather from reason of why some nodes are downed when using scontrol reconfigure...


Thanks in advance,
Naoki.
Comment 1 Moe Jette 2014-01-26 11:31:33 MST
I preseume that you mean the slurmctld daemon is down, not the slurmd daemons.
This message:
# scontrol: error: slurm_receive_msg: Socket timed out on send/recv operation
# slurm_reconfigure error: Socket timed out on send/recv operation

indicates that your slurmctld is not responding. That might happon on a reconfiguration if you have a huge number of jobs or something really bad in your configuration or a really slow file system with the state file. Another option is that your configuration is bad and slurmctld died. Please attach your slurmctld log file for this time period.
Comment 2 Moe Jette 2014-01-31 03:33:57 MST
Still waiting for logs and configuration file.
Comment 3 Moe Jette 2014-02-24 02:11:47 MST
If this happens again, please open a ticket with the configuration file and logs.