Hi, we changed value of following change configuration as follows, JobRequeue=0 after changing slurm.conf, we update config to system using scontrol reconfigure, unfortunately, the slurmd is down for some nodes with following error.. # scontrol: error: slurm_receive_msg: Socket timed out on send/recv operation # slurm_reconfigure error: Socket timed out on send/recv operation what are the situations that this error is occured? i would like to know gather from reason of why some nodes are downed when using scontrol reconfigure... Thanks in advance, Naoki.
I preseume that you mean the slurmctld daemon is down, not the slurmd daemons. This message: # scontrol: error: slurm_receive_msg: Socket timed out on send/recv operation # slurm_reconfigure error: Socket timed out on send/recv operation indicates that your slurmctld is not responding. That might happon on a reconfiguration if you have a huge number of jobs or something really bad in your configuration or a really slow file system with the state file. Another option is that your configuration is bad and slurmctld died. Please attach your slurmctld log file for this time period.
Still waiting for logs and configuration file.
If this happens again, please open a ticket with the configuration file and logs.