Ticket 15458

Summary: Jobs fails to complete, requeueing the jobs after completion and nodes went to down state frequently
Product: Slurm Reporter: SAJIL NP <iamsajil7>
Component: slurmdAssignee: Jacob Jenson <jacob>
Status: RESOLVED INVALID QA Contact:
Severity: 6 - No support contract    
Priority: ---    
Version: - Unsupported Older Versions   
Hardware: Linux   
OS: Linux   
Site: -Other- Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---
Attachments: slurmd.log

Description SAJIL NP 2022-11-17 23:01:42 MST
Created attachment 27832 [details]
slurmd.log

Hi SLURM Support,

Slurm Version: 18.08.8
OS: Centos 7.7

Slurm fails to update the completed job in one of the nodes I'm using. This makes completed jobs to requeue, and eventually the nodes went to down state. I have to up the node each time and this cycle continues.

Munge, Slurmd in the node & slurmdctld in the master node are active.


When I check the slurmd log, I found few errors like this. Which leads to the node to down state.



1. slurm_connect failed: Connection refused
2. Error connecting slurm stream socket at 192.168.1.23:44184: Connection refused
3. Munge encode failed: Failed to access "/var/run/munge/munge.socket.2": No such file or directory (retrying ...)
4.   _send_srun_resp_msg: 5/5 failed to send msg type 6003: Network is unreachable
5. Error connecting slurm stream socket at 192.168.1.1:6817: Network is unreachable
6. Failed to contact primary controller: Network is unreachable


*********************-------------------****************************

slurd Log out:
[2022-11-18T08:06:05.051] [23859.batch] debug3: sending task exit msg for 1 tasks status 65280 oom 0
[2022-11-18T08:06:05.051] [23859.batch] debug2: Before call to spank_fini()
[2022-11-18T08:06:05.051] [23859.batch] debug2: After call to spank_fini()
[2022-11-18T08:06:05.051] [23859.batch] job 23859 completed with slurm_rc = 0, job_rc = 65280
[2022-11-18T08:06:05.051] [23859.batch] sending REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status 65280
[2022-11-18T08:06:05.051] [23859.batch] debug:  Munge encode failed: Failed to access "/var/run/munge/munge.socket.2": No such file or directory (retrying ...)
[2022-11-18T08:06:05.060] [23859.0] debug3: xcgroup_set_uint32_param: parameter 'cgroup.procs' set to '5764' for '/sys/fs/cgroup/freezer'
[2022-11-18T08:06:05.060] [23859.0] debug3: Took 0 checks before stepd pid 5764 was removed from the freezer job cgroup.
[2022-11-18T08:06:05.060] [23859.0] debug:  step_terminate_monitor_stop signaling condition
[2022-11-18T08:06:05.060] [23859.0] debug2: step_terminate_monitor is stopping
[2022-11-18T08:06:05.060] [23859.0] debug2: Sending SIGKILL to pgid 5764
[2022-11-18T08:06:05.060] [23859.0] debug:  Waiting for IO
[2022-11-18T08:06:05.060] [23859.0] debug:  Closing debug channel
[2022-11-18T08:06:05.060] [23859.0] debug4: Entering _task_read for obj 1fa5000
[2022-11-18T08:06:05.060] [23859.0] debug5:   got eof on task
[2022-11-18T08:06:05.060] [23859.0] debug5: ************************ 0 bytes read from task STDERR
[2022-11-18T08:06:05.060] [23859.0] debug4: Entering _send_eof_msg
[2022-11-18T08:06:05.060] [23859.0] debug5: ======================== Enqueued eof message
[2022-11-18T08:06:05.060] [23859.0] debug4: Leaving  _send_eof_msg
[2022-11-18T08:06:05.060] [23859.0] debug4: eio: handling events for 4 objects
[2022-11-18T08:06:05.060] [23859.0] debug5: Called _task_writable
[2022-11-18T08:06:05.060] [23859.0] debug5:   false, fd == -1
[2022-11-18T08:06:05.060] [23859.0] debug5: Called _task_readable, task 0, STDOUT
[2022-11-18T08:06:05.060] [23859.0] debug5:   false, eof message sent
[2022-11-18T08:06:05.060] [23859.0] debug5: Called _task_readable, task 0, STDERR
[2022-11-18T08:06:05.060] [23859.0] debug5:   false, eof message sent
[2022-11-18T08:06:05.060] [23859.0] debug5: Called _client_writable
[2022-11-18T08:06:05.060] [23859.0] debug5:   client->out.msg_queue queue length = 1
[2022-11-18T08:06:05.060] [23859.0] debug5: Called _client_readable
[2022-11-18T08:06:05.060] [23859.0] debug5:   false, in_eof
[2022-11-18T08:06:05.060] [23859.0] debug4: Entering _client_write
[2022-11-18T08:06:05.060] [23859.0] debug5:   dequeue successful, client->out_msg->length = 10
[2022-11-18T08:06:05.060] [23859.0] debug5:   client->out_remaining = 10
[2022-11-18T08:06:05.060] [23859.0] debug4: eio: handling events for 4 objects
[2022-11-18T08:06:05.060] [23859.0] debug5: Called _task_writable
[2022-11-18T08:06:05.060] [23859.0] debug5:   false, fd == -1
[2022-11-18T08:06:05.060] [23859.0] debug5: Called _task_readable, task 0, STDOUT
[2022-11-18T08:06:05.060] [23859.0] debug5:   false, eof message sent
[2022-11-18T08:06:05.060] [23859.0] debug5: Called _task_readable, task 0, STDERR
[2022-11-18T08:06:05.060] [23859.0] debug5:   false, eof message sent
[2022-11-18T08:06:05.060] [23859.0] debug5: Called _client_writable
[2022-11-18T08:06:05.060] [23859.0] debug5:   false, out_eof
[2022-11-18T08:06:05.060] [23859.0] debug5: Called _client_readable
[2022-11-18T08:06:05.060] [23859.0] debug5:   false, in_eof
[2022-11-18T08:06:05.060] [23859.0] debug:  IO handler exited, rc=0
[2022-11-18T08:06:05.060] [23859.0] debug3: xcgroup_set_uint32_param: parameter 'cgroup.procs' set to '5764' for '/sys/fs/cgroup/cpuset'
[2022-11-18T08:06:05.060] [23859.0] debug3: Took 0 checks before stepd pid 5764 was removed from the cpuset step cgroup.
[2022-11-18T08:06:05.062] [23859.0] debug3: xcgroup_set_uint32_param: parameter 'cgroup.procs' set to '5764' for '/sys/fs/cgroup/devices'
[2022-11-18T08:06:05.062] [23859.0] debug3: Took 0 checks before stepd pid 5764 was removed from the devices step cgroup.
[2022-11-18T08:06:05.062] [23859.0] debug2: Aggregated 1 task exit messages
[2022-11-18T08:06:05.062] [23859.0] debug3: sending task exit msg for 1 tasks status 1280 oom 0
[2022-11-18T08:06:05.062] [23859.0] debug2: slurm_connect failed: Connection refused
[2022-11-18T08:06:05.062] [23859.0] debug2: Error connecting slurm stream socket at 192.168.1.23:44184: Connection refused
[2022-11-18T08:06:05.062] [23859.0] debug:  _send_srun_resp_msg: 0/5 failed to send msg type 6003: Connection refused
[2022-11-18T08:06:05.151] [23859.batch] debug:  Munge encode failed: Failed to access "/var/run/munge/munge.socket.2": No such file or directory (retrying ...)
[2022-11-18T08:06:05.162] [23859.0] debug2: slurm_connect failed: Connection refused
[2022-11-18T08:06:05.162] [23859.0] debug2: Error connecting slurm stream socket at 192.168.1.23:44184: Connection refused
[2022-11-18T08:06:05.162] [23859.0] debug:  _send_srun_resp_msg: 1/5 failed to send msg type 6003: Connection refused
[2022-11-18T08:06:05.251] [23859.batch] debug:  Munge encode failed: Failed to access "/var/run/munge/munge.socket.2": No such file or directory (retrying ...)
[2022-11-18T08:06:05.352] [23859.batch] debug:  Munge encode failed: Failed to access "/var/run/munge/munge.socket.2": No such file or directory (retrying ...)
[2022-11-18T08:06:05.363] [23859.0] debug2: slurm_connect failed: Connection refused
[2022-11-18T08:06:05.363] [23859.0] debug2: Error connecting slurm stream socket at 192.168.1.23:44184: Connection refused
[2022-11-18T08:06:05.363] [23859.0] debug:  _send_srun_resp_msg: 2/5 failed to send msg type 6003: Connection refused
[2022-11-18T08:06:05.452] [23859.batch] debug:  Munge encode failed: Failed to access "/var/run/munge/munge.socket.2": No such file or directory (retrying ...)
[2022-11-18T08:06:05.552] [23859.batch] debug:  Munge encode failed: Failed to access "/var/run/munge/munge.socket.2": No such file or directory (retrying ...)
[2022-11-18T08:06:05.653] [23859.batch] debug:  Munge encode failed: Failed to access "/var/run/munge/munge.socket.2": No such file or directory (retrying ...)
[2022-11-18T08:06:05.753] [23859.batch] debug:  Munge encode failed: Failed to access "/var/run/munge/munge.socket.2": No such file or directory (retrying ...)
[2022-11-18T08:06:05.764] [23859.0] debug2: slurm_connect failed: Connection refused
[2022-11-18T08:06:05.764] [23859.0] debug2: Error connecting slurm stream socket at 192.168.1.23:44184: Connection refused
[2022-11-18T08:06:05.764] [23859.0] debug:  _send_srun_resp_msg: 3/5 failed to send msg type 6003: Connection refused
[2022-11-18T08:06:05.853] [23859.batch] debug:  Munge encode failed: Failed to access "/var/run/munge/munge.socket.2": No such file or directory (retrying ...)
[2022-11-18T08:06:05.953] [23859.batch] debug:  Munge encode failed: Failed to access "/var/run/munge/munge.socket.2": No such file or directory (retrying ...)
[2022-11-18T08:06:06.053] [23859.batch] debug:  Munge encode failed: Failed to access "/var/run/munge/munge.socket.2": No such file or directory (retrying ...)
[2022-11-18T08:06:06.153] [23859.batch] debug:  Munge encode failed: Failed to access "/var/run/munge/munge.socket.2": No such file or directory (retrying ...)
[2022-11-18T08:06:06.254] [23859.batch] debug:  Munge encode failed: Failed to access "/var/run/munge/munge.socket.2": No such file or directory (retrying ...)
[2022-11-18T08:06:06.354] [23859.batch] debug:  Munge encode failed: Failed to access "/var/run/munge/munge.socket.2": No such file or directory (retrying ...)
[2022-11-18T08:06:06.454] [23859.batch] debug:  Munge encode failed: Failed to access "/var/run/munge/munge.socket.2": No such file or directory (retrying ...)
[2022-11-18T08:06:06.554] [23859.batch] debug:  Munge encode failed: Failed to access "/var/run/munge/munge.socket.2": No such file or directory (retrying ...)
[2022-11-18T08:06:06.564] [23859.0] debug2: Error connecting slurm stream socket at 192.168.1.23:44184: Network is unreachable
[2022-11-18T08:06:06.564] [23859.0] debug:  _send_srun_resp_msg: 4/5 failed to send msg type 6003: Network is unreachable
[2022-11-18T08:06:06.654] [23859.batch] debug:  Munge encode failed: Failed to access "/var/run/munge/munge.socket.2": No such file or directory (retrying ...)
[2022-11-18T08:06:06.754] [23859.batch] debug:  Munge encode failed: Failed to access "/var/run/munge/munge.socket.2": No such file or directory (retrying ...)
[2022-11-18T08:06:06.854] [23859.batch] debug:  Munge encode failed: Failed to access "/var/run/munge/munge.socket.2": No such file or directory (retrying ...)
[2022-11-18T08:06:06.954] [23859.batch] debug:  Munge encode failed: Failed to access "/var/run/munge/munge.socket.2": No such file or directory (retrying ...)
[2022-11-18T08:06:07.054] [23859.batch] error: If munged is up, restart with --num-threads=10
[2022-11-18T08:06:07.054] [23859.batch] error: Munge encode failed: Failed to access "/var/run/munge/munge.socket.2": No such file or directory
[2022-11-18T08:06:07.054] [23859.batch] error: authentication: Socket communication error
[2022-11-18T08:06:07.054] [23859.batch] Retrying job complete RPC for 23859.4294967294
[2022-11-18T08:06:07.364] [23859.0] debug2: Error connecting slurm stream socket at 192.168.1.23:44184: Network is unreachable
[2022-11-18T08:06:07.364] [23859.0] debug:  _send_srun_resp_msg: 5/5 failed to send msg type 6003: Network is unreachable
[2022-11-18T08:06:07.364] [23859.0] debug2: Before call to spank_fini()
[2022-11-18T08:06:07.364] [23859.0] debug2: After call to spank_fini()
[2022-11-18T08:06:07.364] [23859.0] debug2: Rank 0 has no children slurmstepd
[2022-11-18T08:06:07.364] [23859.0] debug2: _one_step_complete_msg: first=0, last=0
[2022-11-18T08:06:07.364] [23859.0] debug3: Rank 0 sending complete to slurmctld, range 0 to 0
[2022-11-18T08:06:07.365] [23859.0] debug2: Error connecting slurm stream socket at 192.168.1.1:6817: Network is unreachable
[2022-11-18T08:06:07.365] [23859.0] debug:  Failed to contact primary controller: Network is unreachable
[2022-11-18T09:25:10.442] debug:  Log file re-opened