| Summary: | Jobs fails to complete, requeueing the jobs after completion and nodes went to down state frequently | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | SAJIL NP <iamsajil7> |
| Component: | slurmd | Assignee: | Jacob Jenson <jacob> |
| Status: | RESOLVED INVALID | QA Contact: | |
| Severity: | 6 - No support contract | ||
| Priority: | --- | ||
| Version: | - Unsupported Older Versions | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | -Other- | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
| Attachments: | slurmd.log | ||
Created attachment 27832 [details] slurmd.log Hi SLURM Support, Slurm Version: 18.08.8 OS: Centos 7.7 Slurm fails to update the completed job in one of the nodes I'm using. This makes completed jobs to requeue, and eventually the nodes went to down state. I have to up the node each time and this cycle continues. Munge, Slurmd in the node & slurmdctld in the master node are active. When I check the slurmd log, I found few errors like this. Which leads to the node to down state. 1. slurm_connect failed: Connection refused 2. Error connecting slurm stream socket at 192.168.1.23:44184: Connection refused 3. Munge encode failed: Failed to access "/var/run/munge/munge.socket.2": No such file or directory (retrying ...) 4. _send_srun_resp_msg: 5/5 failed to send msg type 6003: Network is unreachable 5. Error connecting slurm stream socket at 192.168.1.1:6817: Network is unreachable 6. Failed to contact primary controller: Network is unreachable *********************-------------------**************************** slurd Log out: [2022-11-18T08:06:05.051] [23859.batch] debug3: sending task exit msg for 1 tasks status 65280 oom 0 [2022-11-18T08:06:05.051] [23859.batch] debug2: Before call to spank_fini() [2022-11-18T08:06:05.051] [23859.batch] debug2: After call to spank_fini() [2022-11-18T08:06:05.051] [23859.batch] job 23859 completed with slurm_rc = 0, job_rc = 65280 [2022-11-18T08:06:05.051] [23859.batch] sending REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status 65280 [2022-11-18T08:06:05.051] [23859.batch] debug: Munge encode failed: Failed to access "/var/run/munge/munge.socket.2": No such file or directory (retrying ...) [2022-11-18T08:06:05.060] [23859.0] debug3: xcgroup_set_uint32_param: parameter 'cgroup.procs' set to '5764' for '/sys/fs/cgroup/freezer' [2022-11-18T08:06:05.060] [23859.0] debug3: Took 0 checks before stepd pid 5764 was removed from the freezer job cgroup. [2022-11-18T08:06:05.060] [23859.0] debug: step_terminate_monitor_stop signaling condition [2022-11-18T08:06:05.060] [23859.0] debug2: step_terminate_monitor is stopping [2022-11-18T08:06:05.060] [23859.0] debug2: Sending SIGKILL to pgid 5764 [2022-11-18T08:06:05.060] [23859.0] debug: Waiting for IO [2022-11-18T08:06:05.060] [23859.0] debug: Closing debug channel [2022-11-18T08:06:05.060] [23859.0] debug4: Entering _task_read for obj 1fa5000 [2022-11-18T08:06:05.060] [23859.0] debug5: got eof on task [2022-11-18T08:06:05.060] [23859.0] debug5: ************************ 0 bytes read from task STDERR [2022-11-18T08:06:05.060] [23859.0] debug4: Entering _send_eof_msg [2022-11-18T08:06:05.060] [23859.0] debug5: ======================== Enqueued eof message [2022-11-18T08:06:05.060] [23859.0] debug4: Leaving _send_eof_msg [2022-11-18T08:06:05.060] [23859.0] debug4: eio: handling events for 4 objects [2022-11-18T08:06:05.060] [23859.0] debug5: Called _task_writable [2022-11-18T08:06:05.060] [23859.0] debug5: false, fd == -1 [2022-11-18T08:06:05.060] [23859.0] debug5: Called _task_readable, task 0, STDOUT [2022-11-18T08:06:05.060] [23859.0] debug5: false, eof message sent [2022-11-18T08:06:05.060] [23859.0] debug5: Called _task_readable, task 0, STDERR [2022-11-18T08:06:05.060] [23859.0] debug5: false, eof message sent [2022-11-18T08:06:05.060] [23859.0] debug5: Called _client_writable [2022-11-18T08:06:05.060] [23859.0] debug5: client->out.msg_queue queue length = 1 [2022-11-18T08:06:05.060] [23859.0] debug5: Called _client_readable [2022-11-18T08:06:05.060] [23859.0] debug5: false, in_eof [2022-11-18T08:06:05.060] [23859.0] debug4: Entering _client_write [2022-11-18T08:06:05.060] [23859.0] debug5: dequeue successful, client->out_msg->length = 10 [2022-11-18T08:06:05.060] [23859.0] debug5: client->out_remaining = 10 [2022-11-18T08:06:05.060] [23859.0] debug4: eio: handling events for 4 objects [2022-11-18T08:06:05.060] [23859.0] debug5: Called _task_writable [2022-11-18T08:06:05.060] [23859.0] debug5: false, fd == -1 [2022-11-18T08:06:05.060] [23859.0] debug5: Called _task_readable, task 0, STDOUT [2022-11-18T08:06:05.060] [23859.0] debug5: false, eof message sent [2022-11-18T08:06:05.060] [23859.0] debug5: Called _task_readable, task 0, STDERR [2022-11-18T08:06:05.060] [23859.0] debug5: false, eof message sent [2022-11-18T08:06:05.060] [23859.0] debug5: Called _client_writable [2022-11-18T08:06:05.060] [23859.0] debug5: false, out_eof [2022-11-18T08:06:05.060] [23859.0] debug5: Called _client_readable [2022-11-18T08:06:05.060] [23859.0] debug5: false, in_eof [2022-11-18T08:06:05.060] [23859.0] debug: IO handler exited, rc=0 [2022-11-18T08:06:05.060] [23859.0] debug3: xcgroup_set_uint32_param: parameter 'cgroup.procs' set to '5764' for '/sys/fs/cgroup/cpuset' [2022-11-18T08:06:05.060] [23859.0] debug3: Took 0 checks before stepd pid 5764 was removed from the cpuset step cgroup. [2022-11-18T08:06:05.062] [23859.0] debug3: xcgroup_set_uint32_param: parameter 'cgroup.procs' set to '5764' for '/sys/fs/cgroup/devices' [2022-11-18T08:06:05.062] [23859.0] debug3: Took 0 checks before stepd pid 5764 was removed from the devices step cgroup. [2022-11-18T08:06:05.062] [23859.0] debug2: Aggregated 1 task exit messages [2022-11-18T08:06:05.062] [23859.0] debug3: sending task exit msg for 1 tasks status 1280 oom 0 [2022-11-18T08:06:05.062] [23859.0] debug2: slurm_connect failed: Connection refused [2022-11-18T08:06:05.062] [23859.0] debug2: Error connecting slurm stream socket at 192.168.1.23:44184: Connection refused [2022-11-18T08:06:05.062] [23859.0] debug: _send_srun_resp_msg: 0/5 failed to send msg type 6003: Connection refused [2022-11-18T08:06:05.151] [23859.batch] debug: Munge encode failed: Failed to access "/var/run/munge/munge.socket.2": No such file or directory (retrying ...) [2022-11-18T08:06:05.162] [23859.0] debug2: slurm_connect failed: Connection refused [2022-11-18T08:06:05.162] [23859.0] debug2: Error connecting slurm stream socket at 192.168.1.23:44184: Connection refused [2022-11-18T08:06:05.162] [23859.0] debug: _send_srun_resp_msg: 1/5 failed to send msg type 6003: Connection refused [2022-11-18T08:06:05.251] [23859.batch] debug: Munge encode failed: Failed to access "/var/run/munge/munge.socket.2": No such file or directory (retrying ...) [2022-11-18T08:06:05.352] [23859.batch] debug: Munge encode failed: Failed to access "/var/run/munge/munge.socket.2": No such file or directory (retrying ...) [2022-11-18T08:06:05.363] [23859.0] debug2: slurm_connect failed: Connection refused [2022-11-18T08:06:05.363] [23859.0] debug2: Error connecting slurm stream socket at 192.168.1.23:44184: Connection refused [2022-11-18T08:06:05.363] [23859.0] debug: _send_srun_resp_msg: 2/5 failed to send msg type 6003: Connection refused [2022-11-18T08:06:05.452] [23859.batch] debug: Munge encode failed: Failed to access "/var/run/munge/munge.socket.2": No such file or directory (retrying ...) [2022-11-18T08:06:05.552] [23859.batch] debug: Munge encode failed: Failed to access "/var/run/munge/munge.socket.2": No such file or directory (retrying ...) [2022-11-18T08:06:05.653] [23859.batch] debug: Munge encode failed: Failed to access "/var/run/munge/munge.socket.2": No such file or directory (retrying ...) [2022-11-18T08:06:05.753] [23859.batch] debug: Munge encode failed: Failed to access "/var/run/munge/munge.socket.2": No such file or directory (retrying ...) [2022-11-18T08:06:05.764] [23859.0] debug2: slurm_connect failed: Connection refused [2022-11-18T08:06:05.764] [23859.0] debug2: Error connecting slurm stream socket at 192.168.1.23:44184: Connection refused [2022-11-18T08:06:05.764] [23859.0] debug: _send_srun_resp_msg: 3/5 failed to send msg type 6003: Connection refused [2022-11-18T08:06:05.853] [23859.batch] debug: Munge encode failed: Failed to access "/var/run/munge/munge.socket.2": No such file or directory (retrying ...) [2022-11-18T08:06:05.953] [23859.batch] debug: Munge encode failed: Failed to access "/var/run/munge/munge.socket.2": No such file or directory (retrying ...) [2022-11-18T08:06:06.053] [23859.batch] debug: Munge encode failed: Failed to access "/var/run/munge/munge.socket.2": No such file or directory (retrying ...) [2022-11-18T08:06:06.153] [23859.batch] debug: Munge encode failed: Failed to access "/var/run/munge/munge.socket.2": No such file or directory (retrying ...) [2022-11-18T08:06:06.254] [23859.batch] debug: Munge encode failed: Failed to access "/var/run/munge/munge.socket.2": No such file or directory (retrying ...) [2022-11-18T08:06:06.354] [23859.batch] debug: Munge encode failed: Failed to access "/var/run/munge/munge.socket.2": No such file or directory (retrying ...) [2022-11-18T08:06:06.454] [23859.batch] debug: Munge encode failed: Failed to access "/var/run/munge/munge.socket.2": No such file or directory (retrying ...) [2022-11-18T08:06:06.554] [23859.batch] debug: Munge encode failed: Failed to access "/var/run/munge/munge.socket.2": No such file or directory (retrying ...) [2022-11-18T08:06:06.564] [23859.0] debug2: Error connecting slurm stream socket at 192.168.1.23:44184: Network is unreachable [2022-11-18T08:06:06.564] [23859.0] debug: _send_srun_resp_msg: 4/5 failed to send msg type 6003: Network is unreachable [2022-11-18T08:06:06.654] [23859.batch] debug: Munge encode failed: Failed to access "/var/run/munge/munge.socket.2": No such file or directory (retrying ...) [2022-11-18T08:06:06.754] [23859.batch] debug: Munge encode failed: Failed to access "/var/run/munge/munge.socket.2": No such file or directory (retrying ...) [2022-11-18T08:06:06.854] [23859.batch] debug: Munge encode failed: Failed to access "/var/run/munge/munge.socket.2": No such file or directory (retrying ...) [2022-11-18T08:06:06.954] [23859.batch] debug: Munge encode failed: Failed to access "/var/run/munge/munge.socket.2": No such file or directory (retrying ...) [2022-11-18T08:06:07.054] [23859.batch] error: If munged is up, restart with --num-threads=10 [2022-11-18T08:06:07.054] [23859.batch] error: Munge encode failed: Failed to access "/var/run/munge/munge.socket.2": No such file or directory [2022-11-18T08:06:07.054] [23859.batch] error: authentication: Socket communication error [2022-11-18T08:06:07.054] [23859.batch] Retrying job complete RPC for 23859.4294967294 [2022-11-18T08:06:07.364] [23859.0] debug2: Error connecting slurm stream socket at 192.168.1.23:44184: Network is unreachable [2022-11-18T08:06:07.364] [23859.0] debug: _send_srun_resp_msg: 5/5 failed to send msg type 6003: Network is unreachable [2022-11-18T08:06:07.364] [23859.0] debug2: Before call to spank_fini() [2022-11-18T08:06:07.364] [23859.0] debug2: After call to spank_fini() [2022-11-18T08:06:07.364] [23859.0] debug2: Rank 0 has no children slurmstepd [2022-11-18T08:06:07.364] [23859.0] debug2: _one_step_complete_msg: first=0, last=0 [2022-11-18T08:06:07.364] [23859.0] debug3: Rank 0 sending complete to slurmctld, range 0 to 0 [2022-11-18T08:06:07.365] [23859.0] debug2: Error connecting slurm stream socket at 192.168.1.1:6817: Network is unreachable [2022-11-18T08:06:07.365] [23859.0] debug: Failed to contact primary controller: Network is unreachable [2022-11-18T09:25:10.442] debug: Log file re-opened