Ticket 2420

Summary:	slurmd hanging/locking up
Product:	Slurm	Reporter:	Will French <will>
Component:	slurmd	Assignee:	Alejandro Sanchez <alex>
Status:	RESOLVED FIXED	QA Contact:
Severity:	3 - Medium Impact
Priority:	---	CC:	alex, charles.johnson, davide.vanzo, sean, tim
Version:	15.08.7
Hardware:	Linux
OS:	Linux
Site:	Vanderbilt	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:	15.08.8 16.05.0-pre1
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	slurm.conf slurmdbd.conf slurmdbd.log Output from gdb session attached to hung slurmd process

Description Will French 2016-02-06 00:42:34 MST

I have already brought this up in ticket 2395, but this issue does not appear to be related to the original subject of ticket 2395 so I'm opening a new one here.

Since our upgrade from 14.11.9 to 15.08.7 we have been seeing slurmd's lock up on about 3-8 of our compute nodes (out of ~620 total compute nodes) per day, resulting in NODE_FAIL's for any jobs running on those nodes. slurmctld is unable to communicate with these nodes and they are marked as down* by SLURM. Here are the nodes from last night (logs shown below for each of these systems):

08:34:35-frenchwr@vmps12:~$ sinfoprobs | grep vmp5
vmp505       down*      eight                2016-02-06T00:55:37  slurm        Not responding                                      
vmp560       down*      eight                2016-02-06T04:26:38  slurm        Not responding                                      
vmp564       down*      eight                2016-02-06T00:33:57  slurm        Not responding 

Interestingly, slurmd will still be running on these nodes and the nodes are accessible via ssh and respond via ping. Load is not high on the nodes. A "service slurmd restart" does not restart the slurmd process; it appears to be hung. We end up needing to do a kill -9 on the original slurmd process at which point the nodes immediately begin talking to slurmctld and are available for scheduling.

Before leaving yesterday we bumped up the SlurmdDebug from 3 to 7. Here's logging info on nodes where slurmd froze up last night:

------------------

Feb  5 17:32:21 vmp564 slurmstepd[14651]: sending REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status 0
Feb  5 17:32:21 vmp564 slurmstepd[14651]: done with job
Feb  5 17:32:52 vmp564 slurmd[1315]: _run_prolog: run job script took usec=437
Feb  5 17:32:52 vmp564 slurmd[1315]: _run_prolog: prolog with lock for job 7046384 ran for 0 seconds
Feb  5 17:32:52 vmp564 slurmd[1315]: Launching batch job 7046384 for UID 60813
Feb  5 17:32:52 vmp564 slurmstepd[5483]: task/cgroup: /slurm/uid_60813/job_7046384: alloc=4000MB mem.limit=4000MB memsw.limit=4400MB
Feb  5 17:32:52 vmp564 slurmstepd[5483]: task/cgroup: /slurm/uid_60813/job_7046384/step_batch: alloc=4000MB mem.limit=4000MB memsw.limit=4400MB
Feb  5 17:44:57 vmp564 slurmd[1315]: error: Timeout waiting for completion of 172 threads
Feb  5 17:44:57 vmp564 slurmd[1315]: Node configuration differs from hardware: CPUs=8:16(hw) Boards=1:1(hw) SocketsPerBoard=2:2(hw) CoresPerSocket=4:4(hw) ThreadsPerCore=2:2(hw)
Feb  5 17:44:57 vmp564 slurmd[1315]: Message aggregation disabled
Feb  5 17:44:57 vmp564 slurmd[1315]: CPU frequency setting not configured for this node
Feb  5 17:44:57 vmp564 slurmd[1315]: Resource spec: Reserved system memory limit not configured for this node
Feb  6 00:27:22 vmp564 slurmd[1315]: active_threads == MAX_THREADS(256)
Feb  6 00:34:03 vmp564 slurmstepd[3341]: sending REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status 256
Feb  6 00:34:06 vmp564 slurmstepd[3341]: done with job
Feb  6 00:34:06 vmp564 slurmd[1315]: error: gathering job accounting: 0
Feb  6 00:34:06 vmp564 slurmd[1315]: error: gathering job accounting: 0
Feb  6 00:34:06 vmp564 slurmd[1315]: error: gathering job accounting: 0
Feb  6 00:34:06 vmp564 slurmd[1315]: error: gathering job accounting: 0
Feb  6 00:34:06 vmp564 slurmd[1315]: error: gathering job accounting: 0
Feb  6 00:34:06 vmp564 slurmd[1315]: error: gathering job accounting: 0
Feb  6 00:34:06 vmp564 slurmd[1315]: error: gathering job accounting: 0
Feb  6 00:34:06 vmp564 slurmd[1315]: error: gathering job accounting: 0
Feb  6 00:34:06 vmp564 slurmd[1315]: error: gathering job accounting: 0

-------------------

Feb  5 22:54:53 vmp560 slurmstepd[18611]: debug level = 2
Feb  5 22:54:53 vmp560 slurmstepd[18611]: Called _msg_socket_readable
Feb  5 22:54:53 vmp560 slurmstepd[18611]: Called _msg_socket_readable
Feb  5 22:54:53 vmp560 slurmstepd[18611]: Called _msg_socket_readable
Feb  5 22:54:53 vmp560 slurmstepd[18611]: task 0 (18615) started 2016-02-05T22:54:53
Feb  5 22:54:53 vmp560 slurmstepd[18611]: Called _msg_socket_readable
Feb  5 22:54:53 vmp560 slurmstepd[18611]: Trying to load plugin /usr/scheduler/slurm/lib/slurm/job_container_none.so
Feb  5 22:54:53 vmp560 slurmstepd[18611]: Success.
Feb  5 22:54:53 vmp560 slurmstepd[18611]: Unblocking 7048085.4294967294 task 0, writefd = 17
Feb  5 22:54:53 vmp560 slurmstepd[18615]: Can't propagate RLIMIT_NPROC of 514844 from submit host: Operation not permitted
Feb  6 00:55:44 vmp560 slurmstepd[30423]: error: Failed to send MESSAGE_TASK_EXIT: Connection refused
Feb  6 00:55:47 vmp560 slurmstepd[30423]: done with job
Feb  6 00:55:47 vmp560 slurmd[28396]: error: gathering job accounting: 0
Feb  6 00:55:47 vmp560 slurmd[28396]: error: gathering job accounting: 0
Feb  6 00:55:47 vmp560 slurmd[28396]: error: gathering job accounting: 0
Feb  6 00:55:47 vmp560 slurmd[28396]: error: gathering job accounting: 0
Feb  6 00:55:47 vmp560 slurmd[28396]: error: gathering job accounting: 0
.
.
.
Feb  6 00:59:44 vmp560 slurmd[28396]: launch task 7047718.1 request from 103978.58698@10.0.22.6 (port 52871)
Feb  6 00:59:44 vmp560 slurmstepd[6582]: Trying to load plugin /usr/scheduler/slurm/lib/slurm/switch_none.so
Feb  6 00:59:44 vmp560 slurmstepd[6582]: Success.
Feb  6 00:59:44 vmp560 slurmstepd[6582]: slurmstepd rank 21, parent address = 10.0.22.57, port = 41498
Feb  6 00:59:44 vmp560 slurmstepd[6582]: entering stepd_step_rec_create
Feb  6 00:59:44 vmp560 slurmstepd[6582]: Trying to load plugin /usr/scheduler/slurm/lib/slurm/acct_gather_profile_none.so
Feb  6 00:59:44 vmp560 slurmstepd[6582]: Success.
Feb  6 00:59:44 vmp560 slurmstepd[6582]: Trying to load plugin /usr/scheduler/slurm/lib/slurm/acct_gather_energy_none.so
Feb  6 00:59:44 vmp560 slurmstepd[6582]: Success.
Feb  6 00:59:44 vmp560 slurmstepd[6582]: Trying to load plugin /usr/scheduler/slurm/lib/slurm/acct_gather_infiniband_none.so
Feb  6 00:59:44 vmp560 slurmstepd[6582]: Success.
Feb  6 00:59:44 vmp560 slurmstepd[6582]: Trying to load plugin /usr/scheduler/slurm/lib/slurm/acct_gather_filesystem_none.so
Feb  6 00:59:44 vmp560 slurmstepd[6582]: Success.
Feb  6 00:59:44 vmp560 slurmstepd[6582]: Trying to load plugin /usr/scheduler/slurm/lib/slurm/jobacct_gather_none.so
Feb  6 00:59:44 vmp560 slurmstepd[6582]: Success.
Feb  6 00:59:44 vmp560 slurmstepd[6582]: jobacct_gather dynamic logging enabled
Feb  6 00:59:44 vmp560 slurmstepd[6582]: acct_gather_profile_startpoll dynamic logging enabled
Feb  6 00:59:44 vmp560 slurmstepd[6582]: Entered job_manager for 7047718.1 pid=6582
Feb  6 00:59:44 vmp560 slurmstepd[6582]: Trying to load plugin /usr/scheduler/slurm/lib/slurm/core_spec_none.so
Feb  6 00:59:44 vmp560 slurmstepd[6582]: Called _msg_socket_readable
Feb  6 00:59:44 vmp560 slurmstepd[6582]: Success.
Feb  6 00:59:44 vmp560 slurmstepd[6582]: Trying to load plugin /usr/scheduler/slurm/lib/slurm/task_cgroup.so
Feb  6 00:59:44 vmp560 slurmstepd[6582]: parameter 'memory.use_hierarchy' set to '1' for '/cgroup/memory'
Feb  6 00:59:44 vmp560 slurmstepd[6582]: Success.
Feb  6 00:59:44 vmp560 slurmstepd[6582]: Trying to load plugin /usr/scheduler/slurm/lib/slurm/proctrack_cgroup.so
Feb  6 00:59:44 vmp560 slurmstepd[6582]: Success.
Feb  6 00:59:44 vmp560 slurmstepd[6582]: Trying to load plugin /usr/scheduler/slurm/lib/slurm/checkpoint_none.so
Feb  6 00:59:44 vmp560 slurmstepd[6582]: Success.
Feb  6 00:59:44 vmp560 slurmstepd[6582]: Trying to load plugin /usr/scheduler/slurm/lib/slurm/mpi_none.so
Feb  6 00:59:44 vmp560 slurmstepd[6582]: Success.
Feb  6 00:59:44 vmp560 slurmstepd[6582]: parameter 'notify_on_release' set to '1' for '/cgroup/freezer/slurm'
Feb  6 00:59:44 vmp560 slurmstepd[6582]: parameter 'notify_on_release' set to '1' for '/cgroup/freezer/slurm/uid_103978'
Feb  6 00:59:44 vmp560 slurmstepd[6582]: parameter 'notify_on_release' set to '1' for '/cgroup/freezer/slurm/uid_103978/job_7047718'
Feb  6 00:59:44 vmp560 slurmstepd[6582]: parameter 'notify_on_release' set to '1' for '/cgroup/freezer/slurm/uid_103978/job_7047718/step_1'
Feb  6 00:59:44 vmp560 slurmstepd[6582]: parameter 'notify_on_release' set to '0' for '/cgroup/freezer/slurm/uid_103978/job_7047718/step_1'
Feb  6 00:59:44 vmp560 slurmstepd[6582]: parameter 'notify_on_release' set to '0' for '/cgroup/cpuset/slurm'
Feb  6 00:59:44 vmp560 slurmstepd[6582]: slurm cgroup /slurm successfully created for ns cpuset: File exists
Feb  6 00:59:44 vmp560 slurmstepd[6582]: parameter 'notify_on_release' set to '1' for '/cgroup/cpuset/slurm/uid_103978'
Feb  6 00:59:44 vmp560 slurmstepd[6582]: parameter 'cpuset.cpus' set to '0-15' for '/cgroup/cpuset/slurm/uid_103978'
Feb  6 00:59:44 vmp560 slurmstepd[6582]: parameter 'cpuset.mems' set to '0-1' for '/cgroup/cpuset/slurm/uid_103978'
Feb  6 00:59:44 vmp560 slurmstepd[6582]: parameter 'cpuset.cpus' set to '4-5,12-13' for '/cgroup/cpuset/slurm/uid_103978'
Feb  6 00:59:44 vmp560 slurmstepd[6582]: parameter 'notify_on_release' set to '1' for '/cgroup/cpuset/slurm/uid_103978/job_7047718'
Feb  6 00:59:44 vmp560 slurmstepd[6582]: parameter 'cpuset.cpus' set to '4-5,12-13' for '/cgroup/cpuset/slurm/uid_103978/job_7047718'
Feb  6 00:59:44 vmp560 slurmstepd[6582]: parameter 'cpuset.mems' set to '0-1' for '/cgroup/cpuset/slurm/uid_103978/job_7047718'
Feb  6 00:59:44 vmp560 slurmstepd[6582]: parameter 'cpuset.cpus' set to '4-5,12-13' for '/cgroup/cpuset/slurm/uid_103978/job_7047718'
Feb  6 00:59:44 vmp560 slurmstepd[6582]: parameter 'notify_on_release' set to '1' for '/cgroup/cpuset/slurm/uid_103978/job_7047718/step_1'
Feb  6 00:59:44 vmp560 slurmstepd[6582]: parameter 'cpuset.cpus' set to '4-5,12-13' for '/cgroup/cpuset/slurm/uid_103978/job_7047718/step_1'
Feb  6 00:59:44 vmp560 slurmstepd[6582]: parameter 'cpuset.mems' set to '0-1' for '/cgroup/cpuset/slurm/uid_103978/job_7047718/step_1'
Feb  6 00:59:44 vmp560 slurmstepd[6582]: parameter 'cpuset.cpus' set to '4-5,12-13' for '/cgroup/cpuset/slurm/uid_103978/job_7047718/step_1'
Feb  6 00:59:44 vmp560 slurmstepd[6582]: parameter 'notify_on_release' set to '0' for '/cgroup/memory/slurm'
Feb  6 00:59:44 vmp560 slurmstepd[6582]: slurm cgroup /slurm successfully created for ns memory: File exists
Feb  6 00:59:44 vmp560 slurmstepd[6582]: parameter 'notify_on_release' set to '1' for '/cgroup/memory/slurm/uid_103978'
Feb  6 00:59:44 vmp560 slurmstepd[6582]: parameter 'memory.use_hierarchy' set to '1' for '/cgroup/memory/slurm/uid_103978'
Feb  6 00:59:44 vmp560 slurmstepd[6582]: parameter 'notify_on_release' set to '0' for '/cgroup/memory/slurm/uid_103978/job_7047718'
Feb  6 00:59:44 vmp560 slurmstepd[6582]: parameter 'memory.use_hierarchy' set to '1' for '/cgroup/memory/slurm/uid_103978/job_7047718'
Feb  6 00:59:44 vmp560 slurmstepd[6582]: parameter 'memory.limit_in_bytes' set to '2097152000' for '/cgroup/memory/slurm/uid_103978/job_7047718'
Feb  6 00:59:44 vmp560 slurmstepd[6582]: parameter 'memory.memsw.limit_in_bytes' set to '2306867200' for '/cgroup/memory/slurm/uid_103978/job_7047718'
Feb  6 00:59:44 vmp560 slurmstepd[6582]: task/cgroup: /slurm/uid_103978/job_7047718: alloc=2000MB mem.limit=2000MB memsw.limit=2200MB
Feb  6 00:59:44 vmp560 slurmstepd[6582]: parameter 'notify_on_release' set to '0' for '/cgroup/memory/slurm/uid_103978/job_7047718/step_1'
Feb  6 00:59:44 vmp560 slurmstepd[6582]: parameter 'memory.use_hierarchy' set to '1' for '/cgroup/memory/slurm/uid_103978/job_7047718/step_1'
Feb  6 00:59:44 vmp560 slurmstepd[6582]: parameter 'memory.limit_in_bytes' set to '2097152000' for '/cgroup/memory/slurm/uid_103978/job_7047718/step_1'
Feb  6 00:59:44 vmp560 slurmstepd[6582]: parameter 'memory.memsw.limit_in_bytes' set to '2306867200' for '/cgroup/memory/slurm/uid_103978/job_7047718/step_1'
Feb  6 00:59:44 vmp560 slurmstepd[6582]: task/cgroup: /slurm/uid_103978/job_7047718/step_1: alloc=2000MB mem.limit=2000MB memsw.limit=2200MB
Feb  6 00:59:44 vmp560 slurmstepd[6582]: Called _msg_socket_readable
Feb  6 00:59:44 vmp560 slurmstepd[6582]: Called _msg_socket_readable
Feb  6 00:59:44 vmp560 slurmstepd[6582]: Called _msg_socket_readable
Feb  6 00:59:44 vmp560 slurmstepd[6582]: Called _msg_socket_readable
Feb  6 00:59:44 vmp560 slurmstepd[6582]: Called _msg_socket_readable
Feb  6 00:59:44 vmp560 slurmstepd[6582]: Called _msg_socket_readable
Feb  6 00:59:44 vmp560 slurmstepd[6582]: Called _msg_socket_readable
Feb  6 00:59:44 vmp560 slurmstepd[6582]: debug level = 2
Feb  6 00:59:44 vmp560 slurmstepd[6582]: Called _msg_socket_readable
Feb  6 00:59:44 vmp560 slurmstepd[6582]: Called _msg_socket_readable
Feb  6 00:59:44 vmp560 slurmstepd[6582]: Called _msg_socket_readable
Feb  6 00:59:44 vmp560 slurmstepd[6582]: task 21 (6587) started 2016-02-06T00:59:44
Feb  6 00:59:44 vmp560 slurmstepd[6582]: Called _msg_socket_readable
Feb  6 00:59:44 vmp560 slurmstepd[6582]: Called _msg_socket_readable
Feb  6 00:59:44 vmp560 slurmstepd[6582]: Trying to load plugin /usr/scheduler/slurm/lib/slurm/job_container_none.so
Feb  6 00:59:44 vmp560 slurmstepd[6582]: Success.
Feb  6 00:59:44 vmp560 slurmstepd[6582]: Unblocking 7047718.1 task 0, writefd = 25
Feb  6 00:59:44 vmp560 slurmstepd[6582]: slurm_send_only_node_msg: sent 0
Feb  6 00:59:45 vmp560 slurmstepd[6582]: Called _msg_socket_accept
Feb  6 00:59:45 vmp560 slurmstepd[6582]: Leaving _msg_socket_accept
Feb  6 00:59:45 vmp560 slurmstepd[6582]: Called _msg_socket_readable
Feb  6 00:59:45 vmp560 slurmstepd[6582]: Entering _handle_accept (new thread)
Feb  6 00:59:45 vmp560 slurmstepd[6582]:   Identity: uid=0, gid=0
Feb  6 00:59:45 vmp560 slurmstepd[6582]: Entering _handle_request
Feb  6 00:59:45 vmp560 slurmstepd[6582]: Got request 20
Feb  6 00:59:45 vmp560 slurmstepd[6582]: Leaving  _handle_request: SLURM_SUCCESS
Feb  6 00:59:45 vmp560 slurmstepd[6582]: Entering _handle_request
Feb  6 00:59:45 vmp560 slurmstepd[6582]: Got request 4
Feb  6 00:59:45 vmp560 slurmstepd[6582]: Leaving  _handle_request: SLURM_SUCCESS
Feb  6 00:59:45 vmp560 slurmstepd[6582]: Entering _handle_request
Feb  6 00:59:45 vmp560 slurmstepd[6582]: Leaving  _handle_accept
Feb  6 01:00:10 vmp560 slurmstepd[6582]: task 21 (6587) exited with exit code 0.
Feb  6 01:00:10 vmp560 slurmstepd[6582]: parameter 'freezer.state' set to 'THAWED' for '/cgroup/freezer/slurm/uid_103978/job_7047718/step_1'
Feb  6 01:00:10 vmp560 slurmstepd[6582]: parameter 'cgroup.procs' set to '6582' for '/cgroup/freezer'
Feb  6 01:00:10 vmp560 slurmstepd[6582]: sending task exit msg for 1 tasks status 0
Feb  6 01:00:10 vmp560 slurmstepd[6582]: delaying 250ms
Feb  6 01:00:10 vmp560 slurmstepd[6582]: slurm_send_only_node_msg: sent 0
Feb  6 01:00:10 vmp560 slurmstepd[6582]: Rank 21 sending complete to rank 18, range 21 to 21
Feb  6 01:00:10 vmp560 slurmstepd[6582]: Called _msg_socket_readable
Feb  6 01:00:10 vmp560 slurmstepd[6582]: done with job
Feb  6 04:20:03 vmp560 slurmd[28396]: active_threads == MAX_THREADS(256)
Feb  6 04:26:44 vmp560 slurmstepd[27912]: sending REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status 256
Feb  6 04:26:44 vmp560 slurmstepd[27912]: done with job

-----------------


Feb  4 22:56:34 vmp505 slurmstepd[28117]: error: Exceeded step memory limit at some point.
Feb  4 22:56:34 vmp505 slurmstepd[28117]: sending REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status 0
Feb  4 22:56:34 vmp505 slurmstepd[28117]: done with job
Feb  4 23:30:57 vmp505 slurmd[2417]: _run_prolog: run job script took usec=19
Feb  4 23:30:57 vmp505 slurmd[2417]: _run_prolog: prolog with lock for job 7045007 ran for 0 seconds
Feb  4 23:30:57 vmp505 slurmd[2417]: Launching batch job 7045007 for UID 156369
Feb  4 23:30:57 vmp505 slurmstepd[25136]: task/cgroup: /slurm/uid_156369/job_7045007: alloc=5120MB mem.limit=5120MB memsw.limit=5632MB
Feb  4 23:30:57 vmp505 slurmstepd[25136]: task/cgroup: /slurm/uid_156369/job_7045007/step_batch: alloc=5120MB mem.limit=5120MB memsw.limit=5632MB
Feb  5 04:00:10 vmp505 slurmstepd[24016]: error: Failed to send MESSAGE_TASK_EXIT: Connection refused
Feb  5 04:00:14 vmp505 slurmstepd[24016]: done with job
Feb  5 04:00:14 vmp505 slurmd[2417]: error: gathering job accounting: 0
Feb  5 04:00:14 vmp505 slurmd[2417]: error: gathering job accounting: 0
Feb  5 04:00:14 vmp505 slurmd[2417]: error: gathering job accounting: 0
Feb  5 04:00:14 vmp505 slurmd[2417]: error: gathering job accounting: 0
Feb  5 04:00:14 vmp505 slurmd[2417]: error: gathering job accounting: 0
Feb  5 04:00:14 vmp505 slurmd[2417]: error: gathering job accounting: 0
.
.
.
Feb  5 04:14:17 vmp505 slurmd[2417]: _run_prolog: run job script took usec=505
Feb  5 04:14:17 vmp505 slurmd[2417]: _run_prolog: prolog with lock for job 7045555 ran for 0 seconds
Feb  5 04:14:17 vmp505 slurmd[2417]: Launching batch job 7045555 for UID 108679
Feb  5 04:14:17 vmp505 slurmstepd[18447]: task/cgroup: /slurm/uid_108679/job_7045555: alloc=5120MB mem.limit=5120MB memsw.limit=5632MB
Feb  5 04:14:17 vmp505 slurmstepd[18447]: task/cgroup: /slurm/uid_108679/job_7045555/step_batch: alloc=5120MB mem.limit=5120MB memsw.limit=5632MB
Feb  5 04:14:17 vmp505 slurmd[2417]: launch task 7045555.0 request from 108679.20450@10.0.22.5 (port 22426)
Feb  5 04:14:17 vmp505 slurmstepd[18486]: task/cgroup: /slurm/uid_108679/job_7045555: alloc=5120MB mem.limit=5120MB memsw.limit=5632MB
Feb  5 04:14:17 vmp505 slurmstepd[18486]: task/cgroup: /slurm/uid_108679/job_7045555/step_0: alloc=5120MB mem.limit=5120MB memsw.limit=5632MB
Feb  5 04:14:17 vmp505 slurmstepd[18486]: done with job
Feb  5 04:14:18 vmp505 slurmd[2417]: launch task 7045555.1 request from 108679.20450@10.0.22.5 (port 24474)
Feb  5 04:14:18 vmp505 slurmstepd[18516]: task/cgroup: /slurm/uid_108679/job_7045555: alloc=5120MB mem.limit=5120MB memsw.limit=5632MB
Feb  5 04:14:18 vmp505 slurmstepd[18516]: task/cgroup: /slurm/uid_108679/job_7045555/step_1: alloc=5120MB mem.limit=5120MB memsw.limit=5632MB
Feb  5 11:00:21 vmp505 slurmd[2417]: _run_prolog: run job script took usec=26
Feb  5 11:00:21 vmp505 slurmd[2417]: _run_prolog: prolog with lock for job 7047602 ran for 0 seconds
Feb  5 11:00:21 vmp505 slurmd[2417]: Launching batch job 7047602 for UID 79124
Feb  5 11:00:21 vmp505 slurmstepd[22861]: task/cgroup: /slurm/uid_79124/job_7047602: alloc=1024MB mem.limit=1024MB memsw.limit=1126MB
Feb  5 11:00:21 vmp505 slurmstepd[22861]: task/cgroup: /slurm/uid_79124/job_7047602/step_batch: alloc=1024MB mem.limit=1024MB memsw.limit=1126MB
Feb  5 11:00:21 vmp505 slurmstepd[22861]: sending REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status 0
Feb  5 11:00:22 vmp505 slurmstepd[22861]: done with job
Feb  5 11:10:25 vmp505 slurmd[2417]: launch task 7047718.0 request from 103978.58698@10.0.20.201 (port 5782)
Feb  5 11:10:25 vmp505 slurmd[2417]: _run_prolog: run job script took usec=21
Feb  5 11:10:25 vmp505 slurmd[2417]: _run_prolog: prolog with lock for job 7047718 ran for 0 seconds
Feb  5 11:10:25 vmp505 slurmstepd[23510]: task/cgroup: /slurm/uid_103978/job_7047718: alloc=1000MB mem.limit=1000MB memsw.limit=1100MB
Feb  5 11:10:25 vmp505 slurmstepd[23510]: task/cgroup: /slurm/uid_103978/job_7047718/step_0: alloc=1000MB mem.limit=1000MB memsw.limit=1100MB
Feb  5 17:44:57 vmp505 slurmd[2417]: error: Timeout waiting for completion of 167 threads
Feb  5 17:44:57 vmp505 slurmd[2417]: Node configuration differs from hardware: CPUs=8:16(hw) Boards=1:1(hw) SocketsPerBoard=2:2(hw) CoresPerSocket=4:4(hw) ThreadsPerCore=2:2(hw)
Feb  5 17:44:57 vmp505 slurmd[2417]: Message aggregation disabled
Feb  5 17:44:57 vmp505 slurmd[2417]: CPU frequency setting not configured for this node
Feb  5 17:44:57 vmp505 slurmd[2417]: Resource spec: Reserved system memory limit not configured for this node
Feb  6 00:49:02 vmp505 slurmd[2417]: active_threads == MAX_THREADS(256)
Feb  6 00:55:42 vmp505 slurmstepd[18447]: sending REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status 256
Feb  6 00:55:43 vmp505 slurmstepd[23510]: error: Failed to send MESSAGE_TASK_EXIT: Connection refused
Feb  6 00:55:43 vmp505 slurmstepd[23510]: done with job
Feb  6 00:55:45 vmp505 slurmstepd[18447]: done with job
Feb  6 00:55:45 vmp505 slurmd[2417]: error: gathering job accounting: 0
Feb  6 00:55:45 vmp505 slurmd[2417]: error: gathering job accounting: 0
Feb  6 00:55:45 vmp505 slurmd[2417]: error: gathering job accounting: 0
Feb  6 00:55:45 vmp505 slurmd[2417]: error: gathering job accounting: 0
.
.
.

--------------

Note that I've trimmed the lines that read "error: gathering job accounting: 0". There are many more such lines in our actual logs. The nodes that this occurs on seem somewhat random (different racks, different server type, different processor types, etc.). There have been a few repeat offenders but for the most part it's different nodes each day. The "active_threads == MAX_THREADS(256)" jumps out to me. Messages like "Feb  5 17:44:57 vmp564 slurmd[1315]: error: Timeout waiting for completion of 172 threads" are also troubling.

We're bumping our slurmdDebug back down to 3 for now. I'll go ahead and attach our current slurm.conf and slurmdbd.conf in case those are needed.

Comment 1 Will French 2016-02-06 00:44:43 MST

Created attachment 2692 [details]
slurm.conf

Comment 2 Will French 2016-02-06 00:45:12 MST

Created attachment 2693 [details]
slurmdbd.conf

Comment 3 Alejandro Sanchez 2016-02-07 20:22:19 MST

Hi Will,

could you please attach the slurmdbd.log file?

Comment 6 Alejandro Sanchez 2016-02-08 00:29:06 MST

Will, it would be nice if you could also attach gdb to the vmp560's slurmd and copy the output of:

(gdb) set pagination off
(gdb) thread apply all bt full

Comment 7 Will French 2016-02-08 02:27:31 MST

Created attachment 2696 [details]
slurmdbd.log

This goes back to mid-January. We performed the upgrade to 15.08.7 on the morning of January 27.

Comment 8 Will French 2016-02-08 04:29:43 MST

(In reply to Alejandro Sanchez from comment #6)
> Will, it would be nice if you could also attach gdb to the vmp560's slurmd
> and copy the output of:
> 
> (gdb) set pagination off
> (gdb) thread apply all bt full

Is this safe to do while there are jobs running on the node? If not, I'll need to drain the node first, which will probably take a few days. 

Would you recommend something like:

gdb -p PID

to attach to the currently running slurmd on vmp560?

Comment 11 Alejandro Sanchez 2016-02-08 04:50:39 MST

This should be safe, but the slurmd would be unresponsive while under gdb so get in and get out and it should be fine.

Anyhow, weren't the nodes DOWN and locked?

Comment 12 Will French 2016-02-08 05:04:50 MST

> Anyhow, weren't the nodes DOWN and locked?

Since there hasn't been much pattern to nodes locking up and going down, we've been returning affected nodes to service.

Debugger output shown below:

[root@vmp560 ~]# gdb -p 22792
GNU gdb (GDB) Red Hat Enterprise Linux (7.2-83.el6)
Copyright (C) 2010 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Attaching to process 22792
Reading symbols from /gpfs22/scheduler/centos6/slurm-15.08.7/sbin/slurmd...done.
Reading symbols from /usr/lib64/libhwloc.so.5...(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/libhwloc.so.5
Reading symbols from /lib64/libdl.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/libdl.so.2
Reading symbols from /lib64/libpthread.so.0...(no debugging symbols found)...done.
[Thread debugging using libthread_db enabled]
Loaded symbols for /lib64/libpthread.so.0
Reading symbols from /lib64/libc.so.6...(no debugging symbols found)...done.
Loaded symbols for /lib64/libc.so.6
Reading symbols from /lib64/libm.so.6...(no debugging symbols found)...done.
Loaded symbols for /lib64/libm.so.6
Reading symbols from /usr/lib64/libnuma.so.1...(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/libnuma.so.1
Reading symbols from /lib64/libpci.so.3...(no debugging symbols found)...done.
Loaded symbols for /lib64/libpci.so.3
Reading symbols from /usr/lib64/libxml2.so.2...(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/libxml2.so.2
Reading symbols from /lib64/ld-linux-x86-64.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/ld-linux-x86-64.so.2
Reading symbols from /lib64/libresolv.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/libresolv.so.2
Reading symbols from /lib64/libz.so.1...(no debugging symbols found)...done.
Loaded symbols for /lib64/libz.so.1
Reading symbols from /lib64/libnss_files.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/libnss_files.so.2
Reading symbols from /usr/scheduler/slurm/lib/slurm/select_cons_res.so...done.
Loaded symbols for /usr/scheduler/slurm/lib/slurm/select_cons_res.so
Reading symbols from /usr/scheduler/slurm/lib/slurm/topology_none.so...done.
Loaded symbols for /usr/scheduler/slurm/lib/slurm/topology_none.so
Reading symbols from /usr/scheduler/slurm/lib/slurm/route_default.so...done.
Loaded symbols for /usr/scheduler/slurm/lib/slurm/route_default.so
Reading symbols from /usr/scheduler/slurm/lib/slurm/proctrack_cgroup.so...done.
Loaded symbols for /usr/scheduler/slurm/lib/slurm/proctrack_cgroup.so
Reading symbols from /usr/scheduler/slurm/lib/slurm/task_cgroup.so...done.
Loaded symbols for /usr/scheduler/slurm/lib/slurm/task_cgroup.so
Reading symbols from /usr/scheduler/slurm/lib/slurm/auth_munge.so...done.
Loaded symbols for /usr/scheduler/slurm/lib/slurm/auth_munge.so
Reading symbols from /usr/lib64/libmunge.so.2...done.
Loaded symbols for /usr/lib64/libmunge.so.2
Reading symbols from /usr/scheduler/slurm/lib/slurm/crypto_munge.so...done.
Loaded symbols for /usr/scheduler/slurm/lib/slurm/crypto_munge.so
Reading symbols from /usr/scheduler/slurm/lib/slurm/jobacct_gather_none.so...done.
Loaded symbols for /usr/scheduler/slurm/lib/slurm/jobacct_gather_none.so
Reading symbols from /usr/scheduler/slurm/lib/slurm/job_container_none.so...done.
Loaded symbols for /usr/scheduler/slurm/lib/slurm/job_container_none.so
Reading symbols from /usr/scheduler/slurm/lib/slurm/core_spec_none.so...done.
Loaded symbols for /usr/scheduler/slurm/lib/slurm/core_spec_none.so
Reading symbols from /usr/scheduler/slurm/lib/slurm/switch_none.so...done.
Loaded symbols for /usr/scheduler/slurm/lib/slurm/switch_none.so
Reading symbols from /usr/scheduler/slurm/lib/slurm/acct_gather_energy_none.so...done.
Loaded symbols for /usr/scheduler/slurm/lib/slurm/acct_gather_energy_none.so
Reading symbols from /usr/scheduler/slurm/lib/slurm/acct_gather_profile_none.so...done.
Loaded symbols for /usr/scheduler/slurm/lib/slurm/acct_gather_profile_none.so
Reading symbols from /usr/scheduler/slurm/lib/slurm/acct_gather_infiniband_none.so...done.
Loaded symbols for /usr/scheduler/slurm/lib/slurm/acct_gather_infiniband_none.so
Reading symbols from /usr/scheduler/slurm/lib/slurm/acct_gather_filesystem_none.so...done.
Loaded symbols for /usr/scheduler/slurm/lib/slurm/acct_gather_filesystem_none.so
0x00007f3a04ebeadd in accept () from /lib64/libpthread.so.0
Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.166.el6_7.3.x86_64 hwloc-1.5-3.el6_5.x86_64 libxml2-2.7.6-20.el6_7.1.x86_64 numactl-2.0.9-2.el6.x86_64 pciutils-libs-3.1.10-4.el6.x86_64 zlib-1.2.3-29.el6.x86_64
(gdb) set pagination off
(gdb) thread apply all bt full

Thread 1 (Thread 0x7f3a056fd700 (LWP 22792)):
#0  0x00007f3a04ebeadd in accept () from /lib64/libpthread.so.0
No symbol table info available.
#1  0x00000000004e3a89 in slurm_accept_msg_conn (fd=4, addr=0x1f61580) at ../../../src/common/slurm_protocol_socket_implementation.c:477
        len = 16
#2  0x0000000000429559 in _msg_engine () at ../../../../src/slurmd/slurmd/slurmd.c:458
        cli = 0x1f61580
        sock = 7
#3  0x00000000004292af in main (argc=1, argv=0x7ffea0fc34b8) at ../../../../src/slurmd/slurmd/slurmd.c:370
        i = 4096
        pidfd = 5
        blocked_signals = {13, 0}
        cc = 4096
        oom_value = 0x0
        slurmd_uid = 0
        curr_uid = 0
        time_stamp = "Sat, 06 Feb 2016 08:38:58 -0600\000\210\241q\005:\177\000\000\240\361o\005:\177\000\000.N=\366\000\000\000\000\n/P\005:\177\000\000\000\000\000\000\000\000\000\000\240\361o\005:\177\000\000\001", '\000' <repeats 15 times>, "\001\000\000\000:\177\000\000\210\241q\005:\177\000\000Ȣ ", '\000' <repeats 13 times>"\270, v\211\004:\177\000\000\240\232\363\001", '\000' <repeats 12 times>"\340, \244q\005:\177\000\000\240\063\374\240\376\177\000\000\000\000\000\000\000\000\000\000\270\063\374\240\376\177\000\000\020\336h\004\001\000\000\000\001\000\000\000\000\000\000\000\364jA", '\000' <repeats 13 times>, "\017\000\000\000\000\000\000\000`\230\363\001\000\000\000\000\365\330P\005:\177\000\000\001\000\000\000\000\000\000\000"...
        lopts = {stderr_level = LOG_LEVEL_INFO, syslog_level = LOG_LEVEL_INFO, logfile_level = LOG_LEVEL_INFO, prefix_level = 1, buffered = 0}
(gdb) quit
A debugging session is active.

	Inferior 1 [process 22792] will be detached.

Quit anyway? (y or n) y
Detaching from program: /gpfs22/scheduler/centos6/slurm-15.08.7/sbin/slurmd, process 22792

Comment 13 Tim Wickberg 2016-02-08 05:31:58 MST

Was that node stuck at the point you ran gdb, or was that working normally then?

Judging by the logs you've sent, we're expecting to see 256+ threads active. Only one thread looks like the node is probably healthy right now?

If it was stuck at that point, can you print this:

(gdb) print active_threads, active_mutex, active_cond

The next stuck node you have, I'd like you to save a core dump from it so you can get the node back to service, and pull values for us at a later time. The gdb command `generate-core-dump` should work once attached.

I also noticed your cpu count for some of these nodes is half that of the machine itself. If you want to ignore hyperthreads on the node, I'd suggest setting ThreadsPerCore=1 instead. There's also a CR_ONE_THREAD_PER_CORE flag for SchedParams that may be of use, although there's a bug in the current release and you'll want to wait for 15.08.8 before looking into using that.

(In reply to Will French from comment #12)
> > Anyhow, weren't the nodes DOWN and locked?
> 
> Since there hasn't been much pattern to nodes locking up and going down,
> we've been returning affected nodes to service.
> 
> Debugger output shown below:
>

Comment 14 Will French 2016-02-08 06:28:29 MST

(In reply to Tim Wickberg from comment #13)
> Was that node stuck at the point you ran gdb, or was that working normally
> then?

The node was healthy at this point.


> 
> If it was stuck at that point, can you print this:
> 
> (gdb) print active_threads, active_mutex, active_cond
> 
> The next stuck node you have, I'd like you to save a core dump from it so
> you can get the node back to service, and pull values for us at a later
> time. The gdb command `generate-core-dump` should work once attached.


Will do. I should be able to send this tomorrow morning when we have a few nodes that are locked up.


> I also noticed your cpu count for some of these nodes is half that of the
> machine itself. If you want to ignore hyperthreads on the node, I'd suggest
> setting ThreadsPerCore=1 instead. There's also a CR_ONE_THREAD_PER_CORE flag
> for SchedParams that may be of use, although there's a bug in the current
> release and you'll want to wait for 15.08.8 before looking into using that. 

We do not want to disable hyperthreading completely, but we prefer for users to see physical rather than logical cores. We're following the advice that we were given in an earlier thread: http://bugs.schedmd.com/show_bug.cgi?id=1328#c16

Comment 15 Tim Wickberg 2016-02-08 07:44:47 MST

> The node was healthy at this point.

That explains it. Generally slurmd shouldn't be holding threads open indefinitely. The single thread is normal operating conditions, but the logs  point towards something holding them active until the max thread count is exceeded, and slurmd then dead-locking.

Seeing what threads are active will hopefully point us towards the culprit.

Out of curiosity, what OS and kernel is running on the cluster?

> Will do. I should be able to send this tomorrow morning when we have a few
> nodes that are locked up.

Just to verify - we just want the info, please don't attach the core file itself. The core files are only useful if you have the binaries that produced them, so it's usually simplest to have you pull some critical values out of the core file instead.

> > I also noticed your cpu count for some of these nodes is half that of the
> > machine itself. If you want to ignore hyperthreads on the node, I'd suggest
> > setting ThreadsPerCore=1 instead. There's also a CR_ONE_THREAD_PER_CORE flag
> > for SchedParams that may be of use, although there's a bug in the current
> > release and you'll want to wait for 15.08.8 before looking into using that. 
> 
> We do not want to disable hyperthreading completely, but we prefer for users
> to see physical rather than logical cores. We're following the advice that
> we were given in an earlier thread:
> http://bugs.schedmd.com/show_bug.cgi?id=1328#c16

This is something we've gone back and forth on - hyperthreading is... problematic.

Lowering the ThreadsPerCore count is our preferred approach if CR_ONE_TASK_PER_CORE doesn't work for you. (I mixed this up on my last response, CR_ONE_TASK_PER_CORE is the correct name.) 

Task layout on the individual node should be better under either of those than under-specifying the CPU count, although the performance difference is likely negligible in your case.

Comment 16 Will French 2016-02-08 10:45:28 MST

> Out of curiosity, what OS and kernel is running on the cluster?

18:42:29-frenchwr@vmp560:~$ lsb_release -a
LSB Version:	:base-4.0-amd64:base-4.0-noarch:core-4.0-amd64:core-4.0-noarch
Distributor ID:	CentOS
Description:	CentOS release 6.7 (Final)
Release:	6.7
Codename:	Final
18:42:36-frenchwr@vmp560:~$ uname -a
Linux vmp560 2.6.32-573.12.1.el6.x86_64 #1 SMP Tue Dec 15 21:19:08 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux


> Just to verify - we just want the info, please don't attach the core file
> itself. The core files are only useful if you have the binaries that
> produced them, so it's usually simplest to have you pull some critical
> values out of the core file instead.

Understood.

> This is something we've gone back and forth on - hyperthreading is...
> problematic.
> 
> Lowering the ThreadsPerCore count is our preferred approach if
> CR_ONE_TASK_PER_CORE doesn't work for you. (I mixed this up on my last
> response, CR_ONE_TASK_PER_CORE is the correct name.) 
> 
> Task layout on the individual node should be better under either of those
> than under-specifying the CPU count, although the performance difference is
> likely negligible in your case.

Okay, that's good to know. In what scenario(s) should we expect our current configuration to negatively impact performance?

Comment 17 Will French 2016-02-09 01:52:28 MST

We have another hung node (vmp226), below is the relevant information. I've generated a core dump file so let me know if you'd like me to do some additional analysis on that file. I'm also attaching the full output from the gdb session since it's pretty lengthy. 

[root@vmp226 ~]# scontrol show node vmp226
NodeName=vmp226 Arch=x86_64 CoresPerSocket=4
   CPUAlloc=0 CPUErr=0 CPUTot=8 CPULoad=4.37 Features=eight
   Gres=(null)
   NodeAddr=vmp226 NodeHostName=vmp226 Version=15.08
   OS=Linux RealMemory=123000 AllocMem=0 FreeMem=51950 Sockets=2 Boards=1
   State=DOWN* ThreadsPerCore=2 TmpDisk=0 Weight=51 Owner=N/A
   BootTime=2016-01-06T08:43:06 SlurmdStartTime=2016-01-27T07:13:42
   CapWatts=n/a
   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
   Reason=Not responding [slurm@2016-02-09T06:47:41]

[root@vmp226 ~]# ps -ef | grep slurmd
root      9256     1  0 Jan27 ?        00:00:20 /usr/scheduler/slurm/sbin/slurmd
root     25536 24514  0 09:49 pts/3    00:00:00 grep slurmd

[root@vmp226 ~]# scontrol show config | grep SlurmdDebug
SlurmdDebug             = info

[root@vmp226 ~]# tail -1000 /var/log/messages | grep slurm
Feb  8 20:25:35 vmp226 slurmd[9256]: _run_prolog: run job script took usec=764
Feb  8 20:25:35 vmp226 slurmd[9256]: _run_prolog: prolog with lock for job 7084354 ran for 0 seconds
Feb  8 20:25:35 vmp226 slurmd[9256]: Launching batch job 7084354 for UID 114701
Feb  8 20:25:36 vmp226 slurmstepd[12518]: task/cgroup: /slurm/uid_114701/job_7084354: alloc=8000MB mem.limit=8000MB memsw.limit=8800MB
Feb  8 20:25:36 vmp226 slurmstepd[12518]: task/cgroup: /slurm/uid_114701/job_7084354/step_batch: alloc=8000MB mem.limit=8000MB memsw.limit=8800MB
Feb  8 21:02:23 vmp226 slurmstepd[6211]: sending REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status 0
Feb  8 21:02:23 vmp226 slurmstepd[6211]: done with job
Feb  8 21:03:47 vmp226 slurmd[9256]: _run_prolog: run job script took usec=21
Feb  8 21:03:47 vmp226 slurmd[9256]: _run_prolog: prolog with lock for job 7085185 ran for 0 seconds
Feb  8 21:03:47 vmp226 slurmd[9256]: Launching batch job 7085185 for UID 5007
Feb  8 21:03:47 vmp226 slurmstepd[16179]: task/cgroup: /slurm/uid_5007/job_7085185: alloc=9216MB mem.limit=9216MB memsw.limit=10137MB
Feb  8 21:03:47 vmp226 slurmstepd[16179]: task/cgroup: /slurm/uid_5007/job_7085185/step_batch: alloc=9216MB mem.limit=9216MB memsw.limit=10137MB
Feb  9 05:37:40 vmp226 slurmstepd[3226]: error: Exceeded step memory limit at some point.
Feb  9 05:37:41 vmp226 slurmstepd[3226]: sending REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status 0
Feb  9 05:37:41 vmp226 slurmstepd[3226]: done with job
Feb  9 05:44:12 vmp226 slurmd[9256]: _run_prolog: run job script took usec=30
Feb  9 05:44:12 vmp226 slurmd[9256]: _run_prolog: prolog with lock for job 7086991 ran for 0 seconds
Feb  9 05:44:12 vmp226 slurmd[9256]: Launching batch job 7086991 for UID 156369
Feb  9 05:44:12 vmp226 slurmstepd[3882]: task/cgroup: /slurm/uid_156369/job_7086991: alloc=10240MB mem.limit=10240MB memsw.limit=11264MB
Feb  9 05:44:12 vmp226 slurmstepd[3882]: task/cgroup: /slurm/uid_156369/job_7086991/step_batch: alloc=10240MB mem.limit=10240MB memsw.limit=11264MB
Feb  9 06:06:21 vmp226 slurmd[9256]: _run_prolog: run job script took usec=397
Feb  9 06:06:21 vmp226 slurmd[9256]: _run_prolog: prolog with lock for job 7087054 ran for 0 seconds
Feb  9 06:06:21 vmp226 slurmd[9256]: Launching batch job 7087054 for UID 59223
Feb  9 06:06:21 vmp226 slurmstepd[5693]: task/cgroup: /slurm/uid_59223/job_7087054: alloc=8192MB mem.limit=8192MB memsw.limit=9011MB
Feb  9 06:06:21 vmp226 slurmstepd[5693]: task/cgroup: /slurm/uid_59223/job_7087054/step_batch: alloc=8192MB mem.limit=8192MB memsw.limit=9011MB
Feb  9 06:38:16 vmp226 slurmd[9256]: active_threads == MAX_THREADS(256)



top - 09:48:24 up 34 days,  1:05,  1 user,  load average: 4.60, 4.30, 4.23
Tasks: 389 total,   5 running, 384 sleeping,   0 stopped,   0 zombie
Cpu(s): 30.9%us,  0.6%sy,  0.0%ni, 68.5%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:  132145380k total, 79472444k used, 52672936k free,   285412k buffers
Swap:  4194300k total,    12112k used,  4182188k free, 66030228k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                                                                                        
 3887 sivleyrm  20   0  446m  56m  12m R 100.0  0.0 243:32.85 python2.7                                                                                                                      
12527 trippej1  20   0 1094m 342m  10m R 100.0  0.3 802:32.55 python                                                                                                                         
16219 masispid  20   0 3785m 749m 9.9m S 100.2  0.6 689:06.94 MATLAB                                                                                                                         
25156 root      20   0 90064  27m 3908 R 100.0  0.0   0:42.89 cf-agent                                                                                                                       
25140 vuiiscci  20   0 45788  21m 8788 R 99.9  0.0   2:53.38 ANTS   
.
.
.


(gdb) print active_threads, active_mutex, active_cond
$1 = {__data = {__lock = 0, __futex = 15, __total_seq = 8, __wakeup_seq = 7, __woken_seq = 7, __mutex = 0x7e7aa0, __nwaiters = 2, __broadcast_seq = 0}, __size = "\000\000\000\000\017\000\000\000\b\000\000\000\000\000\000\000\a\000\000\000\000\000\000\000\a\000\000\000\000\000\000\000\240z~\000\000\000\000\000\002\000\000\000\000\000\000", __align = 64424509440}

Comment 18 Will French 2016-02-09 01:53:52 MST

Created attachment 2700 [details]
Output from gdb session attached to hung slurmd process

Comment 19 Alejandro Sanchez 2016-02-09 02:17:52 MST

Will,

could you please print again active_threads, active_mutex, active_cond, but in separate prints, like this?

(gdb) print active_threads
(gdb) print active_mutex
(gdb) print active_cond

Thanks.

Comment 20 Will French 2016-02-09 02:34:19 MST

(In reply to Alejandro Sanchez from comment #19)
> Will,
> 
> could you please print again active_threads, active_mutex, active_cond, but
> in separate prints, like this?
> 
> (gdb) print active_threads
> (gdb) print active_mutex
> (gdb) print active_cond
> 
> Thanks.

(gdb) set pagination off
(gdb) print active_threads
$1 = 256
(gdb) print active_mutex
$2 = {__data = {__lock = 0, __count = 0, __owner = 0, __nusers = 1, __kind = 0, __spins = 0, __list = {__prev = 0x0, __next = 0x0}}, __size = '\000' <repeats 12 times>, "\001", '\000' <repeats 26 times>, __align = 0}
(gdb) print active_cond
$3 = {__data = {__lock = 0, __futex = 15, __total_seq = 8, __wakeup_seq = 7, __woken_seq = 7, __mutex = 0x7e7aa0, __nwaiters = 2, __broadcast_seq = 0}, __size = "\000\000\000\000\017\000\000\000\b\000\000\000\000\000\000\000\a\000\000\000\000\000\000\000\a\000\000\000\000\000\000\000\240z~\000\000\000\000\000\002\000\000\000\000\000\000", __align = 64424509440}
(gdb) quit
A debugging session is active.

	Inferior 1 [process 9256] will be detached.

Quit anyway? (y or n) y
Detaching from program: /gpfs22/scheduler/centos6/slurm-15.08.7/sbin/slurmd, process 9256

Comment 22 Tim Wickberg 2016-02-09 05:15:09 MST

Thanks for all the info you've provided, I believe we've been able to identify the issue finally.

Jobs that exceed their memory limits were currently causing slurmd to make a last-ditch attempt to read in accounting data before terminating the job, and getting stuck blocking on a read() call that would never complete since you aren't using jobacct.

The offending job continues to run, then slurmd will spawn another thread to kill it which gets stuck in the same place. This continues until the thread limit is exceeded. Once exceeded, slurmd can no longer create new threads to handle any RPC messages, and is completely unresponsive. The normal shutdown logic within slurmd waits until all threads complete before exiting - this will never happen as those accounting threads are still blocked on I/O - and is why you need to `kill -9` slurmd to get it to stop.

As this only happens when a job exceeds the memory limit - this explains the intermittent behavior you're seeing, and why you're only loosing a few nodes a day. I should have a patch for you for slurmd later this afternoon, I'm verifying the behavior in a few additional locations before committing the patch.

- Tim

Comment 23 Tim Wickberg 2016-02-09 05:15:59 MST

Thanks for all the info you've provided, I believe we've been able to identify the issue finally.

Jobs that exceed their memory limits were currently causing slurmd to make a last-ditch attempt to read in accounting data before terminating the job, and getting stuck blocking on a read() call that would never complete since you aren't using jobacct.

The offending job continues to run, then slurmd will spawn another thread to kill it which gets stuck in the same place. This continues until the thread limit is exceeded. Once exceeded, slurmd can no longer create new threads to handle any RPC messages, and is completely unresponsive. The normal shutdown logic within slurmd waits until all threads complete before exiting - this will never happen as those accounting threads are still blocked on I/O - and is why you need to `kill -9` slurmd to get it to stop.

As this only happens when a job exceeds the memory limit - this explains the intermittent behavior you're seeing, and why you're only loosing a few nodes a day. I should have a patch for you for slurmd later this afternoon, I'm verifying the behavior in a few additional locations before committing the patch.

- Tim

Comment 24 Will French 2016-02-09 05:24:48 MST

> Jobs that exceed their memory limits were currently causing slurmd to make a
> last-ditch attempt to read in accounting data before terminating the job,
> and getting stuck blocking on a read() call that would never complete since
> you aren't using jobacct.

That makes sense! And now that you mention it I do recall often seeing "exceeded memory" errors in the logs immediately prior to the thread count explosion. I just never put two and two together. 


> The offending job continues to run, then slurmd will spawn another thread to
> kill it which gets stuck in the same place. This continues until the thread
> limit is exceeded. Once exceeded, slurmd can no longer create new threads to
> handle any RPC messages, and is completely unresponsive. The normal shutdown
> logic within slurmd waits until all threads complete before exiting - this
> will never happen as those accounting threads are still blocked on I/O - and
> is why you need to `kill -9` slurmd to get it to stop.
> 
> As this only happens when a job exceeds the memory limit - this explains the
> intermittent behavior you're seeing, and why you're only loosing a few nodes
> a day. I should have a patch for you for slurmd later this afternoon, I'm
> verifying the behavior in a few additional locations before committing the
> patch.


Perfect, thank you. Will this be included in next micro release (15.08.8) and, if so, when is that scheduled for release?

Comment 25 Tim Wickberg 2016-02-09 08:25:29 MST

(In reply to Will French from comment #24)
> Perfect, thank you. Will this be included in next micro release (15.08.8)
> and, if so, when is that scheduled for release?

Yes, definitely. We expect to release 15.08.8 within the next couple of weeks, although we don't have a date yet.

The patch is here if you'd like to apply it ahead of the release:
https://github.com/SchedMD/slurm/commit/aaf8bcf6.patch

If there are any further problems please re-open this bug.