| Summary: | slurmd hanging/locking up | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Will French <will> |
| Component: | slurmd | Assignee: | Alejandro Sanchez <alex> |
| Status: | RESOLVED FIXED | QA Contact: | |
| Severity: | 3 - Medium Impact | ||
| Priority: | --- | CC: | alex, charles.johnson, davide.vanzo, sean, tim |
| Version: | 15.08.7 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | Vanderbilt | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | 15.08.8 16.05.0-pre1 | Target Release: | --- |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
| Attachments: |
slurm.conf
slurmdbd.conf slurmdbd.log Output from gdb session attached to hung slurmd process |
||
|
Description
Will French
2016-02-06 00:42:34 MST
Created attachment 2692 [details]
slurm.conf
Created attachment 2693 [details]
slurmdbd.conf
Hi Will, could you please attach the slurmdbd.log file? Will, it would be nice if you could also attach gdb to the vmp560's slurmd and copy the output of: (gdb) set pagination off (gdb) thread apply all bt full Created attachment 2696 [details]
slurmdbd.log
This goes back to mid-January. We performed the upgrade to 15.08.7 on the morning of January 27.
(In reply to Alejandro Sanchez from comment #6) > Will, it would be nice if you could also attach gdb to the vmp560's slurmd > and copy the output of: > > (gdb) set pagination off > (gdb) thread apply all bt full Is this safe to do while there are jobs running on the node? If not, I'll need to drain the node first, which will probably take a few days. Would you recommend something like: gdb -p PID to attach to the currently running slurmd on vmp560? This should be safe, but the slurmd would be unresponsive while under gdb so get in and get out and it should be fine. Anyhow, weren't the nodes DOWN and locked? > Anyhow, weren't the nodes DOWN and locked? Since there hasn't been much pattern to nodes locking up and going down, we've been returning affected nodes to service. Debugger output shown below: [root@vmp560 ~]# gdb -p 22792 GNU gdb (GDB) Red Hat Enterprise Linux (7.2-83.el6) Copyright (C) 2010 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-redhat-linux-gnu". For bug reporting instructions, please see: <http://www.gnu.org/software/gdb/bugs/>. Attaching to process 22792 Reading symbols from /gpfs22/scheduler/centos6/slurm-15.08.7/sbin/slurmd...done. Reading symbols from /usr/lib64/libhwloc.so.5...(no debugging symbols found)...done. Loaded symbols for /usr/lib64/libhwloc.so.5 Reading symbols from /lib64/libdl.so.2...(no debugging symbols found)...done. Loaded symbols for /lib64/libdl.so.2 Reading symbols from /lib64/libpthread.so.0...(no debugging symbols found)...done. [Thread debugging using libthread_db enabled] Loaded symbols for /lib64/libpthread.so.0 Reading symbols from /lib64/libc.so.6...(no debugging symbols found)...done. Loaded symbols for /lib64/libc.so.6 Reading symbols from /lib64/libm.so.6...(no debugging symbols found)...done. Loaded symbols for /lib64/libm.so.6 Reading symbols from /usr/lib64/libnuma.so.1...(no debugging symbols found)...done. Loaded symbols for /usr/lib64/libnuma.so.1 Reading symbols from /lib64/libpci.so.3...(no debugging symbols found)...done. Loaded symbols for /lib64/libpci.so.3 Reading symbols from /usr/lib64/libxml2.so.2...(no debugging symbols found)...done. Loaded symbols for /usr/lib64/libxml2.so.2 Reading symbols from /lib64/ld-linux-x86-64.so.2...(no debugging symbols found)...done. Loaded symbols for /lib64/ld-linux-x86-64.so.2 Reading symbols from /lib64/libresolv.so.2...(no debugging symbols found)...done. Loaded symbols for /lib64/libresolv.so.2 Reading symbols from /lib64/libz.so.1...(no debugging symbols found)...done. Loaded symbols for /lib64/libz.so.1 Reading symbols from /lib64/libnss_files.so.2...(no debugging symbols found)...done. Loaded symbols for /lib64/libnss_files.so.2 Reading symbols from /usr/scheduler/slurm/lib/slurm/select_cons_res.so...done. Loaded symbols for /usr/scheduler/slurm/lib/slurm/select_cons_res.so Reading symbols from /usr/scheduler/slurm/lib/slurm/topology_none.so...done. Loaded symbols for /usr/scheduler/slurm/lib/slurm/topology_none.so Reading symbols from /usr/scheduler/slurm/lib/slurm/route_default.so...done. Loaded symbols for /usr/scheduler/slurm/lib/slurm/route_default.so Reading symbols from /usr/scheduler/slurm/lib/slurm/proctrack_cgroup.so...done. Loaded symbols for /usr/scheduler/slurm/lib/slurm/proctrack_cgroup.so Reading symbols from /usr/scheduler/slurm/lib/slurm/task_cgroup.so...done. Loaded symbols for /usr/scheduler/slurm/lib/slurm/task_cgroup.so Reading symbols from /usr/scheduler/slurm/lib/slurm/auth_munge.so...done. Loaded symbols for /usr/scheduler/slurm/lib/slurm/auth_munge.so Reading symbols from /usr/lib64/libmunge.so.2...done. Loaded symbols for /usr/lib64/libmunge.so.2 Reading symbols from /usr/scheduler/slurm/lib/slurm/crypto_munge.so...done. Loaded symbols for /usr/scheduler/slurm/lib/slurm/crypto_munge.so Reading symbols from /usr/scheduler/slurm/lib/slurm/jobacct_gather_none.so...done. Loaded symbols for /usr/scheduler/slurm/lib/slurm/jobacct_gather_none.so Reading symbols from /usr/scheduler/slurm/lib/slurm/job_container_none.so...done. Loaded symbols for /usr/scheduler/slurm/lib/slurm/job_container_none.so Reading symbols from /usr/scheduler/slurm/lib/slurm/core_spec_none.so...done. Loaded symbols for /usr/scheduler/slurm/lib/slurm/core_spec_none.so Reading symbols from /usr/scheduler/slurm/lib/slurm/switch_none.so...done. Loaded symbols for /usr/scheduler/slurm/lib/slurm/switch_none.so Reading symbols from /usr/scheduler/slurm/lib/slurm/acct_gather_energy_none.so...done. Loaded symbols for /usr/scheduler/slurm/lib/slurm/acct_gather_energy_none.so Reading symbols from /usr/scheduler/slurm/lib/slurm/acct_gather_profile_none.so...done. Loaded symbols for /usr/scheduler/slurm/lib/slurm/acct_gather_profile_none.so Reading symbols from /usr/scheduler/slurm/lib/slurm/acct_gather_infiniband_none.so...done. Loaded symbols for /usr/scheduler/slurm/lib/slurm/acct_gather_infiniband_none.so Reading symbols from /usr/scheduler/slurm/lib/slurm/acct_gather_filesystem_none.so...done. Loaded symbols for /usr/scheduler/slurm/lib/slurm/acct_gather_filesystem_none.so 0x00007f3a04ebeadd in accept () from /lib64/libpthread.so.0 Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.166.el6_7.3.x86_64 hwloc-1.5-3.el6_5.x86_64 libxml2-2.7.6-20.el6_7.1.x86_64 numactl-2.0.9-2.el6.x86_64 pciutils-libs-3.1.10-4.el6.x86_64 zlib-1.2.3-29.el6.x86_64 (gdb) set pagination off (gdb) thread apply all bt full Thread 1 (Thread 0x7f3a056fd700 (LWP 22792)): #0 0x00007f3a04ebeadd in accept () from /lib64/libpthread.so.0 No symbol table info available. #1 0x00000000004e3a89 in slurm_accept_msg_conn (fd=4, addr=0x1f61580) at ../../../src/common/slurm_protocol_socket_implementation.c:477 len = 16 #2 0x0000000000429559 in _msg_engine () at ../../../../src/slurmd/slurmd/slurmd.c:458 cli = 0x1f61580 sock = 7 #3 0x00000000004292af in main (argc=1, argv=0x7ffea0fc34b8) at ../../../../src/slurmd/slurmd/slurmd.c:370 i = 4096 pidfd = 5 blocked_signals = {13, 0} cc = 4096 oom_value = 0x0 slurmd_uid = 0 curr_uid = 0 time_stamp = "Sat, 06 Feb 2016 08:38:58 -0600\000\210\241q\005:\177\000\000\240\361o\005:\177\000\000.N=\366\000\000\000\000\n/P\005:\177\000\000\000\000\000\000\000\000\000\000\240\361o\005:\177\000\000\001", '\000' <repeats 15 times>, "\001\000\000\000:\177\000\000\210\241q\005:\177\000\000ΘΆ ", '\000' <repeats 13 times>"\270, v\211\004:\177\000\000\240\232\363\001", '\000' <repeats 12 times>"\340, \244q\005:\177\000\000\240\063\374\240\376\177\000\000\000\000\000\000\000\000\000\000\270\063\374\240\376\177\000\000\020\336h\004\001\000\000\000\001\000\000\000\000\000\000\000\364jA", '\000' <repeats 13 times>, "\017\000\000\000\000\000\000\000`\230\363\001\000\000\000\000\365\330P\005:\177\000\000\001\000\000\000\000\000\000\000"... lopts = {stderr_level = LOG_LEVEL_INFO, syslog_level = LOG_LEVEL_INFO, logfile_level = LOG_LEVEL_INFO, prefix_level = 1, buffered = 0} (gdb) quit A debugging session is active. Inferior 1 [process 22792] will be detached. Quit anyway? (y or n) y Detaching from program: /gpfs22/scheduler/centos6/slurm-15.08.7/sbin/slurmd, process 22792 Was that node stuck at the point you ran gdb, or was that working normally then? Judging by the logs you've sent, we're expecting to see 256+ threads active. Only one thread looks like the node is probably healthy right now? If it was stuck at that point, can you print this: (gdb) print active_threads, active_mutex, active_cond The next stuck node you have, I'd like you to save a core dump from it so you can get the node back to service, and pull values for us at a later time. The gdb command `generate-core-dump` should work once attached. I also noticed your cpu count for some of these nodes is half that of the machine itself. If you want to ignore hyperthreads on the node, I'd suggest setting ThreadsPerCore=1 instead. There's also a CR_ONE_THREAD_PER_CORE flag for SchedParams that may be of use, although there's a bug in the current release and you'll want to wait for 15.08.8 before looking into using that. (In reply to Will French from comment #12) > > Anyhow, weren't the nodes DOWN and locked? > > Since there hasn't been much pattern to nodes locking up and going down, > we've been returning affected nodes to service. > > Debugger output shown below: > (In reply to Tim Wickberg from comment #13) > Was that node stuck at the point you ran gdb, or was that working normally > then? The node was healthy at this point. > > If it was stuck at that point, can you print this: > > (gdb) print active_threads, active_mutex, active_cond > > The next stuck node you have, I'd like you to save a core dump from it so > you can get the node back to service, and pull values for us at a later > time. The gdb command `generate-core-dump` should work once attached. Will do. I should be able to send this tomorrow morning when we have a few nodes that are locked up. > I also noticed your cpu count for some of these nodes is half that of the > machine itself. If you want to ignore hyperthreads on the node, I'd suggest > setting ThreadsPerCore=1 instead. There's also a CR_ONE_THREAD_PER_CORE flag > for SchedParams that may be of use, although there's a bug in the current > release and you'll want to wait for 15.08.8 before looking into using that. We do not want to disable hyperthreading completely, but we prefer for users to see physical rather than logical cores. We're following the advice that we were given in an earlier thread: http://bugs.schedmd.com/show_bug.cgi?id=1328#c16 > The node was healthy at this point. That explains it. Generally slurmd shouldn't be holding threads open indefinitely. The single thread is normal operating conditions, but the logs point towards something holding them active until the max thread count is exceeded, and slurmd then dead-locking. Seeing what threads are active will hopefully point us towards the culprit. Out of curiosity, what OS and kernel is running on the cluster? > Will do. I should be able to send this tomorrow morning when we have a few > nodes that are locked up. Just to verify - we just want the info, please don't attach the core file itself. The core files are only useful if you have the binaries that produced them, so it's usually simplest to have you pull some critical values out of the core file instead. > > I also noticed your cpu count for some of these nodes is half that of the > > machine itself. If you want to ignore hyperthreads on the node, I'd suggest > > setting ThreadsPerCore=1 instead. There's also a CR_ONE_THREAD_PER_CORE flag > > for SchedParams that may be of use, although there's a bug in the current > > release and you'll want to wait for 15.08.8 before looking into using that. > > We do not want to disable hyperthreading completely, but we prefer for users > to see physical rather than logical cores. We're following the advice that > we were given in an earlier thread: > http://bugs.schedmd.com/show_bug.cgi?id=1328#c16 This is something we've gone back and forth on - hyperthreading is... problematic. Lowering the ThreadsPerCore count is our preferred approach if CR_ONE_TASK_PER_CORE doesn't work for you. (I mixed this up on my last response, CR_ONE_TASK_PER_CORE is the correct name.) Task layout on the individual node should be better under either of those than under-specifying the CPU count, although the performance difference is likely negligible in your case. > Out of curiosity, what OS and kernel is running on the cluster? 18:42:29-frenchwr@vmp560:~$ lsb_release -a LSB Version: :base-4.0-amd64:base-4.0-noarch:core-4.0-amd64:core-4.0-noarch Distributor ID: CentOS Description: CentOS release 6.7 (Final) Release: 6.7 Codename: Final 18:42:36-frenchwr@vmp560:~$ uname -a Linux vmp560 2.6.32-573.12.1.el6.x86_64 #1 SMP Tue Dec 15 21:19:08 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux > Just to verify - we just want the info, please don't attach the core file > itself. The core files are only useful if you have the binaries that > produced them, so it's usually simplest to have you pull some critical > values out of the core file instead. Understood. > This is something we've gone back and forth on - hyperthreading is... > problematic. > > Lowering the ThreadsPerCore count is our preferred approach if > CR_ONE_TASK_PER_CORE doesn't work for you. (I mixed this up on my last > response, CR_ONE_TASK_PER_CORE is the correct name.) > > Task layout on the individual node should be better under either of those > than under-specifying the CPU count, although the performance difference is > likely negligible in your case. Okay, that's good to know. In what scenario(s) should we expect our current configuration to negatively impact performance? We have another hung node (vmp226), below is the relevant information. I've generated a core dump file so let me know if you'd like me to do some additional analysis on that file. I'm also attaching the full output from the gdb session since it's pretty lengthy.
[root@vmp226 ~]# scontrol show node vmp226
NodeName=vmp226 Arch=x86_64 CoresPerSocket=4
CPUAlloc=0 CPUErr=0 CPUTot=8 CPULoad=4.37 Features=eight
Gres=(null)
NodeAddr=vmp226 NodeHostName=vmp226 Version=15.08
OS=Linux RealMemory=123000 AllocMem=0 FreeMem=51950 Sockets=2 Boards=1
State=DOWN* ThreadsPerCore=2 TmpDisk=0 Weight=51 Owner=N/A
BootTime=2016-01-06T08:43:06 SlurmdStartTime=2016-01-27T07:13:42
CapWatts=n/a
CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
Reason=Not responding [slurm@2016-02-09T06:47:41]
[root@vmp226 ~]# ps -ef | grep slurmd
root 9256 1 0 Jan27 ? 00:00:20 /usr/scheduler/slurm/sbin/slurmd
root 25536 24514 0 09:49 pts/3 00:00:00 grep slurmd
[root@vmp226 ~]# scontrol show config | grep SlurmdDebug
SlurmdDebug = info
[root@vmp226 ~]# tail -1000 /var/log/messages | grep slurm
Feb 8 20:25:35 vmp226 slurmd[9256]: _run_prolog: run job script took usec=764
Feb 8 20:25:35 vmp226 slurmd[9256]: _run_prolog: prolog with lock for job 7084354 ran for 0 seconds
Feb 8 20:25:35 vmp226 slurmd[9256]: Launching batch job 7084354 for UID 114701
Feb 8 20:25:36 vmp226 slurmstepd[12518]: task/cgroup: /slurm/uid_114701/job_7084354: alloc=8000MB mem.limit=8000MB memsw.limit=8800MB
Feb 8 20:25:36 vmp226 slurmstepd[12518]: task/cgroup: /slurm/uid_114701/job_7084354/step_batch: alloc=8000MB mem.limit=8000MB memsw.limit=8800MB
Feb 8 21:02:23 vmp226 slurmstepd[6211]: sending REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status 0
Feb 8 21:02:23 vmp226 slurmstepd[6211]: done with job
Feb 8 21:03:47 vmp226 slurmd[9256]: _run_prolog: run job script took usec=21
Feb 8 21:03:47 vmp226 slurmd[9256]: _run_prolog: prolog with lock for job 7085185 ran for 0 seconds
Feb 8 21:03:47 vmp226 slurmd[9256]: Launching batch job 7085185 for UID 5007
Feb 8 21:03:47 vmp226 slurmstepd[16179]: task/cgroup: /slurm/uid_5007/job_7085185: alloc=9216MB mem.limit=9216MB memsw.limit=10137MB
Feb 8 21:03:47 vmp226 slurmstepd[16179]: task/cgroup: /slurm/uid_5007/job_7085185/step_batch: alloc=9216MB mem.limit=9216MB memsw.limit=10137MB
Feb 9 05:37:40 vmp226 slurmstepd[3226]: error: Exceeded step memory limit at some point.
Feb 9 05:37:41 vmp226 slurmstepd[3226]: sending REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status 0
Feb 9 05:37:41 vmp226 slurmstepd[3226]: done with job
Feb 9 05:44:12 vmp226 slurmd[9256]: _run_prolog: run job script took usec=30
Feb 9 05:44:12 vmp226 slurmd[9256]: _run_prolog: prolog with lock for job 7086991 ran for 0 seconds
Feb 9 05:44:12 vmp226 slurmd[9256]: Launching batch job 7086991 for UID 156369
Feb 9 05:44:12 vmp226 slurmstepd[3882]: task/cgroup: /slurm/uid_156369/job_7086991: alloc=10240MB mem.limit=10240MB memsw.limit=11264MB
Feb 9 05:44:12 vmp226 slurmstepd[3882]: task/cgroup: /slurm/uid_156369/job_7086991/step_batch: alloc=10240MB mem.limit=10240MB memsw.limit=11264MB
Feb 9 06:06:21 vmp226 slurmd[9256]: _run_prolog: run job script took usec=397
Feb 9 06:06:21 vmp226 slurmd[9256]: _run_prolog: prolog with lock for job 7087054 ran for 0 seconds
Feb 9 06:06:21 vmp226 slurmd[9256]: Launching batch job 7087054 for UID 59223
Feb 9 06:06:21 vmp226 slurmstepd[5693]: task/cgroup: /slurm/uid_59223/job_7087054: alloc=8192MB mem.limit=8192MB memsw.limit=9011MB
Feb 9 06:06:21 vmp226 slurmstepd[5693]: task/cgroup: /slurm/uid_59223/job_7087054/step_batch: alloc=8192MB mem.limit=8192MB memsw.limit=9011MB
Feb 9 06:38:16 vmp226 slurmd[9256]: active_threads == MAX_THREADS(256)
top - 09:48:24 up 34 days, 1:05, 1 user, load average: 4.60, 4.30, 4.23
Tasks: 389 total, 5 running, 384 sleeping, 0 stopped, 0 zombie
Cpu(s): 30.9%us, 0.6%sy, 0.0%ni, 68.5%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 132145380k total, 79472444k used, 52672936k free, 285412k buffers
Swap: 4194300k total, 12112k used, 4182188k free, 66030228k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
3887 sivleyrm 20 0 446m 56m 12m R 100.0 0.0 243:32.85 python2.7
12527 trippej1 20 0 1094m 342m 10m R 100.0 0.3 802:32.55 python
16219 masispid 20 0 3785m 749m 9.9m S 100.2 0.6 689:06.94 MATLAB
25156 root 20 0 90064 27m 3908 R 100.0 0.0 0:42.89 cf-agent
25140 vuiiscci 20 0 45788 21m 8788 R 99.9 0.0 2:53.38 ANTS
.
.
.
(gdb) print active_threads, active_mutex, active_cond
$1 = {__data = {__lock = 0, __futex = 15, __total_seq = 8, __wakeup_seq = 7, __woken_seq = 7, __mutex = 0x7e7aa0, __nwaiters = 2, __broadcast_seq = 0}, __size = "\000\000\000\000\017\000\000\000\b\000\000\000\000\000\000\000\a\000\000\000\000\000\000\000\a\000\000\000\000\000\000\000\240z~\000\000\000\000\000\002\000\000\000\000\000\000", __align = 64424509440}
Created attachment 2700 [details]
Output from gdb session attached to hung slurmd process
Will, could you please print again active_threads, active_mutex, active_cond, but in separate prints, like this? (gdb) print active_threads (gdb) print active_mutex (gdb) print active_cond Thanks. (In reply to Alejandro Sanchez from comment #19) > Will, > > could you please print again active_threads, active_mutex, active_cond, but > in separate prints, like this? > > (gdb) print active_threads > (gdb) print active_mutex > (gdb) print active_cond > > Thanks. (gdb) set pagination off (gdb) print active_threads $1 = 256 (gdb) print active_mutex $2 = {__data = {__lock = 0, __count = 0, __owner = 0, __nusers = 1, __kind = 0, __spins = 0, __list = {__prev = 0x0, __next = 0x0}}, __size = '\000' <repeats 12 times>, "\001", '\000' <repeats 26 times>, __align = 0} (gdb) print active_cond $3 = {__data = {__lock = 0, __futex = 15, __total_seq = 8, __wakeup_seq = 7, __woken_seq = 7, __mutex = 0x7e7aa0, __nwaiters = 2, __broadcast_seq = 0}, __size = "\000\000\000\000\017\000\000\000\b\000\000\000\000\000\000\000\a\000\000\000\000\000\000\000\a\000\000\000\000\000\000\000\240z~\000\000\000\000\000\002\000\000\000\000\000\000", __align = 64424509440} (gdb) quit A debugging session is active. Inferior 1 [process 9256] will be detached. Quit anyway? (y or n) y Detaching from program: /gpfs22/scheduler/centos6/slurm-15.08.7/sbin/slurmd, process 9256 Thanks for all the info you've provided, I believe we've been able to identify the issue finally. Jobs that exceed their memory limits were currently causing slurmd to make a last-ditch attempt to read in accounting data before terminating the job, and getting stuck blocking on a read() call that would never complete since you aren't using jobacct. The offending job continues to run, then slurmd will spawn another thread to kill it which gets stuck in the same place. This continues until the thread limit is exceeded. Once exceeded, slurmd can no longer create new threads to handle any RPC messages, and is completely unresponsive. The normal shutdown logic within slurmd waits until all threads complete before exiting - this will never happen as those accounting threads are still blocked on I/O - and is why you need to `kill -9` slurmd to get it to stop. As this only happens when a job exceeds the memory limit - this explains the intermittent behavior you're seeing, and why you're only loosing a few nodes a day. I should have a patch for you for slurmd later this afternoon, I'm verifying the behavior in a few additional locations before committing the patch. - Tim Thanks for all the info you've provided, I believe we've been able to identify the issue finally. Jobs that exceed their memory limits were currently causing slurmd to make a last-ditch attempt to read in accounting data before terminating the job, and getting stuck blocking on a read() call that would never complete since you aren't using jobacct. The offending job continues to run, then slurmd will spawn another thread to kill it which gets stuck in the same place. This continues until the thread limit is exceeded. Once exceeded, slurmd can no longer create new threads to handle any RPC messages, and is completely unresponsive. The normal shutdown logic within slurmd waits until all threads complete before exiting - this will never happen as those accounting threads are still blocked on I/O - and is why you need to `kill -9` slurmd to get it to stop. As this only happens when a job exceeds the memory limit - this explains the intermittent behavior you're seeing, and why you're only loosing a few nodes a day. I should have a patch for you for slurmd later this afternoon, I'm verifying the behavior in a few additional locations before committing the patch. - Tim > Jobs that exceed their memory limits were currently causing slurmd to make a > last-ditch attempt to read in accounting data before terminating the job, > and getting stuck blocking on a read() call that would never complete since > you aren't using jobacct. That makes sense! And now that you mention it I do recall often seeing "exceeded memory" errors in the logs immediately prior to the thread count explosion. I just never put two and two together. > The offending job continues to run, then slurmd will spawn another thread to > kill it which gets stuck in the same place. This continues until the thread > limit is exceeded. Once exceeded, slurmd can no longer create new threads to > handle any RPC messages, and is completely unresponsive. The normal shutdown > logic within slurmd waits until all threads complete before exiting - this > will never happen as those accounting threads are still blocked on I/O - and > is why you need to `kill -9` slurmd to get it to stop. > > As this only happens when a job exceeds the memory limit - this explains the > intermittent behavior you're seeing, and why you're only loosing a few nodes > a day. I should have a patch for you for slurmd later this afternoon, I'm > verifying the behavior in a few additional locations before committing the > patch. Perfect, thank you. Will this be included in next micro release (15.08.8) and, if so, when is that scheduled for release? (In reply to Will French from comment #24) > Perfect, thank you. Will this be included in next micro release (15.08.8) > and, if so, when is that scheduled for release? Yes, definitely. We expect to release 15.08.8 within the next couple of weeks, although we don't have a date yet. The patch is here if you'd like to apply it ahead of the release: https://github.com/SchedMD/slurm/commit/aaf8bcf6.patch If there are any further problems please re-open this bug. |