Ticket 5111

Summary: Agent Queue Size bursts and no cleanup
Product: Slurm Reporter: Adam <asa188>
Component: slurmctldAssignee: Dominik Bartkiewicz <bart>
Status: RESOLVED DUPLICATE QA Contact:
Severity: 3 - Medium Impact    
Priority: --- CC: siegert
Version: 17.11.3   
Hardware: Linux   
OS: Linux   
See Also: https://bugs.schedmd.com/show_bug.cgi?id=5147
Site: Simon Fraser University Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- NoveTech Sites: ---
Nvidia HWinf-CS Sites: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---
Attachments: slurm.conf during time of issue
slurmctld.log
lsof on slurmctld
sdiag output
jobs/partitions/dependencies count
gdb output
cdr683 slurmd log
gdb output of slurmd on cdr1560
gdb output of slurmstepd on cdr1560

Description Adam 2018-04-27 14:52:58 MDT
Created attachment 6708 [details]
slurm.conf during time of issue

Having issues where we're possibly getting a burst of jobs submitted and then a burst being cancelled and it's causing our scheduler to skyrocket on 'Agent Queue Size' to the scale of 180,000.  It previously used to do 40-80K previously and then slowly cleanup within the hour.  After adding the new nodes(another 35K cores) we're now getting this to the scale of 180K and they never fully cleanup unless I restart slurmctld.

I've run `lsof` on slurmctld and check its socket connections and I don't even see many attempts to talk to nodes even though the logs show constant attempts to Terminate many jobs on many separate nodes and it seems like it just cascades without really being successful or even talking to the agents.

Sometimes these are jobs the just got scheduled and are in Prolog stage and haven't even reached the logs on the specific matching nodes yet.  They have no knowledge of the job that slurmctld.log is saying it's trying to cancel.

I'm not sure what to target here, it's not due to overtaxed slurmd, they're NICE 0 and the jobs are set to PropagatePrioProcess=2 so the jobs run as NICE 1.  The ARP settings allow for 3084 gc_thresh 1 and such so it's not an arp issue.  

I'll attach slurm.conf, slurmctld.log, sdiag output, jobs/partitions/dependency counts, and a gdb output of `gdb -batch -ex "thread apply all bt full" -p 15428`

Thanks,

Adam
Comment 1 Adam 2018-04-27 14:53:33 MDT
Created attachment 6709 [details]
slurmctld.log
Comment 2 Adam 2018-04-27 14:54:30 MDT
Created attachment 6710 [details]
lsof on slurmctld
Comment 3 Adam 2018-04-27 14:54:53 MDT
Created attachment 6711 [details]
sdiag output
Comment 4 Adam 2018-04-27 14:55:24 MDT
Created attachment 6712 [details]
jobs/partitions/dependencies count
Comment 5 Adam 2018-04-27 14:56:44 MDT
Created attachment 6713 [details]
gdb output
Comment 6 Dominik Bartkiewicz 2018-05-01 04:16:08 MDT
Hi

In sdiag output, I noticed that slurmctld must handel with a huge number of rpcs from users tools ~6 rpc/s. Some of them, like REQUEST_JOB_INFO, requires locking of the most internal data structure. Some of the users generated > 10k rpcs. Is possible to reduce the number of this requests?
I'm stil trying to find the reason why TERMINATE_JOB must be sent so many times.
Could you send me the coresponding slurmd log from cdr683?

Dominik
Comment 7 Adam 2018-05-01 16:17:01 MDT
Created attachment 6743 [details]
cdr683 slurmd log

Hi Dominik,

I've attached the slurmd logs for cdr683.  The nhc checks are because I haven't upgraded the BIOS version on this host yet to match what i've stated in NHC, but I don't have NHC set to offline the host automatically so they're of no current concern.  

Certain users: siegert, and asasfu, are admin run queries, a bunch automated to produce stats, the siegert ones can and will likely be reduced soon since we've put pam_slurm_adopt in and wont have to run squeue as often to combat node abuse outside of jobs.

Some users tend to run `watch -n1 squeue -u ...` or just `watch squeue-u ...` and we've often been trying to instruct them that it's quite excessive and bad for the system.  Snoprod is a group account of a meta scheduler I believe, I may be able to get that reduced if needed a bit.

I don't imagine there's ways to restrict users on the slurm side or cache responses for specific sets of users if it's costly?
Comment 8 Dominik Bartkiewicz 2018-05-02 06:01:01 MDT
Hi

Could you check if for completing job slurmstepd proc exists?
If yes, could you generate the backtrace?
Which distribution do you use?
And which version of glibc you currently have?


Dominik
Comment 9 Martin Siegert 2018-05-02 10:53:42 MDT
Yes, here is an example node  with a job in the comp state that has slurmstepd running:

S USER       PID  PPID  NI   RSS  VSZ STIME %CPU     TIME COMMAND
S root      89711      1   0  9628 10062940 Apr25 0.0 00:00:02 /opt/software/slurm/sbin/slurmd
S root     153043      1   0 21660 800948 Apr30  0.7 00:17:27 slurmstepd: [7187196.0]

-bash-4.2# gdb -p 153043
GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-100.el7
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Attaching to process 153043
Reading symbols from /opt/software/slurm-17.11.3-3/sbin/slurmstepd...done.
Reading symbols from /opt/software/slurm-17.11.3-3/lib/slurm/libslurmfull.so...done.
Loaded symbols for /opt/software/slurm-17.11.3-3/lib/slurm/libslurmfull.so
Reading symbols from /usr/lib64/libhwloc.so.5...Reading symbols from /usr/lib64/libhwloc.so.5...(no debugging symbols found)...done.
(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/libhwloc.so.5
Reading symbols from /usr/lib64/libdl.so.2...(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/libdl.so.2
Reading symbols from /usr/lib64/libpam.so.0...Reading symbols from /usr/lib64/libpam.so.0...(no debugging symbols found)...done.
(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/libpam.so.0
Reading symbols from /usr/lib64/libpam_misc.so.0...Reading symbols from /usr/lib64/libpam_misc.so.0...(no debugging symbols found)...done.
(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/libpam_misc.so.0
Reading symbols from /usr/lib64/libutil.so.1...(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/libutil.so.1
Reading symbols from /usr/lib64/libssh2.so.1...Reading symbols from /usr/lib64/libssh2.so.1...(no debugging symbols found)...done.
(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/libssh2.so.1
Reading symbols from /usr/lib64/libpthread.so.0...(no debugging symbols found)...done.
[New LWP 167216]
[New LWP 167213]
[New LWP 167089]
...
[New LWP 153048]
[New LWP 153047]
[New LWP 153046]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Loaded symbols for /usr/lib64/libpthread.so.0
Reading symbols from /usr/lib64/libc.so.6...(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/libc.so.6
Reading symbols from /usr/lib64/libm.so.6...(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/libm.so.6
Reading symbols from /usr/lib64/libnuma.so.1...Reading symbols from /usr/lib64/libnuma.so.1...(no debugging symbols found)...done.
(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/libnuma.so.1
Reading symbols from /usr/lib64/libltdl.so.7...Reading symbols from /usr/lib64/libltdl.so.7...(no debugging symbols found)...done.
(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/libltdl.so.7
Reading symbols from /lib64/ld-linux-x86-64.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/ld-linux-x86-64.so.2
Reading symbols from /usr/lib64/libaudit.so.1...Reading symbols from /usr/lib64/libaudit.so.1...(no debugging symbols found)...done.
(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/libaudit.so.1
Reading symbols from /usr/lib64/libssl.so.10...Reading symbols from /usr/lib64/libssl.so.10...(no debugging symbols found)...done.
(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/libssl.so.10
Reading symbols from /usr/lib64/libcrypto.so.10...Reading symbols from /usr/lib64/libcrypto.so.10...(no debugging symbols found)...done.
(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/libcrypto.so.10
Reading symbols from /usr/lib64/libz.so.1...Reading symbols from /usr/lib64/libz.so.1...(no debugging symbols found)...done.
(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/libz.so.1
Reading symbols from /usr/lib64/libgcc_s.so.1...(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/libgcc_s.so.1
Reading symbols from /usr/lib64/libcap-ng.so.0...Reading symbols from /usr/lib64/libcap-ng.so.0...(no debugging symbols found)...done.
(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/libcap-ng.so.0
Reading symbols from /usr/lib64/libgssapi_krb5.so.2...Reading symbols from /usr/lib64/libgssapi_krb5.so.2...(no debugging symbols found)...done.
(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/libgssapi_krb5.so.2
Reading symbols from /usr/lib64/libkrb5.so.3...Reading symbols from /usr/lib64/libkrb5.so.3...(no debugging symbols found)...done.
(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/libkrb5.so.3
Reading symbols from /usr/lib64/libcom_err.so.2...Reading symbols from /usr/lib64/libcom_err.so.2...(no debugging symbols found)...done.
(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/libcom_err.so.2
Reading symbols from /usr/lib64/libk5crypto.so.3...Reading symbols from /usr/lib64/libk5crypto.so.3...(no debugging symbols found)...done.
(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/libk5crypto.so.3
Reading symbols from /usr/lib64/libkrb5support.so.0...Reading symbols from /usr/lib64/libkrb5support.so.0...(no debugging symbols found)...done.
(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/libkrb5support.so.0
Reading symbols from /usr/lib64/libkeyutils.so.1...Reading symbols from /usr/lib64/libkeyutils.so.1...(no debugging symbols found)...done.
(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/libkeyutils.so.1
Reading symbols from /usr/lib64/libresolv.so.2...(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/libresolv.so.2
Reading symbols from /usr/lib64/libselinux.so.1...Reading symbols from /usr/lib64/libselinux.so.1...(no debugging symbols found)...done.
(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/libselinux.so.1
Reading symbols from /usr/lib64/libpcre.so.1...Reading symbols from /usr/lib64/libpcre.so.1...(no debugging symbols found)...done.
(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/libpcre.so.1
Reading symbols from /usr/lib64/libnss_files.so.2...(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/libnss_files.so.2
Reading symbols from /opt/software/slurm-17.11.3-3/lib/slurm/select_cons_res.so...done.
Loaded symbols for /opt/software/slurm-17.11.3-3/lib/slurm/select_cons_res.so
Reading symbols from /opt/software/slurm-17.11.3-3/lib/slurm/auth_munge.so...done.
Loaded symbols for /opt/software/slurm-17.11.3-3/lib/slurm/auth_munge.so
Reading symbols from /opt/software/munge-0.5.12/lib/libmunge.so.2...done.
Loaded symbols for /opt/software/munge-0.5.12/lib/libmunge.so.2
Reading symbols from /opt/software/slurm-17.11.3-3/lib/slurm/switch_none.so...done.
Loaded symbols for /opt/software/slurm-17.11.3-3/lib/slurm/switch_none.so
Reading symbols from /opt/software/slurm-17.11.3-3/lib/slurm/gres_gpu.so...done.
Loaded symbols for /opt/software/slurm-17.11.3-3/lib/slurm/gres_gpu.so
Reading symbols from /opt/software/slurm-17.11.3-3/lib/slurm/core_spec_none.so...done.
Loaded symbols for /opt/software/slurm-17.11.3-3/lib/slurm/core_spec_none.so
Reading symbols from /opt/software/slurm-17.11.3-3/lib/slurm/task_cgroup.so...done.
Loaded symbols for /opt/software/slurm-17.11.3-3/lib/slurm/task_cgroup.so
Reading symbols from /opt/software/slurm-17.11.3-3/lib/slurm/proctrack_cgroup.so...done.
Loaded symbols for /opt/software/slurm-17.11.3-3/lib/slurm/proctrack_cgroup.so
Reading symbols from /opt/software/slurm-17.11.3-3/lib/slurm/checkpoint_none.so...done.
Loaded symbols for /opt/software/slurm-17.11.3-3/lib/slurm/checkpoint_none.so
Reading symbols from /opt/software/slurm-17.11.3-3/lib/slurm/crypto_munge.so...done.
Loaded symbols for /opt/software/slurm-17.11.3-3/lib/slurm/crypto_munge.so
Reading symbols from /opt/software/slurm-17.11.3-3/lib/slurm/job_container_none.so...done.
Loaded symbols for /opt/software/slurm-17.11.3-3/lib/slurm/job_container_none.so
Reading symbols from /opt/software/slurm-17.11.3-3/lib/slurm/mpi_pmix.so...done.
Loaded symbols for /opt/software/slurm-17.11.3-3/lib/slurm/mpi_pmix.so
Reading symbols from /opt/software/pmix/lib/libpmix.so...(no debugging symbols found)...done.
Loaded symbols for /opt/software/pmix/lib/libpmix.so
Reading symbols from /usr/lib64/libevent-2.0.so.5...Reading symbols from /usr/lib64/libevent-2.0.so.5...(no debugging symbols found)...done.
(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/libevent-2.0.so.5
Reading symbols from /usr/lib64/libevent_pthreads-2.0.so.5...Reading symbols from /usr/lib64/libevent_pthreads-2.0.so.5...(no debugging symbols found)...done.
(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/libevent_pthreads-2.0.so.5
Reading symbols from /usr/lib64/libtinfo.so.5...Reading symbols from /usr/lib64/libtinfo.so.5...(no debugging symbols found)...done.
(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/libtinfo.so.5
Reading symbols from /opt/software/slurm/lib/slurm/cc-tmpfs_mounts.so...(no debugging symbols found)...done.
Loaded symbols for /opt/software/slurm/lib/slurm/cc-tmpfs_mounts.so
0x00002b84a328bf57 in pthread_join () from /usr/lib64/libpthread.so.0
Missing separate debuginfos, use: debuginfo-install audit-libs-2.7.6-3.el7.x86_64 glibc-2.17-196.el7_4.2.x86_64 hwloc-libs-1.11.2-2.el7.x86_64 keyutils-libs-1.5.8-3.el7.x86_64 krb5-libs-1.15.1-8.el7.x86_64 libcap-ng-0.7.5-4.el7.x86_64 libcom_err-1.42.9-10.el7.x86_64 libevent-2.0.21-4.el7.x86_64 libgcc-4.8.5-16.el7_4.1.x86_64 libselinux-2.5-11.el7.x86_64 libssh2-1.4.3-10.el7_2.1.x86_64 libtool-ltdl-2.4.2-22.el7_3.x86_64 ncurses-libs-5.9-14.20130511.el7_4.x86_64 numactl-libs-2.0.9-6.el7_2.x86_64 openssl-libs-1.0.2k-8.el7.x86_64 pam-1.1.8-18.el7.x86_64 pcre-8.32-17.el7.x86_64 zlib-1.2.7-17.el7.x86_64
(gdb) bt
#0  0x00002b84a328bf57 in pthread_join () from /usr/lib64/libpthread.so.0
#1  0x00000000004107ff in _wait_for_io (job=0x682bc0) at mgr.c:2219
#2  job_manager (job=job@entry=0x682bc0) at mgr.c:1397
#3  0x000000000040ca67 in main (argc=1, argv=0x7ffe8bb56558)
    at slurmstepd.c:172
(gdb)

This is CentOS Linux release 7.4.1708 (Core) and glibc-2.17-196.el7_4.2

- Martin
Comment 10 Dominik Bartkiewicz 2018-05-03 09:56:11 MDT
Hi

I suspect this is duplicate of bug 5103.
To be sure, could you generate slurmd backtrace from all threads, similar to what you have done for slurmctld? 

Dominik
Comment 11 Martin Siegert 2018-05-03 10:25:20 MDT
Here is the backtrace of slurmd on the same node (cdr1560):

# gdb -p 89711
GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-100.el7
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Attaching to process 89711
Reading symbols from /opt/software/slurm-17.11.3-3/sbin/slurmd...done.
Reading symbols from /lib64/libnuma.so.1...Reading symbols from /lib64/libnuma.so.1...(no debugging symbols found)...done.
(no debugging symbols found)...done.
Loaded symbols for /lib64/libnuma.so.1
Reading symbols from /lib64/libz.so.1...Reading symbols from /lib64/libz.so.1...(no debugging symbols found)...done.
(no debugging symbols found)...done.
Loaded symbols for /lib64/libz.so.1
Reading symbols from /opt/software/slurm-17.11.3-3/lib/slurm/libslurmfull.so...done.
Loaded symbols for /opt/software/slurm-17.11.3-3/lib/slurm/libslurmfull.so
Reading symbols from /opt/software/hwloc/lib/libhwloc.so.5...done.
Loaded symbols for /opt/software/hwloc/lib/libhwloc.so.5
Reading symbols from /lib64/libdl.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/libdl.so.2
Reading symbols from /lib64/libpthread.so.0...(no debugging symbols found)...done.
[New LWP 176425]
[New LWP 176378]
[New LWP 176374]
[New LWP 176350]
[New LWP 176219]
[New LWP 175645]
[New LWP 174196]
[New LWP 174172]
[New LWP 173993]
[New LWP 173986]
[New LWP 173962]
[New LWP 173832]
[New LWP 173807]
[New LWP 173579]
[New LWP 173503]
[New LWP 173467]
[New LWP 173444]
[New LWP 173157]
[New LWP 173134]
[New LWP 173120]
[New LWP 173097]
[New LWP 173075]
[New LWP 173050]
[New LWP 173037]
[New LWP 173012]
[New LWP 172998]
[New LWP 172974]
[New LWP 172818]
[New LWP 172795]
[New LWP 172689]
[New LWP 172665]
[New LWP 172637]
[New LWP 172614]
[New LWP 172605]
[New LWP 172601]
[New LWP 172578]
[New LWP 172562]
[New LWP 172538]
[New LWP 172380]
[New LWP 172356]
[New LWP 172177]
[New LWP 172154]
[New LWP 172142]
[New LWP 172118]
[New LWP 171950]
[New LWP 171927]
[New LWP 171821]
[New LWP 171797]
[New LWP 171786]
[New LWP 171762]
[New LWP 171734]
[New LWP 171711]
[New LWP 171696]
[New LWP 171673]
[New LWP 171507]
[New LWP 171484]
[New LWP 171375]
[New LWP 171370]
[New LWP 171346]
[New LWP 171316]
[New LWP 171286]
[New LWP 171017]
[New LWP 171013]
[New LWP 170985]
[New LWP 169980]
[New LWP 169956]
[New LWP 169765]
[New LWP 169758]
[New LWP 169733]
[New LWP 169618]
[New LWP 169595]
[New LWP 169580]
[New LWP 169557]
[New LWP 169539]
[New LWP 169516]
[New LWP 169344]
[New LWP 169321]
[New LWP 169210]
[New LWP 169187]
[New LWP 168987]
[New LWP 168964]
[New LWP 168955]
[New LWP 168862]
[New LWP 168839]
[New LWP 168820]
[New LWP 168796]
[New LWP 168781]
[New LWP 168758]
[New LWP 167215]
[New LWP 167191]
[New LWP 167088]
[New LWP 167082]
[New LWP 167059]
[New LWP 167049]
[New LWP 167023]
[New LWP 167016]
[New LWP 166992]
[New LWP 166982]
[New LWP 166959]
[New LWP 166807]
[New LWP 166783]
[New LWP 166685]
[New LWP 166662]
[New LWP 166655]
[New LWP 166631]
[New LWP 166625]
[New LWP 166620]
[New LWP 166597]
[New LWP 166588]
[New LWP 166565]
[New LWP 166554]
[New LWP 166531]
[New LWP 166375]
[New LWP 166352]
[New LWP 166256]
[New LWP 166232]
[New LWP 166222]
[New LWP 166199]
[New LWP 166191]
[New LWP 166168]
[New LWP 166159]
[New LWP 166136]
[New LWP 166105]
[New LWP 166103]
[New LWP 165954]
[New LWP 165944]
[New LWP 165942]
[New LWP 165919]
[New LWP 165821]
[New LWP 165798]
[New LWP 165788]
[New LWP 165765]
[New LWP 165741]
[New LWP 165718]
[New LWP 165708]
[New LWP 165685]
[New LWP 165527]
[New LWP 165503]
[New LWP 165409]
[New LWP 165386]
[New LWP 165377]
[New LWP 165354]
[New LWP 165324]
[New LWP 165321]
[New LWP 165316]
[New LWP 165304]
[New LWP 165281]
[New LWP 165128]
[New LWP 165104]
[New LWP 165097]
[New LWP 164998]
[New LWP 164975]
[New LWP 164970]
[New LWP 164947]
[New LWP 164938]
[New LWP 164914]
[New LWP 164905]
[New LWP 164882]
[New LWP 164873]
[New LWP 164850]
[New LWP 164696]
[New LWP 164672]
[New LWP 164663]
[New LWP 164640]
[New LWP 164539]
[New LWP 164516]
[New LWP 164508]
[New LWP 164484]
[New LWP 164475]
[New LWP 164452]
[New LWP 164441]
[New LWP 164418]
[New LWP 164266]
[New LWP 164242]
[New LWP 164233]
[New LWP 164210]
[New LWP 164109]
[New LWP 164085]
[New LWP 164077]
[New LWP 164053]
[New LWP 164042]
[New LWP 164019]
[New LWP 164010]
[New LWP 163986]
[New LWP 163835]
[New LWP 163811]
[New LWP 163704]
[New LWP 163681]
[New LWP 163669]
[New LWP 163646]
[New LWP 163638]
[New LWP 163614]
[New LWP 163460]
[New LWP 163437]
[New LWP 163356]
[New LWP 163332]
[New LWP 163311]
[New LWP 163308]
[New LWP 163284]
[New LWP 163265]
[New LWP 163240]
[New LWP 163231]
[New LWP 163208]
[New LWP 163201]
[New LWP 163177]
[New LWP 163021]
[New LWP 162998]
[New LWP 162855]
[New LWP 162831]
[New LWP 162823]
[New LWP 162799]
[New LWP 161432]
[New LWP 161407]
[New LWP 161390]
[New LWP 161367]
[New LWP 161358]
[New LWP 161335]
[New LWP 159985]
[New LWP 159960]
[New LWP 159955]
[New LWP 159949]
[New LWP 159926]
[New LWP 159677]
[New LWP 159653]
[New LWP 159645]
[New LWP 159622]
[New LWP 159612]
[New LWP 159588]
[New LWP 159578]
[New LWP 159555]
[New LWP 158972]
[New LWP 158949]
[New LWP 158681]
[New LWP 158658]
[New LWP 158649]
[New LWP 158626]
[New LWP 158540]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Loaded symbols for /lib64/libpthread.so.0
Reading symbols from /lib64/libc.so.6...(no debugging symbols found)...done.
Loaded symbols for /lib64/libc.so.6
Reading symbols from /lib64/libgcc_s.so.1...(no debugging symbols found)...done.
Loaded symbols for /lib64/libgcc_s.so.1
Reading symbols from /lib64/ld-linux-x86-64.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/ld-linux-x86-64.so.2
Reading symbols from /lib64/libm.so.6...(no debugging symbols found)...done.
Loaded symbols for /lib64/libm.so.6
Reading symbols from /lib64/libtinfo.so.5...Reading symbols from /lib64/libtinfo.so.5...(no debugging symbols found)...done.
(no debugging symbols found)...done.
Loaded symbols for /lib64/libtinfo.so.5
Reading symbols from /lib64/libnss_files.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/libnss_files.so.2
Reading symbols from /opt/software/slurm-17.11.3-3/lib/slurm/select_cons_res.so...done.
Loaded symbols for /opt/software/slurm-17.11.3-3/lib/slurm/select_cons_res.so
Reading symbols from /opt/software/slurm-17.11.3-3/lib/slurm/auth_munge.so...done.
Loaded symbols for /opt/software/slurm-17.11.3-3/lib/slurm/auth_munge.so
Reading symbols from /opt/software/munge-0.5.12/lib/libmunge.so.2...done.
Loaded symbols for /opt/software/munge-0.5.12/lib/libmunge.so.2
Reading symbols from /opt/software/slurm-17.11.3-3/lib/slurm/gres_gpu.so...done.
Loaded symbols for /opt/software/slurm-17.11.3-3/lib/slurm/gres_gpu.so
Reading symbols from /opt/software/slurm-17.11.3-3/lib/slurm/topology_tree.so...done.
Loaded symbols for /opt/software/slurm-17.11.3-3/lib/slurm/topology_tree.so
Reading symbols from /opt/software/slurm-17.11.3-3/lib/slurm/route_topology.so...done.
Loaded symbols for /opt/software/slurm-17.11.3-3/lib/slurm/route_topology.so
Reading symbols from /opt/software/slurm-17.11.3-3/lib/slurm/proctrack_cgroup.so...done.
Loaded symbols for /opt/software/slurm-17.11.3-3/lib/slurm/proctrack_cgroup.so
Reading symbols from /opt/software/slurm-17.11.3-3/lib/slurm/task_cgroup.so...done.
Loaded symbols for /opt/software/slurm-17.11.3-3/lib/slurm/task_cgroup.so
Reading symbols from /opt/software/slurm-17.11.3-3/lib/slurm/crypto_munge.so...done.
Loaded symbols for /opt/software/slurm-17.11.3-3/lib/slurm/crypto_munge.so
Reading symbols from /opt/software/slurm-17.11.3-3/lib/slurm/jobacct_gather_linux.so...done.
Loaded symbols for /opt/software/slurm-17.11.3-3/lib/slurm/jobacct_gather_linux.so
Reading symbols from /opt/software/slurm-17.11.3-3/lib/slurm/job_container_none.so...done.
Loaded symbols for /opt/software/slurm-17.11.3-3/lib/slurm/job_container_none.so
Reading symbols from /opt/software/slurm-17.11.3-3/lib/slurm/core_spec_none.so...done.
Loaded symbols for /opt/software/slurm-17.11.3-3/lib/slurm/core_spec_none.so
Reading symbols from /opt/software/slurm-17.11.3-3/lib/slurm/switch_none.so...done.
Loaded symbols for /opt/software/slurm-17.11.3-3/lib/slurm/switch_none.so
Reading symbols from /opt/software/slurm-17.11.3-3/lib/slurm/acct_gather_energy_none.so...done.
Loaded symbols for /opt/software/slurm-17.11.3-3/lib/slurm/acct_gather_energy_none.so
Reading symbols from /opt/software/slurm-17.11.3-3/lib/slurm/acct_gather_profile_none.so...done.
Loaded symbols for /opt/software/slurm-17.11.3-3/lib/slurm/acct_gather_profile_none.so
Reading symbols from /opt/software/slurm-17.11.3-3/lib/slurm/acct_gather_interconnect_none.so...done.
Loaded symbols for /opt/software/slurm-17.11.3-3/lib/slurm/acct_gather_interconnect_none.so
Reading symbols from /opt/software/slurm-17.11.3-3/lib/slurm/acct_gather_filesystem_none.so...done.
Loaded symbols for /opt/software/slurm-17.11.3-3/lib/slurm/acct_gather_filesystem_none.so
Reading symbols from /lib64/libnss_sss.so.2...Reading symbols from /lib64/libnss_sss.so.2...(no debugging symbols found)...done.
(no debugging symbols found)...done.
Loaded symbols for /lib64/libnss_sss.so.2
0x00002b38f447798d in accept () from /lib64/libpthread.so.0
Missing separate debuginfos, use: debuginfo-install glibc-2.17-196.el7_4.2.x86_64 libgcc-4.8.5-16.el7_4.1.x86_64 ncurses-libs-5.9-14.20130511.el7_4.x86_64 numactl-libs-2.0.9-6.el7_2.x86_64 sssd-client-1.15.2-50.el7_4.8.x86_64 zlib-1.2.7-17.el7.x86_64
(gdb) bt
#0  0x00002b38f447798d in accept () from /lib64/libpthread.so.0
#1  0x00002b38f3d75d67 in slurm_accept_msg_conn (fd=<optimized out>, 
    addr=<optimized out>) at slurm_protocol_socket_implementation.c:454
#2  0x000000000040f1df in _msg_engine () at slurmd.c:437
#3  main (argc=1, argv=0x7ffd87665d18) at slurmd.c:373
(gdb) 

I am also attaching the output of `gdb -batch -ex "thread apply all bt full" -p 89711`

- Martin
Comment 12 Martin Siegert 2018-05-03 10:27:27 MDT
Created attachment 6761 [details]
gdb output of slurmd on cdr1560
Comment 13 Dominik Bartkiewicz 2018-05-03 10:38:20 MDT
Hi

Sorry, in comment 10 I wanted to ask you about slurmstepd full backtrace, not for slurmd. But even from slurmd backtrace, I am able to confirm that this is already known issue from the bug 5103. Only to be 100% sure, could you send me 7187196 slurmstepd full backtrace?

Dominik
Comment 14 Martin Siegert 2018-05-03 10:49:59 MDT
Created attachment 6762 [details]
gdb output of slurmstepd on cdr1560
Comment 15 Dominik Bartkiewicz 2018-05-03 13:18:31 MDT
Thanks, 

I will close this bug as the duplicate of bug 5103.
In bug 5103 comment 20 Tim has described current situation.

You will need to upgrade to 17.11.6 when it will be available.

The solution for now, is to drain and reboot affected node (or kill -9 deadlocked slurmstepd).

Dominik

*** This ticket has been marked as a duplicate of ticket 5103 ***
Comment 16 Adam 2018-05-04 13:58:55 MDT
We put those patches into 17.11.5 and went live with it, didn't seem to resolve the issue unfortunately.  Applied them at 7:00pm PST and the next morning at 7:47:54 PST we saw the beginning of about 6,000 jobs launching and then a bunch cancelling and it started the cascade of about 646 nodes ending up in COMPLETING state.  After restarting slurmctld once it reached 100K Agent Queue Size, it seemed to have brought the COMPLETING nodes down to around 14.