Ticket 3159 - slurmctld becomes unresponsive, possibly related to accounting updates
Summary: slurmctld becomes unresponsive, possibly related to accounting updates
Status: RESOLVED FIXED
Alias: None
Product: Slurm
Classification: Unclassified
Component: slurmctld (show other tickets)
Version: 17.02.3
Hardware: Cray XC Linux
: 3 - Medium Impact
Assignee: Danny Auble
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2016-10-11 15:43 MDT by Doug Jacobsen
Modified: 2017-07-21 15:59 MDT (History)
1 user (show)

See Also:
Site: NERSC
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed: 17.02.7 17.11.0-pre2
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
slurmcltd log (4.79 MB, application/x-gzip)
2016-10-11 15:44 MDT, Doug Jacobsen
Details
Patch to serialize updates coming from the dbd. (1.58 KB, patch)
2017-07-20 14:53 MDT, Danny Auble
Details | Diff

Note You need to log in before you can comment on or make changes to this ticket.
Description Doug Jacobsen 2016-10-11 15:43:29 MDT
Hello,

Twice today we've had an issue where cori's slurmctld needed to be sigkilled and restarted (a 5 minute process now while it caches all user/group data from sssd).

It appears to be concordant messages in slurmctld.log like:

[2016-10-11T09:18:20.082] Warning: Note very large processing time from _slurm_rpc_accounting_update_msg: usec=176799810 began=09:15:23.282
[2016-10-11T09:20:45.051] Warning: Note very large processing time from _slurm_rpc_accounting_update_msg: usec=319753362 began=09:15:25.298
[2016-10-11T09:20:45.051] debug:  updating assoc 15411
[2016-10-11T09:20:45.052] error: assoc_mgr_update gave error: Unspecified error
[2016-10-11T09:21:15.964] debug:  aeld sync event
[2016-10-11T09:21:15.966] debug:  _start_session: Created aeld session fd 9
[2016-10-11T09:23:01.350] server_thread_count over limit (256), waiting

I'm attaching my log, it appears it may have happened last evening, but that may be engaged in other activity related to bug 3153.

Today at 9:20ish and again at 14:16 slurm became unresponsive and I was forced to terminate it (sigkill).

In both instances there was heavy activity from sssd on the system, implying some kind of getpwuid/nam activity was going on.

In this more recent occurrence, I did generate a core file prior to terminating slurmctld so I'll attach some backtraces as well.

Thanks,
Doug
Comment 1 Doug Jacobsen 2016-10-11 15:44:55 MDT
Created attachment 3583 [details]
slurmcltd log
Comment 2 Doug Jacobsen 2016-10-11 15:49:39 MDT
(gdb) thread apply all bt

Thread 40 (Thread 0x2b2e29e7b800 (LWP 17468)):
#0  0x00002b2e2a28003f in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x000000000048d417 in _wr_wrlock (datatype=NODE_LOCK, wait_lock=true) at locks.c:256
#2  0x000000000048cde8 in lock_slurmctld (lock_levels=...) at locks.c:91
#3  0x00000000004476bb in _slurmctld_background (no_data=0x0) at controller.c:1758
#4  0x0000000000444b90 in main (argc=1, argv=0x7ffc207cd7f8) at controller.c:605

Thread 39 (Thread 0x2b2e42147700 (LWP 12627)):
#0  0x00002b2e2a28003f in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x000000000048d1d1 in _wr_rdlock (datatype=JOB_LOCK, wait_lock=true) at locks.c:218
#2  0x000000000048cd9f in lock_slurmctld (lock_levels=...) at locks.c:84
#3  0x00000000004b22d3 in _slurm_rpc_dump_jobs (msg=0x2b2fa8938290) at proc_req.c:1235
#4  0x00000000004af48a in slurmctld_req (msg=0x2b2fa8938290, arg=0x2b2e48015f10) at proc_req.c:289
#5  0x0000000000445dc5 in _service_connection (arg=0x2b2e48015f10) at controller.c:1121
#6  0x00002b2e2a27c0a4 in start_thread () from /lib64/libpthread.so.0
#7  0x00002b2e2a575fed in clone () from /lib64/libc.so.6

Thread 38 (Thread 0x2b2ecbc3b700 (LWP 12623)):
#0  0x00002b2e2a28003f in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x000000000048d417 in _wr_wrlock (datatype=NODE_LOCK, wait_lock=true) at locks.c:256
#2  0x000000000048cde8 in lock_slurmctld (lock_levels=...) at locks.c:91
#3  0x00000000004b3309 in _slurm_rpc_dump_nodes (msg=0x2b2fa89339d0) at proc_req.c:1513
#4  0x00000000004af516 in slurmctld_req (msg=0x2b2fa89339d0, arg=0x2b2e4800eef0) at proc_req.c:310
#5  0x0000000000445dc5 in _service_connection (arg=0x2b2e4800eef0) at controller.c:1121
#6  0x00002b2e2a27c0a4 in start_thread () from /lib64/libpthread.so.0
#7  0x00002b2e2a575fed in clone () from /lib64/libc.so.6

Thread 37 (Thread 0x2b2e41238700 (LWP 12597)):
#0  0x00002b2e2a28003f in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x000000000048d417 in _wr_wrlock (datatype=NODE_LOCK, wait_lock=true) at locks.c:256
#2  0x000000000048cde8 in lock_slurmctld (lock_levels=...) at locks.c:91
#3  0x00000000004b3309 in _slurm_rpc_dump_nodes (msg=0x2b2fa8938e90) at proc_req.c:1513
#4  0x00000000004af516 in slurmctld_req (msg=0x2b2fa8938e90, arg=0x2b2e48013b20) at proc_req.c:310
#5  0x0000000000445dc5 in _service_connection (arg=0x2b2e48013b20) at controller.c:1121
#6  0x00002b2e2a27c0a4 in start_thread () from /lib64/libpthread.so.0
#7  0x00002b2e2a575fed in clone () from /lib64/libc.so.6

Thread 36 (Thread 0x2b2eca827700 (LWP 12586)):
---Type <return> to continue, or q <return> to quit---
#0  0x00002b2e2a28003f in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x000000000048d417 in _wr_wrlock (datatype=JOB_LOCK, wait_lock=true) at locks.c:256
#2  0x000000000048cdb8 in lock_slurmctld (lock_levels=...) at locks.c:86
#3  0x00000000004b4444 in _slurm_rpc_complete_job_allocation (msg=0x2b2faa0ef740) at proc_req.c:1893
#4  0x00000000004af593 in slurmctld_req (msg=0x2b2faa0ef740, arg=0x2b2e480174a0) at proc_req.c:326
#5  0x0000000000445dc5 in _service_connection (arg=0x2b2e480174a0) at controller.c:1121
#6  0x00002b2e2a27c0a4 in start_thread () from /lib64/libpthread.so.0
#7  0x00002b2e2a575fed in clone () from /lib64/libc.so.6

Thread 35 (Thread 0x2b2e4244a700 (LWP 12585)):
#0  0x00002b2e2a28003f in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x000000000048d417 in _wr_wrlock (datatype=JOB_LOCK, wait_lock=true) at locks.c:256
#2  0x000000000048cdb8 in lock_slurmctld (lock_levels=...) at locks.c:86
#3  0x00000000004b7c1b in _slurm_rpc_step_complete (msg=0x2b2faa0ef6b0, running_composite=false) at proc_req.c:3124
#4  0x00000000004af8ea in slurmctld_req (msg=0x2b2faa0ef6b0, arg=0x2b2e48003e20) at proc_req.c:451
#5  0x0000000000445dc5 in _service_connection (arg=0x2b2e48003e20) at controller.c:1121
#6  0x00002b2e2a27c0a4 in start_thread () from /lib64/libpthread.so.0
#7  0x00002b2e2a575fed in clone () from /lib64/libc.so.6

Thread 34 (Thread 0x2b2e4193f700 (LWP 12578)):
#0  0x00002b2e2a28003f in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x000000000048d1d1 in _wr_rdlock (datatype=JOB_LOCK, wait_lock=true) at locks.c:218
#2  0x000000000048cd9f in lock_slurmctld (lock_levels=...) at locks.c:84
#3  0x00000000004b22d3 in _slurm_rpc_dump_jobs (msg=0x2b2faa0f09a0) at proc_req.c:1235
#4  0x00000000004af48a in slurmctld_req (msg=0x2b2faa0f09a0, arg=0x2b2e48013280) at proc_req.c:289
#5  0x0000000000445dc5 in _service_connection (arg=0x2b2e48013280) at controller.c:1121
#6  0x00002b2e2a27c0a4 in start_thread () from /lib64/libpthread.so.0
#7  0x00002b2e2a575fed in clone () from /lib64/libc.so.6

Thread 33 (Thread 0x2b2ec9a19700 (LWP 12562)):
#0  0x00002b2e2a28003f in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x000000000048d1d1 in _wr_rdlock (datatype=JOB_LOCK, wait_lock=true) at locks.c:218
#2  0x000000000048cd9f in lock_slurmctld (lock_levels=...) at locks.c:84
#3  0x00000000004b22d3 in _slurm_rpc_dump_jobs (msg=0x2b2faa0f1e20) at proc_req.c:1235
#4  0x00000000004af48a in slurmctld_req (msg=0x2b2faa0f1e20, arg=0x2b2e48001470) at proc_req.c:289
#5  0x0000000000445dc5 in _service_connection (arg=0x2b2e48001470) at controller.c:1121
#6  0x00002b2e2a27c0a4 in start_thread () from /lib64/libpthread.so.0
#7  0x00002b2e2a575fed in clone () from /lib64/libc.so.6

---Type <return> to continue, or q <return> to quit---
Thread 32 (Thread 0x2b2e4183e700 (LWP 12355)):
#0  0x00002b2e2a28003f in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x000000000048d1d1 in _wr_rdlock (datatype=JOB_LOCK, wait_lock=true) at locks.c:218
#2  0x000000000048cd9f in lock_slurmctld (lock_levels=...) at locks.c:84
#3  0x00000000004b22d3 in _slurm_rpc_dump_jobs (msg=0x2b2fa8938500) at proc_req.c:1235
#4  0x00000000004af48a in slurmctld_req (msg=0x2b2fa8938500, arg=0x2b2e4800be10) at proc_req.c:289
#5  0x0000000000445dc5 in _service_connection (arg=0x2b2e4800be10) at controller.c:1121
#6  0x00002b2e2a27c0a4 in start_thread () from /lib64/libpthread.so.0
#7  0x00002b2e2a575fed in clone () from /lib64/libc.so.6

Thread 31 (Thread 0x2b2ecb130700 (LWP 12343)):
#0  0x00002b2e2a28003f in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x000000000048d417 in _wr_wrlock (datatype=NODE_LOCK, wait_lock=true) at locks.c:256
#2  0x000000000048cde8 in lock_slurmctld (lock_levels=...) at locks.c:91
#3  0x00000000004b3309 in _slurm_rpc_dump_nodes (msg=0x2b2faa0b8cc0) at proc_req.c:1513
#4  0x00000000004af516 in slurmctld_req (msg=0x2b2faa0b8cc0, arg=0x2b2e48007d60) at proc_req.c:310
#5  0x0000000000445dc5 in _service_connection (arg=0x2b2e48007d60) at controller.c:1121
#6  0x00002b2e2a27c0a4 in start_thread () from /lib64/libpthread.so.0
#7  0x00002b2e2a575fed in clone () from /lib64/libc.so.6

Thread 30 (Thread 0x2b3014302700 (LWP 12333)):
#0  0x00002b2e2a28003f in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x000000000048d1d1 in _wr_rdlock (datatype=JOB_LOCK, wait_lock=true) at locks.c:218
#2  0x000000000048cd9f in lock_slurmctld (lock_levels=...) at locks.c:84
#3  0x00000000004b22d3 in _slurm_rpc_dump_jobs (msg=0x2b2faa0b95f0) at proc_req.c:1235
#4  0x00000000004af48a in slurmctld_req (msg=0x2b2faa0b95f0, arg=0x2b2e480061e0) at proc_req.c:289
#5  0x0000000000445dc5 in _service_connection (arg=0x2b2e480061e0) at controller.c:1121
#6  0x00002b2e2a27c0a4 in start_thread () from /lib64/libpthread.so.0
#7  0x00002b2e2a575fed in clone () from /lib64/libc.so.6

Thread 29 (Thread 0x2b2e42349700 (LWP 12125)):
#0  0x00002b2e2a28003f in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00000000004f64a0 in _wr_wrlock (datatype=ASSOC_LOCK) at assoc_mgr.c:1897
#2  0x00000000004f6c49 in assoc_mgr_lock (locks=0x2b2e42348ce0) at assoc_mgr.c:2050
#3  0x00000000004fa488 in assoc_mgr_update_assocs (update=0x2b2faa0b95a0, locked=false) at assoc_mgr.c:3517
#4  0x00000000004fa330 in assoc_mgr_update (update_list=0x2b2fa851f1b0, locked=false) at assoc_mgr.c:3463
#5  0x00000000004bdd20 in _slurm_rpc_accounting_update_msg (msg=0x2b2faa0b9160) at proc_req.c:5232
#6  0x00000000004af9c6 in slurmctld_req (msg=0x2b2faa0b9160, arg=0x2b2e480049a0) at proc_req.c:484
#7  0x0000000000445dc5 in _service_connection (arg=0x2b2e480049a0) at controller.c:1121
---Type <return> to continue, or q <return> to quit---
#8  0x00002b2e2a27c0a4 in start_thread () from /lib64/libpthread.so.0
#9  0x00002b2e2a575fed in clone () from /lib64/libc.so.6

Thread 28 (Thread 0x2b2e41d43700 (LWP 12111)):
#0  0x00002b2e2a28003f in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x000000000048d417 in _wr_wrlock (datatype=NODE_LOCK, wait_lock=true) at locks.c:256
#2  0x000000000048cde8 in lock_slurmctld (lock_levels=...) at locks.c:91
#3  0x00000000004d58f0 in update_assocs_in_resvs () at reservation.c:5545
#4  0x00000000004fb40a in assoc_mgr_update_assocs (update=0x2b2fa8932e70, locked=false) at assoc_mgr.c:3950
#5  0x00000000004fa330 in assoc_mgr_update (update_list=0x3a211a0, locked=false) at assoc_mgr.c:3463
#6  0x00000000004bdd20 in _slurm_rpc_accounting_update_msg (msg=0x2b2fa85f4350) at proc_req.c:5232
#7  0x00000000004af9c6 in slurmctld_req (msg=0x2b2fa85f4350, arg=0x2b2e48011450) at proc_req.c:484
#8  0x0000000000445dc5 in _service_connection (arg=0x2b2e48011450) at controller.c:1121
#9  0x00002b2e2a27c0a4 in start_thread () from /lib64/libpthread.so.0
#10 0x00002b2e2a575fed in clone () from /lib64/libc.so.6

Thread 27 (Thread 0x2b2e4254b700 (LWP 12089)):
#0  0x00002b2e2a56dbbd in poll () from /lib64/libc.so.6
#1  0x00002b2e2f929dca in ?? () from /lib64/libnss_sss.so.2
#2  0x00002b2e2f92a73d in ?? () from /lib64/libnss_sss.so.2
#3  0x00002b2e2f92b161 in _nss_sss_getpwnam_r () from /lib64/libnss_sss.so.2
#4  0x00002b2e2a546ead in getpwnam_r@@GLIBC_2.2.5 () from /lib64/libc.so.6
#5  0x00000000005c95a8 in _getpwnam_r (name=0x1639480 "spartak", pwd=0x2b2e4253abe0, buf=0x2b2e4253ac10 "crayadm", bufsiz=65536,
    result=0x2b2e4253abc8) at uid.c:73
#6  0x00000000005c9698 in uid_from_string (name=0x1639480 "spartak", uidp=0x2b2e4254ac54) at uid.c:111
#7  0x00000000004f3674 in _set_assoc_parent_and_user (assoc=0x16390c0, reset=0) at assoc_mgr.c:810
#8  0x00000000004fb17e in assoc_mgr_update_assocs (update=0x2b2fa99737a0, locked=false) at assoc_mgr.c:3878
#9  0x00000000004fa330 in assoc_mgr_update (update_list=0x2b2fa88e35d0, locked=false) at assoc_mgr.c:3463
#10 0x00000000004bdd20 in _slurm_rpc_accounting_update_msg (msg=0x2b2fa99741d0) at proc_req.c:5232
#11 0x00000000004af9c6 in slurmctld_req (msg=0x2b2fa99741d0, arg=0x2b2e48015da0) at proc_req.c:484
#12 0x0000000000445dc5 in _service_connection (arg=0x2b2e48015da0) at controller.c:1121
#13 0x00002b2e2a27c0a4 in start_thread () from /lib64/libpthread.so.0
#14 0x00002b2e2a575fed in clone () from /lib64/libc.so.6

Thread 26 (Thread 0x2b2ec9514700 (LWP 12075)):
#0  0x00002b2e2a28003f in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00000000004f64a0 in _wr_wrlock (datatype=ASSOC_LOCK) at assoc_mgr.c:1897
#2  0x00000000004f6c49 in assoc_mgr_lock (locks=0x2b2ec9513ce0) at assoc_mgr.c:2050
#3  0x00000000004fa488 in assoc_mgr_update_assocs (update=0x2b2fa9973cf0, locked=false) at assoc_mgr.c:3517
---Type <return> to continue, or q <return> to quit---
#4  0x00000000004fa330 in assoc_mgr_update (update_list=0x2b2fa88e3670, locked=false) at assoc_mgr.c:3463
#5  0x00000000004bdd20 in _slurm_rpc_accounting_update_msg (msg=0x2b2fa9973810) at proc_req.c:5232
#6  0x00000000004af9c6 in slurmctld_req (msg=0x2b2fa9973810, arg=0x2b2e480050d0) at proc_req.c:484
#7  0x0000000000445dc5 in _service_connection (arg=0x2b2e480050d0) at controller.c:1121
#8  0x00002b2e2a27c0a4 in start_thread () from /lib64/libpthread.so.0
#9  0x00002b2e2a575fed in clone () from /lib64/libc.so.6

Thread 25 (Thread 0x2b2e41f45700 (LWP 12067)):
#0  0x00002b2e2a28003f in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x000000000048d417 in _wr_wrlock (datatype=NODE_LOCK, wait_lock=true) at locks.c:256
#2  0x000000000048cde8 in lock_slurmctld (lock_levels=...) at locks.c:91
#3  0x00000000004d58f0 in update_assocs_in_resvs () at reservation.c:5545
#4  0x00000000004fb40a in assoc_mgr_update_assocs (update=0x2b2fa9079d60, locked=false) at assoc_mgr.c:3950
#5  0x00000000004fa330 in assoc_mgr_update (update_list=0x2b2fa88e0460, locked=false) at assoc_mgr.c:3463
#6  0x00000000004bdd20 in _slurm_rpc_accounting_update_msg (msg=0x2b2fa907a990) at proc_req.c:5232
#7  0x00000000004af9c6 in slurmctld_req (msg=0x2b2fa907a990, arg=0x2b2e48005ae0) at proc_req.c:484
#8  0x0000000000445dc5 in _service_connection (arg=0x2b2e48005ae0) at controller.c:1121
#9  0x00002b2e2a27c0a4 in start_thread () from /lib64/libpthread.so.0
#10 0x00002b2e2a575fed in clone () from /lib64/libc.so.6

Thread 24 (Thread 0x2b2e4163c700 (LWP 11865)):
#0  0x00002b2e2a28003f in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x000000000048d417 in _wr_wrlock (datatype=JOB_LOCK, wait_lock=true) at locks.c:256
#2  0x000000000048cdb8 in lock_slurmctld (lock_levels=...) at locks.c:86
#3  0x000000000047c09a in job_hold_by_assoc_id (assoc_id=1464) at job_mgr.c:14776
#4  0x0000000000446414 in _remove_assoc (rec=0x1730e20) at controller.c:1288
#5  0x00000000004fb345 in assoc_mgr_update_assocs (update=0x2b2fa9079fd0, locked=false) at assoc_mgr.c:3936
#6  0x00000000004fa330 in assoc_mgr_update (update_list=0x2b2fa88e0370, locked=false) at assoc_mgr.c:3463
#7  0x00000000004bdd20 in _slurm_rpc_accounting_update_msg (msg=0x2b2fa907a770) at proc_req.c:5232
#8  0x00000000004af9c6 in slurmctld_req (msg=0x2b2fa907a770, arg=0x2b2e4800d7f0) at proc_req.c:484
#9  0x0000000000445dc5 in _service_connection (arg=0x2b2e4800d7f0) at controller.c:1121
#10 0x00002b2e2a27c0a4 in start_thread () from /lib64/libpthread.so.0
#11 0x00002b2e2a575fed in clone () from /lib64/libc.so.6

Thread 23 (Thread 0x2b2e41339700 (LWP 11850)):
#0  0x00002b2e2a28003f in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00000000004f64a0 in _wr_wrlock (datatype=ASSOC_LOCK) at assoc_mgr.c:1897
#2  0x00000000004f6c49 in assoc_mgr_lock (locks=0x2b2e41338ce0) at assoc_mgr.c:2050
#3  0x00000000004fa488 in assoc_mgr_update_assocs (update=0x2b2fa907a290, locked=false) at assoc_mgr.c:3517
---Type <return> to continue, or q <return> to quit---
#4  0x00000000004fa330 in assoc_mgr_update (update_list=0x2b2fa88e3490, locked=false) at assoc_mgr.c:3463
#5  0x00000000004bdd20 in _slurm_rpc_accounting_update_msg (msg=0x2b2fa9079db0) at proc_req.c:5232
#6  0x00000000004af9c6 in slurmctld_req (msg=0x2b2fa9079db0, arg=0x2b2e480064f0) at proc_req.c:484
#7  0x0000000000445dc5 in _service_connection (arg=0x2b2e480064f0) at controller.c:1121
#8  0x00002b2e2a27c0a4 in start_thread () from /lib64/libpthread.so.0
#9  0x00002b2e2a575fed in clone () from /lib64/libc.so.6

Thread 22 (Thread 0x2b2e40f35700 (LWP 11843)):
#0  0x00002b2e2a28003f in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x000000000048d417 in _wr_wrlock (datatype=NODE_LOCK, wait_lock=true) at locks.c:256
#2  0x000000000048cde8 in lock_slurmctld (lock_levels=...) at locks.c:91
#3  0x00000000004d58f0 in update_assocs_in_resvs () at reservation.c:5545
#4  0x00000000004fb40a in assoc_mgr_update_assocs (update=0x2b2fa8938710, locked=false) at assoc_mgr.c:3950
#5  0x00000000004fa330 in assoc_mgr_update (update_list=0x2b2fa88e0500, locked=false) at assoc_mgr.c:3463
#6  0x00000000004bdd20 in _slurm_rpc_accounting_update_msg (msg=0x2b2fa8939100) at proc_req.c:5232
#7  0x00000000004af9c6 in slurmctld_req (msg=0x2b2fa8939100, arg=0x2b2e48010030) at proc_req.c:484
#8  0x0000000000445dc5 in _service_connection (arg=0x2b2e48010030) at controller.c:1121
#9  0x00002b2e2a27c0a4 in start_thread () from /lib64/libpthread.so.0
#10 0x00002b2e2a575fed in clone () from /lib64/libc.so.6

Thread 21 (Thread 0x2b3014807700 (LWP 11828)):
#0  0x00002b2e2a28003f in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00000000004f64a0 in _wr_wrlock (datatype=ASSOC_LOCK) at assoc_mgr.c:1897
#2  0x00000000004f6c49 in assoc_mgr_lock (locks=0x2b3014806ce0) at assoc_mgr.c:2050
#3  0x00000000004fa488 in assoc_mgr_update_assocs (update=0x2b2fa8938c20, locked=false) at assoc_mgr.c:3517
#4  0x00000000004fa330 in assoc_mgr_update (update_list=0x2b2fa88e32b0, locked=false) at assoc_mgr.c:3463
#5  0x00000000004bdd20 in _slurm_rpc_accounting_update_msg (msg=0x2b2fa8938780) at proc_req.c:5232
#6  0x00000000004af9c6 in slurmctld_req (msg=0x2b2fa8938780, arg=0x2b2e48009a20) at proc_req.c:484
#7  0x0000000000445dc5 in _service_connection (arg=0x2b2e48009a20) at controller.c:1121
#8  0x00002b2e2a27c0a4 in start_thread () from /lib64/libpthread.so.0
#9  0x00002b2e2a575fed in clone () from /lib64/libc.so.6

Thread 20 (Thread 0x2b2eca423700 (LWP 11821)):
#0  0x00002b2e2a28003f in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00000000004f64a0 in _wr_wrlock (datatype=ASSOC_LOCK) at assoc_mgr.c:1897
#2  0x00000000004f6c49 in assoc_mgr_lock (locks=0x2b2eca422cf0) at assoc_mgr.c:2050
#3  0x00000000004fb861 in assoc_mgr_update_users (update=0x2b2fa8937b90, locked=false) at assoc_mgr.c:4090
#4  0x00000000004fa316 in assoc_mgr_update (update_list=0x2b2fa82d6da0, locked=false) at assoc_mgr.c:3457
#5  0x00000000004bdd20 in _slurm_rpc_accounting_update_msg (msg=0x2b2fa8938660) at proc_req.c:5232
---Type <return> to continue, or q <return> to quit---
#6  0x00000000004af9c6 in slurmctld_req (msg=0x2b2fa8938660, arg=0x2b2e48008490) at proc_req.c:484
#7  0x0000000000445dc5 in _service_connection (arg=0x2b2e48008490) at controller.c:1121
#8  0x00002b2e2a27c0a4 in start_thread () from /lib64/libpthread.so.0
#9  0x00002b2e2a575fed in clone () from /lib64/libc.so.6

Thread 19 (Thread 0x2b2e4173d700 (LWP 11792)):
#0  0x00002b2e2a28003f in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x000000000048d417 in _wr_wrlock (datatype=JOB_LOCK, wait_lock=true) at locks.c:256
#2  0x000000000048cdb8 in lock_slurmctld (lock_levels=...) at locks.c:86
#3  0x0000000000446603 in _update_assoc (rec=0x2b2fa8a21440) at controller.c:1340
#4  0x00000000004fb3ac in assoc_mgr_update_assocs (update=0x2b2fa8937f30, locked=false) at assoc_mgr.c:3944
#5  0x00000000004fa330 in assoc_mgr_update (update_list=0x2b2fa82d4220, locked=false) at assoc_mgr.c:3463
#6  0x00000000004bdd20 in _slurm_rpc_accounting_update_msg (msg=0x2b2fa8933900) at proc_req.c:5232
#7  0x00000000004af9c6 in slurmctld_req (msg=0x2b2fa8933900, arg=0x2b2e480095d0) at proc_req.c:484
#8  0x0000000000445dc5 in _service_connection (arg=0x2b2e480095d0) at controller.c:1121
#9  0x00002b2e2a27c0a4 in start_thread () from /lib64/libpthread.so.0
#10 0x00002b2e2a575fed in clone () from /lib64/libc.so.6

Thread 18 (Thread 0x2b2ec8403700 (LWP 11784)):
#0  0x00002b2e2a28003f in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x000000000048d417 in _wr_wrlock (datatype=NODE_LOCK, wait_lock=true) at locks.c:256
#2  0x000000000048cde8 in lock_slurmctld (lock_levels=...) at locks.c:91
#3  0x00000000004d58f0 in update_assocs_in_resvs () at reservation.c:5545
#4  0x00000000004fb40a in assoc_mgr_update_assocs (update=0x2b2fa8932e40, locked=false) at assoc_mgr.c:3950
#5  0x00000000004fa330 in assoc_mgr_update (update_list=0x2b2fa82d6e40, locked=false) at assoc_mgr.c:3463
#6  0x00000000004bdd20 in _slurm_rpc_accounting_update_msg (msg=0x2b2fa8933650) at proc_req.c:5232
#7  0x00000000004af9c6 in slurmctld_req (msg=0x2b2fa8933650, arg=0x2b2e4800cde0) at proc_req.c:484
#8  0x0000000000445dc5 in _service_connection (arg=0x2b2e4800cde0) at controller.c:1121
#9  0x00002b2e2a27c0a4 in start_thread () from /lib64/libpthread.so.0
#10 0x00002b2e2a575fed in clone () from /lib64/libc.so.6

Thread 17 (Thread 0x2b2e4274d700 (LWP 11770)):
#0  0x00002b2e2a28003f in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00000000004f64a0 in _wr_wrlock (datatype=ASSOC_LOCK) at assoc_mgr.c:1897
#2  0x00000000004f6c49 in assoc_mgr_lock (locks=0x2b2e4274cce0) at assoc_mgr.c:2050
#3  0x00000000004fa488 in assoc_mgr_update_assocs (update=0x2b2fa8933600, locked=false) at assoc_mgr.c:3517
#4  0x00000000004fa330 in assoc_mgr_update (update_list=0x2b2fa82d3ff0, locked=false) at assoc_mgr.c:3463
#5  0x00000000004bdd20 in _slurm_rpc_accounting_update_msg (msg=0x2b2fa89331b0) at proc_req.c:5232
#6  0x00000000004af9c6 in slurmctld_req (msg=0x2b2fa89331b0, arg=0x2b2e48009460) at proc_req.c:484
---Type <return> to continue, or q <return> to quit---
#7  0x0000000000445dc5 in _service_connection (arg=0x2b2e48009460) at controller.c:1121
#8  0x00002b2e2a27c0a4 in start_thread () from /lib64/libpthread.so.0
#9  0x00002b2e2a575fed in clone () from /lib64/libc.so.6

Thread 16 (Thread 0x2b2eca322700 (LWP 11756)):
#0  0x00002b2e2a28003f in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00000000004f64a0 in _wr_wrlock (datatype=ASSOC_LOCK) at assoc_mgr.c:1897
#2  0x00000000004f6c49 in assoc_mgr_lock (locks=0x2b2eca321cf0) at assoc_mgr.c:2050
#3  0x00000000004fb861 in assoc_mgr_update_users (update=0x2b2fa8737200, locked=false) at assoc_mgr.c:4090
#4  0x00000000004fa316 in assoc_mgr_update (update_list=0x2b2fa8378ad0, locked=false) at assoc_mgr.c:3457
#5  0x00000000004bdd20 in _slurm_rpc_accounting_update_msg (msg=0x2b2fa851ed50) at proc_req.c:5232
#6  0x00000000004af9c6 in slurmctld_req (msg=0x2b2fa851ed50, arg=0x2b2e4800c820) at proc_req.c:484
#7  0x0000000000445dc5 in _service_connection (arg=0x2b2e4800c820) at controller.c:1121
#8  0x00002b2e2a27c0a4 in start_thread () from /lib64/libpthread.so.0
#9  0x00002b2e2a575fed in clone () from /lib64/libc.so.6

Thread 15 (Thread 0x2b2e41137700 (LWP 11741)):
#0  0x00002b2e2a28003f in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x000000000048d417 in _wr_wrlock (datatype=JOB_LOCK, wait_lock=true) at locks.c:256
#2  0x000000000048cdb8 in lock_slurmctld (lock_levels=...) at locks.c:86
#3  0x0000000000446603 in _update_assoc (rec=0x2b2fa871d980) at controller.c:1340
#4  0x00000000004fb3ac in assoc_mgr_update_assocs (update=0x2b2fa803c850, locked=false) at assoc_mgr.c:3944
#5  0x00000000004fa330 in assoc_mgr_update (update_list=0x3a20340, locked=false) at assoc_mgr.c:3463
#6  0x00000000004bdd20 in _slurm_rpc_accounting_update_msg (msg=0x2b2fa85f7b00) at proc_req.c:5232
#7  0x00000000004af9c6 in slurmctld_req (msg=0x2b2fa85f7b00, arg=0x2b2e48009fe0) at proc_req.c:484
#8  0x0000000000445dc5 in _service_connection (arg=0x2b2e48009fe0) at controller.c:1121
#9  0x00002b2e2a27c0a4 in start_thread () from /lib64/libpthread.so.0
#10 0x00002b2e2a575fed in clone () from /lib64/libc.so.6

Thread 14 (Thread 0x2b2ecbf3e700 (LWP 11734)):
#0  0x00002b2e2a28003f in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00000000004f622b in _wr_rdlock (datatype=ASSOC_LOCK) at assoc_mgr.c:1864
#2  0x00000000004f6c32 in assoc_mgr_lock (locks=0x2b2ecbf3db10) at assoc_mgr.c:2048
#3  0x00000000004c77f5 in _set_assoc_list (resv_ptr=0x45d5890) at reservation.c:650
#4  0x00000000004d5911 in update_assocs_in_resvs () at reservation.c:5549
#5  0x00000000004fb40a in assoc_mgr_update_assocs (update=0x2b2fa8043070, locked=false) at assoc_mgr.c:3950
#6  0x00000000004fa330 in assoc_mgr_update (update_list=0x2b2fa85203d0, locked=false) at assoc_mgr.c:3463
#7  0x00000000004bdd20 in _slurm_rpc_accounting_update_msg (msg=0x2b2fa804b770) at proc_req.c:5232
#8  0x00000000004af9c6 in slurmctld_req (msg=0x2b2fa804b770, arg=0x2b2e48015390) at proc_req.c:484
---Type <return> to continue, or q <return> to quit---
#9  0x0000000000445dc5 in _service_connection (arg=0x2b2e48015390) at controller.c:1121
#10 0x00002b2e2a27c0a4 in start_thread () from /lib64/libpthread.so.0
#11 0x00002b2e2a575fed in clone () from /lib64/libc.so.6

Thread 13 (Thread 0x2b2e40c32700 (LWP 17609)):
#0  0x00002b2e2a28003f in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x000000000048d417 in _wr_wrlock (datatype=JOB_LOCK, wait_lock=true) at locks.c:256
#2  0x000000000048cdb8 in lock_slurmctld (lock_levels=...) at locks.c:86
#3  0x00000000004ad0dc in _init_power_save (arg=0x0) at power_save.c:764
#4  0x00002b2e2a27c0a4 in start_thread () from /lib64/libpthread.so.0
#5  0x00002b2e2a575fed in clone () from /lib64/libc.so.6

Thread 12 (Thread 0x2b2e40b31700 (LWP 17608)):
#0  0x00002b2e2a28003f in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00000000004d83ba in slurmctld_state_save (no_data=0x0) at state_save.c:208
#2  0x00002b2e2a27c0a4 in start_thread () from /lib64/libpthread.so.0
#3  0x00002b2e2a575fed in clone () from /lib64/libc.so.6

Thread 11 (Thread 0x2b2e40a30700 (LWP 17607)):
#0  0x00002b2e2a2835a9 in do_sigwait () from /lib64/libpthread.so.0
#1  0x00002b2e2a283633 in sigwait () from /lib64/libpthread.so.0
#2  0x0000000000445260 in _slurmctld_signal_hand (no_data=0x0) at controller.c:876
#3  0x00002b2e2a27c0a4 in start_thread () from /lib64/libpthread.so.0
#4  0x00002b2e2a575fed in clone () from /lib64/libc.so.6

Thread 10 (Thread 0x2b2e4092f700 (LWP 17606)):
#0  0x00002b2e2a56f253 in select () from /lib64/libc.so.6
#1  0x0000000000445910 in _slurmctld_rpc_mgr (no_data=0x0) at controller.c:1014
#2  0x00002b2e2a27c0a4 in start_thread () from /lib64/libpthread.so.0
#3  0x00002b2e2a575fed in clone () from /lib64/libc.so.6

Thread 9 (Thread 0x2b2e4062c700 (LWP 17605)):
#0  0x00002b2e2a27d4c2 in pthread_join () from /lib64/libpthread.so.0
#1  0x00002b2e40224d34 in _cleanup_thread (no_data=0x0) at priority_multifactor.c:1425
#2  0x00002b2e2a27c0a4 in start_thread () from /lib64/libpthread.so.0
#3  0x00002b2e2a575fed in clone () from /lib64/libc.so.6

Thread 8 (Thread 0x2b2e4052b700 (LWP 17604)):
#0  0x00002b2e2a28003f in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
---Type <return> to continue, or q <return> to quit---
#1  0x00000000004f64a0 in _wr_wrlock (datatype=ASSOC_LOCK) at assoc_mgr.c:1897
#2  0x00000000004f6c49 in assoc_mgr_lock (locks=0x2b2e4052af30) at assoc_mgr.c:2050
#3  0x00002b2e402248cd in _decay_thread (no_data=0x0) at priority_multifactor.c:1315
#4  0x00002b2e2a27c0a4 in start_thread () from /lib64/libpthread.so.0
#5  0x00002b2e2a575fed in clone () from /lib64/libc.so.6

Thread 7 (Thread 0x2b2e33f7c700 (LWP 17579)):
#0  0x00002b2e2a28003f in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00000000004f622b in _wr_rdlock (datatype=ASSOC_LOCK) at assoc_mgr.c:1864
#2  0x00000000004f6c32 in assoc_mgr_lock (locks=0x2b2e33f7bef0) at assoc_mgr.c:2048
#3  0x00002b2e40007d9a in _load_state (init_config=false) at burst_buffer_cray.c:1189
#4  0x00002b2e40005197 in _bb_agent (args=0x0) at burst_buffer_cray.c:417
#5  0x00002b2e2a27c0a4 in start_thread () from /lib64/libpthread.so.0
#6  0x00002b2e2a575fed in clone () from /lib64/libc.so.6

Thread 6 (Thread 0x2b2e33e7b700 (LWP 17526)):
#0  0x00002b2e2a56dbbd in poll () from /lib64/libc.so.6
#1  0x00002b2e2b487355 in _aeld_event_loop (args=0x0) at select_cray.c:522
#2  0x00002b2e2a27c0a4 in start_thread () from /lib64/libpthread.so.0
#3  0x00002b2e2a575fed in clone () from /lib64/libc.so.6

Thread 5 (Thread 0x2b2e31dd8700 (LWP 17519)):
#0  0x00002b2e2a28003f in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x000000000048d417 in _wr_wrlock (datatype=JOB_LOCK, wait_lock=true) at locks.c:256
#2  0x000000000048cdb8 in lock_slurmctld (lock_levels=...) at locks.c:86
#3  0x00002b2e31ad0430 in backfill_agent (args=0x0) at backfill.c:819
#4  0x00002b2e2a27c0a4 in start_thread () from /lib64/libpthread.so.0
#5  0x00002b2e2a575fed in clone () from /lib64/libc.so.6

Thread 4 (Thread 0x2b2e3106a700 (LWP 17473)):
#0  0x00002b2e2a2803e8 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00000000005c1a68 in _agent (x=0x0) at slurmdbd_defs.c:2137
#2  0x00002b2e2a27c0a4 in start_thread () from /lib64/libpthread.so.0
#3  0x00002b2e2a575fed in clone () from /lib64/libc.so.6

Thread 3 (Thread 0x2b2e30bde700 (LWP 17472)):
#0  0x00002b2e2a27d4c2 in pthread_join () from /lib64/libpthread.so.0
#1  0x00002b2e307d4dfc in _cleanup_thread (no_data=0x0) at accounting_storage_slurmdbd.c:443
#2  0x00002b2e2a27c0a4 in start_thread () from /lib64/libpthread.so.0
---Type <return> to continue, or q <return> to quit---
#3  0x00002b2e2a575fed in clone () from /lib64/libc.so.6

Thread 2 (Thread 0x2b2e30add700 (LWP 17471)):
#0  0x00002b2e2a28003f in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x000000000048d1d1 in _wr_rdlock (datatype=JOB_LOCK, wait_lock=true) at locks.c:218
#2  0x000000000048cd9f in lock_slurmctld (lock_levels=...) at locks.c:84
#3  0x00002b2e307d48ba in _set_db_inx_thread (no_data=0x0) at accounting_storage_slurmdbd.c:281
#4  0x00002b2e2a27c0a4 in start_thread () from /lib64/libpthread.so.0
#5  0x00002b2e2a575fed in clone () from /lib64/libc.so.6

Thread 1 (Thread 0x2b2e29f7d700 (LWP 17470)):
#0  0x00002b2e2a547d4d in nanosleep () from /lib64/libc.so.6
#1  0x00002b2e2a547be4 in sleep () from /lib64/libc.so.6
#2  0x00002b2e2fb355ef in _lease_extender (args=0x0) at cookies.c:350
#3  0x00002b2e2a27c0a4 in start_thread () from /lib64/libpthread.so.0
#4  0x00002b2e2a575fed in clone () from /lib64/libc.so.6
(gdb)
Comment 4 Doug Jacobsen 2016-10-11 15:56:20 MDT
I terminated slurmctld and ran the accounting update with just slurmdbd up and no slurmctld, and that went through;  We do typically get high 10s of accounting updates per day (new user, removed users, change in qos authorizations, change in account authorizations and so on), and I've typically only seen this behavior with slurmctld when the quantity of updates is >>100 and slurmdbd send too many concurrent updates.

I'm concerned the increased maintenance of the larger cori system is generating more lock contention with the accounting updates.  In this case only a few were trying to come through, as opposed to the former issues with many many updates causing slurmctld to get tied up.

I'm looking forward to your thoughts on this,
Doug
Comment 5 Tim Wickberg 2016-10-11 16:03:27 MDT
It does look like the accounting update is triggering a series of UID lookups that aren't being promptly answered for some reason.

Have you made any significant changes to the associations that would generate an unusually high number of lookups, or is there any indication on the controller of why sssd isn't responding fast enough?

The only thing I can think of trying here would be to defer the uid lookups (although I think that just moves the contention elsewhere, doesn't really solve the potential throughput issues) and/or start caching them internally as is already done for groups membership.
Comment 6 Doug Jacobsen 2016-10-11 16:10:54 MDT
OK, I see one instance of getpwnam;  is that serialized?  If so, would it
be possible to make the locks a bit more fine-grain to allow
non-conflicting account updates to run in parallel?

----
Doug Jacobsen, Ph.D.
NERSC Computer Systems Engineer
National Energy Research Scientific Computing Center <http://www.nersc.gov>
dmjacobsen@lbl.gov

------------- __o
---------- _ '\<,_
----------(_)/  (_)__________________________


On Tue, Oct 11, 2016 at 3:03 PM, <bugs@schedmd.com> wrote:

> *Comment # 5 <https://bugs.schedmd.com/show_bug.cgi?id=3159#c5> on bug
> 3159 <https://bugs.schedmd.com/show_bug.cgi?id=3159> from Tim Wickberg
> <tim@schedmd.com> *
>
> It does look like the accounting update is triggering a series of UID lookups
> that aren't being promptly answered for some reason.
>
> Have you made any significant changes to the associations that would generate
> an unusually high number of lookups, or is there any indication on the
> controller of why sssd isn't responding fast enough?
>
> The only thing I can think of trying here would be to defer the uid lookups
> (although I think that just moves the contention elsewhere, doesn't really
> solve the potential throughput issues) and/or start caching them internally as
> is already done for groups membership.
>
> ------------------------------
> You are receiving this mail because:
>
>    - You reported the bug.
>
>
Comment 7 Doug Jacobsen 2016-10-11 16:11:43 MDT
I'll also look into why sssd is so very slow.

----
Doug Jacobsen, Ph.D.
NERSC Computer Systems Engineer
National Energy Research Scientific Computing Center <http://www.nersc.gov>
dmjacobsen@lbl.gov

------------- __o
---------- _ '\<,_
----------(_)/  (_)__________________________


On Tue, Oct 11, 2016 at 3:10 PM, Douglas Jacobsen <dmjacobsen@lbl.gov>
wrote:

> OK, I see one instance of getpwnam;  is that serialized?  If so, would it
> be possible to make the locks a bit more fine-grain to allow
> non-conflicting account updates to run in parallel?
>
> ----
> Doug Jacobsen, Ph.D.
> NERSC Computer Systems Engineer
> National Energy Research Scientific Computing Center
> <http://www.nersc.gov>
> dmjacobsen@lbl.gov
>
> ------------- __o
> ---------- _ '\<,_
> ----------(_)/  (_)__________________________
>
>
> On Tue, Oct 11, 2016 at 3:03 PM, <bugs@schedmd.com> wrote:
>
>> *Comment # 5 <https://bugs.schedmd.com/show_bug.cgi?id=3159#c5> on bug
>> 3159 <https://bugs.schedmd.com/show_bug.cgi?id=3159> from Tim Wickberg
>> <tim@schedmd.com> *
>>
>> It does look like the accounting update is triggering a series of UID lookups
>> that aren't being promptly answered for some reason.
>>
>> Have you made any significant changes to the associations that would generate
>> an unusually high number of lookups, or is there any indication on the
>> controller of why sssd isn't responding fast enough?
>>
>> The only thing I can think of trying here would be to defer the uid lookups
>> (although I think that just moves the contention elsewhere, doesn't really
>> solve the potential throughput issues) and/or start caching them internally as
>> is already done for groups membership.
>>
>> ------------------------------
>> You are receiving this mail because:
>>
>>    - You reported the bug.
>>
>>
>
Comment 8 Tim Wickberg 2016-10-11 16:32:49 MDT
(In reply to Doug Jacobsen from comment #6)
> OK, I see one instance of getpwnam;  is that serialized?  If so, would it
> be possible to make the locks a bit more fine-grain to allow
> non-conflicting account updates to run in parallel?

It's sitting in a tight loop running through the updates... so, yes, serial.

Changing the core lock structures would be non-trivial (the association manager is already split up some, but the update call needs to hold ~ half the write locks to deal with various potential changes), and isn't something we're currently looking at.

And in this case, I don't think parallel calls to getpwnam would improve the situation. I think it may even exacerbate it if it's sssd / ldap(?) is having trouble responding quickly.
Comment 9 Doug Jacobsen 2016-10-11 16:42:07 MDT
(we are looking at ways to improve sssd performance, so that is happening
in parallel)

yes, perhaps it would just make things worse.  Given that, what is the
issue that is making slurmctld unresponsive while waiting on this?

Is this a service thread, which could, in principle give up locks like the
backfill scheduler does every "so often"?  Does this same service thread
handle all accounting updates?  or does each accounting update generate
it's own version of this?

----
Doug Jacobsen, Ph.D.
NERSC Computer Systems Engineer
National Energy Research Scientific Computing Center <http://www.nersc.gov>
dmjacobsen@lbl.gov

------------- __o
---------- _ '\<,_
----------(_)/  (_)__________________________


On Tue, Oct 11, 2016 at 3:32 PM, <bugs@schedmd.com> wrote:

> *Comment # 8 <https://bugs.schedmd.com/show_bug.cgi?id=3159#c8> on bug
> 3159 <https://bugs.schedmd.com/show_bug.cgi?id=3159> from Tim Wickberg
> <tim@schedmd.com> *
>
> (In reply to Doug Jacobsen from comment #6 <https://bugs.schedmd.com/show_bug.cgi?id=3159#c6>)> OK, I see one instance of getpwnam;  is that serialized?  If so, would it
> > be possible to make the locks a bit more fine-grain to allow
> > non-conflicting account updates to run in parallel?
>
> It's sitting in a tight loop running through the updates... so, yes, serial.
>
> Changing the core lock structures would be non-trivial (the association manager
> is already split up some, but the update call needs to hold ~ half the write
> locks to deal with various potential changes), and isn't something we're
> currently looking at.
>
> And in this case, I don't think parallel calls to getpwnam would improve the
> situation. I think it may even exacerbate it if it's sssd / ldap(?) is having
> trouble responding quickly.
>
> ------------------------------
> You are receiving this mail because:
>
>    - You reported the bug.
>
>
Comment 10 Tim Wickberg 2016-10-11 17:44:21 MDT
(In reply to Doug Jacobsen from comment #9)
> (we are looking at ways to improve sssd performance, so that is happening
> in parallel)
> 
> yes, perhaps it would just make things worse.  Given that, what is the
> issue that is making slurmctld unresponsive while waiting on this?

Lock contention - you can see almost everything else is stuck on pthread_cond_wait trying to acquire various locks.

> Is this a service thread, which could, in principle give up locks like the
> backfill scheduler does every "so often"?  Does this same service thread
> handle all accounting updates?  or does each accounting update generate
> it's own version of this?

It's not a service thread, it's just another RPC message handler responding to a message sent from slurmdbd.

Each update generates a separate message; what I believe you're seeing here is a larger bulk update of your association structure is being slowly worked through (I don't think it's permanently stuck here, and if you'd stopped it at some other point it'd be working on a different account.)

One important difference here is that the backfill scheduler can be stalled/reset/delayed indefinitely without issue (well, aside from system throughput), whereas at some point these accounting updates will always need to be processed eventually.

Admittedly, we've never seen a performance issue around this mechanism, so haven't looked into ways of mitigating it. I'd be curious to find out if the updates are getting stuck due to any specific account, or if it's just the successive set of calls that's getting in trouble - is there any hint from sssd as to what it's seeing from its side?
Comment 11 Doug Jacobsen 2016-10-14 06:59:46 MDT
Hello,

Sorry, I got pulled into other things.

Regarding what sssd is doing, that is an ongoing struggle - we're new to it.  I disabled enumeration in hopes that would help, but not really.  It still takes slurmctld 5 minutes to startup while having a conversation with sssd.  

For this issue, in total, there were 11 different sacctmgr commands that our accounting integration was attempting to run.  Once I down'd slurmctld, allowed it to run with slurmctld offline, and then brought slurmctld back up, all was well.

This type of thing is the origin of another enhancement request I have, which is to allow sacctmgr to block until all slurmctlds have the update, thus serializing the sacctmgr calls from the automated script.  However, I have usually run into this type of problem when there were hundreds or thousands of updates, not 11.  At present the script sleeps for 10s (or so I don't remember) between each 10 sacctmgr calls to (sort of) prevent the accounting queue from overwhelming slurmctld locking.

-Doug
Comment 12 Doug Jacobsen 2016-11-08 14:27:45 MST
Hi Tim,

I think I figured out the problem.  It turns out we had a little over 
four hundred users in the database for which there were no ldap 
entries.  This was owing to a difference of views in our upstream 
user/accounting data provider (which we sync slurm to). Anyway, in the 
it seems that slurm was looping over all of them in 
assoc_mgr_set_missing_uids(), which runs hourly.  We also run our 
accounting updates hourly, and it seems that the hourly cron job (no 
explicit time) and slurm stayed synced up resulting in a large number of 
concurrent updates and quite likely lock thrashing.

Anyway I turned on sssd debugging and found that 98% of the requests 
were for accounts that didn't exist.  I found a way to ignore these 
coming in from the upstream accounting system and removed the uid-less 
acounts from slurmdbd.   As far as I can tell this helped quite a bit in 
stopping the hourly slurmctld hang (and sometimes failure).

For my automated account updates to synchronize the NIM system (NERSC 
Information Management) and slurmdbd, I'm still interested in having a 
way to throttle updates so as not to overwhelm slurmctld (i.e., 
serialize sacctmgr updates).

We also found an earlier issue where sssd would kill the ldap worker 
(sssd_be) and restart it (invalidating the cache.  This was fixed by 
increasing the acceptable timeout -- we have too many groups to be 
cached within the default timeout window.

Anyway, so that is the major observation, users in slurmdbd without uids 
resolvable on the slurmctld-running system can lead to this issue, 
especially if there are a large number of them.

-Doug

On 10/11/16 3:03 PM, bugs@schedmd.com wrote:
>
> *Comment # 5 <https://bugs.schedmd.com/show_bug.cgi?id=3159#c5> on bug 
> 3159 <https://bugs.schedmd.com/show_bug.cgi?id=3159> from Tim Wickberg 
> <mailto:tim@schedmd.com> *
> It does look like the accounting update is triggering a series of UID lookups
> that aren't being promptly answered for some reason.
>
> Have you made any significant changes to the associations that would generate
> an unusually high number of lookups, or is there any indication on the
> controller of why sssd isn't responding fast enough?
>
> The only thing I can think of trying here would be to defer the uid lookups
> (although I think that just moves the contention elsewhere, doesn't really
> solve the potential throughput issues) and/or start caching them internally as
> is already done for groups membership.
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>
Comment 13 Tim Wickberg 2016-11-08 16:25:45 MST
(In reply to Doug Jacobsen from comment #12)
> Hi Tim,
> 
> I think I figured out the problem.  It turns out we had a little over 
> four hundred users in the database for which there were no ldap 
> entries.  This was owing to a difference of views in our upstream 
> user/accounting data provider (which we sync slurm to). Anyway, in the 
> it seems that slurm was looping over all of them in 
> assoc_mgr_set_missing_uids(), which runs hourly.  We also run our 
> accounting updates hourly, and it seems that the hourly cron job (no 
> explicit time) and slurm stayed synced up resulting in a large number of 
> concurrent updates and quite likely lock thrashing.
> 
> Anyway I turned on sssd debugging and found that 98% of the requests 
> were for accounts that didn't exist.  I found a way to ignore these 
> coming in from the upstream accounting system and removed the uid-less 
> acounts from slurmdbd.   As far as I can tell this helped quite a bit in 
> stopping the hourly slurmctld hang (and sometimes failure).
> 
> For my automated account updates to synchronize the NIM system (NERSC 
> Information Management) and slurmdbd, I'm still interested in having a 
> way to throttle updates so as not to overwhelm slurmctld (i.e., 
> serialize sacctmgr updates).

I believe this is bug 2331.

> We also found an earlier issue where sssd would kill the ldap worker 
> (sssd_be) and restart it (invalidating the cache.  This was fixed by 
> increasing the acceptable timeout -- we have too many groups to be 
> cached within the default timeout window.
> 
> Anyway, so that is the major observation, users in slurmdbd without uids 
> resolvable on the slurmctld-running system can lead to this issue, 
> especially if there are a large number of them.

I've been piecing together a few more of these interactions. The hourly refresh of the associations (and UID cache) is attempting to fill in any missing UIDs, and that behavior is intentional from slurmctld's end. (A number of other site may add the association ahead of the user being established fully in LDAP, and retrying periodically is the only way to get this all in sync.)

Should have pointed this out earlier - 'sacctmgr show problems' should be highlighting all of these UID-missing usernames.

If you weren't using the '-i' flag on the accounts, you'd at least have seen a warning about this directly:
tim@zoidberg:~$ sacctmgr
sacctmgr: add user asdfasdf account=root
There is no uid for user 'asdfasdf'
Are you sure you want to continue? (You have 30 seconds to decide)
(N/y): y
 Adding User(s)
  asdfasdf
 Associations =
  U = asdfasdf  A = root       C = zoidberg  
 Non Default Settings
Would you like to commit changes? (You have 30 seconds to decide)
(N/y): y

With the -i flag that additional warning isn't printed; we're discussing whether is may be worth adding into 17.02, although it may inadvertently break a lot of sites user-addition scripts so the tradeoff is still being discussed.
Comment 14 Brian F Gilmer 2017-03-16 07:19:47 MDT
Cray Bug 849333 reports the same behavior.  They are running slurm 16.05.9.

In this case, 151,000 'updating assoc' messages appeared in the log directly after a HSN cable warmswap.

Chain of events:

- " xtwarmswap --remove"  2 HSN cables

> 2017-03-14 10:49:29.736966 29239 START xtwarmswap --remove-cable c1-1c2j31
> 2017-03-14 10:53:00.424124 26610 START xtwarmswap --remove-cable c1-1c2j36

- doing " sinfo -R " yields this errors:

primary:~ # date; sinfo -R
Tue Mar 14 10:54:24 CET 2017
sinfo: error: slurm_receive_msg: Socket timed out on send/recv operation
slurm_load_partitions: Socket timed out on send/recv operation

Summary from the slurmctld.log
[2017-03-14T10:49:18.164] debug:  updating assoc 2338
   2753 10:49
   3259 10:51
   4659 10:52
   1833 10:53
   3568 10:54
   4283 10:55
   2747 10:56
   3817 10:57
   4545 10:58
   4332 10:59
   4372 11:00
   4171 11:01
   4057 11:02
   3969 11:03
   4118 11:04
   4162 11:05
   3898 11:06
   3996 11:07
   3929 11:08
   3362 11:09
   3186 11:10
   3048 11:11
   3097 11:12
   1893 11:13
   1475 11:14
   2674 11:15
   3432 11:16
   2981 11:17
   3048 11:18
   2992 11:19
   2938 11:20
   2635 11:21
    193 11:22
   3372 11:23
   2308 11:24
   3646 11:25
   3383 11:26
   1881 11:27
Comment 15 Tim Wickberg 2017-05-03 11:16:13 MDT
(In reply to Brian F Gilmer from comment #14)
> Cray Bug 849333 reports the same behavior.  They are running slurm 16.05.9.
> 
> In this case, 151,000 'updating assoc' messages appeared in the log directly
> after a HSN cable warmswap.
> 
> Chain of events:
> 
> - " xtwarmswap --remove"  2 HSN cables
> 
> > 2017-03-14 10:49:29.736966 29239 START xtwarmswap --remove-cable c1-1c2j31
> > 2017-03-14 10:53:00.424124 26610 START xtwarmswap --remove-cable c1-1c2j36
> 
> - doing " sinfo -R " yields this errors:
> 
> primary:~ # date; sinfo -R
> Tue Mar 14 10:54:24 CET 2017
> sinfo: error: slurm_receive_msg: Socket timed out on send/recv operation
> slurm_load_partitions: Socket timed out on send/recv operation
> 
> Summary from the slurmctld.log
> [2017-03-14T10:49:18.164] debug:  updating assoc 2338
>    2753 10:49
>    3259 10:51
>    4659 10:52
>    1833 10:53
>    3568 10:54
>    4283 10:55
>    2747 10:56

Brian -

Did you ask them to check for usernames without corresponding uids?

Doug -

Aside from potentially offering a flag to disable the assoc_mgr_set_missing_uids() lookup or change its frequency, I'm thinking this may be best closed as a duplicate of your long-standing slurmdbd transaction request?
Comment 16 Brian F Gilmer 2017-05-03 11:39:25 MDT
(In reply to Tim Wickberg from comment #15)
> (In reply to Brian F Gilmer from comment #14)

> Brian -
> 
> Did you ask them to check for usernames without corresponding uids?
> 

No, I will put that question to them.
Comment 17 Tim Wickberg 2017-05-17 10:41:51 MDT
> > Brian -
> > 
> > Did you ask them to check for usernames without corresponding uids?
> > 
> 
> No, I will put that question to them.

Brian - any updates?
Comment 18 Brian F Gilmer 2017-05-17 13:04:42 MDT
(In reply to Tim Wickberg from comment #17)
> > > Brian -
> > > 
> > > Did you ask them to check for usernames without corresponding uids?
> > > 
> > 
> > No, I will put that question to them.
> 
> Brian - any updates?

No response from the site.  I will poke them again.
Comment 19 Brian F Gilmer 2017-05-22 07:25:55 MDT
From the site:

No indication of the username/no uid condition.

However, after some additional investigation the customer feels that other actions being initiated against slurm at that time (loading a large amount of new account information into slurm) may have been the trigger for this event.

Customer has therefore asked that this particular bug not be pursued anymore, and be closed.
Comment 20 Doug Jacobsen 2017-05-22 08:07:18 MDT
This is my bug, please do not close it.

On May 22, 2017 6:25 AM, <bugs@schedmd.com> wrote:

> *Comment # 19 <https://bugs.schedmd.com/show_bug.cgi?id=3159#c19> on bug
> 3159 <https://bugs.schedmd.com/show_bug.cgi?id=3159> from Brian F Gilmer
> <bgilmer@cray.com> *
>
> From the site:
>
> No indication of the username/no uid condition.
>
> However, after some additional investigation the customer feels that other
> actions being initiated against slurm at that time (loading a large amount of
> new account information into slurm) may have been the trigger for this event.
>
> Customer has therefore asked that this particular bug not be pursued anymore,
> and be closed.
>
> ------------------------------
> You are receiving this mail because:
>
>    - You reported the bug.
>
>
Comment 21 Brian F Gilmer 2017-05-22 08:23:35 MDT
(In reply to Doug Jacobsen from comment #20)
> This is my bug, please do not close it.
> 

Sorry Doug,

I have this bug associated with a Cray from CSCS.
Comment 23 Danny Auble 2017-06-26 13:51:26 MDT
Doug, this has been open for a while, I am updating the version of Slurm running (please correct me if this one I picked isn't correct).  Have you seen this after upgrading to 17.02?
Comment 24 Danny Auble 2017-07-20 14:53:23 MDT
Created attachment 4946 [details]
Patch to serialize updates coming from the dbd.

Doug, I think I finally have a handle on this.

Thread 27 locked assoc_mgr
Thread 14 has the node_write_lock and is waiting on the assoc_mgr lock to proceed.  This is causing the issue.

The reason Thread 14 is locked the way it is is because it is an update that already finished changes to the assoc_mgr and is now making those changes outside the assoc_mgr lock to various components which may need to get a slurmctld lock and then get an assoc_mgr lock.

Now comes another update (Thread 27) which grabbed the assoc_mgr lock and is taking forever on the getpwnam_r calls.  So now we have the node_write_lock locked and the assoc_mgr lock locked.  Only after Thread 27 finishes will we ever continue.

In this patch I have introduced a lock on how this gets processed so only one of these happens at a time.  This should eliminate the scenario of one threads locks holding up the other.  Since the response is sent before the throttle happens it won't slow down the dbd either.

Please try this out and let me know how it works out.  In my testing it has shown very helpful.
Comment 25 Doug Jacobsen 2017-07-21 05:36:10 MDT
Hi Danny,

This looks good to me.  I'll include it in my edison build of 17.02.6 I'm
prepping right now.

Thanks!
Doug

----
Doug Jacobsen, Ph.D.
NERSC Computer Systems Engineer
National Energy Research Scientific Computing Center <http://www.nersc.gov>
dmjacobsen@lbl.gov

------------- __o
---------- _ '\<,_
----------(_)/  (_)__________________________


On Thu, Jul 20, 2017 at 1:53 PM, <bugs@schedmd.com> wrote:

> *Comment # 24 <https://bugs.schedmd.com/show_bug.cgi?id=3159#c24> on bug
> 3159 <https://bugs.schedmd.com/show_bug.cgi?id=3159> from Danny Auble
> <da@schedmd.com> *
>
> Created attachment 4946 [details] <https://bugs.schedmd.com/attachment.cgi?id=4946&action=diff> [details] <https://bugs.schedmd.com/attachment.cgi?id=4946&action=edit>
> Patch to serialize updates coming from the dbd.
>
> Doug, I think I finally have a handle on this.
>
> Thread 27 locked assoc_mgr
> Thread 14 has the node_write_lock and is waiting on the assoc_mgr lock to
> proceed.  This is causing the issue.
>
> The reason Thread 14 is locked the way it is is because it is an update that
> already finished changes to the assoc_mgr and is now making those changes
> outside the assoc_mgr lock to various components which may need to get a
> slurmctld lock and then get an assoc_mgr lock.
>
> Now comes another update (Thread 27) which grabbed the assoc_mgr lock and is
> taking forever on the getpwnam_r calls.  So now we have the node_write_lock
> locked and the assoc_mgr lock locked.  Only after Thread 27 finishes will we
> ever continue.
>
> In this patch I have introduced a lock on how this gets processed so only one
> of these happens at a time.  This should eliminate the scenario of one threads
> locks holding up the other.  Since the response is sent before the throttle
> happens it won't slow down the dbd either.
>
> Please try this out and let me know how it works out.  In my testing it has
> shown very helpful.
>
> ------------------------------
> You are receiving this mail because:
>
>    - You reported the bug.
>
>
Comment 28 Danny Auble 2017-07-21 15:59:52 MDT
Doug this has been committed 24375cb8622d.  Please reopen if you still see issues here.