Ticket 16669

Summary: slurmctld crashes when doing HA tests
Product: Slurm Reporter: Javier Bartolome <javier.bartolome>
Component: slurmctldAssignee: Dominik Bartkiewicz <bart>
Status: RESOLVED FIXED QA Contact:
Severity: 4 - Minor Issue    
Priority: --- CC: andrew.phillips2, bart, ejaco020, kilian, rjohnson, roshni.kp
Version: 22.05.8   
Hardware: Linux   
OS: Linux   
Site: BSC Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: 23.02.3 Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---
Attachments: coredump of slurmctld
ib17be-001 slurmd bt full (non-zombie)

Description Javier Bartolome 2023-05-08 08:07:36 MDT
Created attachment 30163 [details]
coredump of slurmctld

Dear SchedMD support,

We have a setup HA, with 2 slurmctld + 2 slurmdbd and galera cluster underneath.
Connection of slurmdbd to mysql/galera is through localhost.

Way to reproduce the issue is to stop slurmctld on primary controller, everything goes to backup controller after 120 seconds but slurmctld 
crashes and generates the associated coredump.

This happens when we have all cluster not accessible.

#0  0x00001487e4709b21 in dbd_conn_close (pc=pc@entry=0x511298 <acct_db_conn>) at dbd_conn.c:214

#1  0x00001487e47043b3 in acct_storage_p_close_connection (db_conn=0x511298 <acct_db_conn>) at accounting_storage_slurmdbd.c:667

#2  0x00001487e5a105bc in acct_storage_g_close_connection (db_conn=db_conn@entry=0x511298 <acct_db_conn>) at slurm_accounting_storage.c:376

#3  0x00000000004320b1 in ctld_assoc_mgr_init () at controller.c:2374

#4  0x000000000042da9f in run_backup () at backup.c:249

#5  0x00000000004345e8 in main (argc=<optimized out>, argv=<optimized out>) at controller.c:607

Just let us know if you need more info, as slurmctld.conf or whatever else.

Best Regards & thanks.
Comment 2 Dominik Bartkiewicz 2023-05-08 08:21:45 MDT
Hi

Could you send me the output from:

gdb -ex 't a a bt full' -batch /usr/sbin/slurmctld <core_file>

Dominik
Comment 3 Javier Bartolome 2023-05-08 08:38:25 MDT
Sure, here you have:

# gdb -ex 't a a bt full' -batch /usr/sbin/slurmctld /tmp/core.slurmctld.292.3c6ba13631c3465e8a3052617b106050.226518.1678184729000000
[New LWP 226518]
[New LWP 226520]
[New LWP 227195]
[New LWP 463373]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `/usr/sbin/slurmctld -D -s'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00001487e4709b21 in dbd_conn_close (pc=pc@entry=0x511298 <acct_db_conn>) at dbd_conn.c:214
214	dbd_conn.c: No such file or directory.
[Current thread is 1 (Thread 0x1487e56282c0 (LWP 226518))]

Thread 4 (Thread 0x1487e4a2e640 (LWP 463373)):
#0  0x00001487e57b439a in __futex_abstimed_wait_common () from /lib64/libc.so.6
No symbol table info available.
#1  0x00001487e57b6ea4 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libc.so.6
No symbol table info available.
#2  0x00001487e4703cec in _set_db_inx_thread (no_data=<optimized out>) at accounting_storage_slurmdbd.c:498
        err = <optimized out>
        job_ptr = <optimized out>
        itr = <optimized out>
        tvnow = {tv_sec = 1678184729, tv_usec = 648514}
        abs = {tv_sec = 1678184734, tv_nsec = 648514000}
        job_read_lock = {conf = NO_LOCK, job = READ_LOCK, node = NO_LOCK, part = NO_LOCK, fed = NO_LOCK}
        job_write_lock = {conf = NO_LOCK, job = WRITE_LOCK, node = NO_LOCK, part = NO_LOCK, fed = NO_LOCK}
        local_job_list = 0x14828c000ac0
        __func__ = "_set_db_inx_thread"
#3  0x00001487e57b7802 in start_thread () from /lib64/libc.so.6
No symbol table info available.
#4  0x00001487e5757450 in clone3 () from /lib64/libc.so.6
No symbol table info available.

Thread 3 (Thread 0x1487e4b2f640 (LWP 227195)):
#0  0x00001487e57b439a in __futex_abstimed_wait_common () from /lib64/libc.so.6
No symbol table info available.
#1  0x00001487e57b6ea4 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libc.so.6
No symbol table info available.
#2  0x0000000000429fe7 in _agent_init (arg=<optimized out>) at agent.c:1422
        err = <optimized out>
        min_wait = <optimized out>
        mail_too = <optimized out>
        ts = {tv_sec = 1678184730, tv_nsec = 0}
        last_defer_attempt = 1678121162
        __func__ = "_agent_init"
#3  0x00001487e57b7802 in start_thread () from /lib64/libc.so.6
No symbol table info available.
#4  0x00001487e5757450 in clone3 () from /lib64/libc.so.6
No symbol table info available.

Thread 2 (Thread 0x1487e4f20640 (LWP 226520)):
#0  0x00001487e585a71f in poll () from /lib64/libc.so.6
No symbol table info available.
#1  0x00001487e59b8aed in poll (__timeout=-1, __nfds=<optimized out>, __fds=0x1487e0000b70) at /usr/include/bits/poll2.h:39
No locals.
#2  _poll_internal (shutdown_time=<optimized out>, nfds=2, pfds=0x1487e0000b70) at eio.c:351
        n = <optimized out>
        timeout = -1
#3  eio_handle_mainloop (eio=<optimized out>) at eio.c:314
        retval = 0
        pollfds = 0x1487e0000b70
        map = 0x1487e0000ba0
        maxnfds = 1
        nfds = 2
        n = <optimized out>
        shutdown_time = <optimized out>
        __func__ = "eio_handle_mainloop"
        error = <optimized out>
#4  0x00000000004c246a in _slurmctld_listener_thread (x=<optimized out>) at slurmscriptd.c:988
        __func__ = "_slurmctld_listener_thread"
#5  0x00001487e57b7802 in start_thread () from /lib64/libc.so.6
No symbol table info available.
#6  0x00001487e5757450 in clone3 () from /lib64/libc.so.6
No symbol table info available.

Thread 1 (Thread 0x1487e56282c0 (LWP 226518)):
#0  0x00001487e4709b21 in dbd_conn_close (pc=pc@entry=0x511298 <acct_db_conn>) at dbd_conn.c:214
        rc = <optimized out>
        buffer = <optimized out>
        req = {close_conn = 0, commit = 0}
        __func__ = "dbd_conn_close"
#1  0x00001487e47043b3 in acct_storage_p_close_connection (db_conn=0x511298 <acct_db_conn>) at accounting_storage_slurmdbd.c:667
No locals.
#2  0x00001487e5a105bc in acct_storage_g_close_connection (db_conn=db_conn@entry=0x511298 <acct_db_conn>) at slurm_accounting_storage.c:376
No locals.
#3  0x00000000004320b1 in ctld_assoc_mgr_init () at controller.c:2374
        assoc_init_arg = {cache_level = 55, enforce = 11, running_cache = 0x510f68 <running_cache>, add_license_notify = 0x4800d9 <license_add_remote>, resize_qos_notify = 0x430e62 <_resize_qos>, remove_assoc_notify = 0x430de4 <_remove_assoc>, remove_license_notify = 0x48036d <license_remove_remote>, remove_qos_notify = 0x430c93 <_remove_qos>, state_save_location = 0x510ad8 <slurm_conf+1368>, sync_license_notify = 0x4804e7 <license_sync_remote>, update_assoc_notify = 0x430bbd <_update_assoc>, update_cluster_tres = 0x430a02 <_update_cluster_tres>, update_license_notify = 0x4801e0 <license_update_remote>, update_qos_notify = 0x430ae4 <_update_qos>, update_resvs = 0x4c04e1 <update_assocs_in_resvs>}
        num_jobs = 0
        job_read_lock = {conf = NO_LOCK, job = READ_LOCK, node = NO_LOCK, part = NO_LOCK, fed = NO_LOCK}
        __func__ = "ctld_assoc_mgr_init"
#4  0x000000000042da9f in run_backup () at backup.c:249
        i = <optimized out>
        last_ping = 1678184729
        config_read_lock = {conf = READ_LOCK, job = NO_LOCK, node = NO_LOCK, part = NO_LOCK, fed = NO_LOCK}
        config_write_lock = {conf = WRITE_LOCK, job = WRITE_LOCK, node = WRITE_LOCK, part = WRITE_LOCK, fed = NO_LOCK}
        __func__ = "run_backup"
#5  0x00000000004345e8 in main (argc=<optimized out>, argv=<optimized out>) at controller.c:607
        cnt = <optimized out>
        error_code = <optimized out>
        i = 3
        start = {tv_sec = 1678120838, tv_usec = 154946}
        now = {tv_sec = 1678120838, tv_usec = 154981}
        stat_buf = {st_dev = 64768, st_ino = 537007778, st_nlink = 1, st_mode = 33261, st_uid = 0, st_gid = 0, __pad0 = 0, st_rdev = 0, st_size = 29080, st_blksize = 4096, st_blocks = 64, st_atim = {tv_sec = 1678110902, tv_nsec = 516771000}, st_mtim = {tv_sec = 1653987570, tv_nsec = 0}, st_ctim = {tv_sec = 1678110522, tv_nsec = 105312364}, __glibc_reserved = {0, 0, 0}}
        rlim = {rlim_cur = 18446744073709551615, rlim_max = 18446744073709551615}
        config_write_lock = {conf = WRITE_LOCK, job = WRITE_LOCK, node = WRITE_LOCK, part = WRITE_LOCK, fed = NO_LOCK}
        prep_callbacks = {prolog_slurmctld = 0x49c8e1 <prep_prolog_slurmctld_callback>, epilog_slurmctld = 0x49cbc9 <prep_epilog_slurmctld_callback>}
        create_clustername_file = <optimized out>
        conf_file = <optimized out>
        __func__ = "main"
Comment 4 Dominik Bartkiewicz 2023-05-10 07:12:10 MDT
Hi

I can reproduce this crash, I will let you know when the fix will be ready.

Dominik
Comment 6 Andrew Phillips 2023-05-26 10:14:01 MDT
Created attachment 30475 [details]
ib17be-001 slurmd bt full (non-zombie)

Hi,

I don't think I can get gdb to bt zombies.  I've attached a bt for the non-zombie slurmd though.

root@ib17be-001:~# ps auxw|grep slurm
root     144230  0.0  0.0 211800  7800 ?        Sl   May25   0:01 slurmstepd: [110.extern]
root     144921  0.0  0.0      0     0 ?        Zs   May25   0:00 [slurmd] <defunct>
root     144922  0.0  0.0  11184  7648 ?        S    May25   0:00 slurmstepd: [110.0]
root     234003  0.0  0.0   6432   656 pts/0    S+   16:12   0:00 grep slurm
root     959375  0.0  0.0 280028 11188 ?        Ssl  May24   0:12 /usr/local/stow/slurm/sbin/slurmd -D -s

root@ib17be-001:~# gdb -ex 't a a bt full' -batch -p 144921
Could not attach to process.  If your uid matches the uid of the target
process, check the setting of /proc/sys/kernel/yama/ptrace_scope, or try
again as the root user.  For more details, see /etc/sysctl.d/10-ptrace.conf
warning: process 144921 is a zombie - the process has already terminated
ptrace: Operation not permitted.


root@ib17be-001:~# id
uid=0(root) gid=0(root) groups=0(root)

root@ib17be-001:~# sysctl kernel.yama.ptrace_scope
kernel.yama.ptrace_scope = 0
Comment 7 Andrew Phillips 2023-05-26 10:15:16 MDT
Sorry, the attachment and comment I added were for a different ticket.
Comment 15 Javier Bartolome 2023-06-09 02:55:53 MDT
Dear schedmd support,

Do you have an estimation when the patch will be avialable solving
this issue ?

thanks !!
Comment 16 Dominik Bartkiewicz 2023-06-09 03:09:08 MDT
Hi

Those patches fix varied cases when HA configuration crashes or works improperly. All of them will be included in the 23.02.3 release and above.

https://github.com/SchedMD/slurm/commit/846820c4fc
https://github.com/SchedMD/slurm/commit/94e610201b
https://github.com/SchedMD/slurm/commit/fa0a269cfc
https://github.com/SchedMD/slurm/commit/45b4913225
https://github.com/SchedMD/slurm/commit/1eebe1130c

Let me know if we can close this issue or if you have any additional questions.

Dominik
Comment 17 Dominik Bartkiewicz 2023-06-16 09:00:13 MDT
Hi

I'll go ahead and mark the case as fixed.
If you have any questions please reopen.

Dominik
Comment 18 Nate Rini 2023-07-14 09:28:14 MDT
*** Ticket 17205 has been marked as a duplicate of this ticket. ***
Comment 19 Jason Booth 2023-08-03 15:51:29 MDT
*** Ticket 17350 has been marked as a duplicate of this ticket. ***
Comment 20 Carlos Tripiana Montes 2023-08-08 23:28:46 MDT
*** Ticket 17329 has been marked as a duplicate of this ticket. ***