Created attachment 30163 [details] coredump of slurmctld Dear SchedMD support, We have a setup HA, with 2 slurmctld + 2 slurmdbd and galera cluster underneath. Connection of slurmdbd to mysql/galera is through localhost. Way to reproduce the issue is to stop slurmctld on primary controller, everything goes to backup controller after 120 seconds but slurmctld crashes and generates the associated coredump. This happens when we have all cluster not accessible. #0 0x00001487e4709b21 in dbd_conn_close (pc=pc@entry=0x511298 <acct_db_conn>) at dbd_conn.c:214 #1 0x00001487e47043b3 in acct_storage_p_close_connection (db_conn=0x511298 <acct_db_conn>) at accounting_storage_slurmdbd.c:667 #2 0x00001487e5a105bc in acct_storage_g_close_connection (db_conn=db_conn@entry=0x511298 <acct_db_conn>) at slurm_accounting_storage.c:376 #3 0x00000000004320b1 in ctld_assoc_mgr_init () at controller.c:2374 #4 0x000000000042da9f in run_backup () at backup.c:249 #5 0x00000000004345e8 in main (argc=<optimized out>, argv=<optimized out>) at controller.c:607 Just let us know if you need more info, as slurmctld.conf or whatever else. Best Regards & thanks.
Hi Could you send me the output from: gdb -ex 't a a bt full' -batch /usr/sbin/slurmctld <core_file> Dominik
Sure, here you have: # gdb -ex 't a a bt full' -batch /usr/sbin/slurmctld /tmp/core.slurmctld.292.3c6ba13631c3465e8a3052617b106050.226518.1678184729000000 [New LWP 226518] [New LWP 226520] [New LWP 227195] [New LWP 463373] [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib64/libthread_db.so.1". Core was generated by `/usr/sbin/slurmctld -D -s'. Program terminated with signal SIGSEGV, Segmentation fault. #0 0x00001487e4709b21 in dbd_conn_close (pc=pc@entry=0x511298 <acct_db_conn>) at dbd_conn.c:214 214 dbd_conn.c: No such file or directory. [Current thread is 1 (Thread 0x1487e56282c0 (LWP 226518))] Thread 4 (Thread 0x1487e4a2e640 (LWP 463373)): #0 0x00001487e57b439a in __futex_abstimed_wait_common () from /lib64/libc.so.6 No symbol table info available. #1 0x00001487e57b6ea4 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libc.so.6 No symbol table info available. #2 0x00001487e4703cec in _set_db_inx_thread (no_data=<optimized out>) at accounting_storage_slurmdbd.c:498 err = <optimized out> job_ptr = <optimized out> itr = <optimized out> tvnow = {tv_sec = 1678184729, tv_usec = 648514} abs = {tv_sec = 1678184734, tv_nsec = 648514000} job_read_lock = {conf = NO_LOCK, job = READ_LOCK, node = NO_LOCK, part = NO_LOCK, fed = NO_LOCK} job_write_lock = {conf = NO_LOCK, job = WRITE_LOCK, node = NO_LOCK, part = NO_LOCK, fed = NO_LOCK} local_job_list = 0x14828c000ac0 __func__ = "_set_db_inx_thread" #3 0x00001487e57b7802 in start_thread () from /lib64/libc.so.6 No symbol table info available. #4 0x00001487e5757450 in clone3 () from /lib64/libc.so.6 No symbol table info available. Thread 3 (Thread 0x1487e4b2f640 (LWP 227195)): #0 0x00001487e57b439a in __futex_abstimed_wait_common () from /lib64/libc.so.6 No symbol table info available. #1 0x00001487e57b6ea4 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libc.so.6 No symbol table info available. #2 0x0000000000429fe7 in _agent_init (arg=<optimized out>) at agent.c:1422 err = <optimized out> min_wait = <optimized out> mail_too = <optimized out> ts = {tv_sec = 1678184730, tv_nsec = 0} last_defer_attempt = 1678121162 __func__ = "_agent_init" #3 0x00001487e57b7802 in start_thread () from /lib64/libc.so.6 No symbol table info available. #4 0x00001487e5757450 in clone3 () from /lib64/libc.so.6 No symbol table info available. Thread 2 (Thread 0x1487e4f20640 (LWP 226520)): #0 0x00001487e585a71f in poll () from /lib64/libc.so.6 No symbol table info available. #1 0x00001487e59b8aed in poll (__timeout=-1, __nfds=<optimized out>, __fds=0x1487e0000b70) at /usr/include/bits/poll2.h:39 No locals. #2 _poll_internal (shutdown_time=<optimized out>, nfds=2, pfds=0x1487e0000b70) at eio.c:351 n = <optimized out> timeout = -1 #3 eio_handle_mainloop (eio=<optimized out>) at eio.c:314 retval = 0 pollfds = 0x1487e0000b70 map = 0x1487e0000ba0 maxnfds = 1 nfds = 2 n = <optimized out> shutdown_time = <optimized out> __func__ = "eio_handle_mainloop" error = <optimized out> #4 0x00000000004c246a in _slurmctld_listener_thread (x=<optimized out>) at slurmscriptd.c:988 __func__ = "_slurmctld_listener_thread" #5 0x00001487e57b7802 in start_thread () from /lib64/libc.so.6 No symbol table info available. #6 0x00001487e5757450 in clone3 () from /lib64/libc.so.6 No symbol table info available. Thread 1 (Thread 0x1487e56282c0 (LWP 226518)): #0 0x00001487e4709b21 in dbd_conn_close (pc=pc@entry=0x511298 <acct_db_conn>) at dbd_conn.c:214 rc = <optimized out> buffer = <optimized out> req = {close_conn = 0, commit = 0} __func__ = "dbd_conn_close" #1 0x00001487e47043b3 in acct_storage_p_close_connection (db_conn=0x511298 <acct_db_conn>) at accounting_storage_slurmdbd.c:667 No locals. #2 0x00001487e5a105bc in acct_storage_g_close_connection (db_conn=db_conn@entry=0x511298 <acct_db_conn>) at slurm_accounting_storage.c:376 No locals. #3 0x00000000004320b1 in ctld_assoc_mgr_init () at controller.c:2374 assoc_init_arg = {cache_level = 55, enforce = 11, running_cache = 0x510f68 <running_cache>, add_license_notify = 0x4800d9 <license_add_remote>, resize_qos_notify = 0x430e62 <_resize_qos>, remove_assoc_notify = 0x430de4 <_remove_assoc>, remove_license_notify = 0x48036d <license_remove_remote>, remove_qos_notify = 0x430c93 <_remove_qos>, state_save_location = 0x510ad8 <slurm_conf+1368>, sync_license_notify = 0x4804e7 <license_sync_remote>, update_assoc_notify = 0x430bbd <_update_assoc>, update_cluster_tres = 0x430a02 <_update_cluster_tres>, update_license_notify = 0x4801e0 <license_update_remote>, update_qos_notify = 0x430ae4 <_update_qos>, update_resvs = 0x4c04e1 <update_assocs_in_resvs>} num_jobs = 0 job_read_lock = {conf = NO_LOCK, job = READ_LOCK, node = NO_LOCK, part = NO_LOCK, fed = NO_LOCK} __func__ = "ctld_assoc_mgr_init" #4 0x000000000042da9f in run_backup () at backup.c:249 i = <optimized out> last_ping = 1678184729 config_read_lock = {conf = READ_LOCK, job = NO_LOCK, node = NO_LOCK, part = NO_LOCK, fed = NO_LOCK} config_write_lock = {conf = WRITE_LOCK, job = WRITE_LOCK, node = WRITE_LOCK, part = WRITE_LOCK, fed = NO_LOCK} __func__ = "run_backup" #5 0x00000000004345e8 in main (argc=<optimized out>, argv=<optimized out>) at controller.c:607 cnt = <optimized out> error_code = <optimized out> i = 3 start = {tv_sec = 1678120838, tv_usec = 154946} now = {tv_sec = 1678120838, tv_usec = 154981} stat_buf = {st_dev = 64768, st_ino = 537007778, st_nlink = 1, st_mode = 33261, st_uid = 0, st_gid = 0, __pad0 = 0, st_rdev = 0, st_size = 29080, st_blksize = 4096, st_blocks = 64, st_atim = {tv_sec = 1678110902, tv_nsec = 516771000}, st_mtim = {tv_sec = 1653987570, tv_nsec = 0}, st_ctim = {tv_sec = 1678110522, tv_nsec = 105312364}, __glibc_reserved = {0, 0, 0}} rlim = {rlim_cur = 18446744073709551615, rlim_max = 18446744073709551615} config_write_lock = {conf = WRITE_LOCK, job = WRITE_LOCK, node = WRITE_LOCK, part = WRITE_LOCK, fed = NO_LOCK} prep_callbacks = {prolog_slurmctld = 0x49c8e1 <prep_prolog_slurmctld_callback>, epilog_slurmctld = 0x49cbc9 <prep_epilog_slurmctld_callback>} create_clustername_file = <optimized out> conf_file = <optimized out> __func__ = "main"
Hi I can reproduce this crash, I will let you know when the fix will be ready. Dominik
Created attachment 30475 [details] ib17be-001 slurmd bt full (non-zombie) Hi, I don't think I can get gdb to bt zombies. I've attached a bt for the non-zombie slurmd though. root@ib17be-001:~# ps auxw|grep slurm root 144230 0.0 0.0 211800 7800 ? Sl May25 0:01 slurmstepd: [110.extern] root 144921 0.0 0.0 0 0 ? Zs May25 0:00 [slurmd] <defunct> root 144922 0.0 0.0 11184 7648 ? S May25 0:00 slurmstepd: [110.0] root 234003 0.0 0.0 6432 656 pts/0 S+ 16:12 0:00 grep slurm root 959375 0.0 0.0 280028 11188 ? Ssl May24 0:12 /usr/local/stow/slurm/sbin/slurmd -D -s root@ib17be-001:~# gdb -ex 't a a bt full' -batch -p 144921 Could not attach to process. If your uid matches the uid of the target process, check the setting of /proc/sys/kernel/yama/ptrace_scope, or try again as the root user. For more details, see /etc/sysctl.d/10-ptrace.conf warning: process 144921 is a zombie - the process has already terminated ptrace: Operation not permitted. root@ib17be-001:~# id uid=0(root) gid=0(root) groups=0(root) root@ib17be-001:~# sysctl kernel.yama.ptrace_scope kernel.yama.ptrace_scope = 0
Sorry, the attachment and comment I added were for a different ticket.
Dear schedmd support, Do you have an estimation when the patch will be avialable solving this issue ? thanks !!
Hi Those patches fix varied cases when HA configuration crashes or works improperly. All of them will be included in the 23.02.3 release and above. https://github.com/SchedMD/slurm/commit/846820c4fc https://github.com/SchedMD/slurm/commit/94e610201b https://github.com/SchedMD/slurm/commit/fa0a269cfc https://github.com/SchedMD/slurm/commit/45b4913225 https://github.com/SchedMD/slurm/commit/1eebe1130c Let me know if we can close this issue or if you have any additional questions. Dominik
Hi I'll go ahead and mark the case as fixed. If you have any questions please reopen. Dominik
*** Ticket 17205 has been marked as a duplicate of this ticket. ***
*** Ticket 17350 has been marked as a duplicate of this ticket. ***
*** Ticket 17329 has been marked as a duplicate of this ticket. ***