| Summary: | slurmctld crashes when doing HA tests | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Javier Bartolome <javier.bartolome> |
| Component: | slurmctld | Assignee: | Dominik Bartkiewicz <bart> |
| Status: | RESOLVED FIXED | QA Contact: | |
| Severity: | 4 - Minor Issue | ||
| Priority: | --- | CC: | andrew.phillips2, bart, ejaco020, kilian, rjohnson, roshni.kp |
| Version: | 22.05.8 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | BSC | Slinky Site: | --- |
| Alineos Sites: | --- | Atos/Eviden Sites: | --- |
| Confidential Site: | --- | Coreweave sites: | --- |
| Cray Sites: | --- | DS9 clusters: | --- |
| Google sites: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | NoveTech Sites: | --- |
| Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Tzag Elita Sites: | --- |
| Linux Distro: | --- | Machine Name: | |
| CLE Version: | Version Fixed: | 23.02.3 | |
| Target Release: | --- | DevPrio: | --- |
| Emory-Cloud Sites: | --- | ||
| Attachments: |
coredump of slurmctld
ib17be-001 slurmd bt full (non-zombie) |
||
Hi Could you send me the output from: gdb -ex 't a a bt full' -batch /usr/sbin/slurmctld <core_file> Dominik Sure, here you have:
# gdb -ex 't a a bt full' -batch /usr/sbin/slurmctld /tmp/core.slurmctld.292.3c6ba13631c3465e8a3052617b106050.226518.1678184729000000
[New LWP 226518]
[New LWP 226520]
[New LWP 227195]
[New LWP 463373]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `/usr/sbin/slurmctld -D -s'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x00001487e4709b21 in dbd_conn_close (pc=pc@entry=0x511298 <acct_db_conn>) at dbd_conn.c:214
214 dbd_conn.c: No such file or directory.
[Current thread is 1 (Thread 0x1487e56282c0 (LWP 226518))]
Thread 4 (Thread 0x1487e4a2e640 (LWP 463373)):
#0 0x00001487e57b439a in __futex_abstimed_wait_common () from /lib64/libc.so.6
No symbol table info available.
#1 0x00001487e57b6ea4 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libc.so.6
No symbol table info available.
#2 0x00001487e4703cec in _set_db_inx_thread (no_data=<optimized out>) at accounting_storage_slurmdbd.c:498
err = <optimized out>
job_ptr = <optimized out>
itr = <optimized out>
tvnow = {tv_sec = 1678184729, tv_usec = 648514}
abs = {tv_sec = 1678184734, tv_nsec = 648514000}
job_read_lock = {conf = NO_LOCK, job = READ_LOCK, node = NO_LOCK, part = NO_LOCK, fed = NO_LOCK}
job_write_lock = {conf = NO_LOCK, job = WRITE_LOCK, node = NO_LOCK, part = NO_LOCK, fed = NO_LOCK}
local_job_list = 0x14828c000ac0
__func__ = "_set_db_inx_thread"
#3 0x00001487e57b7802 in start_thread () from /lib64/libc.so.6
No symbol table info available.
#4 0x00001487e5757450 in clone3 () from /lib64/libc.so.6
No symbol table info available.
Thread 3 (Thread 0x1487e4b2f640 (LWP 227195)):
#0 0x00001487e57b439a in __futex_abstimed_wait_common () from /lib64/libc.so.6
No symbol table info available.
#1 0x00001487e57b6ea4 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libc.so.6
No symbol table info available.
#2 0x0000000000429fe7 in _agent_init (arg=<optimized out>) at agent.c:1422
err = <optimized out>
min_wait = <optimized out>
mail_too = <optimized out>
ts = {tv_sec = 1678184730, tv_nsec = 0}
last_defer_attempt = 1678121162
__func__ = "_agent_init"
#3 0x00001487e57b7802 in start_thread () from /lib64/libc.so.6
No symbol table info available.
#4 0x00001487e5757450 in clone3 () from /lib64/libc.so.6
No symbol table info available.
Thread 2 (Thread 0x1487e4f20640 (LWP 226520)):
#0 0x00001487e585a71f in poll () from /lib64/libc.so.6
No symbol table info available.
#1 0x00001487e59b8aed in poll (__timeout=-1, __nfds=<optimized out>, __fds=0x1487e0000b70) at /usr/include/bits/poll2.h:39
No locals.
#2 _poll_internal (shutdown_time=<optimized out>, nfds=2, pfds=0x1487e0000b70) at eio.c:351
n = <optimized out>
timeout = -1
#3 eio_handle_mainloop (eio=<optimized out>) at eio.c:314
retval = 0
pollfds = 0x1487e0000b70
map = 0x1487e0000ba0
maxnfds = 1
nfds = 2
n = <optimized out>
shutdown_time = <optimized out>
__func__ = "eio_handle_mainloop"
error = <optimized out>
#4 0x00000000004c246a in _slurmctld_listener_thread (x=<optimized out>) at slurmscriptd.c:988
__func__ = "_slurmctld_listener_thread"
#5 0x00001487e57b7802 in start_thread () from /lib64/libc.so.6
No symbol table info available.
#6 0x00001487e5757450 in clone3 () from /lib64/libc.so.6
No symbol table info available.
Thread 1 (Thread 0x1487e56282c0 (LWP 226518)):
#0 0x00001487e4709b21 in dbd_conn_close (pc=pc@entry=0x511298 <acct_db_conn>) at dbd_conn.c:214
rc = <optimized out>
buffer = <optimized out>
req = {close_conn = 0, commit = 0}
__func__ = "dbd_conn_close"
#1 0x00001487e47043b3 in acct_storage_p_close_connection (db_conn=0x511298 <acct_db_conn>) at accounting_storage_slurmdbd.c:667
No locals.
#2 0x00001487e5a105bc in acct_storage_g_close_connection (db_conn=db_conn@entry=0x511298 <acct_db_conn>) at slurm_accounting_storage.c:376
No locals.
#3 0x00000000004320b1 in ctld_assoc_mgr_init () at controller.c:2374
assoc_init_arg = {cache_level = 55, enforce = 11, running_cache = 0x510f68 <running_cache>, add_license_notify = 0x4800d9 <license_add_remote>, resize_qos_notify = 0x430e62 <_resize_qos>, remove_assoc_notify = 0x430de4 <_remove_assoc>, remove_license_notify = 0x48036d <license_remove_remote>, remove_qos_notify = 0x430c93 <_remove_qos>, state_save_location = 0x510ad8 <slurm_conf+1368>, sync_license_notify = 0x4804e7 <license_sync_remote>, update_assoc_notify = 0x430bbd <_update_assoc>, update_cluster_tres = 0x430a02 <_update_cluster_tres>, update_license_notify = 0x4801e0 <license_update_remote>, update_qos_notify = 0x430ae4 <_update_qos>, update_resvs = 0x4c04e1 <update_assocs_in_resvs>}
num_jobs = 0
job_read_lock = {conf = NO_LOCK, job = READ_LOCK, node = NO_LOCK, part = NO_LOCK, fed = NO_LOCK}
__func__ = "ctld_assoc_mgr_init"
#4 0x000000000042da9f in run_backup () at backup.c:249
i = <optimized out>
last_ping = 1678184729
config_read_lock = {conf = READ_LOCK, job = NO_LOCK, node = NO_LOCK, part = NO_LOCK, fed = NO_LOCK}
config_write_lock = {conf = WRITE_LOCK, job = WRITE_LOCK, node = WRITE_LOCK, part = WRITE_LOCK, fed = NO_LOCK}
__func__ = "run_backup"
#5 0x00000000004345e8 in main (argc=<optimized out>, argv=<optimized out>) at controller.c:607
cnt = <optimized out>
error_code = <optimized out>
i = 3
start = {tv_sec = 1678120838, tv_usec = 154946}
now = {tv_sec = 1678120838, tv_usec = 154981}
stat_buf = {st_dev = 64768, st_ino = 537007778, st_nlink = 1, st_mode = 33261, st_uid = 0, st_gid = 0, __pad0 = 0, st_rdev = 0, st_size = 29080, st_blksize = 4096, st_blocks = 64, st_atim = {tv_sec = 1678110902, tv_nsec = 516771000}, st_mtim = {tv_sec = 1653987570, tv_nsec = 0}, st_ctim = {tv_sec = 1678110522, tv_nsec = 105312364}, __glibc_reserved = {0, 0, 0}}
rlim = {rlim_cur = 18446744073709551615, rlim_max = 18446744073709551615}
config_write_lock = {conf = WRITE_LOCK, job = WRITE_LOCK, node = WRITE_LOCK, part = WRITE_LOCK, fed = NO_LOCK}
prep_callbacks = {prolog_slurmctld = 0x49c8e1 <prep_prolog_slurmctld_callback>, epilog_slurmctld = 0x49cbc9 <prep_epilog_slurmctld_callback>}
create_clustername_file = <optimized out>
conf_file = <optimized out>
__func__ = "main"
Hi I can reproduce this crash, I will let you know when the fix will be ready. Dominik Created attachment 30475 [details]
ib17be-001 slurmd bt full (non-zombie)
Hi,
I don't think I can get gdb to bt zombies. I've attached a bt for the non-zombie slurmd though.
root@ib17be-001:~# ps auxw|grep slurm
root 144230 0.0 0.0 211800 7800 ? Sl May25 0:01 slurmstepd: [110.extern]
root 144921 0.0 0.0 0 0 ? Zs May25 0:00 [slurmd] <defunct>
root 144922 0.0 0.0 11184 7648 ? S May25 0:00 slurmstepd: [110.0]
root 234003 0.0 0.0 6432 656 pts/0 S+ 16:12 0:00 grep slurm
root 959375 0.0 0.0 280028 11188 ? Ssl May24 0:12 /usr/local/stow/slurm/sbin/slurmd -D -s
root@ib17be-001:~# gdb -ex 't a a bt full' -batch -p 144921
Could not attach to process. If your uid matches the uid of the target
process, check the setting of /proc/sys/kernel/yama/ptrace_scope, or try
again as the root user. For more details, see /etc/sysctl.d/10-ptrace.conf
warning: process 144921 is a zombie - the process has already terminated
ptrace: Operation not permitted.
root@ib17be-001:~# id
uid=0(root) gid=0(root) groups=0(root)
root@ib17be-001:~# sysctl kernel.yama.ptrace_scope
kernel.yama.ptrace_scope = 0
Sorry, the attachment and comment I added were for a different ticket. Dear schedmd support, Do you have an estimation when the patch will be avialable solving this issue ? thanks !! Hi Those patches fix varied cases when HA configuration crashes or works improperly. All of them will be included in the 23.02.3 release and above. https://github.com/SchedMD/slurm/commit/846820c4fc https://github.com/SchedMD/slurm/commit/94e610201b https://github.com/SchedMD/slurm/commit/fa0a269cfc https://github.com/SchedMD/slurm/commit/45b4913225 https://github.com/SchedMD/slurm/commit/1eebe1130c Let me know if we can close this issue or if you have any additional questions. Dominik Hi I'll go ahead and mark the case as fixed. If you have any questions please reopen. Dominik *** Ticket 17205 has been marked as a duplicate of this ticket. *** *** Ticket 17350 has been marked as a duplicate of this ticket. *** *** Ticket 17329 has been marked as a duplicate of this ticket. *** |
Created attachment 30163 [details] coredump of slurmctld Dear SchedMD support, We have a setup HA, with 2 slurmctld + 2 slurmdbd and galera cluster underneath. Connection of slurmdbd to mysql/galera is through localhost. Way to reproduce the issue is to stop slurmctld on primary controller, everything goes to backup controller after 120 seconds but slurmctld crashes and generates the associated coredump. This happens when we have all cluster not accessible. #0 0x00001487e4709b21 in dbd_conn_close (pc=pc@entry=0x511298 <acct_db_conn>) at dbd_conn.c:214 #1 0x00001487e47043b3 in acct_storage_p_close_connection (db_conn=0x511298 <acct_db_conn>) at accounting_storage_slurmdbd.c:667 #2 0x00001487e5a105bc in acct_storage_g_close_connection (db_conn=db_conn@entry=0x511298 <acct_db_conn>) at slurm_accounting_storage.c:376 #3 0x00000000004320b1 in ctld_assoc_mgr_init () at controller.c:2374 #4 0x000000000042da9f in run_backup () at backup.c:249 #5 0x00000000004345e8 in main (argc=<optimized out>, argv=<optimized out>) at controller.c:607 Just let us know if you need more info, as slurmctld.conf or whatever else. Best Regards & thanks.