| Summary: | Slurmctld crash | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | CSC sysadmins <csc-slurm-tickets> |
| Component: | slurmctld | Assignee: | David Bigagli <david> |
| Status: | RESOLVED CANNOTREPRODUCE | QA Contact: | |
| Severity: | 3 - Medium Impact | ||
| Priority: | --- | CC: | brian, da |
| Version: | 14.11.6 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | CSC - IT Center for Science | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
| Attachments: |
job submit plugin
config |
||
|
Description
CSC sysadmins
2015-05-21 23:50:32 MDT
rpms for el6.6, core, conf and full log file: https://kannu.csc.fi/index.php/s/HUcQ2wjqDOhdtAy The core indicates memory corruption when processing the 'scontrol show job' command. We will try to reproduce running the controller under valgrind. How many jobs did you have in the system at that time? Is this reproducible somehow or just a random core dump? David > The core indicates memory corruption when processing the 'scontrol show job'
> command. We will try to reproduce running the controller under valgrind.
> How many jobs did you have in the system at that time? Is this reproducible
> somehow or just a random core dump?
This was first occurrence since I upgraded slurm couple of days ago. Previous version (14.11.4) was stable for many months.
Currently there are a bit over 3000 jobs running and 1000 jobs pending.
Running large job array workload under valgrind did not show any memory corruption so far. We tried job several updates and dependencies but no errors. Do you have an idea what trigger it and how to reproduce it. David New core file, job count was a bit over 2300. I've no idea how to reproduce this. https://kannu.csc.fi/index.php/s/HUcQ2wjqDOhdtAy/download?path=%2F&files=core.16263.xz [2015-06-01T17:32:51.680] job_complete: JobID=5827611_23(5827637) State=0x8005 NodeCnt=1 done [2015-06-01T17:33:05.923] _slurm_rpc_submit_batch_job JobId=5827835 usec=1880 [2015-06-01T17:33:05.948] _slurm_rpc_submit_batch_job JobId=5827836 usec=1516 [2015-06-01T17:33:05.962] _slurm_rpc_submit_batch_job JobId=5827837 usec=1264 [2015-06-01T17:33:05.977] _slurm_rpc_submit_batch_job JobId=5827838 usec=1182 [2015-06-01T17:33:05.991] _slurm_rpc_submit_batch_job JobId=5827839 usec=1181 [2015-06-01T17:33:46.807] pidfile not locked, assuming no running daemon [2015-06-01T17:33:46.823] slurmctld version 14.11.6 started on cluster csc Hi, the core file is not useful to us unfortunately as we have different build environment. Could you please print the stack using gdb? (gdb)thread apply all bt so we can see if this has the same stack as the previous core dump. David
(gdb) thread apply all bt
Thread 10 (Thread 0x7f93e44f4700 (LWP 2332)):
#0 0x00000035942aca3d in nanosleep () at ../sysdeps/unix/syscall-template.S:82
#1 0x00000035942ac8b0 in __sleep (seconds=0) at ../sysdeps/unix/sysv/linux/sleep.c:138
#2 0x00007f93e44fa839 in _decay_thread (no_data=0x0) at priority_multifactor.c:1340
#3 0x00000035946079d1 in start_thread (arg=0x7f93e44f4700) at pthread_create.c:301
#4 0x00000035942e88fd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115
Thread 9 (Thread 0x7f93e5b93700 (LWP 2267)):
#0 0x00000035942aca3d in nanosleep () at ../sysdeps/unix/syscall-template.S:82
#1 0x00000035942ac8b0 in __sleep (seconds=0) at ../sysdeps/unix/sysv/linux/sleep.c:138
#2 0x00007f93e5b975f3 in _set_db_inx_thread (no_data=0x0) at accounting_storage_slurmdbd.c:420
#3 0x00000035946079d1 in start_thread (arg=0x7f93e5b93700) at pthread_create.c:301
#4 0x00000035942e88fd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115
Thread 8 (Thread 0x7f93e578e700 (LWP 2278)):
#0 pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:239
#1 0x000000000056c1bf in _agent (x=0x0) at slurmdbd_defs.c:2115
#2 0x00000035946079d1 in start_thread (arg=0x7f93e578e700) at pthread_create.c:301
#3 0x00000035942e88fd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115
Thread 7 (Thread 0x7f93e41f1700 (LWP 2343)):
#0 do_sigwait (set=<value optimized out>, sig=0x7f93e41f0eac) at ../sysdeps/unix/sysv/linux/sigwait.c:65
#1 __sigwait (set=<value optimized out>, sig=0x7f93e41f0eac) at ../sysdeps/unix/sysv/linux/sigwait.c:100
#2 0x0000000000434cf4 in _slurmctld_signal_hand (no_data=0x0) at controller.c:827
#3 0x00000035946079d1 in start_thread (arg=0x7f93e41f1700) at pthread_create.c:301
#4 0x00000035942e88fd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115
Thread 6 (Thread 0x7f93e821e700 (LWP 2265)):
#0 0x00000035942aca3d in nanosleep () at ../sysdeps/unix/syscall-template.S:82
#1 0x00000035942e1be4 in usleep (useconds=<value optimized out>) at ../sysdeps/unix/sysv/linux/usleep.c:33
#2 0x0000000000436591 in _slurmctld_background (no_data=0x0) at controller.c:1448
#3 0x00000000004346ec in main (argc=1, argv=0x7fffc993e648) at controller.c:561
Thread 5 (Thread 0x7f93e5a92700 (LWP 2268)):
#0 0x000000359460822d in pthread_join (threadid=140273191040768, thread_return=0x0) at pthread_join.c:89
#1 0x00007f93e5b97618 in _cleanup_thread (no_data=0x0) at accounting_storage_slurmdbd.c:428
#2 0x00000035946079d1 in start_thread (arg=0x7f93e5a92700) at pthread_create.c:301
#3 0x00000035942e88fd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115
Thread 4 (Thread 0x7f93d7efe700 (LWP 2344)):
#0 pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:239
#1 0x00000000004ae720 in slurmctld_state_save (no_data=0x0) at state_save.c:208
#2 0x00000035946079d1 in start_thread (arg=0x7f93d7efe700) at pthread_create.c:301
#3 0x00000035942e88fd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115
Thread 3 (Thread 0x7f93e42f2700 (LWP 2342)):
#0 0x00000035942e1373 in select () at ../sysdeps/unix/syscall-template.S:82
---Type <return> to continue, or q <return> to quit---
#1 0x0000000000435383 in _slurmctld_rpc_mgr (no_data=0x0) at controller.c:965
#2 0x00000035946079d1 in start_thread (arg=0x7f93e42f2700) at pthread_create.c:301
#3 0x00000035942e88fd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115
Thread 2 (Thread 0x7f93e43f3700 (LWP 2333)):
#0 0x000000359460822d in pthread_join (threadid=140273167320832, thread_return=0x0) at pthread_join.c:89
#1 0x00007f93e44fa97b in _cleanup_thread (no_data=0x0) at priority_multifactor.c:1394
#2 0x00000035946079d1 in start_thread (arg=0x7f93e43f3700) at pthread_create.c:301
#3 0x00000035942e88fd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115
Thread 1 (Thread 0x7f93e4a01700 (LWP 2330)):
#0 0x0000003594232625 in raise (sig=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
#1 0x0000003594233e05 in abort () at abort.c:92
#2 0x0000003594270537 in __libc_message (do_abort=2, fmt=0x3594358780 "*** glibc detected *** %s: %s: 0x%s ***\n") at ../sysdeps/unix/sysv/linux/libc_fatal.c:198
#3 0x0000003594275e66 in malloc_printerr (action=3, str=0x3594358b88 "malloc(): smallbin double linked list corrupted", ptr=<value optimized out>) at malloc.c:6336
#4 0x0000003594279c3f in _int_malloc (av=0x7f93e0000020, bytes=<value optimized out>) at malloc.c:4340
#5 0x000000359427a346 in __libc_calloc (n=<value optimized out>, elem_size=<value optimized out>) at malloc.c:4096
#6 0x00000000004d3af7 in slurm_xmalloc (size=8, clear=true, file=0x5f6d90 "bitstring.c", line=166, func=0x5f6dbf "") at xmalloc.c:84
#7 0x00000000004e49b9 in bit_alloc (nbits=0) at bitstring.c:166
#8 0x00000000004e62cd in bit_copy (b=0x1cfc340) at bitstring.c:679
#9 0x00000000005a79dd in _node_state_dup (gres_data=0x1cfc2b0) at gres.c:2120
#10 0x00000000005a7eea in gres_plugin_node_state_dup (gres_list=0x1a68c70) at gres.c:2189
#11 0x00007f93e75cf61f in _dup_node_usage (orig_ptr=0x7f92ecb55520) at select_cons_res.c:376
#12 0x00007f93e75d32a4 in _will_run_test (job_ptr=0x1cd7520, bitmap=0x7f93e0479d30, min_nodes=1, max_nodes=1, req_nodes=1, job_node_req=1, preemptee_candidates=0x0,
preemptee_job_list=0x7f93e4a00ba0, exc_core_bitmap=0x0) at select_cons_res.c:1712
#13 0x00007f93e75d4177 in select_p_job_test (job_ptr=0x1cd7520, bitmap=0x7f93e0479d30, min_nodes=1, max_nodes=1, req_nodes=1, mode=2, preemptee_candidates=0x0,
preemptee_job_list=0x7f93e4a00ba0, exc_core_bitmap=0x0) at select_cons_res.c:2077
#14 0x00000000005002e2 in select_g_job_test (job_ptr=0x1cd7520, bitmap=0x7f93e0479d30, min_nodes=1, max_nodes=1, req_nodes=1, mode=2, preemptee_candidates=0x0,
preemptee_job_list=0x7f93e4a00ba0, exc_core_bitmap=0x0) at node_select.c:593
#15 0x00007f93e4a04f66 in _try_sched (job_ptr=0x1cd7520, avail_bitmap=0x7f93e4a00d68, min_nodes=1, max_nodes=1, req_nodes=1, exc_core_bitmap=0x0) at backfill.c:366
#16 0x00007f93e4a07649 in _attempt_backfill () at backfill.c:1203
#17 0x00007f93e4a05c35 in backfill_agent (args=0x0) at backfill.c:642
#18 0x00000035946079d1 in start_thread (arg=0x7f93e4a01700) at pthread_create.c:301
#19 0x00000035942e88fd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115
Valgrind printed this on the start up phase, but nothing else (about 2h under with production load)
==31862== Thread 11:
==31862== Syscall param socketcall.sendto(msg) points to uninitialised byte(s)
==31862== at 0x359460ECDC: send (send.c:33)
==31862== by 0x546CFA: _slurm_send (slurm_protocol_socket_implementation.c:700)
==31862== by 0x54623F: _slurm_send_timeout (slurm_protocol_socket_implementation.c:295)
==31862== by 0x545FF5: _slurm_msg_sendto_timeout (slurm_protocol_socket_implementation.c:217)
==31862== by 0x545F68: _slurm_msg_sendto (slurm_protocol_socket_implementation.c:193)
==31862== by 0x512A59: slurm_send_node_msg (slurm_protocol_api.c:3309)
==31862== by 0x49348B: _slurm_rpc_job_alloc_info_lite (proc_req.c:2627)
==31862== by 0x48C211: slurmctld_req (proc_req.c:348)
==31862== by 0x43581E: _service_connection (controller.c:1070)
==31862== by 0x35946079D0: start_thread (pthread_create.c:301)
==31862== by 0x12D796FF: ???
==31862== Address 0xbd35233 is 195 bytes inside a block of size 16,400 alloc'd
==31862== at 0x4A06A2E: malloc (vg_replace_malloc.c:270)
==31862== by 0x4D3B08: slurm_xmalloc (xmalloc.c:86)
==31862== by 0x4E83F1: init_buf (pack.c:154)
==31862== by 0x512969: slurm_send_node_msg (slurm_protocol_api.c:3283)
==31862== by 0x49348B: _slurm_rpc_job_alloc_info_lite (proc_req.c:2627)
==31862== by 0x48C211: slurmctld_req (proc_req.c:348)
==31862== by 0x43581E: _service_connection (controller.c:1070)
==31862== by 0x35946079D0: start_thread (pthread_create.c:301)
==31862== by 0x12D796FF: ???
Thanks, the stack is different but the core dump happens in the memory allocation library just like the previous one. There appear to be memory corruption. David Do you have the entire valgrind log? Could you please attach your slurm.conf. Do you have any submission plugin? If you do could we examine it. Thanks David Created attachment 1934 [details]
job submit plugin
Created attachment 1935 [details]
config
Thanks for the data. The submit plugin does not appear to be using any dynamic memory so I don't think it can corrupt any. This message seems to indicate that a memory pointer is pointing inside a memory block: ==31862== Address 0xbd35233 is 195 bytes inside a block of size 16,400 alloc'd but it is not clear if this is an error. Are you still running with valgrind enabled? David > but it is not clear if this is an error. Are you still running with valgrind
> enabled?
I'd like to but slurm s-commands response times are too slow when running with VG.
Hi, just checking how is it going lately. We still have not have any major core dump fix. David Well, seems that problem disappeared. No crashes during the last three weeks. Let's close this for now as we cannot reproduce. There is some obscure memory corruption happening under unknown conditions. David |