1691 – Slurmctld crash

Ticket 1691 - Slurmctld crash

Summary: Slurmctld crash

Status:	RESOLVED CANNOTREPRODUCE

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	slurmctld (show other tickets)
Version:	14.11.6
Hardware:	Linux Linux

Severity:	3 - Medium Impact
Assignee:	David Bigagli
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2015-05-21 23:50 MDT by CSC sysadmins
Modified:	2015-06-23 05:12 MDT (History)
CC List:	2 users (show)

See Also:
Site:	CSC - IT Center for Science
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
job submit plugin (5.41 KB, text/plain) 2015-06-01 21:32 MDT, CSC sysadmins	Details
config (5.53 KB, text/plain) 2015-06-01 21:32 MDT, CSC sysadmins	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description CSC sysadmins 2015-05-21 23:50:32 MDT

We are running 7019cb6 version of slurm from 14.11 branch (almost 14.11.7). Slurmctld suddenly crashed. Jobid which in backtrace is normal looking array job.

#SBATCH -t 48:00:00
#SBATCH --mem-per-cpu=4000
#SBATCH --array=1-444
#SBATCH -p serial
#SBATCH -n 1
#SBATCH -M csc
#SBATCH --cpus-per-task=4


Log file snippet just before crash:

[2015-05-22T14:19:48.121] job_complete: JobID=5720701 State=0x1 NodeCnt=1 WIFEXITED 1 WEXITSTATUS 0
[2015-05-22T14:19:48.121] job_complete: JobID=5720701 State=0x8003 NodeCnt=1 done
[2015-05-22T14:19:48.121] job_complete: JobID=5720663 State=0x1 NodeCnt=1 WIFEXITED 1 WEXITSTATUS 0
[2015-05-22T14:19:48.121] job_complete: JobID=5720663 State=0x8003 NodeCnt=1 done
[2015-05-22T14:19:48.121] job_complete: JobID=5720665 State=0x1 NodeCnt=1 WIFEXITED 1 WEXITSTATUS 0
[2015-05-22T14:19:48.121] job_complete: JobID=5720665 State=0x8003 NodeCnt=1 done
[2015-05-22T14:19:48.121] job_complete: JobID=5720688 State=0x1 NodeCnt=1 WIFEXITED 1 WEXITSTATUS 0
[2015-05-22T14:19:48.121] job_complete: JobID=5720688 State=0x8003 NodeCnt=1 done
[2015-05-22T14:19:48.122] job_complete: JobID=5720637 State=0x1 NodeCnt=1 WIFEXITED 1 WEXITSTATUS 0
[2015-05-22T14:19:48.122] job_complete: JobID=5720637 State=0x8003 NodeCnt=1 done
[2015-05-22T14:19:48.122] job_complete: JobID=5720687 State=0x1 NodeCnt=1 WIFEXITED 1 WEXITSTATUS 0
[2015-05-22T14:19:48.122] job_complete: JobID=5720687 State=0x8003 NodeCnt=1 done
[2015-05-22T14:19:48.122] job_complete: JobID=5720691 State=0x1 NodeCnt=1 WIFEXITED 1 WEXITSTATUS 0
[2015-05-22T14:19:48.122] job_complete: JobID=5720691 State=0x8003 NodeCnt=1 done
[2015-05-22T14:19:48.122] job_complete: JobID=5720690 State=0x1 NodeCnt=1 WIFEXITED 1 WEXITSTATUS 0
[2015-05-22T14:19:48.122] job_complete: JobID=5720690 State=0x8003 NodeCnt=1 done
[2015-05-22T14:19:48.122] job_complete: JobID=5720689 State=0x1 NodeCnt=1 WIFEXITED 1 WEXITSTATUS 0
[2015-05-22T14:19:48.122] job_complete: JobID=5720689 State=0x8003 NodeCnt=1 done
[2015-05-22T14:19:48.123] job_complete: JobID=5720661 State=0x1 NodeCnt=1 WIFEXITED 1 WEXITSTATUS 0
[2015-05-22T14:19:48.123] job_complete: JobID=5720661 State=0x8003 NodeCnt=1 done
[2015-05-22T14:19:50.314] job_complete: JobID=5720667 State=0x1 NodeCnt=1 WIFEXITED 1 WEXITSTATUS 0
[2015-05-22T14:19:50.314] job_complete: JobID=5720667 State=0x8003 NodeCnt=1 done
[2015-05-22T14:19:50.431] _slurm_rpc_submit_batch_job JobId=5720876 usec=1202
[2015-05-22T14:19:52.554] job_complete: JobID=5720653 State=0x1 NodeCnt=1 WIFEXITED 1 WEXITSTATUS 0
[2015-05-22T14:19:52.554] job_complete: JobID=5720653 State=0x8003 NodeCnt=1 done
[2015-05-22T14:19:54.905] job_complete: JobID=5720639 State=0x1 NodeCnt=1 WIFEXITED 1 WEXITSTATUS 0
[2015-05-22T14:19:54.905] job_complete: JobID=5720639 State=0x8003 NodeCnt=1 done
[2015-05-22T14:19:54.907] _slurm_rpc_submit_batch_job JobId=5720877 usec=1404
[2015-05-22T14:19:59.315] _slurm_rpc_submit_batch_job JobId=5720878 usec=1344
[2015-05-22T14:20:01.642] _slurm_rpc_submit_batch_job JobId=5720879 usec=1435
[2015-05-22T14:20:01.671] _slurm_rpc_submit_batch_job JobId=5720880 usec=1761
[2015-05-22T14:20:01.703] _slurm_rpc_submit_batch_job JobId=5720881 usec=1750
[2015-05-22T14:20:01.718] _slurm_rpc_submit_batch_job JobId=5720882 usec=1477
[2015-05-22T14:20:01.729] _slurm_rpc_submit_batch_job JobId=5720883 usec=1190
[2015-05-22T14:20:01.744] _slurm_rpc_submit_batch_job JobId=5720884 usec=2105
[2015-05-22T14:20:03.847] _slurm_rpc_submit_batch_job JobId=5720885 usec=1473
[2015-05-22T14:20:21.618] pidfile not locked, assuming no running daemon
[2015-05-22T14:20:21.643] slurmctld version 14.11.6 started on cluster csc
[2015-05-22T14:20:21.643] Munge cryptographic signature plugin loaded
[2015-05-22T14:20:21.656] Consumable Resources (CR) Node Selection plugin loaded with argument 4116
[2015-05-22T14:20:21.657] preempt/none loaded
[2015-05-22T14:20:21.691] job_submit.lua: initialized
[2015-05-22T14:20:21.691] ExtSensors NONE plugin loaded
[2015-05-22T14:20:21.703] Accounting storage SLURMDBD plugin loaded with AuthInfo=(null)
[2015-05-22T14:20:21.705] slurmdbd: recovered 0 pending RPCs

 

(gdb) thread apply all bt

Thread 13 (Thread 0x7ff0d1749700 (LWP 17183)):
#0  do_sigwait (set=<value optimized out>, sig=0x7ff0d1748eac) at ../sysdeps/unix/sysv/linux/sigwait.c:65
#1  __sigwait (set=<value optimized out>, sig=0x7ff0d1748eac) at ../sysdeps/unix/sysv/linux/sigwait.c:100
#2  0x0000000000434cf4 in _slurmctld_signal_hand (no_data=0x0) at controller.c:827
#3  0x00000035946079d1 in start_thread (arg=0x7ff0d1749700) at pthread_create.c:301
#4  0x00000035942e88fd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115

Thread 12 (Thread 0x7ff0d1a4c700 (LWP 17180)):
#0  0x00000035942aca3d in nanosleep () at ../sysdeps/unix/syscall-template.S:82
#1  0x00000035942ac8b0 in __sleep (seconds=0) at ../sysdeps/unix/sysv/linux/sleep.c:138
#2  0x00007ff0d1a52839 in _decay_thread (no_data=0x0) at priority_multifactor.c:1340
#3  0x00000035946079d1 in start_thread (arg=0x7ff0d1a4c700) at pthread_create.c:301
#4  0x00000035942e88fd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115

Thread 11 (Thread 0x7ff0d2654700 (LWP 17050)):
#0  pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:239
#1  0x00007ff0d26582b8 in _my_sleep (usec=500000) at backfill.c:441
#2  0x00007ff0d2658d76 in _yield_locks (usec=500000) at backfill.c:680
#3  0x00007ff0d2659f9c in _attempt_backfill () at backfill.c:1077
#4  0x00007ff0d2658c35 in backfill_agent (args=0x0) at backfill.c:642
#5  0x00000035946079d1 in start_thread (arg=0x7ff0d2654700) at pthread_create.c:301
#6  0x00000035942e88fd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115

Thread 10 (Thread 0x7ff0d36e5700 (LWP 16272)):
#0  0x00000035942aca3d in nanosleep () at ../sysdeps/unix/syscall-template.S:82
#1  0x00000035942ac8b0 in __sleep (seconds=0) at ../sysdeps/unix/sysv/linux/sleep.c:138
#2  0x00007ff0d36e95f3 in _set_db_inx_thread (no_data=0x0) at accounting_storage_slurmdbd.c:420
#3  0x00000035946079d1 in start_thread (arg=0x7ff0d36e5700) at pthread_create.c:301
#4  0x00000035942e88fd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115

Thread 9 (Thread 0x7ff0d34e3700 (LWP 16274)):
#0  pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:239
#1  0x000000000056c1bf in _agent (x=0x0) at slurmdbd_defs.c:2115
#2  0x00000035946079d1 in start_thread (arg=0x7ff0d34e3700) at pthread_create.c:301
#3  0x00000035942e88fd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115

Thread 8 (Thread 0x7ff0d194b700 (LWP 17181)):
#0  0x000000359460822d in pthread_join (threadid=140672286115584, thread_return=0x0) at pthread_join.c:89
#1  0x00007ff0d1a5297b in _cleanup_thread (no_data=0x0) at priority_multifactor.c:1394
#2  0x00000035946079d1 in start_thread (arg=0x7ff0d194b700) at pthread_create.c:301
#3  0x00000035942e88fd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115

Thread 7 (Thread 0x7ff0d1648700 (LWP 17184)):
#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:183
#1  0x00000000004ae734 in slurmctld_state_save (no_data=0x0) at state_save.c:211
#2  0x00000035946079d1 in start_thread (arg=0x7ff0d1648700) at pthread_create.c:301
#3  0x00000035942e88fd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115
---Type <return> to continue, or q <return> to quit---

Thread 6 (Thread 0x7ff0d35e4700 (LWP 16273)):
#0  0x000000359460822d in pthread_join (threadid=140672316102400, thread_return=0x0) at pthread_join.c:89
#1  0x00007ff0d36e9618 in _cleanup_thread (no_data=0x0) at accounting_storage_slurmdbd.c:428
#2  0x00000035946079d1 in start_thread (arg=0x7ff0d35e4700) at pthread_create.c:301
#3  0x00000035942e88fd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115

Thread 5 (Thread 0x7ff0d184a700 (LWP 17182)):
#0  0x00000035942e1373 in select () at ../sysdeps/unix/syscall-template.S:82
#1  0x0000000000435383 in _slurmctld_rpc_mgr (no_data=0x0) at controller.c:965
#2  0x00000035946079d1 in start_thread (arg=0x7ff0d184a700) at pthread_create.c:301
#3  0x00000035942e88fd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115

Thread 4 (Thread 0x7ff0b14d4700 (LWP 49017)):
#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:183
#1  0x00000000004715d2 in _wr_rdlock (datatype=JOB_LOCK, wait_lock=true) at locks.c:218
#2  0x00000000004711a7 in lock_slurmctld (lock_levels=...) at locks.c:84
#3  0x000000000048ed13 in _slurm_rpc_dump_jobs (msg=0x7fef4cd76940) at proc_req.c:1175
#4  0x000000000048bf06 in slurmctld_req (msg=0x7fef4cd76940, arg=0x7ff0bc00ca10) at proc_req.c:267
#5  0x000000000043581f in _service_connection (arg=0x7ff0bc00ca10) at controller.c:1070
#6  0x00000035946079d1 in start_thread (arg=0x7ff0b14d4700) at pthread_create.c:301
#7  0x00000035942e88fd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115

Thread 3 (Thread 0x7ff0d5d70700 (LWP 16263)):
#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:183
#1  0x0000000000471809 in _wr_wrlock (datatype=JOB_LOCK, wait_lock=true) at locks.c:256
#2  0x00000000004711c0 in lock_slurmctld (lock_levels=...) at locks.c:86
#3  0x000000000046802d in _schedule (job_limit=4294967295) at job_scheduler.c:1091
#4  0x000000000046741a in schedule (job_limit=4294967295) at job_scheduler.c:813
#5  0x0000000000436e84 in _slurmctld_background (no_data=0x0) at controller.c:1623
#6  0x00000000004346ec in main (argc=1, argv=0x7fffcd2b4078) at controller.c:561

Thread 2 (Thread 0x7ff0b39f9700 (LWP 49018)):
#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:183
#1  0x00000000004715d2 in _wr_rdlock (datatype=JOB_LOCK, wait_lock=true) at locks.c:218
#2  0x00000000004711a7 in lock_slurmctld (lock_levels=...) at locks.c:84
#3  0x0000000000494a47 in _slurm_rpc_submit_batch_job (msg=0x7fef4cb12050) at proc_req.c:3107
#4  0x000000000048c2eb in slurmctld_req (msg=0x7fef4cb12050, arg=0x7ff0bc009950) at proc_req.c:380
#5  0x000000000043581f in _service_connection (arg=0x7ff0bc009950) at controller.c:1070
#6  0x00000035946079d1 in start_thread (arg=0x7ff0b39f9700) at pthread_create.c:301
#7  0x00000035942e88fd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115

Thread 1 (Thread 0x7ff0b3bfb700 (LWP 49016)):
#0  0x0000003594232625 in raise (sig=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
#1  0x0000003594233e05 in abort () at abort.c:92
#2  0x0000003594270537 in __libc_message (do_abort=2, fmt=0x3594358780 "*** glibc detected *** %s: %s: 0x%s ***\n") at ../sysdeps/unix/sysv/linux/libc_fatal.c:198
#3  0x0000003594275e66 in malloc_printerr (action=3, str=0x359435673d "corrupted double-linked list", ptr=<value optimized out>) at malloc.c:6336
#4  0x00000035942762ed in malloc_consolidate (av=0x7fef4c000020) at malloc.c:5214
---Type <return> to continue, or q <return> to quit---
#5  0x0000003594279125 in _int_malloc (av=0x7fef4c000020, bytes=<value optimized out>) at malloc.c:4406
#6  0x000000359427a6b1 in __libc_malloc (bytes=16400) at malloc.c:3664
#7  0x00000000004d3b09 in slurm_xmalloc (size=16384, clear=false, file=0x5f746a "pack.c", line=154, func=0x5f7469 "") at xmalloc.c:86
#8  0x00000000004e83f2 in init_buf (size=16384) at pack.c:154
#9  0x0000000000452b71 in pack_one_job (buffer_ptr=0x7ff0b3bfacc0, buffer_size=0x7ff0b3bfacdc, job_id=5707986, show_flags=0, uid=19395, protocol_version=7168) at job_mgr.c:7407
#10 0x000000000048f293 in _slurm_rpc_dump_job_single (msg=0x7fef4c415bb0) at proc_req.c:1265
#11 0x000000000048bf54 in slurmctld_req (msg=0x7fef4c415bb0, arg=0x7ff0bc0051e0) at proc_req.c:275
#12 0x000000000043581f in _service_connection (arg=0x7ff0bc0051e0) at controller.c:1070
#13 0x00000035946079d1 in start_thread (arg=0x7ff0b3bfb700) at pthread_create.c:301
#14 0x00000035942e88fd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115

Comment 1 CSC sysadmins 2015-05-22 00:49:53 MDT

rpms for el6.6, core, conf and full log file:

https://kannu.csc.fi/index.php/s/HUcQ2wjqDOhdtAy

Comment 2 David Bigagli 2015-05-22 05:01:43 MDT

The core indicates memory corruption when processing the 'scontrol show job'
command. We will try to reproduce running the controller under valgrind.
How many jobs did you have in the system at that time? Is this reproducible
somehow or just a random core dump?

David

Comment 3 CSC sysadmins 2015-05-22 05:29:16 MDT

> The core indicates memory corruption when processing the 'scontrol show job'
> command. We will try to reproduce running the controller under valgrind.
> How many jobs did you have in the system at that time? Is this reproducible
> somehow or just a random core dump?

This was first occurrence since I upgraded slurm couple of days ago. Previous version (14.11.4) was stable for many months.
Currently there are a bit over 3000 jobs running and 1000 jobs pending.

Comment 4 David Bigagli 2015-05-22 09:52:36 MDT

Running large job array workload under valgrind did not show any memory corruption so far. We tried job several updates and dependencies but no errors.
Do you have an idea what trigger it and how to reproduce it. 

David

Comment 5 CSC sysadmins 2015-06-01 03:06:29 MDT

New core file, job count was a bit over 2300. I've no idea how to reproduce this.

https://kannu.csc.fi/index.php/s/HUcQ2wjqDOhdtAy/download?path=%2F&files=core.16263.xz

[2015-06-01T17:32:51.680] job_complete: JobID=5827611_23(5827637) State=0x8005 NodeCnt=1 done
[2015-06-01T17:33:05.923] _slurm_rpc_submit_batch_job JobId=5827835 usec=1880
[2015-06-01T17:33:05.948] _slurm_rpc_submit_batch_job JobId=5827836 usec=1516
[2015-06-01T17:33:05.962] _slurm_rpc_submit_batch_job JobId=5827837 usec=1264
[2015-06-01T17:33:05.977] _slurm_rpc_submit_batch_job JobId=5827838 usec=1182
[2015-06-01T17:33:05.991] _slurm_rpc_submit_batch_job JobId=5827839 usec=1181
[2015-06-01T17:33:46.807] pidfile not locked, assuming no running daemon
[2015-06-01T17:33:46.823] slurmctld version 14.11.6 started on cluster csc

Comment 6 David Bigagli 2015-06-01 05:09:17 MDT

Hi,
   the core file is not useful to us unfortunately as we have different build
environment.

Could you please print the stack using gdb?

(gdb)thread apply all bt

so we can see if this has the same stack as the previous core dump.

David

Comment 7 CSC sysadmins 2015-06-01 05:21:55 MDT

(gdb) thread apply all bt

Thread 10 (Thread 0x7f93e44f4700 (LWP 2332)):
#0  0x00000035942aca3d in nanosleep () at ../sysdeps/unix/syscall-template.S:82
#1  0x00000035942ac8b0 in __sleep (seconds=0) at ../sysdeps/unix/sysv/linux/sleep.c:138
#2  0x00007f93e44fa839 in _decay_thread (no_data=0x0) at priority_multifactor.c:1340
#3  0x00000035946079d1 in start_thread (arg=0x7f93e44f4700) at pthread_create.c:301
#4  0x00000035942e88fd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115

Thread 9 (Thread 0x7f93e5b93700 (LWP 2267)):
#0  0x00000035942aca3d in nanosleep () at ../sysdeps/unix/syscall-template.S:82
#1  0x00000035942ac8b0 in __sleep (seconds=0) at ../sysdeps/unix/sysv/linux/sleep.c:138
#2  0x00007f93e5b975f3 in _set_db_inx_thread (no_data=0x0) at accounting_storage_slurmdbd.c:420
#3  0x00000035946079d1 in start_thread (arg=0x7f93e5b93700) at pthread_create.c:301
#4  0x00000035942e88fd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115

Thread 8 (Thread 0x7f93e578e700 (LWP 2278)):
#0  pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:239
#1  0x000000000056c1bf in _agent (x=0x0) at slurmdbd_defs.c:2115
#2  0x00000035946079d1 in start_thread (arg=0x7f93e578e700) at pthread_create.c:301
#3  0x00000035942e88fd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115

Thread 7 (Thread 0x7f93e41f1700 (LWP 2343)):
#0  do_sigwait (set=<value optimized out>, sig=0x7f93e41f0eac) at ../sysdeps/unix/sysv/linux/sigwait.c:65
#1  __sigwait (set=<value optimized out>, sig=0x7f93e41f0eac) at ../sysdeps/unix/sysv/linux/sigwait.c:100
#2  0x0000000000434cf4 in _slurmctld_signal_hand (no_data=0x0) at controller.c:827
#3  0x00000035946079d1 in start_thread (arg=0x7f93e41f1700) at pthread_create.c:301
#4  0x00000035942e88fd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115

Thread 6 (Thread 0x7f93e821e700 (LWP 2265)):
#0  0x00000035942aca3d in nanosleep () at ../sysdeps/unix/syscall-template.S:82
#1  0x00000035942e1be4 in usleep (useconds=<value optimized out>) at ../sysdeps/unix/sysv/linux/usleep.c:33
#2  0x0000000000436591 in _slurmctld_background (no_data=0x0) at controller.c:1448
#3  0x00000000004346ec in main (argc=1, argv=0x7fffc993e648) at controller.c:561

Thread 5 (Thread 0x7f93e5a92700 (LWP 2268)):
#0  0x000000359460822d in pthread_join (threadid=140273191040768, thread_return=0x0) at pthread_join.c:89
#1  0x00007f93e5b97618 in _cleanup_thread (no_data=0x0) at accounting_storage_slurmdbd.c:428
#2  0x00000035946079d1 in start_thread (arg=0x7f93e5a92700) at pthread_create.c:301
#3  0x00000035942e88fd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115

Thread 4 (Thread 0x7f93d7efe700 (LWP 2344)):
#0  pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:239
#1  0x00000000004ae720 in slurmctld_state_save (no_data=0x0) at state_save.c:208
#2  0x00000035946079d1 in start_thread (arg=0x7f93d7efe700) at pthread_create.c:301
#3  0x00000035942e88fd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115

Thread 3 (Thread 0x7f93e42f2700 (LWP 2342)):
#0  0x00000035942e1373 in select () at ../sysdeps/unix/syscall-template.S:82
---Type <return> to continue, or q <return> to quit---
#1  0x0000000000435383 in _slurmctld_rpc_mgr (no_data=0x0) at controller.c:965
#2  0x00000035946079d1 in start_thread (arg=0x7f93e42f2700) at pthread_create.c:301
#3  0x00000035942e88fd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115

Thread 2 (Thread 0x7f93e43f3700 (LWP 2333)):
#0  0x000000359460822d in pthread_join (threadid=140273167320832, thread_return=0x0) at pthread_join.c:89
#1  0x00007f93e44fa97b in _cleanup_thread (no_data=0x0) at priority_multifactor.c:1394
#2  0x00000035946079d1 in start_thread (arg=0x7f93e43f3700) at pthread_create.c:301
#3  0x00000035942e88fd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115

Thread 1 (Thread 0x7f93e4a01700 (LWP 2330)):
#0  0x0000003594232625 in raise (sig=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
#1  0x0000003594233e05 in abort () at abort.c:92
#2  0x0000003594270537 in __libc_message (do_abort=2, fmt=0x3594358780 "*** glibc detected *** %s: %s: 0x%s ***\n") at ../sysdeps/unix/sysv/linux/libc_fatal.c:198
#3  0x0000003594275e66 in malloc_printerr (action=3, str=0x3594358b88 "malloc(): smallbin double linked list corrupted", ptr=<value optimized out>) at malloc.c:6336
#4  0x0000003594279c3f in _int_malloc (av=0x7f93e0000020, bytes=<value optimized out>) at malloc.c:4340
#5  0x000000359427a346 in __libc_calloc (n=<value optimized out>, elem_size=<value optimized out>) at malloc.c:4096
#6  0x00000000004d3af7 in slurm_xmalloc (size=8, clear=true, file=0x5f6d90 "bitstring.c", line=166, func=0x5f6dbf "") at xmalloc.c:84
#7  0x00000000004e49b9 in bit_alloc (nbits=0) at bitstring.c:166
#8  0x00000000004e62cd in bit_copy (b=0x1cfc340) at bitstring.c:679
#9  0x00000000005a79dd in _node_state_dup (gres_data=0x1cfc2b0) at gres.c:2120
#10 0x00000000005a7eea in gres_plugin_node_state_dup (gres_list=0x1a68c70) at gres.c:2189
#11 0x00007f93e75cf61f in _dup_node_usage (orig_ptr=0x7f92ecb55520) at select_cons_res.c:376
#12 0x00007f93e75d32a4 in _will_run_test (job_ptr=0x1cd7520, bitmap=0x7f93e0479d30, min_nodes=1, max_nodes=1, req_nodes=1, job_node_req=1, preemptee_candidates=0x0,
    preemptee_job_list=0x7f93e4a00ba0, exc_core_bitmap=0x0) at select_cons_res.c:1712
#13 0x00007f93e75d4177 in select_p_job_test (job_ptr=0x1cd7520, bitmap=0x7f93e0479d30, min_nodes=1, max_nodes=1, req_nodes=1, mode=2, preemptee_candidates=0x0,
    preemptee_job_list=0x7f93e4a00ba0, exc_core_bitmap=0x0) at select_cons_res.c:2077
#14 0x00000000005002e2 in select_g_job_test (job_ptr=0x1cd7520, bitmap=0x7f93e0479d30, min_nodes=1, max_nodes=1, req_nodes=1, mode=2, preemptee_candidates=0x0,
    preemptee_job_list=0x7f93e4a00ba0, exc_core_bitmap=0x0) at node_select.c:593
#15 0x00007f93e4a04f66 in _try_sched (job_ptr=0x1cd7520, avail_bitmap=0x7f93e4a00d68, min_nodes=1, max_nodes=1, req_nodes=1, exc_core_bitmap=0x0) at backfill.c:366
#16 0x00007f93e4a07649 in _attempt_backfill () at backfill.c:1203
#17 0x00007f93e4a05c35 in backfill_agent (args=0x0) at backfill.c:642
#18 0x00000035946079d1 in start_thread (arg=0x7f93e4a01700) at pthread_create.c:301
#19 0x00000035942e88fd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115

Valgrind printed this on the start up phase, but nothing else (about 2h under with production load)

==31862== Thread 11:
==31862== Syscall param socketcall.sendto(msg) points to uninitialised byte(s)
==31862==    at 0x359460ECDC: send (send.c:33)
==31862==    by 0x546CFA: _slurm_send (slurm_protocol_socket_implementation.c:700)
==31862==    by 0x54623F: _slurm_send_timeout (slurm_protocol_socket_implementation.c:295)
==31862==    by 0x545FF5: _slurm_msg_sendto_timeout (slurm_protocol_socket_implementation.c:217)
==31862==    by 0x545F68: _slurm_msg_sendto (slurm_protocol_socket_implementation.c:193)
==31862==    by 0x512A59: slurm_send_node_msg (slurm_protocol_api.c:3309)
==31862==    by 0x49348B: _slurm_rpc_job_alloc_info_lite (proc_req.c:2627)
==31862==    by 0x48C211: slurmctld_req (proc_req.c:348)
==31862==    by 0x43581E: _service_connection (controller.c:1070)
==31862==    by 0x35946079D0: start_thread (pthread_create.c:301)
==31862==    by 0x12D796FF: ???
==31862==  Address 0xbd35233 is 195 bytes inside a block of size 16,400 alloc'd
==31862==    at 0x4A06A2E: malloc (vg_replace_malloc.c:270)
==31862==    by 0x4D3B08: slurm_xmalloc (xmalloc.c:86)
==31862==    by 0x4E83F1: init_buf (pack.c:154)
==31862==    by 0x512969: slurm_send_node_msg (slurm_protocol_api.c:3283)
==31862==    by 0x49348B: _slurm_rpc_job_alloc_info_lite (proc_req.c:2627)
==31862==    by 0x48C211: slurmctld_req (proc_req.c:348)
==31862==    by 0x43581E: _service_connection (controller.c:1070)
==31862==    by 0x35946079D0: start_thread (pthread_create.c:301)
==31862==    by 0x12D796FF: ???

Comment 8 David Bigagli 2015-06-01 05:26:14 MDT

Thanks, the stack is different but the core dump happens in the memory allocation 
library just like the previous one. There appear to be memory corruption.

David

Comment 9 David Bigagli 2015-06-01 06:04:34 MDT

Do you have the entire valgrind log?

Comment 10 David Bigagli 2015-06-01 06:07:31 MDT

Could you please attach your slurm.conf. Do you have any submission plugin?
If you do could we examine it.

Thanks
David

Comment 11 CSC sysadmins 2015-06-01 21:32:05 MDT

Created attachment 1934 [details]
job submit plugin

Comment 12 CSC sysadmins 2015-06-01 21:32:55 MDT

Created attachment 1935 [details]
config

Comment 13 David Bigagli 2015-06-02 11:21:34 MDT

Thanks for the data. The submit plugin does not appear to be using any dynamic memory so I don't think it can corrupt any.

This message seems to indicate that a memory pointer is pointing
inside a memory block:

==31862==  Address 0xbd35233 is 195 bytes inside a block of size 16,400 alloc'd

but it is not clear if this is an error. Are you still running with valgrind enabled?

David

Comment 14 CSC sysadmins 2015-06-02 23:57:33 MDT

> but it is not clear if this is an error. Are you still running with valgrind
> enabled?

I'd like to but slurm s-commands response times are too slow when running with VG.

Comment 15 David Bigagli 2015-06-18 09:33:58 MDT

Hi, just checking how is it going lately. We still have not have any major core dump fix.

David

Comment 16 CSC sysadmins 2015-06-22 23:52:46 MDT

Well, seems that problem disappeared. No crashes during the last three weeks.

Comment 17 David Bigagli 2015-06-23 05:12:20 MDT

Let's close this for now as we cannot reproduce. There is some obscure memory
corruption happening under unknown conditions.

David