Ticket 1309 - Slurmctld crashed, invalid job_ptr
Summary: Slurmctld crashed, invalid job_ptr
Status: RESOLVED FIXED
Alias: None
Product: Slurm
Classification: Unclassified
Component: slurmctld (show other tickets)
Version: 14.11.1
Hardware: Linux Linux
: 2 - High Impact
Assignee: Moe Jette
QA Contact:
URL:
: 1342 (view as ticket list)
Depends on:
Blocks:
 
Reported: 2014-12-09 09:06 MST by CSC sysadmins
Modified: 2014-12-29 01:06 MST (History)
3 users (show)

See Also:
Site: CSC - IT Center for Science
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed: 14.11.3
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
Job submit plugin (3.35 KB, text/x-lua)
2014-12-09 09:18 MST, CSC sysadmins
Details
Slurmctld log file (50.32 KB, text/plain)
2014-12-09 09:39 MST, CSC sysadmins
Details

Note You need to log in before you can comment on or make changes to this ticket.
Description CSC sysadmins 2014-12-09 09:06:57 MST
Hi,

Slurmctld crashed earlier on this evening, here is bt of the crash:

Core was generated by `/usr/sbin/slurmctld'.
Program terminated with signal 11, Segmentation fault.
#0  0x000000000044a33c in find_job_record (job_id=3415818) at job_mgr.c:2626
2626                    if (job_ptr->job_id == job_id)
Missing separate debuginfos, use: debuginfo-install slurm-14.11.1-1.el6.x86_64
(gdb) bt full
#0  0x000000000044a33c in find_job_record (job_id=3415818) at job_mgr.c:2626
        job_ptr = 0x1
#1  0x000000000045db23 in _set_job_id (job_ptr=0x7fd880227cb0) at job_mgr.c:8423
        i = 0
        new_id = 3415818
        max_jobs = 4292671760
#2  0x0000000000453f51 in _copy_job_desc_to_job_record (job_desc=0x7fd8800c1710, job_rec_ptr=0x7fda576f5c00, req_bitmap=0x7fda576f5940, exc_bitmap=0x7fda576f5938)
    at job_mgr.c:6371
        error_code = 0
        detail_ptr = 0x7fd8800c1710
        job_ptr = 0x7fd880227cb0
        __func__ = "_copy_job_desc_to_job_record"
#3  0x0000000000451713 in _job_create (job_desc=0x7fd8800c1710, allocate=0, will_run=0, job_pptr=0x7fda576f5c00, submit_uid=34018, err_msg=0x7fda576f5cb0)
    at job_mgr.c:5458
        launch_type_poe = 0
        error_code = 0
        i = 0
        qos_error = 0
        part_ptr = 0x7fd88075bd10
        part_ptr_list = 0x0
        req_bitmap = 0x0
        exc_bitmap = 0x0
        job_ptr = 0x0
        assoc_rec = {accounting_list = 0x0, acct = 0x25d31a0 "csc", assoc_next = 0x0, assoc_next_id = 0x0, cluster = 0x2495370 "csc", def_qos_id = 0, 
          grp_cpu_mins = 4294967295, grp_cpu_run_mins = 4294967295, grp_cpus = 1024, grp_jobs = 4294967295, grp_mem = 4294967295, grp_nodes = 4294967295, 
          grp_submit_jobs = 4294967295, grp_wall = 4294967295, id = 1230, is_def = 1, lft = 6657, max_cpu_mins_pj = 4294967295, max_cpu_run_mins = 4294967295, 
          max_cpus_pj = 4294967295, max_jobs = 4294967295, max_nodes_pj = 4294967295, max_submit_jobs = 896, max_wall_pj = 4294967295, parent_acct = 0x0, 
          parent_id = 6, partition = 0x7fd8800bfc70 "serial", qos_list = 0x2780960, rgt = 6630, shares_raw = 1, uid = 34018, usage = 0x0, user = 0x24c1160 "xxxxxx"}
        assoc_ptr = 0x278d010
        license_list = 0x0
        valid = true
        qos_rec = {description = 0x24a1380 "Normal QOS default", id = 1, flags = 0, grace_time = 0, grp_cpu_mins = 4294967295, grp_cpu_run_mins = 4294967295, 
          grp_cpus = 4294967295, grp_jobs = 4294967295, grp_mem = 4294967295, grp_nodes = 4294967295, grp_submit_jobs = 4294967295, grp_wall = 4294967295, 
          max_cpu_mins_pj = 4294967295, max_cpu_run_mins_pu = 4294967295, max_cpus_pj = 4294967295, max_cpus_pu = 4294967295, max_jobs_pu = 4294967295, 
          max_nodes_pj = 4294967295, max_nodes_pu = 4294967295, max_submit_jobs_pu = 4294967295, max_wall_pj = 4294967295, min_cpus_pj = 1, name = 0x2495320 "normal", 
          preempt_bitstr = 0x0, preempt_list = 0x0, preempt_mode = 0, priority = 0, usage = 0x0, usage_factor = 1, usage_thres = 0}
        qos_ptr = 0x24a12c0
        user_submit_priority = 4294967294
        node_scaling = 1
        cpus_per_mp = 1
        acct_policy_limit_set = {max_cpus = 0, max_nodes = 0, min_cpus = 0, min_nodes = 0, pn_min_memory = 0, qos = 0, time = 0}
#4  0x000000000044d395 in job_allocate (job_specs=0x7fd8800c1710, immediate=0, will_run=0, resp=0x0, allocate=0, submit_uid=34018, job_pptr=0x7fda576f5cc0, 
    err_msg=0x7fda576f5cb0) at job_mgr.c:3758
        defer_sched = 1
        error_code = 0
        i = 0
        no_alloc = false
        top_prio = false
        test_only = 127
        too_fragmented = 218
        independent = 125
        job_ptr = 0x0
        now = 1418137557
        __func__ = "job_allocate"
#5  0x000000000049d119 in _slurm_rpc_submit_batch_job (msg=0x7fd8802e1e70) at proc_req.c:3189
        active_rpc_cnt = 1
        error_code = 0
        tv1 = {tv_sec = 1418137557, tv_usec = 4239}
        tv2 = {tv_sec = 140567837220096, tv_usec = 30787792493784}
        tv_str = '\000' <repeats 19 times>
        delta_t = 5208805
        step_id = 0
        job_ptr = 0x0
        response_msg = {address = {sin_family = 0, sin_port = 0, sin_addr = {s_addr = 0}, sin_zero = "\000\000\000\000\000\000\000"}, auth_cred = 0x0, conn_fd = -1, 
          data = 0x0, data_size = 0, flags = 0, msg_type = 65534, protocol_version = 7168, forward = {cnt = 0, init = 65534, nodelist = 0x0, timeout = 0}, 
          forward_struct = 0x0, orig_addr = {sin_family = 0, sin_port = 0, sin_addr = {s_addr = 0}, sin_zero = "\000\000\000\000\000\000\000"}, ret_list = 0x0}
        submit_msg = {job_id = 1466916080, step_id = 32730, error_code = 5308707}
        job_desc_msg = 0x7fd8800c1710
        job_read_lock = {config = READ_LOCK, job = READ_LOCK, node = READ_LOCK, partition = READ_LOCK}
---Type <return> to continue, or q <return> to quit---
        job_write_lock = {config = NO_LOCK, job = WRITE_LOCK, node = READ_LOCK, partition = READ_LOCK}
        uid = 34018
        err_msg = 0x0
#6  0x00000000004945b3 in slurmctld_req (msg=0x7fd8802e1e70, arg=0x7fda74003ec0) at proc_req.c:380
        tv1 = {tv_sec = 1418137556, tv_usec = 615835}
        tv2 = {tv_sec = 4472544, tv_usec = 140567837220096}
        tv_str = '\000' <repeats 19 times>
        delta_t = 371696738
        i = 169
        rpc_type_index = 8
        rpc_user_index = 169
        rpc_uid = 34018
        __func__ = "slurmctld_req"
#7  0x00000000004357eb in _service_connection (arg=0x7fda74003ec0) at controller.c:1070
        conn = 0x7fda74003ec0
        return_code = 0x0
        msg = 0x7fd8802e1e70
        __func__ = "_service_connection"
#8  0x0000003b16a079d1 in start_thread (arg=0x7fda576f6700) at pthread_create.c:301
        __res = <value optimized out>
        pd = 0x7fda576f6700
        now = <value optimized out>
        unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140575746516736, 2604083960062317716, 140576390549008, 140575746517440, 0, 3, -2623559752364419948, 
                2618207461584049300}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}}
        not_first_call = <value optimized out>
        pagesize_m1 = <value optimized out>
        sp = <value optimized out>
        freesize = <value optimized out>
#9  0x0000003b162e886d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115
No locals.
Comment 1 Danny Auble 2014-12-09 09:12:45 MST
Could you send your slurm.conf.

Do you happen to be running with any job_submit plugins?  If so could you send those as well?

This appears to be memory corruption.

Can you easily reproduce this?  If so could you attach valgrind and send the output?
Comment 2 CSC sysadmins 2014-12-09 09:18:59 MST
Created attachment 1500 [details]
Job submit plugin
Comment 3 CSC sysadmins 2014-12-09 09:23:47 MST
Here is our config. This is not easy to reproduce, we upgraded slurm last Friday and this was a first crash. Slurmctld resumed after restart and seems to work again.

Configuration data as of 2014-12-10T01:13:44
AccountingStorageBackupHost = (null)
AccountingStorageEnforce = associations,limits
AccountingStorageHost   = slurmdbip
AccountingStorageLoc    = N/A
AccountingStoragePort   = 6819
AccountingStorageType   = accounting_storage/slurmdbd
AccountingStorageUser   = N/A
AccountingStoreJobComment = YES
AcctGatherEnergyType    = acct_gather_energy/rapl
AcctGatherFilesystemType = acct_gather_filesystem/none
AcctGatherInfinibandType = acct_gather_infiniband/none
AcctGatherNodeFreq      = 30 sec
AcctGatherProfileType   = acct_gather_profile/none
AllowSpecResourcesUsage = 0
AuthInfo                = (null)
AuthType                = auth/munge
BackupAddr              = (null)
BackupController        = (null)
BatchStartTimeout       = 10 sec
BOOT_TIME               = 2014-12-10T00:27:57
CacheGroups             = 0
CheckpointType          = checkpoint/none
ChosLoc                 = (null)
ClusterName             = csc
CompleteWait            = 12 sec
ControlAddr             = 10.10.0.5
ControlMachine          = service01,service02
CoreSpecPlugin          = core_spec/none
CpuFreqDef              = OnDemand
CryptoType              = crypto/munge
DebugFlags              = (null)
DefMemPerCPU            = 512
DisableRootJobs         = NO
DynAllocPort            = 0
EnforcePartLimits       = YES
Epilog                  = /etc/slurm/epilog
EpilogMsgTime           = 2000 usec
EpilogSlurmctld         = (null)
ExtSensorsType          = ext_sensors/none
ExtSensorsFreq          = 0 sec
FairShareDampeningFactor = 1
FastSchedule            = 2
FirstJobId              = 2230000
GetEnvTimeout           = 2 sec
GresTypes               = mic,gpu
GroupUpdateForce        = 0
GroupUpdateTime         = 600 sec
HASH_VAL                = Match
HealthCheckInterval     = 1800 sec
HealthCheckNodeState    = IDLE
HealthCheckProgram      = /etc/slurm/health_check
InactiveLimit           = 1800 sec
JobAcctGatherFrequency  = energy=30,task=30
JobAcctGatherType       = jobacct_gather/linux
JobAcctGatherParams     = (null)
JobCheckpointDir        = /var/slurm/checkpoint
JobCompHost             = localhost
JobCompLoc              = /var/log/slurm_jobcomp.log
JobCompPort             = 0
JobCompType             = jobcomp/none
JobCompUser             = root
JobContainerType        = job_container/none
JobCredentialPrivateKey = (null)
JobCredentialPublicCertificate = (null)
JobFileAppend           = 0
JobRequeue              = 1
JobSubmitPlugins        = lua
KeepAliveTime           = 60 sec
KillOnBadExit           = 1
KillWait                = 10 sec
LaunchType              = launch/slurm
Layouts                 = 
Licenses                = mdcs:256
LicensesUsed            = mdcs:0/256
MailProg                = /bin/mail
MaxArraySize            = 1001
MaxJobCount             = 30000
MaxJobId                = 4294901760
MaxMemPerNode           = UNLIMITED
MaxStepCount            = 40000
MaxTasksPerNode         = 128
MemLimitEnforce         = yes
MessageTimeout          = 99 sec
MinJobAge               = 300 sec
MpiDefault              = none
MpiParams               = ports=12000-12999
NEXT_JOB_ID             = 3416724
OverTimeLimit           = 0 min
PluginDir               = /usr/lib64/slurm
PlugStackConfig         = /etc/slurm/plugstack.conf
PreemptMode             = OFF
PreemptType             = preempt/none
PriorityParameters      = (null)
PriorityDecayHalfLife   = 7-00:00:00
PriorityCalcPeriod      = 00:05:00
PriorityFavorSmall      = 0
PriorityFlags           = 
PriorityMaxAge          = 6-00:00:00
PriorityUsageResetPeriod = NONE
PriorityType            = priority/multifactor
PriorityWeightAge       = 500
PriorityWeightFairShare = 1000
PriorityWeightJobSize   = 1000
PriorityWeightPartition = 1000
PriorityWeightQOS       = 0
PrivateData             = none
ProctrackType           = proctrack/cgroup
Prolog                  = /etc/slurm/prolog
PrologSlurmctld         = (null)
PrologFlags             = (null)
PropagatePrioProcess    = 0
PropagateResourceLimits = (null)
PropagateResourceLimitsExcept = MEMLOCK,RLIMIT_AS,RLIMIT_CPU,RLIMIT_NPROC,RLIMIT_CORE,RLIMIT_DATA,RLIMIT_RSS,STACK
RebootProgram           = /sbin/reboot
ReconfigFlags           = (null)
RequeueExit             = (null)
RequeueExitHold         = (null)
ResumeProgram           = (null)
ResumeRate              = 300 nodes/min
ResumeTimeout           = 60 sec
ResvEpilog              = (null)
ResvOverRun             = 0 min
ResvProlog              = (null)
ReturnToService         = 1
RoutePlugin             = (null)
SallocDefaultCommand    = (null)
SchedulerParameters     = bf_max_job_user=30,bf_continue,bf_interval=60,bf_resolution=180,max_job_bf=300,defer_rpc_cnt=10
SchedulerPort           = 7321
SchedulerRootFilter     = 1
SchedulerTimeSlice      = 30 sec
SchedulerType           = sched/backfill
SelectType              = select/cons_res
SelectTypeParameters    = CR_CORE_MEMORY,CR_CORE_DEFAULT_DIST_BLOCK,CR_LLN
SlurmUser               = slurm(88)
SlurmctldDebug          = verbose
SlurmctldLogFile        = /slurmdb/log/Slurmctld.log
SlurmctldPort           = 6817
SlurmctldTimeout        = 300 sec
SlurmdDebug             = info
SlurmdLogFile           = (null)
SlurmdPidFile           = /var/run/slurmd.pid
SlurmdPlugstack         = (null)
SlurmdPort              = 6818
SlurmdSpoolDir          = /slurmdb/tmp/slurmd
SlurmdTimeout           = 600 sec
SlurmdUser              = root(0)
SlurmSchedLogFile       = (null)
SlurmSchedLogLevel      = 0
SlurmctldPidFile        = /var/run/slurmctld.pid
SlurmctldPlugstack      = (null)
SLURM_CONF              = /etc/slurm/slurm.conf
SLURM_VERSION           = 14.11.1
SrunEpilog              = (null)
SrunProlog              = (null)
StateSaveLocation       = /slurmdb/tmp
SuspendExcNodes         = (null)
SuspendExcParts         = (null)
SuspendProgram          = (null)
SuspendRate             = 60 nodes/min
SuspendTime             = NONE
SuspendTimeout          = 30 sec
SwitchType              = switch/none
TaskEpilog              = (null)
TaskPlugin              = task/cgroup
TaskPluginParam         = (null type)
TaskProlog              = (null)
TmpFS                   = /tmp
TopologyPlugin          = topology/tree
TrackWCKey              = 0
TreeWidth               = 50
UsePam                  = 1
UnkillableStepProgram   = (null)
UnkillableStepTimeout   = 60 sec
VSizeFactor             = 0 percent
WaitTime                = 0 sec
Comment 4 Danny Auble 2014-12-09 09:25:27 MST
Thanks, could you send a snip of the slurmctld log up till the seg fault?
Comment 5 CSC sysadmins 2014-12-09 09:39:45 MST
Created attachment 1501 [details]
Slurmctld log file
Comment 6 Danny Auble 2014-12-09 09:45:23 MST
Tommi, do you have an idea how many jobs were in the queue when this happened?

Was job 3415818 part of an array?

While this isn't related, I find it interesting there are Warnings of large processing time when this was happening.

On the core could you send the output of

thread apply all bt
Comment 7 CSC sysadmins 2014-12-09 10:07:50 MST
> Tommi, do you have an idea how many jobs were in the queue when this happened?

Usually we have couple of hundreds of jobs.

> Was job 3415818 part of an array?

I don't know. That job id does not exists on the log file | sacct db

> While this isn't related, I find it interesting there are Warnings of large processing time when this was happening.

My colleague made a BR in the past, it's a long standing issue and should not be related:
http://bugs.schedmd.com/show_bug.cgi?id=1082

> On the core could you send the output of




Core was generated by `/usr/sbin/slurmctld'.
Program terminated with signal 11, Segmentation fault.
#0  0x000000000044a33c in find_job_record (job_id=3415818) at job_mgr.c:2626
2626                    if (job_ptr->job_id == job_id)
Missing separate debuginfos, use: debuginfo-install slurm-14.11.1-1.el6.x86_64
(gdb) thread apply all bt

Thread 20 (Thread 0x7fda7e930700 (LWP 39633)):
#0  pthread_cond_timedwait@@GLIBC_2.3.2 ()
    at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:239
#1  0x00007fda7e934278 in _my_sleep (usec=500000) at backfill.c:438
#2  0x00007fda7e934cc0 in _yield_locks (usec=500000) at backfill.c:666
#3  0x00007fda7e935e31 in _attempt_backfill () at backfill.c:1035
#4  0x00007fda7e934b7f in backfill_agent (args=0x0) at backfill.c:628
#5  0x0000003b16a079d1 in start_thread (arg=0x7fda7e930700) at pthread_create.c:301
#6  0x0000003b162e886d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115

Thread 19 (Thread 0x7fda7df2b700 (LWP 39635)):
#0  0x0000003b162ac9fd in nanosleep () at ../sysdeps/unix/syscall-template.S:82
#1  0x0000003b162ac870 in __sleep (seconds=0) at ../sysdeps/unix/sysv/linux/sleep.c:138
#2  0x00007fda7df319ca in _decay_thread (no_data=0x0) at priority_multifactor.c:1335
#3  0x0000003b16a079d1 in start_thread (arg=0x7fda7df2b700) at pthread_create.c:301
#4  0x0000003b162e886d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115

Thread 18 (Thread 0x7fda56ded700 (LWP 7106)):
#0  0x0000003b162ac9fd in nanosleep () at ../sysdeps/unix/syscall-template.S:82
#1  0x0000003b162e1b54 in usleep (useconds=<value optimized out>)
    at ../sysdeps/unix/sysv/linux/usleep.c:33
#2  0x000000000042f8d3 in _wdog (args=0x7fd880325c50) at agent.c:573
#3  0x0000003b16a079d1 in start_thread (arg=0x7fda56ded700) at pthread_create.c:301
#4  0x0000003b162e886d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115

Thread 17 (Thread 0x7fda57cfc700 (LWP 7107)):
#0  pthread_cond_wait@@GLIBC_2.3.2 ()
    at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:183
#1  0x00000000004e2ceb in start_msg_tree (hl=0x7fd88021ab70, msg=0x7fda57cfbe20, timeout=0)
    at forward.c:647
---Type <return> to continue, or q <return> to quit---
#2  0x00000000005241ce in slurm_send_recv_msgs (nodelist=0x7fd880313100 "c436", 
    msg=0x7fda57cfbe20, timeout=0, quiet=true) at slurm_protocol_api.c:3987
#3  0x00000000004305c5 in _thread_per_group_rpc (args=0x7fd880047240) at agent.c:879
#4  0x0000003b16a079d1 in start_thread (arg=0x7fda57cfc700) at pthread_create.c:301
#5  0x0000003b162e886d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115

Thread 16 (Thread 0x7fda578f8700 (LWP 7114)):
#0  0x0000003b162ac9fd in nanosleep () at ../sysdeps/unix/syscall-template.S:82
#1  0x0000003b162ac870 in __sleep (seconds=0) at ../sysdeps/unix/sysv/linux/sleep.c:138
#2  0x00000000005242d8 in slurm_send_addr_recv_msgs (msg=0x7fda578f7e20, 
    name=0x7fd880212040 "c436", timeout=99000) at slurm_protocol_api.c:4021
#3  0x00000000004e1f27 in _fwd_tree_thread (arg=0x7fd880329410) at forward.c:363
#4  0x0000003b16a079d1 in start_thread (arg=0x7fda578f8700) at pthread_create.c:301
#5  0x0000003b162e886d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115

Thread 15 (Thread 0x7fda7d21e700 (LWP 7111)):
#0  0x0000003b16a0822d in pthread_join (threadid=140575740200704, thread_return=0x0)
    at pthread_join.c:89
#1  0x000000000042ee06 in agent (args=0x7fd88027c7e0) at agent.c:332
#2  0x0000003b16a079d1 in start_thread (arg=0x7fda7d21e700) at pthread_create.c:301
#3  0x0000003b162e886d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115

Thread 14 (Thread 0x7fda7f6bd700 (LWP 39621)):
#0  pthread_cond_timedwait@@GLIBC_2.3.2 ()
    at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:239
#1  0x00000000005ae31f in _agent (x=0x0) at slurmdbd_defs.c:2113
#2  0x0000003b16a079d1 in start_thread (arg=0x7fda7f6bd700) at pthread_create.c:301
#3  0x0000003b162e886d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115

Thread 13 (Thread 0x7fda7dd29700 (LWP 39637)):
#0  0x0000003b162e12e3 in select () at ../sysdeps/unix/syscall-template.S:82
---Type <return> to continue, or q <return> to quit---
#1  0x000000000043534f in _slurmctld_rpc_mgr (no_data=0x0) at controller.c:965
#2  0x0000003b16a079d1 in start_thread (arg=0x7fda7dd29700) at pthread_create.c:301
#3  0x0000003b162e886d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115

Thread 12 (Thread 0x7fda7dc28700 (LWP 39638)):
#0  do_sigwait (set=<value optimized out>, sig=0x7fda7dc27eac)
    at ../sysdeps/unix/sysv/linux/sigwait.c:65
#1  __sigwait (set=<value optimized out>, sig=0x7fda7dc27eac)
    at ../sysdeps/unix/sysv/linux/sigwait.c:100
#2  0x0000000000434cc0 in _slurmctld_signal_hand (no_data=0x0) at controller.c:827
#3  0x0000003b16a079d1 in start_thread (arg=0x7fda7dc28700) at pthread_create.c:301
#4  0x0000003b162e886d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115

Thread 11 (Thread 0x7fda574f4700 (LWP 7121)):
#0  0x00000000004deec8 in slurm_xmalloc (size=24, clear=false, file=0x64250a "pack.c", line=150, 
    func=0x642509 "") at xmalloc.c:86
#1  0x00000000004f6c7e in init_buf (size=16384) at pack.c:150
#2  0x00000000005230ea in slurm_send_node_msg (fd=11, msg=0x7fda574f3cd0)
    at slurm_protocol_api.c:3281
#3  0x00000000004973f2 in _slurm_rpc_dump_jobs_user (msg=0x7fd88075f330) at proc_req.c:1239
#4  0x00000000004941f5 in slurmctld_req (msg=0x7fd88075f330, arg=0x7fda74003ef0)
    at proc_req.c:271
#5  0x00000000004357eb in _service_connection (arg=0x7fda74003ef0) at controller.c:1070
#6  0x0000003b16a079d1 in start_thread (arg=0x7fda574f4700) at pthread_create.c:301
#7  0x0000003b162e886d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115

Thread 10 (Thread 0x7fda566e6700 (LWP 7113)):
#0  pthread_cond_wait@@GLIBC_2.3.2 ()
    at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:183
#1  0x00000000004e2ceb in start_msg_tree (hl=0x7fd880329450, msg=0x7fda566e5e20, timeout=0)
    at forward.c:647
---Type <return> to continue, or q <return> to quit--- 
#2  0x00000000005241ce in slurm_send_recv_msgs (nodelist=0x7fd880380330 "c436", 
    msg=0x7fda566e5e20, timeout=0, quiet=true) at slurm_protocol_api.c:3987
#3  0x00000000004305c5 in _thread_per_group_rpc (args=0x7fd880230df0) at agent.c:879
#4  0x0000003b16a079d1 in start_thread (arg=0x7fda566e6700) at pthread_create.c:301
#5  0x0000003b162e886d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115

Thread 9 (Thread 0x7fda7db27700 (LWP 39639)):
#0  pthread_cond_wait@@GLIBC_2.3.2 ()
    at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:183
#1  0x00000000004b7ac4 in slurmctld_state_save (no_data=0x0) at state_save.c:211
#2  0x0000003b16a079d1 in start_thread (arg=0x7fda7db27700) at pthread_create.c:301
#3  0x0000003b162e886d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115

Thread 8 (Thread 0x7fda570f0700 (LWP 7112)):
#0  0x0000003b162ac9fd in nanosleep () at ../sysdeps/unix/syscall-template.S:82
#1  0x0000003b162e1b54 in usleep (useconds=<value optimized out>)
    at ../sysdeps/unix/sysv/linux/usleep.c:33
#2  0x000000000042f8d3 in _wdog (args=0x7fd88012ced0) at agent.c:573
#3  0x0000003b16a079d1 in start_thread (arg=0x7fda570f0700) at pthread_create.c:301
#4  0x0000003b162e886d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115

Thread 7 (Thread 0x7fda562e2700 (LWP 7108)):
#0  0x0000003b162ac9fd in nanosleep () at ../sysdeps/unix/syscall-template.S:82
#1  0x0000003b162ac870 in __sleep (seconds=0) at ../sysdeps/unix/sysv/linux/sleep.c:138
#2  0x00000000005242d8 in slurm_send_addr_recv_msgs (msg=0x7fda562e1e20, 
    name=0x7fd88031e0d0 "c436", timeout=99000) at slurm_protocol_api.c:4021
#3  0x00000000004e1f27 in _fwd_tree_thread (arg=0x7fd8803293c0) at forward.c:363
#4  0x0000003b16a079d1 in start_thread (arg=0x7fda562e2700) at pthread_create.c:301
#5  0x0000003b162e886d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115

Thread 6 (Thread 0x7fda57dfd700 (LWP 7105)):
---Type <return> to continue, or q <return> to quit---
#0  0x0000003b16a0822d in pthread_join (threadid=140575737042688, thread_return=0x0)
    at pthread_join.c:89
#1  0x000000000042ee06 in agent (args=0x7fd88006dde0) at agent.c:332
#2  0x0000003b16a079d1 in start_thread (arg=0x7fda57dfd700) at pthread_create.c:301
#3  0x0000003b162e886d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115

Thread 5 (Thread 0x7fda7de2a700 (LWP 39636)):
#0  0x0000003b16a0822d in pthread_join (threadid=140576392656640, thread_return=0x0)
    at pthread_join.c:89
#1  0x00007fda7df31b0c in _cleanup_thread (no_data=0x0) at priority_multifactor.c:1389
#2  0x0000003b16a079d1 in start_thread (arg=0x7fda7de2a700) at pthread_create.c:301
#3  0x0000003b162e886d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115

Thread 4 (Thread 0x7fda7f9c2700 (LWP 39618)):
#0  0x0000003b16a0822d in pthread_join (threadid=140576421590784, thread_return=0x0)
    at pthread_join.c:89
#1  0x00007fda7fac75c1 in _cleanup_thread (no_data=0x0) at accounting_storage_slurmdbd.c:423
#2  0x0000003b16a079d1 in start_thread (arg=0x7fda7f9c2700) at pthread_create.c:301
#3  0x0000003b162e886d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115

Thread 3 (Thread 0x7fda821ca700 (LWP 39615)):
#0  0x0000003b162ac9fd in nanosleep () at ../sysdeps/unix/syscall-template.S:82
#1  0x0000003b162e1b54 in usleep (useconds=<value optimized out>)
    at ../sysdeps/unix/sysv/linux/usleep.c:33
#2  0x0000000000436552 in _slurmctld_background (no_data=0x0) at controller.c:1449
#3  0x00000000004346b8 in main (argc=1, argv=0x7ffff9075648) at controller.c:561

Thread 2 (Thread 0x7fda7fac3700 (LWP 39617)):
#0  0x0000003b162ac9fd in nanosleep () at ../sysdeps/unix/syscall-template.S:82
#1  0x0000003b162ac870 in __sleep (seconds=0) at ../sysdeps/unix/sysv/linux/sleep.c:138
#2  0x00007fda7fac759c in _set_db_inx_thread (no_data=0x0) at accounting_storage_slurmdbd.c:415
---Type <return> to continue, or q <return> to quit---
#3  0x0000003b16a079d1 in start_thread (arg=0x7fda7fac3700) at pthread_create.c:301
#4  0x0000003b162e886d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115

Thread 1 (Thread 0x7fda576f6700 (LWP 7120)):
#0  0x000000000044a33c in find_job_record (job_id=3415818) at job_mgr.c:2626
#1  0x000000000045db23 in _set_job_id (job_ptr=0x7fd880227cb0) at job_mgr.c:8423
#2  0x0000000000453f51 in _copy_job_desc_to_job_record (job_desc=0x7fd8800c1710, 
    job_rec_ptr=0x7fda576f5c00, req_bitmap=0x7fda576f5940, exc_bitmap=0x7fda576f5938)
    at job_mgr.c:6371
#3  0x0000000000451713 in _job_create (job_desc=0x7fd8800c1710, allocate=0, will_run=0, 
    job_pptr=0x7fda576f5c00, submit_uid=34018, err_msg=0x7fda576f5cb0) at job_mgr.c:5458
#4  0x000000000044d395 in job_allocate (job_specs=0x7fd8800c1710, immediate=0, will_run=0, 
    resp=0x0, allocate=0, submit_uid=34018, job_pptr=0x7fda576f5cc0, err_msg=0x7fda576f5cb0)
    at job_mgr.c:3758
#5  0x000000000049d119 in _slurm_rpc_submit_batch_job (msg=0x7fd8802e1e70) at proc_req.c:3189
#6  0x00000000004945b3 in slurmctld_req (msg=0x7fd8802e1e70, arg=0x7fda74003ec0)
    at proc_req.c:380
#7  0x00000000004357eb in _service_connection (arg=0x7fda74003ec0) at controller.c:1070
#8  0x0000003b16a079d1 in start_thread (arg=0x7fda576f6700) at pthread_create.c:301
#9  0x0000003b162e886d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115
Comment 8 CSC sysadmins 2014-12-10 01:44:40 MST
Our secondary controller started to crash too, here is valgrind log of it. It looks completely different but maybe it explains memory corruption?

==00:03:00:33.169 42152== Rerun with --leak-check=full to see details of leaked memory
==00:03:00:33.169 42152== 
==00:03:00:33.169 42152== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 18 from 9)
==00:03:00:33.169 42152== 
==00:03:00:33.169 42152== 1 errors in context 1 of 1:
==00:03:00:33.169 42152== Invalid read of size 8
==00:03:00:33.169 42152==    at 0x4D42C5: _delete_assoc_hash (assoc_mgr.c:274)
==00:03:00:33.169 42152==    by 0x4DEC6A: assoc_mgr_set_missing_uids (assoc_mgr.c:4583)
==00:03:00:33.170 42152==    by 0x43720C: _slurmctld_background (controller.c:1706)
==00:03:00:33.170 42152==    by 0x4346B7: main (controller.c:561)
==00:03:00:33.170 42152==  Address 0x10 is not stack'd, malloc'd or (recently) free'd
==00:03:00:33.170 42152== 
--00:03:00:33.170 42152-- 
--00:03:00:33.170 42152-- used_suppression:     18 dl-hack3-cond-1
==00:03:00:33.170 42152== 
==00:03:00:33.170 42152== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 18 from 9)
Comment 9 Danny Auble 2014-12-10 01:55:37 MST
This was fixed in commit https://github.com/SchedMD/slurm/commit/7f899cf080044c71d876e9918a53f4a789aed516

It does look similar though. 

On December 10, 2014 7:44:40 AM PST, bugs@schedmd.com wrote:
>http://bugs.schedmd.com/show_bug.cgi?id=1309
>
>--- Comment #8 from Tommi Tervo <tommi.tervo@csc.fi> ---
>Our secondary controller started to crash too, here is valgrind log of
>it. It
>looks completely different but maybe it explains memory corruption?
>
>==00:03:00:33.169 42152== Rerun with --leak-check=full to see details
>of leaked
>memory
>==00:03:00:33.169 42152== 
>==00:03:00:33.169 42152== ERROR SUMMARY: 1 errors from 1 contexts
>(suppressed:
>18 from 9)
>==00:03:00:33.169 42152== 
>==00:03:00:33.169 42152== 1 errors in context 1 of 1:
>==00:03:00:33.169 42152== Invalid read of size 8
>==00:03:00:33.169 42152==    at 0x4D42C5: _delete_assoc_hash
>(assoc_mgr.c:274)
>==00:03:00:33.169 42152==    by 0x4DEC6A: assoc_mgr_set_missing_uids
>(assoc_mgr.c:4583)
>==00:03:00:33.170 42152==    by 0x43720C: _slurmctld_background
>(controller.c:1706)
>==00:03:00:33.170 42152==    by 0x4346B7: main (controller.c:561)
>==00:03:00:33.170 42152==  Address 0x10 is not stack'd, malloc'd or
>(recently)
>free'd
>==00:03:00:33.170 42152== 
>--00:03:00:33.170 42152-- 
>--00:03:00:33.170 42152-- used_suppression:     18 dl-hack3-cond-1
>==00:03:00:33.170 42152== 
>==00:03:00:33.170 42152== ERROR SUMMARY: 1 errors from 1 contexts
>(suppressed:
>18 from 9)
>
>-- 
>You are receiving this mail because:
>You are on the CC list for the bug.
Comment 10 Danny Auble 2014-12-10 07:48:33 MST
Tommi, the more we think about this these 2 different crashes may be related.  I am guessing you are adding and removing users from the system quite often as this is the only time this kind of thing would happen.  Particularly when the user is added to Slurm before they have a uid on the slurmctld node(s).

Does this fit your scenario?  This is obviously what happened in your second crash, but was this happening around the time of the first crash as well?
Comment 11 CSC sysadmins 2014-12-10 22:53:09 MST
Hi,

We had a problem with LDAP configuration on the secondary slurmctld server which explains that missing uid crash. But we did not have that configuration error on the primary server. Our sssd/ldap setup has been a bit flaky so it's possible that it happened on the primary server also.

I looked through commit logs and decided to take latest version from the 14.11 branch, because there were a bunch of other fixups as well. Time will tell if this problem reappears.

Regards,
Tommi Tervo
CSC
Comment 12 Danny Auble 2014-12-10 23:38:56 MST
We will probably tag 14.11.2 today if that helps.  But if you are having problems keeping users in sync this patch will definitely be something you want. 

On December 11, 2014 4:53:09 AM PST, bugs@schedmd.com wrote:
>http://bugs.schedmd.com/show_bug.cgi?id=1309
>
>--- Comment #11 from Tommi Tervo <tommi.tervo@csc.fi> ---
>Hi,
>
>We had a problem with LDAP configuration on the secondary slurmctld
>server
>which explains that missing uid crash. But we did not have that
>configuration
>error on the primary server. Our sssd/ldap setup has been a bit flaky
>so it's
>possible that it happened on the primary server also.
>
>I looked through commit logs and decided to take latest version from
>the 14.11
>branch, because there were a bunch of other fixups as well. Time will
>tell if
>this problem reappears.
>
>Regards,
>Tommi Tervo
>CSC
>
>-- 
>You are receiving this mail because:
>You are on the CC list for the bug.
Comment 13 David Bigagli 2014-12-12 08:54:26 MST
Please reopen if necessary.

David
Comment 14 CSC sysadmins 2014-12-15 18:00:32 MST
Hi,

Problem reappeared, slurm is from 14.11 branch: commit 64e5324d

Core was generated by `/usr/sbin/slurmctld'.
Program terminated with signal 11, Segmentation fault.
#0  0x000000000044a3ec in find_job_record (job_id=3470037) at job_mgr.c:2626
2626                    if (job_ptr->job_id == job_id)
Missing separate debuginfos, use: debuginfo-install slurm-14.11.1-2.x86_64
(gdb) thread apply all bt

Thread 52 (Thread 0x7fcb45e6c700 (LWP 28170)):
#0  0x0000003b16a0822d in pthread_join (threadid=140511015225088, thread_return=0x0) at pthread_join.c:89
#1  0x000000000042ee06 in agent (args=0x7fcb403741f0) at agent.c:332
#2  0x0000003b16a079d1 in start_thread (arg=0x7fcb45e6c700) at pthread_create.c:301
#3  0x0000003b162e886d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115

Thread 51 (Thread 0x7fcb2eeee700 (LWP 28128)):
#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:183
#1  0x00000000004775ad in _wr_wrlock (datatype=JOB_LOCK, wait_lock=true) at locks.c:256
#2  0x0000000000476f64 in lock_slurmctld (lock_levels=...) at locks.c:86
#3  0x000000000042fe96 in _notify_slurmctld_nodes (agent_ptr=0x7fc9305d1cd0, no_resp_cnt=0, retry_cnt=0) at agent.c:694
#4  0x000000000042fac6 in _wdog (args=0x7fc9305d1cd0) at agent.c:603
#5  0x0000003b16a079d1 in start_thread (arg=0x7fcb2eeee700) at pthread_create.c:301
#6  0x0000003b162e886d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115

Thread 50 (Thread 0x7fcb45163700 (LWP 28151)):
#0  0x0000003b16a0822d in pthread_join (threadid=140511006803712, thread_return=0x0) at pthread_join.c:89
#1  0x000000000042ee06 in agent (args=0x7fcb40753300) at agent.c:332
#2  0x0000003b16a079d1 in start_thread (arg=0x7fcb45163700) at pthread_create.c:301
#3  0x0000003b162e886d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115

Thread 49 (Thread 0x7fcb2fafa700 (LWP 28160)):
#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:183
#1  0x00000000004775ad in _wr_wrlock (datatype=JOB_LOCK, wait_lock=true) at locks.c:256
#2  0x0000000000476f64 in lock_slurmctld (lock_levels=...) at locks.c:86
#3  0x000000000042fe96 in _notify_slurmctld_nodes (agent_ptr=0x7fc930357950, no_resp_cnt=0, retry_cnt=0) at agent.c:694
#4  0x000000000042fac6 in _wdog (args=0x7fc930357950) at agent.c:603
#5  0x0000003b16a079d1 in start_thread (arg=0x7fcb2fafa700) at pthread_create.c:301
#6  0x0000003b162e886d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115

Thread 48 (Thread 0x7fcb45466700 (LWP 28164)):
#0  0x0000003b16a0822d in pthread_join (threadid=140511047309056, thread_return=0x0) at pthread_join.c:89
#1  0x000000000042ee06 in agent (args=0x7fcb40335fe0) at agent.c:332
#2  0x0000003b16a079d1 in start_thread (arg=0x7fcb45466700) at pthread_create.c:301
#3  0x0000003b162e886d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115

Thread 47 (Thread 0x7fcb2ebeb700 (LWP 28143)):
#0  0x0000003b16a0822d in pthread_join (threadid=140511007856384, thread_return=0x0) at pthread_join.c:89
#1  0x000000000042ee06 in agent (args=0x7fcb401d3670) at agent.c:332
#2  0x0000003b16a079d1 in start_thread (arg=0x7fcb2ebeb700) at pthread_create.c:301
#3  0x0000003b162e886d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115

Thread 46 (Thread 0x7fcb44b5d700 (LWP 28144)):
#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:183
#1  0x00000000004775ad in _wr_wrlock (datatype=JOB_LOCK, wait_lock=true) at locks.c:256
#2  0x0000000000476f64 in lock_slurmctld (lock_levels=...) at locks.c:86
#3  0x000000000042fe96 in _notify_slurmctld_nodes (agent_ptr=0x7fc930174c80, no_resp_cnt=0, retry_cnt=0) at agent.c:694
#4  0x000000000042fac6 in _wdog (args=0x7fc930174c80) at agent.c:603
#5  0x0000003b16a079d1 in start_thread (arg=0x7fcb44b5d700) at pthread_create.c:301
#6  0x0000003b162e886d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115

Thread 45 (Thread 0x7fcb2fdfd700 (LWP 28139)):
#0  0x0000003b16a0822d in pthread_join (threadid=140511043098368, thread_return=0x0) at pthread_join.c:89
#1  0x000000000042ee06 in agent (args=0x7fcb40cec5c0) at agent.c:332
#2  0x0000003b16a079d1 in start_thread (arg=0x7fcb2fdfd700) at pthread_create.c:301
#3  0x0000003b162e886d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115

Thread 44 (Thread 0x7fcb45f6d700 (LWP 28155)):
#0  0x0000003b16a0822d in pthread_join (threadid=140511003645696, thread_return=0x0) at pthread_join.c:89
#1  0x000000000042ee06 in agent (args=0x7fcb402fee80) at agent.c:332
#2  0x0000003b16a079d1 in start_thread (arg=0x7fcb45f6d700) at pthread_create.c:301
#3  0x0000003b162e886d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115

Thread 43 (Thread 0x7fcb44759700 (LWP 28156)):
#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:183
#1  0x00000000004775ad in _wr_wrlock (datatype=JOB_LOCK, wait_lock=true) at locks.c:256
#2  0x0000000000476f64 in lock_slurmctld (lock_levels=...) at locks.c:86
#3  0x000000000042fe96 in _notify_slurmctld_nodes (agent_ptr=0x7fc9305830f0, no_resp_cnt=0, retry_cnt=0) at agent.c:694
#4  0x000000000042fac6 in _wdog (args=0x7fc9305830f0) at agent.c:603
#5  0x0000003b16a079d1 in start_thread (arg=0x7fcb44759700) at pthread_create.c:301
#6  0x0000003b162e886d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115

Thread 42 (Thread 0x7fcb4606e700 (LWP 28159)):
#0  0x0000003b16a0822d in pthread_join (threadid=140510655129344, thread_return=0x0) at pthread_join.c:89
#1  0x000000000042ee06 in agent (args=0x7fcb403c8010) at agent.c:332
#2  0x0000003b16a079d1 in start_thread (arg=0x7fcb4606e700) at pthread_create.c:301
#3  0x0000003b162e886d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115

Thread 41 (Thread 0x7fcb2e9e9700 (LWP 28136)):
---Type <return> to continue, or q <return> to quit---
#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:183
#1  0x00000000004775ad in _wr_wrlock (datatype=JOB_LOCK, wait_lock=true) at locks.c:256
#2  0x0000000000476f64 in lock_slurmctld (lock_levels=...) at locks.c:86
#3  0x000000000042fe96 in _notify_slurmctld_nodes (agent_ptr=0x7fc930031280, no_resp_cnt=0, retry_cnt=0) at agent.c:694
#4  0x000000000042fac6 in _wdog (args=0x7fc930031280) at agent.c:603
#5  0x0000003b16a079d1 in start_thread (arg=0x7fcb2e9e9700) at pthread_create.c:301
#6  0x0000003b162e886d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115

Thread 40 (Thread 0x7fcb46cf9700 (LWP 28140)):
#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:183
#1  0x00000000004775ad in _wr_wrlock (datatype=JOB_LOCK, wait_lock=true) at locks.c:256
#2  0x0000000000476f64 in lock_slurmctld (lock_levels=...) at locks.c:86
#3  0x000000000042fe96 in _notify_slurmctld_nodes (agent_ptr=0x7fc9300a4900, no_resp_cnt=0, retry_cnt=0) at agent.c:694
#4  0x000000000042fac6 in _wdog (args=0x7fc9300a4900) at agent.c:603
#5  0x0000003b16a079d1 in start_thread (arg=0x7fcb46cf9700) at pthread_create.c:301
#6  0x0000003b162e886d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115

Thread 39 (Thread 0x7fcb45264700 (LWP 28171)):
#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:183
#1  0x00000000004775ad in _wr_wrlock (datatype=JOB_LOCK, wait_lock=true) at locks.c:256
#2  0x0000000000476f64 in lock_slurmctld (lock_levels=...) at locks.c:86
#3  0x000000000042fe96 in _notify_slurmctld_nodes (agent_ptr=0x7fc9300029a0, no_resp_cnt=0, retry_cnt=0) at agent.c:694
#4  0x000000000042fac6 in _wdog (args=0x7fc9300029a0) at agent.c:603
#5  0x0000003b16a079d1 in start_thread (arg=0x7fcb45264700) at pthread_create.c:301
#6  0x0000003b162e886d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115

Thread 38 (Thread 0x7fcb44a5c700 (LWP 28152)):
#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:183
#1  0x00000000004775ad in _wr_wrlock (datatype=JOB_LOCK, wait_lock=true) at locks.c:256
#2  0x0000000000476f64 in lock_slurmctld (lock_levels=...) at locks.c:86
#3  0x000000000042fe96 in _notify_slurmctld_nodes (agent_ptr=0x7fc9301608b0, no_resp_cnt=0, retry_cnt=0) at agent.c:694
#4  0x000000000042fac6 in _wdog (args=0x7fc9301608b0) at agent.c:603
#5  0x0000003b16a079d1 in start_thread (arg=0x7fcb44a5c700) at pthread_create.c:301
#6  0x0000003b162e886d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115

Thread 37 (Thread 0x7fcb49d0b700 (LWP 5954)):
#0  0x0000003b162ac9fd in nanosleep () at ../sysdeps/unix/syscall-template.S:82
#1  0x0000003b162e1b54 in usleep (useconds=<value optimized out>) at ../sysdeps/unix/sysv/linux/usleep.c:33
#2  0x0000000000436552 in _slurmctld_background (no_data=0x0) at controller.c:1449
#3  0x00000000004346b8 in main (argc=1, argv=0x7fff4beb2a48) at controller.c:561

Thread 36 (Thread 0x7fcb470fd700 (LWP 28165)):
#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:183
#1  0x00000000004775ad in _wr_wrlock (datatype=JOB_LOCK, wait_lock=true) at locks.c:256
#2  0x0000000000476f64 in lock_slurmctld (lock_levels=...) at locks.c:86
#3  0x000000000042fe96 in _notify_slurmctld_nodes (agent_ptr=0x7fc930364d60, no_resp_cnt=0, retry_cnt=0) at agent.c:694
#4  0x000000000042fac6 in _wdog (args=0x7fc930364d60) at agent.c:603
#5  0x0000003b16a079d1 in start_thread (arg=0x7fcb470fd700) at pthread_create.c:301
#6  0x0000003b162e886d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115

Thread 35 (Thread 0x7fcb46ffc700 (LWP 28148)):
#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:183
#1  0x00000000004775ad in _wr_wrlock (datatype=JOB_LOCK, wait_lock=true) at locks.c:256
#2  0x0000000000476f64 in lock_slurmctld (lock_levels=...) at locks.c:86
#3  0x000000000042fe96 in _notify_slurmctld_nodes (agent_ptr=0x7fc9303ec600, no_resp_cnt=0, retry_cnt=0) at agent.c:694
#4  0x000000000042fac6 in _wdog (args=0x7fc9303ec600) at agent.c:603
#5  0x0000003b16a079d1 in start_thread (arg=0x7fcb46ffc700) at pthread_create.c:301
#6  0x0000003b162e886d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115

Thread 34 (Thread 0x7fcb2eaea700 (LWP 28147)):
#0  0x0000003b16a0822d in pthread_join (threadid=140511046256384, thread_return=0x0) at pthread_join.c:89
#1  0x000000000042ee06 in agent (args=0x7fcb4000dfc0) at agent.c:332
#2  0x0000003b16a079d1 in start_thread (arg=0x7fcb2eaea700) at pthread_create.c:301
#3  0x0000003b162e886d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115

Thread 33 (Thread 0x7fcb44658700 (LWP 28127)):
#0  0x0000003b16a0822d in pthread_join (threadid=140510642497280, thread_return=0x0) at pthread_join.c:89
#1  0x000000000042ee06 in agent (args=0x7fcb40df43e0) at agent.c:332
#2  0x0000003b16a079d1 in start_thread (arg=0x7fcb44658700) at pthread_create.c:301
#3  0x0000003b162e886d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115

Thread 32 (Thread 0x7fcb47503700 (LWP 5957)):
#0  0x0000003b16a0822d in pthread_join (threadid=140511052580608, thread_return=0x0) at pthread_join.c:89
#1  0x00007fcb476085c1 in _cleanup_thread (no_data=0x0) at accounting_storage_slurmdbd.c:423
#2  0x0000003b16a079d1 in start_thread (arg=0x7fcb47503700) at pthread_create.c:301
#3  0x0000003b162e886d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115

Thread 31 (Thread 0x7fcb2ecec700 (LWP 28120)):
#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:183
#1  0x00000000004775ad in _wr_wrlock (datatype=JOB_LOCK, wait_lock=true) at locks.c:256
---Type <return> to continue, or q <return> to quit---
#2  0x0000000000476f64 in lock_slurmctld (lock_levels=...) at locks.c:86
#3  0x000000000042fe96 in _notify_slurmctld_nodes (agent_ptr=0x7fc93013e370, no_resp_cnt=0, retry_cnt=0) at agent.c:694
#4  0x000000000042fac6 in _wdog (args=0x7fc93013e370) at agent.c:603
#5  0x0000003b16a079d1 in start_thread (arg=0x7fcb2ecec700) at pthread_create.c:301
#6  0x0000003b162e886d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115

Thread 30 (Thread 0x7fcb2f8f8700 (LWP 28135)):
#0  0x0000003b16a0822d in pthread_join (threadid=140510637233920, thread_return=0x0) at pthread_join.c:89
#1  0x000000000042ee06 in agent (args=0x7fcb40faa550) at agent.c:332
#2  0x0000003b16a079d1 in start_thread (arg=0x7fcb2f8f8700) at pthread_create.c:301
#3  0x0000003b162e886d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115

Thread 29 (Thread 0x7fcb46dfa700 (LWP 28116)):
#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:183
#1  0x00000000004775ad in _wr_wrlock (datatype=JOB_LOCK, wait_lock=true) at locks.c:256
#2  0x0000000000476f64 in lock_slurmctld (lock_levels=...) at locks.c:86
#3  0x000000000042fe96 in _notify_slurmctld_nodes (agent_ptr=0x7fc9307c9450, no_resp_cnt=0, retry_cnt=0) at agent.c:694
#4  0x000000000042fac6 in _wdog (args=0x7fc9307c9450) at agent.c:603
#5  0x0000003b16a079d1 in start_thread (arg=0x7fcb46dfa700) at pthread_create.c:301
#6  0x0000003b162e886d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115

Thread 28 (Thread 0x7fcb44254700 (LWP 28124)):
#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:183
#1  0x00000000004775ad in _wr_wrlock (datatype=JOB_LOCK, wait_lock=true) at locks.c:256
#2  0x0000000000476f64 in lock_slurmctld (lock_levels=...) at locks.c:86
#3  0x000000000042fe96 in _notify_slurmctld_nodes (agent_ptr=0x7fc9300dbe40, no_resp_cnt=0, retry_cnt=0) at agent.c:694
#4  0x000000000042fac6 in _wdog (args=0x7fc9300dbe40) at agent.c:603
#5  0x0000003b16a079d1 in start_thread (arg=0x7fcb44254700) at pthread_create.c:301
#6  0x0000003b162e886d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115

Thread 27 (Thread 0x7fcb471fe700 (LWP 5960)):
#0  pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:239
#1  0x00000000005ae4a1 in _agent (x=0x0) at slurmdbd_defs.c:2115
#2  0x0000003b16a079d1 in start_thread (arg=0x7fcb471fe700) at pthread_create.c:301
#3  0x0000003b162e886d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115

Thread 26 (Thread 0x7fcb44153700 (LWP 28104)):
#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:183
#1  0x00000000004775ad in _wr_wrlock (datatype=JOB_LOCK, wait_lock=true) at locks.c:256
#2  0x0000000000476f64 in lock_slurmctld (lock_levels=...) at locks.c:86
#3  0x000000000042fe96 in _notify_slurmctld_nodes (agent_ptr=0x7fc9300326f0, no_resp_cnt=0, retry_cnt=0) at agent.c:694
#4  0x000000000042fac6 in _wdog (args=0x7fc9300326f0) at agent.c:603
#5  0x0000003b16a079d1 in start_thread (arg=0x7fcb44153700) at pthread_create.c:301
#6  0x0000003b162e886d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115

Thread 25 (Thread 0x7fcb2f5f5700 (LWP 28112)):
#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:183
#1  0x00000000004775ad in _wr_wrlock (datatype=JOB_LOCK, wait_lock=true) at locks.c:256
#2  0x0000000000476f64 in lock_slurmctld (lock_levels=...) at locks.c:86
#3  0x000000000042fe96 in _notify_slurmctld_nodes (agent_ptr=0x7fc930095820, no_resp_cnt=0, retry_cnt=0) at agent.c:694
#4  0x000000000042fac6 in _wdog (args=0x7fc930095820) at agent.c:603
#5  0x0000003b16a079d1 in start_thread (arg=0x7fcb2f5f5700) at pthread_create.c:301
#6  0x0000003b162e886d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115

Thread 24 (Thread 0x7fcb45062700 (LWP 28100)):
#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:183
#1  0x00000000004775ad in _wr_wrlock (datatype=JOB_LOCK, wait_lock=true) at locks.c:256
#2  0x0000000000476f64 in lock_slurmctld (lock_levels=...) at locks.c:86
#3  0x000000000042fe96 in _notify_slurmctld_nodes (agent_ptr=0x7fc930496060, no_resp_cnt=0, retry_cnt=0) at agent.c:694
#4  0x000000000042fac6 in _wdog (args=0x7fc930496060) at agent.c:603
#5  0x0000003b16a079d1 in start_thread (arg=0x7fcb45062700) at pthread_create.c:301
#6  0x0000003b162e886d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115

Thread 23 (Thread 0x7fcb2f9f9700 (LWP 28132)):
#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:183
#1  0x00000000004775ad in _wr_wrlock (datatype=JOB_LOCK, wait_lock=true) at locks.c:256
#2  0x0000000000476f64 in lock_slurmctld (lock_levels=...) at locks.c:86
#3  0x000000000042fe96 in _notify_slurmctld_nodes (agent_ptr=0x7fc93007aad0, no_resp_cnt=0, retry_cnt=0) at agent.c:694
#4  0x000000000042fac6 in _wdog (args=0x7fc93007aad0) at agent.c:603
#5  0x0000003b16a079d1 in start_thread (arg=0x7fcb2f9f9700) at pthread_create.c:301
#6  0x0000003b162e886d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115

Thread 22 (Thread 0x7fcb44456700 (LWP 28108)):
#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:183
#1  0x00000000004775ad in _wr_wrlock (datatype=JOB_LOCK, wait_lock=true) at locks.c:256
#2  0x0000000000476f64 in lock_slurmctld (lock_levels=...) at locks.c:86
#3  0x000000000042fe96 in _notify_slurmctld_nodes (agent_ptr=0x7fc930132930, no_resp_cnt=0, retry_cnt=0) at agent.c:694
#4  0x000000000042fac6 in _wdog (args=0x7fc930132930) at agent.c:603
#5  0x0000003b16a079d1 in start_thread (arg=0x7fcb44456700) at pthread_create.c:301
#6  0x0000003b162e886d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115
---Type <return> to continue, or q <return> to quit---

Thread 21 (Thread 0x7fcb44355700 (LWP 28131)):
#0  0x0000003b16a0822d in pthread_join (threadid=140510654076672, thread_return=0x0) at pthread_join.c:89
#1  0x000000000042ee06 in agent (args=0x7fcb40666d10) at agent.c:332
#2  0x0000003b16a079d1 in start_thread (arg=0x7fcb44355700) at pthread_create.c:301
#3  0x0000003b162e886d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115

Thread 20 (Thread 0x7fcb2fbfb700 (LWP 28096)):
#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:183
#1  0x00000000004775ad in _wr_wrlock (datatype=JOB_LOCK, wait_lock=true) at locks.c:256
#2  0x0000000000476f64 in lock_slurmctld (lock_levels=...) at locks.c:86
#3  0x000000000042fe96 in _notify_slurmctld_nodes (agent_ptr=0x7fc9301f5e50, no_resp_cnt=0, retry_cnt=0) at agent.c:694
#4  0x000000000042fac6 in _wdog (args=0x7fc9301f5e50) at agent.c:603
#5  0x0000003b16a079d1 in start_thread (arg=0x7fcb2fbfb700) at pthread_create.c:301
#6  0x0000003b162e886d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115

Thread 19 (Thread 0x7fcb47604700 (LWP 5956)):
#0  0x0000003b162ac9fd in nanosleep () at ../sysdeps/unix/syscall-template.S:82
#1  0x0000003b162ac870 in __sleep (seconds=0) at ../sysdeps/unix/sysv/linux/sleep.c:138
#2  0x00007fcb4760859c in _set_db_inx_thread (no_data=0x0) at accounting_storage_slurmdbd.c:415
#3  0x0000003b16a079d1 in start_thread (arg=0x7fcb47604700) at pthread_create.c:301
#4  0x0000003b162e886d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115

Thread 18 (Thread 0x7fcb2f4f4700 (LWP 28087)):
#0  0x0000003b16a0822d in pthread_join (threadid=140510596794112, thread_return=0x0) at pthread_join.c:89
#1  0x000000000042ee06 in agent (args=0x7fcb402a97e0) at agent.c:332
#2  0x0000003b16a079d1 in start_thread (arg=0x7fcb2f4f4700) at pthread_create.c:301
#3  0x0000003b162e886d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115

Thread 17 (Thread 0x7fcb4596b700 (LWP 5991)):
#0  0x0000003b16a0822d in pthread_join (threadid=140511023646464, thread_return=0x0) at pthread_join.c:89
#1  0x00007fcb45a72b0c in _cleanup_thread (no_data=0x0) at priority_multifactor.c:1389
#2  0x0000003b16a079d1 in start_thread (arg=0x7fcb4596b700) at pthread_create.c:301
#3  0x0000003b162e886d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115

Thread 16 (Thread 0x7fcb46bf8700 (LWP 28119)):
#0  0x0000003b16a0822d in pthread_join (threadid=140510640391936, thread_return=0x0) at pthread_join.c:89
#1  0x000000000042ee06 in agent (args=0x7fcb4049bd70) at agent.c:332
#2  0x0000003b16a079d1 in start_thread (arg=0x7fcb46bf8700) at pthread_create.c:301
#3  0x0000003b162e886d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115

Thread 15 (Thread 0x7fcb45769700 (LWP 5993)):
#0  do_sigwait (set=<value optimized out>, sig=0x7fcb45768eac) at ../sysdeps/unix/sysv/linux/sigwait.c:65
#1  __sigwait (set=<value optimized out>, sig=0x7fcb45768eac) at ../sysdeps/unix/sysv/linux/sigwait.c:100
#2  0x0000000000434cc0 in _slurmctld_signal_hand (no_data=0x0) at controller.c:827
#3  0x0000003b16a079d1 in start_thread (arg=0x7fcb45769700) at pthread_create.c:301
#4  0x0000003b162e886d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115

Thread 14 (Thread 0x7fcb2eded700 (LWP 28123)):
#0  0x0000003b16a0822d in pthread_join (threadid=140510998382336, thread_return=0x0) at pthread_join.c:89
#1  0x000000000042ee06 in agent (args=0x7fcb40e30a20) at agent.c:332
#2  0x0000003b16a079d1 in start_thread (arg=0x7fcb2eded700) at pthread_create.c:301
#3  0x0000003b162e886d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115

Thread 13 (Thread 0x7fcb4495b700 (LWP 28115)):
#0  0x0000003b16a0822d in pthread_join (threadid=140511044151040, thread_return=0x0) at pthread_join.c:89
#1  0x000000000042ee06 in agent (args=0x7fcb40857390) at agent.c:332
#2  0x0000003b16a079d1 in start_thread (arg=0x7fcb4495b700) at pthread_create.c:301
#3  0x0000003b162e886d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115

Thread 12 (Thread 0x7fcb2d469700 (LWP 28111)):
#0  0x0000003b16a0822d in pthread_join (threadid=140510649865984, thread_return=0x0) at pthread_join.c:89
#1  0x000000000042ee06 in agent (args=0x7fcb40427fa0) at agent.c:332
#2  0x0000003b16a079d1 in start_thread (arg=0x7fcb2d469700) at pthread_create.c:301
#3  0x0000003b162e886d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115

Thread 11 (Thread 0x7fcb2e2ef700 (LWP 28103)):
#0  0x0000003b16a0822d in pthread_join (threadid=140510997329664, thread_return=0x0) at pthread_join.c:89
#1  0x000000000042ee06 in agent (args=0x7fcb4015a190) at agent.c:332
#2  0x0000003b16a079d1 in start_thread (arg=0x7fcb2e2ef700) at pthread_create.c:301
#3  0x0000003b162e886d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115

Thread 10 (Thread 0x7fcb44c5e700 (LWP 28107)):
#0  0x0000003b16a0822d in pthread_join (threadid=140511000487680, thread_return=0x0) at pthread_join.c:89
#1  0x000000000042ee06 in agent (args=0x7fcb4019b3a0) at agent.c:332
#2  0x0000003b16a079d1 in start_thread (arg=0x7fcb44c5e700) at pthread_create.c:301
#3  0x0000003b162e886d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115

Thread 9 (Thread 0x7fcb2f6f6700 (LWP 28099)):
#0  0x0000003b16a0822d in pthread_join (threadid=140511013119744, thread_return=0x0) at pthread_join.c:89
---Type <return> to continue, or q <return> to quit---
#1  0x000000000042ee06 in agent (args=0x7fcb404effb0) at agent.c:332
#2  0x0000003b16a079d1 in start_thread (arg=0x7fcb2f6f6700) at pthread_create.c:301
#3  0x0000003b162e886d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115

Thread 8 (Thread 0x7fcb2f1f1700 (LWP 28092)):
#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:183
#1  0x00000000004775ad in _wr_wrlock (datatype=JOB_LOCK, wait_lock=true) at locks.c:256
#2  0x0000000000476f64 in lock_slurmctld (lock_levels=...) at locks.c:86
#3  0x000000000042fe96 in _notify_slurmctld_nodes (agent_ptr=0x7fc930111e00, no_resp_cnt=0, retry_cnt=0) at agent.c:694
#4  0x000000000042fac6 in _wdog (args=0x7fc930111e00) at agent.c:603
#5  0x0000003b16a079d1 in start_thread (arg=0x7fcb2f1f1700) at pthread_create.c:301
#6  0x0000003b162e886d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115

Thread 7 (Thread 0x7fcb44d5f700 (LWP 28095)):
#0  0x0000003b16a0822d in pthread_join (threadid=140510656182016, thread_return=0x0) at pthread_join.c:89
#1  0x000000000042ee06 in agent (args=0x7fcb40874e10) at agent.c:332
#2  0x0000003b16a079d1 in start_thread (arg=0x7fcb44d5f700) at pthread_create.c:301
#3  0x0000003b162e886d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115

Thread 6 (Thread 0x7fcb45365700 (LWP 28091)):
#0  0x0000003b16a0822d in pthread_join (threadid=140510645655296, thread_return=0x0) at pthread_join.c:89
#1  0x000000000042ee06 in agent (args=0x7fcb40bcd560) at agent.c:332
#2  0x0000003b16a079d1 in start_thread (arg=0x7fcb45365700) at pthread_create.c:301
#3  0x0000003b162e886d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115

Thread 5 (Thread 0x7fcb2c358700 (LWP 28088)):
#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:183
#1  0x00000000004775ad in _wr_wrlock (datatype=JOB_LOCK, wait_lock=true) at locks.c:256
#2  0x0000000000476f64 in lock_slurmctld (lock_levels=...) at locks.c:86
#3  0x000000000042fe96 in _notify_slurmctld_nodes (agent_ptr=0x7fc9306a2bd0, no_resp_cnt=0, retry_cnt=0) at agent.c:694
#4  0x000000000042fac6 in _wdog (args=0x7fc9306a2bd0) at agent.c:603
#5  0x0000003b16a079d1 in start_thread (arg=0x7fcb2c358700) at pthread_create.c:301
#6  0x0000003b162e886d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115

Thread 4 (Thread 0x7fcb4586a700 (LWP 5992)):
#0  0x0000003b162e12e3 in select () at ../sysdeps/unix/syscall-template.S:82
#1  0x000000000043534f in _slurmctld_rpc_mgr (no_data=0x0) at controller.c:965
#2  0x0000003b16a079d1 in start_thread (arg=0x7fcb4586a700) at pthread_create.c:301
#3  0x0000003b162e886d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115

Thread 3 (Thread 0x7fcb45668700 (LWP 5994)):
#0  pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:239
#1  0x00000000004b7be4 in slurmctld_state_save (no_data=0x0) at state_save.c:208
#2  0x0000003b16a079d1 in start_thread (arg=0x7fcb45668700) at pthread_create.c:301
#3  0x0000003b162e886d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115

Thread 2 (Thread 0x7fcb45a6c700 (LWP 5990)):
#0  0x0000003b162ac9fd in nanosleep () at ../sysdeps/unix/syscall-template.S:82
#1  0x0000003b162ac870 in __sleep (seconds=0) at ../sysdeps/unix/sysv/linux/sleep.c:138
#2  0x00007fcb45a729ca in _decay_thread (no_data=0x0) at priority_multifactor.c:1335
#3  0x0000003b16a079d1 in start_thread (arg=0x7fcb45a6c700) at pthread_create.c:301
#4  0x0000003b162e886d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115

Thread 1 (Thread 0x7fcb46471700 (LWP 5980)):
#0  0x000000000044a3ec in find_job_record (job_id=3470037) at job_mgr.c:2626
#1  0x000000000045dbd3 in _set_job_id (job_ptr=0x7fcb4092d6d0) at job_mgr.c:8423
#2  0x000000000044c57c in _job_rec_copy (job_ptr=0x7fcb4092d6d0) at job_mgr.c:3464
#3  0x000000000046bc6e in job_array_post_sched (job_ptr=0x7fcb4092d6d0) at job_mgr.c:13941
#4  0x0000000000486a42 in select_nodes (job_ptr=0x7fcb4092d6d0, test_only=false, select_node_bitmap=0x0, err_msg=0x0) at node_scheduler.c:1829
#5  0x00007fcb4647805f in _start_job (job_ptr=0x7fcb4092d6d0, resv_bitmap=0x7fcb402fb770) at backfill.c:1401
#6  0x00007fcb464775cb in _attempt_backfill () at backfill.c:1182
#7  0x00007fcb46475b7f in backfill_agent (args=0x0) at backfill.c:628
#8  0x0000003b16a079d1 in start_thread (arg=0x7fcb46471700) at pthread_create.c:301
#9  0x0000003b162e886d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115
Comment 15 CSC sysadmins 2014-12-15 18:05:22 MST
Here is log snippet before crash:

[2014-12-15T22:17:01.603] Warning: Note very large processing time from _slurm_rpc_complete_batch_script: usec=1577162 began=22:17:00.026
[2014-12-15T22:17:01.603] Warning: Note very large processing time from _slurm_rpc_step_complete: usec=1275776 began=22:17:00.328
[2014-12-15T22:17:01.603] job_complete: JobID=3465317_385 (3468649) State=0x1 NodeCnt=1 WIFEXITED 1 WEXITSTATUS 0
[2014-12-15T22:17:01.604] job_complete: JobID=3465317_385 (3468649) State=0x8003 NodeCnt=1 done
[2014-12-15T22:17:01.604] Warning: Note very large processing time from _slurm_rpc_complete_batch_script: usec=1455278 began=22:17:00.148
[2014-12-15T22:17:01.604] Warning: Note very large processing time from _slurm_rpc_job_alloc_info_lite: usec=1173343 began=22:17:00.430
[2014-12-15T22:17:01.604] Warning: Note very large processing time from _slurm_rpc_dump_job_user: usec=1997165 began=22:16:59.607
[2014-12-15T22:17:04.201] job_complete: JobID=3465317_667 (3469323) State=0x1 NodeCnt=1 WIFEXITED 1 WEXITSTATUS 0
[2014-12-15T22:17:04.201] job_complete: JobID=3465317_667 (3469323) State=0x8003 NodeCnt=1 done
[2014-12-15T22:17:04.201] Warning: Note very large processing time from _slurmctld_background: usec=1597417 began=22:17:02.604
[2014-12-15T22:17:06.802] job_complete: JobID=3465317_333 (3468530) State=0x1 NodeCnt=1 WIFEXITED 1 WEXITSTATUS 0
[2014-12-15T22:17:06.802] job_complete: JobID=3465317_333 (3468530) State=0x8003 NodeCnt=1 done
[2014-12-15T22:17:06.802] Warning: Note very large processing time from _slurm_rpc_complete_batch_script: usec=1612015 began=22:17:05.190
[2014-12-15T22:17:06.802] Warning: Note very large processing time from _slurmctld_background: usec=1600491 began=22:17:05.202
[2014-12-15T22:17:06.802] job_complete: JobID=3465257_560 (3469953) State=0x1 NodeCnt=1 WIFEXITED 1 WEXITSTATUS 0
[2014-12-15T22:17:06.802] job_complete: JobID=3465257_560 (3469953) State=0x8003 NodeCnt=1 done
[2014-12-15T22:17:06.803] job_complete: JobID=3463243_425 (3467172) State=0x1 NodeCnt=1 WIFEXITED 1 WEXITSTATUS 0
[2014-12-15T22:17:06.803] job_complete: JobID=3463243_425 (3467172) State=0x8003 NodeCnt=1 done
[2014-12-15T22:17:09.364] Warning: Note very large processing time from _slurmctld_background: usec=1561567 began=22:17:07.803
[2014-12-15T22:17:09.397] job_complete: JobID=3465257_221 (3469273) State=0x1 NodeCnt=1 WIFEXITED 1 WEXITSTATUS 0
[2014-12-15T22:17:09.397] job_complete: JobID=3465257_221 (3469273) State=0x8003 NodeCnt=1 done
[2014-12-15T22:17:09.866] backfill: Started JobId=3465257_563 (3469985) on c94
[2014-12-15T22:17:09.868] backfill: Started JobId=3465257_564 (3469986) on c132
[2014-12-15T22:17:09.870] backfill: Started JobId=3465257_565 (3469987) on c389
[2014-12-15T22:17:09.871] backfill: Started JobId=3465257_566 (3469988) on c465
[2014-12-15T22:17:09.873] backfill: Started JobId=3465257_567 (3469989) on c94
[2014-12-15T22:17:09.875] backfill: Started JobId=3465257_568 (3469990) on c132
[2014-12-15T22:17:09.876] backfill: Started JobId=3465257_569 (3469991) on c204
[2014-12-15T22:17:09.878] backfill: Started JobId=3465257_570 (3469992) on c208
[2014-12-15T22:17:09.879] backfill: Started JobId=3465257_571 (3469993) on c270
[2014-12-15T22:17:09.881] backfill: Started JobId=3465257_572 (3469994) on c296
[2014-12-15T22:17:09.883] backfill: Started JobId=3465257_573 (3469995) on c297
[2014-12-15T22:17:09.884] backfill: Started JobId=3465257_574 (3469996) on c299
[2014-12-15T22:17:09.886] backfill: Started JobId=3465257_575 (3469997) on c365
[2014-12-15T22:17:09.887] backfill: Started JobId=3465257_576 (3469998) on c381
[2014-12-15T22:17:09.889] backfill: Started JobId=3465257_577 (3469999) on c389
[2014-12-15T22:17:09.890] backfill: Started JobId=3465257_578 (3470000) on c397
[2014-12-15T22:17:09.892] backfill: Started JobId=3465257_579 (3470001) on c419
[2014-12-15T22:17:09.894] backfill: Started JobId=3465257_580 (3470002) on c432
[2014-12-15T22:17:09.895] backfill: Started JobId=3465257_581 (3470003) on c434
[2014-12-15T22:17:09.897] backfill: Started JobId=3465257_582 (3470004) on c465
[2014-12-15T22:17:09.898] backfill: Started JobId=3465257_583 (3470005) on c472
[2014-12-15T22:17:09.900] backfill: Started JobId=3465257_584 (3470006) on c576
[2014-12-15T22:17:09.901] backfill: Started JobId=3465257_585 (3470007) on c16
[2014-12-15T22:17:09.903] backfill: Started JobId=3465257_586 (3470008) on c66
[2014-12-15T22:17:09.905] backfill: Started JobId=3465257_587 (3470009) on c90
[2014-12-15T22:17:09.906] backfill: Started JobId=3465257_588 (3470010) on c94
[2014-12-15T22:17:09.908] backfill: Started JobId=3465257_589 (3470011) on c132
[2014-12-15T22:17:09.909] backfill: Started JobId=3465257_590 (3470012) on c143
[2014-12-15T22:17:09.911] backfill: Started JobId=3465257_591 (3470013) on c145
[2014-12-15T22:17:09.912] backfill: Started JobId=3465318_302 (3470014) on c160
[2014-12-15T22:17:09.914] backfill: Started JobId=3465318_303 (3470015) on c184
[2014-12-15T22:17:09.916] backfill: Started JobId=3465318_304 (3470016) on c204
[2014-12-15T22:17:09.917] backfill: Started JobId=3465318_305 (3470017) on c208
[2014-12-15T22:17:09.919] backfill: Started JobId=3465318_306 (3470018) on c214
[2014-12-15T22:17:09.920] backfill: Started JobId=3465318_307 (3470019) on c216
[2014-12-15T22:17:09.922] backfill: Started JobId=3465318_308 (3470020) on c217
[2014-12-15T22:17:09.924] backfill: Started JobId=3465318_309 (3470021) on c218
[2014-12-15T22:17:09.925] backfill: Started JobId=3465318_310 (3470022) on c221
[2014-12-15T22:17:09.927] backfill: Started JobId=3465318_311 (3470023) on c222
[2014-12-15T22:17:09.928] backfill: Started JobId=3465318_312 (3470024) on c224
[2014-12-15T22:17:09.930] backfill: Started JobId=3465318_313 (3470025) on c238
[2014-12-15T22:17:09.931] backfill: Started JobId=3465318_314 (3470026) on c250
[2014-12-15T22:17:09.933] backfill: Started JobId=3465318_315 (3470027) on c261
[2014-12-15T22:17:09.933] WARNING: agent retry_list size is 22
[2014-12-15T22:17:09.933]    retry_list msg_type=4005,4005,4005,4005,4005
[2014-12-15T22:17:09.935] backfill: Started JobId=3465318_316 (3470028) on c266
[2014-12-15T22:17:09.936] backfill: Started JobId=3465318_317 (3470029) on c267
[2014-12-15T22:17:09.938] backfill: Started JobId=3465318_318 (3470030) on c270
[2014-12-15T22:17:09.939] backfill: Started JobId=3465318_319 (3470031) on c282
[2014-12-15T22:17:09.941] backfill: Started JobId=3465318_320 (3470032) on c286
[2014-12-15T22:17:09.942] backfill: Started JobId=3465318_321 (3470033) on c296
[2014-12-15T22:17:09.944] backfill: Started JobId=3465318_322 (3470034) on c297
[2014-12-15T22:17:09.946] backfill: Started JobId=3465318_323 (3470035) on c299
[2014-12-15T22:17:09.947] backfill: Started JobId=3465318_324 (3470036) on c304
[2014-12-15T22:17:38.255] pidfile not locked, assuming no running daemon
[2014-12-15T22:17:38.283] slurmctld version 14.11.1-2 started on cluster csc
Comment 16 Moe Jette 2014-12-16 07:49:31 MST
I was able to recreate a similar problem (broken hash table) and fix the problem. I believe the commit below fixes the problem you are seeing:
https://github.com/SchedMD/slurm/commit/f293ce7ccdef10bfd4a0d0b92d40f59a81b3b13b
Comment 17 Brian Christiansen 2014-12-16 09:53:42 MST
I was able to reproduce the same segfault you are seeing. I've confirmed that Moe's patch fixes the problem.
Comment 18 CSC sysadmins 2014-12-16 20:51:03 MST
Thanks for the patch, I've upgraded our slurm again.
Comment 19 Moe Jette 2014-12-17 06:36:47 MST
we are pretty confident that patch fixes the problem. Please re-open if necessary.
Comment 20 John Hanks 2014-12-29 01:06:49 MST
*** Ticket 1342 has been marked as a duplicate of this ticket. ***