Hi, Slurmctld crashed earlier on this evening, here is bt of the crash: Core was generated by `/usr/sbin/slurmctld'. Program terminated with signal 11, Segmentation fault. #0 0x000000000044a33c in find_job_record (job_id=3415818) at job_mgr.c:2626 2626 if (job_ptr->job_id == job_id) Missing separate debuginfos, use: debuginfo-install slurm-14.11.1-1.el6.x86_64 (gdb) bt full #0 0x000000000044a33c in find_job_record (job_id=3415818) at job_mgr.c:2626 job_ptr = 0x1 #1 0x000000000045db23 in _set_job_id (job_ptr=0x7fd880227cb0) at job_mgr.c:8423 i = 0 new_id = 3415818 max_jobs = 4292671760 #2 0x0000000000453f51 in _copy_job_desc_to_job_record (job_desc=0x7fd8800c1710, job_rec_ptr=0x7fda576f5c00, req_bitmap=0x7fda576f5940, exc_bitmap=0x7fda576f5938) at job_mgr.c:6371 error_code = 0 detail_ptr = 0x7fd8800c1710 job_ptr = 0x7fd880227cb0 __func__ = "_copy_job_desc_to_job_record" #3 0x0000000000451713 in _job_create (job_desc=0x7fd8800c1710, allocate=0, will_run=0, job_pptr=0x7fda576f5c00, submit_uid=34018, err_msg=0x7fda576f5cb0) at job_mgr.c:5458 launch_type_poe = 0 error_code = 0 i = 0 qos_error = 0 part_ptr = 0x7fd88075bd10 part_ptr_list = 0x0 req_bitmap = 0x0 exc_bitmap = 0x0 job_ptr = 0x0 assoc_rec = {accounting_list = 0x0, acct = 0x25d31a0 "csc", assoc_next = 0x0, assoc_next_id = 0x0, cluster = 0x2495370 "csc", def_qos_id = 0, grp_cpu_mins = 4294967295, grp_cpu_run_mins = 4294967295, grp_cpus = 1024, grp_jobs = 4294967295, grp_mem = 4294967295, grp_nodes = 4294967295, grp_submit_jobs = 4294967295, grp_wall = 4294967295, id = 1230, is_def = 1, lft = 6657, max_cpu_mins_pj = 4294967295, max_cpu_run_mins = 4294967295, max_cpus_pj = 4294967295, max_jobs = 4294967295, max_nodes_pj = 4294967295, max_submit_jobs = 896, max_wall_pj = 4294967295, parent_acct = 0x0, parent_id = 6, partition = 0x7fd8800bfc70 "serial", qos_list = 0x2780960, rgt = 6630, shares_raw = 1, uid = 34018, usage = 0x0, user = 0x24c1160 "xxxxxx"} assoc_ptr = 0x278d010 license_list = 0x0 valid = true qos_rec = {description = 0x24a1380 "Normal QOS default", id = 1, flags = 0, grace_time = 0, grp_cpu_mins = 4294967295, grp_cpu_run_mins = 4294967295, grp_cpus = 4294967295, grp_jobs = 4294967295, grp_mem = 4294967295, grp_nodes = 4294967295, grp_submit_jobs = 4294967295, grp_wall = 4294967295, max_cpu_mins_pj = 4294967295, max_cpu_run_mins_pu = 4294967295, max_cpus_pj = 4294967295, max_cpus_pu = 4294967295, max_jobs_pu = 4294967295, max_nodes_pj = 4294967295, max_nodes_pu = 4294967295, max_submit_jobs_pu = 4294967295, max_wall_pj = 4294967295, min_cpus_pj = 1, name = 0x2495320 "normal", preempt_bitstr = 0x0, preempt_list = 0x0, preempt_mode = 0, priority = 0, usage = 0x0, usage_factor = 1, usage_thres = 0} qos_ptr = 0x24a12c0 user_submit_priority = 4294967294 node_scaling = 1 cpus_per_mp = 1 acct_policy_limit_set = {max_cpus = 0, max_nodes = 0, min_cpus = 0, min_nodes = 0, pn_min_memory = 0, qos = 0, time = 0} #4 0x000000000044d395 in job_allocate (job_specs=0x7fd8800c1710, immediate=0, will_run=0, resp=0x0, allocate=0, submit_uid=34018, job_pptr=0x7fda576f5cc0, err_msg=0x7fda576f5cb0) at job_mgr.c:3758 defer_sched = 1 error_code = 0 i = 0 no_alloc = false top_prio = false test_only = 127 too_fragmented = 218 independent = 125 job_ptr = 0x0 now = 1418137557 __func__ = "job_allocate" #5 0x000000000049d119 in _slurm_rpc_submit_batch_job (msg=0x7fd8802e1e70) at proc_req.c:3189 active_rpc_cnt = 1 error_code = 0 tv1 = {tv_sec = 1418137557, tv_usec = 4239} tv2 = {tv_sec = 140567837220096, tv_usec = 30787792493784} tv_str = '\000' <repeats 19 times> delta_t = 5208805 step_id = 0 job_ptr = 0x0 response_msg = {address = {sin_family = 0, sin_port = 0, sin_addr = {s_addr = 0}, sin_zero = "\000\000\000\000\000\000\000"}, auth_cred = 0x0, conn_fd = -1, data = 0x0, data_size = 0, flags = 0, msg_type = 65534, protocol_version = 7168, forward = {cnt = 0, init = 65534, nodelist = 0x0, timeout = 0}, forward_struct = 0x0, orig_addr = {sin_family = 0, sin_port = 0, sin_addr = {s_addr = 0}, sin_zero = "\000\000\000\000\000\000\000"}, ret_list = 0x0} submit_msg = {job_id = 1466916080, step_id = 32730, error_code = 5308707} job_desc_msg = 0x7fd8800c1710 job_read_lock = {config = READ_LOCK, job = READ_LOCK, node = READ_LOCK, partition = READ_LOCK} ---Type <return> to continue, or q <return> to quit--- job_write_lock = {config = NO_LOCK, job = WRITE_LOCK, node = READ_LOCK, partition = READ_LOCK} uid = 34018 err_msg = 0x0 #6 0x00000000004945b3 in slurmctld_req (msg=0x7fd8802e1e70, arg=0x7fda74003ec0) at proc_req.c:380 tv1 = {tv_sec = 1418137556, tv_usec = 615835} tv2 = {tv_sec = 4472544, tv_usec = 140567837220096} tv_str = '\000' <repeats 19 times> delta_t = 371696738 i = 169 rpc_type_index = 8 rpc_user_index = 169 rpc_uid = 34018 __func__ = "slurmctld_req" #7 0x00000000004357eb in _service_connection (arg=0x7fda74003ec0) at controller.c:1070 conn = 0x7fda74003ec0 return_code = 0x0 msg = 0x7fd8802e1e70 __func__ = "_service_connection" #8 0x0000003b16a079d1 in start_thread (arg=0x7fda576f6700) at pthread_create.c:301 __res = <value optimized out> pd = 0x7fda576f6700 now = <value optimized out> unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140575746516736, 2604083960062317716, 140576390549008, 140575746517440, 0, 3, -2623559752364419948, 2618207461584049300}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}} not_first_call = <value optimized out> pagesize_m1 = <value optimized out> sp = <value optimized out> freesize = <value optimized out> #9 0x0000003b162e886d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115 No locals.
Could you send your slurm.conf. Do you happen to be running with any job_submit plugins? If so could you send those as well? This appears to be memory corruption. Can you easily reproduce this? If so could you attach valgrind and send the output?
Created attachment 1500 [details] Job submit plugin
Here is our config. This is not easy to reproduce, we upgraded slurm last Friday and this was a first crash. Slurmctld resumed after restart and seems to work again. Configuration data as of 2014-12-10T01:13:44 AccountingStorageBackupHost = (null) AccountingStorageEnforce = associations,limits AccountingStorageHost = slurmdbip AccountingStorageLoc = N/A AccountingStoragePort = 6819 AccountingStorageType = accounting_storage/slurmdbd AccountingStorageUser = N/A AccountingStoreJobComment = YES AcctGatherEnergyType = acct_gather_energy/rapl AcctGatherFilesystemType = acct_gather_filesystem/none AcctGatherInfinibandType = acct_gather_infiniband/none AcctGatherNodeFreq = 30 sec AcctGatherProfileType = acct_gather_profile/none AllowSpecResourcesUsage = 0 AuthInfo = (null) AuthType = auth/munge BackupAddr = (null) BackupController = (null) BatchStartTimeout = 10 sec BOOT_TIME = 2014-12-10T00:27:57 CacheGroups = 0 CheckpointType = checkpoint/none ChosLoc = (null) ClusterName = csc CompleteWait = 12 sec ControlAddr = 10.10.0.5 ControlMachine = service01,service02 CoreSpecPlugin = core_spec/none CpuFreqDef = OnDemand CryptoType = crypto/munge DebugFlags = (null) DefMemPerCPU = 512 DisableRootJobs = NO DynAllocPort = 0 EnforcePartLimits = YES Epilog = /etc/slurm/epilog EpilogMsgTime = 2000 usec EpilogSlurmctld = (null) ExtSensorsType = ext_sensors/none ExtSensorsFreq = 0 sec FairShareDampeningFactor = 1 FastSchedule = 2 FirstJobId = 2230000 GetEnvTimeout = 2 sec GresTypes = mic,gpu GroupUpdateForce = 0 GroupUpdateTime = 600 sec HASH_VAL = Match HealthCheckInterval = 1800 sec HealthCheckNodeState = IDLE HealthCheckProgram = /etc/slurm/health_check InactiveLimit = 1800 sec JobAcctGatherFrequency = energy=30,task=30 JobAcctGatherType = jobacct_gather/linux JobAcctGatherParams = (null) JobCheckpointDir = /var/slurm/checkpoint JobCompHost = localhost JobCompLoc = /var/log/slurm_jobcomp.log JobCompPort = 0 JobCompType = jobcomp/none JobCompUser = root JobContainerType = job_container/none JobCredentialPrivateKey = (null) JobCredentialPublicCertificate = (null) JobFileAppend = 0 JobRequeue = 1 JobSubmitPlugins = lua KeepAliveTime = 60 sec KillOnBadExit = 1 KillWait = 10 sec LaunchType = launch/slurm Layouts = Licenses = mdcs:256 LicensesUsed = mdcs:0/256 MailProg = /bin/mail MaxArraySize = 1001 MaxJobCount = 30000 MaxJobId = 4294901760 MaxMemPerNode = UNLIMITED MaxStepCount = 40000 MaxTasksPerNode = 128 MemLimitEnforce = yes MessageTimeout = 99 sec MinJobAge = 300 sec MpiDefault = none MpiParams = ports=12000-12999 NEXT_JOB_ID = 3416724 OverTimeLimit = 0 min PluginDir = /usr/lib64/slurm PlugStackConfig = /etc/slurm/plugstack.conf PreemptMode = OFF PreemptType = preempt/none PriorityParameters = (null) PriorityDecayHalfLife = 7-00:00:00 PriorityCalcPeriod = 00:05:00 PriorityFavorSmall = 0 PriorityFlags = PriorityMaxAge = 6-00:00:00 PriorityUsageResetPeriod = NONE PriorityType = priority/multifactor PriorityWeightAge = 500 PriorityWeightFairShare = 1000 PriorityWeightJobSize = 1000 PriorityWeightPartition = 1000 PriorityWeightQOS = 0 PrivateData = none ProctrackType = proctrack/cgroup Prolog = /etc/slurm/prolog PrologSlurmctld = (null) PrologFlags = (null) PropagatePrioProcess = 0 PropagateResourceLimits = (null) PropagateResourceLimitsExcept = MEMLOCK,RLIMIT_AS,RLIMIT_CPU,RLIMIT_NPROC,RLIMIT_CORE,RLIMIT_DATA,RLIMIT_RSS,STACK RebootProgram = /sbin/reboot ReconfigFlags = (null) RequeueExit = (null) RequeueExitHold = (null) ResumeProgram = (null) ResumeRate = 300 nodes/min ResumeTimeout = 60 sec ResvEpilog = (null) ResvOverRun = 0 min ResvProlog = (null) ReturnToService = 1 RoutePlugin = (null) SallocDefaultCommand = (null) SchedulerParameters = bf_max_job_user=30,bf_continue,bf_interval=60,bf_resolution=180,max_job_bf=300,defer_rpc_cnt=10 SchedulerPort = 7321 SchedulerRootFilter = 1 SchedulerTimeSlice = 30 sec SchedulerType = sched/backfill SelectType = select/cons_res SelectTypeParameters = CR_CORE_MEMORY,CR_CORE_DEFAULT_DIST_BLOCK,CR_LLN SlurmUser = slurm(88) SlurmctldDebug = verbose SlurmctldLogFile = /slurmdb/log/Slurmctld.log SlurmctldPort = 6817 SlurmctldTimeout = 300 sec SlurmdDebug = info SlurmdLogFile = (null) SlurmdPidFile = /var/run/slurmd.pid SlurmdPlugstack = (null) SlurmdPort = 6818 SlurmdSpoolDir = /slurmdb/tmp/slurmd SlurmdTimeout = 600 sec SlurmdUser = root(0) SlurmSchedLogFile = (null) SlurmSchedLogLevel = 0 SlurmctldPidFile = /var/run/slurmctld.pid SlurmctldPlugstack = (null) SLURM_CONF = /etc/slurm/slurm.conf SLURM_VERSION = 14.11.1 SrunEpilog = (null) SrunProlog = (null) StateSaveLocation = /slurmdb/tmp SuspendExcNodes = (null) SuspendExcParts = (null) SuspendProgram = (null) SuspendRate = 60 nodes/min SuspendTime = NONE SuspendTimeout = 30 sec SwitchType = switch/none TaskEpilog = (null) TaskPlugin = task/cgroup TaskPluginParam = (null type) TaskProlog = (null) TmpFS = /tmp TopologyPlugin = topology/tree TrackWCKey = 0 TreeWidth = 50 UsePam = 1 UnkillableStepProgram = (null) UnkillableStepTimeout = 60 sec VSizeFactor = 0 percent WaitTime = 0 sec
Thanks, could you send a snip of the slurmctld log up till the seg fault?
Created attachment 1501 [details] Slurmctld log file
Tommi, do you have an idea how many jobs were in the queue when this happened? Was job 3415818 part of an array? While this isn't related, I find it interesting there are Warnings of large processing time when this was happening. On the core could you send the output of thread apply all bt
> Tommi, do you have an idea how many jobs were in the queue when this happened? Usually we have couple of hundreds of jobs. > Was job 3415818 part of an array? I don't know. That job id does not exists on the log file | sacct db > While this isn't related, I find it interesting there are Warnings of large processing time when this was happening. My colleague made a BR in the past, it's a long standing issue and should not be related: http://bugs.schedmd.com/show_bug.cgi?id=1082 > On the core could you send the output of Core was generated by `/usr/sbin/slurmctld'. Program terminated with signal 11, Segmentation fault. #0 0x000000000044a33c in find_job_record (job_id=3415818) at job_mgr.c:2626 2626 if (job_ptr->job_id == job_id) Missing separate debuginfos, use: debuginfo-install slurm-14.11.1-1.el6.x86_64 (gdb) thread apply all bt Thread 20 (Thread 0x7fda7e930700 (LWP 39633)): #0 pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:239 #1 0x00007fda7e934278 in _my_sleep (usec=500000) at backfill.c:438 #2 0x00007fda7e934cc0 in _yield_locks (usec=500000) at backfill.c:666 #3 0x00007fda7e935e31 in _attempt_backfill () at backfill.c:1035 #4 0x00007fda7e934b7f in backfill_agent (args=0x0) at backfill.c:628 #5 0x0000003b16a079d1 in start_thread (arg=0x7fda7e930700) at pthread_create.c:301 #6 0x0000003b162e886d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115 Thread 19 (Thread 0x7fda7df2b700 (LWP 39635)): #0 0x0000003b162ac9fd in nanosleep () at ../sysdeps/unix/syscall-template.S:82 #1 0x0000003b162ac870 in __sleep (seconds=0) at ../sysdeps/unix/sysv/linux/sleep.c:138 #2 0x00007fda7df319ca in _decay_thread (no_data=0x0) at priority_multifactor.c:1335 #3 0x0000003b16a079d1 in start_thread (arg=0x7fda7df2b700) at pthread_create.c:301 #4 0x0000003b162e886d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115 Thread 18 (Thread 0x7fda56ded700 (LWP 7106)): #0 0x0000003b162ac9fd in nanosleep () at ../sysdeps/unix/syscall-template.S:82 #1 0x0000003b162e1b54 in usleep (useconds=<value optimized out>) at ../sysdeps/unix/sysv/linux/usleep.c:33 #2 0x000000000042f8d3 in _wdog (args=0x7fd880325c50) at agent.c:573 #3 0x0000003b16a079d1 in start_thread (arg=0x7fda56ded700) at pthread_create.c:301 #4 0x0000003b162e886d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115 Thread 17 (Thread 0x7fda57cfc700 (LWP 7107)): #0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:183 #1 0x00000000004e2ceb in start_msg_tree (hl=0x7fd88021ab70, msg=0x7fda57cfbe20, timeout=0) at forward.c:647 ---Type <return> to continue, or q <return> to quit--- #2 0x00000000005241ce in slurm_send_recv_msgs (nodelist=0x7fd880313100 "c436", msg=0x7fda57cfbe20, timeout=0, quiet=true) at slurm_protocol_api.c:3987 #3 0x00000000004305c5 in _thread_per_group_rpc (args=0x7fd880047240) at agent.c:879 #4 0x0000003b16a079d1 in start_thread (arg=0x7fda57cfc700) at pthread_create.c:301 #5 0x0000003b162e886d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115 Thread 16 (Thread 0x7fda578f8700 (LWP 7114)): #0 0x0000003b162ac9fd in nanosleep () at ../sysdeps/unix/syscall-template.S:82 #1 0x0000003b162ac870 in __sleep (seconds=0) at ../sysdeps/unix/sysv/linux/sleep.c:138 #2 0x00000000005242d8 in slurm_send_addr_recv_msgs (msg=0x7fda578f7e20, name=0x7fd880212040 "c436", timeout=99000) at slurm_protocol_api.c:4021 #3 0x00000000004e1f27 in _fwd_tree_thread (arg=0x7fd880329410) at forward.c:363 #4 0x0000003b16a079d1 in start_thread (arg=0x7fda578f8700) at pthread_create.c:301 #5 0x0000003b162e886d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115 Thread 15 (Thread 0x7fda7d21e700 (LWP 7111)): #0 0x0000003b16a0822d in pthread_join (threadid=140575740200704, thread_return=0x0) at pthread_join.c:89 #1 0x000000000042ee06 in agent (args=0x7fd88027c7e0) at agent.c:332 #2 0x0000003b16a079d1 in start_thread (arg=0x7fda7d21e700) at pthread_create.c:301 #3 0x0000003b162e886d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115 Thread 14 (Thread 0x7fda7f6bd700 (LWP 39621)): #0 pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:239 #1 0x00000000005ae31f in _agent (x=0x0) at slurmdbd_defs.c:2113 #2 0x0000003b16a079d1 in start_thread (arg=0x7fda7f6bd700) at pthread_create.c:301 #3 0x0000003b162e886d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115 Thread 13 (Thread 0x7fda7dd29700 (LWP 39637)): #0 0x0000003b162e12e3 in select () at ../sysdeps/unix/syscall-template.S:82 ---Type <return> to continue, or q <return> to quit--- #1 0x000000000043534f in _slurmctld_rpc_mgr (no_data=0x0) at controller.c:965 #2 0x0000003b16a079d1 in start_thread (arg=0x7fda7dd29700) at pthread_create.c:301 #3 0x0000003b162e886d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115 Thread 12 (Thread 0x7fda7dc28700 (LWP 39638)): #0 do_sigwait (set=<value optimized out>, sig=0x7fda7dc27eac) at ../sysdeps/unix/sysv/linux/sigwait.c:65 #1 __sigwait (set=<value optimized out>, sig=0x7fda7dc27eac) at ../sysdeps/unix/sysv/linux/sigwait.c:100 #2 0x0000000000434cc0 in _slurmctld_signal_hand (no_data=0x0) at controller.c:827 #3 0x0000003b16a079d1 in start_thread (arg=0x7fda7dc28700) at pthread_create.c:301 #4 0x0000003b162e886d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115 Thread 11 (Thread 0x7fda574f4700 (LWP 7121)): #0 0x00000000004deec8 in slurm_xmalloc (size=24, clear=false, file=0x64250a "pack.c", line=150, func=0x642509 "") at xmalloc.c:86 #1 0x00000000004f6c7e in init_buf (size=16384) at pack.c:150 #2 0x00000000005230ea in slurm_send_node_msg (fd=11, msg=0x7fda574f3cd0) at slurm_protocol_api.c:3281 #3 0x00000000004973f2 in _slurm_rpc_dump_jobs_user (msg=0x7fd88075f330) at proc_req.c:1239 #4 0x00000000004941f5 in slurmctld_req (msg=0x7fd88075f330, arg=0x7fda74003ef0) at proc_req.c:271 #5 0x00000000004357eb in _service_connection (arg=0x7fda74003ef0) at controller.c:1070 #6 0x0000003b16a079d1 in start_thread (arg=0x7fda574f4700) at pthread_create.c:301 #7 0x0000003b162e886d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115 Thread 10 (Thread 0x7fda566e6700 (LWP 7113)): #0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:183 #1 0x00000000004e2ceb in start_msg_tree (hl=0x7fd880329450, msg=0x7fda566e5e20, timeout=0) at forward.c:647 ---Type <return> to continue, or q <return> to quit--- #2 0x00000000005241ce in slurm_send_recv_msgs (nodelist=0x7fd880380330 "c436", msg=0x7fda566e5e20, timeout=0, quiet=true) at slurm_protocol_api.c:3987 #3 0x00000000004305c5 in _thread_per_group_rpc (args=0x7fd880230df0) at agent.c:879 #4 0x0000003b16a079d1 in start_thread (arg=0x7fda566e6700) at pthread_create.c:301 #5 0x0000003b162e886d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115 Thread 9 (Thread 0x7fda7db27700 (LWP 39639)): #0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:183 #1 0x00000000004b7ac4 in slurmctld_state_save (no_data=0x0) at state_save.c:211 #2 0x0000003b16a079d1 in start_thread (arg=0x7fda7db27700) at pthread_create.c:301 #3 0x0000003b162e886d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115 Thread 8 (Thread 0x7fda570f0700 (LWP 7112)): #0 0x0000003b162ac9fd in nanosleep () at ../sysdeps/unix/syscall-template.S:82 #1 0x0000003b162e1b54 in usleep (useconds=<value optimized out>) at ../sysdeps/unix/sysv/linux/usleep.c:33 #2 0x000000000042f8d3 in _wdog (args=0x7fd88012ced0) at agent.c:573 #3 0x0000003b16a079d1 in start_thread (arg=0x7fda570f0700) at pthread_create.c:301 #4 0x0000003b162e886d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115 Thread 7 (Thread 0x7fda562e2700 (LWP 7108)): #0 0x0000003b162ac9fd in nanosleep () at ../sysdeps/unix/syscall-template.S:82 #1 0x0000003b162ac870 in __sleep (seconds=0) at ../sysdeps/unix/sysv/linux/sleep.c:138 #2 0x00000000005242d8 in slurm_send_addr_recv_msgs (msg=0x7fda562e1e20, name=0x7fd88031e0d0 "c436", timeout=99000) at slurm_protocol_api.c:4021 #3 0x00000000004e1f27 in _fwd_tree_thread (arg=0x7fd8803293c0) at forward.c:363 #4 0x0000003b16a079d1 in start_thread (arg=0x7fda562e2700) at pthread_create.c:301 #5 0x0000003b162e886d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115 Thread 6 (Thread 0x7fda57dfd700 (LWP 7105)): ---Type <return> to continue, or q <return> to quit--- #0 0x0000003b16a0822d in pthread_join (threadid=140575737042688, thread_return=0x0) at pthread_join.c:89 #1 0x000000000042ee06 in agent (args=0x7fd88006dde0) at agent.c:332 #2 0x0000003b16a079d1 in start_thread (arg=0x7fda57dfd700) at pthread_create.c:301 #3 0x0000003b162e886d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115 Thread 5 (Thread 0x7fda7de2a700 (LWP 39636)): #0 0x0000003b16a0822d in pthread_join (threadid=140576392656640, thread_return=0x0) at pthread_join.c:89 #1 0x00007fda7df31b0c in _cleanup_thread (no_data=0x0) at priority_multifactor.c:1389 #2 0x0000003b16a079d1 in start_thread (arg=0x7fda7de2a700) at pthread_create.c:301 #3 0x0000003b162e886d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115 Thread 4 (Thread 0x7fda7f9c2700 (LWP 39618)): #0 0x0000003b16a0822d in pthread_join (threadid=140576421590784, thread_return=0x0) at pthread_join.c:89 #1 0x00007fda7fac75c1 in _cleanup_thread (no_data=0x0) at accounting_storage_slurmdbd.c:423 #2 0x0000003b16a079d1 in start_thread (arg=0x7fda7f9c2700) at pthread_create.c:301 #3 0x0000003b162e886d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115 Thread 3 (Thread 0x7fda821ca700 (LWP 39615)): #0 0x0000003b162ac9fd in nanosleep () at ../sysdeps/unix/syscall-template.S:82 #1 0x0000003b162e1b54 in usleep (useconds=<value optimized out>) at ../sysdeps/unix/sysv/linux/usleep.c:33 #2 0x0000000000436552 in _slurmctld_background (no_data=0x0) at controller.c:1449 #3 0x00000000004346b8 in main (argc=1, argv=0x7ffff9075648) at controller.c:561 Thread 2 (Thread 0x7fda7fac3700 (LWP 39617)): #0 0x0000003b162ac9fd in nanosleep () at ../sysdeps/unix/syscall-template.S:82 #1 0x0000003b162ac870 in __sleep (seconds=0) at ../sysdeps/unix/sysv/linux/sleep.c:138 #2 0x00007fda7fac759c in _set_db_inx_thread (no_data=0x0) at accounting_storage_slurmdbd.c:415 ---Type <return> to continue, or q <return> to quit--- #3 0x0000003b16a079d1 in start_thread (arg=0x7fda7fac3700) at pthread_create.c:301 #4 0x0000003b162e886d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115 Thread 1 (Thread 0x7fda576f6700 (LWP 7120)): #0 0x000000000044a33c in find_job_record (job_id=3415818) at job_mgr.c:2626 #1 0x000000000045db23 in _set_job_id (job_ptr=0x7fd880227cb0) at job_mgr.c:8423 #2 0x0000000000453f51 in _copy_job_desc_to_job_record (job_desc=0x7fd8800c1710, job_rec_ptr=0x7fda576f5c00, req_bitmap=0x7fda576f5940, exc_bitmap=0x7fda576f5938) at job_mgr.c:6371 #3 0x0000000000451713 in _job_create (job_desc=0x7fd8800c1710, allocate=0, will_run=0, job_pptr=0x7fda576f5c00, submit_uid=34018, err_msg=0x7fda576f5cb0) at job_mgr.c:5458 #4 0x000000000044d395 in job_allocate (job_specs=0x7fd8800c1710, immediate=0, will_run=0, resp=0x0, allocate=0, submit_uid=34018, job_pptr=0x7fda576f5cc0, err_msg=0x7fda576f5cb0) at job_mgr.c:3758 #5 0x000000000049d119 in _slurm_rpc_submit_batch_job (msg=0x7fd8802e1e70) at proc_req.c:3189 #6 0x00000000004945b3 in slurmctld_req (msg=0x7fd8802e1e70, arg=0x7fda74003ec0) at proc_req.c:380 #7 0x00000000004357eb in _service_connection (arg=0x7fda74003ec0) at controller.c:1070 #8 0x0000003b16a079d1 in start_thread (arg=0x7fda576f6700) at pthread_create.c:301 #9 0x0000003b162e886d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115
Our secondary controller started to crash too, here is valgrind log of it. It looks completely different but maybe it explains memory corruption? ==00:03:00:33.169 42152== Rerun with --leak-check=full to see details of leaked memory ==00:03:00:33.169 42152== ==00:03:00:33.169 42152== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 18 from 9) ==00:03:00:33.169 42152== ==00:03:00:33.169 42152== 1 errors in context 1 of 1: ==00:03:00:33.169 42152== Invalid read of size 8 ==00:03:00:33.169 42152== at 0x4D42C5: _delete_assoc_hash (assoc_mgr.c:274) ==00:03:00:33.169 42152== by 0x4DEC6A: assoc_mgr_set_missing_uids (assoc_mgr.c:4583) ==00:03:00:33.170 42152== by 0x43720C: _slurmctld_background (controller.c:1706) ==00:03:00:33.170 42152== by 0x4346B7: main (controller.c:561) ==00:03:00:33.170 42152== Address 0x10 is not stack'd, malloc'd or (recently) free'd ==00:03:00:33.170 42152== --00:03:00:33.170 42152-- --00:03:00:33.170 42152-- used_suppression: 18 dl-hack3-cond-1 ==00:03:00:33.170 42152== ==00:03:00:33.170 42152== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 18 from 9)
This was fixed in commit https://github.com/SchedMD/slurm/commit/7f899cf080044c71d876e9918a53f4a789aed516 It does look similar though. On December 10, 2014 7:44:40 AM PST, bugs@schedmd.com wrote: >http://bugs.schedmd.com/show_bug.cgi?id=1309 > >--- Comment #8 from Tommi Tervo <tommi.tervo@csc.fi> --- >Our secondary controller started to crash too, here is valgrind log of >it. It >looks completely different but maybe it explains memory corruption? > >==00:03:00:33.169 42152== Rerun with --leak-check=full to see details >of leaked >memory >==00:03:00:33.169 42152== >==00:03:00:33.169 42152== ERROR SUMMARY: 1 errors from 1 contexts >(suppressed: >18 from 9) >==00:03:00:33.169 42152== >==00:03:00:33.169 42152== 1 errors in context 1 of 1: >==00:03:00:33.169 42152== Invalid read of size 8 >==00:03:00:33.169 42152== at 0x4D42C5: _delete_assoc_hash >(assoc_mgr.c:274) >==00:03:00:33.169 42152== by 0x4DEC6A: assoc_mgr_set_missing_uids >(assoc_mgr.c:4583) >==00:03:00:33.170 42152== by 0x43720C: _slurmctld_background >(controller.c:1706) >==00:03:00:33.170 42152== by 0x4346B7: main (controller.c:561) >==00:03:00:33.170 42152== Address 0x10 is not stack'd, malloc'd or >(recently) >free'd >==00:03:00:33.170 42152== >--00:03:00:33.170 42152-- >--00:03:00:33.170 42152-- used_suppression: 18 dl-hack3-cond-1 >==00:03:00:33.170 42152== >==00:03:00:33.170 42152== ERROR SUMMARY: 1 errors from 1 contexts >(suppressed: >18 from 9) > >-- >You are receiving this mail because: >You are on the CC list for the bug.
Tommi, the more we think about this these 2 different crashes may be related. I am guessing you are adding and removing users from the system quite often as this is the only time this kind of thing would happen. Particularly when the user is added to Slurm before they have a uid on the slurmctld node(s). Does this fit your scenario? This is obviously what happened in your second crash, but was this happening around the time of the first crash as well?
Hi, We had a problem with LDAP configuration on the secondary slurmctld server which explains that missing uid crash. But we did not have that configuration error on the primary server. Our sssd/ldap setup has been a bit flaky so it's possible that it happened on the primary server also. I looked through commit logs and decided to take latest version from the 14.11 branch, because there were a bunch of other fixups as well. Time will tell if this problem reappears. Regards, Tommi Tervo CSC
We will probably tag 14.11.2 today if that helps. But if you are having problems keeping users in sync this patch will definitely be something you want. On December 11, 2014 4:53:09 AM PST, bugs@schedmd.com wrote: >http://bugs.schedmd.com/show_bug.cgi?id=1309 > >--- Comment #11 from Tommi Tervo <tommi.tervo@csc.fi> --- >Hi, > >We had a problem with LDAP configuration on the secondary slurmctld >server >which explains that missing uid crash. But we did not have that >configuration >error on the primary server. Our sssd/ldap setup has been a bit flaky >so it's >possible that it happened on the primary server also. > >I looked through commit logs and decided to take latest version from >the 14.11 >branch, because there were a bunch of other fixups as well. Time will >tell if >this problem reappears. > >Regards, >Tommi Tervo >CSC > >-- >You are receiving this mail because: >You are on the CC list for the bug.
Please reopen if necessary. David
Hi, Problem reappeared, slurm is from 14.11 branch: commit 64e5324d Core was generated by `/usr/sbin/slurmctld'. Program terminated with signal 11, Segmentation fault. #0 0x000000000044a3ec in find_job_record (job_id=3470037) at job_mgr.c:2626 2626 if (job_ptr->job_id == job_id) Missing separate debuginfos, use: debuginfo-install slurm-14.11.1-2.x86_64 (gdb) thread apply all bt Thread 52 (Thread 0x7fcb45e6c700 (LWP 28170)): #0 0x0000003b16a0822d in pthread_join (threadid=140511015225088, thread_return=0x0) at pthread_join.c:89 #1 0x000000000042ee06 in agent (args=0x7fcb403741f0) at agent.c:332 #2 0x0000003b16a079d1 in start_thread (arg=0x7fcb45e6c700) at pthread_create.c:301 #3 0x0000003b162e886d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115 Thread 51 (Thread 0x7fcb2eeee700 (LWP 28128)): #0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:183 #1 0x00000000004775ad in _wr_wrlock (datatype=JOB_LOCK, wait_lock=true) at locks.c:256 #2 0x0000000000476f64 in lock_slurmctld (lock_levels=...) at locks.c:86 #3 0x000000000042fe96 in _notify_slurmctld_nodes (agent_ptr=0x7fc9305d1cd0, no_resp_cnt=0, retry_cnt=0) at agent.c:694 #4 0x000000000042fac6 in _wdog (args=0x7fc9305d1cd0) at agent.c:603 #5 0x0000003b16a079d1 in start_thread (arg=0x7fcb2eeee700) at pthread_create.c:301 #6 0x0000003b162e886d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115 Thread 50 (Thread 0x7fcb45163700 (LWP 28151)): #0 0x0000003b16a0822d in pthread_join (threadid=140511006803712, thread_return=0x0) at pthread_join.c:89 #1 0x000000000042ee06 in agent (args=0x7fcb40753300) at agent.c:332 #2 0x0000003b16a079d1 in start_thread (arg=0x7fcb45163700) at pthread_create.c:301 #3 0x0000003b162e886d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115 Thread 49 (Thread 0x7fcb2fafa700 (LWP 28160)): #0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:183 #1 0x00000000004775ad in _wr_wrlock (datatype=JOB_LOCK, wait_lock=true) at locks.c:256 #2 0x0000000000476f64 in lock_slurmctld (lock_levels=...) at locks.c:86 #3 0x000000000042fe96 in _notify_slurmctld_nodes (agent_ptr=0x7fc930357950, no_resp_cnt=0, retry_cnt=0) at agent.c:694 #4 0x000000000042fac6 in _wdog (args=0x7fc930357950) at agent.c:603 #5 0x0000003b16a079d1 in start_thread (arg=0x7fcb2fafa700) at pthread_create.c:301 #6 0x0000003b162e886d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115 Thread 48 (Thread 0x7fcb45466700 (LWP 28164)): #0 0x0000003b16a0822d in pthread_join (threadid=140511047309056, thread_return=0x0) at pthread_join.c:89 #1 0x000000000042ee06 in agent (args=0x7fcb40335fe0) at agent.c:332 #2 0x0000003b16a079d1 in start_thread (arg=0x7fcb45466700) at pthread_create.c:301 #3 0x0000003b162e886d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115 Thread 47 (Thread 0x7fcb2ebeb700 (LWP 28143)): #0 0x0000003b16a0822d in pthread_join (threadid=140511007856384, thread_return=0x0) at pthread_join.c:89 #1 0x000000000042ee06 in agent (args=0x7fcb401d3670) at agent.c:332 #2 0x0000003b16a079d1 in start_thread (arg=0x7fcb2ebeb700) at pthread_create.c:301 #3 0x0000003b162e886d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115 Thread 46 (Thread 0x7fcb44b5d700 (LWP 28144)): #0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:183 #1 0x00000000004775ad in _wr_wrlock (datatype=JOB_LOCK, wait_lock=true) at locks.c:256 #2 0x0000000000476f64 in lock_slurmctld (lock_levels=...) at locks.c:86 #3 0x000000000042fe96 in _notify_slurmctld_nodes (agent_ptr=0x7fc930174c80, no_resp_cnt=0, retry_cnt=0) at agent.c:694 #4 0x000000000042fac6 in _wdog (args=0x7fc930174c80) at agent.c:603 #5 0x0000003b16a079d1 in start_thread (arg=0x7fcb44b5d700) at pthread_create.c:301 #6 0x0000003b162e886d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115 Thread 45 (Thread 0x7fcb2fdfd700 (LWP 28139)): #0 0x0000003b16a0822d in pthread_join (threadid=140511043098368, thread_return=0x0) at pthread_join.c:89 #1 0x000000000042ee06 in agent (args=0x7fcb40cec5c0) at agent.c:332 #2 0x0000003b16a079d1 in start_thread (arg=0x7fcb2fdfd700) at pthread_create.c:301 #3 0x0000003b162e886d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115 Thread 44 (Thread 0x7fcb45f6d700 (LWP 28155)): #0 0x0000003b16a0822d in pthread_join (threadid=140511003645696, thread_return=0x0) at pthread_join.c:89 #1 0x000000000042ee06 in agent (args=0x7fcb402fee80) at agent.c:332 #2 0x0000003b16a079d1 in start_thread (arg=0x7fcb45f6d700) at pthread_create.c:301 #3 0x0000003b162e886d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115 Thread 43 (Thread 0x7fcb44759700 (LWP 28156)): #0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:183 #1 0x00000000004775ad in _wr_wrlock (datatype=JOB_LOCK, wait_lock=true) at locks.c:256 #2 0x0000000000476f64 in lock_slurmctld (lock_levels=...) at locks.c:86 #3 0x000000000042fe96 in _notify_slurmctld_nodes (agent_ptr=0x7fc9305830f0, no_resp_cnt=0, retry_cnt=0) at agent.c:694 #4 0x000000000042fac6 in _wdog (args=0x7fc9305830f0) at agent.c:603 #5 0x0000003b16a079d1 in start_thread (arg=0x7fcb44759700) at pthread_create.c:301 #6 0x0000003b162e886d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115 Thread 42 (Thread 0x7fcb4606e700 (LWP 28159)): #0 0x0000003b16a0822d in pthread_join (threadid=140510655129344, thread_return=0x0) at pthread_join.c:89 #1 0x000000000042ee06 in agent (args=0x7fcb403c8010) at agent.c:332 #2 0x0000003b16a079d1 in start_thread (arg=0x7fcb4606e700) at pthread_create.c:301 #3 0x0000003b162e886d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115 Thread 41 (Thread 0x7fcb2e9e9700 (LWP 28136)): ---Type <return> to continue, or q <return> to quit--- #0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:183 #1 0x00000000004775ad in _wr_wrlock (datatype=JOB_LOCK, wait_lock=true) at locks.c:256 #2 0x0000000000476f64 in lock_slurmctld (lock_levels=...) at locks.c:86 #3 0x000000000042fe96 in _notify_slurmctld_nodes (agent_ptr=0x7fc930031280, no_resp_cnt=0, retry_cnt=0) at agent.c:694 #4 0x000000000042fac6 in _wdog (args=0x7fc930031280) at agent.c:603 #5 0x0000003b16a079d1 in start_thread (arg=0x7fcb2e9e9700) at pthread_create.c:301 #6 0x0000003b162e886d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115 Thread 40 (Thread 0x7fcb46cf9700 (LWP 28140)): #0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:183 #1 0x00000000004775ad in _wr_wrlock (datatype=JOB_LOCK, wait_lock=true) at locks.c:256 #2 0x0000000000476f64 in lock_slurmctld (lock_levels=...) at locks.c:86 #3 0x000000000042fe96 in _notify_slurmctld_nodes (agent_ptr=0x7fc9300a4900, no_resp_cnt=0, retry_cnt=0) at agent.c:694 #4 0x000000000042fac6 in _wdog (args=0x7fc9300a4900) at agent.c:603 #5 0x0000003b16a079d1 in start_thread (arg=0x7fcb46cf9700) at pthread_create.c:301 #6 0x0000003b162e886d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115 Thread 39 (Thread 0x7fcb45264700 (LWP 28171)): #0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:183 #1 0x00000000004775ad in _wr_wrlock (datatype=JOB_LOCK, wait_lock=true) at locks.c:256 #2 0x0000000000476f64 in lock_slurmctld (lock_levels=...) at locks.c:86 #3 0x000000000042fe96 in _notify_slurmctld_nodes (agent_ptr=0x7fc9300029a0, no_resp_cnt=0, retry_cnt=0) at agent.c:694 #4 0x000000000042fac6 in _wdog (args=0x7fc9300029a0) at agent.c:603 #5 0x0000003b16a079d1 in start_thread (arg=0x7fcb45264700) at pthread_create.c:301 #6 0x0000003b162e886d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115 Thread 38 (Thread 0x7fcb44a5c700 (LWP 28152)): #0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:183 #1 0x00000000004775ad in _wr_wrlock (datatype=JOB_LOCK, wait_lock=true) at locks.c:256 #2 0x0000000000476f64 in lock_slurmctld (lock_levels=...) at locks.c:86 #3 0x000000000042fe96 in _notify_slurmctld_nodes (agent_ptr=0x7fc9301608b0, no_resp_cnt=0, retry_cnt=0) at agent.c:694 #4 0x000000000042fac6 in _wdog (args=0x7fc9301608b0) at agent.c:603 #5 0x0000003b16a079d1 in start_thread (arg=0x7fcb44a5c700) at pthread_create.c:301 #6 0x0000003b162e886d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115 Thread 37 (Thread 0x7fcb49d0b700 (LWP 5954)): #0 0x0000003b162ac9fd in nanosleep () at ../sysdeps/unix/syscall-template.S:82 #1 0x0000003b162e1b54 in usleep (useconds=<value optimized out>) at ../sysdeps/unix/sysv/linux/usleep.c:33 #2 0x0000000000436552 in _slurmctld_background (no_data=0x0) at controller.c:1449 #3 0x00000000004346b8 in main (argc=1, argv=0x7fff4beb2a48) at controller.c:561 Thread 36 (Thread 0x7fcb470fd700 (LWP 28165)): #0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:183 #1 0x00000000004775ad in _wr_wrlock (datatype=JOB_LOCK, wait_lock=true) at locks.c:256 #2 0x0000000000476f64 in lock_slurmctld (lock_levels=...) at locks.c:86 #3 0x000000000042fe96 in _notify_slurmctld_nodes (agent_ptr=0x7fc930364d60, no_resp_cnt=0, retry_cnt=0) at agent.c:694 #4 0x000000000042fac6 in _wdog (args=0x7fc930364d60) at agent.c:603 #5 0x0000003b16a079d1 in start_thread (arg=0x7fcb470fd700) at pthread_create.c:301 #6 0x0000003b162e886d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115 Thread 35 (Thread 0x7fcb46ffc700 (LWP 28148)): #0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:183 #1 0x00000000004775ad in _wr_wrlock (datatype=JOB_LOCK, wait_lock=true) at locks.c:256 #2 0x0000000000476f64 in lock_slurmctld (lock_levels=...) at locks.c:86 #3 0x000000000042fe96 in _notify_slurmctld_nodes (agent_ptr=0x7fc9303ec600, no_resp_cnt=0, retry_cnt=0) at agent.c:694 #4 0x000000000042fac6 in _wdog (args=0x7fc9303ec600) at agent.c:603 #5 0x0000003b16a079d1 in start_thread (arg=0x7fcb46ffc700) at pthread_create.c:301 #6 0x0000003b162e886d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115 Thread 34 (Thread 0x7fcb2eaea700 (LWP 28147)): #0 0x0000003b16a0822d in pthread_join (threadid=140511046256384, thread_return=0x0) at pthread_join.c:89 #1 0x000000000042ee06 in agent (args=0x7fcb4000dfc0) at agent.c:332 #2 0x0000003b16a079d1 in start_thread (arg=0x7fcb2eaea700) at pthread_create.c:301 #3 0x0000003b162e886d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115 Thread 33 (Thread 0x7fcb44658700 (LWP 28127)): #0 0x0000003b16a0822d in pthread_join (threadid=140510642497280, thread_return=0x0) at pthread_join.c:89 #1 0x000000000042ee06 in agent (args=0x7fcb40df43e0) at agent.c:332 #2 0x0000003b16a079d1 in start_thread (arg=0x7fcb44658700) at pthread_create.c:301 #3 0x0000003b162e886d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115 Thread 32 (Thread 0x7fcb47503700 (LWP 5957)): #0 0x0000003b16a0822d in pthread_join (threadid=140511052580608, thread_return=0x0) at pthread_join.c:89 #1 0x00007fcb476085c1 in _cleanup_thread (no_data=0x0) at accounting_storage_slurmdbd.c:423 #2 0x0000003b16a079d1 in start_thread (arg=0x7fcb47503700) at pthread_create.c:301 #3 0x0000003b162e886d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115 Thread 31 (Thread 0x7fcb2ecec700 (LWP 28120)): #0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:183 #1 0x00000000004775ad in _wr_wrlock (datatype=JOB_LOCK, wait_lock=true) at locks.c:256 ---Type <return> to continue, or q <return> to quit--- #2 0x0000000000476f64 in lock_slurmctld (lock_levels=...) at locks.c:86 #3 0x000000000042fe96 in _notify_slurmctld_nodes (agent_ptr=0x7fc93013e370, no_resp_cnt=0, retry_cnt=0) at agent.c:694 #4 0x000000000042fac6 in _wdog (args=0x7fc93013e370) at agent.c:603 #5 0x0000003b16a079d1 in start_thread (arg=0x7fcb2ecec700) at pthread_create.c:301 #6 0x0000003b162e886d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115 Thread 30 (Thread 0x7fcb2f8f8700 (LWP 28135)): #0 0x0000003b16a0822d in pthread_join (threadid=140510637233920, thread_return=0x0) at pthread_join.c:89 #1 0x000000000042ee06 in agent (args=0x7fcb40faa550) at agent.c:332 #2 0x0000003b16a079d1 in start_thread (arg=0x7fcb2f8f8700) at pthread_create.c:301 #3 0x0000003b162e886d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115 Thread 29 (Thread 0x7fcb46dfa700 (LWP 28116)): #0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:183 #1 0x00000000004775ad in _wr_wrlock (datatype=JOB_LOCK, wait_lock=true) at locks.c:256 #2 0x0000000000476f64 in lock_slurmctld (lock_levels=...) at locks.c:86 #3 0x000000000042fe96 in _notify_slurmctld_nodes (agent_ptr=0x7fc9307c9450, no_resp_cnt=0, retry_cnt=0) at agent.c:694 #4 0x000000000042fac6 in _wdog (args=0x7fc9307c9450) at agent.c:603 #5 0x0000003b16a079d1 in start_thread (arg=0x7fcb46dfa700) at pthread_create.c:301 #6 0x0000003b162e886d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115 Thread 28 (Thread 0x7fcb44254700 (LWP 28124)): #0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:183 #1 0x00000000004775ad in _wr_wrlock (datatype=JOB_LOCK, wait_lock=true) at locks.c:256 #2 0x0000000000476f64 in lock_slurmctld (lock_levels=...) at locks.c:86 #3 0x000000000042fe96 in _notify_slurmctld_nodes (agent_ptr=0x7fc9300dbe40, no_resp_cnt=0, retry_cnt=0) at agent.c:694 #4 0x000000000042fac6 in _wdog (args=0x7fc9300dbe40) at agent.c:603 #5 0x0000003b16a079d1 in start_thread (arg=0x7fcb44254700) at pthread_create.c:301 #6 0x0000003b162e886d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115 Thread 27 (Thread 0x7fcb471fe700 (LWP 5960)): #0 pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:239 #1 0x00000000005ae4a1 in _agent (x=0x0) at slurmdbd_defs.c:2115 #2 0x0000003b16a079d1 in start_thread (arg=0x7fcb471fe700) at pthread_create.c:301 #3 0x0000003b162e886d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115 Thread 26 (Thread 0x7fcb44153700 (LWP 28104)): #0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:183 #1 0x00000000004775ad in _wr_wrlock (datatype=JOB_LOCK, wait_lock=true) at locks.c:256 #2 0x0000000000476f64 in lock_slurmctld (lock_levels=...) at locks.c:86 #3 0x000000000042fe96 in _notify_slurmctld_nodes (agent_ptr=0x7fc9300326f0, no_resp_cnt=0, retry_cnt=0) at agent.c:694 #4 0x000000000042fac6 in _wdog (args=0x7fc9300326f0) at agent.c:603 #5 0x0000003b16a079d1 in start_thread (arg=0x7fcb44153700) at pthread_create.c:301 #6 0x0000003b162e886d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115 Thread 25 (Thread 0x7fcb2f5f5700 (LWP 28112)): #0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:183 #1 0x00000000004775ad in _wr_wrlock (datatype=JOB_LOCK, wait_lock=true) at locks.c:256 #2 0x0000000000476f64 in lock_slurmctld (lock_levels=...) at locks.c:86 #3 0x000000000042fe96 in _notify_slurmctld_nodes (agent_ptr=0x7fc930095820, no_resp_cnt=0, retry_cnt=0) at agent.c:694 #4 0x000000000042fac6 in _wdog (args=0x7fc930095820) at agent.c:603 #5 0x0000003b16a079d1 in start_thread (arg=0x7fcb2f5f5700) at pthread_create.c:301 #6 0x0000003b162e886d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115 Thread 24 (Thread 0x7fcb45062700 (LWP 28100)): #0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:183 #1 0x00000000004775ad in _wr_wrlock (datatype=JOB_LOCK, wait_lock=true) at locks.c:256 #2 0x0000000000476f64 in lock_slurmctld (lock_levels=...) at locks.c:86 #3 0x000000000042fe96 in _notify_slurmctld_nodes (agent_ptr=0x7fc930496060, no_resp_cnt=0, retry_cnt=0) at agent.c:694 #4 0x000000000042fac6 in _wdog (args=0x7fc930496060) at agent.c:603 #5 0x0000003b16a079d1 in start_thread (arg=0x7fcb45062700) at pthread_create.c:301 #6 0x0000003b162e886d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115 Thread 23 (Thread 0x7fcb2f9f9700 (LWP 28132)): #0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:183 #1 0x00000000004775ad in _wr_wrlock (datatype=JOB_LOCK, wait_lock=true) at locks.c:256 #2 0x0000000000476f64 in lock_slurmctld (lock_levels=...) at locks.c:86 #3 0x000000000042fe96 in _notify_slurmctld_nodes (agent_ptr=0x7fc93007aad0, no_resp_cnt=0, retry_cnt=0) at agent.c:694 #4 0x000000000042fac6 in _wdog (args=0x7fc93007aad0) at agent.c:603 #5 0x0000003b16a079d1 in start_thread (arg=0x7fcb2f9f9700) at pthread_create.c:301 #6 0x0000003b162e886d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115 Thread 22 (Thread 0x7fcb44456700 (LWP 28108)): #0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:183 #1 0x00000000004775ad in _wr_wrlock (datatype=JOB_LOCK, wait_lock=true) at locks.c:256 #2 0x0000000000476f64 in lock_slurmctld (lock_levels=...) at locks.c:86 #3 0x000000000042fe96 in _notify_slurmctld_nodes (agent_ptr=0x7fc930132930, no_resp_cnt=0, retry_cnt=0) at agent.c:694 #4 0x000000000042fac6 in _wdog (args=0x7fc930132930) at agent.c:603 #5 0x0000003b16a079d1 in start_thread (arg=0x7fcb44456700) at pthread_create.c:301 #6 0x0000003b162e886d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115 ---Type <return> to continue, or q <return> to quit--- Thread 21 (Thread 0x7fcb44355700 (LWP 28131)): #0 0x0000003b16a0822d in pthread_join (threadid=140510654076672, thread_return=0x0) at pthread_join.c:89 #1 0x000000000042ee06 in agent (args=0x7fcb40666d10) at agent.c:332 #2 0x0000003b16a079d1 in start_thread (arg=0x7fcb44355700) at pthread_create.c:301 #3 0x0000003b162e886d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115 Thread 20 (Thread 0x7fcb2fbfb700 (LWP 28096)): #0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:183 #1 0x00000000004775ad in _wr_wrlock (datatype=JOB_LOCK, wait_lock=true) at locks.c:256 #2 0x0000000000476f64 in lock_slurmctld (lock_levels=...) at locks.c:86 #3 0x000000000042fe96 in _notify_slurmctld_nodes (agent_ptr=0x7fc9301f5e50, no_resp_cnt=0, retry_cnt=0) at agent.c:694 #4 0x000000000042fac6 in _wdog (args=0x7fc9301f5e50) at agent.c:603 #5 0x0000003b16a079d1 in start_thread (arg=0x7fcb2fbfb700) at pthread_create.c:301 #6 0x0000003b162e886d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115 Thread 19 (Thread 0x7fcb47604700 (LWP 5956)): #0 0x0000003b162ac9fd in nanosleep () at ../sysdeps/unix/syscall-template.S:82 #1 0x0000003b162ac870 in __sleep (seconds=0) at ../sysdeps/unix/sysv/linux/sleep.c:138 #2 0x00007fcb4760859c in _set_db_inx_thread (no_data=0x0) at accounting_storage_slurmdbd.c:415 #3 0x0000003b16a079d1 in start_thread (arg=0x7fcb47604700) at pthread_create.c:301 #4 0x0000003b162e886d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115 Thread 18 (Thread 0x7fcb2f4f4700 (LWP 28087)): #0 0x0000003b16a0822d in pthread_join (threadid=140510596794112, thread_return=0x0) at pthread_join.c:89 #1 0x000000000042ee06 in agent (args=0x7fcb402a97e0) at agent.c:332 #2 0x0000003b16a079d1 in start_thread (arg=0x7fcb2f4f4700) at pthread_create.c:301 #3 0x0000003b162e886d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115 Thread 17 (Thread 0x7fcb4596b700 (LWP 5991)): #0 0x0000003b16a0822d in pthread_join (threadid=140511023646464, thread_return=0x0) at pthread_join.c:89 #1 0x00007fcb45a72b0c in _cleanup_thread (no_data=0x0) at priority_multifactor.c:1389 #2 0x0000003b16a079d1 in start_thread (arg=0x7fcb4596b700) at pthread_create.c:301 #3 0x0000003b162e886d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115 Thread 16 (Thread 0x7fcb46bf8700 (LWP 28119)): #0 0x0000003b16a0822d in pthread_join (threadid=140510640391936, thread_return=0x0) at pthread_join.c:89 #1 0x000000000042ee06 in agent (args=0x7fcb4049bd70) at agent.c:332 #2 0x0000003b16a079d1 in start_thread (arg=0x7fcb46bf8700) at pthread_create.c:301 #3 0x0000003b162e886d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115 Thread 15 (Thread 0x7fcb45769700 (LWP 5993)): #0 do_sigwait (set=<value optimized out>, sig=0x7fcb45768eac) at ../sysdeps/unix/sysv/linux/sigwait.c:65 #1 __sigwait (set=<value optimized out>, sig=0x7fcb45768eac) at ../sysdeps/unix/sysv/linux/sigwait.c:100 #2 0x0000000000434cc0 in _slurmctld_signal_hand (no_data=0x0) at controller.c:827 #3 0x0000003b16a079d1 in start_thread (arg=0x7fcb45769700) at pthread_create.c:301 #4 0x0000003b162e886d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115 Thread 14 (Thread 0x7fcb2eded700 (LWP 28123)): #0 0x0000003b16a0822d in pthread_join (threadid=140510998382336, thread_return=0x0) at pthread_join.c:89 #1 0x000000000042ee06 in agent (args=0x7fcb40e30a20) at agent.c:332 #2 0x0000003b16a079d1 in start_thread (arg=0x7fcb2eded700) at pthread_create.c:301 #3 0x0000003b162e886d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115 Thread 13 (Thread 0x7fcb4495b700 (LWP 28115)): #0 0x0000003b16a0822d in pthread_join (threadid=140511044151040, thread_return=0x0) at pthread_join.c:89 #1 0x000000000042ee06 in agent (args=0x7fcb40857390) at agent.c:332 #2 0x0000003b16a079d1 in start_thread (arg=0x7fcb4495b700) at pthread_create.c:301 #3 0x0000003b162e886d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115 Thread 12 (Thread 0x7fcb2d469700 (LWP 28111)): #0 0x0000003b16a0822d in pthread_join (threadid=140510649865984, thread_return=0x0) at pthread_join.c:89 #1 0x000000000042ee06 in agent (args=0x7fcb40427fa0) at agent.c:332 #2 0x0000003b16a079d1 in start_thread (arg=0x7fcb2d469700) at pthread_create.c:301 #3 0x0000003b162e886d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115 Thread 11 (Thread 0x7fcb2e2ef700 (LWP 28103)): #0 0x0000003b16a0822d in pthread_join (threadid=140510997329664, thread_return=0x0) at pthread_join.c:89 #1 0x000000000042ee06 in agent (args=0x7fcb4015a190) at agent.c:332 #2 0x0000003b16a079d1 in start_thread (arg=0x7fcb2e2ef700) at pthread_create.c:301 #3 0x0000003b162e886d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115 Thread 10 (Thread 0x7fcb44c5e700 (LWP 28107)): #0 0x0000003b16a0822d in pthread_join (threadid=140511000487680, thread_return=0x0) at pthread_join.c:89 #1 0x000000000042ee06 in agent (args=0x7fcb4019b3a0) at agent.c:332 #2 0x0000003b16a079d1 in start_thread (arg=0x7fcb44c5e700) at pthread_create.c:301 #3 0x0000003b162e886d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115 Thread 9 (Thread 0x7fcb2f6f6700 (LWP 28099)): #0 0x0000003b16a0822d in pthread_join (threadid=140511013119744, thread_return=0x0) at pthread_join.c:89 ---Type <return> to continue, or q <return> to quit--- #1 0x000000000042ee06 in agent (args=0x7fcb404effb0) at agent.c:332 #2 0x0000003b16a079d1 in start_thread (arg=0x7fcb2f6f6700) at pthread_create.c:301 #3 0x0000003b162e886d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115 Thread 8 (Thread 0x7fcb2f1f1700 (LWP 28092)): #0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:183 #1 0x00000000004775ad in _wr_wrlock (datatype=JOB_LOCK, wait_lock=true) at locks.c:256 #2 0x0000000000476f64 in lock_slurmctld (lock_levels=...) at locks.c:86 #3 0x000000000042fe96 in _notify_slurmctld_nodes (agent_ptr=0x7fc930111e00, no_resp_cnt=0, retry_cnt=0) at agent.c:694 #4 0x000000000042fac6 in _wdog (args=0x7fc930111e00) at agent.c:603 #5 0x0000003b16a079d1 in start_thread (arg=0x7fcb2f1f1700) at pthread_create.c:301 #6 0x0000003b162e886d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115 Thread 7 (Thread 0x7fcb44d5f700 (LWP 28095)): #0 0x0000003b16a0822d in pthread_join (threadid=140510656182016, thread_return=0x0) at pthread_join.c:89 #1 0x000000000042ee06 in agent (args=0x7fcb40874e10) at agent.c:332 #2 0x0000003b16a079d1 in start_thread (arg=0x7fcb44d5f700) at pthread_create.c:301 #3 0x0000003b162e886d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115 Thread 6 (Thread 0x7fcb45365700 (LWP 28091)): #0 0x0000003b16a0822d in pthread_join (threadid=140510645655296, thread_return=0x0) at pthread_join.c:89 #1 0x000000000042ee06 in agent (args=0x7fcb40bcd560) at agent.c:332 #2 0x0000003b16a079d1 in start_thread (arg=0x7fcb45365700) at pthread_create.c:301 #3 0x0000003b162e886d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115 Thread 5 (Thread 0x7fcb2c358700 (LWP 28088)): #0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:183 #1 0x00000000004775ad in _wr_wrlock (datatype=JOB_LOCK, wait_lock=true) at locks.c:256 #2 0x0000000000476f64 in lock_slurmctld (lock_levels=...) at locks.c:86 #3 0x000000000042fe96 in _notify_slurmctld_nodes (agent_ptr=0x7fc9306a2bd0, no_resp_cnt=0, retry_cnt=0) at agent.c:694 #4 0x000000000042fac6 in _wdog (args=0x7fc9306a2bd0) at agent.c:603 #5 0x0000003b16a079d1 in start_thread (arg=0x7fcb2c358700) at pthread_create.c:301 #6 0x0000003b162e886d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115 Thread 4 (Thread 0x7fcb4586a700 (LWP 5992)): #0 0x0000003b162e12e3 in select () at ../sysdeps/unix/syscall-template.S:82 #1 0x000000000043534f in _slurmctld_rpc_mgr (no_data=0x0) at controller.c:965 #2 0x0000003b16a079d1 in start_thread (arg=0x7fcb4586a700) at pthread_create.c:301 #3 0x0000003b162e886d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115 Thread 3 (Thread 0x7fcb45668700 (LWP 5994)): #0 pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:239 #1 0x00000000004b7be4 in slurmctld_state_save (no_data=0x0) at state_save.c:208 #2 0x0000003b16a079d1 in start_thread (arg=0x7fcb45668700) at pthread_create.c:301 #3 0x0000003b162e886d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115 Thread 2 (Thread 0x7fcb45a6c700 (LWP 5990)): #0 0x0000003b162ac9fd in nanosleep () at ../sysdeps/unix/syscall-template.S:82 #1 0x0000003b162ac870 in __sleep (seconds=0) at ../sysdeps/unix/sysv/linux/sleep.c:138 #2 0x00007fcb45a729ca in _decay_thread (no_data=0x0) at priority_multifactor.c:1335 #3 0x0000003b16a079d1 in start_thread (arg=0x7fcb45a6c700) at pthread_create.c:301 #4 0x0000003b162e886d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115 Thread 1 (Thread 0x7fcb46471700 (LWP 5980)): #0 0x000000000044a3ec in find_job_record (job_id=3470037) at job_mgr.c:2626 #1 0x000000000045dbd3 in _set_job_id (job_ptr=0x7fcb4092d6d0) at job_mgr.c:8423 #2 0x000000000044c57c in _job_rec_copy (job_ptr=0x7fcb4092d6d0) at job_mgr.c:3464 #3 0x000000000046bc6e in job_array_post_sched (job_ptr=0x7fcb4092d6d0) at job_mgr.c:13941 #4 0x0000000000486a42 in select_nodes (job_ptr=0x7fcb4092d6d0, test_only=false, select_node_bitmap=0x0, err_msg=0x0) at node_scheduler.c:1829 #5 0x00007fcb4647805f in _start_job (job_ptr=0x7fcb4092d6d0, resv_bitmap=0x7fcb402fb770) at backfill.c:1401 #6 0x00007fcb464775cb in _attempt_backfill () at backfill.c:1182 #7 0x00007fcb46475b7f in backfill_agent (args=0x0) at backfill.c:628 #8 0x0000003b16a079d1 in start_thread (arg=0x7fcb46471700) at pthread_create.c:301 #9 0x0000003b162e886d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115
Here is log snippet before crash: [2014-12-15T22:17:01.603] Warning: Note very large processing time from _slurm_rpc_complete_batch_script: usec=1577162 began=22:17:00.026 [2014-12-15T22:17:01.603] Warning: Note very large processing time from _slurm_rpc_step_complete: usec=1275776 began=22:17:00.328 [2014-12-15T22:17:01.603] job_complete: JobID=3465317_385 (3468649) State=0x1 NodeCnt=1 WIFEXITED 1 WEXITSTATUS 0 [2014-12-15T22:17:01.604] job_complete: JobID=3465317_385 (3468649) State=0x8003 NodeCnt=1 done [2014-12-15T22:17:01.604] Warning: Note very large processing time from _slurm_rpc_complete_batch_script: usec=1455278 began=22:17:00.148 [2014-12-15T22:17:01.604] Warning: Note very large processing time from _slurm_rpc_job_alloc_info_lite: usec=1173343 began=22:17:00.430 [2014-12-15T22:17:01.604] Warning: Note very large processing time from _slurm_rpc_dump_job_user: usec=1997165 began=22:16:59.607 [2014-12-15T22:17:04.201] job_complete: JobID=3465317_667 (3469323) State=0x1 NodeCnt=1 WIFEXITED 1 WEXITSTATUS 0 [2014-12-15T22:17:04.201] job_complete: JobID=3465317_667 (3469323) State=0x8003 NodeCnt=1 done [2014-12-15T22:17:04.201] Warning: Note very large processing time from _slurmctld_background: usec=1597417 began=22:17:02.604 [2014-12-15T22:17:06.802] job_complete: JobID=3465317_333 (3468530) State=0x1 NodeCnt=1 WIFEXITED 1 WEXITSTATUS 0 [2014-12-15T22:17:06.802] job_complete: JobID=3465317_333 (3468530) State=0x8003 NodeCnt=1 done [2014-12-15T22:17:06.802] Warning: Note very large processing time from _slurm_rpc_complete_batch_script: usec=1612015 began=22:17:05.190 [2014-12-15T22:17:06.802] Warning: Note very large processing time from _slurmctld_background: usec=1600491 began=22:17:05.202 [2014-12-15T22:17:06.802] job_complete: JobID=3465257_560 (3469953) State=0x1 NodeCnt=1 WIFEXITED 1 WEXITSTATUS 0 [2014-12-15T22:17:06.802] job_complete: JobID=3465257_560 (3469953) State=0x8003 NodeCnt=1 done [2014-12-15T22:17:06.803] job_complete: JobID=3463243_425 (3467172) State=0x1 NodeCnt=1 WIFEXITED 1 WEXITSTATUS 0 [2014-12-15T22:17:06.803] job_complete: JobID=3463243_425 (3467172) State=0x8003 NodeCnt=1 done [2014-12-15T22:17:09.364] Warning: Note very large processing time from _slurmctld_background: usec=1561567 began=22:17:07.803 [2014-12-15T22:17:09.397] job_complete: JobID=3465257_221 (3469273) State=0x1 NodeCnt=1 WIFEXITED 1 WEXITSTATUS 0 [2014-12-15T22:17:09.397] job_complete: JobID=3465257_221 (3469273) State=0x8003 NodeCnt=1 done [2014-12-15T22:17:09.866] backfill: Started JobId=3465257_563 (3469985) on c94 [2014-12-15T22:17:09.868] backfill: Started JobId=3465257_564 (3469986) on c132 [2014-12-15T22:17:09.870] backfill: Started JobId=3465257_565 (3469987) on c389 [2014-12-15T22:17:09.871] backfill: Started JobId=3465257_566 (3469988) on c465 [2014-12-15T22:17:09.873] backfill: Started JobId=3465257_567 (3469989) on c94 [2014-12-15T22:17:09.875] backfill: Started JobId=3465257_568 (3469990) on c132 [2014-12-15T22:17:09.876] backfill: Started JobId=3465257_569 (3469991) on c204 [2014-12-15T22:17:09.878] backfill: Started JobId=3465257_570 (3469992) on c208 [2014-12-15T22:17:09.879] backfill: Started JobId=3465257_571 (3469993) on c270 [2014-12-15T22:17:09.881] backfill: Started JobId=3465257_572 (3469994) on c296 [2014-12-15T22:17:09.883] backfill: Started JobId=3465257_573 (3469995) on c297 [2014-12-15T22:17:09.884] backfill: Started JobId=3465257_574 (3469996) on c299 [2014-12-15T22:17:09.886] backfill: Started JobId=3465257_575 (3469997) on c365 [2014-12-15T22:17:09.887] backfill: Started JobId=3465257_576 (3469998) on c381 [2014-12-15T22:17:09.889] backfill: Started JobId=3465257_577 (3469999) on c389 [2014-12-15T22:17:09.890] backfill: Started JobId=3465257_578 (3470000) on c397 [2014-12-15T22:17:09.892] backfill: Started JobId=3465257_579 (3470001) on c419 [2014-12-15T22:17:09.894] backfill: Started JobId=3465257_580 (3470002) on c432 [2014-12-15T22:17:09.895] backfill: Started JobId=3465257_581 (3470003) on c434 [2014-12-15T22:17:09.897] backfill: Started JobId=3465257_582 (3470004) on c465 [2014-12-15T22:17:09.898] backfill: Started JobId=3465257_583 (3470005) on c472 [2014-12-15T22:17:09.900] backfill: Started JobId=3465257_584 (3470006) on c576 [2014-12-15T22:17:09.901] backfill: Started JobId=3465257_585 (3470007) on c16 [2014-12-15T22:17:09.903] backfill: Started JobId=3465257_586 (3470008) on c66 [2014-12-15T22:17:09.905] backfill: Started JobId=3465257_587 (3470009) on c90 [2014-12-15T22:17:09.906] backfill: Started JobId=3465257_588 (3470010) on c94 [2014-12-15T22:17:09.908] backfill: Started JobId=3465257_589 (3470011) on c132 [2014-12-15T22:17:09.909] backfill: Started JobId=3465257_590 (3470012) on c143 [2014-12-15T22:17:09.911] backfill: Started JobId=3465257_591 (3470013) on c145 [2014-12-15T22:17:09.912] backfill: Started JobId=3465318_302 (3470014) on c160 [2014-12-15T22:17:09.914] backfill: Started JobId=3465318_303 (3470015) on c184 [2014-12-15T22:17:09.916] backfill: Started JobId=3465318_304 (3470016) on c204 [2014-12-15T22:17:09.917] backfill: Started JobId=3465318_305 (3470017) on c208 [2014-12-15T22:17:09.919] backfill: Started JobId=3465318_306 (3470018) on c214 [2014-12-15T22:17:09.920] backfill: Started JobId=3465318_307 (3470019) on c216 [2014-12-15T22:17:09.922] backfill: Started JobId=3465318_308 (3470020) on c217 [2014-12-15T22:17:09.924] backfill: Started JobId=3465318_309 (3470021) on c218 [2014-12-15T22:17:09.925] backfill: Started JobId=3465318_310 (3470022) on c221 [2014-12-15T22:17:09.927] backfill: Started JobId=3465318_311 (3470023) on c222 [2014-12-15T22:17:09.928] backfill: Started JobId=3465318_312 (3470024) on c224 [2014-12-15T22:17:09.930] backfill: Started JobId=3465318_313 (3470025) on c238 [2014-12-15T22:17:09.931] backfill: Started JobId=3465318_314 (3470026) on c250 [2014-12-15T22:17:09.933] backfill: Started JobId=3465318_315 (3470027) on c261 [2014-12-15T22:17:09.933] WARNING: agent retry_list size is 22 [2014-12-15T22:17:09.933] retry_list msg_type=4005,4005,4005,4005,4005 [2014-12-15T22:17:09.935] backfill: Started JobId=3465318_316 (3470028) on c266 [2014-12-15T22:17:09.936] backfill: Started JobId=3465318_317 (3470029) on c267 [2014-12-15T22:17:09.938] backfill: Started JobId=3465318_318 (3470030) on c270 [2014-12-15T22:17:09.939] backfill: Started JobId=3465318_319 (3470031) on c282 [2014-12-15T22:17:09.941] backfill: Started JobId=3465318_320 (3470032) on c286 [2014-12-15T22:17:09.942] backfill: Started JobId=3465318_321 (3470033) on c296 [2014-12-15T22:17:09.944] backfill: Started JobId=3465318_322 (3470034) on c297 [2014-12-15T22:17:09.946] backfill: Started JobId=3465318_323 (3470035) on c299 [2014-12-15T22:17:09.947] backfill: Started JobId=3465318_324 (3470036) on c304 [2014-12-15T22:17:38.255] pidfile not locked, assuming no running daemon [2014-12-15T22:17:38.283] slurmctld version 14.11.1-2 started on cluster csc
I was able to recreate a similar problem (broken hash table) and fix the problem. I believe the commit below fixes the problem you are seeing: https://github.com/SchedMD/slurm/commit/f293ce7ccdef10bfd4a0d0b92d40f59a81b3b13b
I was able to reproduce the same segfault you are seeing. I've confirmed that Moe's patch fixes the problem.
Thanks for the patch, I've upgraded our slurm again.
we are pretty confident that patch fixes the problem. Please re-open if necessary.
*** Ticket 1342 has been marked as a duplicate of this ticket. ***