Created attachment 2957 [details] gdb session slurmctld is occasionally crashing since we upgraded to .9 today. A cron job restarts it and it runs for a while before crashing again. It has happened several times. I'm attaching the backtrace as a file since it's much longer than usual due to the amount of threads.
Created attachment 2958 [details] slurmctld.log.bz2
Hey Ryan, Most likely not related to this bug, but are these errors expected? [2016-03-31T16:55:02.799] error: find_node_record: lookup failure for m6stage-1-4 [2016-03-31T16:55:02.799] error: node_name2bitmap: invalid node specified m6stage-1-4 error: find_node_record: lookup failure for m6stage-1-5 [2016-03-31T16:55:02.832] error: node_name2bitmap: invalid node specified m6stage-1-5
Yes. They're installed off of our provisioning server but aren't in the Slurm config. Those nodes are probably just trying to run something like marking themselves offline due to hardware failure, etc (they're old staging nodes).
Ryan, in that frame 2 of thread 1 could you print *ss[x].tmpjobs
#0 0x0000000000509175 in bit_ffs (b=0x20312d3d656d6954) at bitstring.c:483 483 _assert_bitstr_valid(b); (gdb) thread 1 [Switching to thread 1 (Thread 28662)]#0 0x0000000000509175 in bit_ffs (b=0x20312d3d656d6954) at bitstring.c:483 483 _assert_bitstr_valid(b); (gdb) frame 2 #2 0x00007f2a002a1851 in _rm_job_from_res (part_record_ptr=0x3d88ab0, node_usage=0x3d82fb0, job_ptr=0x38ba860, action=0) at select_cons_res.c:1255 1255 _build_row_bitmaps(p_ptr, job_ptr); (gdb) print *ss[x].tmpjobs No symbol "ss" in current context.
Also, you say it is crashing rather often. Did the other backtraces files look similar? Have you seen any similarities from the jobs that trigger this? Nothing much has changed in the cons_res plugin which makes this seem like an older bug.
(In reply to Ryan Cox from comment #5) > #0 0x0000000000509175 in bit_ffs (b=0x20312d3d656d6954) at bitstring.c:483 > 483 _assert_bitstr_valid(b); > (gdb) thread 1 > [Switching to thread 1 (Thread 28662)]#0 0x0000000000509175 in bit_ffs > (b=0x20312d3d656d6954) at bitstring.c:483 > 483 _assert_bitstr_valid(b); > (gdb) frame 2 > #2 0x00007f2a002a1851 in _rm_job_from_res (part_record_ptr=0x3d88ab0, > node_usage=0x3d82fb0, job_ptr=0x38ba860, action=0) at select_cons_res.c:1255 > 1255 _build_row_bitmaps(p_ptr, job_ptr); > (gdb) print *ss[x].tmpjobs > No symbol "ss" in current context. Whoops frame 1, sorry.
> Whoops frame 1, sorry. (gdb) frame 1 #1 0x00007f2a0029fa8d in _build_row_bitmaps (p_ptr=0x3d88a70, job_ptr=0x38ba860) at select_cons_res.c:646 646 ss[x].jstart = bit_ffs(ss[x].tmpjobs->node_bitmap); (gdb) print *ss[x].tmpjobs $1 = {core_bitmap = 0x725f657461657263, core_bitmap_used = 0x6d614e203a767365, cpu_array_cnt = 913128805, cpu_array_value = 0x68746c6165685f38, cpu_array_reps = 0x616d3a6b63656863, cpus = 0x636e616e65746e69, cpus_used = 0x5320353536312e65, cores_per_socket = 0x656d695474726174, memory_allocated = 0x33302d363130323d, memory_used = 0x353a36315431332d, nhosts = 926038581, node_bitmap = 0x20312d3d656d6954, node_req = 1634891076, nodes = 0x67616c462030313d <Address 0x67616c462030313d out of bounds>, ncpus = 1095581043, sock_core_rep_count = 0x2c50414c5245564f, sockets_per_node = 0x4a5f45524f4e4749, whole_node = 79 'O'}
Created attachment 2959 [details] gdb sessions (first and second when -O2, etc were enabled) (In reply to Danny Auble from comment #6) > Also, you say it is crashing rather often. Did the other backtraces files > look similar? > > Have you seen any similarities from the jobs that trigger this? Nothing > much has changed in the cons_res plugin which makes this seem like an older > bug. I didn't think to copy the core files elsewhere in time so this is all I have left from them (attached). The first core file you saw in this ticket was with -O0 and all the default settings. The ones from earlier today that I'm attaching now were from slurmctld when it was compiled with -O2.
Hum, well that looks like trash memory. In the same frame could you print *p_ptr print p_ptr->row[i] print x print j print i
(gdb) print *p_ptr $2 = {next = 0x3d88a30, num_rows = 5, part_ptr = 0x28bcf80, row = 0x3eb62f0} (gdb) print p_ptr->row[i] $3 = {row_bitmap = 0x3ebae50, num_jobs = 41, job_list = 0x3eb8350, job_list_size = 48} (gdb) print x $4 = 4 (gdb) print j $5 = 4 (gdb) print i $6 = 0
Well the other cores are in the same place, seems like we are going off the array somehow. Would you be willing to put this on valgrind and see if it turns up anything?
(In reply to Danny Auble from comment #12) > Well the other cores are in the same place, seems like we are going off the > array somehow. > > Would you be willing to put this on valgrind and see if it turns up anything? Sure, but I think I'll wait until the morning. I'll let this go overnight since I have cron restart it when needed, the backup controller seems to take over successfully every time, and the primary stays up for dozens of minutes at least before crashing. This should give us another core file to work with, I imagine. I'm taking off now and will get back to this first thing tomorrow.
No problem, thanks :).
Ryan, we think this might be related to sharing of resources? It appears you have shared=yes on your partitions. Do you happen to know if your users request the -s|--share option? Could you, in frame 1 print *job_ptr->details Based on these numbers given it looks like you are sharing resources, perhaps unknowingly? That should be fine, but perhaps with this config there may be a bug.
(In reply to Danny Auble from comment #15) > Ryan, we think this might be related to sharing of resources? > > It appears you have shared=yes on your partitions. > > Do you happen to know if your users request the -s|--share option? > > Could you, in frame 1 > > print *job_ptr->details > > Based on these numbers given it looks like you are sharing resources, > perhaps unknowingly? > > That should be fine, but perhaps with this config there may be a bug. All partitions are set to be shared by default. Some users request --exclusive but many jobs are shared.
Keep in mind shared means shared/oversubscribe cpus, not having multiple jobs share a node, that happens by default. The term shared is very misleading and comes from the days of whole node allocations and is being changed in 16.05 to something more clear like oversubscribe. On March 31, 2016 5:06:32 PM MST, bugs@schedmd.com wrote: >https://bugs.schedmd.com/show_bug.cgi?id=2603 > >--- Comment #16 from Ryan Cox <ryan_cox@byu.edu> --- >(In reply to Danny Auble from comment #15) >> Ryan, we think this might be related to sharing of resources? >> >> It appears you have shared=yes on your partitions. >> >> Do you happen to know if your users request the -s|--share option? >> >> Could you, in frame 1 >> >> print *job_ptr->details >> >> Based on these numbers given it looks like you are sharing resources, >> perhaps unknowingly? >> >> That should be fine, but perhaps with this config there may be a bug. > >All partitions are set to be shared by default. Some users request >--exclusive >but many jobs are shared. > >-- >You are receiving this mail because: >You are watching all bug changes.
Everything is shared by default unless they request --exclusive. On March 31, 2016 5:47:57 PM MDT, bugs@schedmd.com wrote: >https://bugs.schedmd.com/show_bug.cgi?id=2603 > >--- Comment #15 from Danny Auble <da@schedmd.com> --- >Ryan, we think this might be related to sharing of resources? > >It appears you have shared=yes on your partitions. > >Do you happen to know if your users request the -s|--share option? > >Could you, in frame 1 > >print *job_ptr->details > >Based on these numbers given it looks like you are sharing resources, >perhaps >unknowingly? > >That should be fine, but perhaps with this config there may be a bug.
(In reply to Danny Auble from comment #15) > Ryan, we think this might be related to sharing of resources? > > It appears you have shared=yes on your partitions. > > Do you happen to know if your users request the -s|--share option? I wasn't aware that it meant oversubscribe. I really doubt that anyone specifies that and I've never seen any evidence of it. Is there any easy way to check running/pending jobs for it? This is all I could come up with: $ squeue -ho %h | sort | uniq -c 94 no 5902 unknwn > Could you, in frame 1 > > print *job_ptr->details > > Based on these numbers given it looks like you are sharing resources, > perhaps unknowingly? > > That should be fine, but perhaps with this config there may be a bug. (gdb) frame 1 #1 0x00007f2a0029fa8d in _build_row_bitmaps (p_ptr=0x3d88a70, job_ptr=0x38ba860) at select_cons_res.c:646 646 ss[x].jstart = bit_ffs(ss[x].tmpjobs->node_bitmap); (gdb) print *job_ptr->details $1 = {acctg_freq = 0x0, argc = 1, argv = 0x38bae10, begin_time = 1459459189, ckpt_dir = 0x38badb0 "/var/slurm/checkpoint", contiguous = 0, core_spec = 65534, cpu_bind = 0x0, cpu_bind_type = 65534, cpu_freq_min = 4294967294, cpu_freq_max = 4294967294, cpu_freq_gov = 4294967294, cpus_per_task = 1, depend_list = 0x0, dependency = 0x0, orig_dependency = 0x0, env_cnt = 0, env_sup = 0x0, exc_node_bitmap = 0x0, exc_nodes = 0x0, expanding_jobid = 0, feature_list = 0x0, features = 0x0, magic = 0, max_cpus = 4294967294, max_nodes = 0, mc_ptr = 0x38bade0, mem_bind = 0x0, mem_bind_type = 0, min_cpus = 16, min_nodes = 1, nice = 10000, ntasks_per_node = 0, num_tasks = 16, open_mode = 0 '\000', overcommit = 0 '\000', plane_size = 0, pn_min_cpus = 1, pn_min_memory = 2147484672, pn_min_tmp_disk = 0, prolog_running = 0 '\000', reserved_resources = 0, req_node_bitmap = 0x0, req_node_layout = 0x0, preempt_start_time = 0, req_nodes = 0x0, requeue = 0, restart_dir = 0x0, share_res = 254 '\376', std_err = 0x0, std_in = 0x38bad10 "/dev/null", std_out = 0x0, submit_time = 1459459189, task_dist = 8192, usable_nodes = 0, whole_node = 0 '\000', work_dir = 0x38bad40 "/panfs/pan.fsl.byu.edu/scr/usr/60/josephsl/graphener3/Zr_sv,Vc/5860/planarBE"}
Created attachment 2962 [details] gdb session for core-2016-03-31_22:08 Another core from last night
Created attachment 2963 [details] gdb session for core-2016-04-01_03:24 Here's another core. There were two other crashes that I didn't capture the core files for. I'm going to try looking for similarities.
Created attachment 2964 [details] slurmctld.log.1.gz Logs covering the time period that the crashes happened.
Created attachment 2967 [details] gdb session for core-2016-04-01_09:21 Here's the last gdb session I'll send you until I hear otherwise. This job was running so I also got this: root@sched1:/var/log/slurm# scontrol show job -dd 9919523 JobId=9919523 JobName=grid study fsl test UserId=thomas95(21131) GroupId=thomas95(21350) Priority=101490 Nice=0 Account=sn72 QOS=normal JobState=RUNNING Reason=None Dependency=(null) Requeue=0 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 DerivedExitCode=0:0 RunTime=13:42:25 TimeLimit=1-00:00:00 TimeMin=N/A SubmitTime=2016-03-31T19:49:38 EligibleTime=2016-03-31T19:49:38 StartTime=2016-03-31T19:49:51 EndTime=2016-04-01T19:49:51 PreemptTime=None SuspendTime=None SecsPreSuspend=0 Partition=m6n AllocNode:Sid=m8int01:100465 ReqNodeList=(null) ExcNodeList=(null) NodeList=m6-17-[5,13,15-16],m6-20-[2-8,10,12,14,16] BatchHost=m6-17-5 NumNodes=15 NumCPUs=69 CPUs/Task=1 ReqB:S:C:T=0:0:*:* TRES=cpu=69,mem=282624,node=15 Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* Nodes=m6-17-5 CPU_IDs=0-1,6-7 Mem=16384 Nodes=m6-17-[13,15-16],m6-20-2 CPU_IDs=0-2,6-8 Mem=24576 Nodes=m6-20-3 CPU_IDs=1-3,6-7 Mem=20480 Nodes=m6-20-4 CPU_IDs=0-2,6-7 Mem=20480 Nodes=m6-20-5 CPU_IDs=0-1,6-7 Mem=16384 Nodes=m6-20-6 CPU_IDs=0-2,6-8 Mem=24576 Nodes=m6-20-7 CPU_IDs=0,4,8-9 Mem=16384 Nodes=m6-20-8 CPU_IDs=0 Mem=4096 Nodes=m6-20-10 CPU_IDs=0-1,6-7 Mem=16384 Nodes=m6-20-12 CPU_IDs=10 Mem=4096 Nodes=m6-20-14 CPU_IDs=1-3,6-7 Mem=20480 Nodes=m6-20-16 CPU_IDs=0-2,6-8 Mem=24576 MinCPUsNode=1 MinMemoryCPU=4G MinTmpDiskNode=0 Features=(null) Gres=(null) Reservation=(null) Shared=OK Contiguous=0 Licenses=(null) Network=(null) Command=/bluehome3/thomas95/FLOWlab/Projects/FlorisChangesCode/GridStudy/fsl_run_cases.sh WorkDir=/bluehome3/thomas95/FLOWlab/Projects/FlorisChangesCode/GridStudy StdErr=/bluehome3/thomas95/FLOWlab/Projects/FlorisChangesCode/GridStudy/slurm-9919523.out StdIn=/dev/null StdOut=/bluehome3/thomas95/FLOWlab/Projects/FlorisChangesCode/GridStudy/slurm-9919523.out BatchScript= #!/bin/bash #SBATCH --time=24:00:00 # walltime #SBATCH --ntasks=69 # number of processor cores (i.e. tasks) #SBATCH --mem-per-cpu=4096M # memory per CPU core #SBATCH -J "grid study fsl test" # job name #SBATCH --mail-user=jaredthomas@byu.net # email address #SBATCH --mail-type=BEGIN #SBATCH --mail-type=END #SBATCH --mail-type=FAIL # LOAD MODULES, INSERT CODE, AND RUN YOUR PROGRAMS HERE for rows in 2 3 4 5 6 7 8 9 10; do for case in 1 2 3 4 5 6; do mpirun -np 69 python GridStudy.py $rows $case done done Power= SICP=0
In the core file core-2016-04-01_09:21, I noticed that there was a job testing whether to preempt another job. Here is more information: (gdb) thread 1 [Switching to thread 1 (Thread 24592)]#0 0x00007f4697c49f15 in *__GI_raise (sig=<value optimized out>) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64 64 in ../nptl/sysdeps/unix/sysv/linux/raise.c (gdb) bt #0 0x00007f4697c49f15 in *__GI_raise (sig=<value optimized out>) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64 #1 0x00007f4697c4cd20 in *__GI_abort () at abort.c:92 #2 0x00007f4697c430a1 in *__GI___assert_fail (assertion=0x63e190 "(b) != ((void *)0)", file=<value optimized out>, line=483, function=0x63e60d "bit_ffs") at assert.c:81 #3 0x0000000000509171 in bit_ffs (b=0x0) at bitstring.c:483 #4 0x00007f4696bbaa8d in _build_row_bitmaps (p_ptr=0x7f468c0838e0, job_ptr=0x22d6910) at select_cons_res.c:646 #5 0x00007f4696bbc851 in _rm_job_from_res (part_record_ptr=0x7f468c0bcf10, node_usage=0x7f468c3d0180, job_ptr=0x22d6910, action=0) at select_cons_res.c:1255 #6 0x00007f4696bbe09f in _will_run_test (job_ptr=0x340bf30, bitmap=0x7f468c1f61a0, min_nodes=1, max_nodes=500000, req_nodes=1, job_node_req=1, preemptee_candidates=0x0, preemptee_job_list=0x7f46940a1aa8, exc_core_bitmap=0x0) at select_cons_res.c:1859 #7 0x00007f4696bbede8 in select_p_job_test (job_ptr=0x340bf30, bitmap=0x7f468c1f61a0, min_nodes=1, max_nodes=500000, req_nodes=1, mode=2, preemptee_candidates=0x0, preemptee_job_list=0x7f46940a1aa8, exc_core_bitmap=0x0) at select_cons_res.c:2196 #8 0x0000000000525d3b in select_g_job_test (job_ptr=0x340bf30, bitmap=0x7f468c1f61a0, min_nodes=1, max_nodes=500000, req_nodes=1, mode=2, preemptee_candidates=0x0, preemptee_job_list=0x7f46940a1aa8, exc_core_bitmap=0x0) at node_select.c:587 #9 0x00007f46940a6a3b in _try_sched (job_ptr=0x340bf30, avail_bitmap=0x7f46940a1cd8, min_nodes=1, max_nodes=500000, req_nodes=1, exc_core_bitmap=0x0) at backfill.c:418 #10 0x00007f46940a9766 in _attempt_backfill () at backfill.c:1356 #11 0x00007f46940a7780 in backfill_agent (args=0x0) at backfill.c:706 #12 0x00007f4697f8a8ca in start_thread (arg=<value optimized out>) at pthread_create.c:300 #13 0x00007f4697cf0abd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112 #14 0x0000000000000000 in ?? () (gdb) frame 8 #8 0x0000000000525d3b in select_g_job_test (job_ptr=0x340bf30, bitmap=0x7f468c1f61a0, min_nodes=1, max_nodes=500000, req_nodes=1, mode=2, preemptee_candidates=0x0, preemptee_job_list=0x7f46940a1aa8, exc_core_bitmap=0x0) at node_select.c:587 587 return (*(ops[select_context_default].job_test)) (gdb) print *job_ptr $1 = {account = 0x11b0b20 "bch", alias_list = 0x0, alloc_node = 0x25cb070 "m8-9-4", alloc_resp_port = 0, alloc_sid = 92326, array_job_id = 0, array_task_id = 4294967294, array_recs = 0x0, assoc_id = 1214, assoc_ptr = 0xe3d830, batch_flag = 1, batch_host = 0x0, bit_flags = 8, burst_buffer = 0x0, check_job = 0x28cd0f0, ckpt_interval = 0, ckpt_time = 0, comment = 0x0, cpu_cnt = 0, billable_tres = 4294967294, cr_enabled = 1, db_index = 16940128, derived_ec = 0, details = 0x340c260, direct_set_prio = 0, end_time = 0, end_time_exp = 0, epilog_running = false, exit_code = 0, front_end_ptr = 0x0, gres = 0x0, gres_list = 0x0, gres_alloc = 0x0, gres_req = 0x0, gres_used = 0x0, group_id = 21702, job_id = 9921532, job_next = 0x0, job_array_next_j = 0x0, job_array_next_t = 0x0, job_resrcs = 0x0, job_state = 0, kill_on_node_fail = 1, licenses = 0x0, license_list = 0x0, limit_set = {qos = 0, time = 0, tres = 0x29a07f0}, mail_type = 6, mail_user = 0x28c89f0 "paintmyeyesgreen@gmail.com", magic = 4038539564, name = 0x340c500 "lowY_sv,Vc3713", network = 0x0, next_step_id = 0, nodes = 0x0, node_addr = 0x0, node_bitmap = 0x0, node_bitmap_cg = 0x0, node_cnt = 0, node_cnt_wag = 2, nodes_completing = 0x0, other_port = 0, partition = 0x340c480 "m6n,m6,m8,m7", part_ptr_list = 0x1e696a0, part_nodes_missing = false, part_ptr = 0x3118330, power_flags = 0 '\000', pre_sus_time = 0, preempt_time = 0, preempt_in_progress = false, priority = 101197, priority_array = 0x2600620, prio_factors = 0x340c3e0, profile = 0, qos_id = 1, qos_ptr = 0xdb4840, reboot = 0 '\000', restart_cnt = 0, resize_time = 0, resv_id = 0, resv_name = 0x0, resv_ptr = 0x0, requid = 4294967295, resp_host = 0x0, sched_nodes = 0x7f468c3bd360 "m6-18-[10,13]", select_jobinfo = 0x340c620, sicp_mode = 0 '\000', spank_job_env = 0x0, spank_job_env_size = 0, start_protocol_ver = 7424, start_time = 0, state_desc = 0x0, state_reason = 1, state_reason_prev = 1, step_list = 0x2051560, suspend_time = 0, time_last_active = 1459522519, time_limit = 120, time_min = 0, tot_sus_time = 0, total_cpus = 16, total_nodes = 0, tres_req_cnt = 0x28e7e10, tres_req_str = 0x340ab70 "1=16,2=16384,4=1", tres_fmt_req_str = 0x340c530 "cpu=16,mem=16384,node=1", tres_alloc_cnt = 0x0, tres_alloc_str = 0x0, tres_fmt_alloc_str = 0x0, user_id = 21460, wait_all_nodes = 0, warn_flags = 0, warn_signal = 0, warn_time = 0, wckey = 0x0, req_switch = 0, wait4switch = 0, best_switch = true, wait4switch_start = 0} (gdb) print *job_ptr->details $2 = {acctg_freq = 0x0, argc = 1, argv = 0x2099bf0, begin_time = 1459522519, ckpt_dir = 0x340c650 "/var/slurm/checkpoint", contiguous = 0, core_spec = 65534, cpu_bind = 0x0, cpu_bind_type = 65534, cpu_freq_min = 4294967294, cpu_freq_max = 4294967294, cpu_freq_gov = 4294967294, cpus_per_task = 1, depend_list = 0x0, dependency = 0x0, orig_dependency = 0x0, env_cnt = 0, env_sup = 0x0, exc_node_bitmap = 0x0, exc_nodes = 0x0, expanding_jobid = 0, feature_list = 0x0, features = 0x0, magic = 0, max_cpus = 4294967294, max_nodes = 0, mc_ptr = 0x340c680, mem_bind = 0x0, mem_bind_type = 0, min_cpus = 16, min_nodes = 1, nice = 10000, ntasks_per_node = 0, num_tasks = 16, open_mode = 0 '\000', overcommit = 0 '\000', plane_size = 0, pn_min_cpus = 1, pn_min_memory = 2147484672, pn_min_tmp_disk = 0, prolog_running = 0 '\000', reserved_resources = 0, req_node_bitmap = 0x0, req_node_layout = 0x0, preempt_start_time = 0, req_nodes = 0x0, requeue = 0, restart_dir = 0x0, share_res = 0 '\000', std_err = 0x0, std_in = 0x340c590 "/dev/null", std_out = 0x0, submit_time = 1459522519, task_dist = 8192, usable_nodes = 0, whole_node = 0 '\000', work_dir = 0x340c5c0 "/panfs/pan.fsl.byu.edu/scr/usr/60/josephsl/graphener3/Y_sv,Vc/3713"} (gdb) print *preemptee_job_list $3 = (List) 0x0 (gdb) quit root@sched1:/var/log/slurm# scontrol -dd show job 9921532 JobId=9921532 JobName=lowY_sv,Vc3713 UserId=josephsl(21460) GroupId=josephsl(21702) Priority=101197 Nice=0 Account=bch QOS=normal JobState=RUNNING Reason=None Dependency=(null) Requeue=0 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 DerivedExitCode=0:0 RunTime=00:01:55 TimeLimit=02:00:00 TimeMin=N/A SubmitTime=2016-04-01T08:55:19 EligibleTime=2016-04-01T08:55:19 StartTime=2016-04-01T09:34:33 EndTime=2016-04-01T11:34:33 PreemptTime=None SuspendTime=None SecsPreSuspend=0 Partition=m7 AllocNode:Sid=m8-9-4:92326 ReqNodeList=(null) ExcNodeList=(null) NodeList=m7-14-[5,11] BatchHost=m7-14-5 NumNodes=2 NumCPUs=16 CPUs/Task=1 ReqB:S:C:T=0:0:*:* TRES=cpu=16,mem=16384,node=2 Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* Nodes=m7-14-5 CPU_IDs=0-7,9-15 Mem=15360 Nodes=m7-14-11 CPU_IDs=1 Mem=1024 MinCPUsNode=1 MinMemoryCPU=1024M MinTmpDiskNode=0 Features=(null) Gres=(null) Reservation=(null) Shared=OK Contiguous=0 Licenses=(null) Network=(null) Command=/panfs/pan.fsl.byu.edu/scr/usr/60/josephsl/graphener3/Y_sv,Vc/3713/job WorkDir=/panfs/pan.fsl.byu.edu/scr/usr/60/josephsl/graphener3/Y_sv,Vc/3713 StdErr=/panfs/pan.fsl.byu.edu/scr/usr/60/josephsl/graphener3/Y_sv,Vc/3713/slurm-9921532.out StdIn=/dev/null StdOut=/panfs/pan.fsl.byu.edu/scr/usr/60/josephsl/graphener3/Y_sv,Vc/3713/slurm-9921532.out BatchScript= #!/bin/bash #SBATCH --time=2:0:00 #SBATCH --ntasks=16 #SBATCH --mem-per-cpu=1024M #SBATCH --mail-user=paintmyeyesgreen@gmail.com #SBATCH --mail-type=FAIL #SBATCH --mail-type=END #SBATCH --job-name=lowY_sv,Vc3713 module unload mpi module load mpi/openmpi-1.6.5_intel-13.0.1 module unload mkl module load mkl/11/2 module unload python module load python/2/7 mpiexec ./vasp.x > vasp.out Power= SICP=0
Can you tell me what version of Slurm that you were using prior to the upgrade and problems? The point where the failures all occur and the data structure being manipulated all are local to the select/cons_res plugin and there haven't been any substantive changes to the plugin since 15.08.8, so I'm wondering if I need to look at changes further back in time.
(In reply to Moe Jette from comment #25) > Can you tell me what version of Slurm that you were using prior to the > upgrade and problems? We were running 15.08.8 since February 22. Prior to that, we switched to .7 on January 22. Sometimes we let the client commands lag behind on compute/login nodes by a few minor versions but they're at .8 currently.
Any ideas about what else to try? I could add debug statements in various places but don't know what all to look for. Many of the core dumps seem to be caused by SIGABRT and not SIGSEGV, by the way.
Here's an update: The SIGABRT are due to the Slurm code testing that a valid bitmap value is passed to a search function before the operation. Specifically it is checking for a non-zero pointer value and valid "magic cookie". In your case it's dying due to a NULL pointer some of the time. Other time's it's operating on a pointer that's out of bounds. I've set up a system here that looks like yours and run under valgrind both our regression test and everything I could think of to cause a problem like this and found nothing. I've reviewed all of the Slurm changes since version 15.08.8 and found nothing remotely related has changed. Danny did the same exercise independently. I've been studying everywhere the invalid data structure is manipulated and found nothing so far, but am continuing that. What might be helpful on your side is: 1. You can run with a DebugFlags value of "SelectType". It's going to be really verbose, so you might just turn that on temporarily if you get into a state where it's happening frequently. 2. In one of this core files, dump the full "p_ptr" data structure in the _build_row_bitmaps() function: for (i = 0; i < p_ptr->num_rows; i++) { print *p_ptr->row[i] for (j = 0; j < p_ptr->row[i].num_jobs; j++) { print *p_ptr->row[i].job_list[j] } }
I might be grasping at straws, but one thing I just noticed is Slurm's bitmap pointer was set in a couple of places to something that looks like it might be ASCI text rather than an integer value or memory address. It's nothing that I recognize from Slurm code, but perhaps not random either. #0 0x0000000000509175 in bit_ffs (b=0x20312d3d656d6954) at bitstring.c:483 20 31 2d 3d 65 6d 69 54 " 1-=emiT" #0 0x00000000004d7094 in bit_ffs (b=0x34422f7a6d73536f) at bitstring.c:483 "4B/zmsSo"
(In reply to Moe Jette from comment #28) Thanks for all the work both of you have put into this. > What might be helpful on your side is: > 1. You can run with a DebugFlags value of "SelectType". It's going to be > really verbose, so you might just turn that on temporarily if you get into a > state where it's happening frequently. I've enabled this with scontrol. I'll disable it before I leave today if I think it's going to put too much in the logs. > 2. In one of this core files, dump the full "p_ptr" data structure in the > _build_row_bitmaps() function: > for (i = 0; i < p_ptr->num_rows; i++) { > print *p_ptr->row[i] > for (j = 0; j < p_ptr->row[i].num_jobs; j++) { > print *p_ptr->row[i].job_list[j] > } > } Is this what you need? Program terminated with signal 6, Aborted. #0 0x00007fd4cb66df15 in *__GI_raise (sig=<value optimized out>) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64 64 ../nptl/sysdeps/unix/sysv/linux/raise.c: No such file or directory. in ../nptl/sysdeps/unix/sysv/linux/raise.c (gdb) bt #0 0x00007fd4cb66df15 in *__GI_raise (sig=<value optimized out>) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64 #1 0x00007fd4cb670d20 in *__GI_abort () at abort.c:92 #2 0x00007fd4cb6670a1 in *__GI___assert_fail (assertion=0x63e190 "(b) != ((void *)0)", file=<value optimized out>, line=483, function=0x63e60d "bit_ffs") at assert.c:81 #3 0x0000000000509171 in bit_ffs (b=0x0) at bitstring.c:483 #4 0x00007fd4ca5dea8d in _build_row_bitmaps (p_ptr=0x7fd4ac04d450, job_ptr=0x28430c0) at select_cons_res.c:646 #5 0x00007fd4ca5e0851 in _rm_job_from_res (part_record_ptr=0x7fd4ac071f30, node_usage=0x7fd4ac3adf90, job_ptr=0x28430c0, action=0) at select_cons_res.c:1255 #6 0x00007fd4ca5e16f5 in _run_now (job_ptr=0x7fd4a45878f0, bitmap=0x7fd4ac136bc0, min_nodes=2, max_nodes=2, req_nodes=2, job_node_req=1, preemptee_candidates=0x7fd4ac21a4b0, preemptee_job_list=0x7fff9a7b8588, exc_core_bitmap=0x0) at select_cons_res.c:1603 #7 0x00007fd4ca5e2e6a in select_p_job_test (job_ptr=0x7fd4a45878f0, bitmap=0x7fd4ac136bc0, min_nodes=2, max_nodes=2, req_nodes=2, mode=0, preemptee_candidates=0x7fd4ac21a4b0, preemptee_job_list=0x7fff9a7b8588, exc_core_bitmap=0x0) at select_cons_res.c:2204 #8 0x0000000000525d3b in select_g_job_test (job_ptr=0x7fd4a45878f0, bitmap=0x7fd4ac136bc0, min_nodes=2, max_nodes=2, req_nodes=2, mode=0, preemptee_candidates=0x7fd4ac21a4b0, preemptee_job_list=0x7fff9a7b8588, exc_core_bitmap=0x0) at node_select.c:587 #9 0x0000000000492290 in _pick_best_nodes (node_set_ptr=0x7fd4ac1369c0, node_set_size=2, select_bitmap=0x7fff9a7b85a0, job_ptr=0x7fd4a45878f0, part_ptr=0x1934690, min_nodes=2, max_nodes=2, req_nodes=2, test_only=false, preemptee_candidates=0x7fd4ac21a4b0, preemptee_job_list=0x7fff9a7b8588, has_xand=false, exc_core_bitmap=0x0, resv_overlap=false) at node_scheduler.c:1523 #10 0x000000000049100e in _get_req_features (node_set_ptr=0x7fd4ac1369c0, node_set_size=2, select_bitmap=0x7fff9a7b85a0, job_ptr=0x7fd4a45878f0, part_ptr=0x1934690, min_nodes=2, max_nodes=2, req_nodes=2, test_only=false, preemptee_job_list=0x7fff9a7b8588) at node_scheduler.c:1011 #11 0x000000000049327f in select_nodes (job_ptr=0x7fd4a45878f0, test_only=false, select_node_bitmap=0x0, unavail_node_str=0x7fd4ac110210 "m6-17-4,m6-18-[11-12],m6-19-13,m6-20-[11,13],m6-26-[8,14],m6-28-14,m6-29-[5,8,12,14,16],m6-30-[8-10],m6-31-14,m6-32-[1,8],m6-33-10,m7-6-[7,10],m7-8-13,m7-10-5,m7-15-15,m8-1-9,m8-2-12,m8-3-[11,13],m8-4"..., err_msg=0x0) at node_scheduler.c:1938 #12 0x000000000047bd34 in _schedule (job_limit=4294967295) at job_scheduler.c:1570 #13 0x0000000000479e86 in schedule (job_limit=4294967295) at job_scheduler.c:903 #14 0x0000000000445d7c in _slurmctld_background (no_data=0x0) at controller.c:1875 #15 0x0000000000442cc7 in main (argc=1, argv=0x7fff9a7b8eb8) at controller.c:583 (gdb) frame 4 #4 0x00007fd4ca5dea8d in _build_row_bitmaps (p_ptr=0x7fd4ac04d450, job_ptr=0x28430c0) at select_cons_res.c:646 646 ss[x].jstart = bit_ffs(ss[x].tmpjobs->node_bitmap); (gdb) print p_ptr->row[i] $1 = {row_bitmap = 0x7fd4ac13e450, num_jobs = 2060, job_list = 0x7fd4ac0c9690, job_list_size = 2136} (gdb) print p_ptr->row[i].job_list[j] $2 = (struct job_resources *) 0x0 (gdb) print i $3 = 0 (gdb) print j $4 = 382
(In reply to Ryan Cox from comment #30) > Is this what you need? > > (gdb) frame 4 > #4 0x00007fd4ca5dea8d in _build_row_bitmaps (p_ptr=0x7fd4ac04d450, > job_ptr=0x28430c0) at select_cons_res.c:646 > 646 ss[x].jstart = bit_ffs(ss[x].tmpjobs->node_bitmap); > (gdb) print p_ptr->row[i] > $1 = {row_bitmap = 0x7fd4ac13e450, num_jobs = 2060, job_list = > 0x7fd4ac0c9690, job_list_size = 2136} > (gdb) print p_ptr->row[i].job_list[j] > $2 = (struct job_resources *) 0x0 > (gdb) print i > $3 = 0 > (gdb) print j > $4 = 382 It would be nice to get some of the other values around here too. Did you have a chance to run slurmctld under valgrind yet? > #0 0x0000000000509175 in bit_ffs (b=0x20312d3d656d6954) at bitstring.c:483 > 20 31 2d 3d 65 6d 69 54 > " 1-=emiT" In case you didn't notice the mirror image of this string is "Time=-1 ", which isn't part of Slurm, but perhaps part of a user Script or might be in your Lua or ??
(In reply to Moe Jette from comment #31) > (In reply to Ryan Cox from comment #30) > > Is this what you need? > > > > (gdb) frame 4 > > #4 0x00007fd4ca5dea8d in _build_row_bitmaps (p_ptr=0x7fd4ac04d450, > > job_ptr=0x28430c0) at select_cons_res.c:646 > > 646 ss[x].jstart = bit_ffs(ss[x].tmpjobs->node_bitmap); > > (gdb) print p_ptr->row[i] > > $1 = {row_bitmap = 0x7fd4ac13e450, num_jobs = 2060, job_list = > > 0x7fd4ac0c9690, job_list_size = 2136} > > (gdb) print p_ptr->row[i].job_list[j] > > $2 = (struct job_resources *) 0x0 > > (gdb) print i > > $3 = 0 > > (gdb) print j > > $4 = 382 > > It would be nice to get some of the other values around here too. > > > Did you have a chance to run slurmctld under valgrind yet? > Only for a few minutes so far. sdiag says it completed two backfill cycles. I'll attach the log. > > > #0 0x0000000000509175 in bit_ffs (b=0x20312d3d656d6954) at bitstring.c:483 > > 20 31 2d 3d 65 6d 69 54 > > " 1-=emiT" > > In case you didn't notice the mirror image of this string is "Time=-1 ", > which isn't part of Slurm, but perhaps part of a user Script or might be in > your Lua or ?? I can't find any instances of it in /etc/slurm (includes lua) or in scontrol -dd show job (as root so I see the scripts). I also checked the state directory and couldn't find 'Time=-1' there either. It may have been there previously in a job but isn't there anymore.
Created attachment 2968 [details] valgrind.log Here's the valgrind output. I'm taking off now for the weekend.
The valgrind output did locate a memory leak with respect to the checkpoint/blcr plugin, which I fixed here: https://github.com/SchedMD/slurm/commit/08d520db583c20c52b14490c03b2dcb3da388960 I made a third pass over all of the changes in Slurm v15.08.9 on Saturday and all uses of the relevant data structures (not just recent changes) without finding anything of interest, beyond changing a comment for greater clarity, committed here: https://github.com/SchedMD/slurm/commit/3f51a7889880997f4aee30bc27872ae1f4206eca I thought of one more idea yesterday that I want to pursue more today, but I'm grasping at straws... I'm sorry this is taking so long. How frequently is slurmctld crashing on you?
> I'm sorry this is taking so long. How frequently is slurmctld crashing on > you? 10 times between Friday night and this morning. Do you think I should apply the memory leak fix? The only reason I ask is because I don't know that anyone is actually using the Slurm blcr support and recompiling might mess up the ability to look at old core dumps. It's probably a small enough change to not affect most gdb work, however. I did find something interesting (but possibly unrelated) from one core file. One job was trying to preempt another. The standby job (9930598) ended at 14:52 on Saturday but had a reason of FAILED (likely preempted). That was the same time the core file was generated. I can find no record of the job that initiated the preemption (9933390) in sacct. The only entries for 9933390 in the logs were: [2016-04-02T14:51:59.225] _slurm_rpc_submit_batch_job JobId=9933390 usec=7020 [2016-04-02T14:52:27.478] Purged files for defunct batch job 9933390 Almost immediately after it was submitted, slurmctld crashed and was restarted at 14:52:20 due to cron (there's a sleep 20 before the check and restart... long story) From slurmctld (grep 9930598, doesn't show surrounding entries): [2016-04-02T12:53:41.780] _slurm_rpc_submit_batch_job JobId=9930598 usec=4012 [2016-04-02T14:49:02.124] sched: Allocate JobID=9930598 NodeList=m6-18-[7,9] #CPUs=16 Partition=m6n [2016-04-02T14:52:27.203] recovered job step 9930598.0 [2016-04-02T14:52:27.203] Recovered JobID=9930598 State=0x1 NodeCnt=0 Assoc=118 [2016-04-02T14:52:42.393] job_complete: JobID=9930598 State=0x1 NodeCnt=2 WEXITSTATUS 1 [2016-04-02T14:52:42.393] email msg to theemailaddressredacted@byryan SLURM Job_id=9930598 Name=sBE.Sc_sv,H254 Failed, Run time 00:03:40, FAILED, ExitCode 1 [2016-04-02T14:52:42.394] job_complete: JobID=9930598 State=0x8005 NodeCnt=2 done slurmd logs from node 0 (grep 9930598, doesn't show surrounding entries from other jobs): [2016-04-02T14:49:02.129] _run_prolog: prolog with lock for job 9930598 ran for 0 seconds [2016-04-02T14:49:02.134] Launching batch job 9930598 for UID 6181 [2016-04-02T14:49:02.154] [9930598] checkpoint/blcr init [2016-04-02T14:49:02.178] [9930598] task/cgroup: /slurm/uid_6181/job_9930598: alloc=4096MB mem.limit=4096MB memsw.limit=4300MB [2016-04-02T14:49:02.178] [9930598] task/cgroup: /slurm/uid_6181/job_9930598/step_batch: alloc=4096MB mem.limit=4096MB memsw.limit=4300MB [2016-04-02T14:49:02.222] [9930598.4294967295] checkpoint/blcr init [2016-04-02T14:49:02.236] [9930598.4294967295] task/cgroup: /slurm/uid_6181/job_9930598: alloc=4096MB mem.limit=4096MB memsw.limit=4300MB [2016-04-02T14:49:02.236] [9930598.4294967295] task/cgroup: /slurm/uid_6181/job_9930598/step_extern: alloc=4096MB mem.limit=4096MB memsw.limit=4300MB [2016-04-02T14:52:36.988] [9930598.4294967295] *** STEP 9930598.4294967295 ON m6-18-7 CANCELLED AT 2016-04-02T14:52:36 *** [2016-04-02T14:52:38.175] [9930598.4294967295] done with job [2016-04-02T14:52:42.369] [9930598] sending REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status 256 [2016-04-02T14:52:42.402] [9930598] done with job It's very likely that another job could have preempted it once the slurmctld was alive again.
Created attachment 2969 [details] gdb session for core-2016-04-02_14:52 (preemption) Some gdb info from the core mentioned in comment 35.
Looking back, it does appear that we had a core dump on March 11 that we didn't report. I think we were busy that day and decided to report it if it happened again. That was on .8.
I would not worry about the memory leak patch. I doubt that you would notice any difference. I'm going to widen the search for changes back to version 15.08.7 based upon the hypothesis the problem was introduce earlier, but is happening far more frequently now due to unknown reasons.
(In reply to Moe Jette from comment #38) > I would not worry about the memory leak patch. I doubt that you would notice > any difference. > > I'm going to widen the search for changes back to version 15.08.7 based upon > the hypothesis the problem was introduce earlier, but is happening far more > frequently now due to unknown reasons. Sounds good. I briefly looked at all core dumps and who submitted the jobs in question. By far the most common style of problem from the core dumps was: #0 0x0000000000509175 in bit_ffs (b=0x62a) at bitstring.c:483 #1 0x00007ff401811a8d in _build_row_bitmaps (p_ptr=0x24881f0, job_ptr=0x2b69e60) at select_cons_res.c:646 #2 0x00007ff401813851 in _rm_job_from_res (part_record_ptr=0x2488230, node_usage=0x24991e0, job_ptr=0x2b69e60, action=0) at select_cons_res.c:1255 #3 0x00007ff4018160e6 in select_p_job_fini (job_ptr=0x2b69e60) at select_cons_res.c:2296 #4 0x0000000000525fda in select_g_job_fini (job_ptr=0x2b69e60) at node_select.c:691 #5 0x000000000048faf9 in deallocate_nodes (job_ptr=0x2b69e60, timeout=false, suspended=false, preempted=false) at node_scheduler.c:484 #6 0x000000000045b9bc in job_complete (job_id=9942351, uid=0, requeue=false, node_fail=false, job_return_code=0) at job_mgr.c:4961 #7 0x00000000004a804a in _slurm_rpc_complete_batch_script (msg=0x1af9ee0, locked=false) at proc_req.c:2219 #8 0x00000000004a284c in slurmctld_req (msg=0x1af9ee0, arg=0x7ff3d8221260) at proc_req.c:346 #9 0x0000000000443d0b in _service_connection (arg=0x7ff3d8221260) at controller.c:1096 #10 0x00007ff402be18ca in start_thread (arg=<value optimized out>) at pthread_create.c:300 #11 0x00007ff402947abd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112 #12 0x0000000000000000 in ?? () There is no pattern between submitters, though the ones with many short jobs are more likely to show up in the core dumps (no surprise there). Some jobs are preemptable, some aren't.
Would it help to have access to one of the core files and a way to run gdb on it? I can give you a CDE tarball (http://www.pgbovine.net/cde/manual/basic-usage.html) that you can run directly. If you're not familiar with it, it tracks files and libraries that a binary opens during the course of a run. I used it when I ran gdb on one of the core files. It's about 22M when compressed. I tested it on a different system and it seemed to work well, just like running it from the system running the ctld.
(In reply to Ryan Cox from comment #40) > Would it help to have access to one of the core files and a way to run gdb > on it? I can give you a CDE tarball > (http://www.pgbovine.net/cde/manual/basic-usage.html) that you can run > directly. If you're not familiar with it, it tracks files and libraries > that a binary opens during the course of a run. I used it when I ran gdb on > one of the core files. It's about 22M when compressed. I tested it on a > different system and it seemed to work well, just like running it from the > system running the ctld. I've been reviewing the changes back to version 15.08.7 and haven't found anything suspect. I'm making a second pass now. Yes, a CDE tarball might be helpful. I don't have any better ideas. Given the size, perhaps you can email it to me directly rather than attaching it.
Since many of these die due to a failed assertion, are there any debugging statements we can add prior to the assertion? Basically test the same thing as the assertion somewhere prior to assert, print out lots of debugging information, then allow assert to happen as normal?
(In reply to Ryan Cox from comment #42) > Since many of these die due to a failed assertion, are there any debugging > statements we can add prior to the assertion? Basically test the same thing > as the assertion somewhere prior to assert, print out lots of debugging > information, then allow assert to happen as normal? That's actually what I'm working on now. I'll send you a patch before the end of today.
Created attachment 2972 [details] additional debugging This patch adds a magic cookie to the data structure that I am mostly concerned with and tests for that magic cookie pretty much whenever the data structure is touched. Of particular note, it checks for that magic cookie right before trying to use the data structure where you keep getting the core dumps. I also added a check right before the job data structure is purged to look for any vestigial records for the job in the select plugin. You didn't perchance start making changes related to partitions? Changing the partition associated with a running job for example could cause this problem if the select plugin isn't modified to handle that.
(In reply to Moe Jette from comment #44) > Created attachment 2972 [details] > additional debugging > > This patch adds a magic cookie to the data structure that I am mostly > concerned with and tests for that magic cookie pretty much whenever the data > structure is touched. > > Of particular note, it checks for that magic cookie right before trying to > use the data structure where you keep getting the core dumps. > > I also added a check right before the job data structure is purged to look > for any vestigial records for the job in the select plugin. I just deployed this so hopefully it will give us some good information when I get to work tomorrow. > You didn't perchance start making changes related to partitions? > Changing the partition associated with a running job for example could cause > this problem if the select plugin isn't modified to handle that. We split a small partition into two on March 23. All slurm daemons of every kind have been restarted since then. That was after the March 11 segfault and it was a really long time before that (mid-2015) that we made a change to partitions.
Please upload the slurmctld log file when you have a chance.
Oddly enough we haven't had another crash since I installed the patch yesterday. I would have expected at least two crashes by now. Would you still like the log file or should I wait until there's a crash?
(In reply to Ryan Cox from comment #47) > Oddly enough we haven't had another crash since I installed the patch > yesterday. I would have expected at least two crashes by now. Would you > still like the log file or should I wait until there's a crash? Yes please. There are changes in the patch which are not only intended produce logs which should help me identify the root problem, but prevent the crash in most cases.
Created attachment 2974 [details] slurmctld.log.bz2 after debugging patch I can send you more of the logs from yesterday but they're 80M compressed (I ran with SelectType debugging for a while yesterday but it was going to fill the file system so I stopped it). Hopefully this is enough of the logs for you. If not I can try to trim yesterday's logs or send them to you another way.
I think that I should have sufficient information here. Thanks!
Great. Let me know if you want us to add another patch or anything.
Here's what I know now. When a job is started, the select/cons_res plugin creates an entry in a data structure for it. In this data structure, there is a record of the job allocation and a partition pointer (copied from the job data structure) Later when the job ends, the select/cons_res plugin tries to remove the job record. At that later time, the job's partition pointer is valid, but is DIFFERENT from the value when the job started. This mis-match prevents the job record in select/cons_res from being cleared. Later when the job is purged from slurmctld and the select/cons_res data structure is manipulated there are bad pointers causing the abort (they pointed into the job record, which has since been purged). My patch prevents these aborts and logs the events, but I have been unable to identify any mechanism that changes the partition pointer of a running job other than an "scontrol reconfig" that synchronizes the changes with the select/cons_res plugin. There are many such job records in the log, all of a PAIR of back-to-back messages like these: [2016-04-05T08:25:47.762] error: Found vestigial select record for job 9949087 [2016-04-05T08:25:47.762] error: Partition pointer mis-match [2016-04-05T08:26:47.198] error: Found vestigial select record for job 9949060 [2016-04-05T08:26:47.198] error: Partition pointer mis-match
That's really interesting. I've been trying to look for patterns with the jobs that cause the aborts but haven't found anything remotely resembling a pattern. I tested several theories but none panned out. I skimmed the patch but didn't look too closely. So the patch should prevent aborts in addition to doing the logging?
(In reply to Ryan Cox from comment #53) > I skimmed the patch but didn't look too closely. So the patch should prevent > aborts in addition to doing the logging? Yes
(In reply to Ryan Cox from comment #53) > That's really interesting. I've been trying to look for patterns with the > jobs that cause the aborts but haven't found anything remotely resembling a > pattern. I tested several theories but none panned out. Here are some of the things that I suspect are factors: "Shared=yes" in the partitions Jobs being submitted to multiple partitions There may very well be a timing component too
Fixed backfill scheduler race condition that could cause invalid pointer in select/cons_res plugin. Bug introduced in 15.08.9, commit: efd9d35 The scenario is as follows 1. Backfill scheduler is running, then releases locks 2. Main scheduling loop starts a job "A" 3. Backfill scheduler resumes, finds job "A" in its queue and resets it's partition pointer. 4. Job "A" completes and tries to remove resource allocation record from select/cons_res data structure, but fails to find it because it is looking in the table for the wrong partition. 5. Job "A" record gets purged from slurmctld 6. Select/cons_res plugin attempts to operate on resource allocation data structure, finds pointer into the now purged data structure of job "A" and aborts or gets SEGV Commit here: https://github.com/SchedMD/slurm/commit/d8b18ff8dc6f2237db3729f4ab085b003e54c89e We'll likely tag version 15.08.10 this week with the fix.
Excellent. Thanks for all the hard work in getting this fixed. I went ahead and just reinstalled with everything up through that commit since it all looked pretty safe.