Ticket 2603 - slurmctld segfault
Summary: slurmctld segfault
Status: RESOLVED FIXED
Alias: None
Product: Slurm
Classification: Unclassified
Component: slurmctld (show other tickets)
Version: 15.08.9
Hardware: Linux Linux
: 3 - Medium Impact
Assignee: Moe Jette
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2016-03-31 10:03 MDT by Ryan Cox
Modified: 2024-04-29 04:50 MDT (History)
0 users

See Also:
Site: BYU - Brigham Young University
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed: 15.08.10
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
gdb session (42.20 KB, text/plain)
2016-03-31 10:03 MDT, Ryan Cox
Details
slurmctld.log.bz2 (910.06 KB, application/x-bzip)
2016-03-31 10:05 MDT, Ryan Cox
Details
gdb sessions (first and second when -O2, etc were enabled) (9.09 KB, text/plain)
2016-03-31 10:21 MDT, Ryan Cox
Details
gdb session for core-2016-03-31_22:08 (18.64 KB, text/plain)
2016-04-01 01:21 MDT, Ryan Cox
Details
gdb session for core-2016-04-01_03:24 (47.26 KB, text/plain)
2016-04-01 01:27 MDT, Ryan Cox
Details
slurmctld.log.1.gz (1.92 MB, application/octet-stream)
2016-04-01 01:30 MDT, Ryan Cox
Details
gdb session for core-2016-04-01_09:21 (13.34 KB, text/plain)
2016-04-01 02:33 MDT, Ryan Cox
Details
valgrind.log (337.83 KB, text/plain)
2016-04-01 10:06 MDT, Ryan Cox
Details
gdb session for core-2016-04-02_14:52 (preemption) (19.25 KB, text/plain)
2016-04-04 03:02 MDT, Ryan Cox
Details
additional debugging (15.53 KB, patch)
2016-04-04 10:49 MDT, Moe Jette
Details | Diff
slurmctld.log.bz2 after debugging patch (288.24 KB, application/x-bzip)
2016-04-05 03:37 MDT, Ryan Cox
Details

Note You need to log in before you can comment on or make changes to this ticket.
Description Ryan Cox 2016-03-31 10:03:18 MDT
Created attachment 2957 [details]
gdb session

slurmctld is occasionally crashing since we upgraded to .9 today.  A cron job restarts it and it runs for a while before crashing again.  It has happened several times.  I'm attaching the backtrace as a file since it's much longer than usual due to the amount of threads.
Comment 1 Ryan Cox 2016-03-31 10:05:39 MDT
Created attachment 2958 [details]
slurmctld.log.bz2
Comment 2 Danny Auble 2016-03-31 10:07:50 MDT
Hey Ryan,

Most likely not related to this bug, but are these errors expected?

[2016-03-31T16:55:02.799] error: find_node_record: lookup failure for m6stage-1-4
[2016-03-31T16:55:02.799] error: node_name2bitmap: invalid node specified m6stage-1-4

error: find_node_record: lookup failure for m6stage-1-5
[2016-03-31T16:55:02.832] error: node_name2bitmap: invalid node specified m6stage-1-5
Comment 3 Ryan Cox 2016-03-31 10:11:13 MDT
Yes.  They're installed off of our provisioning server but aren't in the Slurm config.  Those nodes are probably just trying to run something like marking themselves offline due to hardware failure, etc (they're old staging nodes).
Comment 4 Danny Auble 2016-03-31 10:12:38 MDT
Ryan, in that frame 2 of thread 1 could you 

print *ss[x].tmpjobs
Comment 5 Ryan Cox 2016-03-31 10:14:31 MDT
#0  0x0000000000509175 in bit_ffs (b=0x20312d3d656d6954) at bitstring.c:483
483		_assert_bitstr_valid(b);
(gdb) thread 1
[Switching to thread 1 (Thread 28662)]#0  0x0000000000509175 in bit_ffs (b=0x20312d3d656d6954) at bitstring.c:483
483		_assert_bitstr_valid(b);
(gdb) frame 2
#2  0x00007f2a002a1851 in _rm_job_from_res (part_record_ptr=0x3d88ab0, node_usage=0x3d82fb0, job_ptr=0x38ba860, action=0) at select_cons_res.c:1255
1255				_build_row_bitmaps(p_ptr, job_ptr);
(gdb) print *ss[x].tmpjobs
No symbol "ss" in current context.
Comment 6 Danny Auble 2016-03-31 10:16:47 MDT
Also, you say it is crashing rather often.  Did the other backtraces files look similar?

Have you seen any similarities from the jobs that trigger this?  Nothing much has changed in the cons_res plugin which makes this seem like an older bug.
Comment 7 Danny Auble 2016-03-31 10:18:00 MDT
(In reply to Ryan Cox from comment #5)
> #0  0x0000000000509175 in bit_ffs (b=0x20312d3d656d6954) at bitstring.c:483
> 483		_assert_bitstr_valid(b);
> (gdb) thread 1
> [Switching to thread 1 (Thread 28662)]#0  0x0000000000509175 in bit_ffs
> (b=0x20312d3d656d6954) at bitstring.c:483
> 483		_assert_bitstr_valid(b);
> (gdb) frame 2
> #2  0x00007f2a002a1851 in _rm_job_from_res (part_record_ptr=0x3d88ab0,
> node_usage=0x3d82fb0, job_ptr=0x38ba860, action=0) at select_cons_res.c:1255
> 1255				_build_row_bitmaps(p_ptr, job_ptr);
> (gdb) print *ss[x].tmpjobs
> No symbol "ss" in current context.

Whoops frame 1, sorry.
Comment 8 Ryan Cox 2016-03-31 10:18:41 MDT
> Whoops frame 1, sorry.

(gdb) frame 1
#1  0x00007f2a0029fa8d in _build_row_bitmaps (p_ptr=0x3d88a70, job_ptr=0x38ba860) at select_cons_res.c:646
646				ss[x].jstart = bit_ffs(ss[x].tmpjobs->node_bitmap);
(gdb) print *ss[x].tmpjobs
$1 = {core_bitmap = 0x725f657461657263, core_bitmap_used = 0x6d614e203a767365, cpu_array_cnt = 913128805, cpu_array_value = 0x68746c6165685f38, cpu_array_reps = 0x616d3a6b63656863, cpus = 0x636e616e65746e69, cpus_used = 0x5320353536312e65, cores_per_socket = 0x656d695474726174, 
  memory_allocated = 0x33302d363130323d, memory_used = 0x353a36315431332d, nhosts = 926038581, node_bitmap = 0x20312d3d656d6954, node_req = 1634891076, nodes = 0x67616c462030313d <Address 0x67616c462030313d out of bounds>, ncpus = 1095581043, sock_core_rep_count = 0x2c50414c5245564f, 
  sockets_per_node = 0x4a5f45524f4e4749, whole_node = 79 'O'}
Comment 9 Ryan Cox 2016-03-31 10:21:25 MDT
Created attachment 2959 [details]
gdb sessions (first and second when -O2, etc were enabled)

(In reply to Danny Auble from comment #6)
> Also, you say it is crashing rather often.  Did the other backtraces files
> look similar?
> 
> Have you seen any similarities from the jobs that trigger this?  Nothing
> much has changed in the cons_res plugin which makes this seem like an older
> bug.

I didn't think to copy the core files elsewhere in time so this is all I have left from them (attached).  The first core file you saw in this ticket was with -O0 and all the default settings.  The ones from earlier today that I'm attaching now were from slurmctld when it was compiled with -O2.
Comment 10 Danny Auble 2016-03-31 10:22:39 MDT
Hum, well that looks like trash memory.  

In the same frame could you

print *p_ptr
print p_ptr->row[i]
print x
print j
print i
Comment 11 Ryan Cox 2016-03-31 10:24:30 MDT
(gdb) print *p_ptr
$2 = {next = 0x3d88a30, num_rows = 5, part_ptr = 0x28bcf80, row = 0x3eb62f0}
(gdb) print p_ptr->row[i]
$3 = {row_bitmap = 0x3ebae50, num_jobs = 41, job_list = 0x3eb8350, job_list_size = 48}
(gdb) print x
$4 = 4
(gdb) print j
$5 = 4
(gdb) print i
$6 = 0
Comment 12 Danny Auble 2016-03-31 10:26:51 MDT
Well the other cores are in the same place, seems like we are going off the array somehow.

Would you be willing to put this on valgrind and see if it turns up anything?
Comment 13 Ryan Cox 2016-03-31 10:31:02 MDT
(In reply to Danny Auble from comment #12)
> Well the other cores are in the same place, seems like we are going off the
> array somehow.
> 
> Would you be willing to put this on valgrind and see if it turns up anything?

Sure, but I think I'll wait until the morning.  I'll let this go overnight since I have cron restart it when needed, the backup controller seems to take over successfully every time, and the primary stays up for dozens of minutes at least before crashing.  This should give us another core file to work with, I imagine.

I'm taking off now and will get back to this first thing tomorrow.
Comment 14 Danny Auble 2016-03-31 10:31:29 MDT
No problem, thanks :).
Comment 15 Danny Auble 2016-03-31 10:47:57 MDT
Ryan, we think this might be related to sharing of resources?

It appears you have shared=yes on your partitions.

Do you happen to know if your users request the -s|--share option?

Could you, in frame 1

print *job_ptr->details

Based on these numbers given it looks like you are sharing resources, perhaps unknowingly?

That should be fine, but perhaps with this config there may be a bug.
Comment 16 Ryan Cox 2016-03-31 11:06:32 MDT
(In reply to Danny Auble from comment #15)
> Ryan, we think this might be related to sharing of resources?
> 
> It appears you have shared=yes on your partitions.
> 
> Do you happen to know if your users request the -s|--share option?
> 
> Could you, in frame 1
> 
> print *job_ptr->details
> 
> Based on these numbers given it looks like you are sharing resources,
> perhaps unknowingly?
> 
> That should be fine, but perhaps with this config there may be a bug.

All partitions are set to be shared by default. Some users request --exclusive but many jobs are shared.
Comment 17 Danny Auble 2016-03-31 11:11:25 MDT
Keep in mind shared means shared/oversubscribe cpus, not having multiple jobs share a node, that happens by default. The term shared is very misleading and comes from the days of whole node allocations and is being changed in 16.05 to something more clear like oversubscribe. 

On March 31, 2016 5:06:32 PM MST, bugs@schedmd.com wrote:
>https://bugs.schedmd.com/show_bug.cgi?id=2603
>
>--- Comment #16 from Ryan Cox <ryan_cox@byu.edu> ---
>(In reply to Danny Auble from comment #15)
>> Ryan, we think this might be related to sharing of resources?
>> 
>> It appears you have shared=yes on your partitions.
>> 
>> Do you happen to know if your users request the -s|--share option?
>> 
>> Could you, in frame 1
>> 
>> print *job_ptr->details
>> 
>> Based on these numbers given it looks like you are sharing resources,
>> perhaps unknowingly?
>> 
>> That should be fine, but perhaps with this config there may be a bug.
>
>All partitions are set to be shared by default. Some users request
>--exclusive
>but many jobs are shared.
>
>-- 
>You are receiving this mail because:
>You are watching all bug changes.
Comment 18 Ryan Cox 2016-03-31 13:03:39 MDT
Everything is shared by default unless they request --exclusive.

On March 31, 2016 5:47:57 PM MDT, bugs@schedmd.com wrote:
>https://bugs.schedmd.com/show_bug.cgi?id=2603
>
>--- Comment #15 from Danny Auble <da@schedmd.com> ---
>Ryan, we think this might be related to sharing of resources?
>
>It appears you have shared=yes on your partitions.
>
>Do you happen to know if your users request the -s|--share option?
>
>Could you, in frame 1
>
>print *job_ptr->details
>
>Based on these numbers given it looks like you are sharing resources,
>perhaps
>unknowingly?
>
>That should be fine, but perhaps with this config there may be a bug.
Comment 19 Ryan Cox 2016-04-01 01:10:34 MDT
(In reply to Danny Auble from comment #15)
> Ryan, we think this might be related to sharing of resources?
> 
> It appears you have shared=yes on your partitions.
> 
> Do you happen to know if your users request the -s|--share option?


I wasn't aware that it meant oversubscribe.  I really doubt that anyone specifies that and I've never seen any evidence of it.  Is there any easy way to check running/pending jobs for it?  This is all I could come up with:
$ squeue -ho %h | sort | uniq -c
     94 no
   5902 unknwn



> Could you, in frame 1
> 
> print *job_ptr->details
> 
> Based on these numbers given it looks like you are sharing resources,
> perhaps unknowingly?
> 
> That should be fine, but perhaps with this config there may be a bug.

(gdb) frame 1
#1  0x00007f2a0029fa8d in _build_row_bitmaps (p_ptr=0x3d88a70, job_ptr=0x38ba860) at select_cons_res.c:646
646				ss[x].jstart = bit_ffs(ss[x].tmpjobs->node_bitmap);
(gdb) print *job_ptr->details
$1 = {acctg_freq = 0x0, argc = 1, argv = 0x38bae10, begin_time = 1459459189, ckpt_dir = 0x38badb0 "/var/slurm/checkpoint", contiguous = 0, core_spec = 65534, cpu_bind = 0x0, cpu_bind_type = 65534, cpu_freq_min = 4294967294, cpu_freq_max = 4294967294, cpu_freq_gov = 4294967294, cpus_per_task = 1, 
  depend_list = 0x0, dependency = 0x0, orig_dependency = 0x0, env_cnt = 0, env_sup = 0x0, exc_node_bitmap = 0x0, exc_nodes = 0x0, expanding_jobid = 0, feature_list = 0x0, features = 0x0, magic = 0, max_cpus = 4294967294, max_nodes = 0, mc_ptr = 0x38bade0, mem_bind = 0x0, mem_bind_type = 0, min_cpus = 16, 
  min_nodes = 1, nice = 10000, ntasks_per_node = 0, num_tasks = 16, open_mode = 0 '\000', overcommit = 0 '\000', plane_size = 0, pn_min_cpus = 1, pn_min_memory = 2147484672, pn_min_tmp_disk = 0, prolog_running = 0 '\000', reserved_resources = 0, req_node_bitmap = 0x0, req_node_layout = 0x0, 
  preempt_start_time = 0, req_nodes = 0x0, requeue = 0, restart_dir = 0x0, share_res = 254 '\376', std_err = 0x0, std_in = 0x38bad10 "/dev/null", std_out = 0x0, submit_time = 1459459189, task_dist = 8192, usable_nodes = 0, whole_node = 0 '\000', 
  work_dir = 0x38bad40 "/panfs/pan.fsl.byu.edu/scr/usr/60/josephsl/graphener3/Zr_sv,Vc/5860/planarBE"}
Comment 20 Ryan Cox 2016-04-01 01:21:23 MDT
Created attachment 2962 [details]
gdb session for core-2016-03-31_22:08

Another core from last night
Comment 21 Ryan Cox 2016-04-01 01:27:35 MDT
Created attachment 2963 [details]
gdb session for core-2016-04-01_03:24

Here's another core.  There were two other crashes that I didn't capture the core files for.  I'm going to try looking for similarities.
Comment 22 Ryan Cox 2016-04-01 01:30:34 MDT
Created attachment 2964 [details]
slurmctld.log.1.gz

Logs covering the time period that the crashes happened.
Comment 23 Ryan Cox 2016-04-01 02:33:09 MDT
Created attachment 2967 [details]
gdb session for core-2016-04-01_09:21

Here's the last gdb session I'll send you until I hear otherwise.  This job was running so I also got this:

root@sched1:/var/log/slurm# scontrol show job -dd 9919523
JobId=9919523 JobName=grid study fsl test
   UserId=thomas95(21131) GroupId=thomas95(21350)
   Priority=101490 Nice=0 Account=sn72 QOS=normal
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=0 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   DerivedExitCode=0:0
   RunTime=13:42:25 TimeLimit=1-00:00:00 TimeMin=N/A
   SubmitTime=2016-03-31T19:49:38 EligibleTime=2016-03-31T19:49:38
   StartTime=2016-03-31T19:49:51 EndTime=2016-04-01T19:49:51
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   Partition=m6n AllocNode:Sid=m8int01:100465
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=m6-17-[5,13,15-16],m6-20-[2-8,10,12,14,16]
   BatchHost=m6-17-5
   NumNodes=15 NumCPUs=69 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=69,mem=282624,node=15
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
     Nodes=m6-17-5 CPU_IDs=0-1,6-7 Mem=16384
     Nodes=m6-17-[13,15-16],m6-20-2 CPU_IDs=0-2,6-8 Mem=24576
     Nodes=m6-20-3 CPU_IDs=1-3,6-7 Mem=20480
     Nodes=m6-20-4 CPU_IDs=0-2,6-7 Mem=20480
     Nodes=m6-20-5 CPU_IDs=0-1,6-7 Mem=16384
     Nodes=m6-20-6 CPU_IDs=0-2,6-8 Mem=24576
     Nodes=m6-20-7 CPU_IDs=0,4,8-9 Mem=16384
     Nodes=m6-20-8 CPU_IDs=0 Mem=4096
     Nodes=m6-20-10 CPU_IDs=0-1,6-7 Mem=16384
     Nodes=m6-20-12 CPU_IDs=10 Mem=4096
     Nodes=m6-20-14 CPU_IDs=1-3,6-7 Mem=20480
     Nodes=m6-20-16 CPU_IDs=0-2,6-8 Mem=24576
   MinCPUsNode=1 MinMemoryCPU=4G MinTmpDiskNode=0
   Features=(null) Gres=(null) Reservation=(null)
   Shared=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/bluehome3/thomas95/FLOWlab/Projects/FlorisChangesCode/GridStudy/fsl_run_cases.sh
   WorkDir=/bluehome3/thomas95/FLOWlab/Projects/FlorisChangesCode/GridStudy
   StdErr=/bluehome3/thomas95/FLOWlab/Projects/FlorisChangesCode/GridStudy/slurm-9919523.out
   StdIn=/dev/null
   StdOut=/bluehome3/thomas95/FLOWlab/Projects/FlorisChangesCode/GridStudy/slurm-9919523.out
   BatchScript=
#!/bin/bash

#SBATCH --time=24:00:00   # walltime
#SBATCH --ntasks=69   # number of processor cores (i.e. tasks)
#SBATCH --mem-per-cpu=4096M   # memory per CPU core
#SBATCH -J "grid study fsl test"   # job name
#SBATCH --mail-user=jaredthomas@byu.net   # email address
#SBATCH --mail-type=BEGIN
#SBATCH --mail-type=END
#SBATCH --mail-type=FAIL

# LOAD MODULES, INSERT CODE, AND RUN YOUR PROGRAMS HERE
for rows in 2 3 4 5 6 7 8 9 10; do
    for case in 1 2 3 4 5 6; do
        mpirun -np 69 python GridStudy.py $rows $case
    done
done

   Power= SICP=0
Comment 24 Ryan Cox 2016-04-01 02:38:10 MDT
In the core file core-2016-04-01_09:21, I noticed that there was a job testing whether to preempt another job.  Here is more information:

(gdb) thread 1
[Switching to thread 1 (Thread 24592)]#0  0x00007f4697c49f15 in *__GI_raise (sig=<value optimized out>) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
64	in ../nptl/sysdeps/unix/sysv/linux/raise.c
(gdb) bt
#0  0x00007f4697c49f15 in *__GI_raise (sig=<value optimized out>) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
#1  0x00007f4697c4cd20 in *__GI_abort () at abort.c:92
#2  0x00007f4697c430a1 in *__GI___assert_fail (assertion=0x63e190 "(b) != ((void *)0)", file=<value optimized out>, line=483, function=0x63e60d "bit_ffs") at assert.c:81
#3  0x0000000000509171 in bit_ffs (b=0x0) at bitstring.c:483
#4  0x00007f4696bbaa8d in _build_row_bitmaps (p_ptr=0x7f468c0838e0, job_ptr=0x22d6910) at select_cons_res.c:646
#5  0x00007f4696bbc851 in _rm_job_from_res (part_record_ptr=0x7f468c0bcf10, node_usage=0x7f468c3d0180, job_ptr=0x22d6910, action=0) at select_cons_res.c:1255
#6  0x00007f4696bbe09f in _will_run_test (job_ptr=0x340bf30, bitmap=0x7f468c1f61a0, min_nodes=1, max_nodes=500000, req_nodes=1, job_node_req=1, preemptee_candidates=0x0, preemptee_job_list=0x7f46940a1aa8, exc_core_bitmap=0x0) at select_cons_res.c:1859
#7  0x00007f4696bbede8 in select_p_job_test (job_ptr=0x340bf30, bitmap=0x7f468c1f61a0, min_nodes=1, max_nodes=500000, req_nodes=1, mode=2, preemptee_candidates=0x0, preemptee_job_list=0x7f46940a1aa8, exc_core_bitmap=0x0) at select_cons_res.c:2196
#8  0x0000000000525d3b in select_g_job_test (job_ptr=0x340bf30, bitmap=0x7f468c1f61a0, min_nodes=1, max_nodes=500000, req_nodes=1, mode=2, preemptee_candidates=0x0, preemptee_job_list=0x7f46940a1aa8, exc_core_bitmap=0x0) at node_select.c:587
#9  0x00007f46940a6a3b in _try_sched (job_ptr=0x340bf30, avail_bitmap=0x7f46940a1cd8, min_nodes=1, max_nodes=500000, req_nodes=1, exc_core_bitmap=0x0) at backfill.c:418
#10 0x00007f46940a9766 in _attempt_backfill () at backfill.c:1356
#11 0x00007f46940a7780 in backfill_agent (args=0x0) at backfill.c:706
#12 0x00007f4697f8a8ca in start_thread (arg=<value optimized out>) at pthread_create.c:300
#13 0x00007f4697cf0abd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#14 0x0000000000000000 in ?? ()
(gdb) frame 8
#8  0x0000000000525d3b in select_g_job_test (job_ptr=0x340bf30, bitmap=0x7f468c1f61a0, min_nodes=1, max_nodes=500000, req_nodes=1, mode=2, preemptee_candidates=0x0, preemptee_job_list=0x7f46940a1aa8, exc_core_bitmap=0x0) at node_select.c:587
587		return (*(ops[select_context_default].job_test))
(gdb) print *job_ptr
$1 = {account = 0x11b0b20 "bch", alias_list = 0x0, alloc_node = 0x25cb070 "m8-9-4", alloc_resp_port = 0, alloc_sid = 92326, array_job_id = 0, array_task_id = 4294967294, array_recs = 0x0, assoc_id = 1214, assoc_ptr = 0xe3d830, batch_flag = 1, batch_host = 0x0, bit_flags = 8, burst_buffer = 0x0, 
  check_job = 0x28cd0f0, ckpt_interval = 0, ckpt_time = 0, comment = 0x0, cpu_cnt = 0, billable_tres = 4294967294, cr_enabled = 1, db_index = 16940128, derived_ec = 0, details = 0x340c260, direct_set_prio = 0, end_time = 0, end_time_exp = 0, epilog_running = false, exit_code = 0, front_end_ptr = 0x0, gres = 0x0, 
  gres_list = 0x0, gres_alloc = 0x0, gres_req = 0x0, gres_used = 0x0, group_id = 21702, job_id = 9921532, job_next = 0x0, job_array_next_j = 0x0, job_array_next_t = 0x0, job_resrcs = 0x0, job_state = 0, kill_on_node_fail = 1, licenses = 0x0, license_list = 0x0, limit_set = {qos = 0, time = 0, tres = 0x29a07f0}, 
  mail_type = 6, mail_user = 0x28c89f0 "paintmyeyesgreen@gmail.com", magic = 4038539564, name = 0x340c500 "lowY_sv,Vc3713", network = 0x0, next_step_id = 0, nodes = 0x0, node_addr = 0x0, node_bitmap = 0x0, node_bitmap_cg = 0x0, node_cnt = 0, node_cnt_wag = 2, nodes_completing = 0x0, other_port = 0, 
  partition = 0x340c480 "m6n,m6,m8,m7", part_ptr_list = 0x1e696a0, part_nodes_missing = false, part_ptr = 0x3118330, power_flags = 0 '\000', pre_sus_time = 0, preempt_time = 0, preempt_in_progress = false, priority = 101197, priority_array = 0x2600620, prio_factors = 0x340c3e0, profile = 0, qos_id = 1, 
  qos_ptr = 0xdb4840, reboot = 0 '\000', restart_cnt = 0, resize_time = 0, resv_id = 0, resv_name = 0x0, resv_ptr = 0x0, requid = 4294967295, resp_host = 0x0, sched_nodes = 0x7f468c3bd360 "m6-18-[10,13]", select_jobinfo = 0x340c620, sicp_mode = 0 '\000', spank_job_env = 0x0, spank_job_env_size = 0, 
  start_protocol_ver = 7424, start_time = 0, state_desc = 0x0, state_reason = 1, state_reason_prev = 1, step_list = 0x2051560, suspend_time = 0, time_last_active = 1459522519, time_limit = 120, time_min = 0, tot_sus_time = 0, total_cpus = 16, total_nodes = 0, tres_req_cnt = 0x28e7e10, 
  tres_req_str = 0x340ab70 "1=16,2=16384,4=1", tres_fmt_req_str = 0x340c530 "cpu=16,mem=16384,node=1", tres_alloc_cnt = 0x0, tres_alloc_str = 0x0, tres_fmt_alloc_str = 0x0, user_id = 21460, wait_all_nodes = 0, warn_flags = 0, warn_signal = 0, warn_time = 0, wckey = 0x0, req_switch = 0, wait4switch = 0, 
  best_switch = true, wait4switch_start = 0}
(gdb) print *job_ptr->details
$2 = {acctg_freq = 0x0, argc = 1, argv = 0x2099bf0, begin_time = 1459522519, ckpt_dir = 0x340c650 "/var/slurm/checkpoint", contiguous = 0, core_spec = 65534, cpu_bind = 0x0, cpu_bind_type = 65534, cpu_freq_min = 4294967294, cpu_freq_max = 4294967294, cpu_freq_gov = 4294967294, cpus_per_task = 1, 
  depend_list = 0x0, dependency = 0x0, orig_dependency = 0x0, env_cnt = 0, env_sup = 0x0, exc_node_bitmap = 0x0, exc_nodes = 0x0, expanding_jobid = 0, feature_list = 0x0, features = 0x0, magic = 0, max_cpus = 4294967294, max_nodes = 0, mc_ptr = 0x340c680, mem_bind = 0x0, mem_bind_type = 0, min_cpus = 16, 
  min_nodes = 1, nice = 10000, ntasks_per_node = 0, num_tasks = 16, open_mode = 0 '\000', overcommit = 0 '\000', plane_size = 0, pn_min_cpus = 1, pn_min_memory = 2147484672, pn_min_tmp_disk = 0, prolog_running = 0 '\000', reserved_resources = 0, req_node_bitmap = 0x0, req_node_layout = 0x0, 
  preempt_start_time = 0, req_nodes = 0x0, requeue = 0, restart_dir = 0x0, share_res = 0 '\000', std_err = 0x0, std_in = 0x340c590 "/dev/null", std_out = 0x0, submit_time = 1459522519, task_dist = 8192, usable_nodes = 0, whole_node = 0 '\000', 
  work_dir = 0x340c5c0 "/panfs/pan.fsl.byu.edu/scr/usr/60/josephsl/graphener3/Y_sv,Vc/3713"}
(gdb) print *preemptee_job_list
$3 = (List) 0x0
(gdb) quit
root@sched1:/var/log/slurm# scontrol -dd show job 9921532
JobId=9921532 JobName=lowY_sv,Vc3713
   UserId=josephsl(21460) GroupId=josephsl(21702)
   Priority=101197 Nice=0 Account=bch QOS=normal
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=0 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   DerivedExitCode=0:0
   RunTime=00:01:55 TimeLimit=02:00:00 TimeMin=N/A
   SubmitTime=2016-04-01T08:55:19 EligibleTime=2016-04-01T08:55:19
   StartTime=2016-04-01T09:34:33 EndTime=2016-04-01T11:34:33
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   Partition=m7 AllocNode:Sid=m8-9-4:92326
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=m7-14-[5,11]
   BatchHost=m7-14-5
   NumNodes=2 NumCPUs=16 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=16,mem=16384,node=2
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
     Nodes=m7-14-5 CPU_IDs=0-7,9-15 Mem=15360
     Nodes=m7-14-11 CPU_IDs=1 Mem=1024
   MinCPUsNode=1 MinMemoryCPU=1024M MinTmpDiskNode=0
   Features=(null) Gres=(null) Reservation=(null)
   Shared=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/panfs/pan.fsl.byu.edu/scr/usr/60/josephsl/graphener3/Y_sv,Vc/3713/job
   WorkDir=/panfs/pan.fsl.byu.edu/scr/usr/60/josephsl/graphener3/Y_sv,Vc/3713
   StdErr=/panfs/pan.fsl.byu.edu/scr/usr/60/josephsl/graphener3/Y_sv,Vc/3713/slurm-9921532.out
   StdIn=/dev/null
   StdOut=/panfs/pan.fsl.byu.edu/scr/usr/60/josephsl/graphener3/Y_sv,Vc/3713/slurm-9921532.out
   BatchScript=
#!/bin/bash

#SBATCH --time=2:0:00
#SBATCH --ntasks=16
#SBATCH --mem-per-cpu=1024M
#SBATCH --mail-user=paintmyeyesgreen@gmail.com
#SBATCH --mail-type=FAIL
#SBATCH --mail-type=END
#SBATCH --job-name=lowY_sv,Vc3713
module unload mpi
module load mpi/openmpi-1.6.5_intel-13.0.1
module unload mkl
module load mkl/11/2
module unload python
module load python/2/7
mpiexec ./vasp.x > vasp.out

   Power= SICP=0
Comment 25 Moe Jette 2016-04-01 05:06:13 MDT
Can you tell me what version of Slurm that you were using prior to the upgrade and problems?

The point where the failures all occur and the data structure being manipulated all are local to the select/cons_res plugin and there haven't been any substantive changes to the plugin since 15.08.8, so I'm wondering if I need to look at changes further back in time.
Comment 26 Ryan Cox 2016-04-01 05:10:33 MDT
(In reply to Moe Jette from comment #25)
> Can you tell me what version of Slurm that you were using prior to the
> upgrade and problems?

We were running 15.08.8 since February 22.  Prior to that, we switched to .7 on January 22.

Sometimes we let the client commands lag behind on compute/login nodes by a few minor versions but they're at .8 currently.
Comment 27 Ryan Cox 2016-04-01 08:37:29 MDT
Any ideas about what else to try?  I could add debug statements in various places but don't know what all to look for.  Many of the core dumps seem to be caused by SIGABRT and not SIGSEGV, by the way.
Comment 28 Moe Jette 2016-04-01 09:09:17 MDT
Here's an update:

The SIGABRT are due to the Slurm code testing that a valid bitmap value is passed to a search function before the operation. Specifically it is checking for a non-zero pointer value and valid "magic cookie". In your case it's dying due to a NULL pointer some of the time. Other time's it's operating on a pointer that's out of bounds.

I've set up a system here that looks like yours and run under valgrind both our regression test and everything I could think of to cause a problem like this and found nothing.

I've reviewed all of the Slurm changes since version 15.08.8 and found nothing remotely related has changed. Danny did the same exercise independently.

I've been studying everywhere the invalid data structure is manipulated and found nothing so far, but am continuing that.

What might be helpful on your side is:
1. You can run with a DebugFlags value of "SelectType". It's going to be really verbose, so you might just turn that on temporarily if you get into a state where it's happening frequently.

2. In one of this core files, dump the full "p_ptr" data structure in the _build_row_bitmaps() function:
for (i = 0; i < p_ptr->num_rows; i++) {
  print *p_ptr->row[i]
  for (j = 0; j < p_ptr->row[i].num_jobs; j++) {
    print *p_ptr->row[i].job_list[j]
  }
}
Comment 29 Moe Jette 2016-04-01 09:23:11 MDT
I might be grasping at straws, but one thing I just noticed is Slurm's bitmap pointer was set in a couple of places to something that looks like it might be ASCI text rather than an integer value or memory address. It's nothing that I recognize from Slurm code, but perhaps not random either. 

#0  0x0000000000509175 in bit_ffs (b=0x20312d3d656d6954) at bitstring.c:483
20 31 2d 3d 65 6d 69 54
" 1-=emiT"

#0  0x00000000004d7094 in bit_ffs (b=0x34422f7a6d73536f) at bitstring.c:483
"4B/zmsSo"
Comment 30 Ryan Cox 2016-04-01 09:26:52 MDT
(In reply to Moe Jette from comment #28)

Thanks for all the work both of you have put into this.

> What might be helpful on your side is:
> 1. You can run with a DebugFlags value of "SelectType". It's going to be
> really verbose, so you might just turn that on temporarily if you get into a
> state where it's happening frequently.
 
I've enabled this with scontrol.  I'll disable it before I leave today if I think it's going to put too much in the logs.

> 2. In one of this core files, dump the full "p_ptr" data structure in the
> _build_row_bitmaps() function:
> for (i = 0; i < p_ptr->num_rows; i++) {
>   print *p_ptr->row[i]
>   for (j = 0; j < p_ptr->row[i].num_jobs; j++) {
>     print *p_ptr->row[i].job_list[j]
>   }
> }


Is this what you need?

Program terminated with signal 6, Aborted.
#0  0x00007fd4cb66df15 in *__GI_raise (sig=<value optimized out>) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
64	../nptl/sysdeps/unix/sysv/linux/raise.c: No such file or directory.
	in ../nptl/sysdeps/unix/sysv/linux/raise.c
(gdb) bt
#0  0x00007fd4cb66df15 in *__GI_raise (sig=<value optimized out>) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
#1  0x00007fd4cb670d20 in *__GI_abort () at abort.c:92
#2  0x00007fd4cb6670a1 in *__GI___assert_fail (assertion=0x63e190 "(b) != ((void *)0)", file=<value optimized out>, line=483, function=0x63e60d "bit_ffs") at assert.c:81
#3  0x0000000000509171 in bit_ffs (b=0x0) at bitstring.c:483
#4  0x00007fd4ca5dea8d in _build_row_bitmaps (p_ptr=0x7fd4ac04d450, job_ptr=0x28430c0) at select_cons_res.c:646
#5  0x00007fd4ca5e0851 in _rm_job_from_res (part_record_ptr=0x7fd4ac071f30, node_usage=0x7fd4ac3adf90, job_ptr=0x28430c0, action=0) at select_cons_res.c:1255
#6  0x00007fd4ca5e16f5 in _run_now (job_ptr=0x7fd4a45878f0, bitmap=0x7fd4ac136bc0, min_nodes=2, max_nodes=2, req_nodes=2, job_node_req=1, preemptee_candidates=0x7fd4ac21a4b0, preemptee_job_list=0x7fff9a7b8588, exc_core_bitmap=0x0) at select_cons_res.c:1603
#7  0x00007fd4ca5e2e6a in select_p_job_test (job_ptr=0x7fd4a45878f0, bitmap=0x7fd4ac136bc0, min_nodes=2, max_nodes=2, req_nodes=2, mode=0, preemptee_candidates=0x7fd4ac21a4b0, preemptee_job_list=0x7fff9a7b8588, exc_core_bitmap=0x0) at select_cons_res.c:2204
#8  0x0000000000525d3b in select_g_job_test (job_ptr=0x7fd4a45878f0, bitmap=0x7fd4ac136bc0, min_nodes=2, max_nodes=2, req_nodes=2, mode=0, preemptee_candidates=0x7fd4ac21a4b0, preemptee_job_list=0x7fff9a7b8588, exc_core_bitmap=0x0) at node_select.c:587
#9  0x0000000000492290 in _pick_best_nodes (node_set_ptr=0x7fd4ac1369c0, node_set_size=2, select_bitmap=0x7fff9a7b85a0, job_ptr=0x7fd4a45878f0, part_ptr=0x1934690, min_nodes=2, max_nodes=2, req_nodes=2, test_only=false, preemptee_candidates=0x7fd4ac21a4b0, preemptee_job_list=0x7fff9a7b8588, has_xand=false, 
    exc_core_bitmap=0x0, resv_overlap=false) at node_scheduler.c:1523
#10 0x000000000049100e in _get_req_features (node_set_ptr=0x7fd4ac1369c0, node_set_size=2, select_bitmap=0x7fff9a7b85a0, job_ptr=0x7fd4a45878f0, part_ptr=0x1934690, min_nodes=2, max_nodes=2, req_nodes=2, test_only=false, preemptee_job_list=0x7fff9a7b8588) at node_scheduler.c:1011
#11 0x000000000049327f in select_nodes (job_ptr=0x7fd4a45878f0, test_only=false, select_node_bitmap=0x0, 
    unavail_node_str=0x7fd4ac110210 "m6-17-4,m6-18-[11-12],m6-19-13,m6-20-[11,13],m6-26-[8,14],m6-28-14,m6-29-[5,8,12,14,16],m6-30-[8-10],m6-31-14,m6-32-[1,8],m6-33-10,m7-6-[7,10],m7-8-13,m7-10-5,m7-15-15,m8-1-9,m8-2-12,m8-3-[11,13],m8-4"..., err_msg=0x0) at node_scheduler.c:1938
#12 0x000000000047bd34 in _schedule (job_limit=4294967295) at job_scheduler.c:1570
#13 0x0000000000479e86 in schedule (job_limit=4294967295) at job_scheduler.c:903
#14 0x0000000000445d7c in _slurmctld_background (no_data=0x0) at controller.c:1875
#15 0x0000000000442cc7 in main (argc=1, argv=0x7fff9a7b8eb8) at controller.c:583
(gdb) frame 4
#4  0x00007fd4ca5dea8d in _build_row_bitmaps (p_ptr=0x7fd4ac04d450, job_ptr=0x28430c0) at select_cons_res.c:646
646				ss[x].jstart = bit_ffs(ss[x].tmpjobs->node_bitmap);
(gdb) print p_ptr->row[i]
$1 = {row_bitmap = 0x7fd4ac13e450, num_jobs = 2060, job_list = 0x7fd4ac0c9690, job_list_size = 2136}
(gdb) print p_ptr->row[i].job_list[j]
$2 = (struct job_resources *) 0x0
(gdb) print i
$3 = 0
(gdb) print j
$4 = 382
Comment 31 Moe Jette 2016-04-01 09:38:36 MDT
(In reply to Ryan Cox from comment #30)
> Is this what you need?
>
> (gdb) frame 4
> #4  0x00007fd4ca5dea8d in _build_row_bitmaps (p_ptr=0x7fd4ac04d450,
> job_ptr=0x28430c0) at select_cons_res.c:646
> 646				ss[x].jstart = bit_ffs(ss[x].tmpjobs->node_bitmap);
> (gdb) print p_ptr->row[i]
> $1 = {row_bitmap = 0x7fd4ac13e450, num_jobs = 2060, job_list =
> 0x7fd4ac0c9690, job_list_size = 2136}
> (gdb) print p_ptr->row[i].job_list[j]
> $2 = (struct job_resources *) 0x0
> (gdb) print i
> $3 = 0
> (gdb) print j
> $4 = 382

It would be nice to get some of the other values around here too.


Did you have a chance to run slurmctld under valgrind yet?


> #0  0x0000000000509175 in bit_ffs (b=0x20312d3d656d6954) at bitstring.c:483
> 20 31 2d 3d 65 6d 69 54
> " 1-=emiT"

In case you didn't notice the mirror image of this string is "Time=-1 ", which isn't part of Slurm, but perhaps part of a user Script or might be in your Lua or ??
Comment 32 Ryan Cox 2016-04-01 10:05:36 MDT
(In reply to Moe Jette from comment #31)
> (In reply to Ryan Cox from comment #30)
> > Is this what you need?
> >
> > (gdb) frame 4
> > #4  0x00007fd4ca5dea8d in _build_row_bitmaps (p_ptr=0x7fd4ac04d450,
> > job_ptr=0x28430c0) at select_cons_res.c:646
> > 646				ss[x].jstart = bit_ffs(ss[x].tmpjobs->node_bitmap);
> > (gdb) print p_ptr->row[i]
> > $1 = {row_bitmap = 0x7fd4ac13e450, num_jobs = 2060, job_list =
> > 0x7fd4ac0c9690, job_list_size = 2136}
> > (gdb) print p_ptr->row[i].job_list[j]
> > $2 = (struct job_resources *) 0x0
> > (gdb) print i
> > $3 = 0
> > (gdb) print j
> > $4 = 382
> 
> It would be nice to get some of the other values around here too.
> 
> 
> Did you have a chance to run slurmctld under valgrind yet?
> 

Only for a few minutes so far.  sdiag says it completed two backfill cycles.  I'll attach the log.

> 
> > #0  0x0000000000509175 in bit_ffs (b=0x20312d3d656d6954) at bitstring.c:483
> > 20 31 2d 3d 65 6d 69 54
> > " 1-=emiT"
> 
> In case you didn't notice the mirror image of this string is "Time=-1 ",
> which isn't part of Slurm, but perhaps part of a user Script or might be in
> your Lua or ??

I can't find any instances of it in /etc/slurm (includes lua) or in scontrol -dd show job (as root so I see the scripts).  I also checked the state directory and couldn't find 'Time=-1' there either.  It may have been there previously in a job but isn't there anymore.
Comment 33 Ryan Cox 2016-04-01 10:06:30 MDT
Created attachment 2968 [details]
valgrind.log

Here's the valgrind output.  I'm taking off now for the weekend.
Comment 34 Moe Jette 2016-04-04 01:42:23 MDT
The valgrind output did locate a memory leak with respect to the checkpoint/blcr plugin, which I fixed here:
https://github.com/SchedMD/slurm/commit/08d520db583c20c52b14490c03b2dcb3da388960

I made a third pass over all of the changes in Slurm v15.08.9 on Saturday and all uses of the relevant data structures (not just recent changes) without finding anything of interest, beyond changing a comment for greater clarity, committed here:
https://github.com/SchedMD/slurm/commit/3f51a7889880997f4aee30bc27872ae1f4206eca

I thought of one more idea yesterday that I want to pursue more today, but I'm grasping at straws...

I'm sorry this is taking so long. How frequently is slurmctld crashing on you?
Comment 35 Ryan Cox 2016-04-04 02:59:43 MDT
> I'm sorry this is taking so long. How frequently is slurmctld crashing on
> you?

10 times between Friday night and this morning.  Do you think I should apply the memory leak fix?  The only reason I ask is because I don't know that anyone is actually using the Slurm blcr support and recompiling might mess up the ability to look at old core dumps.  It's probably a small enough change to not affect most gdb work, however.

I did find something interesting (but possibly unrelated) from one core file.  One job was trying to preempt another.  The standby job (9930598) ended at 14:52 on Saturday but had a reason of FAILED (likely preempted).  That was the same time the core file was generated.  I can find no record of the job that initiated the preemption (9933390) in sacct.

The only entries for 9933390 in the logs were:
[2016-04-02T14:51:59.225] _slurm_rpc_submit_batch_job JobId=9933390 usec=7020
[2016-04-02T14:52:27.478] Purged files for defunct batch job 9933390

Almost immediately after it was submitted, slurmctld crashed and was restarted at 14:52:20 due to cron (there's a sleep 20 before the check and restart... long story)

From slurmctld (grep 9930598, doesn't show surrounding entries):
[2016-04-02T12:53:41.780] _slurm_rpc_submit_batch_job JobId=9930598 usec=4012
[2016-04-02T14:49:02.124] sched: Allocate JobID=9930598 NodeList=m6-18-[7,9] #CPUs=16 Partition=m6n
[2016-04-02T14:52:27.203] recovered job step 9930598.0
[2016-04-02T14:52:27.203] Recovered JobID=9930598 State=0x1 NodeCnt=0 Assoc=118
[2016-04-02T14:52:42.393] job_complete: JobID=9930598 State=0x1 NodeCnt=2 WEXITSTATUS 1
[2016-04-02T14:52:42.393] email msg to theemailaddressredacted@byryan SLURM Job_id=9930598 Name=sBE.Sc_sv,H254 Failed, Run time 00:03:40, FAILED, ExitCode 1
[2016-04-02T14:52:42.394] job_complete: JobID=9930598 State=0x8005 NodeCnt=2 done

slurmd logs from node 0 (grep 9930598, doesn't show surrounding entries from other jobs):
[2016-04-02T14:49:02.129] _run_prolog: prolog with lock for job 9930598 ran for 0 seconds
[2016-04-02T14:49:02.134] Launching batch job 9930598 for UID 6181
[2016-04-02T14:49:02.154] [9930598] checkpoint/blcr init
[2016-04-02T14:49:02.178] [9930598] task/cgroup: /slurm/uid_6181/job_9930598: alloc=4096MB mem.limit=4096MB memsw.limit=4300MB
[2016-04-02T14:49:02.178] [9930598] task/cgroup: /slurm/uid_6181/job_9930598/step_batch: alloc=4096MB mem.limit=4096MB memsw.limit=4300MB
[2016-04-02T14:49:02.222] [9930598.4294967295] checkpoint/blcr init
[2016-04-02T14:49:02.236] [9930598.4294967295] task/cgroup: /slurm/uid_6181/job_9930598: alloc=4096MB mem.limit=4096MB memsw.limit=4300MB
[2016-04-02T14:49:02.236] [9930598.4294967295] task/cgroup: /slurm/uid_6181/job_9930598/step_extern: alloc=4096MB mem.limit=4096MB memsw.limit=4300MB
[2016-04-02T14:52:36.988] [9930598.4294967295] *** STEP 9930598.4294967295 ON m6-18-7 CANCELLED AT 2016-04-02T14:52:36 ***
[2016-04-02T14:52:38.175] [9930598.4294967295] done with job
[2016-04-02T14:52:42.369] [9930598] sending REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status 256
[2016-04-02T14:52:42.402] [9930598] done with job

It's very likely that another job could have preempted it once the slurmctld was alive again.
Comment 36 Ryan Cox 2016-04-04 03:02:06 MDT
Created attachment 2969 [details]
gdb session for core-2016-04-02_14:52 (preemption)

Some gdb info from the core mentioned in comment 35.
Comment 37 Ryan Cox 2016-04-04 03:06:13 MDT
Looking back, it does appear that we had a core dump on March 11 that we didn't report.  I think we were busy that day and decided to report it if it happened again.  That was on .8.
Comment 38 Moe Jette 2016-04-04 03:11:56 MDT
I would not worry about the memory leak patch. I doubt that you would notice any difference.

I'm going to widen the search for changes back to version 15.08.7 based upon the hypothesis the problem was introduce earlier, but is happening far more frequently now due to unknown reasons.
Comment 39 Ryan Cox 2016-04-04 03:17:48 MDT
(In reply to Moe Jette from comment #38)
> I would not worry about the memory leak patch. I doubt that you would notice
> any difference.
> 
> I'm going to widen the search for changes back to version 15.08.7 based upon
> the hypothesis the problem was introduce earlier, but is happening far more
> frequently now due to unknown reasons.

Sounds good.  

I briefly looked at all core dumps and who submitted the jobs in question.

By far the most common style of problem from the core dumps was:
#0  0x0000000000509175 in bit_ffs (b=0x62a) at bitstring.c:483
#1  0x00007ff401811a8d in _build_row_bitmaps (p_ptr=0x24881f0, job_ptr=0x2b69e60) at select_cons_res.c:646
#2  0x00007ff401813851 in _rm_job_from_res (part_record_ptr=0x2488230, node_usage=0x24991e0, job_ptr=0x2b69e60, action=0) at select_cons_res.c:1255
#3  0x00007ff4018160e6 in select_p_job_fini (job_ptr=0x2b69e60) at select_cons_res.c:2296
#4  0x0000000000525fda in select_g_job_fini (job_ptr=0x2b69e60) at node_select.c:691
#5  0x000000000048faf9 in deallocate_nodes (job_ptr=0x2b69e60, timeout=false, suspended=false, preempted=false) at node_scheduler.c:484
#6  0x000000000045b9bc in job_complete (job_id=9942351, uid=0, requeue=false, node_fail=false, job_return_code=0) at job_mgr.c:4961
#7  0x00000000004a804a in _slurm_rpc_complete_batch_script (msg=0x1af9ee0, locked=false) at proc_req.c:2219
#8  0x00000000004a284c in slurmctld_req (msg=0x1af9ee0, arg=0x7ff3d8221260) at proc_req.c:346
#9  0x0000000000443d0b in _service_connection (arg=0x7ff3d8221260) at controller.c:1096
#10 0x00007ff402be18ca in start_thread (arg=<value optimized out>) at pthread_create.c:300
#11 0x00007ff402947abd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#12 0x0000000000000000 in ?? ()


There is no pattern between submitters, though the ones with many short jobs are more likely to show up in the core dumps (no surprise there).  Some jobs are preemptable, some aren't.
Comment 40 Ryan Cox 2016-04-04 04:48:54 MDT
Would it help to have access to one of the core files and a way to run gdb on it?  I can give you a CDE tarball (http://www.pgbovine.net/cde/manual/basic-usage.html) that you can run directly.  If you're not familiar with it, it tracks files and libraries that a binary opens during the course of a run.  I used it when I ran gdb on one of the core files.  It's about 22M when compressed.  I tested it on a different system and it seemed to work well, just like running it from the system running the ctld.
Comment 41 Moe Jette 2016-04-04 05:01:05 MDT
(In reply to Ryan Cox from comment #40)
> Would it help to have access to one of the core files and a way to run gdb
> on it?  I can give you a CDE tarball
> (http://www.pgbovine.net/cde/manual/basic-usage.html) that you can run
> directly.  If you're not familiar with it, it tracks files and libraries
> that a binary opens during the course of a run.  I used it when I ran gdb on
> one of the core files.  It's about 22M when compressed.  I tested it on a
> different system and it seemed to work well, just like running it from the
> system running the ctld.

I've been reviewing the changes back to version 15.08.7 and haven't found anything suspect. I'm making a second pass now.

Yes, a CDE tarball might be helpful. I don't have any better ideas. Given the size, perhaps you can email it to me directly rather than attaching it.
Comment 42 Ryan Cox 2016-04-04 09:35:38 MDT
Since many of these die due to a failed assertion, are there any debugging statements we can add prior to the assertion?  Basically test the same thing as the assertion somewhere prior to assert, print out lots of debugging information, then allow assert to happen as normal?
Comment 43 Moe Jette 2016-04-04 09:41:31 MDT
(In reply to Ryan Cox from comment #42)
> Since many of these die due to a failed assertion, are there any debugging
> statements we can add prior to the assertion?  Basically test the same thing
> as the assertion somewhere prior to assert, print out lots of debugging
> information, then allow assert to happen as normal?

That's actually what I'm working on now. I'll send you a patch before the end of today.
Comment 44 Moe Jette 2016-04-04 10:49:41 MDT
Created attachment 2972 [details]
additional debugging

This patch adds a magic cookie to the data structure that I am mostly concerned with and tests for that magic cookie pretty much whenever the data structure is touched.

Of particular note, it checks for that magic cookie right before trying to use the data structure where you keep getting the core dumps.

I also added a check right before the job data structure is purged to look for any vestigial records for the job in the select plugin.

You didn't perchance start making changes related to partitions?
Changing the partition associated with a running job for example could cause this problem if the select plugin isn't modified to handle that.
Comment 45 Ryan Cox 2016-04-04 11:39:42 MDT
(In reply to Moe Jette from comment #44)
> Created attachment 2972 [details]
> additional debugging
> 
> This patch adds a magic cookie to the data structure that I am mostly
> concerned with and tests for that magic cookie pretty much whenever the data
> structure is touched.
> 
> Of particular note, it checks for that magic cookie right before trying to
> use the data structure where you keep getting the core dumps.
> 
> I also added a check right before the job data structure is purged to look
> for any vestigial records for the job in the select plugin.

I just deployed this so hopefully it will give us some good information when I get to work tomorrow.

> You didn't perchance start making changes related to partitions?
> Changing the partition associated with a running job for example could cause
> this problem if the select plugin isn't modified to handle that.

We split a small partition into two on March 23.  All slurm daemons of every kind have been restarted since then.  That was after the March 11 segfault and it was a really long time before that (mid-2015) that we made a change to partitions.
Comment 46 Moe Jette 2016-04-05 02:18:07 MDT
Please upload the slurmctld log file when you have a chance.
Comment 47 Ryan Cox 2016-04-05 02:32:31 MDT
Oddly enough we haven't had another crash since I installed the patch yesterday.  I would have expected at least two crashes by now.  Would you still like the log file or should I wait until there's a crash?
Comment 48 Moe Jette 2016-04-05 02:42:45 MDT
(In reply to Ryan Cox from comment #47)
> Oddly enough we haven't had another crash since I installed the patch
> yesterday.  I would have expected at least two crashes by now.  Would you
> still like the log file or should I wait until there's a crash?

Yes please. There are changes in the patch which are not only intended produce logs which should help me identify the root problem, but prevent the crash in most cases.
Comment 49 Ryan Cox 2016-04-05 03:37:41 MDT
Created attachment 2974 [details]
slurmctld.log.bz2 after debugging patch

I can send you more of the logs from yesterday but they're 80M compressed (I ran with SelectType debugging for a while yesterday but it was going to fill the file system so I stopped it).  Hopefully this is enough of the logs for you.  If not I can try to trim yesterday's logs or send them to you another way.
Comment 50 Moe Jette 2016-04-05 04:05:01 MDT
I think that I should have sufficient information here. Thanks!
Comment 51 Ryan Cox 2016-04-05 04:32:30 MDT
Great.  Let me know if you want us to add another patch or anything.
Comment 52 Moe Jette 2016-04-05 04:54:49 MDT
Here's what I know now.

When a job is started, the select/cons_res plugin creates an entry in a data structure for it. In this data structure, there is a record of the job allocation and a partition pointer (copied from the job data structure)

Later when the job ends, the select/cons_res plugin tries to remove the job record. At that later time, the job's partition pointer is valid, but is DIFFERENT from the value when the job started. This mis-match prevents the job record in select/cons_res from being cleared. Later when the job is purged from slurmctld and the select/cons_res data structure is manipulated there are bad pointers causing the abort (they pointed into the job record, which has since been purged).

My patch prevents these aborts and logs the events, but I have been unable to identify any mechanism that changes the partition pointer of a running job other than an "scontrol reconfig" that synchronizes the changes with the select/cons_res plugin.

There are many such job records in the log, all of a PAIR of back-to-back messages like these:
[2016-04-05T08:25:47.762] error: Found vestigial select record for job 9949087
[2016-04-05T08:25:47.762] error: Partition pointer mis-match

[2016-04-05T08:26:47.198] error: Found vestigial select record for job 9949060
[2016-04-05T08:26:47.198] error: Partition pointer mis-match
Comment 53 Ryan Cox 2016-04-05 05:07:04 MDT
That's really interesting. I've been trying to look for patterns with the jobs that cause the aborts but haven't found anything remotely resembling a pattern. I tested several theories but none panned out.

I skimmed the patch but didn't look too closely. So the patch should prevent aborts in addition to doing the logging?
Comment 54 Moe Jette 2016-04-05 05:21:47 MDT
(In reply to Ryan Cox from comment #53)
> I skimmed the patch but didn't look too closely. So the patch should prevent
> aborts in addition to doing the logging?

Yes
Comment 55 Moe Jette 2016-04-05 05:32:13 MDT
(In reply to Ryan Cox from comment #53)
> That's really interesting. I've been trying to look for patterns with the
> jobs that cause the aborts but haven't found anything remotely resembling a
> pattern. I tested several theories but none panned out.

Here are some of the things that I suspect are factors:
"Shared=yes" in the partitions
Jobs being submitted to multiple partitions
There may very well be a timing component too
Comment 56 Moe Jette 2016-04-05 07:36:04 MDT
Fixed backfill scheduler race condition that could cause invalid pointer in
    select/cons_res plugin. Bug introduced in 15.08.9, commit:
    efd9d35

The scenario is as follows
1. Backfill scheduler is running, then releases locks
2. Main scheduling loop starts a job "A"
3. Backfill scheduler resumes, finds job "A" in its queue and
   resets it's partition pointer.
4. Job "A" completes and tries to remove resource allocation record
   from select/cons_res data structure, but fails to find it because
   it is looking in the table for the wrong partition.
5. Job "A" record gets purged from slurmctld
6. Select/cons_res plugin attempts to operate on resource allocation
   data structure, finds pointer into the now purged data structure
   of job "A" and aborts or gets SEGV

Commit here:
https://github.com/SchedMD/slurm/commit/d8b18ff8dc6f2237db3729f4ab085b003e54c89e

We'll likely tag version 15.08.10 this week with the fix.
Comment 57 Ryan Cox 2016-04-05 09:40:55 MDT
Excellent.  Thanks for all the hard work in getting this fixed.  I went ahead and just reinstalled with everything up through that commit since it all looked pretty safe.