Ticket 5276

Summary:	Slurmctld crash on no job_resrcs
Product:	Slurm	Reporter:	Paul Edmon <pedmon>
Component:	slurmctld	Assignee:	Marshall Garey <marshall>
Status:	RESOLVED FIXED	QA Contact:
Severity:	3 - Medium Impact
Priority:	---	CC:	alex, felip.moll
Version:	17.11.7
Hardware:	Linux
OS:	Linux
See Also:	https://bugs.schedmd.com/show_bug.cgi?id=5438
Site:	Harvard University	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:	17.11.9 18.08.0pre2
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	Fuller GDB Log Log from the first crash Current slurm.conf bypass crash for missing job_resrsc struct Current slurm.spec bypass crash for missing job_resrsc struct slurmctl log for job 45416412 sacct info for jobs with missing structs Prevent job_resrcs from being overwritten for multi-partition job submissions

Description Paul Edmon 2018-06-07 10:20:18 MDT

Our slurmctld started crashing about 10 minutes ago with the following error

slurmctld: debug3: cons_res: adding job 45416606 to part holymeissner row 0
slurmctld: debug3: cons_res: _add_job_to_res: job 45416608 act 0 
slurmctld: debug3: cons_res: adding job 45416608 to part holymeissner row 0
slurmctld: debug3: cons_res: _add_job_to_res: job 45416609 act 0 
slurmctld: debug3: cons_res: adding job 45416609 to part holymeissner row 0
slurmctld: error: _add_job_to_res: job 45416611 has no job_resrcs info
slurmctld: Warning: Note very large processing time from read_slurm_conf: usec=53732228 began=12:09:55.868
slurmctld: debug2: Spawning RPC agent for msg_type REQUEST_TERMINATE_JOB
slurmctld: debug:  backfill: beginning
slurmctld: debug2: got 1 threads to send out
slurmctld: debug2: Tree head got back 0 looking for 1
slurmctld: Running as primary controller
slurmctld: debug:  No BackupController, not launching heartbeat.
slurmctld: Registering slurmctld at port 6820 with slurmdbd.
slurmctld: debug3: Tree sending to holy2a18107
slurmctld: debug4: orig_timeout was 100000 we have 0 steps and a timeout of 100000
slurmctld: debug2: Tree head got back 1
slurmctld: debug2: prolog_slurmctld job 45416611 prolog completed
slurmctld: debug:  backfill: 12288 jobs to backfill
slurmctld: debug2: Spawning RPC agent for msg_type REQUEST_TERMINATE_JOB
slurmctld: debug2: node_did_resp holy2a18107
slurmctld: prolog_running_decr: Configuration for JobID=45416611 is complete
slurmctld: Extending job 45416611 time limit by 350 secs for configuration
slurmctld: debug2: got 1 threads to send out
slurmctld: debug2: Tree head got back 0 looking for 1
Segmentation fault (core dumped)

I tried tracing back 45416611 to see if I could remove it but no luck scancel is not quick enough and there is no job record in the spool/hash.1 to delete.  The scheduler is dead in the water.  I will attach the gdb dump of the core file in a bit.

Comment 1 Paul Edmon 2018-06-07 10:25:05 MDT

[root@holy-slurm02 spool]# gdb /usr/sbin/slurmctld core.138997
GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-110.el7
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /usr/sbin/slurmctld...done.
[New LWP 139033]
[New LWP 138998]
[New LWP 139000]
[New LWP 138997]
[New LWP 139003]
[New LWP 139036]
[New LWP 139037]
[New LWP 139040]
[New LWP 138999]
[New LWP 139041]
[New LWP 139002]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `/usr/sbin/slurmctld'.
Program terminated with signal 11, Segmentation fault.
#0  0x0000000000471ae5 in launch_prolog (job_ptr=job_ptr@entry=0x785e720) at node_scheduler.c:2853
2853	node_scheduler.c: No such file or directory.
Missing separate debuginfos, use: debuginfo-install slurm-slurmctld-17.11.7-1fasrc01.el7.centos.x86_64
(gdb) bt full
#0  0x0000000000471ae5 in launch_prolog (job_ptr=job_ptr@entry=0x785e720) at node_scheduler.c:2853
        prolog_msg_ptr = 0x7f6fb0000e70
        agent_arg_ptr = <optimized out>
        job_resrcs_ptr = 0x0
        cred_arg = {jobid = 45416611, stepid = 4294967295, uid = 11608, gid = 403048, user_name = 0x0, ngids = 0, gids = 0x0, cores_per_socket = 0x0, sockets_per_node = 0x0, 
          sock_core_rep_count = 0x0, job_constraints = 0x0, job_core_bitmap = 0x0, job_core_spec = 65534, job_hostlist = 0x0, job_mem_limit = 0, job_nhosts = 0, 
          job_gres_list = 0x0, x11 = 0, step_core_bitmap = 0x0, step_hostlist = 0x0, step_mem_limit = 0, step_gres_list = 0x0}
        i = <optimized out>
        __func__ = "launch_prolog"
#1  0x000000000043ff48 in job_config_fini (job_ptr=job_ptr@entry=0x785e720) at job_mgr.c:8492
        now = 1528388077
#2  0x000000000046084b in prolog_running_decr (job_ptr=job_ptr@entry=0x785e720) at job_scheduler.c:4406
        job_id_buf = "JobID=45416611\000g\033\037\002\000\000\000\000\000\376\377\000\000\000\000\000\000 \347\205\a\000\000\000\000\300\f\000\260o\177\000\000\243\000\265\002\000\000\000\000\315\300l\301o\177\000\000\b\000\000\000\060\000\000\000`\236*\273o\177\000\000\240\235*\273o\177\000\000\000R0\310\324.\037g\000\000\000\000\000\000\000\000\243\000\265\002", '\000' <repeats 12 times>, "p\377\377\377\377\377\377\377\000\000\000\000\000\000\000\000\001", '\000' <repeats 31 times>, "\214\002\"\301o\177\000\000\060\370\347\003\000\000\000\000P\021\301\301o\177\000\000\227\001\000\000\000\000\000\000"...
        __func__ = "prolog_running_decr"
#3  0x0000000000464645 in _run_prolog (arg=0x785e720) at job_scheduler.c:4358
        job_ptr = 0x785e720
        node_ptr = <optimized out>
        job_id = 45416611
        cpid = <optimized out>
        i = <optimized out>
        rc = <optimized out>
        status = 0
        wait_rc = <optimized out>
        argv = {0x0, 0x0}
        my_env = 0x0
        config_read_lock = {config = READ_LOCK, job = WRITE_LOCK, node = WRITE_LOCK, partition = NO_LOCK, federation = READ_LOCK}
        node_bitmap = 0x7f6fb0000cc0
        now = 1528388076
        resume_timeout = <optimized out>
        tm = 60112
        __func__ = "_run_prolog"
#4  0x00007f6fc1218e25 in start_thread () from /lib64/libpthread.so.0
No symbol table info available.
#5  0x00007f6fc0f42bad in clone () from /lib64/libc.so.6
No symbol table info available.

Comment 2 Paul Edmon 2018-06-07 10:30:12 MDT

Created attachment 7027 [details]
Fuller GDB Log

Added a trace of all the threads.

Comment 3 Tim Wickberg 2018-06-07 10:30:21 MDT

Logs from the first crash would be nice to have - something's definitely gone wrong with this job dispatch, and I'll get you a patch to bypass this issue at least for the moment in a couple of minutes, but we'd obviously like to nail down however this job got into such as state.

Comment 4 Paul Edmon 2018-06-07 10:31:01 MDT

Sure.  Give me a sec.

-Paul Edmon-


On 06/07/2018 12:30 PM, bugs@schedmd.com wrote:
>
> *Comment # 3 <https://bugs.schedmd.com/show_bug.cgi?id=5276#c3> on bug 
> 5276 <https://bugs.schedmd.com/show_bug.cgi?id=5276> from Tim Wickberg 
> <mailto:tim@schedmd.com> *
> Logs from the first crash would be nice to have - something's definitely gone
> wrong with this job dispatch, and I'll get you a patch to bypass this issue at
> least for the moment in a couple of minutes, but we'd obviously like to nail
> down however this job got into such as state.
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>

Comment 5 Paul Edmon 2018-06-07 10:36:18 MDT

This is the first crash of the day:

[root@holy-slurm02 spool]# gdb /usr/sbin/slurmctld core.69971
GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-110.el7
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later 
<http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /usr/sbin/slurmctld...done.
[New LWP 124204]
[New LWP 124218]
[New LWP 69974]
[New LWP 124224]
[New LWP 124225]
[New LWP 69978]
[New LWP 70030]
[New LWP 70036]
[New LWP 69976]
[New LWP 124186]
[New LWP 70032]
[New LWP 69971]
[New LWP 69982]
[New LWP 70029]
[New LWP 69975]
[New LWP 70035]
[New LWP 70033]
[New LWP 70038]
[New LWP 70034]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `/usr/sbin/slurmctld'.
Program terminated with signal 11, Segmentation fault.
#0  0x0000000000471ae5 in launch_prolog 
(job_ptr=job_ptr@entry=0x7f83507d57e0) at node_scheduler.c:2853
2853    node_scheduler.c: No such file or directory.
Missing separate debuginfos, use: debuginfo-install 
slurm-slurmctld-17.11.7-1fasrc01.el7.centos.x86_64
(gdb) bt full
#0  0x0000000000471ae5 in launch_prolog 
(job_ptr=job_ptr@entry=0x7f83507d57e0) at node_scheduler.c:2853
         prolog_msg_ptr = 0x7f84e8008580
         agent_arg_ptr = <optimized out>
         job_resrcs_ptr = 0x0
         cred_arg = {jobid = 45416412, stepid = 4294967295, uid = 11608, 
gid = 403048, user_name = 0x0, ngids = 0, gids = 0x0, cores_per_socket = 
0x0, sockets_per_node = 0x0,
           sock_core_rep_count = 0x0, job_constraints = 0x0, 
job_core_bitmap = 0x0, job_core_spec = 65534, job_hostlist = 0x0, 
job_mem_limit = 0, job_nhosts = 0,
           job_gres_list = 0x0, x11 = 0, step_core_bitmap = 0x0, 
step_hostlist = 0x0, step_mem_limit = 0, step_gres_list = 0x0}
         i = <optimized out>
         __func__ = "launch_prolog"
#1  0x000000000043ff48 in job_config_fini 
(job_ptr=job_ptr@entry=0x7f83507d57e0) at job_mgr.c:8492
         now = 1528387082
#2  0x000000000046084b in prolog_running_decr 
(job_ptr=job_ptr@entry=0x7f83507d57e0) at job_scheduler.c:4406
         job_id_buf = 
"JobID=45416412\000\000㮶f\207\177\000\000\376\377\000\000\000\000\000\000\340W}P\203\177\000\000\000\000\000\000\000\000\000\000`\035n", 
'\000' <repeats 13 times>, 
"\376\256\266f\207\177\000\000`\276\317s\206\177\000\000\240\275\317s\206\177\000\000\000$f\267\357\275$\302\000\020\020\000\000\000\000\000\334\377\264\002", 
'\000' <repeats 12 times>, 
"p\377\377\377\377\377\377\377\000\000\000\000\000\000\000\000\001\000\000\000\000\000\000\000\350εf\207\177\000\000%\272\266f\207\177\000\000\000\000\000\000\214\020\002\000 
\035n\000\000\000\000\000`\035n\000\000\000\000\000K\003\017\000\000\000\000\000\060\370\347\003\000\000\000\000\346mF"...
         __func__ = "prolog_running_decr"
#3  0x0000000000464645 in _run_prolog (arg=0x7f83507d57e0) at 
job_scheduler.c:4358
         job_ptr = 0x7f83507d57e0
         node_ptr = <optimized out>
         job_id = 45416412
         cpid = <optimized out>
         i = <optimized out>
         rc = <optimized out>
         status = 0
         wait_rc = <optimized out>
         argv = {0x0, 0x0}
         my_env = 0x0
         config_read_lock = {config = READ_LOCK, job = WRITE_LOCK, node 
= WRITE_LOCK, partition = NO_LOCK, federation = READ_LOCK}
         node_bitmap = 0x0
         now = 1528387082
         resume_timeout = <optimized out>
         tm = 6992
         __func__ = "_run_prolog"
#4  0x00007f8766b67e25 in start_thread () from /lib64/libpthread.so.0
No symbol table info available.
#5  0x00007f8766891bad in clone () from /lib64/libc.so.6
No symbol table info available.


On 06/07/2018 12:30 PM, bugs@schedmd.com wrote:
>
> *Comment # 3 <https://bugs.schedmd.com/show_bug.cgi?id=5276#c3> on bug 
> 5276 <https://bugs.schedmd.com/show_bug.cgi?id=5276> from Tim Wickberg 
> <mailto:tim@schedmd.com> *
> Logs from the first crash would be nice to have - something's definitely gone
> wrong with this job dispatch, and I'll get you a patch to bypass this issue at
> least for the moment in a couple of minutes, but we'd obviously like to nail
> down however this job got into such as state.
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>

Comment 6 Tim Wickberg 2018-06-07 10:37:28 MDT

Can you grab "p *job_ptr" from the first crash, thread 1 frame 1?

Comment 7 Paul Edmon 2018-06-07 10:37:58 MDT

Created attachment 7028 [details]
Log from the first crash

Comment 8 Paul Edmon 2018-06-07 10:40:58 MDT

(gdb) thread 1
[Switching to thread 1 (Thread 0x7f8673cfc700 (LWP 124204))]
#3  0x0000000000464645 in _run_prolog (arg=0x7f83507d57e0) at 
job_scheduler.c:4358
4358    job_scheduler.c: No such file or directory.
(gdb) frame 1
#1  0x000000000043ff48 in job_config_fini 
(job_ptr=job_ptr@entry=0x7f83507d57e0) at job_mgr.c:8492
8492    job_mgr.c: No such file or directory.
(gdb) p *job_ptr
$2 = {account = 0x7f8350fc7bc0 "conroy_lab", admin_comment = 0x0, 
alias_list = 0x0, alloc_node = 0x7f8350fccd60 "rclogin15", 
alloc_resp_port = 8920, alloc_sid = 8991,
   array_job_id = 0, array_task_id = 4294967294, array_recs = 0x0, 
assoc_id = 7965, assoc_ptr = 0x1b84a90, batch_flag = 0, batch_host = 
0x7f83507d1950 "holy2a14207",
   billable_tres = 47.625, bit_flags = 0, burst_buffer = 0x0, 
burst_buffer_state = 0x0, check_job = 0x0, ckpt_interval = 0, ckpt_time 
= 0, clusters = 0x0, comment = 0x0,
   cpu_cnt = 32, cr_enabled = 1, db_index = 0, deadline = 0, delay_boot 
= 0, derived_ec = 0, details = 0x7f8350001b50, direct_set_prio = 0, 
end_time = 1528415882,
   end_time_exp = 1528415882, epilog_running = false, exit_code = 0, 
fed_details = 0x0, front_end_ptr = 0x0, gids = 0x0, gres = 0x0, 
gres_list = 0x0, gres_alloc = 0x7f8350fcf240 "",
   gres_detail_cnt = 0, gres_detail_str = 0x0, gres_req = 0x7f8350fd3ed0 
"", gres_used = 0x0, group_id = 403048, job_id = 45416412, job_next = 
0x0, job_array_next_j = 0x0,
   job_array_next_t = 0x0, job_resrcs = 0x0, job_state = 1, 
kill_on_node_fail = 1, last_sched_eval = 1528387082, licenses = 0x0, 
license_list = 0x0, limit_set = {qos = 0, time = 0,
     tres = 0x7f83507f06c0}, mail_type = 0, mail_user = 0x0, magic = 
4038539564, mcs_label = 0x0, name = 0x7f8350000de0 "bash", network = 
0x0, next_step_id = 0, ngids = 0,
   nodes = 0x7f8350154bd0 "holy2a14207", node_addr = 0x7f8350154ba0, 
node_bitmap = 0x0, node_bitmap_cg = 0x0, node_cnt = 1, node_cnt_wag = 1, 
nodes_completing = 0x0,
   origin_cluster = 0x7f835053bb50 "odyssey", other_port = 8919, 
pack_job_id = 0, pack_job_id_set = 0x0, pack_job_offset = 0, 
pack_job_list = 0x0,
   partition = 0x7f83500785b0 "test,conroy-intel,conroy,itc_cluster", 
part_ptr_list = 0x8aa93c0, part_nodes_missing = false, part_ptr = 
0x1feb0c0, power_flags = 0 '\000',
   pre_sus_time = 0, preempt_time = 0, preempt_in_progress = false, 
priority = 678396, priority_array = 0x7f835047ba70, prio_factors = 
0x7f8350fd3320, profile = 0, qos_id = 1,
   qos_ptr = 0x1aadc70, qos_blocking_ptr = 0x0, reboot = 0 '\000', 
restart_cnt = 0, resize_time = 0, resv_id = 0, resv_name = 0x0, resv_ptr 
= 0x0, requid = 4294967295,
   resp_host = 0x7f8350fc7bf0 "10.242.104.144", sched_nodes = 0x0, 
select_jobinfo = 0x7f8350fc7c50, spank_job_env = 0x7f8350402750, 
spank_job_env_size = 1,
   start_protocol_ver = 8192, start_time = 1528387082, state_desc = 0x0, 
state_reason = 35, state_reason_prev = 0, step_list = 0x5493fd0, 
suspend_time = 0,
   time_last_active = 1528387082, time_limit = 480, time_min = 0, 
tot_sus_time = 0, total_cpus = 32, total_nodes = 1, tres_req_cnt = 
0x7f835040fa40,
   tres_req_str = 0x7f8350001aa0 "1=32,2=128000,4=1", tres_fmt_req_str = 
0x7f83507d5b90 "cpu=32,mem=125G,node=1", tres_alloc_cnt = 0x7f8350fd32e0,
   tres_alloc_str = 0x7f8350f9dd10 
"1=32,2=128000,3=18446744073709551614,4=1,5=47", tres_fmt_alloc_str = 
0x7f83503f46e0 "cpu=32,mem=125G,node=1,billing=47", user_id = 11608,
   user_name = 0x7f84e838c720 "rpn3", wait_all_nodes = 1, warn_flags = 
0, warn_signal = 0, warn_time = 0, wckey = 0x0, req_switch = 0, 
wait4switch = 0, best_switch = true,
   wait4switch_start = 0}


On 06/07/2018 12:37 PM, bugs@schedmd.com wrote:
>
> *Comment # 6 <https://bugs.schedmd.com/show_bug.cgi?id=5276#c6> on bug 
> 5276 <https://bugs.schedmd.com/show_bug.cgi?id=5276> from Tim Wickberg 
> <mailto:tim@schedmd.com> *
> Can you grab "p *job_ptr" from the first crash, thread 1 frame 1?
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>

Comment 9 Paul Edmon 2018-06-07 10:42:54 MDT

This is the log from slurmctld just before:

[2018-06-07T11:58:00.023] _slurm_rpc_submit_batch_job: JobId=45416411 
InitPrio=5239455 usec=11238
[2018-06-07T11:58:00.883] sched: Allocate JobID=45416407 
NodeList=holygpu15 #CPUs=2 Partition=gpgpu_requeue
[2018-06-07T11:58:01.573] prolog_running_decr: Configuration for 
JobID=45416407 is complete
[2018-06-07T11:58:01.573] Extending job 45416407 time limit by 1 secs 
for configuration
[2018-06-07T11:58:02.223] error: find_preemptable_jobs: job 45416412 not 
pending
[2018-06-07T11:58:02.224] error: find_preemptable_jobs: job 45416412 not 
pending
[2018-06-07T11:58:02.224] error: find_preemptable_jobs: job 45416412 not 
pending
[2018-06-07T11:58:02.224] sched: _slurm_rpc_allocate_resources 
JobId=45416412 NodeList=holy2a14207 usec=2044571
[2018-06-07T11:58:02.224] _job_complete: JobID=45412944 State=0x1 
NodeCnt=1 WEXITSTATUS 0
[2018-06-07T11:58:02.225] _job_complete: JobID=45412944 State=0x8003 
NodeCnt=1 done
[2018-06-07T11:58:02.246] _slurm_rpc_submit_batch_job: JobId=45416413 
InitPrio=4144998 usec=21429
[2018-06-07T11:58:02.372] error: _will_run_test: Job 45416412 has NULL 
node_bitmap
[2018-06-07T11:58:02.393] error: _will_run_test: Job 45416412 has NULL 
node_bitmap
[2018-06-07T11:58:02.419] _slurm_rpc_submit_batch_job: JobId=45416414 
InitPrio=5239455 usec=15508
[2018-06-07T11:58:02.422] _job_complete: JobID=45098977 State=0x1 
NodeCnt=1 WEXITSTATUS 0
[2018-06-07T11:58:02.422] _job_complete: JobID=45098977 State=0x8003 
NodeCnt=1 done
[2018-06-07T11:58:02.434] _job_complete: JobID=45380117_2599(45383309) 
State=0x1 NodeCnt=1 WEXITSTATUS 0
[2018-06-07T11:58:02.435] _job_complete: JobID=45380117_2599(45383309) 
State=0x8003 NodeCnt=1 done
[2018-06-07T11:58:02.781] _slurm_rpc_submit_batch_job: JobId=45416415 
InitPrio=5239455 usec=14985
[2018-06-07T11:58:02.781] prolog_running_decr: Configuration for 
JobID=45416412 is complete
Jun  7 11:58:02 holy-slurm02 kernel: [1352031.736258] srvcn[124204]: 
segfault at 50 ip 0000000000471ae5 sp 00007f8673cfbc40 error 4 in 
slurmctld[400000+df000]
Jun  7 11:58:02 holy-slurm02 abrt-hook-ccpp[124226]: Process 69971 
(slurmctld) of user 57812 killed by SIGSEGV - dumping core
Jun  7 11:58:40 holy-slurm02 systemd[1]: Stopping Slurm controller daemon...

-Paul Edmon-


On 06/07/2018 12:37 PM, bugs@schedmd.com wrote:
>
> *Comment # 6 <https://bugs.schedmd.com/show_bug.cgi?id=5276#c6> on bug 
> 5276 <https://bugs.schedmd.com/show_bug.cgi?id=5276> from Tim Wickberg 
> <mailto:tim@schedmd.com> *
> Can you grab "p *job_ptr" from the first crash, thread 1 frame 1?
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>

Comment 10 Paul Edmon 2018-06-07 10:43:55 MDT

For reference after the first crash I tried restarting the scheduler.  
Then I tried canceling 45416412, which did succeed. However it then 
locked up on the job that the latest crash is on.

-Paul Edmon-


On 06/07/2018 12:37 PM, bugs@schedmd.com wrote:
>
> *Comment # 6 <https://bugs.schedmd.com/show_bug.cgi?id=5276#c6> on bug 
> 5276 <https://bugs.schedmd.com/show_bug.cgi?id=5276> from Tim Wickberg 
> <mailto:tim@schedmd.com> *
> Can you grab "p *job_ptr" from the first crash, thread 1 frame 1?
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>

Comment 11 Paul Edmon 2018-06-07 10:44:32 MDT

I also noticed that both of those jobs purportedly were going to the 
same node so I shot that node as well just incase it was the node being 
in a bad state that caused the crash.

-Paul Edmon-


On 06/07/2018 12:37 PM, bugs@schedmd.com wrote:
>
> *Comment # 6 <https://bugs.schedmd.com/show_bug.cgi?id=5276#c6> on bug 
> 5276 <https://bugs.schedmd.com/show_bug.cgi?id=5276> from Tim Wickberg 
> <mailto:tim@schedmd.com> *
> Can you grab "p *job_ptr" from the first crash, thread 1 frame 1?
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>

Comment 12 Paul Edmon 2018-06-07 10:48:03 MDT

Created attachment 7029 [details]
Current slurm.conf

Comment 13 Tim Wickberg 2018-06-07 10:53:26 MDT

Created attachment 7030 [details]
bypass crash for missing job_resrsc struct

I'm attaching a patch that should bypass this specific crash, although I can't promise you won't see other issues somewhere else.

We'll need to keep looking for how you got into this - if you can set the core files aside (along with a copy of the current binaries) we may ask for some additional details out of those once we have a better theory as to what had happened here.

Just to check - are you running any custom patches on top of 17.11.7, or is this stock currently?

Comment 14 Paul Edmon 2018-06-07 10:55:25 MDT

Thanks.  This is stock.  I will attach our spec file so you can see.

-Paul Edmon-


On 06/07/2018 12:53 PM, bugs@schedmd.com wrote:
>
> *Comment # 13 <https://bugs.schedmd.com/show_bug.cgi?id=5276#c13> on 
> bug 5276 <https://bugs.schedmd.com/show_bug.cgi?id=5276> from Tim 
> Wickberg <mailto:tim@schedmd.com> *
> Createdattachment 7030 <attachment.cgi?id=7030&action=diff> [details] 
> <attachment.cgi?id=7030&action=edit>
> bypass crash for missing job_resrsc struct
>
> I'm attaching a patch that should bypass this specific crash, although I can't
> promise you won't see other issues somewhere else.
>
> We'll need to keep looking for how you got into this - if you can set the core
> files aside (along with a copy of the current binaries) we may ask for some
> additional details out of those once we have a better theory as to what had
> happened here.
>
> Just to check - are you running any custom patches on top of 17.11.7, or is
> this stock currently?
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>

Comment 15 Paul Edmon 2018-06-07 10:56:00 MDT

Created attachment 7031 [details]
Current slurm.spec

Comment 17 Tim Wickberg 2018-06-07 11:00:59 MDT

Created attachment 7032 [details]
bypass crash for missing job_resrsc struct

Slightly revised version of the patch is attached here, sorry for any confusion. This adds the 'return;' I'd missed adding in here - the first version would just crash as before.

Comment 19 Paul Edmon 2018-06-07 11:18:29 MDT

That patch worked.  The scheduler is back in action.  Keep you posted if anything changes.

Comment 20 Tim Wickberg 2018-06-07 12:37:06 MDT

Handing this off to Marshall to look into further, and dropping the severity now that you're running safely for now.

Of course, please let us know if you notice any new issues, but assuming that was the only afflicted job that should be gone by now.

- Tim

Comment 21 Felip Moll 2018-06-07 12:37:21 MDT

Hi Paul,

Would you mind to show us also:

"p *job_ptr" from the first crash, thread 1 frame 3

Thanks

Comment 22 Paul Edmon 2018-06-07 12:57:01 MDT

Sure.

[root@holy-slurm02 spool]# gdb /usr/sbin/slurmctld core.69971
GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-110.el7
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later 
<http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /usr/sbin/slurmctld...done.

warning: exec file is newer than core file.
[New LWP 124204]
[New LWP 124218]
[New LWP 69974]
[New LWP 124224]
[New LWP 124225]
[New LWP 69978]
[New LWP 70030]
[New LWP 70036]
[New LWP 69976]
[New LWP 124186]
[New LWP 70032]
[New LWP 69971]
[New LWP 69982]
[New LWP 70029]
[New LWP 69975]
[New LWP 70035]
[New LWP 70033]
[New LWP 70038]
[New LWP 70034]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `/usr/sbin/slurmctld'.
Program terminated with signal 11, Segmentation fault.
#0  launch_prolog (job_ptr=0x7f83507d57e0) at node_scheduler.c:2878
2878        cred_arg.job_hostlist    = job_ptr->job_resrcs->nodes;
Missing separate debuginfos, use: debuginfo-install 
slurm-slurmctld-17.11.7-1fasrc02.el7.x86_64
(gdb) thread 1
[Switching to thread 1 (Thread 0x7f8673cfc700 (LWP 124204))]
#0  launch_prolog (job_ptr=0x7f83507d57e0) at node_scheduler.c:2878
2878        cred_arg.job_hostlist    = job_ptr->job_resrcs->nodes;
(gdb) frame 3
#3  0x0000000000000000 in ?? ()
(gdb) p *job_ptr
No symbol "job_ptr" in current context.

-Paul Edmon-

On 06/07/2018 02:37 PM, bugs@schedmd.com wrote:
> Felip Moll <mailto:felip.moll@schedmd.com> changed bug 5276 
> <https://bugs.schedmd.com/show_bug.cgi?id=5276>
> What 	Removed 	Added
> Assignee 	marshall@schedmd.com 	support@schedmd.com
> Severity 	3 - Medium Impact 	1 - System not usable
>
> *Comment # 21 <https://bugs.schedmd.com/show_bug.cgi?id=5276#c21> on 
> bug 5276 <https://bugs.schedmd.com/show_bug.cgi?id=5276> from Felip 
> Moll <mailto:felip.moll@schedmd.com> *
> Hi Paul,
>
> Would you mind to show us also:
>
> "p *job_ptr" from the first crash, thread 1 frame 3
>
> Thanks
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>

Comment 23 Marshall Garey 2018-06-07 13:45:07 MDT

That's weird, it didn't even find a frame 3 that time. What's the backtrace of thread 1?

I'm interested in "p *job_ptr" from frame 3 of the backtrace in comment 5 - if you can get that, it would be very helpful.

Also, as Tim mentioned, a slurmctld log file from the time of the crash would also be very helpful.

Thanks for your help on this.

- Marshall

Comment 24 Paul Edmon 2018-06-07 13:53:15 MDT

First crash, thread 1

[root@holy-slurm02 spool]# gdb /usr/sbin/slurmctld core.69971
GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-110.el7
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later 
<http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /usr/sbin/slurmctld...done.

warning: exec file is newer than core file.
[New LWP 124204]
[New LWP 124218]
[New LWP 69974]
[New LWP 124224]
[New LWP 124225]
[New LWP 69978]
[New LWP 70030]
[New LWP 70036]
[New LWP 69976]
[New LWP 124186]
[New LWP 70032]
[New LWP 69971]
[New LWP 69982]
[New LWP 70029]
[New LWP 69975]
[New LWP 70035]
[New LWP 70033]
[New LWP 70038]
[New LWP 70034]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `/usr/sbin/slurmctld'.
Program terminated with signal 11, Segmentation fault.
#0  launch_prolog (job_ptr=0x7f83507d57e0) at node_scheduler.c:2878
2878        cred_arg.job_hostlist    = job_ptr->job_resrcs->nodes;
Missing separate debuginfos, use: debuginfo-install 
slurm-slurmctld-17.11.7-1fasrc02.el7.x86_64
(gdb) thread 1
[Switching to thread 1 (Thread 0x7f8673cfc700 (LWP 124204))]
#0  launch_prolog (job_ptr=0x7f83507d57e0) at node_scheduler.c:2878
2878        cred_arg.job_hostlist    = job_ptr->job_resrcs->nodes;
(gdb) bt full
#0  launch_prolog (job_ptr=0x7f83507d57e0) at node_scheduler.c:2878
         prolog_msg_ptr = 0x7f84e8008580
         agent_arg_ptr = <optimized out>
         job_resrcs_ptr = 0x0
         cred_arg = {jobid = 45416412, stepid = 4294967295, uid = 11608, 
gid = 403048, user_name = 0x0, ngids = 0, gids = 0x0, cores_per_socket = 
0x0, sockets_per_node = 0x0,
           sock_core_rep_count = 0x0, job_constraints = 0x0, 
job_core_bitmap = 0x0, job_core_spec = 65534, job_hostlist = 0x0, 
job_mem_limit = 0, job_nhosts = 0,
           job_gres_list = 0x0, x11 = 0, step_core_bitmap = 0x0, 
step_hostlist = 0x0, step_mem_limit = 0, step_gres_list = 0x0}
         i = <optimized out>
         __func__ = "launch_prolog"
#1  0x000000000043ff48 in pack_spec_jobs (buffer_ptr=0x2d58, 
buffer_size=0x7f83500127b0, job_ids=0x0, show_flags=0, uid=3, 
filter_uid=61, protocol_version=0) at job_mgr.c:9524
         jobs_packed = 0
         tmp_offset = <optimized out>
         pack_info = {buffer = 0x35343d4449626f4a, filter_uid = 
875966772, jobs_packed = 0x7f8766b6aee3 <_L_unlock_569+15>, 
protocol_version = 65534, show_flags = 0, uid = 0}
         buffer = <optimized out>
#2  0x0000000002b4ffdc in ?? ()
No symbol table info available.
#3  0x0000000000000000 in ?? ()
No symbol table info available.

frame 3

[root@holy-slurm02 spool]# gdb /usr/sbin/slurmctld core.69971
GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-110.el7
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later 
<http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /usr/sbin/slurmctld...done.

warning: exec file is newer than core file.
[New LWP 124204]
[New LWP 124218]
[New LWP 69974]
[New LWP 124224]
[New LWP 124225]
[New LWP 69978]
[New LWP 70030]
[New LWP 70036]
[New LWP 69976]
[New LWP 124186]
[New LWP 70032]
[New LWP 69971]
[New LWP 69982]
[New LWP 70029]
[New LWP 69975]
[New LWP 70035]
[New LWP 70033]
[New LWP 70038]
[New LWP 70034]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `/usr/sbin/slurmctld'.
Program terminated with signal 11, Segmentation fault.
#0  launch_prolog (job_ptr=0x7f83507d57e0) at node_scheduler.c:2878
2878        cred_arg.job_hostlist    = job_ptr->job_resrcs->nodes;
Missing separate debuginfos, use: debuginfo-install 
slurm-slurmctld-17.11.7-1fasrc02.el7.x86_64
(gdb) frame 3
#3  0x0000000000000000 in ?? ()
(gdb) bt full
#0  launch_prolog (job_ptr=0x7f83507d57e0) at node_scheduler.c:2878
         prolog_msg_ptr = 0x7f84e8008580
         agent_arg_ptr = <optimized out>
         job_resrcs_ptr = 0x0
         cred_arg = {jobid = 45416412, stepid = 4294967295, uid = 11608, 
gid = 403048, user_name = 0x0, ngids = 0, gids = 0x0, cores_per_socket = 
0x0, sockets_per_node = 0x0,
           sock_core_rep_count = 0x0, job_constraints = 0x0, 
job_core_bitmap = 0x0, job_core_spec = 65534, job_hostlist = 0x0, 
job_mem_limit = 0, job_nhosts = 0,
           job_gres_list = 0x0, x11 = 0, step_core_bitmap = 0x0, 
step_hostlist = 0x0, step_mem_limit = 0, step_gres_list = 0x0}
         i = <optimized out>
         __func__ = "launch_prolog"
#1  0x000000000043ff48 in pack_spec_jobs (buffer_ptr=0x2d58, 
buffer_size=0x7f83500127b0, job_ids=0x0, show_flags=0, uid=3, 
filter_uid=61, protocol_version=0) at job_mgr.c:9524
         jobs_packed = 0
         tmp_offset = <optimized out>
         pack_info = {buffer = 0x35343d4449626f4a, filter_uid = 
875966772, jobs_packed = 0x7f8766b6aee3 <_L_unlock_569+15>, 
protocol_version = 65534, show_flags = 0, uid = 0}
         buffer = <optimized out>
#2  0x0000000002b4ffdc in ?? ()
No symbol table info available.
#3  0x0000000000000000 in ?? ()
No symbol table info available.

If you look at Comment 9 it shows the slurmctld.log from the time of the 
error.  Let me know if you need any of the log before that point as its 
only for the minute of the crash.

-Paul Edmon-


On 06/07/2018 03:45 PM, bugs@schedmd.com wrote:
>
> *Comment # 23 <https://bugs.schedmd.com/show_bug.cgi?id=5276#c23> on 
> bug 5276 <https://bugs.schedmd.com/show_bug.cgi?id=5276> from Marshall 
> Garey <mailto:marshall@schedmd.com> *
> That's weird, it didn't even find a frame 3 that time. What's the backtrace of
> thread 1?
>
> I'm interested in "p *job_ptr" from frame 3 of the backtrace incomment 5 <show_bug.cgi?id=5276#c5>  - if
> you can get that, it would be very helpful.
>
> Also, as Tim mentioned, a slurmctld log file from the time of the crash would
> also be very helpful.
>
> Thanks for your help on this.
>
> - Marshall
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>

Comment 25 Marshall Garey 2018-06-07 14:01:50 MDT

Thanks, I forgot about the log in comment 9. I'll let you know if we need more than that.

Perhaps part of the problem here with gdb is that the binary is newer than when the crash happened (since you applied the patch). Did you happen to make a copy of the slurmctld binary before applying Tim's patch?

We'd hoped to get a snapshot of job_ptr inside the function _run_prolog(). It appears that isn't working, probably because of the newer slurmctld binary file.

Comment 26 Paul Edmon 2018-06-07 14:10:13 MDT

Ah okay, We have a test system that is still on the vanilla 17.11.7.  
Here are the gdb info:

[root@slurm-test ~]# gdb /usr/sbin/slurmctld core.69971
GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-100.el7_4.1
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later 
<http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /usr/sbin/slurmctld...done.
[New LWP 124204]
[New LWP 124218]
[New LWP 69974]
[New LWP 124224]
[New LWP 124225]
[New LWP 69978]
[New LWP 70030]
[New LWP 70036]
[New LWP 69976]
[New LWP 124186]
[New LWP 70032]
[New LWP 69971]
[New LWP 69982]
[New LWP 70029]
[New LWP 69975]
[New LWP 70035]
[New LWP 70033]
[New LWP 70038]
[New LWP 70034]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `/usr/sbin/slurmctld'.
Program terminated with signal 11, Segmentation fault.
#0  0x0000000000471ae5 in launch_prolog 
(job_ptr=job_ptr@entry=0x7f83507d57e0) at node_scheduler.c:2853
2853        cred_arg.job_nhosts          = job_ptr->job_resrcs->nhosts;
Missing separate debuginfos, use: debuginfo-install 
slurm-slurmctld-17.11.7-1fasrc01.el7.centos.x86_64
(gdb) thread 1
[Switching to thread 1 (Thread 0x7f8673cfc700 (LWP 124204))]
#0  0x0000000000471ae5 in launch_prolog 
(job_ptr=job_ptr@entry=0x7f83507d57e0) at node_scheduler.c:2853
2853        cred_arg.job_nhosts          = job_ptr->job_resrcs->nhosts;
(gdb) p *job_ptr
$1 = {account = 0x7f8350fc7bc0 "conroy_lab", admin_comment = 0x0, 
alias_list = 0x0, alloc_node = 0x7f8350fccd60 "rclogin15", 
alloc_resp_port = 8920,
   alloc_sid = 8991, array_job_id = 0, array_task_id = 4294967294, 
array_recs = 0x0, assoc_id = 7965, assoc_ptr = 0x1b84a90, batch_flag = 0,
   batch_host = 0x7f83507d1950 "holy2a14207", billable_tres = 47.625, 
bit_flags = 0, burst_buffer = 0x0, burst_buffer_state = 0x0, check_job = 
0x0,
   ckpt_interval = 0, ckpt_time = 0, clusters = 0x0, comment = 0x0, 
cpu_cnt = 32, cr_enabled = 1, db_index = 0, deadline = 0, delay_boot = 
0, derived_ec = 0,
   details = 0x7f8350001b50, direct_set_prio = 0, end_time = 1528415882, 
end_time_exp = 1528415882, epilog_running = false, exit_code = 0, 
fed_details = 0x0,
   front_end_ptr = 0x0, gids = 0x0, gres = 0x0, gres_list = 0x0, 
gres_alloc = 0x7f8350fcf240 "", gres_detail_cnt = 0, gres_detail_str = 0x0,
   gres_req = 0x7f8350fd3ed0 "", gres_used = 0x0, group_id = 403048, 
job_id = 45416412, job_next = 0x0, job_array_next_j = 0x0, 
job_array_next_t = 0x0,
   job_resrcs = 0x0, job_state = 1, kill_on_node_fail = 1, 
last_sched_eval = 1528387082, licenses = 0x0, license_list = 0x0, 
limit_set = {qos = 0, time = 0,
     tres = 0x7f83507f06c0}, mail_type = 0, mail_user = 0x0, magic = 
4038539564, mcs_label = 0x0, name = 0x7f8350000de0 "bash", network = 0x0,
   next_step_id = 0, ngids = 0, nodes = 0x7f8350154bd0 "holy2a14207", 
node_addr = 0x7f8350154ba0, node_bitmap = 0x0, node_bitmap_cg = 0x0, 
node_cnt = 1,
   node_cnt_wag = 1, nodes_completing = 0x0, origin_cluster = 
0x7f835053bb50 "odyssey", other_port = 8919, pack_job_id = 0, 
pack_job_id_set = 0x0,
   pack_job_offset = 0, pack_job_list = 0x0, partition = 0x7f83500785b0 
"test,conroy-intel,conroy,itc_cluster", part_ptr_list = 0x8aa93c0,
   part_nodes_missing = false, part_ptr = 0x1feb0c0, power_flags = 0 
'\000', pre_sus_time = 0, preempt_time = 0, preempt_in_progress = false,
   priority = 678396, priority_array = 0x7f835047ba70, prio_factors = 
0x7f8350fd3320, profile = 0, qos_id = 1, qos_ptr = 0x1aadc70, 
qos_blocking_ptr = 0x0,
   reboot = 0 '\000', restart_cnt = 0, resize_time = 0, resv_id = 0, 
resv_name = 0x0, resv_ptr = 0x0, requid = 4294967295,
   resp_host = 0x7f8350fc7bf0 "10.242.104.144", sched_nodes = 0x0, 
select_jobinfo = 0x7f8350fc7c50, spank_job_env = 0x7f8350402750, 
spank_job_env_size = 1,
   start_protocol_ver = 8192, start_time = 1528387082, state_desc = 0x0, 
state_reason = 35, state_reason_prev = 0, step_list = 0x5493fd0, 
suspend_time = 0,
   time_last_active = 1528387082, time_limit = 480, time_min = 0, 
tot_sus_time = 0, total_cpus = 32, total_nodes = 1, tres_req_cnt = 
0x7f835040fa40,
   tres_req_str = 0x7f8350001aa0 "1=32,2=128000,4=1", tres_fmt_req_str = 
0x7f83507d5b90 "cpu=32,mem=125G,node=1", tres_alloc_cnt = 0x7f8350fd32e0,
   tres_alloc_str = 0x7f8350f9dd10 
"1=32,2=128000,3=18446744073709551614,4=1,5=47", tres_fmt_alloc_str = 
0x7f83503f46e0 "cpu=32,mem=125G,node=1,billing=47",
   user_id = 11608, user_name = 0x7f84e838c720 "rpn3", wait_all_nodes = 
1, warn_flags = 0, warn_signal = 0, warn_time = 0, wckey = 0x0, 
req_switch = 0,
   wait4switch = 0, best_switch = true, wait4switch_start = 0}
(gdb) quit
[root@slurm-test ~]# gdb /usr/sbin/slurmctld core.69971
GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-100.el7_4.1
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later 
<http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /usr/sbin/slurmctld...done.
[New LWP 124204]
[New LWP 124218]
[New LWP 69974]
[New LWP 124224]
[New LWP 124225]
[New LWP 69978]
[New LWP 70030]
[New LWP 70036]
[New LWP 69976]
[New LWP 124186]
[New LWP 70032]
[New LWP 69971]
[New LWP 69982]
[New LWP 70029]
[New LWP 69975]
[New LWP 70035]
[New LWP 70033]
[New LWP 70038]
[New LWP 70034]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `/usr/sbin/slurmctld'.
Program terminated with signal 11, Segmentation fault.
#0  0x0000000000471ae5 in launch_prolog 
(job_ptr=job_ptr@entry=0x7f83507d57e0) at node_scheduler.c:2853
2853        cred_arg.job_nhosts          = job_ptr->job_resrcs->nhosts;
Missing separate debuginfos, use: debuginfo-install 
slurm-slurmctld-17.11.7-1fasrc01.el7.centos.x86_64
(gdb) frame 3
#3  0x0000000000464645 in _run_prolog (arg=0x7f83507d57e0) at 
job_scheduler.c:4358
4358        prolog_running_decr(job_ptr);
(gdb) p *job_ptr
$1 = {account = 0x7f8350fc7bc0 "conroy_lab", admin_comment = 0x0, 
alias_list = 0x0, alloc_node = 0x7f8350fccd60 "rclogin15", 
alloc_resp_port = 8920,
   alloc_sid = 8991, array_job_id = 0, array_task_id = 4294967294, 
array_recs = 0x0, assoc_id = 7965, assoc_ptr = 0x1b84a90, batch_flag = 0,
   batch_host = 0x7f83507d1950 "holy2a14207", billable_tres = 47.625, 
bit_flags = 0, burst_buffer = 0x0, burst_buffer_state = 0x0, check_job = 
0x0,
   ckpt_interval = 0, ckpt_time = 0, clusters = 0x0, comment = 0x0, 
cpu_cnt = 32, cr_enabled = 1, db_index = 0, deadline = 0, delay_boot = 
0, derived_ec = 0,
   details = 0x7f8350001b50, direct_set_prio = 0, end_time = 1528415882, 
end_time_exp = 1528415882, epilog_running = false, exit_code = 0, 
fed_details = 0x0,
   front_end_ptr = 0x0, gids = 0x0, gres = 0x0, gres_list = 0x0, 
gres_alloc = 0x7f8350fcf240 "", gres_detail_cnt = 0, gres_detail_str = 0x0,
   gres_req = 0x7f8350fd3ed0 "", gres_used = 0x0, group_id = 403048, 
job_id = 45416412, job_next = 0x0, job_array_next_j = 0x0, 
job_array_next_t = 0x0,
   job_resrcs = 0x0, job_state = 1, kill_on_node_fail = 1, 
last_sched_eval = 1528387082, licenses = 0x0, license_list = 0x0, 
limit_set = {qos = 0, time = 0,
     tres = 0x7f83507f06c0}, mail_type = 0, mail_user = 0x0, magic = 
4038539564, mcs_label = 0x0, name = 0x7f8350000de0 "bash", network = 0x0,
   next_step_id = 0, ngids = 0, nodes = 0x7f8350154bd0 "holy2a14207", 
node_addr = 0x7f8350154ba0, node_bitmap = 0x0, node_bitmap_cg = 0x0, 
node_cnt = 1,
   node_cnt_wag = 1, nodes_completing = 0x0, origin_cluster = 
0x7f835053bb50 "odyssey", other_port = 8919, pack_job_id = 0, 
pack_job_id_set = 0x0,
   pack_job_offset = 0, pack_job_list = 0x0, partition = 0x7f83500785b0 
"test,conroy-intel,conroy,itc_cluster", part_ptr_list = 0x8aa93c0,
   part_nodes_missing = false, part_ptr = 0x1feb0c0, power_flags = 0 
'\000', pre_sus_time = 0, preempt_time = 0, preempt_in_progress = false,
   priority = 678396, priority_array = 0x7f835047ba70, prio_factors = 
0x7f8350fd3320, profile = 0, qos_id = 1, qos_ptr = 0x1aadc70, 
qos_blocking_ptr = 0x0,
   reboot = 0 '\000', restart_cnt = 0, resize_time = 0, resv_id = 0, 
resv_name = 0x0, resv_ptr = 0x0, requid = 4294967295,
   resp_host = 0x7f8350fc7bf0 "10.242.104.144", sched_nodes = 0x0, 
select_jobinfo = 0x7f8350fc7c50, spank_job_env = 0x7f8350402750, 
spank_job_env_size = 1,
   start_protocol_ver = 8192, start_time = 1528387082, state_desc = 0x0, 
state_reason = 35, state_reason_prev = 0, step_list = 0x5493fd0, 
suspend_time = 0,
   time_last_active = 1528387082, time_limit = 480, time_min = 0, 
tot_sus_time = 0, total_cpus = 32, total_nodes = 1, tres_req_cnt = 
0x7f835040fa40,
   tres_req_str = 0x7f8350001aa0 "1=32,2=128000,4=1", tres_fmt_req_str = 
0x7f83507d5b90 "cpu=32,mem=125G,node=1", tres_alloc_cnt = 0x7f8350fd32e0,
   tres_alloc_str = 0x7f8350f9dd10 
"1=32,2=128000,3=18446744073709551614,4=1,5=47", tres_fmt_alloc_str = 
0x7f83503f46e0 "cpu=32,mem=125G,node=1,billing=47",
   user_id = 11608, user_name = 0x7f84e838c720 "rpn3", wait_all_nodes = 
1, warn_flags = 0, warn_signal = 0, warn_time = 0, wckey = 0x0, 
req_switch = 0,
   wait4switch = 0, best_switch = true, wait4switch_start = 0}
(gdb) quit


On 06/07/2018 04:01 PM, bugs@schedmd.com wrote:
>
> *Comment # 25 <https://bugs.schedmd.com/show_bug.cgi?id=5276#c25> on 
> bug 5276 <https://bugs.schedmd.com/show_bug.cgi?id=5276> from Marshall 
> Garey <mailto:marshall@schedmd.com> *
> Thanks, I forgot about the log incomment 9 <show_bug.cgi?id=5276#c9>. I'll let you know if we need more
> than that.
>
> Perhaps part of the problem here with gdb is that the binary is newer than when
> the crash happened (since you applied the patch). Did you happen to make a copy
> of the slurmctld binary before applying Tim's patch?
>
> We'd hoped to get a snapshot of job_ptr inside the function _run_prolog(). It
> appears that isn't working, probably because of the newer slurmctld binary
> file.
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>

Comment 27 Marshall Garey 2018-06-07 14:24:38 MDT

Excellent, thank you. That's very helpful. Keep a copy of that binary somewhere so you can use it if we need further info.

Comment 31 Marshall Garey 2018-06-08 09:12:03 MDT

Hi Paul, we're continuing to look into this. Can we get a more complete log file from around the time of the crash? We'd like everything starting from when job  45416611 was initially submitted up to the crash.

Thanks

- Marshall

Comment 32 Paul Edmon 2018-06-08 09:27:46 MDT

So for 45416611 I don't have a record for when it was submitted. This is 
all the references I have to it in the log:

messages:[2018-06-07T12:04:59.266] error: find_preemptable_jobs: job 
45416611 not pending
messages:[2018-06-07T12:04:59.267] error: find_preemptable_jobs: job 
45416611 not pending
messages:[2018-06-07T12:04:59.267] error: find_preemptable_jobs: job 
45416611 not pending
messages:[2018-06-07T12:04:59.268] sched: _slurm_rpc_allocate_resources 
JobId=45416611 NodeList=holy2a14207 usec=1090628
messages:[2018-06-07T12:04:59.447] error: _will_run_test: Job 45416611 
has NULL node_bitmap
messages:[2018-06-07T12:04:59.485] error: _will_run_test: Job 45416611 
has NULL node_bitmap
messages:[2018-06-07T12:04:59.519] error: _will_run_test: Job 45416611 
has NULL node_bitmap
messages:[2018-06-07T12:04:59.546] error: _will_run_test: Job 45416611 
has NULL node_bitmap
messages:[2018-06-07T12:04:59.572] error: _will_run_test: Job 45416611 
has NULL node_bitmap
messages:[2018-06-07T12:04:59.591] error: _will_run_test: Job 45416611 
has NULL node_bitmap
messages:[2018-06-07T12:04:59.610] error: _will_run_test: Job 45416611 
has NULL node_bitmap
messages:[2018-06-07T12:04:59.627] error: _will_run_test: Job 45416611 
has NULL node_bitmap
messages:[2018-06-07T12:04:59.645] error: _will_run_test: Job 45416611 
has NULL node_bitmap
messages:[2018-06-07T12:04:59.662] error: _will_run_test: Job 45416611 
has NULL node_bitmap
messages:[2018-06-07T12:04:59.681] error: _will_run_test: Job 45416611 
has NULL node_bitmap
messages:[2018-06-07T12:04:59.697] error: _will_run_test: Job 45416611 
has NULL node_bitmap
messages:[2018-06-07T12:04:59.716] error: _will_run_test: Job 45416611 
has NULL node_bitmap
messages:[2018-06-07T12:04:59.733] error: _will_run_test: Job 45416611 
has NULL node_bitmap
messages:[2018-06-07T12:04:59.752] error: _will_run_test: Job 45416611 
has NULL node_bitmap
messages:[2018-06-07T12:04:59.768] error: _will_run_test: Job 45416611 
has NULL node_bitmap
messages:[2018-06-07T12:04:59.787] error: _will_run_test: Job 45416611 
has NULL node_bitmap
messages:[2018-06-07T12:04:59.804] error: _will_run_test: Job 45416611 
has NULL node_bitmap
messages:[2018-06-07T12:04:59.823] error: _will_run_test: Job 45416611 
has NULL node_bitmap
messages:[2018-06-07T12:04:59.840] error: _will_run_test: Job 45416611 
has NULL node_bitmap
messages:[2018-06-07T12:04:59.859] error: _will_run_test: Job 45416611 
has NULL node_bitmap
messages:[2018-06-07T12:04:59.877] error: _will_run_test: Job 45416611 
has NULL node_bitmap
messages:[2018-06-07T12:04:59.895] error: _will_run_test: Job 45416611 
has NULL node_bitmap
messages:[2018-06-07T12:04:59.913] error: _will_run_test: Job 45416611 
has NULL node_bitmap
messages:[2018-06-07T12:04:59.932] error: _will_run_test: Job 45416611 
has NULL node_bitmap
messages:[2018-06-07T12:04:59.949] error: _will_run_test: Job 45416611 
has NULL node_bitmap
messages:[2018-06-07T12:04:59.969] error: _will_run_test: Job 45416611 
has NULL node_bitmap
messages:[2018-06-07T12:05:00.059] error: _will_run_test: Job 45416611 
has NULL node_bitmap
messages:[2018-06-07T12:05:00.085] error: _will_run_test: Job 45416611 
has NULL node_bitmap
messages:[2018-06-07T12:05:00.098] error: _will_run_test: Job 45416611 
has NULL node_bitmap
messages:[2018-06-07T12:05:00.113] error: _will_run_test: Job 45416611 
has NULL node_bitmap
messages:[2018-06-07T12:05:00.125] error: _will_run_test: Job 45416611 
has NULL node_bitmap
messages:[2018-06-07T12:05:00.139] error: _will_run_test: Job 45416611 
has NULL node_bitmap
messages:[2018-06-07T12:05:00.144] error: _will_run_test: Job 45416611 
has NULL node_bitmap
messages:[2018-06-07T12:05:00.151] error: _will_run_test: Job 45416611 
has NULL node_bitmap
messages:[2018-06-07T12:05:00.158] error: _will_run_test: Job 45416611 
has NULL node_bitmap
messages:[2018-06-07T12:05:00.168] error: _will_run_test: Job 45416611 
has NULL node_bitmap
messages:[2018-06-07T12:05:00.175] error: _will_run_test: Job 45416611 
has NULL node_bitmap
messages:[2018-06-07T12:05:00.185] error: _will_run_test: Job 45416611 
has NULL node_bitmap
messages:[2018-06-07T12:05:00.189] error: _will_run_test: Job 45416611 
has NULL node_bitmap
messages:[2018-06-07T12:05:00.196] error: _will_run_test: Job 45416611 
has NULL node_bitmap
messages:[2018-06-07T12:05:00.201] error: _will_run_test: Job 45416611 
has NULL node_bitmap
messages:[2018-06-07T12:05:00.208] error: _will_run_test: Job 45416611 
has NULL node_bitmap
messages:[2018-06-07T12:05:00.214] error: _will_run_test: Job 45416611 
has NULL node_bitmap
messages:[2018-06-07T12:05:00.221] error: _will_run_test: Job 45416611 
has NULL node_bitmap
messages:[2018-06-07T12:05:00.227] error: _will_run_test: Job 45416611 
has NULL node_bitmap
messages:[2018-06-07T12:05:00.234] error: _will_run_test: Job 45416611 
has NULL node_bitmap
messages:[2018-06-07T12:05:00.240] error: _will_run_test: Job 45416611 
has NULL node_bitmap
messages:[2018-06-07T12:05:00.248] error: _will_run_test: Job 45416611 
has NULL node_bitmap
messages:[2018-06-07T12:05:00.253] error: _will_run_test: Job 45416611 
has NULL node_bitmap
messages:[2018-06-07T12:05:00.261] error: _will_run_test: Job 45416611 
has NULL node_bitmap
messages:[2018-06-07T12:05:00.267] error: _will_run_test: Job 45416611 
has NULL node_bitmap
messages:[2018-06-07T12:05:00.276] error: _will_run_test: Job 45416611 
has NULL node_bitmap
messages:[2018-06-07T12:05:00.282] error: _will_run_test: Job 45416611 
has NULL node_bitmap
messages:[2018-06-07T12:05:00.290] error: _will_run_test: Job 45416611 
has NULL node_bitmap
messages:[2018-06-07T12:05:00.296] error: _will_run_test: Job 45416611 
has NULL node_bitmap
messages:[2018-06-07T12:05:00.303] error: _will_run_test: Job 45416611 
has NULL node_bitmap
messages:[2018-06-07T12:05:00.309] error: _will_run_test: Job 45416611 
has NULL node_bitmap
messages:[2018-06-07T12:05:00.316] error: _will_run_test: Job 45416611 
has NULL node_bitmap
messages:[2018-06-07T12:05:00.322] error: _will_run_test: Job 45416611 
has NULL node_bitmap
messages:[2018-06-07T12:05:00.330] error: _will_run_test: Job 45416611 
has NULL node_bitmap
messages:[2018-06-07T12:05:00.336] error: _will_run_test: Job 45416611 
has NULL node_bitmap
messages:[2018-06-07T12:05:00.344] error: _will_run_test: Job 45416611 
has NULL node_bitmap
messages:[2018-06-07T12:05:00.350] error: _will_run_test: Job 45416611 
has NULL node_bitmap
messages:[2018-06-07T12:05:00.358] error: _will_run_test: Job 45416611 
has NULL node_bitmap
messages:[2018-06-07T12:05:00.364] error: _will_run_test: Job 45416611 
has NULL node_bitmap
messages:[2018-06-07T12:05:00.372] error: _will_run_test: Job 45416611 
has NULL node_bitmap
messages:[2018-06-07T12:05:00.378] error: _will_run_test: Job 45416611 
has NULL node_bitmap
messages:[2018-06-07T12:05:00.386] error: _will_run_test: Job 45416611 
has NULL node_bitmap
messages:[2018-06-07T12:05:00.392] error: _will_run_test: Job 45416611 
has NULL node_bitmap
messages:[2018-06-07T12:05:00.400] error: _will_run_test: Job 45416611 
has NULL node_bitmap
messages:[2018-06-07T12:05:00.406] error: _will_run_test: Job 45416611 
has NULL node_bitmap
messages:[2018-06-07T12:05:00.414] error: _will_run_test: Job 45416611 
has NULL node_bitmap
messages:[2018-06-07T12:05:00.420] error: _will_run_test: Job 45416611 
has NULL node_bitmap
messages:[2018-06-07T12:05:00.429] error: _will_run_test: Job 45416611 
has NULL node_bitmap
messages:[2018-06-07T12:05:00.435] error: _will_run_test: Job 45416611 
has NULL node_bitmap
messages:[2018-06-07T12:05:00.443] error: _will_run_test: Job 45416611 
has NULL node_bitmap
messages:[2018-06-07T12:05:00.450] error: _will_run_test: Job 45416611 
has NULL node_bitmap
messages:[2018-06-07T12:05:00.458] error: _will_run_test: Job 45416611 
has NULL node_bitmap
messages:[2018-06-07T12:05:00.465] error: _will_run_test: Job 45416611 
has NULL node_bitmap
messages:[2018-06-07T12:05:00.473] error: _will_run_test: Job 45416611 
has NULL node_bitmap
messages:[2018-06-07T12:05:00.480] error: _will_run_test: Job 45416611 
has NULL node_bitmap
messages:[2018-06-07T12:05:00.488] error: _will_run_test: Job 45416611 
has NULL node_bitmap
messages:[2018-06-07T12:05:00.495] error: _will_run_test: Job 45416611 
has NULL node_bitmap
messages:[2018-06-07T12:05:00.504] error: _will_run_test: Job 45416611 
has NULL node_bitmap
messages:[2018-06-07T12:05:00.511] error: _will_run_test: Job 45416611 
has NULL node_bitmap
messages:[2018-06-07T12:05:00.519] error: _will_run_test: Job 45416611 
has NULL node_bitmap
messages:[2018-06-07T12:05:00.526] error: _will_run_test: Job 45416611 
has NULL node_bitmap
messages:[2018-06-07T12:05:00.536] error: _will_run_test: Job 45416611 
has NULL node_bitmap
messages:[2018-06-07T12:05:00.543] error: _will_run_test: Job 45416611 
has NULL node_bitmap
messages:[2018-06-07T12:05:00.552] error: _will_run_test: Job 45416611 
has NULL node_bitmap
messages:[2018-06-07T12:05:00.559] error: _will_run_test: Job 45416611 
has NULL node_bitmap
messages:[2018-06-07T12:05:00.567] error: _will_run_test: Job 45416611 
has NULL node_bitmap
messages:[2018-06-07T12:05:00.574] error: _will_run_test: Job 45416611 
has NULL node_bitmap
messages:[2018-06-07T12:05:00.583] error: _will_run_test: Job 45416611 
has NULL node_bitmap
messages:[2018-06-07T12:05:00.590] error: _will_run_test: Job 45416611 
has NULL node_bitmap
messages:[2018-06-07T12:05:00.598] error: _will_run_test: Job 45416611 
has NULL node_bitmap
messages:[2018-06-07T12:05:00.605] error: _will_run_test: Job 45416611 
has NULL node_bitmap
messages:[2018-06-07T12:05:00.614] error: _will_run_test: Job 45416611 
has NULL node_bitmap
messages:[2018-06-07T12:05:00.621] error: _will_run_test: Job 45416611 
has NULL node_bitmap
messages:[2018-06-07T12:05:00.630] error: _will_run_test: Job 45416611 
has NULL node_bitmap
messages:[2018-06-07T12:05:00.637] error: _will_run_test: Job 45416611 
has NULL node_bitmap
messages:[2018-06-07T12:05:00.646] error: _will_run_test: Job 45416611 
has NULL node_bitmap
messages:[2018-06-07T12:05:00.652] error: _will_run_test: Job 45416611 
has NULL node_bitmap
messages:[2018-06-07T12:05:00.661] error: _will_run_test: Job 45416611 
has NULL node_bitmap
messages:[2018-06-07T12:05:00.668] error: _will_run_test: Job 45416611 
has NULL node_bitmap
messages:[2018-06-07T12:05:00.676] error: _will_run_test: Job 45416611 
has NULL node_bitmap
messages:[2018-06-07T12:05:00.683] error: _will_run_test: Job 45416611 
has NULL node_bitmap
messages:[2018-06-07T12:05:00.692] error: _will_run_test: Job 45416611 
has NULL node_bitmap
messages:[2018-06-07T12:05:00.699] error: _will_run_test: Job 45416611 
has NULL node_bitmap
messages:[2018-06-07T12:05:00.708] error: _will_run_test: Job 45416611 
has NULL node_bitmap
messages:[2018-06-07T12:05:00.715] error: _will_run_test: Job 45416611 
has NULL node_bitmap
messages:[2018-06-07T12:05:00.723] error: _will_run_test: Job 45416611 
has NULL node_bitmap
messages:[2018-06-07T12:05:00.730] error: _will_run_test: Job 45416611 
has NULL node_bitmap
messages:[2018-06-07T12:05:00.739] error: _will_run_test: Job 45416611 
has NULL node_bitmap
messages:[2018-06-07T12:05:00.747] error: _will_run_test: Job 45416611 
has NULL node_bitmap
messages:[2018-06-07T12:05:00.757] error: _will_run_test: Job 45416611 
has NULL node_bitmap
messages:[2018-06-07T12:05:00.765] error: _will_run_test: Job 45416611 
has NULL node_bitmap
messages:[2018-06-07T12:05:00.775] error: _will_run_test: Job 45416611 
has NULL node_bitmap
messages:[2018-06-07T12:05:00.783] error: _will_run_test: Job 45416611 
has NULL node_bitmap
messages:[2018-06-07T12:05:00.794] error: _will_run_test: Job 45416611 
has NULL node_bitmap
messages:[2018-06-07T12:05:00.802] error: _will_run_test: Job 45416611 
has NULL node_bitmap
messages:[2018-06-07T12:05:00.812] error: _will_run_test: Job 45416611 
has NULL node_bitmap
messages:[2018-06-07T12:05:00.820] error: _will_run_test: Job 45416611 
has NULL node_bitmap
messages:[2018-06-07T12:05:00.830] error: _will_run_test: Job 45416611 
has NULL node_bitmap
messages:[2018-06-07T12:05:00.838] error: _will_run_test: Job 45416611 
has NULL node_bitmap
messages:[2018-06-07T12:05:00.848] error: _will_run_test: Job 45416611 
has NULL node_bitmap
messages:[2018-06-07T12:05:00.856] error: _will_run_test: Job 45416611 
has NULL node_bitmap
messages:[2018-06-07T12:05:00.866] error: _will_run_test: Job 45416611 
has NULL node_bitmap
messages:[2018-06-07T12:05:00.871] error: _will_run_test: Job 45416611 
has NULL node_bitmap
messages:[2018-06-07T12:05:00.877] error: _will_run_test: Job 45416611 
has NULL node_bitmap
messages:[2018-06-07T12:05:00.885] error: _will_run_test: Job 45416611 
has NULL node_bitmap
messages:[2018-06-07T12:05:00.895] error: _will_run_test: Job 45416611 
has NULL node_bitmap
messages:[2018-06-07T12:05:00.903] error: _will_run_test: Job 45416611 
has NULL node_bitmap
messages:[2018-06-07T12:05:00.913] error: _will_run_test: Job 45416611 
has NULL node_bitmap
messages:[2018-06-07T12:05:00.921] error: _will_run_test: Job 45416611 
has NULL node_bitmap
messages:[2018-06-07T12:05:00.931] error: _will_run_test: Job 45416611 
has NULL node_bitmap
messages:[2018-06-07T12:05:00.939] error: _will_run_test: Job 45416611 
has NULL node_bitmap
messages:[2018-06-07T12:05:00.948] error: _will_run_test: Job 45416611 
has NULL node_bitmap
messages:[2018-06-07T12:05:00.956] error: _will_run_test: Job 45416611 
has NULL node_bitmap
messages:[2018-06-07T12:05:00.966] error: _will_run_test: Job 45416611 
has NULL node_bitmap
messages:[2018-06-07T12:05:00.974] error: _will_run_test: Job 45416611 
has NULL node_bitmap
messages:[2018-06-07T12:05:00.984] error: _will_run_test: Job 45416611 
has NULL node_bitmap
messages:[2018-06-07T12:05:00.992] error: _will_run_test: Job 45416611 
has NULL node_bitmap
messages:[2018-06-07T12:05:01.002] error: _will_run_test: Job 45416611 
has NULL node_bitmap
messages:[2018-06-07T12:05:01.011] error: _will_run_test: Job 45416611 
has NULL node_bitmap
messages:[2018-06-07T12:05:01.021] error: _will_run_test: Job 45416611 
has NULL node_bitmap
messages:[2018-06-07T12:05:01.030] error: _will_run_test: Job 45416611 
has NULL node_bitmap
messages:[2018-06-07T12:05:01.041] error: _will_run_test: Job 45416611 
has NULL node_bitmap
messages:[2018-06-07T12:05:01.050] error: _will_run_test: Job 45416611 
has NULL node_bitmap
messages:[2018-06-07T12:05:01.061] error: _will_run_test: Job 45416611 
has NULL node_bitmap
messages:[2018-06-07T12:05:01.070] error: _will_run_test: Job 45416611 
has NULL node_bitmap
messages:[2018-06-07T12:05:01.081] error: _will_run_test: Job 45416611 
has NULL node_bitmap
messages:[2018-06-07T12:05:01.089] error: _will_run_test: Job 45416611 
has NULL node_bitmap
messages:[2018-06-07T12:05:01.100] error: _will_run_test: Job 45416611 
has NULL node_bitmap
messages:[2018-06-07T12:05:01.109] error: _will_run_test: Job 45416611 
has NULL node_bitmap
messages:[2018-06-07T12:05:01.120] error: _will_run_test: Job 45416611 
has NULL node_bitmap
messages:[2018-06-07T12:05:01.128] error: _will_run_test: Job 45416611 
has NULL node_bitmap
messages:[2018-06-07T12:05:01.139] error: _will_run_test: Job 45416611 
has NULL node_bitmap
messages:[2018-06-07T12:05:01.148] error: _will_run_test: Job 45416611 
has NULL node_bitmap
messages:[2018-06-07T12:05:01.159] error: _will_run_test: Job 45416611 
has NULL node_bitmap
messages:[2018-06-07T12:05:01.168] error: _will_run_test: Job 45416611 
has NULL node_bitmap
messages:[2018-06-07T12:05:01.219] prolog_running_decr: Configuration 
for JobID=45416611 is complete
messages:[2018-06-07T12:05:01.219] Extending job 45416611 time limit by 
2 secs for configuration
messages:[2018-06-07T12:06:54.398] Recovered JobID=45416611 State=0x4001 
NodeCnt=0 Assoc=7965
messages:[2018-06-07T12:06:54.666] error: _add_job_to_res: job 45416611 
has no job_resrcs info
messages:[2018-06-07T12:06:54.666] error: 
select_g_select_nodeinfo_set(45416611): No such file or directory
messages:[2018-06-07T12:07:05.634] error: _add_job_to_res: job 45416611 
has no job_resrcs info
messages:[2018-06-07T12:07:05.806] prolog_running_decr: Configuration 
for JobID=45416611 is complete
messages:[2018-06-07T12:07:05.806] Extending job 45416611 time limit by 
126 secs for configuration
messages:[2018-06-07T12:07:53.553] Recovered JobID=45416611 State=0x4001 
NodeCnt=0 Assoc=7965
messages:[2018-06-07T12:07:53.828] error: _add_job_to_res: job 45416611 
has no job_resrcs info
messages:[2018-06-07T12:07:53.828] error: 
select_g_select_nodeinfo_set(45416611): No such file or directory
messages:[2018-06-07T12:07:54.549] error: _add_job_to_res: job 45416611 
has no job_resrcs info
messages:[2018-06-07T12:07:54.587] prolog_running_decr: Configuration 
for JobID=45416611 is complete
messages:[2018-06-07T12:07:54.587] Extending job 45416611 time limit by 
175 secs for configuration
messages:[2018-06-07T12:08:35.501] Recovered JobID=45416611 State=0x4001 
NodeCnt=0 Assoc=7965
messages:[2018-06-07T12:08:35.763] error: _add_job_to_res: job 45416611 
has no job_resrcs info
messages:[2018-06-07T12:08:35.763] error: 
select_g_select_nodeinfo_set(45416611): No such file or directory
messages:[2018-06-07T12:08:36.351] error: _add_job_to_res: job 45416611 
has no job_resrcs info
messages:[2018-06-07T12:08:36.389] prolog_running_decr: Configuration 
for JobID=45416611 is complete
messages:[2018-06-07T12:08:36.389] Extending job 45416611 time limit by 
217 secs for configuration
messages:[2018-06-07T12:10:19.205] Recovered JobID=45416611 State=0x4001 
NodeCnt=0 Assoc=7965
messages:[2018-06-07T12:10:19.205] debug:  starting job 45416611 in 
accounting
messages:[2018-06-07T12:10:19.851] error: _add_job_to_res: job 45416611 
has no job_resrcs info
messages:[2018-06-07T12:10:19.851] error: 
select_g_select_nodeinfo_set(45416611): No such file or directory
messages:[2018-06-07T12:10:49.234] debug2: acct_policy_job_begin: after 
adding job 45416611, qos test grp_used_tres_run_secs(cpu) is 3031200
messages:[2018-06-07T12:10:49.234] debug2: acct_policy_job_begin: after 
adding job 45416611, qos test grp_used_tres_run_secs(mem) is 22319424000
messages:[2018-06-07T12:10:49.234] debug2: acct_policy_job_begin: after 
adding job 45416611, qos test grp_used_tres_run_secs(node) is 770400
messages:[2018-06-07T12:10:49.234] debug2: acct_policy_job_begin: after 
adding job 45416611, qos test grp_used_tres_run_secs(billing) is 7578000
messages:[2018-06-07T12:10:49.234] debug2: acct_policy_job_begin: after 
adding job 45416611, qos normal grp_used_tres_run_secs(cpu) is 276895569240
messages:[2018-06-07T12:10:49.234] debug2: acct_policy_job_begin: after 
adding job 45416611, qos normal grp_used_tres_run_secs(mem) is 
1249854241365600
messages:[2018-06-07T12:10:49.234] debug2: acct_policy_job_begin: after 
adding job 45416611, qos normal grp_used_tres_run_secs(node) is 13865803260
messages:[2018-06-07T12:10:49.234] debug2: acct_policy_job_begin: after 
adding job 45416611, qos normal grp_used_tres_run_secs(billing) is 
453091777260
messages:[2018-06-07T12:10:49.234] debug2: acct_policy_job_begin: after 
adding job 45416611, assoc 7965(conroy_lab/rpn3/(null)) 
grp_used_tres_run_secs(cpu) is 921600
messages:[2018-06-07T12:10:49.234] debug2: acct_policy_job_begin: after 
adding job 45416611, assoc 7965(conroy_lab/rpn3/(null)) 
grp_used_tres_run_secs(mem) is 3686400000
messages:[2018-06-07T12:10:49.234] debug2: acct_policy_job_begin: after 
adding job 45416611, assoc 7965(conroy_lab/rpn3/(null)) 
grp_used_tres_run_secs(node) is 28800
messages:[2018-06-07T12:10:49.234] debug2: acct_policy_job_begin: after 
adding job 45416611, assoc 7965(conroy_lab/rpn3/(null)) 
grp_used_tres_run_secs(billing) is 1353600
messages:[2018-06-07T12:10:49.234] debug2: acct_policy_job_begin: after 
adding job 45416611, assoc 5041(conroy_lab/(null)/(null)) 
grp_used_tres_run_secs(cpu) is 749448000
messages:[2018-06-07T12:10:49.234] debug2: acct_policy_job_begin: after 
adding job 45416611, assoc 5041(conroy_lab/(null)/(null)) 
grp_used_tres_run_secs(mem) is 2942496000000
messages:[2018-06-07T12:10:49.234] debug2: acct_policy_job_begin: after 
adding job 45416611, assoc 5041(conroy_lab/(null)/(null)) 
grp_used_tres_run_secs(node) is 615844800
messages:[2018-06-07T12:10:49.234] debug2: acct_policy_job_begin: after 
adding job 45416611, assoc 5041(conroy_lab/(null)/(null)) 
grp_used_tres_run_secs(billing) is 172641600
messages:[2018-06-07T12:10:49.234] debug2: acct_policy_job_begin: after 
adding job 45416611, assoc 1(root/(null)/(null)) 
grp_used_tres_run_secs(cpu) is 276976785240
messages:[2018-06-07T12:10:49.234] debug2: acct_policy_job_begin: after 
adding job 45416611, assoc 1(root/(null)/(null)) 
grp_used_tres_run_secs(mem) is 1250218310229600
messages:[2018-06-07T12:10:49.234] debug2: acct_policy_job_begin: after 
adding job 45416611, assoc 1(root/(null)/(null)) 
grp_used_tres_run_secs(node) is 13871419260
messages:[2018-06-07T12:10:49.234] debug2: acct_policy_job_begin: after 
adding job 45416611, assoc 1(root/(null)/(null)) 
grp_used_tres_run_secs(billing) is 453213601260
messages:[2018-06-07T12:10:49.600] error: _add_job_to_res: job 45416611 
has no job_resrcs info
messages:[2018-06-07T12:10:49.635] debug2: prolog_slurmctld job 45416611 
prolog completed
messages:[2018-06-07T12:10:49.762] prolog_running_decr: Configuration 
for JobID=45416611 is complete
messages:[2018-06-07T12:10:49.762] Extending job 45416611 time limit by 
350 secs for configuration
messages:[2018-06-07T12:12:40.139] Recovered JobID=45416611 State=0x4001 
NodeCnt=0 Assoc=7965
messages:[2018-06-07T12:12:40.418] error: _add_job_to_res: job 45416611 
has no job_resrcs info
messages:[2018-06-07T12:12:40.418] error: 
select_g_select_nodeinfo_set(45416611): No such file or directory
messages:[2018-06-07T12:12:52.000] error: _add_job_to_res: job 45416611 
has no job_resrcs info
messages:[2018-06-07T12:12:52.106] prolog_running_decr: Configuration 
for JobID=45416611 is complete
messages:[2018-06-07T12:12:52.106] Extending job 45416611 time limit by 
473 secs for configuration
messages:[2018-06-07T12:13:34.531] Recovered JobID=45416611 State=0x4001 
NodeCnt=0 Assoc=7965
messages:[2018-06-07T12:13:34.799] error: _add_job_to_res: job 45416611 
has no job_resrcs info
messages:[2018-06-07T12:13:34.799] error: 
select_g_select_nodeinfo_set(45416611): No such file or directory
messages:[2018-06-07T12:13:35.513] error: _add_job_to_res: job 45416611 
has no job_resrcs info
messages:[2018-06-07T12:13:35.551] prolog_running_decr: Configuration 
for JobID=45416611 is complete
messages:[2018-06-07T12:13:35.551] Extending job 45416611 time limit by 
516 secs for configuration
messages:[2018-06-07T12:14:36.337] Recovered JobID=45416611 State=0x4001 
NodeCnt=0 Assoc=7965
messages:[2018-06-07T12:14:36.604] error: _add_job_to_res: job 45416611 
has no job_resrcs info
messages:[2018-06-07T12:14:36.604] error: 
select_g_select_nodeinfo_set(45416611): No such file or directory
messages:[2018-06-07T12:14:37.192] error: _add_job_to_res: job 45416611 
has no job_resrcs info
messages:[2018-06-07T12:14:37.232] prolog_running_decr: Configuration 
for JobID=45416611 is complete
messages:[2018-06-07T12:14:37.232] Extending job 45416611 time limit by 
578 secs for configuration
messages:[2018-06-07T13:15:13.718] Recovered JobID=45416611 State=0x4001 
NodeCnt=0 Assoc=7965
messages:[2018-06-07T13:15:13.986] error: _add_job_to_res: job 45416611 
has no job_resrcs info
messages:[2018-06-07T13:15:13.986] error: 
select_g_select_nodeinfo_set(45416611): No such file or directory
messages:[2018-06-07T13:15:25.929] error: _add_job_to_res: job 45416611 
has no job_resrcs info
messages:[2018-06-07T13:15:26.006] prolog_running_decr: Configuration 
for JobID=45416611 is complete
messages:[2018-06-07T13:15:26.006] Extending job 45416611 time limit by 
4227 secs for configuration
messages:[2018-06-07T13:15:26.006] error: launch_prolog: missing job 
resources struct for job 45416611, setting to failed


As for 45416412, which was the job that caused the initial crash we have:

messages:[2018-06-07T11:58:02.223] error: find_preemptable_jobs: job 
45416412 not pending
messages:[2018-06-07T11:58:02.224] error: find_preemptable_jobs: job 
45416412 not pending
messages:[2018-06-07T11:58:02.224] error: find_preemptable_jobs: job 
45416412 not pending
messages:[2018-06-07T11:58:02.224] sched: _slurm_rpc_allocate_resources 
JobId=45416412 NodeList=holy2a14207 usec=2044571
messages:[2018-06-07T11:58:02.372] error: _will_run_test: Job 45416412 
has NULL node_bitmap
messages:[2018-06-07T11:58:02.393] error: _will_run_test: Job 45416412 
has NULL node_bitmap
messages:[2018-06-07T11:58:02.781] prolog_running_decr: Configuration 
for JobID=45416412 is complete
messages:[2018-06-07T11:59:54.153] _slurm_rpc_submit_batch_job: 
JobId=45416412 InitPrio=17861812 usec=10822
messages:[2018-06-07T11:59:57.222] sched: Allocate JobID=45416412 
NodeList=holyhoekstra04 #CPUs=8 Partition=serial_requeue
messages:[2018-06-07T11:59:57.320] prolog_running_decr: Configuration 
for JobID=45416412 is complete
messages:[2018-06-07T12:02:39.833] recovered job step 45416412.4294967295
messages:[2018-06-07T12:02:39.833] Recovered JobID=45416412 State=0x1 
NodeCnt=0 Assoc=4902
messages:[2018-06-07T12:02:42.065] _job_complete: JobID=45416412 
State=0x1 NodeCnt=1 WEXITSTATUS 1
messages:[2018-06-07T12:02:42.066] _job_complete: JobID=45416412 
State=0x8005 NodeCnt=1 done
messages:[2018-06-07T12:04:02.947] Recovered JobID=45416412 State=0x5 
NodeCnt=0 Assoc=4902
messages:[2018-06-07T12:06:54.393] Recovered JobID=45416412 State=0x5 
NodeCnt=0 Assoc=4902
messages:[2018-06-07T12:07:53.548] Recovered JobID=45416412 State=0x5 
NodeCnt=0 Assoc=4902
messages:[2018-06-07T12:08:35.496] Recovered JobID=45416412 State=0x5 
NodeCnt=0 Assoc=4902
messages:[2018-06-07T12:10:19.189] Recovered JobID=45416412 State=0x5 
NodeCnt=0 Assoc=4902
messages:[2018-06-07T12:10:20.022] debug3: Found batch directory for 
job_id 45416412
messages:[2018-06-07T12:12:40.135] Recovered JobID=45416412 State=0x5 
NodeCnt=0 Assoc=4902
messages:[2018-06-07T12:13:34.526] Recovered JobID=45416412 State=0x5 
NodeCnt=0 Assoc=4902
messages:[2018-06-07T12:14:36.332] Recovered JobID=45416412 State=0x5 
NodeCnt=0 Assoc=4902
messages:[2018-06-07T13:15:13.713] Recovered JobID=45416412 State=0x5 
NodeCnt=0 Assoc=4902

Note that I grepped out everything not related to these two jobs. So if 
you want other details let me know.

-Paul Edmon-


On 06/08/2018 11:12 AM, bugs@schedmd.com wrote:
>
> *Comment # 31 <https://bugs.schedmd.com/show_bug.cgi?id=5276#c31> on 
> bug 5276 <https://bugs.schedmd.com/show_bug.cgi?id=5276> from Marshall 
> Garey <mailto:marshall@schedmd.com> *
> Hi Paul, we're continuing to look into this. Can we get a more complete log
> file from around the time of the crash? We'd like everything starting from when
> job  45416611 was initially submitted up to the crash.
>
> Thanks
>
> - Marshall
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>

Comment 33 Marshall Garey 2018-06-08 09:36:14 MDT

Thanks, we actually wanted to know if this job was preempted (and any other interactions), so this helps a lot. We'll keep you posted.

Comment 35 Marshall Garey 2018-06-08 16:52:16 MDT

Can you attach the entire log from the earliest mention of job 45416412 which caused the crash (from comment 32 that appears to be 2018-06-07T11:58:02.223) to the time of the crash?

Just what we see in comment 32 is interesting, but there are certain log messages I'm interested in that don't have the job id in it, and maybe other contextual information that will help.

Thanks.

- Marshall

Comment 37 Paul Edmon 2018-06-08 18:39:09 MDT

Created attachment 7056 [details]
slurmctl log for job 45416412

Let me know if you need more than this. I don't want to bloat the file size too much.

Comment 40 Marshall Garey 2018-06-25 17:14:00 MDT

Just an update - your log file has been helpful; thank you for uploading it. We're still investigating this. We haven't been able to find any obvious holes where job_resrcs is NULL but continues to get dereferenced. It might be a subtle race condition somewhere - that's what I've been looking for.

Comment 42 Marshall Garey 2018-07-13 16:27:39 MDT

Hi Paul,

We haven't been able to reproduce this yet or determine how those jobs got into that state. I have identified a few locations that are missing the config read lock, and one that's missing the partition read lock, but none of those affect the jobs. We'll get those fixed anyway.

Just to check, I assume you're still running with the patch that Tim provided to prevent the segfault? His patch also added in an error message:

error("%s: missing job resources struct for job %u, setting to failed",
			      __func__, job_ptr->job_id);

Can you grep for "missing job resources struct for job" - have you seen this error message since then?

I'll keep investigating this one.

- Marshall

Comment 43 Paul Edmon 2018-07-16 08:27:07 MDT

Sure.  Here are the errors I saw:

messages-20180624:[2018-06-21T17:57:45.205] error: launch_prolog: 
missing job resources struct for job 46831986, setting to failed
messages-20180624:[2018-06-22T12:16:30.355] error: launch_prolog: 
missing job resources struct for job 46884696, setting to failed
messages-20180624:[2018-06-22T12:16:39.201] error: launch_prolog: 
missing job resources struct for job 46884705, setting to failed
messages-20180624:[2018-06-22T12:17:31.002] error: launch_prolog: 
missing job resources struct for job 46884744, setting to failed
messages-20180624:[2018-06-22T12:17:42.178] error: launch_prolog: 
missing job resources struct for job 46884764, setting to failed
messages-20180624:[2018-06-22T12:18:23.223] error: launch_prolog: 
missing job resources struct for job 46884792, setting to failed
messages-20180701:[2018-06-26T13:04:44.788] error: launch_prolog: 
missing job resources struct for job 47097214, setting to failed
messages-20180708:[2018-07-02T11:49:22.785] error: launch_prolog: 
missing job resources struct for job 47541075, setting to failed
messages-20180708:[2018-07-02T22:52:32.409] error: launch_prolog: 
missing job resources struct for job 47576114, setting to failed
messages-20180708:[2018-07-03T16:20:04.786] error: launch_prolog: 
missing job resources struct for job 47622547, setting to failed
messages-20180708:[2018-07-03T16:20:08.549] error: launch_prolog: 
missing job resources struct for job 47622549, setting to failed
messages-20180708:[2018-07-03T16:20:12.144] error: launch_prolog: 
missing job resources struct for job 47622550, setting to failed
messages-20180708:[2018-07-03T16:20:18.817] error: launch_prolog: 
missing job resources struct for job 47622572, setting to failed
messages-20180708:[2018-07-03T16:20:38.994] error: launch_prolog: 
missing job resources struct for job 47622602, setting to failed
messages-20180708:[2018-07-05T10:12:24.281] error: launch_prolog: 
missing job resources struct for job 47766679, setting to failed
messages-20180715:[2018-07-11T15:43:34.312] error: launch_prolog: 
missing job resources struct for job 48119966, setting to failed
messages-20180715:[2018-07-13T11:06:40.887] error: launch_prolog: 
missing job resources struct for job 48248498, setting to failed
messages-20180715:[2018-07-13T11:07:43.233] error: launch_prolog: 
missing job resources struct for job 48248517, setting to failed

Let me know if you want me to try to track back these jobs.

-Paul Edmon-


On 07/13/2018 06:27 PM, bugs@schedmd.com wrote:
>
> *Comment # 42 <https://bugs.schedmd.com/show_bug.cgi?id=5276#c42> on 
> bug 5276 <https://bugs.schedmd.com/show_bug.cgi?id=5276> from Marshall 
> Garey <mailto:marshall@schedmd.com> *
> Hi Paul,
>
> We haven't been able to reproduce this yet or determine how those jobs got into
> that state. I have identified a few locations that are missing the config read
> lock, and one that's missing the partition read lock, but none of those affect
> the jobs. We'll get those fixed anyway.
>
> Just to check, I assume you're still running with the patch that Tim provided
> to prevent the segfault? His patch also added in an error message:
>
> error("%s: missing job resources struct for job %u, setting to failed",
>                                __func__, job_ptr->job_id);
>
> Can you grep for "missing job resources struct for job" - have you seen this
> error message since then?
>
> I'll keep investigating this one.
>
> - Marshall
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>

Comment 44 Marshall Garey 2018-07-17 16:03:30 MDT

Interesting. It would be helpful to know if there's a pattern - something in common between all those jobs. I think that would help us track down how they get into this state.

- Did they all get preempted and requeued?
- Were they all part of an array job, or were they all individual jobs, or some of both?
- Any other patterns?

For now, can you get the output of this sacct command? If you could attach it as a file, that would help me - it's easier for me to parse and keep track of things in a text editor with different files than in the web browser.

sacct -D -j<list of all those job id's> --format=jobid,jobidraw,jobname,partition,start,end,submit,eligible,state,exitcode,derivedexitcode,account,user,reqtres%50,alloctres%50

I may ask for slurmctld logs later.

Comment 45 Paul Edmon 2018-07-18 08:34:57 MDT

Created attachment 7337 [details]
sacct info for jobs with missing structs

I got this info doing the following:

[root@holy-slurm02 log]# grep "missing job resources struct for job" *
messages:[2018-07-16T14:33:38.985] error: launch_prolog: missing job resources struct for job 48502865, setting to failed
messages:[2018-07-16T14:33:47.900] error: launch_prolog: missing job resources struct for job 48502867, setting to failed
messages-20180624:[2018-06-21T17:57:45.205] error: launch_prolog: missing job resources struct for job 46831986, setting to failed
messages-20180624:[2018-06-22T12:16:30.355] error: launch_prolog: missing job resources struct for job 46884696, setting to failed
messages-20180624:[2018-06-22T12:16:39.201] error: launch_prolog: missing job resources struct for job 46884705, setting to failed
messages-20180624:[2018-06-22T12:17:31.002] error: launch_prolog: missing job resources struct for job 46884744, setting to failed
messages-20180624:[2018-06-22T12:17:42.178] error: launch_prolog: missing job resources struct for job 46884764, setting to failed
messages-20180624:[2018-06-22T12:18:23.223] error: launch_prolog: missing job resources struct for job 46884792, setting to failed
messages-20180701:[2018-06-26T13:04:44.788] error: launch_prolog: missing job resources struct for job 47097214, setting to failed
messages-20180708:[2018-07-02T11:49:22.785] error: launch_prolog: missing job resources struct for job 47541075, setting to failed
messages-20180708:[2018-07-02T22:52:32.409] error: launch_prolog: missing job resources struct for job 47576114, setting to failed
messages-20180708:[2018-07-03T16:20:04.786] error: launch_prolog: missing job resources struct for job 47622547, setting to failed
messages-20180708:[2018-07-03T16:20:08.549] error: launch_prolog: missing job resources struct for job 47622549, setting to failed
messages-20180708:[2018-07-03T16:20:12.144] error: launch_prolog: missing job resources struct for job 47622550, setting to failed
messages-20180708:[2018-07-03T16:20:18.817] error: launch_prolog: missing job resources struct for job 47622572, setting to failed
messages-20180708:[2018-07-03T16:20:38.994] error: launch_prolog: missing job resources struct for job 47622602, setting to failed
messages-20180708:[2018-07-05T10:12:24.281] error: launch_prolog: missing job resources struct for job 47766679, setting to failed
messages-20180715:[2018-07-11T15:43:34.312] error: launch_prolog: missing job resources struct for job 48119966, setting to failed
messages-20180715:[2018-07-13T11:06:40.887] error: launch_prolog: missing job resources struct for job 48248498, setting to failed
messages-20180715:[2018-07-13T11:07:43.233] error: launch_prolog: missing job resources struct for job 48248517, setting to failed

[root@holy-slurm02 ~]# sacct -D -j 48502867,48502865,48248517,48248498,48119966,47766679,47622602,47622572,47622550,47622549,47622547,47576114,47541075,47097214,46884792,46884764,46884744,46884705,46884696,46831986 --format=jobid,jobidraw,jobname,partition,start,end,submit,eligible,state,exitcode,derivedexitcode,account,user,reqtres%50,alloctres%5 > missing-struct.txt

Comment 46 Paul Edmon 2018-07-23 08:52:01 MDT

I saw that 17.11.8 was released.  I'm guessing it does not contain a fix for this bug.  What do you recommend we do?  Should I stay on 17.11.7 with the patch we have until we figure this out?

Comment 47 Marshall Garey 2018-07-23 15:37:12 MDT

Correct, 17.11.8 doesn't contain any patches for this. We didn't want to include the workaround we gave you, though it seems harmless.

For now, if you want to upgrade, feel free to upgrade to 17.11.8, but apply Tim's patch locally on top of 17.11.8.

Comment 48 Marshall Garey 2018-07-26 12:02:45 MDT

We've had a few other cases reported which are very similar to this one - 5438, 5447, and 5452, though they all crashed in the same place but different than this one. We can reliably reproduce that crash:

1. EnforcePartLimits=ALL
2. Submitting a lot of jobs to multiple partitions
3. The job runs on not the last partition in the list it was submitted to, and runs at submit time (scheduled immediately).

1. You have EnforcePartLimits=ALL.

2. To check if these jobs that are crashing are submitting to multiple partitions:

Can you set your slurmctld debug level to debug and see if any of those errors reoccur? If they do, can you grep for "has more than one partition" and the job id in the slurmctld log file? You should see a message like this:

slurmctld: debug:  Job 544012 has more than one partition (debug)(1167)
slurmctld: debug:  Job 544012 has more than one partition (canpreempt)(1167)
slurmctld: debug:  Job 544012 has more than one partition (canpreempt2)(1167)

If you do, feel free to turn the log level back down.

3. From the sacct output you uploaded in comment 45, I've verified that the submit and start times are the same for all those jobs.


We're working on a fix for those other bugs; I'm hopeful that it will fix it for you, too.

Comment 49 Paul Edmon 2018-07-26 12:15:12 MDT

Yeah the EnforcePartLimits=ALL has never really worked as advertised as 
I will see people submit jobs to multiple partitions that should have 
been rejected due to one of the partition limits.  For instance a person 
will submit a job asking for 250G to two partitions but one partition 
doesn't have any nodes that have 250G but the other does.  If I submit 
individually the partition with the lower memory nodes will reject it, 
but if I multisubmit it will submit but then generate an error like:

[2018-07-26T13:22:14.611] _build_node_list: No nodes satisfy job 
49276726 requirements in partition shared

In which case I go in and manually reset the partitions.  However what 
it should do is reject the submission outright because one of the two 
partitions can't run the job.

Anyways I will turn on that debug flag and assuming it doesn't bloat my 
logs I can leave it running until I see that error again.  It seems to 
happen at least once a week.

-Paul Edmon-


On 07/26/2018 02:02 PM, bugs@schedmd.com wrote:
>
> *Comment # 48 <https://bugs.schedmd.com/show_bug.cgi?id=5276#c48> on 
> bug 5276 <https://bugs.schedmd.com/show_bug.cgi?id=5276> from Marshall 
> Garey <mailto:marshall@schedmd.com> *
> We've had a few other cases reported which are very similar to this one - 5438,
> 5447, and 5452, though they all crashed in the same place but different than
> this one. We can reliably reproduce that crash:
>
> 1. EnforcePartLimits=ALL
> 2. Submitting a lot of jobs to multiple partitions
> 3. The job runs on not the last partition in the list it was submitted to, and
> runs at submit time (scheduled immediately).
>
> 1. You have EnforcePartLimits=ALL.
>
> 2. To check if these jobs that are crashing are submitting to multiple
> partitions:
>
> Can you set your slurmctld debug level to debug and see if any of those errors
> reoccur? If they do, can you grep for "has more than one partition" and the job
> id in the slurmctld log file? You should see a message like this:
>
> slurmctld: debug:  Job 544012 has more than one partition (debug)(1167)
> slurmctld: debug:  Job 544012 has more than one partition (canpreempt)(1167)
> slurmctld: debug:  Job 544012 has more than one partition (canpreempt2)(1167)
>
> If you do, feel free to turn the log level back down.
>
> 3. From the sacct output you uploaded incomment 45 <show_bug.cgi?id=5276#c45>, I've verified that the
> submit and start times are the same for all those jobs.
>
>
> We're working on a fix for those other bugs; I'm hopeful that it will fix it
> for you, too.
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>

Comment 50 Marshall Garey 2018-07-27 08:54:46 MDT

Created attachment 7436 [details]
Prevent job_resrcs from being overwritten for multi-partition job submissions

(In reply to Paul Edmon from comment #49)
> Yeah the EnforcePartLimits=ALL has never really worked as advertised as 
> I will see people submit jobs to multiple partitions that should have 
> been rejected due to one of the partition limits.  For instance a person 
> will submit a job asking for 250G to two partitions but one partition 
> doesn't have any nodes that have 250G but the other does.  If I submit 
> individually the partition with the lower memory nodes will reject it, 
> but if I multisubmit it will submit but then generate an error like:
> 
> [2018-07-26T13:22:14.611] _build_node_list: No nodes satisfy job 
> 49276726 requirements in partition shared
> 
> In which case I go in and manually reset the partitions.  However what 
> it should do is reject the submission outright because one of the two 
> partitions can't run the job.

Yeah, the issue is that if a job can be started right away, it will, before it even finishes going through all the partitions. If it can't start right away, then it does go through all the partitions and is rejected appropriately.

We're aware of this and are seeing what can be done about that. It's not trivial, though.


> Anyways I will turn on that debug flag and assuming it doesn't bloat my 
> logs I can leave it running until I see that error again.  It seems to 
> happen at least once a week.

I suppose another thing you could do is just change that debug message to verbose and recompile/restart slurmctld.

$ git diff
diff --git a/src/plugins/priority/multifactor/priority_multifactor.c b/src/plugins/priority/multifactor/priority_multifactor.c
index 2396e60e24..5666b9b41a 100644
--- a/src/plugins/priority/multifactor/priority_multifactor.c
+++ b/src/plugins/priority/multifactor/priority_multifactor.c
@@ -619,7 +619,7 @@ static uint32_t _get_priority_internal(time_t start_time,
                                job_ptr->priority_array[i] =
                                        (uint32_t) priority_part;
                        }
-                       debug("Job %u has more than one partition (%s)(%u)",
+                       verbose("Job %u has more than one partition (%s)(%u)",
                              job_ptr->job_id, part_ptr->name,
                              job_ptr->priority_array[i]);
                        i++;


This patch fixes the other segfaults, and if I'm right, should fix your segfault as well. Feel free to try applying this patch and removing the local bypass that Tim provided to verify that the problem is indeed fixed. Or, if you'd rather wait to see if you see that "has more than one partition" message comes up with the jobs that would have segfaulted before you apply this patch, you can do that. This patch hasn't been committed yet, but will likely be soon.

Comment 51 Paul Edmon 2018-07-27 08:59:06 MDT

Since I already have the debug logging on I'm inclined to let that go 
for a week and see if we catch any in the act.  If we do we can then 
look at what happened and confirm that it is the same problem.

-Paul Edmon-


On 07/27/2018 10:54 AM, bugs@schedmd.com wrote:
>
> *Comment # 50 <https://bugs.schedmd.com/show_bug.cgi?id=5276#c50> on 
> bug 5276 <https://bugs.schedmd.com/show_bug.cgi?id=5276> from Marshall 
> Garey <mailto:marshall@schedmd.com> *
> Createdattachment 7436 <attachment.cgi?id=7436&action=diff> [details] 
> <attachment.cgi?id=7436&action=edit>
> Prevent job_resrcs from being overwritten for multi-partition job submissions
>
> (In reply to Paul Edmon fromcomment #49 <show_bug.cgi?id=5276#c49>)
> > Yeah the EnforcePartLimits=ALL has never really worked as advertised as > I will see people submit jobs to multiple partitions that should 
> have > been rejected due to one of the partition limits.  For instance 
> a person > will submit a job asking for 250G to two partitions but one 
> partition > doesn't have any nodes that have 250G but the other does.  
> If I submit > individually the partition with the lower memory nodes 
> will reject it, > but if I multisubmit it will submit but then 
> generate an error like: > > [2018-07-26T13:22:14.611] 
> _build_node_list: No nodes satisfy job > 49276726 requirements in 
> partition shared > > In which case I go in and manually reset the 
> partitions.  However what > it should do is reject the submission 
> outright because one of the two > partitions can't run the job.
>
> Yeah, the issue is that if a job can be started right away, it will, before it
> even finishes going through all the partitions. If it can't start right away,
> then it does go through all the partitions and is rejected appropriately.
>
> We're aware of this and are seeing what can be done about that. It's not
> trivial, though.
>
>
> > Anyways I will turn on that debug flag and assuming it doesn't bloat my > logs I can leave it running until I see that error again.  It seems 
> to > happen at least once a week.
>
> I suppose another thing you could do is just change that debug message to
> verbose and recompile/restart slurmctld.
>
> $ git diff
> diff --git a/src/plugins/priority/multifactor/priority_multifactor.c
> b/src/plugins/priority/multifactor/priority_multifactor.c
> index 2396e60e24..5666b9b41a 100644
> --- a/src/plugins/priority/multifactor/priority_multifactor.c
> +++ b/src/plugins/priority/multifactor/priority_multifactor.c
> @@ -619,7 +619,7 @@ static uint32_t _get_priority_internal(time_t start_time,
>                                  job_ptr->priority_array[i] =
>                                          (uint32_t) priority_part;
>                          }
> -                       debug("Job %u has more than one partition (%s)(%u)",
> +                       verbose("Job %u has more than one partition (%s)(%u)",
>                                job_ptr->job_id, part_ptr->name,
>                                job_ptr->priority_array[i]);
>                          i++;
>
>
> This patch fixes the other segfaults, and if I'm right, should fix your
> segfault as well. Feel free to try applying this patch and removing the local
> bypass that Tim provided to verify that the problem is indeed fixed. Or, if
> you'd rather wait to see if you see that "has more than one partition" message
> comes up with the jobs that would have segfaulted before you apply this patch,
> you can do that. This patch hasn't been committed yet, but will likely be soon.
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>

Comment 52 Paul Edmon 2018-08-01 08:30:48 MDT

So looks like we have a winner:

messages:[2018-07-30T14:12:33.640] debug:  Job 49634964 has more than 
one partition (huce_intel)(9915484)
messages:[2018-07-30T14:12:33.640] debug:  Job 49634964 has more than 
one partition (test)(9915484)
messages:[2018-07-30T14:12:33.661] error: find_preemptable_jobs: job 
49634964 not pending
messages:[2018-07-30T14:12:33.661] sched: _slurm_rpc_allocate_resources 
JobId=49634964 NodeList=holy2c18102 usec=21897
messages:[2018-07-30T14:12:34.180] prolog_running_decr: Configuration 
for JobID=49634964 is complete
messages:[2018-07-30T14:12:34.180] Extending job 49634964 time limit by 
1 secs for configuration
messages:[2018-07-30T14:12:34.180] error: launch_prolog: missing job 
resources struct for job 49634964, setting to failed
messages:[2018-07-30T15:03:05.036] _slurm_rpc_kill_job: REQUEST_KILL_JOB 
job 49634964 uid 11423
messages:[2018-07-30T15:03:05.169] job_str_signal(3): invalid job id 
49634964
messages:[2018-07-30T15:03:05.169] _slurm_rpc_kill_job: job_str_signal() 
job 49634964 sig 9 returned Invalid job id specified

messages:[2018-07-30T21:37:18.732] debug:  Job 49659045 has more than 
one partition (test)(54990)
messages:[2018-07-30T21:37:18.732] debug:  Job 49659045 has more than 
one partition (xlin-lab)(54990)
messages:[2018-07-30T21:37:18.732] debug:  Job 49659045 has more than 
one partition (shared)(54990)
messages:[2018-07-30T21:37:18.752] error: find_preemptable_jobs: job 
49659045 not pending
messages:[2018-07-30T21:37:18.752] error: find_preemptable_jobs: job 
49659045 not pending
messages:[2018-07-30T21:37:18.754] sched: _slurm_rpc_allocate_resources 
JobId=49659045 NodeList=holy7c19312 usec=22213
messages:[2018-07-30T21:37:18.967] prolog_running_decr: Configuration 
for JobID=49659045 is complete
messages:[2018-07-30T21:37:18.967] error: launch_prolog: missing job 
resources struct for job 49659045, setting to failed

So it looks like it is the multiple partition issue.  I'm guessing then 
that the fixes you have will fix this.  Do you need any other info from 
me?  I'm anticipating these fixes will be in 17.11.9?  Do you have a 
patch for 17.11.8 that applies these fixes?  If so I can roll up to that.

-Paul Edmon-


On 07/27/2018 10:54 AM, bugs@schedmd.com wrote:
>
> *Comment # 50 <https://bugs.schedmd.com/show_bug.cgi?id=5276#c50> on 
> bug 5276 <https://bugs.schedmd.com/show_bug.cgi?id=5276> from Marshall 
> Garey <mailto:marshall@schedmd.com> *
> Createdattachment 7436 <attachment.cgi?id=7436&action=diff> [details] 
> <attachment.cgi?id=7436&action=edit>
> Prevent job_resrcs from being overwritten for multi-partition job submissions
>
> (In reply to Paul Edmon fromcomment #49 <show_bug.cgi?id=5276#c49>)
> > Yeah the EnforcePartLimits=ALL has never really worked as advertised as > I will see people submit jobs to multiple partitions that should 
> have > been rejected due to one of the partition limits.  For instance 
> a person > will submit a job asking for 250G to two partitions but one 
> partition > doesn't have any nodes that have 250G but the other does.  
> If I submit > individually the partition with the lower memory nodes 
> will reject it, > but if I multisubmit it will submit but then 
> generate an error like: > > [2018-07-26T13:22:14.611] 
> _build_node_list: No nodes satisfy job > 49276726 requirements in 
> partition shared > > In which case I go in and manually reset the 
> partitions.  However what > it should do is reject the submission 
> outright because one of the two > partitions can't run the job.
>
> Yeah, the issue is that if a job can be started right away, it will, before it
> even finishes going through all the partitions. If it can't start right away,
> then it does go through all the partitions and is rejected appropriately.
>
> We're aware of this and are seeing what can be done about that. It's not
> trivial, though.
>
>
> > Anyways I will turn on that debug flag and assuming it doesn't bloat my > logs I can leave it running until I see that error again.  It seems 
> to > happen at least once a week.
>
> I suppose another thing you could do is just change that debug message to
> verbose and recompile/restart slurmctld.
>
> $ git diff
> diff --git a/src/plugins/priority/multifactor/priority_multifactor.c
> b/src/plugins/priority/multifactor/priority_multifactor.c
> index 2396e60e24..5666b9b41a 100644
> --- a/src/plugins/priority/multifactor/priority_multifactor.c
> +++ b/src/plugins/priority/multifactor/priority_multifactor.c
> @@ -619,7 +619,7 @@ static uint32_t _get_priority_internal(time_t start_time,
>                                  job_ptr->priority_array[i] =
>                                          (uint32_t) priority_part;
>                          }
> -                       debug("Job %u has more than one partition (%s)(%u)",
> +                       verbose("Job %u has more than one partition (%s)(%u)",
>                                job_ptr->job_id, part_ptr->name,
>                                job_ptr->priority_array[i]);
>                          i++;
>
>
> This patch fixes the other segfaults, and if I'm right, should fix your
> segfault as well. Feel free to try applying this patch and removing the local
> bypass that Tim provided to verify that the problem is indeed fixed. Or, if
> you'd rather wait to see if you see that "has more than one partition" message
> comes up with the jobs that would have segfaulted before you apply this patch,
> you can do that. This patch hasn't been committed yet, but will likely be soon.
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>

Comment 53 Marshall Garey 2018-08-01 09:09:15 MDT

Excellent. Comment 50 is the patch that fixes it. It has already been committed and will be in 17.11.9 -

https://github.com/SchedMD/slurm/commit/fef07a40972

You can apply this directly to 17.11.8.

Comment 54 Marshall Garey 2018-08-08 13:57:52 MDT

Have you seen the segfault happen again? Would you like to keep the bug open for a little while longer, or is it okay to close?

Comment 55 Paul Edmon 2018-08-13 09:29:44 MDT

We haven't upgraded to 17.11.9 yet.  I would say close this as we are presuming that the upgrade will fix.  If it doesn't I will either open this ticket again or open a new one.

Comment 56 Marshall Garey 2018-08-13 09:32:54 MDT

Thanks. Closing as resolved/fixed


https://github.com/SchedMD/slurm/commit/fef07a40972