Our slurmctld started crashing about 10 minutes ago with the following error slurmctld: debug3: cons_res: adding job 45416606 to part holymeissner row 0 slurmctld: debug3: cons_res: _add_job_to_res: job 45416608 act 0 slurmctld: debug3: cons_res: adding job 45416608 to part holymeissner row 0 slurmctld: debug3: cons_res: _add_job_to_res: job 45416609 act 0 slurmctld: debug3: cons_res: adding job 45416609 to part holymeissner row 0 slurmctld: error: _add_job_to_res: job 45416611 has no job_resrcs info slurmctld: Warning: Note very large processing time from read_slurm_conf: usec=53732228 began=12:09:55.868 slurmctld: debug2: Spawning RPC agent for msg_type REQUEST_TERMINATE_JOB slurmctld: debug: backfill: beginning slurmctld: debug2: got 1 threads to send out slurmctld: debug2: Tree head got back 0 looking for 1 slurmctld: Running as primary controller slurmctld: debug: No BackupController, not launching heartbeat. slurmctld: Registering slurmctld at port 6820 with slurmdbd. slurmctld: debug3: Tree sending to holy2a18107 slurmctld: debug4: orig_timeout was 100000 we have 0 steps and a timeout of 100000 slurmctld: debug2: Tree head got back 1 slurmctld: debug2: prolog_slurmctld job 45416611 prolog completed slurmctld: debug: backfill: 12288 jobs to backfill slurmctld: debug2: Spawning RPC agent for msg_type REQUEST_TERMINATE_JOB slurmctld: debug2: node_did_resp holy2a18107 slurmctld: prolog_running_decr: Configuration for JobID=45416611 is complete slurmctld: Extending job 45416611 time limit by 350 secs for configuration slurmctld: debug2: got 1 threads to send out slurmctld: debug2: Tree head got back 0 looking for 1 Segmentation fault (core dumped) I tried tracing back 45416611 to see if I could remove it but no luck scancel is not quick enough and there is no job record in the spool/hash.1 to delete. The scheduler is dead in the water. I will attach the gdb dump of the core file in a bit.
[root@holy-slurm02 spool]# gdb /usr/sbin/slurmctld core.138997 GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-110.el7 Copyright (C) 2013 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-redhat-linux-gnu". For bug reporting instructions, please see: <http://www.gnu.org/software/gdb/bugs/>... Reading symbols from /usr/sbin/slurmctld...done. [New LWP 139033] [New LWP 138998] [New LWP 139000] [New LWP 138997] [New LWP 139003] [New LWP 139036] [New LWP 139037] [New LWP 139040] [New LWP 138999] [New LWP 139041] [New LWP 139002] [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib64/libthread_db.so.1". Core was generated by `/usr/sbin/slurmctld'. Program terminated with signal 11, Segmentation fault. #0 0x0000000000471ae5 in launch_prolog (job_ptr=job_ptr@entry=0x785e720) at node_scheduler.c:2853 2853 node_scheduler.c: No such file or directory. Missing separate debuginfos, use: debuginfo-install slurm-slurmctld-17.11.7-1fasrc01.el7.centos.x86_64 (gdb) bt full #0 0x0000000000471ae5 in launch_prolog (job_ptr=job_ptr@entry=0x785e720) at node_scheduler.c:2853 prolog_msg_ptr = 0x7f6fb0000e70 agent_arg_ptr = <optimized out> job_resrcs_ptr = 0x0 cred_arg = {jobid = 45416611, stepid = 4294967295, uid = 11608, gid = 403048, user_name = 0x0, ngids = 0, gids = 0x0, cores_per_socket = 0x0, sockets_per_node = 0x0, sock_core_rep_count = 0x0, job_constraints = 0x0, job_core_bitmap = 0x0, job_core_spec = 65534, job_hostlist = 0x0, job_mem_limit = 0, job_nhosts = 0, job_gres_list = 0x0, x11 = 0, step_core_bitmap = 0x0, step_hostlist = 0x0, step_mem_limit = 0, step_gres_list = 0x0} i = <optimized out> __func__ = "launch_prolog" #1 0x000000000043ff48 in job_config_fini (job_ptr=job_ptr@entry=0x785e720) at job_mgr.c:8492 now = 1528388077 #2 0x000000000046084b in prolog_running_decr (job_ptr=job_ptr@entry=0x785e720) at job_scheduler.c:4406 job_id_buf = "JobID=45416611\000g\033\037\002\000\000\000\000\000\376\377\000\000\000\000\000\000 \347\205\a\000\000\000\000\300\f\000\260o\177\000\000\243\000\265\002\000\000\000\000\315\300l\301o\177\000\000\b\000\000\000\060\000\000\000`\236*\273o\177\000\000\240\235*\273o\177\000\000\000R0\310\324.\037g\000\000\000\000\000\000\000\000\243\000\265\002", '\000' <repeats 12 times>, "p\377\377\377\377\377\377\377\000\000\000\000\000\000\000\000\001", '\000' <repeats 31 times>, "\214\002\"\301o\177\000\000\060\370\347\003\000\000\000\000P\021\301\301o\177\000\000\227\001\000\000\000\000\000\000"... __func__ = "prolog_running_decr" #3 0x0000000000464645 in _run_prolog (arg=0x785e720) at job_scheduler.c:4358 job_ptr = 0x785e720 node_ptr = <optimized out> job_id = 45416611 cpid = <optimized out> i = <optimized out> rc = <optimized out> status = 0 wait_rc = <optimized out> argv = {0x0, 0x0} my_env = 0x0 config_read_lock = {config = READ_LOCK, job = WRITE_LOCK, node = WRITE_LOCK, partition = NO_LOCK, federation = READ_LOCK} node_bitmap = 0x7f6fb0000cc0 now = 1528388076 resume_timeout = <optimized out> tm = 60112 __func__ = "_run_prolog" #4 0x00007f6fc1218e25 in start_thread () from /lib64/libpthread.so.0 No symbol table info available. #5 0x00007f6fc0f42bad in clone () from /lib64/libc.so.6 No symbol table info available.
Created attachment 7027 [details] Fuller GDB Log Added a trace of all the threads.
Logs from the first crash would be nice to have - something's definitely gone wrong with this job dispatch, and I'll get you a patch to bypass this issue at least for the moment in a couple of minutes, but we'd obviously like to nail down however this job got into such as state.
Sure. Give me a sec. -Paul Edmon- On 06/07/2018 12:30 PM, bugs@schedmd.com wrote: > > *Comment # 3 <https://bugs.schedmd.com/show_bug.cgi?id=5276#c3> on bug > 5276 <https://bugs.schedmd.com/show_bug.cgi?id=5276> from Tim Wickberg > <mailto:tim@schedmd.com> * > Logs from the first crash would be nice to have - something's definitely gone > wrong with this job dispatch, and I'll get you a patch to bypass this issue at > least for the moment in a couple of minutes, but we'd obviously like to nail > down however this job got into such as state. > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. >
This is the first crash of the day: [root@holy-slurm02 spool]# gdb /usr/sbin/slurmctld core.69971 GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-110.el7 Copyright (C) 2013 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-redhat-linux-gnu". For bug reporting instructions, please see: <http://www.gnu.org/software/gdb/bugs/>... Reading symbols from /usr/sbin/slurmctld...done. [New LWP 124204] [New LWP 124218] [New LWP 69974] [New LWP 124224] [New LWP 124225] [New LWP 69978] [New LWP 70030] [New LWP 70036] [New LWP 69976] [New LWP 124186] [New LWP 70032] [New LWP 69971] [New LWP 69982] [New LWP 70029] [New LWP 69975] [New LWP 70035] [New LWP 70033] [New LWP 70038] [New LWP 70034] [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib64/libthread_db.so.1". Core was generated by `/usr/sbin/slurmctld'. Program terminated with signal 11, Segmentation fault. #0 0x0000000000471ae5 in launch_prolog (job_ptr=job_ptr@entry=0x7f83507d57e0) at node_scheduler.c:2853 2853 node_scheduler.c: No such file or directory. Missing separate debuginfos, use: debuginfo-install slurm-slurmctld-17.11.7-1fasrc01.el7.centos.x86_64 (gdb) bt full #0 0x0000000000471ae5 in launch_prolog (job_ptr=job_ptr@entry=0x7f83507d57e0) at node_scheduler.c:2853 prolog_msg_ptr = 0x7f84e8008580 agent_arg_ptr = <optimized out> job_resrcs_ptr = 0x0 cred_arg = {jobid = 45416412, stepid = 4294967295, uid = 11608, gid = 403048, user_name = 0x0, ngids = 0, gids = 0x0, cores_per_socket = 0x0, sockets_per_node = 0x0, sock_core_rep_count = 0x0, job_constraints = 0x0, job_core_bitmap = 0x0, job_core_spec = 65534, job_hostlist = 0x0, job_mem_limit = 0, job_nhosts = 0, job_gres_list = 0x0, x11 = 0, step_core_bitmap = 0x0, step_hostlist = 0x0, step_mem_limit = 0, step_gres_list = 0x0} i = <optimized out> __func__ = "launch_prolog" #1 0x000000000043ff48 in job_config_fini (job_ptr=job_ptr@entry=0x7f83507d57e0) at job_mgr.c:8492 now = 1528387082 #2 0x000000000046084b in prolog_running_decr (job_ptr=job_ptr@entry=0x7f83507d57e0) at job_scheduler.c:4406 job_id_buf = "JobID=45416412\000\000㮶f\207\177\000\000\376\377\000\000\000\000\000\000\340W}P\203\177\000\000\000\000\000\000\000\000\000\000`\035n", '\000' <repeats 13 times>, "\376\256\266f\207\177\000\000`\276\317s\206\177\000\000\240\275\317s\206\177\000\000\000$f\267\357\275$\302\000\020\020\000\000\000\000\000\334\377\264\002", '\000' <repeats 12 times>, "p\377\377\377\377\377\377\377\000\000\000\000\000\000\000\000\001\000\000\000\000\000\000\000\350εf\207\177\000\000%\272\266f\207\177\000\000\000\000\000\000\214\020\002\000 \035n\000\000\000\000\000`\035n\000\000\000\000\000K\003\017\000\000\000\000\000\060\370\347\003\000\000\000\000\346mF"... __func__ = "prolog_running_decr" #3 0x0000000000464645 in _run_prolog (arg=0x7f83507d57e0) at job_scheduler.c:4358 job_ptr = 0x7f83507d57e0 node_ptr = <optimized out> job_id = 45416412 cpid = <optimized out> i = <optimized out> rc = <optimized out> status = 0 wait_rc = <optimized out> argv = {0x0, 0x0} my_env = 0x0 config_read_lock = {config = READ_LOCK, job = WRITE_LOCK, node = WRITE_LOCK, partition = NO_LOCK, federation = READ_LOCK} node_bitmap = 0x0 now = 1528387082 resume_timeout = <optimized out> tm = 6992 __func__ = "_run_prolog" #4 0x00007f8766b67e25 in start_thread () from /lib64/libpthread.so.0 No symbol table info available. #5 0x00007f8766891bad in clone () from /lib64/libc.so.6 No symbol table info available. On 06/07/2018 12:30 PM, bugs@schedmd.com wrote: > > *Comment # 3 <https://bugs.schedmd.com/show_bug.cgi?id=5276#c3> on bug > 5276 <https://bugs.schedmd.com/show_bug.cgi?id=5276> from Tim Wickberg > <mailto:tim@schedmd.com> * > Logs from the first crash would be nice to have - something's definitely gone > wrong with this job dispatch, and I'll get you a patch to bypass this issue at > least for the moment in a couple of minutes, but we'd obviously like to nail > down however this job got into such as state. > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. >
Can you grab "p *job_ptr" from the first crash, thread 1 frame 1?
Created attachment 7028 [details] Log from the first crash
(gdb) thread 1 [Switching to thread 1 (Thread 0x7f8673cfc700 (LWP 124204))] #3 0x0000000000464645 in _run_prolog (arg=0x7f83507d57e0) at job_scheduler.c:4358 4358 job_scheduler.c: No such file or directory. (gdb) frame 1 #1 0x000000000043ff48 in job_config_fini (job_ptr=job_ptr@entry=0x7f83507d57e0) at job_mgr.c:8492 8492 job_mgr.c: No such file or directory. (gdb) p *job_ptr $2 = {account = 0x7f8350fc7bc0 "conroy_lab", admin_comment = 0x0, alias_list = 0x0, alloc_node = 0x7f8350fccd60 "rclogin15", alloc_resp_port = 8920, alloc_sid = 8991, array_job_id = 0, array_task_id = 4294967294, array_recs = 0x0, assoc_id = 7965, assoc_ptr = 0x1b84a90, batch_flag = 0, batch_host = 0x7f83507d1950 "holy2a14207", billable_tres = 47.625, bit_flags = 0, burst_buffer = 0x0, burst_buffer_state = 0x0, check_job = 0x0, ckpt_interval = 0, ckpt_time = 0, clusters = 0x0, comment = 0x0, cpu_cnt = 32, cr_enabled = 1, db_index = 0, deadline = 0, delay_boot = 0, derived_ec = 0, details = 0x7f8350001b50, direct_set_prio = 0, end_time = 1528415882, end_time_exp = 1528415882, epilog_running = false, exit_code = 0, fed_details = 0x0, front_end_ptr = 0x0, gids = 0x0, gres = 0x0, gres_list = 0x0, gres_alloc = 0x7f8350fcf240 "", gres_detail_cnt = 0, gres_detail_str = 0x0, gres_req = 0x7f8350fd3ed0 "", gres_used = 0x0, group_id = 403048, job_id = 45416412, job_next = 0x0, job_array_next_j = 0x0, job_array_next_t = 0x0, job_resrcs = 0x0, job_state = 1, kill_on_node_fail = 1, last_sched_eval = 1528387082, licenses = 0x0, license_list = 0x0, limit_set = {qos = 0, time = 0, tres = 0x7f83507f06c0}, mail_type = 0, mail_user = 0x0, magic = 4038539564, mcs_label = 0x0, name = 0x7f8350000de0 "bash", network = 0x0, next_step_id = 0, ngids = 0, nodes = 0x7f8350154bd0 "holy2a14207", node_addr = 0x7f8350154ba0, node_bitmap = 0x0, node_bitmap_cg = 0x0, node_cnt = 1, node_cnt_wag = 1, nodes_completing = 0x0, origin_cluster = 0x7f835053bb50 "odyssey", other_port = 8919, pack_job_id = 0, pack_job_id_set = 0x0, pack_job_offset = 0, pack_job_list = 0x0, partition = 0x7f83500785b0 "test,conroy-intel,conroy,itc_cluster", part_ptr_list = 0x8aa93c0, part_nodes_missing = false, part_ptr = 0x1feb0c0, power_flags = 0 '\000', pre_sus_time = 0, preempt_time = 0, preempt_in_progress = false, priority = 678396, priority_array = 0x7f835047ba70, prio_factors = 0x7f8350fd3320, profile = 0, qos_id = 1, qos_ptr = 0x1aadc70, qos_blocking_ptr = 0x0, reboot = 0 '\000', restart_cnt = 0, resize_time = 0, resv_id = 0, resv_name = 0x0, resv_ptr = 0x0, requid = 4294967295, resp_host = 0x7f8350fc7bf0 "10.242.104.144", sched_nodes = 0x0, select_jobinfo = 0x7f8350fc7c50, spank_job_env = 0x7f8350402750, spank_job_env_size = 1, start_protocol_ver = 8192, start_time = 1528387082, state_desc = 0x0, state_reason = 35, state_reason_prev = 0, step_list = 0x5493fd0, suspend_time = 0, time_last_active = 1528387082, time_limit = 480, time_min = 0, tot_sus_time = 0, total_cpus = 32, total_nodes = 1, tres_req_cnt = 0x7f835040fa40, tres_req_str = 0x7f8350001aa0 "1=32,2=128000,4=1", tres_fmt_req_str = 0x7f83507d5b90 "cpu=32,mem=125G,node=1", tres_alloc_cnt = 0x7f8350fd32e0, tres_alloc_str = 0x7f8350f9dd10 "1=32,2=128000,3=18446744073709551614,4=1,5=47", tres_fmt_alloc_str = 0x7f83503f46e0 "cpu=32,mem=125G,node=1,billing=47", user_id = 11608, user_name = 0x7f84e838c720 "rpn3", wait_all_nodes = 1, warn_flags = 0, warn_signal = 0, warn_time = 0, wckey = 0x0, req_switch = 0, wait4switch = 0, best_switch = true, wait4switch_start = 0} On 06/07/2018 12:37 PM, bugs@schedmd.com wrote: > > *Comment # 6 <https://bugs.schedmd.com/show_bug.cgi?id=5276#c6> on bug > 5276 <https://bugs.schedmd.com/show_bug.cgi?id=5276> from Tim Wickberg > <mailto:tim@schedmd.com> * > Can you grab "p *job_ptr" from the first crash, thread 1 frame 1? > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. >
This is the log from slurmctld just before: [2018-06-07T11:58:00.023] _slurm_rpc_submit_batch_job: JobId=45416411 InitPrio=5239455 usec=11238 [2018-06-07T11:58:00.883] sched: Allocate JobID=45416407 NodeList=holygpu15 #CPUs=2 Partition=gpgpu_requeue [2018-06-07T11:58:01.573] prolog_running_decr: Configuration for JobID=45416407 is complete [2018-06-07T11:58:01.573] Extending job 45416407 time limit by 1 secs for configuration [2018-06-07T11:58:02.223] error: find_preemptable_jobs: job 45416412 not pending [2018-06-07T11:58:02.224] error: find_preemptable_jobs: job 45416412 not pending [2018-06-07T11:58:02.224] error: find_preemptable_jobs: job 45416412 not pending [2018-06-07T11:58:02.224] sched: _slurm_rpc_allocate_resources JobId=45416412 NodeList=holy2a14207 usec=2044571 [2018-06-07T11:58:02.224] _job_complete: JobID=45412944 State=0x1 NodeCnt=1 WEXITSTATUS 0 [2018-06-07T11:58:02.225] _job_complete: JobID=45412944 State=0x8003 NodeCnt=1 done [2018-06-07T11:58:02.246] _slurm_rpc_submit_batch_job: JobId=45416413 InitPrio=4144998 usec=21429 [2018-06-07T11:58:02.372] error: _will_run_test: Job 45416412 has NULL node_bitmap [2018-06-07T11:58:02.393] error: _will_run_test: Job 45416412 has NULL node_bitmap [2018-06-07T11:58:02.419] _slurm_rpc_submit_batch_job: JobId=45416414 InitPrio=5239455 usec=15508 [2018-06-07T11:58:02.422] _job_complete: JobID=45098977 State=0x1 NodeCnt=1 WEXITSTATUS 0 [2018-06-07T11:58:02.422] _job_complete: JobID=45098977 State=0x8003 NodeCnt=1 done [2018-06-07T11:58:02.434] _job_complete: JobID=45380117_2599(45383309) State=0x1 NodeCnt=1 WEXITSTATUS 0 [2018-06-07T11:58:02.435] _job_complete: JobID=45380117_2599(45383309) State=0x8003 NodeCnt=1 done [2018-06-07T11:58:02.781] _slurm_rpc_submit_batch_job: JobId=45416415 InitPrio=5239455 usec=14985 [2018-06-07T11:58:02.781] prolog_running_decr: Configuration for JobID=45416412 is complete Jun 7 11:58:02 holy-slurm02 kernel: [1352031.736258] srvcn[124204]: segfault at 50 ip 0000000000471ae5 sp 00007f8673cfbc40 error 4 in slurmctld[400000+df000] Jun 7 11:58:02 holy-slurm02 abrt-hook-ccpp[124226]: Process 69971 (slurmctld) of user 57812 killed by SIGSEGV - dumping core Jun 7 11:58:40 holy-slurm02 systemd[1]: Stopping Slurm controller daemon... -Paul Edmon- On 06/07/2018 12:37 PM, bugs@schedmd.com wrote: > > *Comment # 6 <https://bugs.schedmd.com/show_bug.cgi?id=5276#c6> on bug > 5276 <https://bugs.schedmd.com/show_bug.cgi?id=5276> from Tim Wickberg > <mailto:tim@schedmd.com> * > Can you grab "p *job_ptr" from the first crash, thread 1 frame 1? > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. >
For reference after the first crash I tried restarting the scheduler. Then I tried canceling 45416412, which did succeed. However it then locked up on the job that the latest crash is on. -Paul Edmon- On 06/07/2018 12:37 PM, bugs@schedmd.com wrote: > > *Comment # 6 <https://bugs.schedmd.com/show_bug.cgi?id=5276#c6> on bug > 5276 <https://bugs.schedmd.com/show_bug.cgi?id=5276> from Tim Wickberg > <mailto:tim@schedmd.com> * > Can you grab "p *job_ptr" from the first crash, thread 1 frame 1? > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. >
I also noticed that both of those jobs purportedly were going to the same node so I shot that node as well just incase it was the node being in a bad state that caused the crash. -Paul Edmon- On 06/07/2018 12:37 PM, bugs@schedmd.com wrote: > > *Comment # 6 <https://bugs.schedmd.com/show_bug.cgi?id=5276#c6> on bug > 5276 <https://bugs.schedmd.com/show_bug.cgi?id=5276> from Tim Wickberg > <mailto:tim@schedmd.com> * > Can you grab "p *job_ptr" from the first crash, thread 1 frame 1? > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. >
Created attachment 7029 [details] Current slurm.conf
Created attachment 7030 [details] bypass crash for missing job_resrsc struct I'm attaching a patch that should bypass this specific crash, although I can't promise you won't see other issues somewhere else. We'll need to keep looking for how you got into this - if you can set the core files aside (along with a copy of the current binaries) we may ask for some additional details out of those once we have a better theory as to what had happened here. Just to check - are you running any custom patches on top of 17.11.7, or is this stock currently?
Thanks. This is stock. I will attach our spec file so you can see. -Paul Edmon- On 06/07/2018 12:53 PM, bugs@schedmd.com wrote: > > *Comment # 13 <https://bugs.schedmd.com/show_bug.cgi?id=5276#c13> on > bug 5276 <https://bugs.schedmd.com/show_bug.cgi?id=5276> from Tim > Wickberg <mailto:tim@schedmd.com> * > Createdattachment 7030 <attachment.cgi?id=7030&action=diff> [details] > <attachment.cgi?id=7030&action=edit> > bypass crash for missing job_resrsc struct > > I'm attaching a patch that should bypass this specific crash, although I can't > promise you won't see other issues somewhere else. > > We'll need to keep looking for how you got into this - if you can set the core > files aside (along with a copy of the current binaries) we may ask for some > additional details out of those once we have a better theory as to what had > happened here. > > Just to check - are you running any custom patches on top of 17.11.7, or is > this stock currently? > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. >
Created attachment 7031 [details] Current slurm.spec
Created attachment 7032 [details] bypass crash for missing job_resrsc struct Slightly revised version of the patch is attached here, sorry for any confusion. This adds the 'return;' I'd missed adding in here - the first version would just crash as before.
That patch worked. The scheduler is back in action. Keep you posted if anything changes.
Handing this off to Marshall to look into further, and dropping the severity now that you're running safely for now. Of course, please let us know if you notice any new issues, but assuming that was the only afflicted job that should be gone by now. - Tim
Hi Paul, Would you mind to show us also: "p *job_ptr" from the first crash, thread 1 frame 3 Thanks
Sure. [root@holy-slurm02 spool]# gdb /usr/sbin/slurmctld core.69971 GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-110.el7 Copyright (C) 2013 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-redhat-linux-gnu". For bug reporting instructions, please see: <http://www.gnu.org/software/gdb/bugs/>... Reading symbols from /usr/sbin/slurmctld...done. warning: exec file is newer than core file. [New LWP 124204] [New LWP 124218] [New LWP 69974] [New LWP 124224] [New LWP 124225] [New LWP 69978] [New LWP 70030] [New LWP 70036] [New LWP 69976] [New LWP 124186] [New LWP 70032] [New LWP 69971] [New LWP 69982] [New LWP 70029] [New LWP 69975] [New LWP 70035] [New LWP 70033] [New LWP 70038] [New LWP 70034] [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib64/libthread_db.so.1". Core was generated by `/usr/sbin/slurmctld'. Program terminated with signal 11, Segmentation fault. #0 launch_prolog (job_ptr=0x7f83507d57e0) at node_scheduler.c:2878 2878 cred_arg.job_hostlist = job_ptr->job_resrcs->nodes; Missing separate debuginfos, use: debuginfo-install slurm-slurmctld-17.11.7-1fasrc02.el7.x86_64 (gdb) thread 1 [Switching to thread 1 (Thread 0x7f8673cfc700 (LWP 124204))] #0 launch_prolog (job_ptr=0x7f83507d57e0) at node_scheduler.c:2878 2878 cred_arg.job_hostlist = job_ptr->job_resrcs->nodes; (gdb) frame 3 #3 0x0000000000000000 in ?? () (gdb) p *job_ptr No symbol "job_ptr" in current context. -Paul Edmon- On 06/07/2018 02:37 PM, bugs@schedmd.com wrote: > Felip Moll <mailto:felip.moll@schedmd.com> changed bug 5276 > <https://bugs.schedmd.com/show_bug.cgi?id=5276> > What Removed Added > Assignee marshall@schedmd.com support@schedmd.com > Severity 3 - Medium Impact 1 - System not usable > > *Comment # 21 <https://bugs.schedmd.com/show_bug.cgi?id=5276#c21> on > bug 5276 <https://bugs.schedmd.com/show_bug.cgi?id=5276> from Felip > Moll <mailto:felip.moll@schedmd.com> * > Hi Paul, > > Would you mind to show us also: > > "p *job_ptr" from the first crash, thread 1 frame 3 > > Thanks > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. >
That's weird, it didn't even find a frame 3 that time. What's the backtrace of thread 1? I'm interested in "p *job_ptr" from frame 3 of the backtrace in comment 5 - if you can get that, it would be very helpful. Also, as Tim mentioned, a slurmctld log file from the time of the crash would also be very helpful. Thanks for your help on this. - Marshall
First crash, thread 1 [root@holy-slurm02 spool]# gdb /usr/sbin/slurmctld core.69971 GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-110.el7 Copyright (C) 2013 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-redhat-linux-gnu". For bug reporting instructions, please see: <http://www.gnu.org/software/gdb/bugs/>... Reading symbols from /usr/sbin/slurmctld...done. warning: exec file is newer than core file. [New LWP 124204] [New LWP 124218] [New LWP 69974] [New LWP 124224] [New LWP 124225] [New LWP 69978] [New LWP 70030] [New LWP 70036] [New LWP 69976] [New LWP 124186] [New LWP 70032] [New LWP 69971] [New LWP 69982] [New LWP 70029] [New LWP 69975] [New LWP 70035] [New LWP 70033] [New LWP 70038] [New LWP 70034] [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib64/libthread_db.so.1". Core was generated by `/usr/sbin/slurmctld'. Program terminated with signal 11, Segmentation fault. #0 launch_prolog (job_ptr=0x7f83507d57e0) at node_scheduler.c:2878 2878 cred_arg.job_hostlist = job_ptr->job_resrcs->nodes; Missing separate debuginfos, use: debuginfo-install slurm-slurmctld-17.11.7-1fasrc02.el7.x86_64 (gdb) thread 1 [Switching to thread 1 (Thread 0x7f8673cfc700 (LWP 124204))] #0 launch_prolog (job_ptr=0x7f83507d57e0) at node_scheduler.c:2878 2878 cred_arg.job_hostlist = job_ptr->job_resrcs->nodes; (gdb) bt full #0 launch_prolog (job_ptr=0x7f83507d57e0) at node_scheduler.c:2878 prolog_msg_ptr = 0x7f84e8008580 agent_arg_ptr = <optimized out> job_resrcs_ptr = 0x0 cred_arg = {jobid = 45416412, stepid = 4294967295, uid = 11608, gid = 403048, user_name = 0x0, ngids = 0, gids = 0x0, cores_per_socket = 0x0, sockets_per_node = 0x0, sock_core_rep_count = 0x0, job_constraints = 0x0, job_core_bitmap = 0x0, job_core_spec = 65534, job_hostlist = 0x0, job_mem_limit = 0, job_nhosts = 0, job_gres_list = 0x0, x11 = 0, step_core_bitmap = 0x0, step_hostlist = 0x0, step_mem_limit = 0, step_gres_list = 0x0} i = <optimized out> __func__ = "launch_prolog" #1 0x000000000043ff48 in pack_spec_jobs (buffer_ptr=0x2d58, buffer_size=0x7f83500127b0, job_ids=0x0, show_flags=0, uid=3, filter_uid=61, protocol_version=0) at job_mgr.c:9524 jobs_packed = 0 tmp_offset = <optimized out> pack_info = {buffer = 0x35343d4449626f4a, filter_uid = 875966772, jobs_packed = 0x7f8766b6aee3 <_L_unlock_569+15>, protocol_version = 65534, show_flags = 0, uid = 0} buffer = <optimized out> #2 0x0000000002b4ffdc in ?? () No symbol table info available. #3 0x0000000000000000 in ?? () No symbol table info available. frame 3 [root@holy-slurm02 spool]# gdb /usr/sbin/slurmctld core.69971 GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-110.el7 Copyright (C) 2013 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-redhat-linux-gnu". For bug reporting instructions, please see: <http://www.gnu.org/software/gdb/bugs/>... Reading symbols from /usr/sbin/slurmctld...done. warning: exec file is newer than core file. [New LWP 124204] [New LWP 124218] [New LWP 69974] [New LWP 124224] [New LWP 124225] [New LWP 69978] [New LWP 70030] [New LWP 70036] [New LWP 69976] [New LWP 124186] [New LWP 70032] [New LWP 69971] [New LWP 69982] [New LWP 70029] [New LWP 69975] [New LWP 70035] [New LWP 70033] [New LWP 70038] [New LWP 70034] [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib64/libthread_db.so.1". Core was generated by `/usr/sbin/slurmctld'. Program terminated with signal 11, Segmentation fault. #0 launch_prolog (job_ptr=0x7f83507d57e0) at node_scheduler.c:2878 2878 cred_arg.job_hostlist = job_ptr->job_resrcs->nodes; Missing separate debuginfos, use: debuginfo-install slurm-slurmctld-17.11.7-1fasrc02.el7.x86_64 (gdb) frame 3 #3 0x0000000000000000 in ?? () (gdb) bt full #0 launch_prolog (job_ptr=0x7f83507d57e0) at node_scheduler.c:2878 prolog_msg_ptr = 0x7f84e8008580 agent_arg_ptr = <optimized out> job_resrcs_ptr = 0x0 cred_arg = {jobid = 45416412, stepid = 4294967295, uid = 11608, gid = 403048, user_name = 0x0, ngids = 0, gids = 0x0, cores_per_socket = 0x0, sockets_per_node = 0x0, sock_core_rep_count = 0x0, job_constraints = 0x0, job_core_bitmap = 0x0, job_core_spec = 65534, job_hostlist = 0x0, job_mem_limit = 0, job_nhosts = 0, job_gres_list = 0x0, x11 = 0, step_core_bitmap = 0x0, step_hostlist = 0x0, step_mem_limit = 0, step_gres_list = 0x0} i = <optimized out> __func__ = "launch_prolog" #1 0x000000000043ff48 in pack_spec_jobs (buffer_ptr=0x2d58, buffer_size=0x7f83500127b0, job_ids=0x0, show_flags=0, uid=3, filter_uid=61, protocol_version=0) at job_mgr.c:9524 jobs_packed = 0 tmp_offset = <optimized out> pack_info = {buffer = 0x35343d4449626f4a, filter_uid = 875966772, jobs_packed = 0x7f8766b6aee3 <_L_unlock_569+15>, protocol_version = 65534, show_flags = 0, uid = 0} buffer = <optimized out> #2 0x0000000002b4ffdc in ?? () No symbol table info available. #3 0x0000000000000000 in ?? () No symbol table info available. If you look at Comment 9 it shows the slurmctld.log from the time of the error. Let me know if you need any of the log before that point as its only for the minute of the crash. -Paul Edmon- On 06/07/2018 03:45 PM, bugs@schedmd.com wrote: > > *Comment # 23 <https://bugs.schedmd.com/show_bug.cgi?id=5276#c23> on > bug 5276 <https://bugs.schedmd.com/show_bug.cgi?id=5276> from Marshall > Garey <mailto:marshall@schedmd.com> * > That's weird, it didn't even find a frame 3 that time. What's the backtrace of > thread 1? > > I'm interested in "p *job_ptr" from frame 3 of the backtrace incomment 5 <show_bug.cgi?id=5276#c5> - if > you can get that, it would be very helpful. > > Also, as Tim mentioned, a slurmctld log file from the time of the crash would > also be very helpful. > > Thanks for your help on this. > > - Marshall > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. >
Thanks, I forgot about the log in comment 9. I'll let you know if we need more than that. Perhaps part of the problem here with gdb is that the binary is newer than when the crash happened (since you applied the patch). Did you happen to make a copy of the slurmctld binary before applying Tim's patch? We'd hoped to get a snapshot of job_ptr inside the function _run_prolog(). It appears that isn't working, probably because of the newer slurmctld binary file.
Ah okay, We have a test system that is still on the vanilla 17.11.7. Here are the gdb info: [root@slurm-test ~]# gdb /usr/sbin/slurmctld core.69971 GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-100.el7_4.1 Copyright (C) 2013 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-redhat-linux-gnu". For bug reporting instructions, please see: <http://www.gnu.org/software/gdb/bugs/>... Reading symbols from /usr/sbin/slurmctld...done. [New LWP 124204] [New LWP 124218] [New LWP 69974] [New LWP 124224] [New LWP 124225] [New LWP 69978] [New LWP 70030] [New LWP 70036] [New LWP 69976] [New LWP 124186] [New LWP 70032] [New LWP 69971] [New LWP 69982] [New LWP 70029] [New LWP 69975] [New LWP 70035] [New LWP 70033] [New LWP 70038] [New LWP 70034] [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib64/libthread_db.so.1". Core was generated by `/usr/sbin/slurmctld'. Program terminated with signal 11, Segmentation fault. #0 0x0000000000471ae5 in launch_prolog (job_ptr=job_ptr@entry=0x7f83507d57e0) at node_scheduler.c:2853 2853 cred_arg.job_nhosts = job_ptr->job_resrcs->nhosts; Missing separate debuginfos, use: debuginfo-install slurm-slurmctld-17.11.7-1fasrc01.el7.centos.x86_64 (gdb) thread 1 [Switching to thread 1 (Thread 0x7f8673cfc700 (LWP 124204))] #0 0x0000000000471ae5 in launch_prolog (job_ptr=job_ptr@entry=0x7f83507d57e0) at node_scheduler.c:2853 2853 cred_arg.job_nhosts = job_ptr->job_resrcs->nhosts; (gdb) p *job_ptr $1 = {account = 0x7f8350fc7bc0 "conroy_lab", admin_comment = 0x0, alias_list = 0x0, alloc_node = 0x7f8350fccd60 "rclogin15", alloc_resp_port = 8920, alloc_sid = 8991, array_job_id = 0, array_task_id = 4294967294, array_recs = 0x0, assoc_id = 7965, assoc_ptr = 0x1b84a90, batch_flag = 0, batch_host = 0x7f83507d1950 "holy2a14207", billable_tres = 47.625, bit_flags = 0, burst_buffer = 0x0, burst_buffer_state = 0x0, check_job = 0x0, ckpt_interval = 0, ckpt_time = 0, clusters = 0x0, comment = 0x0, cpu_cnt = 32, cr_enabled = 1, db_index = 0, deadline = 0, delay_boot = 0, derived_ec = 0, details = 0x7f8350001b50, direct_set_prio = 0, end_time = 1528415882, end_time_exp = 1528415882, epilog_running = false, exit_code = 0, fed_details = 0x0, front_end_ptr = 0x0, gids = 0x0, gres = 0x0, gres_list = 0x0, gres_alloc = 0x7f8350fcf240 "", gres_detail_cnt = 0, gres_detail_str = 0x0, gres_req = 0x7f8350fd3ed0 "", gres_used = 0x0, group_id = 403048, job_id = 45416412, job_next = 0x0, job_array_next_j = 0x0, job_array_next_t = 0x0, job_resrcs = 0x0, job_state = 1, kill_on_node_fail = 1, last_sched_eval = 1528387082, licenses = 0x0, license_list = 0x0, limit_set = {qos = 0, time = 0, tres = 0x7f83507f06c0}, mail_type = 0, mail_user = 0x0, magic = 4038539564, mcs_label = 0x0, name = 0x7f8350000de0 "bash", network = 0x0, next_step_id = 0, ngids = 0, nodes = 0x7f8350154bd0 "holy2a14207", node_addr = 0x7f8350154ba0, node_bitmap = 0x0, node_bitmap_cg = 0x0, node_cnt = 1, node_cnt_wag = 1, nodes_completing = 0x0, origin_cluster = 0x7f835053bb50 "odyssey", other_port = 8919, pack_job_id = 0, pack_job_id_set = 0x0, pack_job_offset = 0, pack_job_list = 0x0, partition = 0x7f83500785b0 "test,conroy-intel,conroy,itc_cluster", part_ptr_list = 0x8aa93c0, part_nodes_missing = false, part_ptr = 0x1feb0c0, power_flags = 0 '\000', pre_sus_time = 0, preempt_time = 0, preempt_in_progress = false, priority = 678396, priority_array = 0x7f835047ba70, prio_factors = 0x7f8350fd3320, profile = 0, qos_id = 1, qos_ptr = 0x1aadc70, qos_blocking_ptr = 0x0, reboot = 0 '\000', restart_cnt = 0, resize_time = 0, resv_id = 0, resv_name = 0x0, resv_ptr = 0x0, requid = 4294967295, resp_host = 0x7f8350fc7bf0 "10.242.104.144", sched_nodes = 0x0, select_jobinfo = 0x7f8350fc7c50, spank_job_env = 0x7f8350402750, spank_job_env_size = 1, start_protocol_ver = 8192, start_time = 1528387082, state_desc = 0x0, state_reason = 35, state_reason_prev = 0, step_list = 0x5493fd0, suspend_time = 0, time_last_active = 1528387082, time_limit = 480, time_min = 0, tot_sus_time = 0, total_cpus = 32, total_nodes = 1, tres_req_cnt = 0x7f835040fa40, tres_req_str = 0x7f8350001aa0 "1=32,2=128000,4=1", tres_fmt_req_str = 0x7f83507d5b90 "cpu=32,mem=125G,node=1", tres_alloc_cnt = 0x7f8350fd32e0, tres_alloc_str = 0x7f8350f9dd10 "1=32,2=128000,3=18446744073709551614,4=1,5=47", tres_fmt_alloc_str = 0x7f83503f46e0 "cpu=32,mem=125G,node=1,billing=47", user_id = 11608, user_name = 0x7f84e838c720 "rpn3", wait_all_nodes = 1, warn_flags = 0, warn_signal = 0, warn_time = 0, wckey = 0x0, req_switch = 0, wait4switch = 0, best_switch = true, wait4switch_start = 0} (gdb) quit [root@slurm-test ~]# gdb /usr/sbin/slurmctld core.69971 GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-100.el7_4.1 Copyright (C) 2013 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-redhat-linux-gnu". For bug reporting instructions, please see: <http://www.gnu.org/software/gdb/bugs/>... Reading symbols from /usr/sbin/slurmctld...done. [New LWP 124204] [New LWP 124218] [New LWP 69974] [New LWP 124224] [New LWP 124225] [New LWP 69978] [New LWP 70030] [New LWP 70036] [New LWP 69976] [New LWP 124186] [New LWP 70032] [New LWP 69971] [New LWP 69982] [New LWP 70029] [New LWP 69975] [New LWP 70035] [New LWP 70033] [New LWP 70038] [New LWP 70034] [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib64/libthread_db.so.1". Core was generated by `/usr/sbin/slurmctld'. Program terminated with signal 11, Segmentation fault. #0 0x0000000000471ae5 in launch_prolog (job_ptr=job_ptr@entry=0x7f83507d57e0) at node_scheduler.c:2853 2853 cred_arg.job_nhosts = job_ptr->job_resrcs->nhosts; Missing separate debuginfos, use: debuginfo-install slurm-slurmctld-17.11.7-1fasrc01.el7.centos.x86_64 (gdb) frame 3 #3 0x0000000000464645 in _run_prolog (arg=0x7f83507d57e0) at job_scheduler.c:4358 4358 prolog_running_decr(job_ptr); (gdb) p *job_ptr $1 = {account = 0x7f8350fc7bc0 "conroy_lab", admin_comment = 0x0, alias_list = 0x0, alloc_node = 0x7f8350fccd60 "rclogin15", alloc_resp_port = 8920, alloc_sid = 8991, array_job_id = 0, array_task_id = 4294967294, array_recs = 0x0, assoc_id = 7965, assoc_ptr = 0x1b84a90, batch_flag = 0, batch_host = 0x7f83507d1950 "holy2a14207", billable_tres = 47.625, bit_flags = 0, burst_buffer = 0x0, burst_buffer_state = 0x0, check_job = 0x0, ckpt_interval = 0, ckpt_time = 0, clusters = 0x0, comment = 0x0, cpu_cnt = 32, cr_enabled = 1, db_index = 0, deadline = 0, delay_boot = 0, derived_ec = 0, details = 0x7f8350001b50, direct_set_prio = 0, end_time = 1528415882, end_time_exp = 1528415882, epilog_running = false, exit_code = 0, fed_details = 0x0, front_end_ptr = 0x0, gids = 0x0, gres = 0x0, gres_list = 0x0, gres_alloc = 0x7f8350fcf240 "", gres_detail_cnt = 0, gres_detail_str = 0x0, gres_req = 0x7f8350fd3ed0 "", gres_used = 0x0, group_id = 403048, job_id = 45416412, job_next = 0x0, job_array_next_j = 0x0, job_array_next_t = 0x0, job_resrcs = 0x0, job_state = 1, kill_on_node_fail = 1, last_sched_eval = 1528387082, licenses = 0x0, license_list = 0x0, limit_set = {qos = 0, time = 0, tres = 0x7f83507f06c0}, mail_type = 0, mail_user = 0x0, magic = 4038539564, mcs_label = 0x0, name = 0x7f8350000de0 "bash", network = 0x0, next_step_id = 0, ngids = 0, nodes = 0x7f8350154bd0 "holy2a14207", node_addr = 0x7f8350154ba0, node_bitmap = 0x0, node_bitmap_cg = 0x0, node_cnt = 1, node_cnt_wag = 1, nodes_completing = 0x0, origin_cluster = 0x7f835053bb50 "odyssey", other_port = 8919, pack_job_id = 0, pack_job_id_set = 0x0, pack_job_offset = 0, pack_job_list = 0x0, partition = 0x7f83500785b0 "test,conroy-intel,conroy,itc_cluster", part_ptr_list = 0x8aa93c0, part_nodes_missing = false, part_ptr = 0x1feb0c0, power_flags = 0 '\000', pre_sus_time = 0, preempt_time = 0, preempt_in_progress = false, priority = 678396, priority_array = 0x7f835047ba70, prio_factors = 0x7f8350fd3320, profile = 0, qos_id = 1, qos_ptr = 0x1aadc70, qos_blocking_ptr = 0x0, reboot = 0 '\000', restart_cnt = 0, resize_time = 0, resv_id = 0, resv_name = 0x0, resv_ptr = 0x0, requid = 4294967295, resp_host = 0x7f8350fc7bf0 "10.242.104.144", sched_nodes = 0x0, select_jobinfo = 0x7f8350fc7c50, spank_job_env = 0x7f8350402750, spank_job_env_size = 1, start_protocol_ver = 8192, start_time = 1528387082, state_desc = 0x0, state_reason = 35, state_reason_prev = 0, step_list = 0x5493fd0, suspend_time = 0, time_last_active = 1528387082, time_limit = 480, time_min = 0, tot_sus_time = 0, total_cpus = 32, total_nodes = 1, tres_req_cnt = 0x7f835040fa40, tres_req_str = 0x7f8350001aa0 "1=32,2=128000,4=1", tres_fmt_req_str = 0x7f83507d5b90 "cpu=32,mem=125G,node=1", tres_alloc_cnt = 0x7f8350fd32e0, tres_alloc_str = 0x7f8350f9dd10 "1=32,2=128000,3=18446744073709551614,4=1,5=47", tres_fmt_alloc_str = 0x7f83503f46e0 "cpu=32,mem=125G,node=1,billing=47", user_id = 11608, user_name = 0x7f84e838c720 "rpn3", wait_all_nodes = 1, warn_flags = 0, warn_signal = 0, warn_time = 0, wckey = 0x0, req_switch = 0, wait4switch = 0, best_switch = true, wait4switch_start = 0} (gdb) quit On 06/07/2018 04:01 PM, bugs@schedmd.com wrote: > > *Comment # 25 <https://bugs.schedmd.com/show_bug.cgi?id=5276#c25> on > bug 5276 <https://bugs.schedmd.com/show_bug.cgi?id=5276> from Marshall > Garey <mailto:marshall@schedmd.com> * > Thanks, I forgot about the log incomment 9 <show_bug.cgi?id=5276#c9>. I'll let you know if we need more > than that. > > Perhaps part of the problem here with gdb is that the binary is newer than when > the crash happened (since you applied the patch). Did you happen to make a copy > of the slurmctld binary before applying Tim's patch? > > We'd hoped to get a snapshot of job_ptr inside the function _run_prolog(). It > appears that isn't working, probably because of the newer slurmctld binary > file. > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. >
Excellent, thank you. That's very helpful. Keep a copy of that binary somewhere so you can use it if we need further info.
Hi Paul, we're continuing to look into this. Can we get a more complete log file from around the time of the crash? We'd like everything starting from when job 45416611 was initially submitted up to the crash. Thanks - Marshall
So for 45416611 I don't have a record for when it was submitted. This is all the references I have to it in the log: messages:[2018-06-07T12:04:59.266] error: find_preemptable_jobs: job 45416611 not pending messages:[2018-06-07T12:04:59.267] error: find_preemptable_jobs: job 45416611 not pending messages:[2018-06-07T12:04:59.267] error: find_preemptable_jobs: job 45416611 not pending messages:[2018-06-07T12:04:59.268] sched: _slurm_rpc_allocate_resources JobId=45416611 NodeList=holy2a14207 usec=1090628 messages:[2018-06-07T12:04:59.447] error: _will_run_test: Job 45416611 has NULL node_bitmap messages:[2018-06-07T12:04:59.485] error: _will_run_test: Job 45416611 has NULL node_bitmap messages:[2018-06-07T12:04:59.519] error: _will_run_test: Job 45416611 has NULL node_bitmap messages:[2018-06-07T12:04:59.546] error: _will_run_test: Job 45416611 has NULL node_bitmap messages:[2018-06-07T12:04:59.572] error: _will_run_test: Job 45416611 has NULL node_bitmap messages:[2018-06-07T12:04:59.591] error: _will_run_test: Job 45416611 has NULL node_bitmap messages:[2018-06-07T12:04:59.610] error: _will_run_test: Job 45416611 has NULL node_bitmap messages:[2018-06-07T12:04:59.627] error: _will_run_test: Job 45416611 has NULL node_bitmap messages:[2018-06-07T12:04:59.645] error: _will_run_test: Job 45416611 has NULL node_bitmap messages:[2018-06-07T12:04:59.662] error: _will_run_test: Job 45416611 has NULL node_bitmap messages:[2018-06-07T12:04:59.681] error: _will_run_test: Job 45416611 has NULL node_bitmap messages:[2018-06-07T12:04:59.697] error: _will_run_test: Job 45416611 has NULL node_bitmap messages:[2018-06-07T12:04:59.716] error: _will_run_test: Job 45416611 has NULL node_bitmap messages:[2018-06-07T12:04:59.733] error: _will_run_test: Job 45416611 has NULL node_bitmap messages:[2018-06-07T12:04:59.752] error: _will_run_test: Job 45416611 has NULL node_bitmap messages:[2018-06-07T12:04:59.768] error: _will_run_test: Job 45416611 has NULL node_bitmap messages:[2018-06-07T12:04:59.787] error: _will_run_test: Job 45416611 has NULL node_bitmap messages:[2018-06-07T12:04:59.804] error: _will_run_test: Job 45416611 has NULL node_bitmap messages:[2018-06-07T12:04:59.823] error: _will_run_test: Job 45416611 has NULL node_bitmap messages:[2018-06-07T12:04:59.840] error: _will_run_test: Job 45416611 has NULL node_bitmap messages:[2018-06-07T12:04:59.859] error: _will_run_test: Job 45416611 has NULL node_bitmap messages:[2018-06-07T12:04:59.877] error: _will_run_test: Job 45416611 has NULL node_bitmap messages:[2018-06-07T12:04:59.895] error: _will_run_test: Job 45416611 has NULL node_bitmap messages:[2018-06-07T12:04:59.913] error: _will_run_test: Job 45416611 has NULL node_bitmap messages:[2018-06-07T12:04:59.932] error: _will_run_test: Job 45416611 has NULL node_bitmap messages:[2018-06-07T12:04:59.949] error: _will_run_test: Job 45416611 has NULL node_bitmap messages:[2018-06-07T12:04:59.969] error: _will_run_test: Job 45416611 has NULL node_bitmap messages:[2018-06-07T12:05:00.059] error: _will_run_test: Job 45416611 has NULL node_bitmap messages:[2018-06-07T12:05:00.085] error: _will_run_test: Job 45416611 has NULL node_bitmap messages:[2018-06-07T12:05:00.098] error: _will_run_test: Job 45416611 has NULL node_bitmap messages:[2018-06-07T12:05:00.113] error: _will_run_test: Job 45416611 has NULL node_bitmap messages:[2018-06-07T12:05:00.125] error: _will_run_test: Job 45416611 has NULL node_bitmap messages:[2018-06-07T12:05:00.139] error: _will_run_test: Job 45416611 has NULL node_bitmap messages:[2018-06-07T12:05:00.144] error: _will_run_test: Job 45416611 has NULL node_bitmap messages:[2018-06-07T12:05:00.151] error: _will_run_test: Job 45416611 has NULL node_bitmap messages:[2018-06-07T12:05:00.158] error: _will_run_test: Job 45416611 has NULL node_bitmap messages:[2018-06-07T12:05:00.168] error: _will_run_test: Job 45416611 has NULL node_bitmap messages:[2018-06-07T12:05:00.175] error: _will_run_test: Job 45416611 has NULL node_bitmap messages:[2018-06-07T12:05:00.185] error: _will_run_test: Job 45416611 has NULL node_bitmap messages:[2018-06-07T12:05:00.189] error: _will_run_test: Job 45416611 has NULL node_bitmap messages:[2018-06-07T12:05:00.196] error: _will_run_test: Job 45416611 has NULL node_bitmap messages:[2018-06-07T12:05:00.201] error: _will_run_test: Job 45416611 has NULL node_bitmap messages:[2018-06-07T12:05:00.208] error: _will_run_test: Job 45416611 has NULL node_bitmap messages:[2018-06-07T12:05:00.214] error: _will_run_test: Job 45416611 has NULL node_bitmap messages:[2018-06-07T12:05:00.221] error: _will_run_test: Job 45416611 has NULL node_bitmap messages:[2018-06-07T12:05:00.227] error: _will_run_test: Job 45416611 has NULL node_bitmap messages:[2018-06-07T12:05:00.234] error: _will_run_test: Job 45416611 has NULL node_bitmap messages:[2018-06-07T12:05:00.240] error: _will_run_test: Job 45416611 has NULL node_bitmap messages:[2018-06-07T12:05:00.248] error: _will_run_test: Job 45416611 has NULL node_bitmap messages:[2018-06-07T12:05:00.253] error: _will_run_test: Job 45416611 has NULL node_bitmap messages:[2018-06-07T12:05:00.261] error: _will_run_test: Job 45416611 has NULL node_bitmap messages:[2018-06-07T12:05:00.267] error: _will_run_test: Job 45416611 has NULL node_bitmap messages:[2018-06-07T12:05:00.276] error: _will_run_test: Job 45416611 has NULL node_bitmap messages:[2018-06-07T12:05:00.282] error: _will_run_test: Job 45416611 has NULL node_bitmap messages:[2018-06-07T12:05:00.290] error: _will_run_test: Job 45416611 has NULL node_bitmap messages:[2018-06-07T12:05:00.296] error: _will_run_test: Job 45416611 has NULL node_bitmap messages:[2018-06-07T12:05:00.303] error: _will_run_test: Job 45416611 has NULL node_bitmap messages:[2018-06-07T12:05:00.309] error: _will_run_test: Job 45416611 has NULL node_bitmap messages:[2018-06-07T12:05:00.316] error: _will_run_test: Job 45416611 has NULL node_bitmap messages:[2018-06-07T12:05:00.322] error: _will_run_test: Job 45416611 has NULL node_bitmap messages:[2018-06-07T12:05:00.330] error: _will_run_test: Job 45416611 has NULL node_bitmap messages:[2018-06-07T12:05:00.336] error: _will_run_test: Job 45416611 has NULL node_bitmap messages:[2018-06-07T12:05:00.344] error: _will_run_test: Job 45416611 has NULL node_bitmap messages:[2018-06-07T12:05:00.350] error: _will_run_test: Job 45416611 has NULL node_bitmap messages:[2018-06-07T12:05:00.358] error: _will_run_test: Job 45416611 has NULL node_bitmap messages:[2018-06-07T12:05:00.364] error: _will_run_test: Job 45416611 has NULL node_bitmap messages:[2018-06-07T12:05:00.372] error: _will_run_test: Job 45416611 has NULL node_bitmap messages:[2018-06-07T12:05:00.378] error: _will_run_test: Job 45416611 has NULL node_bitmap messages:[2018-06-07T12:05:00.386] error: _will_run_test: Job 45416611 has NULL node_bitmap messages:[2018-06-07T12:05:00.392] error: _will_run_test: Job 45416611 has NULL node_bitmap messages:[2018-06-07T12:05:00.400] error: _will_run_test: Job 45416611 has NULL node_bitmap messages:[2018-06-07T12:05:00.406] error: _will_run_test: Job 45416611 has NULL node_bitmap messages:[2018-06-07T12:05:00.414] error: _will_run_test: Job 45416611 has NULL node_bitmap messages:[2018-06-07T12:05:00.420] error: _will_run_test: Job 45416611 has NULL node_bitmap messages:[2018-06-07T12:05:00.429] error: _will_run_test: Job 45416611 has NULL node_bitmap messages:[2018-06-07T12:05:00.435] error: _will_run_test: Job 45416611 has NULL node_bitmap messages:[2018-06-07T12:05:00.443] error: _will_run_test: Job 45416611 has NULL node_bitmap messages:[2018-06-07T12:05:00.450] error: _will_run_test: Job 45416611 has NULL node_bitmap messages:[2018-06-07T12:05:00.458] error: _will_run_test: Job 45416611 has NULL node_bitmap messages:[2018-06-07T12:05:00.465] error: _will_run_test: Job 45416611 has NULL node_bitmap messages:[2018-06-07T12:05:00.473] error: _will_run_test: Job 45416611 has NULL node_bitmap messages:[2018-06-07T12:05:00.480] error: _will_run_test: Job 45416611 has NULL node_bitmap messages:[2018-06-07T12:05:00.488] error: _will_run_test: Job 45416611 has NULL node_bitmap messages:[2018-06-07T12:05:00.495] error: _will_run_test: Job 45416611 has NULL node_bitmap messages:[2018-06-07T12:05:00.504] error: _will_run_test: Job 45416611 has NULL node_bitmap messages:[2018-06-07T12:05:00.511] error: _will_run_test: Job 45416611 has NULL node_bitmap messages:[2018-06-07T12:05:00.519] error: _will_run_test: Job 45416611 has NULL node_bitmap messages:[2018-06-07T12:05:00.526] error: _will_run_test: Job 45416611 has NULL node_bitmap messages:[2018-06-07T12:05:00.536] error: _will_run_test: Job 45416611 has NULL node_bitmap messages:[2018-06-07T12:05:00.543] error: _will_run_test: Job 45416611 has NULL node_bitmap messages:[2018-06-07T12:05:00.552] error: _will_run_test: Job 45416611 has NULL node_bitmap messages:[2018-06-07T12:05:00.559] error: _will_run_test: Job 45416611 has NULL node_bitmap messages:[2018-06-07T12:05:00.567] error: _will_run_test: Job 45416611 has NULL node_bitmap messages:[2018-06-07T12:05:00.574] error: _will_run_test: Job 45416611 has NULL node_bitmap messages:[2018-06-07T12:05:00.583] error: _will_run_test: Job 45416611 has NULL node_bitmap messages:[2018-06-07T12:05:00.590] error: _will_run_test: Job 45416611 has NULL node_bitmap messages:[2018-06-07T12:05:00.598] error: _will_run_test: Job 45416611 has NULL node_bitmap messages:[2018-06-07T12:05:00.605] error: _will_run_test: Job 45416611 has NULL node_bitmap messages:[2018-06-07T12:05:00.614] error: _will_run_test: Job 45416611 has NULL node_bitmap messages:[2018-06-07T12:05:00.621] error: _will_run_test: Job 45416611 has NULL node_bitmap messages:[2018-06-07T12:05:00.630] error: _will_run_test: Job 45416611 has NULL node_bitmap messages:[2018-06-07T12:05:00.637] error: _will_run_test: Job 45416611 has NULL node_bitmap messages:[2018-06-07T12:05:00.646] error: _will_run_test: Job 45416611 has NULL node_bitmap messages:[2018-06-07T12:05:00.652] error: _will_run_test: Job 45416611 has NULL node_bitmap messages:[2018-06-07T12:05:00.661] error: _will_run_test: Job 45416611 has NULL node_bitmap messages:[2018-06-07T12:05:00.668] error: _will_run_test: Job 45416611 has NULL node_bitmap messages:[2018-06-07T12:05:00.676] error: _will_run_test: Job 45416611 has NULL node_bitmap messages:[2018-06-07T12:05:00.683] error: _will_run_test: Job 45416611 has NULL node_bitmap messages:[2018-06-07T12:05:00.692] error: _will_run_test: Job 45416611 has NULL node_bitmap messages:[2018-06-07T12:05:00.699] error: _will_run_test: Job 45416611 has NULL node_bitmap messages:[2018-06-07T12:05:00.708] error: _will_run_test: Job 45416611 has NULL node_bitmap messages:[2018-06-07T12:05:00.715] error: _will_run_test: Job 45416611 has NULL node_bitmap messages:[2018-06-07T12:05:00.723] error: _will_run_test: Job 45416611 has NULL node_bitmap messages:[2018-06-07T12:05:00.730] error: _will_run_test: Job 45416611 has NULL node_bitmap messages:[2018-06-07T12:05:00.739] error: _will_run_test: Job 45416611 has NULL node_bitmap messages:[2018-06-07T12:05:00.747] error: _will_run_test: Job 45416611 has NULL node_bitmap messages:[2018-06-07T12:05:00.757] error: _will_run_test: Job 45416611 has NULL node_bitmap messages:[2018-06-07T12:05:00.765] error: _will_run_test: Job 45416611 has NULL node_bitmap messages:[2018-06-07T12:05:00.775] error: _will_run_test: Job 45416611 has NULL node_bitmap messages:[2018-06-07T12:05:00.783] error: _will_run_test: Job 45416611 has NULL node_bitmap messages:[2018-06-07T12:05:00.794] error: _will_run_test: Job 45416611 has NULL node_bitmap messages:[2018-06-07T12:05:00.802] error: _will_run_test: Job 45416611 has NULL node_bitmap messages:[2018-06-07T12:05:00.812] error: _will_run_test: Job 45416611 has NULL node_bitmap messages:[2018-06-07T12:05:00.820] error: _will_run_test: Job 45416611 has NULL node_bitmap messages:[2018-06-07T12:05:00.830] error: _will_run_test: Job 45416611 has NULL node_bitmap messages:[2018-06-07T12:05:00.838] error: _will_run_test: Job 45416611 has NULL node_bitmap messages:[2018-06-07T12:05:00.848] error: _will_run_test: Job 45416611 has NULL node_bitmap messages:[2018-06-07T12:05:00.856] error: _will_run_test: Job 45416611 has NULL node_bitmap messages:[2018-06-07T12:05:00.866] error: _will_run_test: Job 45416611 has NULL node_bitmap messages:[2018-06-07T12:05:00.871] error: _will_run_test: Job 45416611 has NULL node_bitmap messages:[2018-06-07T12:05:00.877] error: _will_run_test: Job 45416611 has NULL node_bitmap messages:[2018-06-07T12:05:00.885] error: _will_run_test: Job 45416611 has NULL node_bitmap messages:[2018-06-07T12:05:00.895] error: _will_run_test: Job 45416611 has NULL node_bitmap messages:[2018-06-07T12:05:00.903] error: _will_run_test: Job 45416611 has NULL node_bitmap messages:[2018-06-07T12:05:00.913] error: _will_run_test: Job 45416611 has NULL node_bitmap messages:[2018-06-07T12:05:00.921] error: _will_run_test: Job 45416611 has NULL node_bitmap messages:[2018-06-07T12:05:00.931] error: _will_run_test: Job 45416611 has NULL node_bitmap messages:[2018-06-07T12:05:00.939] error: _will_run_test: Job 45416611 has NULL node_bitmap messages:[2018-06-07T12:05:00.948] error: _will_run_test: Job 45416611 has NULL node_bitmap messages:[2018-06-07T12:05:00.956] error: _will_run_test: Job 45416611 has NULL node_bitmap messages:[2018-06-07T12:05:00.966] error: _will_run_test: Job 45416611 has NULL node_bitmap messages:[2018-06-07T12:05:00.974] error: _will_run_test: Job 45416611 has NULL node_bitmap messages:[2018-06-07T12:05:00.984] error: _will_run_test: Job 45416611 has NULL node_bitmap messages:[2018-06-07T12:05:00.992] error: _will_run_test: Job 45416611 has NULL node_bitmap messages:[2018-06-07T12:05:01.002] error: _will_run_test: Job 45416611 has NULL node_bitmap messages:[2018-06-07T12:05:01.011] error: _will_run_test: Job 45416611 has NULL node_bitmap messages:[2018-06-07T12:05:01.021] error: _will_run_test: Job 45416611 has NULL node_bitmap messages:[2018-06-07T12:05:01.030] error: _will_run_test: Job 45416611 has NULL node_bitmap messages:[2018-06-07T12:05:01.041] error: _will_run_test: Job 45416611 has NULL node_bitmap messages:[2018-06-07T12:05:01.050] error: _will_run_test: Job 45416611 has NULL node_bitmap messages:[2018-06-07T12:05:01.061] error: _will_run_test: Job 45416611 has NULL node_bitmap messages:[2018-06-07T12:05:01.070] error: _will_run_test: Job 45416611 has NULL node_bitmap messages:[2018-06-07T12:05:01.081] error: _will_run_test: Job 45416611 has NULL node_bitmap messages:[2018-06-07T12:05:01.089] error: _will_run_test: Job 45416611 has NULL node_bitmap messages:[2018-06-07T12:05:01.100] error: _will_run_test: Job 45416611 has NULL node_bitmap messages:[2018-06-07T12:05:01.109] error: _will_run_test: Job 45416611 has NULL node_bitmap messages:[2018-06-07T12:05:01.120] error: _will_run_test: Job 45416611 has NULL node_bitmap messages:[2018-06-07T12:05:01.128] error: _will_run_test: Job 45416611 has NULL node_bitmap messages:[2018-06-07T12:05:01.139] error: _will_run_test: Job 45416611 has NULL node_bitmap messages:[2018-06-07T12:05:01.148] error: _will_run_test: Job 45416611 has NULL node_bitmap messages:[2018-06-07T12:05:01.159] error: _will_run_test: Job 45416611 has NULL node_bitmap messages:[2018-06-07T12:05:01.168] error: _will_run_test: Job 45416611 has NULL node_bitmap messages:[2018-06-07T12:05:01.219] prolog_running_decr: Configuration for JobID=45416611 is complete messages:[2018-06-07T12:05:01.219] Extending job 45416611 time limit by 2 secs for configuration messages:[2018-06-07T12:06:54.398] Recovered JobID=45416611 State=0x4001 NodeCnt=0 Assoc=7965 messages:[2018-06-07T12:06:54.666] error: _add_job_to_res: job 45416611 has no job_resrcs info messages:[2018-06-07T12:06:54.666] error: select_g_select_nodeinfo_set(45416611): No such file or directory messages:[2018-06-07T12:07:05.634] error: _add_job_to_res: job 45416611 has no job_resrcs info messages:[2018-06-07T12:07:05.806] prolog_running_decr: Configuration for JobID=45416611 is complete messages:[2018-06-07T12:07:05.806] Extending job 45416611 time limit by 126 secs for configuration messages:[2018-06-07T12:07:53.553] Recovered JobID=45416611 State=0x4001 NodeCnt=0 Assoc=7965 messages:[2018-06-07T12:07:53.828] error: _add_job_to_res: job 45416611 has no job_resrcs info messages:[2018-06-07T12:07:53.828] error: select_g_select_nodeinfo_set(45416611): No such file or directory messages:[2018-06-07T12:07:54.549] error: _add_job_to_res: job 45416611 has no job_resrcs info messages:[2018-06-07T12:07:54.587] prolog_running_decr: Configuration for JobID=45416611 is complete messages:[2018-06-07T12:07:54.587] Extending job 45416611 time limit by 175 secs for configuration messages:[2018-06-07T12:08:35.501] Recovered JobID=45416611 State=0x4001 NodeCnt=0 Assoc=7965 messages:[2018-06-07T12:08:35.763] error: _add_job_to_res: job 45416611 has no job_resrcs info messages:[2018-06-07T12:08:35.763] error: select_g_select_nodeinfo_set(45416611): No such file or directory messages:[2018-06-07T12:08:36.351] error: _add_job_to_res: job 45416611 has no job_resrcs info messages:[2018-06-07T12:08:36.389] prolog_running_decr: Configuration for JobID=45416611 is complete messages:[2018-06-07T12:08:36.389] Extending job 45416611 time limit by 217 secs for configuration messages:[2018-06-07T12:10:19.205] Recovered JobID=45416611 State=0x4001 NodeCnt=0 Assoc=7965 messages:[2018-06-07T12:10:19.205] debug: starting job 45416611 in accounting messages:[2018-06-07T12:10:19.851] error: _add_job_to_res: job 45416611 has no job_resrcs info messages:[2018-06-07T12:10:19.851] error: select_g_select_nodeinfo_set(45416611): No such file or directory messages:[2018-06-07T12:10:49.234] debug2: acct_policy_job_begin: after adding job 45416611, qos test grp_used_tres_run_secs(cpu) is 3031200 messages:[2018-06-07T12:10:49.234] debug2: acct_policy_job_begin: after adding job 45416611, qos test grp_used_tres_run_secs(mem) is 22319424000 messages:[2018-06-07T12:10:49.234] debug2: acct_policy_job_begin: after adding job 45416611, qos test grp_used_tres_run_secs(node) is 770400 messages:[2018-06-07T12:10:49.234] debug2: acct_policy_job_begin: after adding job 45416611, qos test grp_used_tres_run_secs(billing) is 7578000 messages:[2018-06-07T12:10:49.234] debug2: acct_policy_job_begin: after adding job 45416611, qos normal grp_used_tres_run_secs(cpu) is 276895569240 messages:[2018-06-07T12:10:49.234] debug2: acct_policy_job_begin: after adding job 45416611, qos normal grp_used_tres_run_secs(mem) is 1249854241365600 messages:[2018-06-07T12:10:49.234] debug2: acct_policy_job_begin: after adding job 45416611, qos normal grp_used_tres_run_secs(node) is 13865803260 messages:[2018-06-07T12:10:49.234] debug2: acct_policy_job_begin: after adding job 45416611, qos normal grp_used_tres_run_secs(billing) is 453091777260 messages:[2018-06-07T12:10:49.234] debug2: acct_policy_job_begin: after adding job 45416611, assoc 7965(conroy_lab/rpn3/(null)) grp_used_tres_run_secs(cpu) is 921600 messages:[2018-06-07T12:10:49.234] debug2: acct_policy_job_begin: after adding job 45416611, assoc 7965(conroy_lab/rpn3/(null)) grp_used_tres_run_secs(mem) is 3686400000 messages:[2018-06-07T12:10:49.234] debug2: acct_policy_job_begin: after adding job 45416611, assoc 7965(conroy_lab/rpn3/(null)) grp_used_tres_run_secs(node) is 28800 messages:[2018-06-07T12:10:49.234] debug2: acct_policy_job_begin: after adding job 45416611, assoc 7965(conroy_lab/rpn3/(null)) grp_used_tres_run_secs(billing) is 1353600 messages:[2018-06-07T12:10:49.234] debug2: acct_policy_job_begin: after adding job 45416611, assoc 5041(conroy_lab/(null)/(null)) grp_used_tres_run_secs(cpu) is 749448000 messages:[2018-06-07T12:10:49.234] debug2: acct_policy_job_begin: after adding job 45416611, assoc 5041(conroy_lab/(null)/(null)) grp_used_tres_run_secs(mem) is 2942496000000 messages:[2018-06-07T12:10:49.234] debug2: acct_policy_job_begin: after adding job 45416611, assoc 5041(conroy_lab/(null)/(null)) grp_used_tres_run_secs(node) is 615844800 messages:[2018-06-07T12:10:49.234] debug2: acct_policy_job_begin: after adding job 45416611, assoc 5041(conroy_lab/(null)/(null)) grp_used_tres_run_secs(billing) is 172641600 messages:[2018-06-07T12:10:49.234] debug2: acct_policy_job_begin: after adding job 45416611, assoc 1(root/(null)/(null)) grp_used_tres_run_secs(cpu) is 276976785240 messages:[2018-06-07T12:10:49.234] debug2: acct_policy_job_begin: after adding job 45416611, assoc 1(root/(null)/(null)) grp_used_tres_run_secs(mem) is 1250218310229600 messages:[2018-06-07T12:10:49.234] debug2: acct_policy_job_begin: after adding job 45416611, assoc 1(root/(null)/(null)) grp_used_tres_run_secs(node) is 13871419260 messages:[2018-06-07T12:10:49.234] debug2: acct_policy_job_begin: after adding job 45416611, assoc 1(root/(null)/(null)) grp_used_tres_run_secs(billing) is 453213601260 messages:[2018-06-07T12:10:49.600] error: _add_job_to_res: job 45416611 has no job_resrcs info messages:[2018-06-07T12:10:49.635] debug2: prolog_slurmctld job 45416611 prolog completed messages:[2018-06-07T12:10:49.762] prolog_running_decr: Configuration for JobID=45416611 is complete messages:[2018-06-07T12:10:49.762] Extending job 45416611 time limit by 350 secs for configuration messages:[2018-06-07T12:12:40.139] Recovered JobID=45416611 State=0x4001 NodeCnt=0 Assoc=7965 messages:[2018-06-07T12:12:40.418] error: _add_job_to_res: job 45416611 has no job_resrcs info messages:[2018-06-07T12:12:40.418] error: select_g_select_nodeinfo_set(45416611): No such file or directory messages:[2018-06-07T12:12:52.000] error: _add_job_to_res: job 45416611 has no job_resrcs info messages:[2018-06-07T12:12:52.106] prolog_running_decr: Configuration for JobID=45416611 is complete messages:[2018-06-07T12:12:52.106] Extending job 45416611 time limit by 473 secs for configuration messages:[2018-06-07T12:13:34.531] Recovered JobID=45416611 State=0x4001 NodeCnt=0 Assoc=7965 messages:[2018-06-07T12:13:34.799] error: _add_job_to_res: job 45416611 has no job_resrcs info messages:[2018-06-07T12:13:34.799] error: select_g_select_nodeinfo_set(45416611): No such file or directory messages:[2018-06-07T12:13:35.513] error: _add_job_to_res: job 45416611 has no job_resrcs info messages:[2018-06-07T12:13:35.551] prolog_running_decr: Configuration for JobID=45416611 is complete messages:[2018-06-07T12:13:35.551] Extending job 45416611 time limit by 516 secs for configuration messages:[2018-06-07T12:14:36.337] Recovered JobID=45416611 State=0x4001 NodeCnt=0 Assoc=7965 messages:[2018-06-07T12:14:36.604] error: _add_job_to_res: job 45416611 has no job_resrcs info messages:[2018-06-07T12:14:36.604] error: select_g_select_nodeinfo_set(45416611): No such file or directory messages:[2018-06-07T12:14:37.192] error: _add_job_to_res: job 45416611 has no job_resrcs info messages:[2018-06-07T12:14:37.232] prolog_running_decr: Configuration for JobID=45416611 is complete messages:[2018-06-07T12:14:37.232] Extending job 45416611 time limit by 578 secs for configuration messages:[2018-06-07T13:15:13.718] Recovered JobID=45416611 State=0x4001 NodeCnt=0 Assoc=7965 messages:[2018-06-07T13:15:13.986] error: _add_job_to_res: job 45416611 has no job_resrcs info messages:[2018-06-07T13:15:13.986] error: select_g_select_nodeinfo_set(45416611): No such file or directory messages:[2018-06-07T13:15:25.929] error: _add_job_to_res: job 45416611 has no job_resrcs info messages:[2018-06-07T13:15:26.006] prolog_running_decr: Configuration for JobID=45416611 is complete messages:[2018-06-07T13:15:26.006] Extending job 45416611 time limit by 4227 secs for configuration messages:[2018-06-07T13:15:26.006] error: launch_prolog: missing job resources struct for job 45416611, setting to failed As for 45416412, which was the job that caused the initial crash we have: messages:[2018-06-07T11:58:02.223] error: find_preemptable_jobs: job 45416412 not pending messages:[2018-06-07T11:58:02.224] error: find_preemptable_jobs: job 45416412 not pending messages:[2018-06-07T11:58:02.224] error: find_preemptable_jobs: job 45416412 not pending messages:[2018-06-07T11:58:02.224] sched: _slurm_rpc_allocate_resources JobId=45416412 NodeList=holy2a14207 usec=2044571 messages:[2018-06-07T11:58:02.372] error: _will_run_test: Job 45416412 has NULL node_bitmap messages:[2018-06-07T11:58:02.393] error: _will_run_test: Job 45416412 has NULL node_bitmap messages:[2018-06-07T11:58:02.781] prolog_running_decr: Configuration for JobID=45416412 is complete messages:[2018-06-07T11:59:54.153] _slurm_rpc_submit_batch_job: JobId=45416412 InitPrio=17861812 usec=10822 messages:[2018-06-07T11:59:57.222] sched: Allocate JobID=45416412 NodeList=holyhoekstra04 #CPUs=8 Partition=serial_requeue messages:[2018-06-07T11:59:57.320] prolog_running_decr: Configuration for JobID=45416412 is complete messages:[2018-06-07T12:02:39.833] recovered job step 45416412.4294967295 messages:[2018-06-07T12:02:39.833] Recovered JobID=45416412 State=0x1 NodeCnt=0 Assoc=4902 messages:[2018-06-07T12:02:42.065] _job_complete: JobID=45416412 State=0x1 NodeCnt=1 WEXITSTATUS 1 messages:[2018-06-07T12:02:42.066] _job_complete: JobID=45416412 State=0x8005 NodeCnt=1 done messages:[2018-06-07T12:04:02.947] Recovered JobID=45416412 State=0x5 NodeCnt=0 Assoc=4902 messages:[2018-06-07T12:06:54.393] Recovered JobID=45416412 State=0x5 NodeCnt=0 Assoc=4902 messages:[2018-06-07T12:07:53.548] Recovered JobID=45416412 State=0x5 NodeCnt=0 Assoc=4902 messages:[2018-06-07T12:08:35.496] Recovered JobID=45416412 State=0x5 NodeCnt=0 Assoc=4902 messages:[2018-06-07T12:10:19.189] Recovered JobID=45416412 State=0x5 NodeCnt=0 Assoc=4902 messages:[2018-06-07T12:10:20.022] debug3: Found batch directory for job_id 45416412 messages:[2018-06-07T12:12:40.135] Recovered JobID=45416412 State=0x5 NodeCnt=0 Assoc=4902 messages:[2018-06-07T12:13:34.526] Recovered JobID=45416412 State=0x5 NodeCnt=0 Assoc=4902 messages:[2018-06-07T12:14:36.332] Recovered JobID=45416412 State=0x5 NodeCnt=0 Assoc=4902 messages:[2018-06-07T13:15:13.713] Recovered JobID=45416412 State=0x5 NodeCnt=0 Assoc=4902 Note that I grepped out everything not related to these two jobs. So if you want other details let me know. -Paul Edmon- On 06/08/2018 11:12 AM, bugs@schedmd.com wrote: > > *Comment # 31 <https://bugs.schedmd.com/show_bug.cgi?id=5276#c31> on > bug 5276 <https://bugs.schedmd.com/show_bug.cgi?id=5276> from Marshall > Garey <mailto:marshall@schedmd.com> * > Hi Paul, we're continuing to look into this. Can we get a more complete log > file from around the time of the crash? We'd like everything starting from when > job 45416611 was initially submitted up to the crash. > > Thanks > > - Marshall > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. >
Thanks, we actually wanted to know if this job was preempted (and any other interactions), so this helps a lot. We'll keep you posted.
Can you attach the entire log from the earliest mention of job 45416412 which caused the crash (from comment 32 that appears to be 2018-06-07T11:58:02.223) to the time of the crash? Just what we see in comment 32 is interesting, but there are certain log messages I'm interested in that don't have the job id in it, and maybe other contextual information that will help. Thanks. - Marshall
Created attachment 7056 [details] slurmctl log for job 45416412 Let me know if you need more than this. I don't want to bloat the file size too much.
Just an update - your log file has been helpful; thank you for uploading it. We're still investigating this. We haven't been able to find any obvious holes where job_resrcs is NULL but continues to get dereferenced. It might be a subtle race condition somewhere - that's what I've been looking for.
Hi Paul, We haven't been able to reproduce this yet or determine how those jobs got into that state. I have identified a few locations that are missing the config read lock, and one that's missing the partition read lock, but none of those affect the jobs. We'll get those fixed anyway. Just to check, I assume you're still running with the patch that Tim provided to prevent the segfault? His patch also added in an error message: error("%s: missing job resources struct for job %u, setting to failed", __func__, job_ptr->job_id); Can you grep for "missing job resources struct for job" - have you seen this error message since then? I'll keep investigating this one. - Marshall
Sure. Here are the errors I saw: messages-20180624:[2018-06-21T17:57:45.205] error: launch_prolog: missing job resources struct for job 46831986, setting to failed messages-20180624:[2018-06-22T12:16:30.355] error: launch_prolog: missing job resources struct for job 46884696, setting to failed messages-20180624:[2018-06-22T12:16:39.201] error: launch_prolog: missing job resources struct for job 46884705, setting to failed messages-20180624:[2018-06-22T12:17:31.002] error: launch_prolog: missing job resources struct for job 46884744, setting to failed messages-20180624:[2018-06-22T12:17:42.178] error: launch_prolog: missing job resources struct for job 46884764, setting to failed messages-20180624:[2018-06-22T12:18:23.223] error: launch_prolog: missing job resources struct for job 46884792, setting to failed messages-20180701:[2018-06-26T13:04:44.788] error: launch_prolog: missing job resources struct for job 47097214, setting to failed messages-20180708:[2018-07-02T11:49:22.785] error: launch_prolog: missing job resources struct for job 47541075, setting to failed messages-20180708:[2018-07-02T22:52:32.409] error: launch_prolog: missing job resources struct for job 47576114, setting to failed messages-20180708:[2018-07-03T16:20:04.786] error: launch_prolog: missing job resources struct for job 47622547, setting to failed messages-20180708:[2018-07-03T16:20:08.549] error: launch_prolog: missing job resources struct for job 47622549, setting to failed messages-20180708:[2018-07-03T16:20:12.144] error: launch_prolog: missing job resources struct for job 47622550, setting to failed messages-20180708:[2018-07-03T16:20:18.817] error: launch_prolog: missing job resources struct for job 47622572, setting to failed messages-20180708:[2018-07-03T16:20:38.994] error: launch_prolog: missing job resources struct for job 47622602, setting to failed messages-20180708:[2018-07-05T10:12:24.281] error: launch_prolog: missing job resources struct for job 47766679, setting to failed messages-20180715:[2018-07-11T15:43:34.312] error: launch_prolog: missing job resources struct for job 48119966, setting to failed messages-20180715:[2018-07-13T11:06:40.887] error: launch_prolog: missing job resources struct for job 48248498, setting to failed messages-20180715:[2018-07-13T11:07:43.233] error: launch_prolog: missing job resources struct for job 48248517, setting to failed Let me know if you want me to try to track back these jobs. -Paul Edmon- On 07/13/2018 06:27 PM, bugs@schedmd.com wrote: > > *Comment # 42 <https://bugs.schedmd.com/show_bug.cgi?id=5276#c42> on > bug 5276 <https://bugs.schedmd.com/show_bug.cgi?id=5276> from Marshall > Garey <mailto:marshall@schedmd.com> * > Hi Paul, > > We haven't been able to reproduce this yet or determine how those jobs got into > that state. I have identified a few locations that are missing the config read > lock, and one that's missing the partition read lock, but none of those affect > the jobs. We'll get those fixed anyway. > > Just to check, I assume you're still running with the patch that Tim provided > to prevent the segfault? His patch also added in an error message: > > error("%s: missing job resources struct for job %u, setting to failed", > __func__, job_ptr->job_id); > > Can you grep for "missing job resources struct for job" - have you seen this > error message since then? > > I'll keep investigating this one. > > - Marshall > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. >
Interesting. It would be helpful to know if there's a pattern - something in common between all those jobs. I think that would help us track down how they get into this state. - Did they all get preempted and requeued? - Were they all part of an array job, or were they all individual jobs, or some of both? - Any other patterns? For now, can you get the output of this sacct command? If you could attach it as a file, that would help me - it's easier for me to parse and keep track of things in a text editor with different files than in the web browser. sacct -D -j<list of all those job id's> --format=jobid,jobidraw,jobname,partition,start,end,submit,eligible,state,exitcode,derivedexitcode,account,user,reqtres%50,alloctres%50 I may ask for slurmctld logs later.
Created attachment 7337 [details] sacct info for jobs with missing structs I got this info doing the following: [root@holy-slurm02 log]# grep "missing job resources struct for job" * messages:[2018-07-16T14:33:38.985] error: launch_prolog: missing job resources struct for job 48502865, setting to failed messages:[2018-07-16T14:33:47.900] error: launch_prolog: missing job resources struct for job 48502867, setting to failed messages-20180624:[2018-06-21T17:57:45.205] error: launch_prolog: missing job resources struct for job 46831986, setting to failed messages-20180624:[2018-06-22T12:16:30.355] error: launch_prolog: missing job resources struct for job 46884696, setting to failed messages-20180624:[2018-06-22T12:16:39.201] error: launch_prolog: missing job resources struct for job 46884705, setting to failed messages-20180624:[2018-06-22T12:17:31.002] error: launch_prolog: missing job resources struct for job 46884744, setting to failed messages-20180624:[2018-06-22T12:17:42.178] error: launch_prolog: missing job resources struct for job 46884764, setting to failed messages-20180624:[2018-06-22T12:18:23.223] error: launch_prolog: missing job resources struct for job 46884792, setting to failed messages-20180701:[2018-06-26T13:04:44.788] error: launch_prolog: missing job resources struct for job 47097214, setting to failed messages-20180708:[2018-07-02T11:49:22.785] error: launch_prolog: missing job resources struct for job 47541075, setting to failed messages-20180708:[2018-07-02T22:52:32.409] error: launch_prolog: missing job resources struct for job 47576114, setting to failed messages-20180708:[2018-07-03T16:20:04.786] error: launch_prolog: missing job resources struct for job 47622547, setting to failed messages-20180708:[2018-07-03T16:20:08.549] error: launch_prolog: missing job resources struct for job 47622549, setting to failed messages-20180708:[2018-07-03T16:20:12.144] error: launch_prolog: missing job resources struct for job 47622550, setting to failed messages-20180708:[2018-07-03T16:20:18.817] error: launch_prolog: missing job resources struct for job 47622572, setting to failed messages-20180708:[2018-07-03T16:20:38.994] error: launch_prolog: missing job resources struct for job 47622602, setting to failed messages-20180708:[2018-07-05T10:12:24.281] error: launch_prolog: missing job resources struct for job 47766679, setting to failed messages-20180715:[2018-07-11T15:43:34.312] error: launch_prolog: missing job resources struct for job 48119966, setting to failed messages-20180715:[2018-07-13T11:06:40.887] error: launch_prolog: missing job resources struct for job 48248498, setting to failed messages-20180715:[2018-07-13T11:07:43.233] error: launch_prolog: missing job resources struct for job 48248517, setting to failed [root@holy-slurm02 ~]# sacct -D -j 48502867,48502865,48248517,48248498,48119966,47766679,47622602,47622572,47622550,47622549,47622547,47576114,47541075,47097214,46884792,46884764,46884744,46884705,46884696,46831986 --format=jobid,jobidraw,jobname,partition,start,end,submit,eligible,state,exitcode,derivedexitcode,account,user,reqtres%50,alloctres%5 > missing-struct.txt
I saw that 17.11.8 was released. I'm guessing it does not contain a fix for this bug. What do you recommend we do? Should I stay on 17.11.7 with the patch we have until we figure this out?
Correct, 17.11.8 doesn't contain any patches for this. We didn't want to include the workaround we gave you, though it seems harmless. For now, if you want to upgrade, feel free to upgrade to 17.11.8, but apply Tim's patch locally on top of 17.11.8.
We've had a few other cases reported which are very similar to this one - 5438, 5447, and 5452, though they all crashed in the same place but different than this one. We can reliably reproduce that crash: 1. EnforcePartLimits=ALL 2. Submitting a lot of jobs to multiple partitions 3. The job runs on not the last partition in the list it was submitted to, and runs at submit time (scheduled immediately). 1. You have EnforcePartLimits=ALL. 2. To check if these jobs that are crashing are submitting to multiple partitions: Can you set your slurmctld debug level to debug and see if any of those errors reoccur? If they do, can you grep for "has more than one partition" and the job id in the slurmctld log file? You should see a message like this: slurmctld: debug: Job 544012 has more than one partition (debug)(1167) slurmctld: debug: Job 544012 has more than one partition (canpreempt)(1167) slurmctld: debug: Job 544012 has more than one partition (canpreempt2)(1167) If you do, feel free to turn the log level back down. 3. From the sacct output you uploaded in comment 45, I've verified that the submit and start times are the same for all those jobs. We're working on a fix for those other bugs; I'm hopeful that it will fix it for you, too.
Yeah the EnforcePartLimits=ALL has never really worked as advertised as I will see people submit jobs to multiple partitions that should have been rejected due to one of the partition limits. For instance a person will submit a job asking for 250G to two partitions but one partition doesn't have any nodes that have 250G but the other does. If I submit individually the partition with the lower memory nodes will reject it, but if I multisubmit it will submit but then generate an error like: [2018-07-26T13:22:14.611] _build_node_list: No nodes satisfy job 49276726 requirements in partition shared In which case I go in and manually reset the partitions. However what it should do is reject the submission outright because one of the two partitions can't run the job. Anyways I will turn on that debug flag and assuming it doesn't bloat my logs I can leave it running until I see that error again. It seems to happen at least once a week. -Paul Edmon- On 07/26/2018 02:02 PM, bugs@schedmd.com wrote: > > *Comment # 48 <https://bugs.schedmd.com/show_bug.cgi?id=5276#c48> on > bug 5276 <https://bugs.schedmd.com/show_bug.cgi?id=5276> from Marshall > Garey <mailto:marshall@schedmd.com> * > We've had a few other cases reported which are very similar to this one - 5438, > 5447, and 5452, though they all crashed in the same place but different than > this one. We can reliably reproduce that crash: > > 1. EnforcePartLimits=ALL > 2. Submitting a lot of jobs to multiple partitions > 3. The job runs on not the last partition in the list it was submitted to, and > runs at submit time (scheduled immediately). > > 1. You have EnforcePartLimits=ALL. > > 2. To check if these jobs that are crashing are submitting to multiple > partitions: > > Can you set your slurmctld debug level to debug and see if any of those errors > reoccur? If they do, can you grep for "has more than one partition" and the job > id in the slurmctld log file? You should see a message like this: > > slurmctld: debug: Job 544012 has more than one partition (debug)(1167) > slurmctld: debug: Job 544012 has more than one partition (canpreempt)(1167) > slurmctld: debug: Job 544012 has more than one partition (canpreempt2)(1167) > > If you do, feel free to turn the log level back down. > > 3. From the sacct output you uploaded incomment 45 <show_bug.cgi?id=5276#c45>, I've verified that the > submit and start times are the same for all those jobs. > > > We're working on a fix for those other bugs; I'm hopeful that it will fix it > for you, too. > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. >
Created attachment 7436 [details] Prevent job_resrcs from being overwritten for multi-partition job submissions (In reply to Paul Edmon from comment #49) > Yeah the EnforcePartLimits=ALL has never really worked as advertised as > I will see people submit jobs to multiple partitions that should have > been rejected due to one of the partition limits. For instance a person > will submit a job asking for 250G to two partitions but one partition > doesn't have any nodes that have 250G but the other does. If I submit > individually the partition with the lower memory nodes will reject it, > but if I multisubmit it will submit but then generate an error like: > > [2018-07-26T13:22:14.611] _build_node_list: No nodes satisfy job > 49276726 requirements in partition shared > > In which case I go in and manually reset the partitions. However what > it should do is reject the submission outright because one of the two > partitions can't run the job. Yeah, the issue is that if a job can be started right away, it will, before it even finishes going through all the partitions. If it can't start right away, then it does go through all the partitions and is rejected appropriately. We're aware of this and are seeing what can be done about that. It's not trivial, though. > Anyways I will turn on that debug flag and assuming it doesn't bloat my > logs I can leave it running until I see that error again. It seems to > happen at least once a week. I suppose another thing you could do is just change that debug message to verbose and recompile/restart slurmctld. $ git diff diff --git a/src/plugins/priority/multifactor/priority_multifactor.c b/src/plugins/priority/multifactor/priority_multifactor.c index 2396e60e24..5666b9b41a 100644 --- a/src/plugins/priority/multifactor/priority_multifactor.c +++ b/src/plugins/priority/multifactor/priority_multifactor.c @@ -619,7 +619,7 @@ static uint32_t _get_priority_internal(time_t start_time, job_ptr->priority_array[i] = (uint32_t) priority_part; } - debug("Job %u has more than one partition (%s)(%u)", + verbose("Job %u has more than one partition (%s)(%u)", job_ptr->job_id, part_ptr->name, job_ptr->priority_array[i]); i++; This patch fixes the other segfaults, and if I'm right, should fix your segfault as well. Feel free to try applying this patch and removing the local bypass that Tim provided to verify that the problem is indeed fixed. Or, if you'd rather wait to see if you see that "has more than one partition" message comes up with the jobs that would have segfaulted before you apply this patch, you can do that. This patch hasn't been committed yet, but will likely be soon.
Since I already have the debug logging on I'm inclined to let that go for a week and see if we catch any in the act. If we do we can then look at what happened and confirm that it is the same problem. -Paul Edmon- On 07/27/2018 10:54 AM, bugs@schedmd.com wrote: > > *Comment # 50 <https://bugs.schedmd.com/show_bug.cgi?id=5276#c50> on > bug 5276 <https://bugs.schedmd.com/show_bug.cgi?id=5276> from Marshall > Garey <mailto:marshall@schedmd.com> * > Createdattachment 7436 <attachment.cgi?id=7436&action=diff> [details] > <attachment.cgi?id=7436&action=edit> > Prevent job_resrcs from being overwritten for multi-partition job submissions > > (In reply to Paul Edmon fromcomment #49 <show_bug.cgi?id=5276#c49>) > > Yeah the EnforcePartLimits=ALL has never really worked as advertised as > I will see people submit jobs to multiple partitions that should > have > been rejected due to one of the partition limits. For instance > a person > will submit a job asking for 250G to two partitions but one > partition > doesn't have any nodes that have 250G but the other does. > If I submit > individually the partition with the lower memory nodes > will reject it, > but if I multisubmit it will submit but then > generate an error like: > > [2018-07-26T13:22:14.611] > _build_node_list: No nodes satisfy job > 49276726 requirements in > partition shared > > In which case I go in and manually reset the > partitions. However what > it should do is reject the submission > outright because one of the two > partitions can't run the job. > > Yeah, the issue is that if a job can be started right away, it will, before it > even finishes going through all the partitions. If it can't start right away, > then it does go through all the partitions and is rejected appropriately. > > We're aware of this and are seeing what can be done about that. It's not > trivial, though. > > > > Anyways I will turn on that debug flag and assuming it doesn't bloat my > logs I can leave it running until I see that error again. It seems > to > happen at least once a week. > > I suppose another thing you could do is just change that debug message to > verbose and recompile/restart slurmctld. > > $ git diff > diff --git a/src/plugins/priority/multifactor/priority_multifactor.c > b/src/plugins/priority/multifactor/priority_multifactor.c > index 2396e60e24..5666b9b41a 100644 > --- a/src/plugins/priority/multifactor/priority_multifactor.c > +++ b/src/plugins/priority/multifactor/priority_multifactor.c > @@ -619,7 +619,7 @@ static uint32_t _get_priority_internal(time_t start_time, > job_ptr->priority_array[i] = > (uint32_t) priority_part; > } > - debug("Job %u has more than one partition (%s)(%u)", > + verbose("Job %u has more than one partition (%s)(%u)", > job_ptr->job_id, part_ptr->name, > job_ptr->priority_array[i]); > i++; > > > This patch fixes the other segfaults, and if I'm right, should fix your > segfault as well. Feel free to try applying this patch and removing the local > bypass that Tim provided to verify that the problem is indeed fixed. Or, if > you'd rather wait to see if you see that "has more than one partition" message > comes up with the jobs that would have segfaulted before you apply this patch, > you can do that. This patch hasn't been committed yet, but will likely be soon. > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. >
So looks like we have a winner: messages:[2018-07-30T14:12:33.640] debug: Job 49634964 has more than one partition (huce_intel)(9915484) messages:[2018-07-30T14:12:33.640] debug: Job 49634964 has more than one partition (test)(9915484) messages:[2018-07-30T14:12:33.661] error: find_preemptable_jobs: job 49634964 not pending messages:[2018-07-30T14:12:33.661] sched: _slurm_rpc_allocate_resources JobId=49634964 NodeList=holy2c18102 usec=21897 messages:[2018-07-30T14:12:34.180] prolog_running_decr: Configuration for JobID=49634964 is complete messages:[2018-07-30T14:12:34.180] Extending job 49634964 time limit by 1 secs for configuration messages:[2018-07-30T14:12:34.180] error: launch_prolog: missing job resources struct for job 49634964, setting to failed messages:[2018-07-30T15:03:05.036] _slurm_rpc_kill_job: REQUEST_KILL_JOB job 49634964 uid 11423 messages:[2018-07-30T15:03:05.169] job_str_signal(3): invalid job id 49634964 messages:[2018-07-30T15:03:05.169] _slurm_rpc_kill_job: job_str_signal() job 49634964 sig 9 returned Invalid job id specified messages:[2018-07-30T21:37:18.732] debug: Job 49659045 has more than one partition (test)(54990) messages:[2018-07-30T21:37:18.732] debug: Job 49659045 has more than one partition (xlin-lab)(54990) messages:[2018-07-30T21:37:18.732] debug: Job 49659045 has more than one partition (shared)(54990) messages:[2018-07-30T21:37:18.752] error: find_preemptable_jobs: job 49659045 not pending messages:[2018-07-30T21:37:18.752] error: find_preemptable_jobs: job 49659045 not pending messages:[2018-07-30T21:37:18.754] sched: _slurm_rpc_allocate_resources JobId=49659045 NodeList=holy7c19312 usec=22213 messages:[2018-07-30T21:37:18.967] prolog_running_decr: Configuration for JobID=49659045 is complete messages:[2018-07-30T21:37:18.967] error: launch_prolog: missing job resources struct for job 49659045, setting to failed So it looks like it is the multiple partition issue. I'm guessing then that the fixes you have will fix this. Do you need any other info from me? I'm anticipating these fixes will be in 17.11.9? Do you have a patch for 17.11.8 that applies these fixes? If so I can roll up to that. -Paul Edmon- On 07/27/2018 10:54 AM, bugs@schedmd.com wrote: > > *Comment # 50 <https://bugs.schedmd.com/show_bug.cgi?id=5276#c50> on > bug 5276 <https://bugs.schedmd.com/show_bug.cgi?id=5276> from Marshall > Garey <mailto:marshall@schedmd.com> * > Createdattachment 7436 <attachment.cgi?id=7436&action=diff> [details] > <attachment.cgi?id=7436&action=edit> > Prevent job_resrcs from being overwritten for multi-partition job submissions > > (In reply to Paul Edmon fromcomment #49 <show_bug.cgi?id=5276#c49>) > > Yeah the EnforcePartLimits=ALL has never really worked as advertised as > I will see people submit jobs to multiple partitions that should > have > been rejected due to one of the partition limits. For instance > a person > will submit a job asking for 250G to two partitions but one > partition > doesn't have any nodes that have 250G but the other does. > If I submit > individually the partition with the lower memory nodes > will reject it, > but if I multisubmit it will submit but then > generate an error like: > > [2018-07-26T13:22:14.611] > _build_node_list: No nodes satisfy job > 49276726 requirements in > partition shared > > In which case I go in and manually reset the > partitions. However what > it should do is reject the submission > outright because one of the two > partitions can't run the job. > > Yeah, the issue is that if a job can be started right away, it will, before it > even finishes going through all the partitions. If it can't start right away, > then it does go through all the partitions and is rejected appropriately. > > We're aware of this and are seeing what can be done about that. It's not > trivial, though. > > > > Anyways I will turn on that debug flag and assuming it doesn't bloat my > logs I can leave it running until I see that error again. It seems > to > happen at least once a week. > > I suppose another thing you could do is just change that debug message to > verbose and recompile/restart slurmctld. > > $ git diff > diff --git a/src/plugins/priority/multifactor/priority_multifactor.c > b/src/plugins/priority/multifactor/priority_multifactor.c > index 2396e60e24..5666b9b41a 100644 > --- a/src/plugins/priority/multifactor/priority_multifactor.c > +++ b/src/plugins/priority/multifactor/priority_multifactor.c > @@ -619,7 +619,7 @@ static uint32_t _get_priority_internal(time_t start_time, > job_ptr->priority_array[i] = > (uint32_t) priority_part; > } > - debug("Job %u has more than one partition (%s)(%u)", > + verbose("Job %u has more than one partition (%s)(%u)", > job_ptr->job_id, part_ptr->name, > job_ptr->priority_array[i]); > i++; > > > This patch fixes the other segfaults, and if I'm right, should fix your > segfault as well. Feel free to try applying this patch and removing the local > bypass that Tim provided to verify that the problem is indeed fixed. Or, if > you'd rather wait to see if you see that "has more than one partition" message > comes up with the jobs that would have segfaulted before you apply this patch, > you can do that. This patch hasn't been committed yet, but will likely be soon. > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. >
Excellent. Comment 50 is the patch that fixes it. It has already been committed and will be in 17.11.9 - https://github.com/SchedMD/slurm/commit/fef07a40972 You can apply this directly to 17.11.8.
Have you seen the segfault happen again? Would you like to keep the bug open for a little while longer, or is it okay to close?
We haven't upgraded to 17.11.9 yet. I would say close this as we are presuming that the upgrade will fix. If it doesn't I will either open this ticket again or open a new one.
Thanks. Closing as resolved/fixed https://github.com/SchedMD/slurm/commit/fef07a40972