Hello, We noticed slurmctld died when an invalid request is sent when gres bindings are enforced. We have a dual socket machine, the following invalid request normally just sits in queue. #!/bin/sh #SBATCH -n 22 #SBATCH --ntasks-per-socket=9 #SBATCH --gres=gpu:16 #SBATCH --partition=normal srun hostname -bash-4.2$ sbatch test2.sbatch Submitted batch job 314 -bash-4.2$ sbatch --gres-flags=enforce-binding test2.sbatch sbatch: error: Batch job submission failed: Zero Bytes were transmitted or received -bash-4.2$ When adding --gres-flags=enforce-binding, the slurmctld immediately segfaults on an assertion fail. I'm guessing because S_P_N=3 which is higher than available sockets on the node that is being tested. (gdb) bt #0 0x00007ffff76301d7 in raise () from /lib64/libc.so.6 #1 0x00007ffff76318c8 in abort () from /lib64/libc.so.6 #2 0x00007ffff7629146 in __assert_fail_base () from /lib64/libc.so.6 #3 0x00007ffff76291f2 in __assert_fail () from /lib64/libc.so.6 #4 0x000000000051b282 in bit_or (b1=0x7fffc00078e0, b2=0x0) at bitstring.c:665 #5 0x00007ffff5fd2ca8 in _gres_sock_job_test (job_gres_list=0x9e6090, node_gres_list=0x9e6720, use_total_gres=true, core_bitmap=0x7fffc0007880, core_start_bit=0, core_end_bit=23, job_id=315, node_name=0x8dc210 "keschcn-0001", node_i=0, s_p_n=3) at job_test.c:1102 #6 0x00007ffff5fd1986 in _can_job_run_on_node (job_ptr=0x7fffc0006c40, core_map=0x7fffc0007880, node_i=0, s_p_n=3, node_usage=0xa3a7d0, cr_type=4, test_only=true, part_core_map=0x0) at job_test.c:610 #7 0x00007ffff5fd305e in _get_res_usage (job_ptr=0x7fffc0006c40, node_map=0x7fffc0007790, core_map=0x7fffc0007880, cr_node_cnt=17, node_usage=0xa3a7d0, cr_type=4, cpu_cnt_ptr=0x7fffedca10d8, test_only=true, part_core_map=0x0) at job_test.c:1188 #8 0x00007ffff5fd83ee in _select_nodes (job_ptr=0x7fffc0006c40, min_nodes=1, max_nodes=500000, req_nodes=1, node_map=0x7fffc0007790, cr_node_cnt=17, core_map=0x7fffc0007880, node_usage=0xa3a7d0, cr_type=4, test_only=true, part_core_map=0x0, prefer_alloc_nodes=false) at job_test.c:3006 #9 0x00007ffff5fd8b69 in cr_job_test (job_ptr=0x7fffc0006c40, node_bitmap=0x7fffc0007790, min_nodes=1, max_nodes=500000, req_nodes=1, mode=1, cr_type=4, job_node_req=NODE_CR_ONE_ROW, cr_node_cnt=17, cr_part_ptr=0xa3aba0, node_usage=0xa3a7d0, exc_core_bitmap=0x0, prefer_alloc_nodes=false, qos_preemptor=false, preempt_mode=false) at job_test.c:3195 #10 0x00007ffff5fc89c9 in _test_only (job_ptr=0x7fffc0006c40, bitmap=0x7fffc0007790, min_nodes=1, max_nodes=500000, req_nodes=1, job_node_req=1) at select_cons_res.c:1523 #11 0x00007ffff5fca942 in select_p_job_test (job_ptr=0x7fffc0006c40, bitmap=0x7fffc0007790, min_nodes=1, max_nodes=500000, req_nodes=1, mode=1, preemptee_candidates=0x0, preemptee_job_list=0x7fffedca1760, exc_core_bitmap=0x0) at select_cons_res.c:2305 #12 0x0000000000537f14 in select_g_job_test (job_ptr=0x7fffc0006c40, bitmap=0x7fffc0007790, min_nodes=1, max_nodes=500000, req_nodes=1, mode=1, preemptee_candidates=0x0, preemptee_job_list=0x7fffedca1760, exc_core_bitmap=0x0) at node_select.c:576 #13 0x000000000049e725 in _pick_best_nodes (node_set_ptr=0x7fffc0007510, node_set_size=1, select_bitmap=0x7fffedca1778, job_ptr=0x7fffc0006c40, part_ptr=0x9f47a0, min_nodes=1, max_nodes=500000, req_nodes=1, test_only=true, preemptee_candidates=0x0, preemptee_job_list=0x7fffedca1760, has_xand=false, exc_core_bitmap=0x0, resv_overlap=false) at node_scheduler.c:1854 #14 0x000000000049d216 in _get_req_features (node_set_ptr=0x7fffc0007510, node_set_size=1, select_bitmap=0x7fffedca1778, job_ptr=0x7fffc0006c40, part_ptr=0x9f47a0, min_nodes=1, max_nodes=500000, req_nodes=1, test_only=true, preemptee_job_list=0x7fffedca1760, can_reboot=true) at node_scheduler.c:1301 #15 0x000000000049fb7d in select_nodes (job_ptr=0x7fffc0006c40, test_only=true, select_node_bitmap=0x0, unavail_node_str=0x0, err_msg=0x7fffedca1bf8) at node_scheduler.c:2361 #16 0x000000000045ee08 in _select_nodes_parts (job_ptr=0x7fffc0006c40, test_only=true, select_node_bitmap=0x0, err_msg=0x7fffedca1bf8) at job_mgr.c:4197 #17 0x000000000045f591 in job_allocate (job_specs=0x7fffc00011d0, immediate=0, will_run=0, resp=0x0, allocate=0, submit_uid=23086, job_pptr=0x7fffedca1cd8, err_msg=0x7fffedca1bf8, protocol_version=7936) at job_mgr.c:4401 #18 0x00000000004ba452 in _slurm_rpc_submit_batch_job (msg=0x7fffedca1e80) at proc_req.c:3629 #19 0x00000000004b10a5 in slurmctld_req (msg=0x7fffedca1e80, arg=0x7fffe40008f0) at proc_req.c:431 #20 0x000000000044202b in _service_connection (arg=0x7fffe40008f0) at controller.c:1133 #21 0x00007ffff79c3dc5 in start_thread () from /lib64/libpthread.so.0 #22 0x00007ffff76f276d in clone () from /lib64/libc.so.6
I'm looking into this. Will you attach your configuration files (slurm.conf, gres.conf, cgroup.conf, etc). Thanks, Brian
Created attachment 4942 [details] slurm.conf, gres.conf, cgroup.conf Requested confs
We've seen this on the main system, but I used the TDS to reproduce. The TDS only contains a single node right now keschcn-0001
I'm able to reproduce the crash as well. I'm investigating a fix for it.
Created attachment 4943 [details] Prevent slurmctld abort with --gres-flags=enforce-binding job option We were able to reproduce the slurmctld abort and have a fix for you. See attachment.
Thank you! Fix is working for us.
The fix will be in our next release of Slurm. The commit is here: https://github.com/SchedMD/slurm/commit/c850ccf4d033a9e404b5d9c52fe3eeb07d1dd187