Ticket 4008 - gres-flags=enforce-binding crashes slurmctld with invalid requests
Summary: gres-flags=enforce-binding crashes slurmctld with invalid requests
Status: RESOLVED FIXED
Alias: None
Product: Slurm
Classification: Unclassified
Component: slurmctld (show other tickets)
Version: 17.02.5
Hardware: Linux Linux
: 2 - High Impact
Assignee: Moe Jette
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2017-07-18 12:51 MDT by Mark Klein
Modified: 2017-07-19 09:35 MDT (History)
1 user (show)

See Also:
Site: CSCS - Swiss National Supercomputing Centre
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed: 17.02.7
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
slurm.conf, gres.conf, cgroup.conf (2.23 KB, application/gzip)
2017-07-18 13:15 MDT, Mark Klein
Details
Prevent slurmctld abort with --gres-flags=enforce-binding job option (1.24 KB, patch)
2017-07-18 15:51 MDT, Moe Jette
Details | Diff

Note You need to log in before you can comment on or make changes to this ticket.
Description Mark Klein 2017-07-18 12:51:19 MDT
Hello,

We noticed slurmctld died when an invalid request is sent when gres bindings are enforced.

We have a dual socket machine, the following invalid request normally just sits in queue.


#!/bin/sh
#SBATCH -n 22
#SBATCH --ntasks-per-socket=9
#SBATCH --gres=gpu:16
#SBATCH --partition=normal
srun hostname



-bash-4.2$ sbatch test2.sbatch 
Submitted batch job 314
-bash-4.2$ sbatch --gres-flags=enforce-binding test2.sbatch 
sbatch: error: Batch job submission failed: Zero Bytes were transmitted or received
-bash-4.2$ 



When adding --gres-flags=enforce-binding, the slurmctld immediately segfaults on an assertion fail. I'm guessing because S_P_N=3 which is higher than available sockets on the node that is being tested.



(gdb) bt
#0  0x00007ffff76301d7 in raise () from /lib64/libc.so.6
#1  0x00007ffff76318c8 in abort () from /lib64/libc.so.6
#2  0x00007ffff7629146 in __assert_fail_base () from /lib64/libc.so.6
#3  0x00007ffff76291f2 in __assert_fail () from /lib64/libc.so.6
#4  0x000000000051b282 in bit_or (b1=0x7fffc00078e0, b2=0x0) at bitstring.c:665
#5  0x00007ffff5fd2ca8 in _gres_sock_job_test (job_gres_list=0x9e6090, node_gres_list=0x9e6720, use_total_gres=true, core_bitmap=0x7fffc0007880, core_start_bit=0, 
    core_end_bit=23, job_id=315, node_name=0x8dc210 "keschcn-0001", node_i=0, s_p_n=3) at job_test.c:1102
#6  0x00007ffff5fd1986 in _can_job_run_on_node (job_ptr=0x7fffc0006c40, core_map=0x7fffc0007880, node_i=0, s_p_n=3, node_usage=0xa3a7d0, cr_type=4, test_only=true, 
    part_core_map=0x0) at job_test.c:610
#7  0x00007ffff5fd305e in _get_res_usage (job_ptr=0x7fffc0006c40, node_map=0x7fffc0007790, core_map=0x7fffc0007880, cr_node_cnt=17, node_usage=0xa3a7d0, cr_type=4, 
    cpu_cnt_ptr=0x7fffedca10d8, test_only=true, part_core_map=0x0) at job_test.c:1188
#8  0x00007ffff5fd83ee in _select_nodes (job_ptr=0x7fffc0006c40, min_nodes=1, max_nodes=500000, req_nodes=1, node_map=0x7fffc0007790, cr_node_cnt=17, core_map=0x7fffc0007880, 
    node_usage=0xa3a7d0, cr_type=4, test_only=true, part_core_map=0x0, prefer_alloc_nodes=false) at job_test.c:3006
#9  0x00007ffff5fd8b69 in cr_job_test (job_ptr=0x7fffc0006c40, node_bitmap=0x7fffc0007790, min_nodes=1, max_nodes=500000, req_nodes=1, mode=1, cr_type=4, 
    job_node_req=NODE_CR_ONE_ROW, cr_node_cnt=17, cr_part_ptr=0xa3aba0, node_usage=0xa3a7d0, exc_core_bitmap=0x0, prefer_alloc_nodes=false, qos_preemptor=false, 
    preempt_mode=false) at job_test.c:3195
#10 0x00007ffff5fc89c9 in _test_only (job_ptr=0x7fffc0006c40, bitmap=0x7fffc0007790, min_nodes=1, max_nodes=500000, req_nodes=1, job_node_req=1) at select_cons_res.c:1523
#11 0x00007ffff5fca942 in select_p_job_test (job_ptr=0x7fffc0006c40, bitmap=0x7fffc0007790, min_nodes=1, max_nodes=500000, req_nodes=1, mode=1, preemptee_candidates=0x0, 
    preemptee_job_list=0x7fffedca1760, exc_core_bitmap=0x0) at select_cons_res.c:2305
#12 0x0000000000537f14 in select_g_job_test (job_ptr=0x7fffc0006c40, bitmap=0x7fffc0007790, min_nodes=1, max_nodes=500000, req_nodes=1, mode=1, preemptee_candidates=0x0, 
    preemptee_job_list=0x7fffedca1760, exc_core_bitmap=0x0) at node_select.c:576
#13 0x000000000049e725 in _pick_best_nodes (node_set_ptr=0x7fffc0007510, node_set_size=1, select_bitmap=0x7fffedca1778, job_ptr=0x7fffc0006c40, part_ptr=0x9f47a0, 
    min_nodes=1, max_nodes=500000, req_nodes=1, test_only=true, preemptee_candidates=0x0, preemptee_job_list=0x7fffedca1760, has_xand=false, exc_core_bitmap=0x0, 
    resv_overlap=false) at node_scheduler.c:1854
#14 0x000000000049d216 in _get_req_features (node_set_ptr=0x7fffc0007510, node_set_size=1, select_bitmap=0x7fffedca1778, job_ptr=0x7fffc0006c40, part_ptr=0x9f47a0, 
    min_nodes=1, max_nodes=500000, req_nodes=1, test_only=true, preemptee_job_list=0x7fffedca1760, can_reboot=true) at node_scheduler.c:1301
#15 0x000000000049fb7d in select_nodes (job_ptr=0x7fffc0006c40, test_only=true, select_node_bitmap=0x0, unavail_node_str=0x0, err_msg=0x7fffedca1bf8) at node_scheduler.c:2361
#16 0x000000000045ee08 in _select_nodes_parts (job_ptr=0x7fffc0006c40, test_only=true, select_node_bitmap=0x0, err_msg=0x7fffedca1bf8) at job_mgr.c:4197
#17 0x000000000045f591 in job_allocate (job_specs=0x7fffc00011d0, immediate=0, will_run=0, resp=0x0, allocate=0, submit_uid=23086, job_pptr=0x7fffedca1cd8, 
    err_msg=0x7fffedca1bf8, protocol_version=7936) at job_mgr.c:4401
#18 0x00000000004ba452 in _slurm_rpc_submit_batch_job (msg=0x7fffedca1e80) at proc_req.c:3629
#19 0x00000000004b10a5 in slurmctld_req (msg=0x7fffedca1e80, arg=0x7fffe40008f0) at proc_req.c:431
#20 0x000000000044202b in _service_connection (arg=0x7fffe40008f0) at controller.c:1133
#21 0x00007ffff79c3dc5 in start_thread () from /lib64/libpthread.so.0
#22 0x00007ffff76f276d in clone () from /lib64/libc.so.6
Comment 1 Brian Christiansen 2017-07-18 12:59:50 MDT
I'm looking into this. Will you attach your configuration files (slurm.conf, gres.conf, cgroup.conf, etc).

Thanks,
Brian
Comment 2 Mark Klein 2017-07-18 13:15:05 MDT
Created attachment 4942 [details]
slurm.conf, gres.conf, cgroup.conf

Requested confs
Comment 3 Mark Klein 2017-07-18 13:15:54 MDT
We've seen this on the main system, but I used the TDS to reproduce. The TDS only contains a single node right now keschcn-0001
Comment 4 Brian Christiansen 2017-07-18 13:40:22 MDT
I'm able to reproduce the crash as well. I'm investigating a fix for it.
Comment 5 Moe Jette 2017-07-18 15:51:31 MDT
Created attachment 4943 [details]
Prevent slurmctld abort with --gres-flags=enforce-binding job option

We were able to reproduce the slurmctld abort and have a fix for you. See attachment.
Comment 6 Mark Klein 2017-07-18 17:09:08 MDT
Thank you!

Fix is working for us.
Comment 7 Moe Jette 2017-07-19 09:35:34 MDT
The fix will be in our next release of Slurm.
The commit is here:
https://github.com/SchedMD/slurm/commit/c850ccf4d033a9e404b5d9c52fe3eeb07d1dd187