4008 – gres-flags=enforce-binding crashes slurmctld with invalid requests

Ticket 4008 - gres-flags=enforce-binding crashes slurmctld with invalid requests

Summary: gres-flags=enforce-binding crashes slurmctld with invalid requests

Status:	RESOLVED FIXED

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	slurmctld (show other tickets)
Version:	17.02.5
Hardware:	Linux Linux

Severity:	2 - High Impact
Assignee:	Moe Jette
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2017-07-18 12:51 MDT by Mark Klein
Modified:	2017-07-19 09:35 MDT (History)
CC List:	1 user (show)

See Also:
Site:	CSCS - Swiss National Supercomputing Centre
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:	17.02.7
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
slurm.conf, gres.conf, cgroup.conf (2.23 KB, application/gzip) 2017-07-18 13:15 MDT, Mark Klein	Details
Prevent slurmctld abort with --gres-flags=enforce-binding job option (1.24 KB, patch) 2017-07-18 15:51 MDT, Moe Jette	Details \| Diff
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Mark Klein 2017-07-18 12:51:19 MDT

Hello,

We noticed slurmctld died when an invalid request is sent when gres bindings are enforced.

We have a dual socket machine, the following invalid request normally just sits in queue.


#!/bin/sh
#SBATCH -n 22
#SBATCH --ntasks-per-socket=9
#SBATCH --gres=gpu:16
#SBATCH --partition=normal
srun hostname



-bash-4.2$ sbatch test2.sbatch 
Submitted batch job 314
-bash-4.2$ sbatch --gres-flags=enforce-binding test2.sbatch 
sbatch: error: Batch job submission failed: Zero Bytes were transmitted or received
-bash-4.2$ 



When adding --gres-flags=enforce-binding, the slurmctld immediately segfaults on an assertion fail. I'm guessing because S_P_N=3 which is higher than available sockets on the node that is being tested.



(gdb) bt
#0  0x00007ffff76301d7 in raise () from /lib64/libc.so.6
#1  0x00007ffff76318c8 in abort () from /lib64/libc.so.6
#2  0x00007ffff7629146 in __assert_fail_base () from /lib64/libc.so.6
#3  0x00007ffff76291f2 in __assert_fail () from /lib64/libc.so.6
#4  0x000000000051b282 in bit_or (b1=0x7fffc00078e0, b2=0x0) at bitstring.c:665
#5  0x00007ffff5fd2ca8 in _gres_sock_job_test (job_gres_list=0x9e6090, node_gres_list=0x9e6720, use_total_gres=true, core_bitmap=0x7fffc0007880, core_start_bit=0, 
    core_end_bit=23, job_id=315, node_name=0x8dc210 "keschcn-0001", node_i=0, s_p_n=3) at job_test.c:1102
#6  0x00007ffff5fd1986 in _can_job_run_on_node (job_ptr=0x7fffc0006c40, core_map=0x7fffc0007880, node_i=0, s_p_n=3, node_usage=0xa3a7d0, cr_type=4, test_only=true, 
    part_core_map=0x0) at job_test.c:610
#7  0x00007ffff5fd305e in _get_res_usage (job_ptr=0x7fffc0006c40, node_map=0x7fffc0007790, core_map=0x7fffc0007880, cr_node_cnt=17, node_usage=0xa3a7d0, cr_type=4, 
    cpu_cnt_ptr=0x7fffedca10d8, test_only=true, part_core_map=0x0) at job_test.c:1188
#8  0x00007ffff5fd83ee in _select_nodes (job_ptr=0x7fffc0006c40, min_nodes=1, max_nodes=500000, req_nodes=1, node_map=0x7fffc0007790, cr_node_cnt=17, core_map=0x7fffc0007880, 
    node_usage=0xa3a7d0, cr_type=4, test_only=true, part_core_map=0x0, prefer_alloc_nodes=false) at job_test.c:3006
#9  0x00007ffff5fd8b69 in cr_job_test (job_ptr=0x7fffc0006c40, node_bitmap=0x7fffc0007790, min_nodes=1, max_nodes=500000, req_nodes=1, mode=1, cr_type=4, 
    job_node_req=NODE_CR_ONE_ROW, cr_node_cnt=17, cr_part_ptr=0xa3aba0, node_usage=0xa3a7d0, exc_core_bitmap=0x0, prefer_alloc_nodes=false, qos_preemptor=false, 
    preempt_mode=false) at job_test.c:3195
#10 0x00007ffff5fc89c9 in _test_only (job_ptr=0x7fffc0006c40, bitmap=0x7fffc0007790, min_nodes=1, max_nodes=500000, req_nodes=1, job_node_req=1) at select_cons_res.c:1523
#11 0x00007ffff5fca942 in select_p_job_test (job_ptr=0x7fffc0006c40, bitmap=0x7fffc0007790, min_nodes=1, max_nodes=500000, req_nodes=1, mode=1, preemptee_candidates=0x0, 
    preemptee_job_list=0x7fffedca1760, exc_core_bitmap=0x0) at select_cons_res.c:2305
#12 0x0000000000537f14 in select_g_job_test (job_ptr=0x7fffc0006c40, bitmap=0x7fffc0007790, min_nodes=1, max_nodes=500000, req_nodes=1, mode=1, preemptee_candidates=0x0, 
    preemptee_job_list=0x7fffedca1760, exc_core_bitmap=0x0) at node_select.c:576
#13 0x000000000049e725 in _pick_best_nodes (node_set_ptr=0x7fffc0007510, node_set_size=1, select_bitmap=0x7fffedca1778, job_ptr=0x7fffc0006c40, part_ptr=0x9f47a0, 
    min_nodes=1, max_nodes=500000, req_nodes=1, test_only=true, preemptee_candidates=0x0, preemptee_job_list=0x7fffedca1760, has_xand=false, exc_core_bitmap=0x0, 
    resv_overlap=false) at node_scheduler.c:1854
#14 0x000000000049d216 in _get_req_features (node_set_ptr=0x7fffc0007510, node_set_size=1, select_bitmap=0x7fffedca1778, job_ptr=0x7fffc0006c40, part_ptr=0x9f47a0, 
    min_nodes=1, max_nodes=500000, req_nodes=1, test_only=true, preemptee_job_list=0x7fffedca1760, can_reboot=true) at node_scheduler.c:1301
#15 0x000000000049fb7d in select_nodes (job_ptr=0x7fffc0006c40, test_only=true, select_node_bitmap=0x0, unavail_node_str=0x0, err_msg=0x7fffedca1bf8) at node_scheduler.c:2361
#16 0x000000000045ee08 in _select_nodes_parts (job_ptr=0x7fffc0006c40, test_only=true, select_node_bitmap=0x0, err_msg=0x7fffedca1bf8) at job_mgr.c:4197
#17 0x000000000045f591 in job_allocate (job_specs=0x7fffc00011d0, immediate=0, will_run=0, resp=0x0, allocate=0, submit_uid=23086, job_pptr=0x7fffedca1cd8, 
    err_msg=0x7fffedca1bf8, protocol_version=7936) at job_mgr.c:4401
#18 0x00000000004ba452 in _slurm_rpc_submit_batch_job (msg=0x7fffedca1e80) at proc_req.c:3629
#19 0x00000000004b10a5 in slurmctld_req (msg=0x7fffedca1e80, arg=0x7fffe40008f0) at proc_req.c:431
#20 0x000000000044202b in _service_connection (arg=0x7fffe40008f0) at controller.c:1133
#21 0x00007ffff79c3dc5 in start_thread () from /lib64/libpthread.so.0
#22 0x00007ffff76f276d in clone () from /lib64/libc.so.6

Comment 1 Brian Christiansen 2017-07-18 12:59:50 MDT

I'm looking into this. Will you attach your configuration files (slurm.conf, gres.conf, cgroup.conf, etc).

Thanks,
Brian

Comment 2 Mark Klein 2017-07-18 13:15:05 MDT

Created attachment 4942 [details]
slurm.conf, gres.conf, cgroup.conf

Requested confs

Comment 3 Mark Klein 2017-07-18 13:15:54 MDT

We've seen this on the main system, but I used the TDS to reproduce. The TDS only contains a single node right now keschcn-0001

Comment 4 Brian Christiansen 2017-07-18 13:40:22 MDT

I'm able to reproduce the crash as well. I'm investigating a fix for it.

Comment 5 Moe Jette 2017-07-18 15:51:31 MDT

Created attachment 4943 [details]
Prevent slurmctld abort with --gres-flags=enforce-binding job option

We were able to reproduce the slurmctld abort and have a fix for you. See attachment.

Comment 6 Mark Klein 2017-07-18 17:09:08 MDT

Thank you!

Fix is working for us.

Comment 7 Moe Jette 2017-07-19 09:35:34 MDT

The fix will be in our next release of Slurm.
The commit is here:
https://github.com/SchedMD/slurm/commit/c850ccf4d033a9e404b5d9c52fe3eeb07d1dd187