Ticket 6879

Summary:	slurmctld abrt (slurm_xmalloc/hostlist_ranged_string_xmalloc_dims)
Product:	Slurm	Reporter:	Kilian Cavalotti <kilian>
Component:	slurmctld	Assignee:	Marshall Garey <marshall>
Status:	RESOLVED CANNOTREPRODUCE	QA Contact:
Severity:	4 - Minor Issue
Priority:	---	CC:	bart, regine.gaudin
Version:	18.08.7
Hardware:	Linux
OS:	Linux
See Also:	https://bugs.schedmd.com/show_bug.cgi?id=6659
Site:	Stanford	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:	Sherlock
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	"thread apply all bt full" output slurmctld log

Description Kilian Cavalotti 2019-04-17 21:45:55 MDT

Our Slurm controller just aborted with this backtrace:

(gdb) bt
#0  0x00007f1604229207 in raise () from /lib64/libc.so.6
#1  0x00007f160422a8f8 in abort () from /lib64/libc.so.6
#2  0x00007f160426bd27 in __libc_message () from /lib64/libc.so.6
#3  0x00007f16042725d4 in malloc_printerr () from /lib64/libc.so.6
#4  0x00007f1604275de4 in _int_malloc () from /lib64/libc.so.6
#5  0x00007f16042781ac in malloc () from /lib64/libc.so.6
#6  0x00007f1604b3e79d in slurm_xmalloc (size=size@entry=8192, clear=clear@entry=false, file=file@entry=0x7f1604b53f32 "hostlist.c", line=line@entry=3153, func=func@entry=0x7f1604b54360 <__func__.11375> "hostlist_ranged_string_xmalloc_dims") at xmalloc.c:89
#7  0x00007f1604a6faa5 in hostlist_ranged_string_xmalloc_dims (hl=hl@entry=0x7f15a0746ed0, dims=1, brackets=brackets@entry=1) at hostlist.c:3153
#8  0x00007f1604a6fb35 in hostlist_ranged_string_xmalloc (hl=hl@entry=0x7f15a0746ed0) at hostlist.c:3165
#9  0x00007f1604a829ed in bitmap2node_name_sortable (bitmap=bitmap@entry=0x7f15a0863b90, sort=sort@entry=true) at node_conf.c:408
#10 0x00007f1604a82a10 in bitmap2node_name (bitmap=bitmap@entry=0x7f15a0863b90) at node_conf.c:424
#11 0x00007f16039d900e in cr_job_test (job_ptr=job_ptr@entry=0x2e86980, node_bitmap=node_bitmap@entry=0x7f15a0863b90, min_nodes=min_nodes@entry=1, max_nodes=max_nodes@entry=500000, req_nodes=req_nodes@entry=1, mode=mode@entry=0, cr_type=cr_type@entry=20,
    job_node_req=job_node_req@entry=NODE_CR_ONE_ROW, cr_node_cnt=1335, cr_part_ptr=0x7f140c261630, node_usage=0x7f140c951130, exc_core_bitmap=<optimized out>, exc_core_bitmap@entry=0x0, prefer_alloc_nodes=prefer_alloc_nodes@entry=false,
    qos_preemptor=qos_preemptor@entry=false, preempt_mode=preempt_mode@entry=false) at job_test.c:3729
#12 0x00007f16039cc65a in _run_now (exc_core_bitmap=0x0, preemptee_job_list=0x7f1563bfa580, preemptee_candidates=0x0, job_node_req=<optimized out>, req_nodes=1, max_nodes=500000, min_nodes=1, bitmap=0x7f15a0863b90, job_ptr=0x2e86980) at select_cons_res.c:1576
#13 select_p_job_test (job_ptr=0x2e86980, bitmap=0x7f15a0863b90, min_nodes=1, max_nodes=500000, req_nodes=1, mode=<optimized out>, preemptee_candidates=0x0, preemptee_job_list=0x7f1563bfa580, exc_core_bitmap=0x0) at select_cons_res.c:2334
#14 0x00007f1604a86a74 in select_g_job_test (job_ptr=job_ptr@entry=0x2e86980, bitmap=bitmap@entry=0x7f15a0863b90, min_nodes=min_nodes@entry=1, max_nodes=max_nodes@entry=500000, req_nodes=req_nodes@entry=1, mode=mode@entry=0, preemptee_candidates=0x0,
    preemptee_job_list=preemptee_job_list@entry=0x7f1563bfa580, exc_core_bitmap=exc_core_bitmap@entry=0x0) at node_select.c:524
#15 0x0000000000475ffb in _pick_best_nodes (node_set_ptr=node_set_ptr@entry=0x7f15a04ccb70, node_set_size=node_set_size@entry=2, select_bitmap=select_bitmap@entry=0x7f1563bfa570, job_ptr=job_ptr@entry=0x2e86980, part_ptr=part_ptr@entry=0x7f140c534780,
    min_nodes=min_nodes@entry=1, max_nodes=max_nodes@entry=500000, req_nodes=req_nodes@entry=1, test_only=test_only@entry=false, preemptee_candidates=preemptee_candidates@entry=0x0, preemptee_job_list=preemptee_job_list@entry=0x7f1563bfa580,
    has_xand=has_xand@entry=false, exc_core_bitmap=0x0, resv_overlap=false) at node_scheduler.c:2114
#16 0x00000000004780c1 in _get_req_features (node_set_ptr=0x7f15a04ccb70, node_set_size=node_set_size@entry=2, select_bitmap=select_bitmap@entry=0x7f1563bfa570, job_ptr=job_ptr@entry=0x2e86980, part_ptr=part_ptr@entry=0x7f140c534780, min_nodes=1, max_nodes=500000,
    req_nodes=1, test_only=false, preemptee_job_list=preemptee_job_list@entry=0x7f1563bfa580, can_reboot=true, submission=false) at node_scheduler.c:1494
#17 0x000000000047b02a in select_nodes (job_ptr=<optimized out>, test_only=test_only@entry=false, select_node_bitmap=select_node_bitmap@entry=0x0, err_msg=err_msg@entry=0x0, submission=submission@entry=false) at node_scheduler.c:2688
#18 0x00000000004694b2 in _schedule (job_limit=2000) at job_scheduler.c:1909
#19 schedule (job_limit=<optimized out>, job_limit@entry=0) at job_scheduler.c:1068
#20 0x000000000048df2c in _slurm_rpc_epilog_complete (msg=<optimized out>, run_scheduler=0x7f1563bfaaf4, running_composite=<optimized out>) at proc_req.c:2369
#21 0x0000000000494aef in slurmctld_req (msg=msg@entry=0x7f1563bfae50, arg=arg@entry=0x7f15d004cfb0) at proc_req.c:374
#22 0x0000000000428a61 in _service_connection (arg=0x7f15d004cfb0) at controller.c:1282
#23 0x00007f16045c7dd5 in start_thread () from /lib64/libpthread.so.0
#24 0x00007f16042f0ead in clone () from /lib64/libc.so.6


It restarted normally and is running fine again.

Comment 1 Kilian Cavalotti 2019-04-17 21:47:33 MDT

Created attachment 9943 [details]
"thread apply all bt full" output

Here's the output of "thread apply all bt full"

Comment 2 Marshall Garey 2019-04-18 13:17:03 MDT

Thanks for the bug report. I'm looking into it. Can you let us know if it aborts again?

Comment 3 Kilian Cavalotti 2019-04-18 13:36:03 MDT

(In reply to Marshall Garey from comment #2)
> Thanks for the bug report. I'm looking into it. Can you let us know if it
> aborts again?

Will do, and thanks! It has not aborted again so far.

Cheers,
-- 
Kilian

Comment 4 Marshall Garey 2019-04-26 10:39:09 MDT

Can you upload the slurmctld log file leading up to and including that abort? Because the abort happens inside of a malloc() call, I'd like to get some context about what was happening. Maybe there was heap corruption somewhere?

Comment 5 Kilian Cavalotti 2019-04-26 10:51:32 MDT

Created attachment 10037 [details]
slurmctld log

Sure! Here it is. The crash happened on Apr 17 20:36:18

Cheers,
-- 
Kilian

Comment 6 Marshall Garey 2019-04-26 11:31:10 MDT

Probably unrelated error messages:

There are a lot of these. Internal bug 6659 - we haven't fixed it yet, but know about it (assigned to me).
Apr 17 20:35:59 sh-sl01 slurmctld[13973]: error: _remove_accrue_time_internal: QOS owner user 312359 accrue_cnt underflow


I'll mention it to Kilian, but most likely he's already on top of this:
Apr 18 03:20:50 sh-sl01 slurmctld[174836]: error: Node sh-106-30 appears to have a different slurm.conf than the slurmctld.  This could cause issues with communication and functionality.  Please review both files and make sure they are the same.  If this is expected ignore, and set DebugFlags=NO_CONF_HASH in your slurm.conf.


There are a lot of these. Internal bug 6769 - also haven't fixed it yet, but we're aware of it (also assigned to me).
Apr 16 19:13:33 sh-sl01 slurmctld[13973]: error: select/cons_res: node sh-102-27 memory is under-allocated (63036-65536) for JobId=40525305_394(40674029)

I'll get those fixed so we can clean up your slurmctld log while I also look for whatever is causing this abort in malloc - Googling for sigabrt in malloc indicates that it's most likely heap corruption somewhere else in the program.

Comment 7 Kilian Cavalotti 2019-04-26 11:49:11 MDT

Hi Marshall, 

(In reply to Marshall Garey from comment #6)
> There are a lot of these. Internal bug 6659 - we haven't fixed it yet, but
> know about it (assigned to me).
> Apr 17 20:35:59 sh-sl01 slurmctld[13973]: error:
> _remove_accrue_time_internal: QOS owner user 312359 accrue_cnt underflow

Great to know!

> I'll mention it to Kilian, but most likely he's already on top of this:
> Apr 18 03:20:50 sh-sl01 slurmctld[174836]: error: Node sh-106-30 appears to
> have a different slurm.conf than the slurmctld.  This could cause issues
> with communication and functionality.  Please review both files and make
> sure they are the same.  If this is expected ignore, and set
> DebugFlags=NO_CONF_HASH in your slurm.conf.

Yes, it can happen at times that individual compute nodes get a slightly out of sync config file when we change things. It usually doesn't last.

> There are a lot of these. Internal bug 6769 - also haven't fixed it yet, but
> we're aware of it (also assigned to me).
> Apr 16 19:13:33 sh-sl01 slurmctld[13973]: error: select/cons_res: node
> sh-102-27 memory is under-allocated (63036-65536) for
> JobId=40525305_394(40674029)
> 
> I'll get those fixed so we can clean up your slurmctld log 

Great, that would be awesome, thanks!

> while I also look
> for whatever is causing this abort in malloc - Googling for sigabrt in
> malloc indicates that it's most likely heap corruption somewhere else in the
> program.

Got it. And aborts are pretty rare too, so that would make sense.

Cheers,
-- 
Kilian

Comment 8 Marshall Garey 2019-05-30 09:14:02 MDT

Hi Kilian,

Have there been any more aborts of this nature?

I haven't made any progress on this bug specifically. However, there are proposed fixes for both issues I mentioned (accrue_cnt underflow and memory under-allocated errors).

- Marshall

Comment 9 Kilian Cavalotti 2019-05-30 10:12:23 MDT

HI Marshall, 

(In reply to Marshall Garey from comment #8)
> Have there been any more aborts of this nature?

Not in a while, no.

> I haven't made any progress on this bug specifically. However, there are
> proposed fixes for both issues I mentioned (accrue_cnt underflow and memory
> under-allocated errors).

Nice, thanks for the update!

Cheers,
-- 
Kilian

Comment 10 Regine Gaudin 2019-06-04 03:22:57 MDT

Hi

I'm updating this bug as CEA is also encountering memory under-allocated errors
 you have mentionned (bug 6769), filling slurmctld.log
error: select/cons_res: node machine1234 memory is under-allocated (0-188800) for JobID=XXXXXX

As you wrote "there are proposed fixes for both issues I mentioned (accrue_cnt underflow and memory under-allocated errors)", I let us known that CEA would be also interested in proposed fixes. slurm controller is 18.08.06 and clients in 17.11.6 but will be upgraded soon in 18.08.06

Thanks

Regine

Comment 11 Marshall Garey 2019-06-10 10:22:12 MDT

Regine - the patches for both bugs are pending internal QA/review. They'll both definitely be in 19.05, and probably will both be in 18.08. Although I hope they'll both be in the next tag, I can't promise that. If you'd like patches provided before they're in the public repo, can you create a new ticket for that?

Kilian - we've been looking into adding address sanitizer into our QA toolbox. I'm hopeful it can help me find possible heap corruption bugs, although I think it will be hard to tell if whatever I find/fix (assuming I do find something) is the exact same bug that you encountered. I'd like to keep this bug open for a bit longer to give me time to look into it more before I close it. Let me know if the slurmctld hits an abort like this again.


Tangentially, we are taking tangible steps at improving our QA. It is, unfortunately, a long process. ;)

Comment 12 Kilian Cavalotti 2019-06-10 10:29:00 MDT

Hi Marshall, 

(In reply to Marshall Garey from comment #11)
> Kilian - we've been looking into adding address sanitizer into our QA
> toolbox. I'm hopeful it can help me find possible heap corruption bugs,
> although I think it will be hard to tell if whatever I find/fix (assuming I
> do find something) is the exact same bug that you encountered. I'd like to
> keep this bug open for a bit longer to give me time to look into it more
> before I close it. Let me know if the slurmctld hits an abort like this
> again.

I will!

> Tangentially, we are taking tangible steps at improving our QA. It is,
> unfortunately, a long process. ;)

This is much much much appreciated, thanks!

Cheers,
-- 
Kilian

Comment 13 Marshall Garey 2019-08-16 09:08:37 MDT

Hi Kilian,

I haven't managed to reproduce this at all, and I haven't heard about it happening again, so I'm going to close it for now as cannot reproduce. Please re-open it if you see this abort again.

In other news, bug 6769 (memory underallocated errors) is closed with patches upstream; the patches in bug 6659 (accrue errors) is still pending review.

- Marshall

Comment 14 Kilian Cavalotti 2019-08-16 09:31:23 MDT

Hi Marshall, 

(In reply to Marshall Garey from comment #13)
> I haven't managed to reproduce this at all, and I haven't heard about it
> happening again, so I'm going to close it for now as cannot reproduce.
> Please re-open it if you see this abort again.

Will do, no problem. Thanks for looking into it anyway!

> 
> In other news, bug 6769 (memory underallocated errors) is closed with
> patches upstream; the patches in bug 6659 (accrue errors) is still pending
> review.

Thanks for the update, I appreciate the follow up.

Cheers,
-- 
Kilian