Ticket 2633

Summary:	slurmctld crashes in _rm_job_from_res after upgrade from 14.11 to 15.08
Product:	Slurm	Reporter:	Gianluca Castellani <gianluca.castellani>
Component:	slurmctld	Assignee:	Alejandro Sanchez <alex>
Status:	RESOLVED INFOGIVEN	QA Contact:
Severity:	2 - High Impact
Priority:	---	CC:	alex
Version:	15.08.9
Hardware:	Linux
OS:	Linux
Site:	KAUST	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---

Description Gianluca Castellani 2016-04-12 23:45:43 MDT

Hello,
we did a Slurm upgrade and now slurmctld crashes by time to time.

We had some jobs running in the cluster before the upgrade we just requeued them during a shutdown.

This is the output of gdb /usr/local/slurm/sbin/slurmctld core.2389
bt

#0  0x0000003960c32925 in raise () from /lib64/libc.so.6
#1  0x0000003960c34105 in abort () from /lib64/libc.so.6
#2  0x0000003960c2ba4e in __assert_fail_base () from /lib64/libc.so.6
#3  0x0000003960c2bb10 in __assert_fail () from /lib64/libc.so.6
#4  0x0000000000508a39 in bit_ffs (b=0x0) at bitstring.c:483
#5  0x00007f6513f86562 in _build_row_bitmaps (p_ptr=0x7f64e8acbe10, job_ptr=0x7f64e8fdf150) at select_cons_res.c:646
#6  0x00007f6513f8842b in _rm_job_from_res (part_record_ptr=0x7f64e8acbe10, node_usage=0x7f64e803f580, job_ptr=0x7f64e8fdf150, action=0) at select_cons_res.c:1255
#7  0x00007f6513f8acc0 in select_p_job_fini (job_ptr=0x7f64e8fdf150) at select_cons_res.c:2296
#8  0x0000000000525c8d in select_g_job_fini (job_ptr=0x7f64e8fdf150) at node_select.c:691
#9  0x000000000048cc1f in deallocate_nodes (job_ptr=0x7f64e8fdf150, timeout=false, suspended=false, preempted=false) at node_scheduler.c:484
#10 0x0000000000470aa3 in _job_requeue (uid=0, job_ptr=0x7f64e8fdf150, preempt=false, state=2304) at job_mgr.c:13724
#11 0x0000000000470da7 in job_requeue (uid=0, job_id=1219124, conn_fd=17, protocol_version=7424, preempt=false, state=2304) at job_mgr.c:13809
#12 0x00000000004ac443 in _slurm_rpc_requeue (msg=0x7f64e9182170) at proc_req.c:4430
#13 0x00000000004a00f7 in slurmctld_req (msg=0x7f64e9182170, arg=0x7f6508002510) at proc_req.c:480
#14 0x000000000043efc1 in _service_connection (arg=0x7f6508002510) at controller.c:1096
#15 0x00000039610079d1 in start_thread () from /lib64/libpthread.so.0
#16 0x0000003960ce8b6d in clone () from /lib64/libc.so.6

Do you have any suggestion about how to debug this?

I am also getting a flood of:
error: slurm_receive_msg: Zero Bytes were transmitted or received

do you think that these two are related?

Best,
Gianluca

Comment 2 Alejandro Sanchez 2016-04-13 00:04:47 MDT

Hi Gianluca,

We think this was fixed in 15.08.10:

Fixed backfill scheduler race condition that could cause invalid pointer in
    select/cons_res plugin. Bug introduced in 15.08.9, commit:
    efd9d35

The scenario is as follows
1. Backfill scheduler is running, then releases locks
2. Main scheduling loop starts a job "A"
3. Backfill scheduler resumes, finds job "A" in its queue and
   resets it's partition pointer.
4. Job "A" completes and tries to remove resource allocation record
   from select/cons_res data structure, but fails to find it because
   it is looking in the table for the wrong partition.
5. Job "A" record gets purged from slurmctld
6. Select/cons_res plugin attempts to operate on resource allocation
   data structure, finds pointer into the now purged data structure
   of job "A" and aborts or gets SEGV

Commit here:
https://github.com/SchedMD/slurm/commit/d8b18ff8dc6f2237db3729f4ab085b003e54c89e

Could you please try to upgrade to 15.08.10? Thanks.

Comment 3 Moe Jette 2016-04-13 02:33:45 MDT

(In reply to Gianluca Castellani from comment #0)
> Hello,
> we did a Slurm upgrade and now slurmctld crashes by time to time.

Note, this problem is a duplicate of bug 2603.


> I am also getting a flood of:
> error: slurm_receive_msg: Zero Bytes were transmitted or received
> 
> do you think that these two are related?

These are probably not related.
Are the "Zero Bytes" error messages spread through time or do they all arrive at the same time?
Do they happen immediately after restarting the slurmctld daemon?

Comment 4 Gianluca Castellani 2016-04-14 00:16:38 MDT

Hey Moe,
the RPC error arrived continuously (even after a slurmctld restart). 

Yesterday we were able to relate these messages to several nodes where munge was down (this was a bit surprising and I had overlooked it).

After we had munge running on all the nodes and slurm restarted we did not get any error: slurm_receive_msg: Zero Bytes were transmitted or received;

Right now we upgraded to 15.08.10 in the master and slave node (thanks Alejandro). So far so good.

Comment 5 Alejandro Sanchez 2016-04-14 00:29:29 MDT

Gianluca you might consider configuring a NodeHealthProgram. Among other things, there are implementations taking care of services such as munge being available on nodes, and taking consequent actions, such as setting the node state to DOWN if service not running. Please, let us know if we can close this bug or if you have any more questions.

Comment 6 Alejandro Sanchez 2016-04-19 04:00:00 MDT

Gianluca, closing this bug. Please reopen if needed.

Comment 7 Gianluca Castellani 2016-04-19 05:31:03 MDT

Thanks.
I have been a bit busy this week. Anyway no slurmctld crashes.

Best

On Tuesday, April 19, 2016, <bugs@schedmd.com> wrote:

> Alejandro Sanchez <javascript:_e(%7B%7D,'cvml','alex@schedmd.com');> changed
> bug 2633 <https://bugs.schedmd.com/show_bug.cgi?id=2633>
> What Removed Added
> Resolution --- INFOGIVEN
> Status UNCONFIRMED RESOLVED
>
> *Comment # 6 <https://bugs.schedmd.com/show_bug.cgi?id=2633#c6> on bug
> 2633 <https://bugs.schedmd.com/show_bug.cgi?id=2633> from Alejandro Sanchez
> <javascript:_e(%7B%7D,'cvml','alex@schedmd.com');> *
>
> Gianluca, closing this bug. Please reopen if needed.
>
> ------------------------------
> You are receiving this mail because:
>
>    - You reported the bug.
>
>