Ticket 13609

Summary: SEGV of slurmstepd in pmixp_agent during _poll_handle_event
Product: Slurm Reporter: Regine Gaudin <regine.gaudin>
Component: slurmstepdAssignee: Dominik Bartkiewicz <bart>
Status: RESOLVED FIXED QA Contact:
Severity: 3 - Medium Impact    
Priority: --- CC: bart
Version: 20.11.8   
Hardware: Linux   
OS: Linux   
Site: CEA Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: 22.05.0pre1 Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---

Description Regine Gaudin 2022-03-14 04:59:24 MDT
Hi

We are currently encountering SEGV in slurmstepd. Core analysis shows that _poll_handle_event called by pmix_agent.c is trying to handle a obj->ops to NULL.


The NULL obj might be  due to an unjustified eio_remove_obj(obj, obs) in _serv_read(eio_obj_t *obj, List objs) if (!pmixp_conn_is_alive).
 The remove show be already done by obj->shutdown = true later.

We are still in slurm19.09.05-1 but will upgrade to 20.11.8 at the end of the month. However looking at the slurm.20.11.8 source the bug is still there.

I can not send you the slurmstepd core file as it happens on a confidential machine.


I'll try to give you the main analysis:

gdb slurmstepd core.slurmstepd.0
(gdb) bt
#0 _poll_handle_event (objList=0x1614100, obj=0x16d3ec0, revents=8197) at eio.c:493
#_poll_dispatch  (obj (objList=0x1614100,map=0x18c7fe...) at at eio.c:437
#3 in _agent_thread at pmixp_agent.c:253
#4 _start_thread
#5 clone()


pmixp_agent.c:253 is calling eio_handle_mainloop(_io_handle)

SEGV happens in _poll_handle_event:eio.c 493 while POLLOUT in obj->ops->handle_write 
 
 _poll_handle_event
        if (revents & POLLIN) {
                if (obj->ops->handle_read) {
                        if (!read_called) {
                                (*obj->ops->handle_read ) (obj, objList);
                        }
                } else {
                        debug("No handler for POLLIN");
                        obj->shutdown = true;
                }
        }

        if (revents & POLLOUT) {
                if (obj->ops->handle_write) { ====> SEGV
                        if (!write_called) {
                                (*obj->ops->handle_write) (obj, objList);
                        }
                } else {
                        debug("No handler for POLLOUT");
                        obj->shutdown = true;
                }
        }

But obj->ops = NULL:
(gdb)p *obj
$2 = 'fd = 39, arg = obj=0x16d3e70, ops=0x0, shutdown=true}


In_serv_read(eio_obj_t *obj, List objs) if (!pmixp_conn_is_alive(conn))
 eio_remove_obj(obj, objs) is called:

 _serv_read(eio_obj_t *obj, List objs)
{
        /* sanity check */
        xassert(NULL != obj );
        if (obj->shutdown) {
                /* corresponding connection will be
                 * cleaned up during plugin finalize
                 */
                return 0;
        }

        pmixp_conn_t *conn = (pmixp_conn_t *)obj->arg;
        bool proceed = true;

        /* debug stub */
        pmixp_debug_hang(0);

        /* Read and process all received messages */
        while (proceed) {
                if (!pmixp_conn_progress_rcv(conn)) {
                        proceed = 0;
                }
                if (!pmixp_conn_is_alive(conn)) {
                        obj->shutdown = true;
                        PMIXP_DEBUG("Connection closed fd = %d", obj->fd);
                        /* cleanup after this connection */
                        eio_remove_obj(obj, objs ); ======>  clean obj
                        pmixp_conn_return(conn);
                        proceed = 0;
                }
        }


The hypothesis is that as the call to _poll_handle_event(obj->ops->handle_read)
is made before   _poll_handle_event(obj->ops->handle_write)  eio_remove_obj(obj, objs ) has removed obj which leads to SEGV when poll_handle_event(obj->ops->handle_write) try to handle it.


An idea of patch is to remove  eio_remove_obj(obj, objs );  in _serv_read
as obj->shutdown should do it later. Same thing for _serv_write for coherency
Comment 2 Regine Gaudin 2022-03-21 02:34:35 MDT
Hello
Any update on this ? Thanks
Comment 3 Dominik Bartkiewicz 2022-03-21 02:54:07 MDT
Hi

Sorry for the late response.
Thanks for reporting and analyzing this issue.
As you mentioned in the initial commit we will remove eio_remove_obj() from handlers. I also made a few code optimizations and I am still testing how big an impact on memory footprint this will takes.

Dominik
Comment 10 Dominik Bartkiewicz 2022-03-28 03:36:15 MDT
Hi

As you suggested we removed eio_remove_obj() from handlers. In addition, we changed list routines to make eio faster.

All these changes are in the master branch.
https://github.com/SchedMD/slurm/commit/eb64ef254d
https://github.com/SchedMD/slurm/commit/3fb17f807b
https://github.com/SchedMD/slurm/commit/7710a8437b
https://github.com/SchedMD/slurm/commit/2a2b35acac

Let me know if you have any other questions or if we could close this ticket.

Dominik
Comment 11 Regine Gaudin 2022-03-29 00:47:26 MDT
Hello, thanks
In which version of slurm will they be integrated ?
Comment 12 Dominik Bartkiewicz 2022-03-29 03:48:38 MDT
Hi

It will be included in Slurm 22.05.

Dominik
Comment 13 Dominik Bartkiewicz 2022-04-07 05:01:35 MDT
Hi

I'm gonna go ahead and close the bug. Please, reopen if there are further
questions or concerns.

Dominik
Comment 14 Regine Gaudin 2022-04-11 00:45:18 MDT
Yes thanks. I have integrated the patches in our slurm20.