| Summary: | SEGV of slurmstepd in pmixp_agent during _poll_handle_event | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Regine Gaudin <regine.gaudin> |
| Component: | slurmstepd | Assignee: | Dominik Bartkiewicz <bart> |
| Status: | RESOLVED FIXED | QA Contact: | |
| Severity: | 3 - Medium Impact | ||
| Priority: | --- | CC: | bart |
| Version: | 20.11.8 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | CEA | Slinky Site: | --- |
| Alineos Sites: | --- | Atos/Eviden Sites: | --- |
| Confidential Site: | --- | Coreweave sites: | --- |
| Cray Sites: | --- | DS9 clusters: | --- |
| Google sites: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | NoveTech Sites: | --- |
| Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Tzag Elita Sites: | --- |
| Linux Distro: | --- | Machine Name: | |
| CLE Version: | Version Fixed: | 22.05.0pre1 | |
| Target Release: | --- | DevPrio: | --- |
| Emory-Cloud Sites: | --- | ||
Hello Any update on this ? Thanks Hi Sorry for the late response. Thanks for reporting and analyzing this issue. As you mentioned in the initial commit we will remove eio_remove_obj() from handlers. I also made a few code optimizations and I am still testing how big an impact on memory footprint this will takes. Dominik Hi As you suggested we removed eio_remove_obj() from handlers. In addition, we changed list routines to make eio faster. All these changes are in the master branch. https://github.com/SchedMD/slurm/commit/eb64ef254d https://github.com/SchedMD/slurm/commit/3fb17f807b https://github.com/SchedMD/slurm/commit/7710a8437b https://github.com/SchedMD/slurm/commit/2a2b35acac Let me know if you have any other questions or if we could close this ticket. Dominik Hello, thanks In which version of slurm will they be integrated ? Hi It will be included in Slurm 22.05. Dominik Hi I'm gonna go ahead and close the bug. Please, reopen if there are further questions or concerns. Dominik Yes thanks. I have integrated the patches in our slurm20. |
Hi We are currently encountering SEGV in slurmstepd. Core analysis shows that _poll_handle_event called by pmix_agent.c is trying to handle a obj->ops to NULL. The NULL obj might be due to an unjustified eio_remove_obj(obj, obs) in _serv_read(eio_obj_t *obj, List objs) if (!pmixp_conn_is_alive). The remove show be already done by obj->shutdown = true later. We are still in slurm19.09.05-1 but will upgrade to 20.11.8 at the end of the month. However looking at the slurm.20.11.8 source the bug is still there. I can not send you the slurmstepd core file as it happens on a confidential machine. I'll try to give you the main analysis: gdb slurmstepd core.slurmstepd.0 (gdb) bt #0 _poll_handle_event (objList=0x1614100, obj=0x16d3ec0, revents=8197) at eio.c:493 #_poll_dispatch (obj (objList=0x1614100,map=0x18c7fe...) at at eio.c:437 #3 in _agent_thread at pmixp_agent.c:253 #4 _start_thread #5 clone() pmixp_agent.c:253 is calling eio_handle_mainloop(_io_handle) SEGV happens in _poll_handle_event:eio.c 493 while POLLOUT in obj->ops->handle_write _poll_handle_event if (revents & POLLIN) { if (obj->ops->handle_read) { if (!read_called) { (*obj->ops->handle_read ) (obj, objList); } } else { debug("No handler for POLLIN"); obj->shutdown = true; } } if (revents & POLLOUT) { if (obj->ops->handle_write) { ====> SEGV if (!write_called) { (*obj->ops->handle_write) (obj, objList); } } else { debug("No handler for POLLOUT"); obj->shutdown = true; } } But obj->ops = NULL: (gdb)p *obj $2 = 'fd = 39, arg = obj=0x16d3e70, ops=0x0, shutdown=true} In_serv_read(eio_obj_t *obj, List objs) if (!pmixp_conn_is_alive(conn)) eio_remove_obj(obj, objs) is called: _serv_read(eio_obj_t *obj, List objs) { /* sanity check */ xassert(NULL != obj ); if (obj->shutdown) { /* corresponding connection will be * cleaned up during plugin finalize */ return 0; } pmixp_conn_t *conn = (pmixp_conn_t *)obj->arg; bool proceed = true; /* debug stub */ pmixp_debug_hang(0); /* Read and process all received messages */ while (proceed) { if (!pmixp_conn_progress_rcv(conn)) { proceed = 0; } if (!pmixp_conn_is_alive(conn)) { obj->shutdown = true; PMIXP_DEBUG("Connection closed fd = %d", obj->fd); /* cleanup after this connection */ eio_remove_obj(obj, objs ); ======> clean obj pmixp_conn_return(conn); proceed = 0; } } The hypothesis is that as the call to _poll_handle_event(obj->ops->handle_read) is made before _poll_handle_event(obj->ops->handle_write) eio_remove_obj(obj, objs ) has removed obj which leads to SEGV when poll_handle_event(obj->ops->handle_write) try to handle it. An idea of patch is to remove eio_remove_obj(obj, objs ); in _serv_read as obj->shutdown should do it later. Same thing for _serv_write for coherency