| Summary: | Slurmctld SEGV in find_node_record() processing | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Bruno Faccini <bruno.faccini> |
| Component: | slurmctld | Assignee: | Moe Jette <jette> |
| Status: | RESOLVED FIXED | QA Contact: | |
| Severity: | 3 - Medium Impact | ||
| Priority: | --- | CC: | bruno.faccini, da |
| Version: | 2.3.x | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | --- | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
|
Description
Bruno Faccini
2012-02-20 20:08:54 MST
What is the configuration and/or execute line that can cause the SEGV? Regarding pack_all_nodes, the locking for that will occur in proc_req.c when the RPC is received and not in the underlying functions. Oops, sorry my mistake !! Yes for sure you are right, the thread running pack_all_nodes() is using locking, that's the one running find_node_record() which does'nt !!! Here are their 2x respective back-traces found in the core file : ============================================================================================== #0 0x00000036d3d24676 in __strcmp_sse42 () from /lib64/libc.so.6 #1 0x000000000050a8e8 in find_node_record (name=0x2b066c19de58 "curie1983") at node_conf.c:766 #2 0x000000000044cb59 in is_node_resp (name=0x2b066c19de58 "curie1983") at node_mgr.c:2400 #3 0x0000000000428c39 in _comm_err (args=0x2b066c1b9558) at agent.c:753 #4 _thread_per_group_rpc (args=0x2b066c1b9558) at agent.c:969 #5 0x00000036d40077e1 in start_thread () from /lib64/libpthread.so.0 #6 0x00000036d3ce573d in clone () from /lib64/libc.so.6 and the thread that "just executed pack_all_nodes() =========================================================== #0 0x00000036d400ea5c in send () from /lib64/libpthread.so.0 #1 0x00000000004ca164 in _slurm_send_timeout (fd=7, buf=0x2b066f267b88 "\027", size=798367, flags=0, timeout=90000) at slurm_protocol_socket_implementation.c:310 #2 0x00000000004ca3c6 in _slurm_msg_sendto_timeout (fd=7, buffer=0x2b066f267b88 "\027", size=798367, flags=<value optimized out>, timeout=90000) at slurm_protocol_socket_implementation.c:232 #3 0x00000000004afff5 in slurm_send_node_msg (fd=7, msg=0x2b0371b19880) at slurm_protocol_api.c:2728 #4 0x000000000045b85a in _slurm_rpc_dump_nodes (msg=0x2b066c005cf8) at proc_req.c:1202 #5 0x00000000004618a6 in slurmctld_req (msg=0x2b066c005cf8) at proc_req.c:212 #6 0x000000000042b298 in _service_connection (arg=0x2b035400e9a8) at controller.c:1002 #7 0x00000036d40077e1 in start_thread () from /lib64/libpthread.so.0 #8 0x00000036d3ce573d in clone () from /lib64/libc.so.6 ============================================================================================== This should fix it: https://github.com/SchedMD/slurm/commit/0a06f4e62a1e28d0332bdf092b33406e38a6dca4.patch diff --git a/src/slurmctld/agent.c b/src/slurmctld/agent.c index dbfb6b8..647e3f9 100644 --- a/src/slurmctld/agent.c +++ b/src/slurmctld/agent.c @@ -806,6 +806,9 @@ static void *_thread_per_group_rpc(void *args) /* Locks: Write job, write node */ slurmctld_lock_t job_write_lock = { NO_LOCK, WRITE_LOCK, WRITE_LOCK, NO_LOCK }; + /* Lock: Read node */ + slurmctld_lock_t node_read_lock = { + NO_LOCK, NO_LOCK, READ_LOCK, NO_LOCK }; xassert(args != NULL); xsignal(SIGUSR1, _sig_handler); @@ -875,8 +878,11 @@ static void *_thread_per_group_rpc(void *args) if (slurm_send_only_node_msg(&msg) == SLURM_SUCCESS) { thread_state = DSH_DONE; } else { - if (!srun_agent) + if (!srun_agent) { + lock_slurmctld(node_read_lock); _comm_err(thread_ptr->nodelist, msg_type); + unlock_slurmctld(node_read_lock); + } } goto cleanup; } @@ -966,8 +972,10 @@ static void *_thread_per_group_rpc(void *args) errno = ret_data_info->err; else errno = rc; + lock_slurmctld(node_read_lock); rc = _comm_err(ret_data_info->node_name, msg_type); + unlock_slurmctld(node_read_lock); } if (srun_agent) thread_state = DSH_FAILED; |