Ticket 17

Summary: Slurmctld SEGV in find_node_record() processing
Product: Slurm Reporter: Bruno Faccini <bruno.faccini>
Component: slurmctldAssignee: Moe Jette <jette>
Status: RESOLVED FIXED QA Contact:
Severity: 3 - Medium Impact    
Priority: --- CC: bruno.faccini, da
Version: 2.3.x   
Hardware: Linux   
OS: Linux   
Site: --- Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---

Description Bruno Faccini 2012-02-20 20:08:54 MST
Slurmctld was down due to a SEGV in strcmp() routine directly called from find_node_record(), because the current/parsed node_hash_table[] entry/node_record had its .name field found with the unexpected NULL value when it is found ok/valid in the core-file !!!

BTW, there was an other thread in the pack_all_nodes() algorithm which momentary sets the .name of "hidden" nodes (like the current/concerned one !!) to NULL without any check/control of concurrent access.

So it is likely that we face a race during access node_hash_table[] elements that needs to be fixed with protection/lock of each node_record element !!!
Comment 1 Moe Jette 2012-02-21 02:43:47 MST
What is the configuration and/or execute line that can cause the SEGV?

Regarding pack_all_nodes, the locking for that will occur in proc_req.c when the RPC is received and not in the underlying functions.
Comment 2 Bruno Faccini 2012-02-21 03:53:53 MST
Oops, sorry my mistake !!

Yes for sure you are right, the thread running pack_all_nodes() is using locking, that's the one running find_node_record() which does'nt !!!

Here are their 2x respective back-traces found in the core file :
==============================================================================================
#0  0x00000036d3d24676 in __strcmp_sse42 () from /lib64/libc.so.6
#1  0x000000000050a8e8 in find_node_record (name=0x2b066c19de58 "curie1983") at node_conf.c:766
#2  0x000000000044cb59 in is_node_resp (name=0x2b066c19de58 "curie1983") at node_mgr.c:2400
#3  0x0000000000428c39 in _comm_err (args=0x2b066c1b9558) at agent.c:753
#4  _thread_per_group_rpc (args=0x2b066c1b9558) at agent.c:969
#5  0x00000036d40077e1 in start_thread () from /lib64/libpthread.so.0
#6  0x00000036d3ce573d in clone () from /lib64/libc.so.6

and the thread that "just executed pack_all_nodes()
===========================================================
#0  0x00000036d400ea5c in send () from /lib64/libpthread.so.0
#1  0x00000000004ca164 in _slurm_send_timeout (fd=7, buf=0x2b066f267b88 "\027", size=798367, flags=0, timeout=90000) at slurm_protocol_socket_implementation.c:310
#2  0x00000000004ca3c6 in _slurm_msg_sendto_timeout (fd=7, buffer=0x2b066f267b88 "\027", size=798367, flags=<value optimized out>, timeout=90000) at slurm_protocol_socket_implementation.c:232
#3  0x00000000004afff5 in slurm_send_node_msg (fd=7, msg=0x2b0371b19880) at slurm_protocol_api.c:2728
#4  0x000000000045b85a in _slurm_rpc_dump_nodes (msg=0x2b066c005cf8) at proc_req.c:1202
#5  0x00000000004618a6 in slurmctld_req (msg=0x2b066c005cf8) at proc_req.c:212
#6  0x000000000042b298 in _service_connection (arg=0x2b035400e9a8) at controller.c:1002
#7  0x00000036d40077e1 in start_thread () from /lib64/libpthread.so.0
#8  0x00000036d3ce573d in clone () from /lib64/libc.so.6


==============================================================================================
Comment 3 Moe Jette 2012-02-24 08:14:45 MST
This should fix it:
https://github.com/SchedMD/slurm/commit/0a06f4e62a1e28d0332bdf092b33406e38a6dca4.patch


diff --git a/src/slurmctld/agent.c b/src/slurmctld/agent.c
index dbfb6b8..647e3f9 100644
--- a/src/slurmctld/agent.c
+++ b/src/slurmctld/agent.c
@@ -806,6 +806,9 @@ static void *_thread_per_group_rpc(void *args)
 	/* Locks: Write job, write node */
 	slurmctld_lock_t job_write_lock = {
 		NO_LOCK, WRITE_LOCK, WRITE_LOCK, NO_LOCK };
+	/* Lock: Read node */
+	slurmctld_lock_t node_read_lock = {
+		NO_LOCK, NO_LOCK, READ_LOCK, NO_LOCK };
 
 	xassert(args != NULL);
 	xsignal(SIGUSR1, _sig_handler);
@@ -875,8 +878,11 @@ static void *_thread_per_group_rpc(void *args)
 		if (slurm_send_only_node_msg(&msg) == SLURM_SUCCESS) {
 			thread_state = DSH_DONE;
 		} else {
-			if (!srun_agent)
+			if (!srun_agent) {
+				lock_slurmctld(node_read_lock);
 				_comm_err(thread_ptr->nodelist, msg_type);
+				unlock_slurmctld(node_read_lock);
+			}
 		}
 		goto cleanup;
 	}
@@ -966,8 +972,10 @@ static void *_thread_per_group_rpc(void *args)
 					errno = ret_data_info->err;
 				else
 					errno = rc;
+				lock_slurmctld(node_read_lock);
 				rc = _comm_err(ret_data_info->node_name,
 					       msg_type);
+				unlock_slurmctld(node_read_lock);
 			}
 			if (srun_agent)
 				thread_state = DSH_FAILED;