Ticket 17

Summary:	Slurmctld SEGV in find_node_record() processing
Product:	Slurm	Reporter:	Bruno Faccini <bruno.faccini>
Component:	slurmctld	Assignee:	Moe Jette <jette>
Status:	RESOLVED FIXED	QA Contact:
Severity:	3 - Medium Impact
Priority:	---	CC:	bruno.faccini, da
Version:	2.3.x
Hardware:	Linux
OS:	Linux
Site:	---	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---

Description Bruno Faccini 2012-02-20 20:08:54 MST

Slurmctld was down due to a SEGV in strcmp() routine directly called from find_node_record(), because the current/parsed node_hash_table[] entry/node_record had its .name field found with the unexpected NULL value when it is found ok/valid in the core-file !!!

BTW, there was an other thread in the pack_all_nodes() algorithm which momentary sets the .name of "hidden" nodes (like the current/concerned one !!) to NULL without any check/control of concurrent access.

So it is likely that we face a race during access node_hash_table[] elements that needs to be fixed with protection/lock of each node_record element !!!

Comment 1 Moe Jette 2012-02-21 02:43:47 MST

What is the configuration and/or execute line that can cause the SEGV?

Regarding pack_all_nodes, the locking for that will occur in proc_req.c when the RPC is received and not in the underlying functions.

Comment 2 Bruno Faccini 2012-02-21 03:53:53 MST

Oops, sorry my mistake !!

Yes for sure you are right, the thread running pack_all_nodes() is using locking, that's the one running find_node_record() which does'nt !!!

Here are their 2x respective back-traces found in the core file :
==============================================================================================
#0  0x00000036d3d24676 in __strcmp_sse42 () from /lib64/libc.so.6
#1  0x000000000050a8e8 in find_node_record (name=0x2b066c19de58 "curie1983") at node_conf.c:766
#2  0x000000000044cb59 in is_node_resp (name=0x2b066c19de58 "curie1983") at node_mgr.c:2400
#3  0x0000000000428c39 in _comm_err (args=0x2b066c1b9558) at agent.c:753
#4  _thread_per_group_rpc (args=0x2b066c1b9558) at agent.c:969
#5  0x00000036d40077e1 in start_thread () from /lib64/libpthread.so.0
#6  0x00000036d3ce573d in clone () from /lib64/libc.so.6

and the thread that "just executed pack_all_nodes()
===========================================================
#0  0x00000036d400ea5c in send () from /lib64/libpthread.so.0
#1  0x00000000004ca164 in _slurm_send_timeout (fd=7, buf=0x2b066f267b88 "\027", size=798367, flags=0, timeout=90000) at slurm_protocol_socket_implementation.c:310
#2  0x00000000004ca3c6 in _slurm_msg_sendto_timeout (fd=7, buffer=0x2b066f267b88 "\027", size=798367, flags=<value optimized out>, timeout=90000) at slurm_protocol_socket_implementation.c:232
#3  0x00000000004afff5 in slurm_send_node_msg (fd=7, msg=0x2b0371b19880) at slurm_protocol_api.c:2728
#4  0x000000000045b85a in _slurm_rpc_dump_nodes (msg=0x2b066c005cf8) at proc_req.c:1202
#5  0x00000000004618a6 in slurmctld_req (msg=0x2b066c005cf8) at proc_req.c:212
#6  0x000000000042b298 in _service_connection (arg=0x2b035400e9a8) at controller.c:1002
#7  0x00000036d40077e1 in start_thread () from /lib64/libpthread.so.0
#8  0x00000036d3ce573d in clone () from /lib64/libc.so.6


==============================================================================================

Comment 3 Moe Jette 2012-02-24 08:14:45 MST

This should fix it:
https://github.com/SchedMD/slurm/commit/0a06f4e62a1e28d0332bdf092b33406e38a6dca4.patch


diff --git a/src/slurmctld/agent.c b/src/slurmctld/agent.c
index dbfb6b8..647e3f9 100644
--- a/src/slurmctld/agent.c
+++ b/src/slurmctld/agent.c
@@ -806,6 +806,9 @@ static void *_thread_per_group_rpc(void *args)
 	/* Locks: Write job, write node */
 	slurmctld_lock_t job_write_lock = {
 		NO_LOCK, WRITE_LOCK, WRITE_LOCK, NO_LOCK };
+	/* Lock: Read node */
+	slurmctld_lock_t node_read_lock = {
+		NO_LOCK, NO_LOCK, READ_LOCK, NO_LOCK };
 
 	xassert(args != NULL);
 	xsignal(SIGUSR1, _sig_handler);
@@ -875,8 +878,11 @@ static void *_thread_per_group_rpc(void *args)
 		if (slurm_send_only_node_msg(&msg) == SLURM_SUCCESS) {
 			thread_state = DSH_DONE;
 		} else {
-			if (!srun_agent)
+			if (!srun_agent) {
+				lock_slurmctld(node_read_lock);
 				_comm_err(thread_ptr->nodelist, msg_type);
+				unlock_slurmctld(node_read_lock);
+			}
 		}
 		goto cleanup;
 	}
@@ -966,8 +972,10 @@ static void *_thread_per_group_rpc(void *args)
 					errno = ret_data_info->err;
 				else
 					errno = rc;
+				lock_slurmctld(node_read_lock);
 				rc = _comm_err(ret_data_info->node_name,
 					       msg_type);
+				unlock_slurmctld(node_read_lock);
 			}
 			if (srun_agent)
 				thread_state = DSH_FAILED;