Ticket 10507

Summary:	The node cannot go offline when the nhc script is executed for more than 60 seconds
Product:	Slurm	Reporter:	menglong <meng_long_21>
Component:	Other	Assignee:	Tim Wickberg <tim>
Status:	OPEN ---	QA Contact:
Severity:	C - Contributions
Priority:	---
Version:	21.08.x
Hardware:	Linux
OS:	Linux
Site:	-Other-	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	modify nhc 60 hard code

Description menglong 2020-12-24 22:55:39 MST

Created attachment 17270 [details]
modify nhc 60 hard code

Dear all,
We rely on nhc to check the status of cluster nodes, we have many check items in the nhc script. Under normal circumstances, it takes 20s to execute the nhc script.We found that if the execution time of nhc exceeds 60s (70s-80s)due to the abnormal state of the node, the node cannot go offline.
We tested and found that modifying the hard code 60（run_script_health_check in slurmd.c） can improve the situation：
from
run_script("health_check", conf->health_check_program,0, 60, env, 0);
to
run_script("health_check", slurm_conf.health_check_program,
				0, conf->health_check_interval, env, 0);