Ticket 10507

Summary: The node cannot go offline when the nhc script is executed for more than 60 seconds
Product: Slurm Reporter: menglong <meng_long_21>
Component: OtherAssignee: Tim Wickberg <tim>
Status: OPEN --- QA Contact:
Severity: C - Contributions    
Priority: ---    
Version: 21.08.x   
Hardware: Linux   
OS: Linux   
Site: -Other- Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---
Attachments: modify nhc 60 hard code

Description menglong 2020-12-24 22:55:39 MST
Created attachment 17270 [details]
modify nhc 60 hard code

Dear all,
We rely on nhc to check the status of cluster nodes, we have many check items in the nhc script. Under normal circumstances, it takes 20s to execute the nhc script.We found that if the execution time of nhc exceeds 60s (70s-80s)due to the abnormal state of the node, the node cannot go offline.
We tested and found that modifying the hard code 60(run_script_health_check in slurmd.c) can improve the situation:
from
run_script("health_check", conf->health_check_program,0, 60, env, 0);
to
run_script("health_check", slurm_conf.health_check_program,
				0, conf->health_check_interval, env, 0);