Ticket 10507 - The node cannot go offline when the nhc script is executed for more than 60 seconds
Summary: The node cannot go offline when the nhc script is executed for more than 60 s...
Status: OPEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: Other (show other tickets)
Version: 21.08.x
Hardware: Linux Linux
: C - Contributions
Assignee: Tim Wickberg
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2020-12-24 22:55 MST by menglong
Modified: 2020-12-24 22:55 MST (History)
0 users

See Also:
Site: -Other-
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
modify nhc 60 hard code (414 bytes, text/plain)
2020-12-24 22:55 MST, menglong
Details

Note You need to log in before you can comment on or make changes to this ticket.
Description menglong 2020-12-24 22:55:39 MST
Created attachment 17270 [details]
modify nhc 60 hard code

Dear all,
We rely on nhc to check the status of cluster nodes, we have many check items in the nhc script. Under normal circumstances, it takes 20s to execute the nhc script.We found that if the execution time of nhc exceeds 60s (70s-80s)due to the abnormal state of the node, the node cannot go offline.
We tested and found that modifying the hard code 60(run_script_health_check in slurmd.c) can improve the situation:
from
run_script("health_check", conf->health_check_program,0, 60, env, 0);
to
run_script("health_check", slurm_conf.health_check_program,
				0, conf->health_check_interval, env, 0);