| Summary: | The node cannot go offline when the nhc script is executed for more than 60 seconds | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | menglong <meng_long_21> |
| Component: | Other | Assignee: | Tim Wickberg <tim> |
| Status: | OPEN --- | QA Contact: | |
| Severity: | C - Contributions | ||
| Priority: | --- | ||
| Version: | 21.08.x | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | -Other- | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
| Attachments: | modify nhc 60 hard code | ||
Created attachment 17270 [details] modify nhc 60 hard code Dear all, We rely on nhc to check the status of cluster nodes, we have many check items in the nhc script. Under normal circumstances, it takes 20s to execute the nhc script.We found that if the execution time of nhc exceeds 60s (70s-80s)due to the abnormal state of the node, the node cannot go offline. We tested and found that modifying the hard code 60(run_script_health_check in slurmd.c) can improve the situation: from run_script("health_check", conf->health_check_program,0, 60, env, 0); to run_script("health_check", slurm_conf.health_check_program, 0, conf->health_check_interval, env, 0);