Ticket 12004

Summary: HealthCheckProgram behavior
Product: Slurm Reporter: Matt Ezell <ezellma>
Component: slurmdAssignee: Unassigned Developer <dev-unassigned>
Status: OPEN --- QA Contact:
Severity: 5 - Enhancement    
Priority: --- CC: alex, cinek, lyeager, mcoyne, sts
Version: 20.11.6   
Hardware: Linux   
OS: Linux   
See Also: https://bugs.schedmd.com/show_bug.cgi?id=14721
Site: ORNL-OLCF Slinky Site: ---
Alineos Sites: --- Atos/Eviden Sites: ---
Confidential Site: --- Coreweave sites: ---
Cray Sites: --- DS9 clusters: ---
Google sites: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- NoveTech Sites: ---
Nvidia HWinf-CS Sites: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Tzag Elita Sites: ---
Linux Distro: --- Machine Name:
CLE Version: Version Fixed:
Target Release: --- DevPrio: ---
Emory-Cloud Sites: ---
Attachments: patch0001
patch0002

Description Matt Ezell 2021-07-08 19:47:57 MDT
I'm working to integrate a health check program, and I would like to propose some minor changes for 21.08.

It makes sense to hardcode a 60-second limit for the periodic runs, but on boot we would like to run some longer-running diagnostics before the node registers. Would you consider a patch that allows a longer timeout (or no timeout at all) for the first run?

The documentation for HealthCheckProgram says: "This program will also be executed when the slurmd daemon is first started and before it registers with the slurmctld daemon."  This is only true if HealthCheckInterval is non-zero. There might be situations when an Admin wants to configure the program to run on startup but not regularly. Would you consider a patch to run on startup even if HealthCheckInterval is unset?
Comment 3 Matt Ezell 2021-07-09 10:50:36 MDT
Created attachment 20321 [details]
patch0001
Comment 4 Matt Ezell 2021-07-09 10:51:03 MDT
Created attachment 20322 [details]
patch0002
Comment 9 Marcin Stolarek 2021-07-14 02:23:40 MDT
Matt,

We no longer allow contributions to 21.08 - we're close to the release and all new features are frozen.
The patch you shared hardcodes the behavioral change, which is not something we want since it will enforce behavior change for other sites. We generally prefer an approach with a switch to opt-in for a new behavior switch.

While we see room for improvement in the HealthCheck interface that may happen on 22.05 and will rather be a more complete set of changes.

I'll go ahead and close the ticket now. Should you have more questions please reopen.

cheers,
Marcin
Comment 11 Marcin Stolarek 2021-07-14 11:24:15 MDT
Matt,

I reopened the case since we don't have other places where we can follow up on potential changes in this area.
Sorry for the communication noise.

cheers,
Marcin