I'm working to integrate a health check program, and I would like to propose some minor changes for 21.08. It makes sense to hardcode a 60-second limit for the periodic runs, but on boot we would like to run some longer-running diagnostics before the node registers. Would you consider a patch that allows a longer timeout (or no timeout at all) for the first run? The documentation for HealthCheckProgram says: "This program will also be executed when the slurmd daemon is first started and before it registers with the slurmctld daemon." This is only true if HealthCheckInterval is non-zero. There might be situations when an Admin wants to configure the program to run on startup but not regularly. Would you consider a patch to run on startup even if HealthCheckInterval is unset?
Created attachment 20321 [details] patch0001
Created attachment 20322 [details] patch0002
Matt, We no longer allow contributions to 21.08 - we're close to the release and all new features are frozen. The patch you shared hardcodes the behavioral change, which is not something we want since it will enforce behavior change for other sites. We generally prefer an approach with a switch to opt-in for a new behavior switch. While we see room for improvement in the HealthCheck interface that may happen on 22.05 and will rather be a more complete set of changes. I'll go ahead and close the ticket now. Should you have more questions please reopen. cheers, Marcin
Matt, I reopened the case since we don't have other places where we can follow up on potential changes in this area. Sorry for the communication noise. cheers, Marcin