Ticket 14721

Summary: node health check configuration to avoid administrative actions in user jobs
Product: Slurm Reporter: mike coyne <mcoyne>
Component: ConfigurationAssignee: Tim Wickberg <tim>
Status: OPEN --- QA Contact:
Severity: 5 - Enhancement    
Priority: --- CC: bsantos, cinek, kauffman, mej, nate, randall.white, sts, tim, tsskinner, vhafener
Version: 22.05.2   
Hardware: Linux   
OS: Linux   
See Also: https://bugs.schedmd.com/show_bug.cgi?id=13477
https://bugs.schedmd.com/show_bug.cgi?id=12004
Site: LANL Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: SUSE
Machine Name: chicoma,trinity cray xc and ex (shasta) CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---
Attachments: node health check time out patch

Description mike coyne 2022-08-10 09:09:20 MDT
I have the desire to execute the health check program between jobs , currently it is set withe  the epilog on our cray machines. If i set the
HealthCheckInterval=3600
HealthCheckNodeState=CYCLE,IDLE
HealthCheckProgram=/usr/bin/not_so_lightweight_node_check

will this execute it between one jobs epilog and the next ones prolog once per hour , and if i have a job that say lasts for 16 hours , will it run it after the completion of the long 16 hour job  but before the next one.

I need assistance on configuring to accommodate this , please note 
https://bugs.schedmd.com/show_bug.cgi?id=13477 issues we are seeing 
...

HealthCheckInterval
    The interval in seconds between executions of HealthCheckProgram. The default value is zero, which disables execution. 

HealthCheckNodeState
    Identify what node states should execute the HealthCheckProgram. Multiple state values may be specified with a comma separator. The default value is ANY to execute on nodes in any state. 

        ALLOC
            Run on nodes in the ALLOC state (all CPUs allocated). 

        ANY
            Run on nodes in any state. 

        CYCLE
            Rather than running the health check program on all nodes at the same time, cycle through running on all compute nodes through the course of the HealthCheckInterval. May be combined with the various node state options. 

        IDLE
            Run on nodes in the IDLE state. 

        MIXED
            Run on nodes in the MIXED state (some CPUs idle and other CPUs allocated). 

HealthCheckProgram
    Fully qualified pathname of a script to execute as user root periodically on all compute nodes that are not in the NOT_RESPONDING state. This program may be used to verify the node is fully operational and DRAIN the node or send email if a problem is detected. Any action to be taken must be explicitly performed by the program (e.g. execute "scontrol update NodeName=foo State=drain Reason=tmp_file_system_full" to drain a node). The execution interval is controlled using the HealthCheckInterval parameter. Note that the HealthCheckProgram will be executed at the same time on all nodes to minimize its impact upon parallel programs. This program will be killed if it does not terminate normally within 60 seconds. This program will also be executed when the slurmd daemon is first started and before it registers with the slurmctld daemon. By default, no program will be executed.
Comment 1 Nate Rini 2022-08-10 10:13:31 MDT
Is this happening on a machine where you can share slurmd/slurmctld logs?
Comment 2 mike coyne 2022-08-10 10:23:54 MDT
(In reply to Nate Rini from comment #1)
> Is this happening on a machine where you can share slurmd/slurmctld logs?

Nate , we have seen a couple  instances that we are trying to chase down the cause for our Shasta system in the Open , so we might be able to share some of the info off that.. currently i don't have one to send you yet though .. I am in a holding pattern waiting for it to happen again on chicoma , if its going to ..

question will setting up the nodehealthscript script work like this.. I do have a patch to increase its time out value ..

we had another incident on our secure XC machine , but it seems linked to  power issue that ended up requiring a full reboot of the cluster , so i am writing that one off.

i am trying to be proactive and see if i can move the administrative things not related directly to the job out of the prolog and epilog
Comment 4 Nate Rini 2022-08-10 10:45:50 MDT
(In reply to mike coyne from comment #2)
> question will setting up the nodehealthscript script work like this.. I do
> have a patch to increase its time out value ..
Please attach the patch so I can test with it.
 
> i am trying to be proactive and see if i can move the administrative things
> not related directly to the job out of the prolog and epilog
understood

(In reply to mike coyne from comment #0)
> will this execute it between one jobs epilog and the next ones prolog once
> per hour , and if i have a job that say lasts for 16 hours , will it run it
> after the completion of the long 16 hour job  but before the next one.
The health_check is called by the cronlike _slurmctld_background() thread in slurmctld. It then filters the node list from the nodes that are not DOWN. There is no logic currently to ensure that it is executed between jobs. It may execute between jobs but there is nothing making sure that it will happen. Depending on how busy slurmctld is at the time, it is even less likely to execute between jobs.
Comment 5 mike coyne 2022-08-10 10:50:07 MDT
Created attachment 26263 [details]
node health check time out patch

this is from 21.8.6+ i have not ported it to 22.5 yet .
Comment 10 Nate Rini 2022-08-11 10:11:19 MDT
(In reply to Nate Rini from comment #4)
> (In reply to mike coyne from comment #0)
> > will this execute it between one jobs epilog and the next ones prolog once
> > per hour , and if i have a job that say lasts for 16 hours , will it run it
> > after the completion of the long 16 hour job  but before the next one.
> The health_check is called by the cronlike _slurmctld_background() thread in
> slurmctld. It then filters the node list from the nodes that are not DOWN.
> There is no logic currently to ensure that it is executed between jobs. It
> may execute between jobs but there is nothing making sure that it will
> happen. Depending on how busy slurmctld is at the time, it is even less
> likely to execute between jobs.

Please note that there is currently bug#12004 open to improving the functionality of health_check. I suggest taking a look.

If the goal is to ensure that the health_check runs after every job, then calling the health_check script via the epilog is the suggested route. It is already required that the health_check script request that the node gets closed so it should run there without any issues.
Comment 11 mike coyne 2022-08-11 10:23:23 MDT
(In reply to Nate Rini from comment #10)
> (In reply to Nate Rini from comment #4)
> > (In reply to mike coyne from comment #0)
> > > will this execute it between one jobs epilog and the next ones prolog once
> > > per hour , and if i have a job that say lasts for 16 hours , will it run it
> > > after the completion of the long 16 hour job  but before the next one.
> > The health_check is called by the cronlike _slurmctld_background() thread in
> > slurmctld. It then filters the node list from the nodes that are not DOWN.
> > There is no logic currently to ensure that it is executed between jobs. It
> > may execute between jobs but there is nothing making sure that it will
> > happen. Depending on how busy slurmctld is at the time, it is even less
> > likely to execute between jobs.
> 
> Please note that there is currently bug#12004 open to improving the
> functionality of health_check. I suggest taking a look.
> 
> If the goal is to ensure that the health_check runs after every job, then
> calling the health_check script via the epilog is the suggested route. It is
> already required that the health_check script request that the node gets
> closed so it should run there without any issues.

We have users that are running very short jobs.. , the node heath check runs longer than the jobs..  Also i am trying to remove/isolate admin tasks that can and on occasion do hang from the prolog and epilog so we dont get these  jobs getting hung in epilog or prolog for extended time. I am not as good as i would like to be to find jobs that are stuck in completing so i can run a scontrol reconfigure if issue is a commincation/munge cert timeout  issue similar to the prolog hangs. I will take a look

i Will look at that other bug you mentioned .
Comment 12 mike coyne 2022-08-11 10:38:03 MDT
(In reply to mike coyne from comment #11)
> (In reply to Nate Rini from comment #10)
> > (In reply to Nate Rini from comment #4)
> > > (In reply to mike coyne from comment #0)
> > > > will this execute it between one jobs epilog and the next ones prolog once
> > > > per hour , and if i have a job that say lasts for 16 hours , will it run it
> > > > after the completion of the long 16 hour job  but before the next one.
> > > The health_check is called by the cronlike _slurmctld_background() thread in
> > > slurmctld. It then filters the node list from the nodes that are not DOWN.
> > > There is no logic currently to ensure that it is executed between jobs. It
> > > may execute between jobs but there is nothing making sure that it will
> > > happen. Depending on how busy slurmctld is at the time, it is even less
> > > likely to execute between jobs.
> > 
> > Please note that there is currently bug#12004 open to improving the
> > functionality of health_check. I suggest taking a look.
> > 
> > If the goal is to ensure that the health_check runs after every job, then
> > calling the health_check script via the epilog is the suggested route. It is
> > already required that the health_check script request that the node gets
> > closed so it should run there without any issues.
> 
> We have users that are running very short jobs.. , the node heath check runs
> longer than the jobs..  Also i am trying to remove/isolate admin tasks that
> can and on occasion do hang from the prolog and epilog so we dont get these 
> jobs getting hung in epilog or prolog for extended time. I am not as good as
> i would like to be to find jobs that are stuck in completing so i can run a
> scontrol reconfigure if issue is a communication/munge cert timeout  issue
> similar to the prolog hangs. I will take a look
> 
> i Will look at that other bug you mentioned .

It would be nice to have a config set able option for health check , and its time out when called from both the node health check and from the resume nodes script.. easier than caring a patch ..  i have not tried to port the change to the 22.05 with your new run_command functions quite yet,  as i was not sure the healthcheck and administrative commands like ansible runs  would function as we need as well as  avoiding running during a job to minimize jitter to running mpi jobs , but in a separate script( node health check ? )  so if it times out / hangs etc it will down the node and not hang up sequences of job launches on what may be functional hardware. 
Mike
Comment 15 Tim Wickberg 2022-08-25 15:19:06 MDT
Hey Mike -

I think I get where you're coming from here - you need something periodically handling health-checks on the node, but Epilog runs are too frequent, and NHC may interrupt work inadvertently due to the current de-coupling between job launches and the NHC runs.

I do think that's worth addressing, but I don't want to tackle it in isolation. It's become clear from discussions with a few labs that the current HealthCheck interfaces are inadequate. Bug 12004 has some other discussion related to this.

I'm unfortunately out of development time in the 23.02 release cycle, but would like to get something in place for 23.11 with the various interested parties. Are you / LANL willing to help review some design aspects of a new NHC integration (with some corresponding changes to the state management to avoid races such as the one described here) in a few months?

thanks,
- Tim
Comment 16 mike coyne 2022-08-25 15:26:54 MDT
(In reply to Tim Wickberg from comment #15)
> Hey Mike -
> 
> I think I get where you're coming from here - you need something
> periodically handling health-checks on the node, but Epilog runs are too
> frequent, and NHC may interrupt work inadvertently due to the current
> de-coupling between job launches and the NHC runs.
> 
> I do think that's worth addressing, but I don't want to tackle it in
> isolation. It's become clear from discussions with a few labs that the
> current HealthCheck interfaces are inadequate. Bug 12004 has some other
> discussion related to this.
> 
> I'm unfortunately out of development time in the 23.02 release cycle, but
> would like to get something in place for 23.11 with the various interested
> parties. Are you / LANL willing to help review some design aspects of a new
> NHC integration (with some corresponding changes to the state management to
> avoid races such as the one described here) in a few months?
> 
> thanks,
> - Tim

Tim i will do what ever i can to help , i  will forward this on to my supervisor and see if i can get more traction. 
Mike
Comment 17 Michael Jennings 2022-08-29 17:48:51 MDT
While I agree with Mike (and Matt) that the health check configuration in Slurm could benefit from some additional capabilities, I'll also note a few NHC features that can be helpful in these situations:

- Detached Mode (https://github.com/mej/nhc#detached-mode):  This allows NHC to fork() and return immediately to Slurm, avoiding any timeout issues.  Results from the backgrounded tests are returned on the *next* execution of NHC.  This is particularly useful when NHC is run at a (relatively) short interval but should *not* be used in Prolog/Epilog scripts (for obvious reasons). :-)

- Contexts:  If NHC is invoked with a unique context name (either on the command line, like "nhc -n nhc-slurm," or executed as a different name via a symlink or similar, e.g. "ln -sf nhc /usr/sbin/nhc-slurm" and "/usr/sbin/nhc-slurm"), an entirely different config file is read by default.  This context's config file (e.g., "/etc/nhc/nhc-slurm.conf") can have its own separate settings and everything.  (Note to @mcoyne:  I can certainly provide examples of this that I've used, but maybe the best example at this point is Graham's "CHC" work -- It's just NHC with a specific context!)

- Short-circuiting:  For situations where using Detached Mode is untenable for whatever reason, NHC's execution can be short-circuited inside its own config file; something like "check_cmd_status -r 1 pgrep -ax slurmstepd" as the 1st check will cause NHC to "fail" quickly if any slurmstepd processes are running.  Similar tricks can be done using "check_ps_unauth_users syslog die" (https://github.com/mej/nhc#check_ps_unauth_users) with careful configuration....

@mcoyne:  I'm happy to discuss options with you for temporary use until SchedMD is able to provide a more featureful health check interface. :-)
Comment 19 mike coyne 2023-03-06 16:35:28 MST
Any update on this effort ?
Comment 20 Tim Wickberg 2023-03-07 12:54:01 MST
(In reply to mike coyne from comment #19)
> Any update on this effort ?

Hey Mike -

Nope, unfortunately. I do have some time now that 23.02 is out, and would love to pick this back up with some changes targeted for 23.11.

Let me see if I can get an overview written up of what I'd expect to change with some evolution of the NHC integration, and attach it here ~ next week for discussion.

@MEJ - has development for NHC picked back up in any capacity?

- Tim
Comment 21 Michael Jennings 2023-03-09 13:06:14 MST
(In reply to Tim Wickberg from comment #20)
> @MEJ - has development for NHC picked back up in any capacity?

Yes, substantially.  Now that I've (finally!) made it past all the legal nonsense and have full approval for myself and my teammates to contribute publicly, I'm doing quite a bit of work each week to get NHC moving forward again *and* to get the internal LANL development on NHC (it's been ongoing this whole time) merged in, genericized/tidied up, and pushed up to GitHub.  I'm also moving all my personal/private TODO tasks turned into GitHub Issues to make the roadmap more public as well.

Obviously, since I'm not 100% full-time dedicated to NHC development, it will still take some time to get caught up.  But I have total support from both my team and group leaders as well as the leadership of the Systems group.  Right now, as you know, pretty much all of HPC is swamped with Crossroads & Friends; however, I've already provided training sessions for several Platforms Team folks (and will continue to do so), and a number of new Shasta-focused checks have been assigned to other folks on the team (Alden, Steve Johnson, Travis Cotton, Graham, et al.) to develop and test.

I've also been asked by my superiors to come up with ways to promote NHC and "spread the word" that it's up and going again, which is quite promising in and of itself. :-)

So lots of good news on the NHC front!  And if there's anything I can do to help with or contribute to this effort, I'd be happy to!  I'm confident I'd have no trouble getting the go-ahead from both Mike's management and my own to help out with this.
Comment 22 mike coyne 2024-02-15 08:03:12 MST
is there any word on this ?
Comment 23 mike coyne 2024-04-29 16:21:16 MDT
(In reply to mike coyne from comment #22)
> is there any word on this ?

waiting ?
Comment 24 Tim Wickberg 2024-04-29 16:24:11 MDT
Hey Mike -

I unfortunately haven't had any time to sit down and worth through this. I do think there's a lot of room for improvement, but have been putting out fires elsewhere for the past few releases.

I'm open to suggestions - if there's a clear set of requests for a modern interface from MEJ or others that's likely to be a bit easier for us to work through.

- Tim