Ticket 16637

Summary:	Nodes drained by healthcheck on boot have state reset on slurmd registration
Product:	Slurm	Reporter:	Chris Samuel (NERSC) <csamuel>
Component:	slurmctld	Assignee:	Ben Glines <ben.glines>
Status:	RESOLVED INFOGIVEN	QA Contact:
Severity:	4 - Minor Issue
Priority:	---	CC:	dmjacobsen, kilian
Version:	23.02.1
Hardware:	Linux
OS:	Linux
Site:	NERSC	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---

Description Chris Samuel (NERSC) 2023-05-02 22:40:54 MDT

Hi there,

Our healthcheck script does a bunch of checks, and drains nodes when it finds failures. However, what we see is if it finds a problem on boot it does successfully drain the node, but that appears to get overwritten/reset when the slurmd registers.

For instance here's a node with a dodgy GPU that boots up, drains itself like it's supposed to and then gets reset to allow jobs on to it, gets a job running on it and then the next time the health check runs it drains the node again.

slurmctld:/ # fgrep nid002377 /var/log/slurmctld.log
[....]
[2023-05-03T04:27:05.376] sched/backfill: _handle_planned: BACKFILL: set: nid002377 state is REBOOT^
[2023-05-03T04:27:28.308] sched/backfill: _handle_planned: BACKFILL: cleared: nid002377 state is REBOOT^
[2023-05-03T04:27:58.399] sched/backfill: _handle_planned: BACKFILL: set: nid002377 state is REBOOT^
[2023-05-03T04:28:05.553] sched/backfill: _handle_planned: BACKFILL: cleared: nid002377 state is REBOOT^
[2023-05-03T04:28:08.972] update_node: node nid002377 reason set to: health:nvidia:PRB0040594:remapped_row_failure:GPU-0
[2023-05-03T04:28:08.972] update_node: node nid002377 state set to REBOOT^
[2023-05-03T04:28:13.434] Node nid002377 rebooted 700 secs ago
[2023-05-03T04:28:13.435] Node nid002377 now responding
[2023-05-03T04:28:13.435] node nid002377 returned to service
[2023-05-03T04:28:35.647] sched/backfill: _handle_planned: BACKFILL: set: nid002377 state is IDLE
[2023-05-03T04:28:55.289] sched/backfill: _dump_job_sched: JobId=8221873 to start at 2023-05-06T13:00:00, end at 2023-05-07T13:00:00 on nodes nid002377 in partition gpu_ss11
[2023-05-03T04:28:55.896] sched/backfill: _handle_planned: BACKFILL: cleared: nid002377 state is PLANNED
[2023-05-03T04:29:25.969] sched/backfill: _handle_planned: BACKFILL: set: nid002377 state is IDLE
[2023-05-03T04:29:50.579] sched/backfill: _dump_job_sched: JobId=8221873 to start at 2023-05-06T13:00:00, end at 2023-05-07T13:00:00 on nodes nid002377 in partition gpu_ss11
[2023-05-03T04:29:51.403] sched/backfill: _handle_planned: BACKFILL: cleared: nid002377 state is PLANNED
[2023-05-03T04:30:21.496] sched/backfill: _handle_planned: BACKFILL: set: nid002377 state is IDLE
[2023-05-03T04:30:38.280] sched/backfill: _dump_job_sched: JobId=7773118 to start at 2023-05-06T13:00:00, end at 2023-05-07T13:00:00 on nodes nid002377 in partition gpu_ss11
[2023-05-03T04:30:38.316] sched/backfill: _handle_planned: BACKFILL: cleared: nid002377 state is PLANNED
[2023-05-03T04:31:08.435] sched/backfill: _handle_planned: BACKFILL: set: nid002377 state is IDLE
[2023-05-03T04:31:22.561] sched/backfill: _handle_planned: BACKFILL: cleared: nid002377 state is ALLOCATED
[2023-05-03T04:32:08.979] sched/backfill: _handle_planned: BACKFILL: cleared: nid002377 state is ALLOCATED
[2023-05-03T04:32:32.924] update_node: node nid002377 reason set to: health:nvidia:PRB0040594:remapped_row_failure:GPU-0
[2023-05-03T04:32:32.924] update_node: node nid002377 state set to DRAINING

Is there a way to make this initial drain by the healthcheck stick please?

All the best,
Chris

Comment 1 Kilian Cavalotti 2023-05-03 14:44:09 MDT

Hi!

Just wanted to chime in to add that we're seeing the same thing.

> Is there a way to make this initial drain by the healthcheck stick please?

Or even better, would it be possible to make slurmd (optionally) run the HealthCheckProgram when it starts, before registering and putting back online? Maybe via an additional INIT state in HealthCheckNodeState?

Cheers,
--
Kilian

Comment 2 Chris Samuel (NERSC) 2023-05-03 17:00:20 MDT

(In reply to Kilian Cavalotti from comment #1)

> Hi!
> 
> Just wanted to chime in to add that we're seeing the same thing.

Hey Killian!

> > Is there a way to make this initial drain by the healthcheck stick please?
> 
> Or even better, would it be possible to make slurmd (optionally) run the
> HealthCheckProgram when it starts, before registering and putting back
> online? Maybe via an additional INIT state in HealthCheckNodeState?

I believe this is what's happening at the moment, as we run configless so there's no way for scontrol and friends to work before slurmd starts and caches the config locally. The first run of the healthcheck happens before registration, but then the registration occurs and blats the nodes status (presumably as it's been rebooted with scontrol reboot ASAP nextstate=resume which I just realised I failed to mention in my initial report, mea culpa).

All the best,
Chris

Comment 3 Ben Glines 2023-05-12 09:25:40 MDT

Hi Chris,

Using "ASAP" or "nextstate=resume" will overwrite any node state updates made by your HealthCheckProgram.

From documentation (https://slurm.schedmd.com/scontrol.html#OPT_reboot):
> The node's "DRAIN" state flag will be cleared if the reboot was "ASAP", nextstate=resume or down.
This state clearing will overwrite anything the HealthCheckProgram does.

If you are already draining nodes before they are rebooted, then ASAP is not necessary, as you are already preventing new jobs from being scheduled on the node, effectively rebooting the node as soon as possible.

If you'd like the HealthCheckProgram to determine the node state after the reboot, then don't user nextstate=resume either. Using nextstate=resume will set the node to idle, no matter what your HealthCheckProgram sets it to.

Try just using `scontrol reboot` without these options and let me know if that works for you or if you're still seeing problems with this.

Comment 4 Ben Glines 2023-05-22 08:51:06 MDT

Hi Chris,

Do you have any questions about my last reply?

Comment 5 Ben Glines 2023-06-08 11:06:36 MDT

Closing this now. Feel free to reopen if you have further questions.