| Summary: | Nodes drained by healthcheck on boot have state reset on slurmd registration | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Chris Samuel (NERSC) <csamuel> |
| Component: | slurmctld | Assignee: | Ben Glines <ben.glines> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 4 - Minor Issue | ||
| Priority: | --- | CC: | dmjacobsen, kilian |
| Version: | 23.02.1 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | NERSC | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
|
Description
Chris Samuel (NERSC)
2023-05-02 22:40:54 MDT
Hi!
Just wanted to chime in to add that we're seeing the same thing.
> Is there a way to make this initial drain by the healthcheck stick please?
Or even better, would it be possible to make slurmd (optionally) run the HealthCheckProgram when it starts, before registering and putting back online? Maybe via an additional INIT state in HealthCheckNodeState?
Cheers,
--
Kilian
(In reply to Kilian Cavalotti from comment #1) > Hi! > > Just wanted to chime in to add that we're seeing the same thing. Hey Killian! > > Is there a way to make this initial drain by the healthcheck stick please? > > Or even better, would it be possible to make slurmd (optionally) run the > HealthCheckProgram when it starts, before registering and putting back > online? Maybe via an additional INIT state in HealthCheckNodeState? I believe this is what's happening at the moment, as we run configless so there's no way for scontrol and friends to work before slurmd starts and caches the config locally. The first run of the healthcheck happens before registration, but then the registration occurs and blats the nodes status (presumably as it's been rebooted with scontrol reboot ASAP nextstate=resume which I just realised I failed to mention in my initial report, mea culpa). All the best, Chris Hi Chris, Using "ASAP" or "nextstate=resume" will overwrite any node state updates made by your HealthCheckProgram. From documentation (https://slurm.schedmd.com/scontrol.html#OPT_reboot): > The node's "DRAIN" state flag will be cleared if the reboot was "ASAP", nextstate=resume or down. This state clearing will overwrite anything the HealthCheckProgram does. If you are already draining nodes before they are rebooted, then ASAP is not necessary, as you are already preventing new jobs from being scheduled on the node, effectively rebooting the node as soon as possible. If you'd like the HealthCheckProgram to determine the node state after the reboot, then don't user nextstate=resume either. Using nextstate=resume will set the node to idle, no matter what your HealthCheckProgram sets it to. Try just using `scontrol reboot` without these options and let me know if that works for you or if you're still seeing problems with this. Hi Chris, Do you have any questions about my last reply? Closing this now. Feel free to reopen if you have further questions. |