Ticket 12806

Summary: Cloud Machines being removing from "DRAIN" state when state is set to power_down
Product: Slurm Reporter: Marshall <marshall.adrian>
Component: CloudAssignee: Skyler Malinowski <skyler>
Status: RESOLVED INFOGIVEN QA Contact:
Severity: 4 - Minor Issue    
Priority: --- CC: nick
Version: 20.11.8   
Hardware: Linux   
OS: Linux   
Site: SiFive Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---
Attachments: Showing the machine in drain, and then no longer being in drain state after command is issued
slurm.conf
gres.conf
acct_gather.conf
plugstack.conf
cgroup.conf

Description Marshall 2021-11-02 12:47:05 MDT
Created attachment 22083 [details]
Showing the machine in drain, and then no longer being in drain state after command is issued

We have noticed that when we drain our cloud machines for maintenance they will eventually have jobs start running on them. After doing some testing I can see that as soon as a power_down state is issued they will no longer be set in drain.

Is this expected behavior? If so, how can be prevent this from happening.

Thanks!
Comment 1 Jason Booth 2021-11-02 13:23:37 MDT
Can you attach your *.conf files from this cluster? 

Based on your description it sounds like you have static nodes?


> We have noticed that when we drain our cloud machines for maintenance they will eventually have jobs start running on them.

If these are static nodes then why not just drain the nodes for maintenance?
Comment 3 Jason Booth 2021-11-02 13:32:30 MDT
I wanted to address your question directly while I wait on your config to better understand your use case.


Currently, when you power_down a node, the power_down won't happen until the node is idle. This is similar to how rebooting nodes work (node won't reboot until jobs are off the node). Jobs can continue to be scheduled on the node and push the reboot request out. You can do "reboot ASAP" so that the node reboot after the currently running jobs are done.

In 21.08 we landed several changes here to improve this behavior. 

bug#11538, specifically bug#11538comment#4

Please see that bug for details. I am including a snippet of the change below so that you see the relevant bits of information.


In 21.08:

 -- added power_down_asap, power_down_force power down state for scontrol.
e.g. scontrol update nodename=<> state=power_down_asap

power_down - queue up the the node to be powered down when the node if free. Jobs can 
             continue to land on the node until it powers down.

power_down_asap - queue up the node to be powered down and put the node in a drain 
                  state. This makes it so no more jobs are scheduled on the node and 
                  the node will power down after the currently running jobs are done.

power_down_force - cancel jobs, requeue if possible, and power down node. This state can also be 
                   used to cancel a powering up node and reset it back to powered down.
Comment 4 Marshall 2021-11-02 13:55:03 MDT
Created attachment 22090 [details]
slurm.conf
Comment 5 Marshall 2021-11-02 14:00:02 MDT
Created attachment 22091 [details]
gres.conf
Comment 6 Marshall 2021-11-02 14:00:26 MDT
Created attachment 22092 [details]
acct_gather.conf
Comment 7 Marshall 2021-11-02 14:01:08 MDT
Created attachment 22093 [details]
plugstack.conf
Comment 8 Marshall 2021-11-02 14:04:14 MDT
Created attachment 22094 [details]
cgroup.conf
Comment 9 Marshall 2021-11-02 14:07:22 MDT
Hi I have attached the conf files.

Yes that is correct we have static nodes, and we are trying to drain them for maintenance, but they are put back in the "running/resume" state when a power_down command issued.
Comment 10 Marshall 2021-11-02 14:09:39 MDT
So these nodes are brought up via a power_up command so they will be spun up, but then after the timelimit expires slurm issues a power_down command, which then takes them out of drain state, and jobs will then be launched on them.
Comment 11 Marshall 2021-11-05 13:40:26 MDT
Any update on this one?
Comment 12 Marshall 2021-11-05 13:42:17 MDT
This problem created a massive job failure situation, a machine was not setup correctly and when we tried to drain it kept getting put back into the pool and any job landing on it failed, so it is not a minor issue...
Comment 13 Skyler Malinowski 2021-11-05 15:31:23 MDT
I am unable to reproduce the issue. In my testing the node would stay drained even after a power_down (state=drain~) and would not be scheduled to. Perhaps there is something else going on in your environment.

Would you please provide the slurmctld.log capturing the event. Additionally, could you provide the SuspendProgram, otherwise inspect it for any sort of commands that would alter the node state (e.g. `scontrol update nodename`).

-Skyler
Comment 14 Marshall 2021-11-05 16:53:49 MDT
Thanks yes so I have updated the SuspendProgram, resumeProgram to be /bin/true to help isolate that script that from creating an issue. Even with that I still see the node moving from drain~ to idle~...

Here is the controller log...

[2021-11-05T15:50:09.499] update_node: node slurmtest-node00 reason set to: test
[2021-11-05T15:50:09.499] update_node: node slurmtest-node00 state set to DRAINED%
[2021-11-05T15:50:10.397] SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,max_sched_time=2,partition_job_depth=0,sched_max_job_start=0,sched_min_interval=2
[2021-11-05T15:50:10.397] debug:  sched: Running job scheduler
[2021-11-05T15:50:20.213] power down request repeating for node slurmtest-node00
[2021-11-05T15:50:20.395] power_save: pid 20994 suspending nodes slurmtest-node00
[2021-11-05T15:50:20.403] debug:  sched: Running job scheduler
[2021-11-05T15:50:34.237] debug:  sched/backfill: _attempt_backfill: beginning
[2021-11-05T15:50:34.237] debug:  sched/backfill: _attempt_backfill: no jobs to backfill
Comment 16 Skyler Malinowski 2021-11-08 15:12:24 MST
Thank you for ruling out the suspend and resume programs. I am now able to reproduce the issue, albeit only in a cloud environment against 20.11.7. I will be working towards reproducing this in a local environment as I look into a solution/patch.

I also noticed that the example node, theta165, is not in the suspend exclusion list. So a work around in the meantime is to temporarily add nodes undergoing maintenance into that list. This should prevent the SuspendProgram erroneously running against it (while DRAINING and setting them to IDLE~).
Comment 17 Marshall 2021-11-08 15:50:57 MST
Great! Thanks for the hint and will await a fix!
Comment 18 Skyler Malinowski 2021-11-09 09:37:35 MST
After looking at the code path, your issue is related to the intended behavior of `SlurmctldParameters=idle_on_node_suspend`. The rationale of that parameter is that cloud nodes are ephemeral and disposable. In the cloud, it is cheaper to create a new machine than manually maintain one via conventional means, hence a drained cloud node is as good as idle one in terms of doing work. They should be suspended when not doing work long enough and that parameter helps return nodes into a state that Slurm is allowed to schedule to.

You have a few of options:
1. Change your process/practice of maintaining cloud nodes. Depending on the type of maintenance being done, deploying new images or using network mounts could help.
2. Temporary SuspendExcNodes when draining cloud nodes for manual maintenance.
3. Stop using `SlurmctldParameters=idle_on_node_suspend`.

I will be lowering the severity of this ticket as there are multiple ways to work around this issue.
Comment 19 Skyler Malinowski 2021-11-10 13:28:40 MST
Updating ticket status.

If you have any further questions, please do not hesitate to ask.

Best,
Skyler