Ticket 14360

Summary:	Resume and Suspend scripts not working
Product:	Slurm	Reporter:	Abhimanyu Saurot <abhimanyu.saurot>
Component:	Other	Assignee:	Broderick Gardner <broderick>
Status:	RESOLVED INFOGIVEN	QA Contact:
Severity:	4 - Minor Issue
Priority:	---
Version:	22.05.2
Hardware:	Linux
OS:	Linux
Site:	ASML	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---

Description Abhimanyu Saurot 2022-06-21 09:20:35 MDT

Hi team, 

We are testing the power saving options in slurm and found the given scripts are not executing 

below are the configurations:

/etc/slurm/slurm.conf

# POWER SAVE SUPPORT FOR IDLE NODES (optional)
SuspendProgram=/etc/slurm/SuspendProg
ResumeProgram=/etc/slurm/ResumeProg
SuspendTimeout=30
ResumeTimeout=120
ResumeRate=1
#SuspendExcNodes=
#SuspendExcParts=
SuspendRate=1
SuspendTime=30


[root@node1 ~]# ls -ld /etc/slurm/SuspendProg
-rwxr-xr-x 1 slurm slurm 192 Jun 21 16:38 /etc/slurm/SuspendProg


[root@node1 ~]# cat /etc/slurm/SuspendProg
#!/bin/bash
# Example SuspendProgram
echo "`date` Suspend invoked $0 $*" >>/var/log/power_save.log
hosts=`scontrol show hostnames $1`
for host in $hosts
do
   sudo node_shutdown $host
   done


Error 

[2022-06-21T17:15:49.122] error: power_save: program exit status of 1

Comment 1 Broderick Gardner 2022-06-21 09:42:17 MDT

Do you get any logs in your power_save log file?

Does the SlurmUser (defined in slurm.conf, often 'slurm') have passwordless sudo?

More of a longshot, but also try logging the environment variables or just the PATH. To make sure scontrol is able to be found. Also sudo does not forward the caller's env vars, so if node_shutdown is in a non-standard place, it may not be finding it in the PATH.

Thanks

Comment 2 Abhimanyu Saurot 2022-06-21 09:59:02 MDT

Hi Team, 

It is just an example which is not working. In short even a simple script (performing redirection ) is not working. 

I tried below script as well :


#!/bin/bash
# Example ResumeProgram
echo "`date` Resume invoked $0 $*" >>/var/log/power_save.log
hosts=`scontrol show hostnames $1`
for host in $hosts
do
   echo "In for loop for $hosts " >> /var/log/power_save.log
   done


Below is the error we in slurmctld.log

[2022-06-21T17:15:49.122] error: power_save: program exit status of 1


Regards, 
Abhimanyu Saurot

Comment 3 Broderick Gardner 2022-06-21 10:15:47 MDT

Does the slurm user have permission to create /var/log/power_save.log?

Comment 4 Broderick Gardner 2022-06-21 10:16:47 MDT

Try running the script yourself as slurm.
sudo -u slurm /etc/slurm/SuspendProg

Comment 5 Abhimanyu Saurot 2022-06-21 12:11:10 MDT

Yes slurm has permission to create a file 

I ran the script on controller node as below :

[root@node1 ~]# sudo -u slurm /etc/slurm/SuspendProg
scontrol: error: host list is empty
[root@node1 ~]# sudo -u slurm /etc/slurm/SuspendProg


Could you help what should be the input to scontrol in this case ?

Regards, 
Abhimanyu Saurot

Comment 6 Broderick Gardner 2022-06-21 12:12:40 MDT

Oh right, I should have mentioned that SuspendProg expects a node name as an argument.

Comment 7 Abhimanyu Saurot 2022-06-21 13:04:46 MDT

Slurm should automatically detect that right ?
Or should we use scontrol command to find the idle host and feed it to Suspend program 

Regards, 
Abhimanyu Saurot

Comment 8 Broderick Gardner 2022-06-21 14:14:15 MDT

Right, when the slurmctld calls the resume and suspend programs, it passes a hostlist of the nodes to be resumed or suspended. Your test program was correct. The test run as slurm was wrong because it didn't pass a node to the script.

Comment 9 Abhimanyu Saurot 2022-06-22 01:29:31 MDT

Sounds logical, however slurm fails to resume the node when I try to run a job after slurm suspends it 


[root@node1 ~]# tail -5 /var/log/slurmctld.log
[2022-06-22T09:21:10.610] sched/backfill: _start_job: Started JobId=32 in debug on node2
[2022-06-22T09:23:11.648] node node2 not resumed by ResumeTimeout(120) - marking down and power_save
[2022-06-22T09:23:11.648] Killing JobId=32 on failed node node2


[root@node1 ~]# tail -100 /var/log/power_save.log
Wed Jun 22 09:20:15 CEST 2022 Suspend invoked /etc/slurm/SuspendProg node2
In the for loop node2
Wed Jun 22 09:21:10 CEST 2022 Resume invoked /etc/slurm/ResumeProg node2
In for loop for node2



[root@node1 ~]# squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
                32     debug    uanme     root CF       1:17      1 node2


[root@node1 ~]# srun uanme -a
srun: Required node not available (down, drained or reserved)
srun: job 32 queued and waiting for resources
srun: job 32 has been allocated resources

srun: error: Node failure on node2
srun: error: Nodes node2 are still not ready
srun: error: Something is wrong with the boot of the nodes.


Eventually slurm mark the sever as power down 


[root@node1 ~]# scontrol show node node2
NodeName=node2 Arch=x86_64 CoresPerSocket=10
   CPUAlloc=0 CPUEfctv=20 CPUTot=20 CPULoad=0.01
   AvailableFeatures=(null)
   ActiveFeatures=(null)
   Gres=(null)
   NodeAddr=node2 NodeHostName=node2 Version=22.05.2
   OS=Linux 3.10.0-1127.el7.x86_64 #1 SMP Tue Mar 31 23:36:51 UTC 2020
   RealMemory=128832 AllocMem=0 FreeMem=127077 Sockets=2 Boards=1
   State=DOWN+POWERED_DOWN ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
   Partitions=debug
   BootTime=2022-06-21T13:08:43 SlurmdStartTime=2022-06-22T09:19:44
   LastBusyTime=Unknown
   CfgTRES=cpu=20,mem=128832M,billing=20
   AllocTRES=
   CapWatts=n/a
   CurrentWatts=0 AveWatts=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
   Reason=ResumeTimeout reached [slurm@2022-06-22T09:23:11]

Comment 10 Broderick Gardner 2022-06-22 16:07:59 MDT

Okay, so your SuspendProg and ResumeProg are running from Slurm, unless that output from /var/log/power_save.log is from running them manually.

This line:
[2022-06-22T09:23:11.648] node node2 not resumed by ResumeTimeout(120) - marking down and power_save

means that the node never resumed. That is not a Slurm problem; it ran the ResumeProgram. The slurmd on the resumed node must check into the slurmctld before the configured timeout has elapsed (ResumeTimeout).

Have you verified your script for starting up a node?

Comment 11 Abhimanyu Saurot 2022-06-22 16:14:46 MDT

Hi,

Thanks for your email, I am out of office hence may not respond to your emails.

Regards,
Abhimanyu Saurot
+31-616580738

-- The information contained in this communication and any attachments is confidential and may be privileged, and is for the sole use of the intended recipient(s). Any unauthorized review, use, disclosure or distribution is prohibited. Unless explicitly stated otherwise in the body of this communication or the attachment thereto (if any), the information is provided on an AS-IS basis without any express or implied warranties or liabilities. To the extent you are relying on this information, you are doing so at your own risk. If you are not the intended recipient, please notify the sender immediately by replying to this message and destroy all copies of this message and any attachments. Neither the sender nor the company/group of companies he or she represents shall be liable for the proper and complete transmission of the information contained in this communication, or for any delay in its receipt. -- The information contained in this communication and any attachments is confidential and may be privileged, and is for the sole use of the intended recipient(s). Any unauthorized review, use, disclosure or distribution is prohibited. Unless explicitly stated otherwise in the body of this communication or the attachment thereto (if any), the information is provided on an AS-IS basis without any express or implied warranties or liabilities. To the extent you are relying on this information, you are doing so at your own risk. If you are not the intended recipient, please notify the sender immediately by replying to this message and destroy all copies of this message and any attachments. Neither the sender nor the company/group of companies he or she represents shall be liable for the proper and complete transmission of the information contained in this communication, or for any delay in its receipt.

Comment 12 Abhimanyu Saurot 2022-06-30 06:18:57 MDT

Hi Broderick, 

The Suspend and resume program are nothing but simple scripts to write in a log file not sure why slurmd on the node is unable to communicate to slurmctld 


[root@node1 ~]# cat /etc/slurm/ResumeProg
#!/bin/bash
# Example ResumeProgram
echo "`date` Resume invoked $0 $*" >>/var/log/power_save.log
hosts=`scontrol show hostnames $1`
for host in $hosts
do
   echo "In for loop for $hosts " >> /var/log/power_save.log
   done
[root@node1 ~]# cat /etc/slurm/SuspendProg
#!/bin/bash
# Example SuspendProgram
echo "`date` Suspend invoked $0 $*" >>/var/log/power_save.log
hosts=`scontrol show hostnames $1`
for host in $hosts
do
echo "In the for loop $host" >> /var/log/power_save.log
done


Am I missing something in the flow of power saving mode ? Maybe we can have a call to close it faster ?

Regards, 
Abhimanyu Saurot

Comment 13 Broderick Gardner 2022-06-30 08:29:23 MDT

What are you trying to accomplish exactly? If the resume script doesn't actually start up the node, how could it start up? You can do it manually if you want.
The slurmd node must reboot between suspend and resume; the controller will ignore the slurmd otherwise. 

The suspend and resume programs are how you implement power save mode. Slurm can't do it. It runs the suspend script and expects the node to disappear. That node is now "powered down". 
If that node is scheduled, Slurm runs the resume program and waits for the slurmd to check in. When it checks in, the last boot time must have been since resume program was run.

Comment 14 Broderick Gardner 2022-07-26 09:24:29 MDT

Have you been able to resolve your problem? Do you have any other questions?

Comment 15 Broderick Gardner 2022-08-08 12:03:49 MDT

Timing out and resolving. Please reopen or create a new ticket if you have more questions.

Thanks