| Summary: | Resume and Suspend scripts not working | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Abhimanyu Saurot <abhimanyu.saurot> |
| Component: | Other | Assignee: | Broderick Gardner <broderick> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 4 - Minor Issue | ||
| Priority: | --- | ||
| Version: | 22.05.2 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | ASML | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
|
Description
Abhimanyu Saurot
2022-06-21 09:20:35 MDT
Do you get any logs in your power_save log file? Does the SlurmUser (defined in slurm.conf, often 'slurm') have passwordless sudo? More of a longshot, but also try logging the environment variables or just the PATH. To make sure scontrol is able to be found. Also sudo does not forward the caller's env vars, so if node_shutdown is in a non-standard place, it may not be finding it in the PATH. Thanks Hi Team, It is just an example which is not working. In short even a simple script (performing redirection ) is not working. I tried below script as well : #!/bin/bash # Example ResumeProgram echo "`date` Resume invoked $0 $*" >>/var/log/power_save.log hosts=`scontrol show hostnames $1` for host in $hosts do echo "In for loop for $hosts " >> /var/log/power_save.log done Below is the error we in slurmctld.log [2022-06-21T17:15:49.122] error: power_save: program exit status of 1 Regards, Abhimanyu Saurot Does the slurm user have permission to create /var/log/power_save.log? Try running the script yourself as slurm. sudo -u slurm /etc/slurm/SuspendProg Yes slurm has permission to create a file I ran the script on controller node as below : [root@node1 ~]# sudo -u slurm /etc/slurm/SuspendProg scontrol: error: host list is empty [root@node1 ~]# sudo -u slurm /etc/slurm/SuspendProg Could you help what should be the input to scontrol in this case ? Regards, Abhimanyu Saurot Oh right, I should have mentioned that SuspendProg expects a node name as an argument. Slurm should automatically detect that right ? Or should we use scontrol command to find the idle host and feed it to Suspend program Regards, Abhimanyu Saurot Right, when the slurmctld calls the resume and suspend programs, it passes a hostlist of the nodes to be resumed or suspended. Your test program was correct. The test run as slurm was wrong because it didn't pass a node to the script. Sounds logical, however slurm fails to resume the node when I try to run a job after slurm suspends it
[root@node1 ~]# tail -5 /var/log/slurmctld.log
[2022-06-22T09:21:10.610] sched/backfill: _start_job: Started JobId=32 in debug on node2
[2022-06-22T09:23:11.648] node node2 not resumed by ResumeTimeout(120) - marking down and power_save
[2022-06-22T09:23:11.648] Killing JobId=32 on failed node node2
[root@node1 ~]# tail -100 /var/log/power_save.log
Wed Jun 22 09:20:15 CEST 2022 Suspend invoked /etc/slurm/SuspendProg node2
In the for loop node2
Wed Jun 22 09:21:10 CEST 2022 Resume invoked /etc/slurm/ResumeProg node2
In for loop for node2
[root@node1 ~]# squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
32 debug uanme root CF 1:17 1 node2
[root@node1 ~]# srun uanme -a
srun: Required node not available (down, drained or reserved)
srun: job 32 queued and waiting for resources
srun: job 32 has been allocated resources
srun: error: Node failure on node2
srun: error: Nodes node2 are still not ready
srun: error: Something is wrong with the boot of the nodes.
Eventually slurm mark the sever as power down
[root@node1 ~]# scontrol show node node2
NodeName=node2 Arch=x86_64 CoresPerSocket=10
CPUAlloc=0 CPUEfctv=20 CPUTot=20 CPULoad=0.01
AvailableFeatures=(null)
ActiveFeatures=(null)
Gres=(null)
NodeAddr=node2 NodeHostName=node2 Version=22.05.2
OS=Linux 3.10.0-1127.el7.x86_64 #1 SMP Tue Mar 31 23:36:51 UTC 2020
RealMemory=128832 AllocMem=0 FreeMem=127077 Sockets=2 Boards=1
State=DOWN+POWERED_DOWN ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
Partitions=debug
BootTime=2022-06-21T13:08:43 SlurmdStartTime=2022-06-22T09:19:44
LastBusyTime=Unknown
CfgTRES=cpu=20,mem=128832M,billing=20
AllocTRES=
CapWatts=n/a
CurrentWatts=0 AveWatts=0
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
Reason=ResumeTimeout reached [slurm@2022-06-22T09:23:11]
Okay, so your SuspendProg and ResumeProg are running from Slurm, unless that output from /var/log/power_save.log is from running them manually. This line: [2022-06-22T09:23:11.648] node node2 not resumed by ResumeTimeout(120) - marking down and power_save means that the node never resumed. That is not a Slurm problem; it ran the ResumeProgram. The slurmd on the resumed node must check into the slurmctld before the configured timeout has elapsed (ResumeTimeout). Have you verified your script for starting up a node? Hi, Thanks for your email, I am out of office hence may not respond to your emails. Regards, Abhimanyu Saurot +31-616580738 -- The information contained in this communication and any attachments is confidential and may be privileged, and is for the sole use of the intended recipient(s). Any unauthorized review, use, disclosure or distribution is prohibited. Unless explicitly stated otherwise in the body of this communication or the attachment thereto (if any), the information is provided on an AS-IS basis without any express or implied warranties or liabilities. To the extent you are relying on this information, you are doing so at your own risk. If you are not the intended recipient, please notify the sender immediately by replying to this message and destroy all copies of this message and any attachments. Neither the sender nor the company/group of companies he or she represents shall be liable for the proper and complete transmission of the information contained in this communication, or for any delay in its receipt. -- The information contained in this communication and any attachments is confidential and may be privileged, and is for the sole use of the intended recipient(s). Any unauthorized review, use, disclosure or distribution is prohibited. Unless explicitly stated otherwise in the body of this communication or the attachment thereto (if any), the information is provided on an AS-IS basis without any express or implied warranties or liabilities. To the extent you are relying on this information, you are doing so at your own risk. If you are not the intended recipient, please notify the sender immediately by replying to this message and destroy all copies of this message and any attachments. Neither the sender nor the company/group of companies he or she represents shall be liable for the proper and complete transmission of the information contained in this communication, or for any delay in its receipt. Hi Broderick, The Suspend and resume program are nothing but simple scripts to write in a log file not sure why slurmd on the node is unable to communicate to slurmctld [root@node1 ~]# cat /etc/slurm/ResumeProg #!/bin/bash # Example ResumeProgram echo "`date` Resume invoked $0 $*" >>/var/log/power_save.log hosts=`scontrol show hostnames $1` for host in $hosts do echo "In for loop for $hosts " >> /var/log/power_save.log done [root@node1 ~]# cat /etc/slurm/SuspendProg #!/bin/bash # Example SuspendProgram echo "`date` Suspend invoked $0 $*" >>/var/log/power_save.log hosts=`scontrol show hostnames $1` for host in $hosts do echo "In the for loop $host" >> /var/log/power_save.log done Am I missing something in the flow of power saving mode ? Maybe we can have a call to close it faster ? Regards, Abhimanyu Saurot What are you trying to accomplish exactly? If the resume script doesn't actually start up the node, how could it start up? You can do it manually if you want. The slurmd node must reboot between suspend and resume; the controller will ignore the slurmd otherwise. The suspend and resume programs are how you implement power save mode. Slurm can't do it. It runs the suspend script and expects the node to disappear. That node is now "powered down". If that node is scheduled, Slurm runs the resume program and waits for the slurmd to check in. When it checks in, the last boot time must have been since resume program was run. Have you been able to resolve your problem? Do you have any other questions? Timing out and resolving. Please reopen or create a new ticket if you have more questions. Thanks |