| Summary: | Slurm 20.11: Powered up CLOUD nodes are mistakenly marked as unexpectedly rebooted and put to DOWN | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Francesco De Martino <fdm> |
| Component: | Cloud | Assignee: | Brian Christiansen <brian> |
| Status: | RESOLVED FIXED | QA Contact: | |
| Severity: | 6 - No support contract | ||
| Priority: | --- | ||
| Version: | 20.11.0 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| See Also: | https://bugs.schedmd.com/show_bug.cgi?id=18372 | ||
| Site: | -Other- | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | 20.11.1 21.08.0pre1 | Target Release: | --- |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
| Attachments: | logs + config | ||
Thanks Francesco! This is fixed in 20.11.1: https://github.com/SchedMD/slurm/commit/466c2878ba6ef3724f97a9a7c3c775c9992fb9dc |
Created attachment 16845 [details] logs + config In our cluster we make use of the Cloud Scheduling logic to provision EC2 capacity based on the scheduler workload. After upgrading from Slurm 20.02.5 to Slurm 20.11.0 we have identified a regression that is impacting the capability of the cloud bursting plugin. The problem consists in the fact that when a given CLOUD node is powered up a second time (after it had gone already through a full POWER_UP/POWER_DOWN cycle) the scheduler sees it as unexpectedly rebooted and marks it as DOWN. Here are some details on how to reproduce and the relevant log entries: Cloud related config settings from slurm.conf (full config is attached) ``` # CLOUD CONFIGS OPTIONS SlurmctldParameters=idle_on_node_suspend,cloud_dns CommunicationParameters=NoAddrCache SuspendProgram=/opt/parallelcluster/scripts/slurm/slurm_suspend ResumeProgram=/opt/parallelcluster/scripts/slurm/slurm_resume ResumeFailProgram=/opt/parallelcluster/scripts/slurm/slurm_suspend SuspendTimeout=120 ResumeTimeout=3600 PrivateData=cloud ResumeRate=0 SuspendRate=0 SuspendTime=600 ... NodeName=ondemand-st-c5xlarge-[1-1] CPUs=2 State=CLOUD Feature=static,c5.xlarge,ondemand_i1 NodeName=ondemand-dy-c5xlarge-[1-9] CPUs=2 State=CLOUD Feature=dynamic,c5.xlarge,ondemand_i1 NodeName=ondemand-dy-g38xlarge-[1-10] CPUs=16 State=CLOUD Feature=dynamic,g3.8xlarge,ondemand_i2,gpu Gres=gpu:tesla:2 NodeSet=ondemand_nodes Nodes=ondemand-st-c5xlarge-[1-1],ondemand-dy-c5xlarge-[1-9],ondemand-dy-g38xlarge-[1-10] PartitionName=ondemand Nodes=ondemand_nodes MaxTime=INFINITE State=UP Default=YES SuspendExcNodes=ondemand-st-c5xlarge-[1-1] ``` How to reproduce: ``` [ec2-user@ip-10-0-0-114 ~]$ sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST ondemand* up infinite 19 idle~ ondemand-dy-c5xlarge-[1-9],ondemand-dy-g38xlarge-[1-10] ondemand* up infinite 1 idle ondemand-st-c5xlarge-1 [ec2-user@ip-10-0-0-114 ~]$ sbatch -w ondemand-dy-c5xlarge-3 --wrap "hostname" Submitted batch job 12 # ResumeProgram launches an instance and updates the NodeAddr and the NodeHostName [ec2-user@ip-10-0-0-114 ~]$ sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST ondemand* up infinite 1 mix# ondemand-dy-c5xlarge-3 ondemand* up infinite 1 idle# ondemand-dy-c5xlarge-5 ondemand* up infinite 17 idle~ ondemand-dy-c5xlarge-[1-2,4,6-9],ondemand-dy-g38xlarge-[1-10] ondemand* up infinite 1 idle ondemand-st-c5xlarge-1 [ec2-user@ip-10-0-0-114 ~]$ scontrol show nodes ondemand-dy-c5xlarge-3 NodeName=ondemand-dy-c5xlarge-3 CoresPerSocket=1 CPUAlloc=1 CPUTot=2 CPULoad=N/A AvailableFeatures=dynamic,c5.xlarge,ondemand_i1 ActiveFeatures=dynamic,c5.xlarge,ondemand_i1 Gres=(null) NodeAddr=10.0.27.43 NodeHostName=ondemand-dy-c5xlarge-3 RealMemory=1 AllocMem=0 FreeMem=N/A Sockets=2 Boards=1 State=MIXED#+CLOUD ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A Partitions=ondemand BootTime=None SlurmdStartTime=None CfgTRES=cpu=2,mem=1M,billing=2 AllocTRES=cpu=1 CapWatts=n/a CurrentWatts=0 AveWatts=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s Comment=(null) # Job completes successfully [ec2-user@ip-10-0-0-114 ~]$ scontrol show jobs 12 JobId=12 JobName=wrap UserId=ec2-user(1000) GroupId=ec2-user(1000) MCS_label=N/A Priority=4294901757 Nice=0 Account=(null) QOS=(null) JobState=COMPLETED Reason=None Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 RunTime=00:00:01 TimeLimit=365-00:00:00 TimeMin=N/A SubmitTime=2020-11-26T10:49:01 EligibleTime=2020-11-26T10:49:01 AccrueTime=2020-11-26T10:49:01 StartTime=2020-11-26T10:50:57 EndTime=2020-11-26T10:50:58 Deadline=N/A SuspendTime=None SecsPreSuspend=0 LastSchedEval=2020-11-26T10:49:01 Partition=ondemand AllocNode:Sid=ip-10-0-0-114:27294 ReqNodeList=ondemand-dy-c5xlarge-3 ExcNodeList=(null) NodeList=ondemand-dy-c5xlarge-3 BatchHost=ondemand-dy-c5xlarge-3 NumNodes=1 NumCPUs=1 NumTasks=0 CPUs/Task=1 ReqB:S:C:T=0:0:*:* TRES=cpu=1,node=1,billing=1 Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0 Features=(null) DelayBoot=00:00:00 OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) Command=(null) WorkDir=/home/ec2-user StdErr=/home/ec2-user/slurm-12.out StdIn=/dev/null StdOut=/home/ec2-user/slurm-12.out Power= NtasksPerTRES:0 # Wait for the SuspendTime and for the node to be powered down [ec2-user@ip-10-0-0-114 ~]$ scontrol show nodes ondemand-dy-c5xlarge-3 NodeName=ondemand-dy-c5xlarge-3 Arch=x86_64 CoresPerSocket=1 CPUAlloc=0 CPUTot=2 CPULoad=0.04 AvailableFeatures=dynamic,c5.xlarge,ondemand_i1 ActiveFeatures=dynamic,c5.xlarge,ondemand_i1 Gres=(null) NodeAddr=ondemand-dy-c5xlarge-3 NodeHostName=ondemand-dy-c5xlarge-3 Version=20.11.0 OS=Linux 4.14.203-156.332.amzn2.x86_64 #1 SMP Fri Oct 30 19:19:33 UTC 2020 RealMemory=1 AllocMem=0 FreeMem=6484 Sockets=2 Boards=1 State=IDLE+CLOUD+POWERING_DOWN ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A Partitions=ondemand BootTime=2020-11-26T10:49:16 SlurmdStartTime=2020-11-26T10:50:45 CfgTRES=cpu=2,mem=1M,billing=2 AllocTRES= CapWatts=n/a CurrentWatts=0 AveWatts=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s Comment=(null) [ec2-user@ip-10-0-0-114 ~]$ scontrol show nodes ondemand-dy-c5xlarge-3 NodeName=ondemand-dy-c5xlarge-3 Arch=x86_64 CoresPerSocket=1 CPUAlloc=0 CPUTot=2 CPULoad=0.04 AvailableFeatures=dynamic,c5.xlarge,ondemand_i1 ActiveFeatures=dynamic,c5.xlarge,ondemand_i1 Gres=(null) NodeAddr=ondemand-dy-c5xlarge-3 NodeHostName=ondemand-dy-c5xlarge-3 Version=20.11.0 OS=Linux 4.14.203-156.332.amzn2.x86_64 #1 SMP Fri Oct 30 19:19:33 UTC 2020 RealMemory=1 AllocMem=0 FreeMem=6484 Sockets=2 Boards=1 State=IDLE+CLOUD+POWER ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A Partitions=ondemand BootTime=2020-11-26T10:49:16 SlurmdStartTime=2020-11-26T10:50:45 CfgTRES=cpu=2,mem=1M,billing=2 AllocTRES= CapWatts=n/a CurrentWatts=0 AveWatts=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s Comment=(null) # Submit a new job to the same node [ec2-user@ip-10-0-0-114 ~]$ sbatch -w ondemand-dy-c5xlarge-3 --wrap "hostname" Submitted batch job 13 [ec2-user@ip-10-0-0-114 ~]$ scontrol show nodes ondemand-dy-c5xlarge-3 NodeName=ondemand-dy-c5xlarge-3 Arch=x86_64 CoresPerSocket=1 CPUAlloc=1 CPUTot=2 CPULoad=0.04 AvailableFeatures=dynamic,c5.xlarge,ondemand_i1 ActiveFeatures=dynamic,c5.xlarge,ondemand_i1 Gres=(null) NodeAddr=10.0.18.102 NodeHostName=ondemand-dy-c5xlarge-3 Version=20.11.0 OS=Linux 4.14.203-156.332.amzn2.x86_64 #1 SMP Fri Oct 30 19:19:33 UTC 2020 RealMemory=1 AllocMem=0 FreeMem=6484 Sockets=2 Boards=1 State=MIXED#+CLOUD ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A Partitions=ondemand BootTime=2020-11-26T10:49:16 SlurmdStartTime=2020-11-26T10:50:45 CfgTRES=cpu=2,mem=1M,billing=2 AllocTRES=cpu=1 CapWatts=n/a CurrentWatts=0 AveWatts=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s Comment=(null) # Node is set to DOWN because marked as unexpectedly rebooted [ec2-user@ip-10-0-0-114 ~]$ scontrol show nodes ondemand-dy-c5xlarge-3 NodeName=ondemand-dy-c5xlarge-3 Arch=x86_64 CoresPerSocket=1 CPUAlloc=0 CPUTot=2 CPULoad=0.59 AvailableFeatures=dynamic,c5.xlarge,ondemand_i1 ActiveFeatures=dynamic,c5.xlarge,ondemand_i1 Gres=(null) NodeAddr=10.0.18.102 NodeHostName=ondemand-dy-c5xlarge-3 Version=20.11.0 OS=Linux 4.14.203-156.332.amzn2.x86_64 #1 SMP Fri Oct 30 19:19:33 UTC 2020 RealMemory=1 AllocMem=0 FreeMem=6394 Sockets=2 Boards=1 State=DOWN+CLOUD ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A Partitions=ondemand BootTime=2020-11-26T11:05:25 SlurmdStartTime=2020-11-26T11:07:15 CfgTRES=cpu=2,mem=1M,billing=2 AllocTRES= CapWatts=n/a CurrentWatts=0 AveWatts=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s Reason=Node unexpectedly rebooted [slurm@2020-11-26T11:07:15] Comment=(null) ``` Here is the relevant entries from slurmctld log file (full logs attached): ``` [2020-11-26T11:07:15.047] Node ondemand-dy-c5xlarge-3 rebooted 110 secs ago [2020-11-26T11:07:15.047] debug3: resetting job_count on node ondemand-dy-c5xlarge-3 from 0 to 1 [2020-11-26T11:07:15.047] Node ondemand-dy-c5xlarge-3 now responding [2020-11-26T11:07:15.047] validate_node_specs: Node ondemand-dy-c5xlarge-3 unexpectedly rebooted boot_time=1606388725 last response=1606388447 [2020-11-26T11:07:15.047] requeue job JobId=13 due to failure of node ondemand-dy-c5xlarge-3 ```