Ticket 10298

Summary: Slurm 20.11: Powered up CLOUD nodes are mistakenly marked as unexpectedly rebooted and put to DOWN
Product: Slurm Reporter: Francesco De Martino <fdm>
Component: CloudAssignee: Brian Christiansen <brian>
Status: RESOLVED FIXED QA Contact:
Severity: 6 - No support contract    
Priority: ---    
Version: 20.11.0   
Hardware: Linux   
OS: Linux   
See Also: https://bugs.schedmd.com/show_bug.cgi?id=18372
Site: -Other- Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: 20.11.1 21.08.0pre1 Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---
Attachments: logs + config

Description Francesco De Martino 2020-11-26 06:50:00 MST
Created attachment 16845 [details]
logs + config

In our cluster we make use of the Cloud Scheduling logic to provision EC2 capacity based on the scheduler workload. After upgrading from Slurm 20.02.5 to Slurm 20.11.0 we have identified a regression that is impacting the capability of the cloud bursting plugin. The problem consists in the fact that when a given CLOUD node is powered up a second time (after it had gone already through a full POWER_UP/POWER_DOWN cycle) the scheduler sees it as unexpectedly rebooted and marks it as DOWN.

Here are some details on how to reproduce and the relevant log entries:

Cloud related config settings from slurm.conf (full config is attached)

```
# CLOUD CONFIGS OPTIONS
SlurmctldParameters=idle_on_node_suspend,cloud_dns
CommunicationParameters=NoAddrCache
SuspendProgram=/opt/parallelcluster/scripts/slurm/slurm_suspend
ResumeProgram=/opt/parallelcluster/scripts/slurm/slurm_resume
ResumeFailProgram=/opt/parallelcluster/scripts/slurm/slurm_suspend
SuspendTimeout=120
ResumeTimeout=3600
PrivateData=cloud
ResumeRate=0
SuspendRate=0
SuspendTime=600

...

NodeName=ondemand-st-c5xlarge-[1-1] CPUs=2 State=CLOUD Feature=static,c5.xlarge,ondemand_i1
NodeName=ondemand-dy-c5xlarge-[1-9] CPUs=2 State=CLOUD Feature=dynamic,c5.xlarge,ondemand_i1
NodeName=ondemand-dy-g38xlarge-[1-10] CPUs=16 State=CLOUD Feature=dynamic,g3.8xlarge,ondemand_i2,gpu Gres=gpu:tesla:2

NodeSet=ondemand_nodes Nodes=ondemand-st-c5xlarge-[1-1],ondemand-dy-c5xlarge-[1-9],ondemand-dy-g38xlarge-[1-10]
PartitionName=ondemand Nodes=ondemand_nodes MaxTime=INFINITE State=UP Default=YES
SuspendExcNodes=ondemand-st-c5xlarge-[1-1]
```

How to reproduce:

```
[ec2-user@ip-10-0-0-114 ~]$ sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
ondemand*    up   infinite     19  idle~ ondemand-dy-c5xlarge-[1-9],ondemand-dy-g38xlarge-[1-10]
ondemand*    up   infinite      1   idle ondemand-st-c5xlarge-1

[ec2-user@ip-10-0-0-114 ~]$ sbatch -w ondemand-dy-c5xlarge-3 --wrap "hostname"
Submitted batch job 12

# ResumeProgram launches an instance and updates the NodeAddr and the NodeHostName

[ec2-user@ip-10-0-0-114 ~]$ sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
ondemand*    up   infinite      1   mix# ondemand-dy-c5xlarge-3
ondemand*    up   infinite      1  idle# ondemand-dy-c5xlarge-5
ondemand*    up   infinite     17  idle~ ondemand-dy-c5xlarge-[1-2,4,6-9],ondemand-dy-g38xlarge-[1-10]
ondemand*    up   infinite      1   idle ondemand-st-c5xlarge-1
[ec2-user@ip-10-0-0-114 ~]$ scontrol show nodes ondemand-dy-c5xlarge-3
NodeName=ondemand-dy-c5xlarge-3 CoresPerSocket=1
   CPUAlloc=1 CPUTot=2 CPULoad=N/A
   AvailableFeatures=dynamic,c5.xlarge,ondemand_i1
   ActiveFeatures=dynamic,c5.xlarge,ondemand_i1
   Gres=(null)
   NodeAddr=10.0.27.43 NodeHostName=ondemand-dy-c5xlarge-3
   RealMemory=1 AllocMem=0 FreeMem=N/A Sockets=2 Boards=1
   State=MIXED#+CLOUD ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
   Partitions=ondemand
   BootTime=None SlurmdStartTime=None
   CfgTRES=cpu=2,mem=1M,billing=2
   AllocTRES=cpu=1
   CapWatts=n/a
   CurrentWatts=0 AveWatts=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
   Comment=(null)

# Job completes successfully

[ec2-user@ip-10-0-0-114 ~]$ scontrol show jobs 12
JobId=12 JobName=wrap
   UserId=ec2-user(1000) GroupId=ec2-user(1000) MCS_label=N/A
   Priority=4294901757 Nice=0 Account=(null) QOS=(null)
   JobState=COMPLETED Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:01 TimeLimit=365-00:00:00 TimeMin=N/A
   SubmitTime=2020-11-26T10:49:01 EligibleTime=2020-11-26T10:49:01
   AccrueTime=2020-11-26T10:49:01
   StartTime=2020-11-26T10:50:57 EndTime=2020-11-26T10:50:58 Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2020-11-26T10:49:01
   Partition=ondemand AllocNode:Sid=ip-10-0-0-114:27294
   ReqNodeList=ondemand-dy-c5xlarge-3 ExcNodeList=(null)
   NodeList=ondemand-dy-c5xlarge-3
   BatchHost=ondemand-dy-c5xlarge-3
   NumNodes=1 NumCPUs=1 NumTasks=0 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=1,node=1,billing=1
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=(null)
   WorkDir=/home/ec2-user
   StdErr=/home/ec2-user/slurm-12.out
   StdIn=/dev/null
   StdOut=/home/ec2-user/slurm-12.out
   Power=
   NtasksPerTRES:0
   
# Wait for the SuspendTime and for the node to be powered down

[ec2-user@ip-10-0-0-114 ~]$ scontrol show nodes ondemand-dy-c5xlarge-3
NodeName=ondemand-dy-c5xlarge-3 Arch=x86_64 CoresPerSocket=1
   CPUAlloc=0 CPUTot=2 CPULoad=0.04
   AvailableFeatures=dynamic,c5.xlarge,ondemand_i1
   ActiveFeatures=dynamic,c5.xlarge,ondemand_i1
   Gres=(null)
   NodeAddr=ondemand-dy-c5xlarge-3 NodeHostName=ondemand-dy-c5xlarge-3 Version=20.11.0
   OS=Linux 4.14.203-156.332.amzn2.x86_64 #1 SMP Fri Oct 30 19:19:33 UTC 2020
   RealMemory=1 AllocMem=0 FreeMem=6484 Sockets=2 Boards=1
   State=IDLE+CLOUD+POWERING_DOWN ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
   Partitions=ondemand
   BootTime=2020-11-26T10:49:16 SlurmdStartTime=2020-11-26T10:50:45
   CfgTRES=cpu=2,mem=1M,billing=2
   AllocTRES=
   CapWatts=n/a
   CurrentWatts=0 AveWatts=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
   Comment=(null)
[ec2-user@ip-10-0-0-114 ~]$ scontrol show nodes ondemand-dy-c5xlarge-3
NodeName=ondemand-dy-c5xlarge-3 Arch=x86_64 CoresPerSocket=1
   CPUAlloc=0 CPUTot=2 CPULoad=0.04
   AvailableFeatures=dynamic,c5.xlarge,ondemand_i1
   ActiveFeatures=dynamic,c5.xlarge,ondemand_i1
   Gres=(null)
   NodeAddr=ondemand-dy-c5xlarge-3 NodeHostName=ondemand-dy-c5xlarge-3 Version=20.11.0
   OS=Linux 4.14.203-156.332.amzn2.x86_64 #1 SMP Fri Oct 30 19:19:33 UTC 2020
   RealMemory=1 AllocMem=0 FreeMem=6484 Sockets=2 Boards=1
   State=IDLE+CLOUD+POWER ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
   Partitions=ondemand
   BootTime=2020-11-26T10:49:16 SlurmdStartTime=2020-11-26T10:50:45
   CfgTRES=cpu=2,mem=1M,billing=2
   AllocTRES=
   CapWatts=n/a
   CurrentWatts=0 AveWatts=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
   Comment=(null)
   
# Submit a new job to the same node
[ec2-user@ip-10-0-0-114 ~]$ sbatch -w ondemand-dy-c5xlarge-3 --wrap "hostname"
Submitted batch job 13
[ec2-user@ip-10-0-0-114 ~]$ scontrol show nodes ondemand-dy-c5xlarge-3
NodeName=ondemand-dy-c5xlarge-3 Arch=x86_64 CoresPerSocket=1
   CPUAlloc=1 CPUTot=2 CPULoad=0.04
   AvailableFeatures=dynamic,c5.xlarge,ondemand_i1
   ActiveFeatures=dynamic,c5.xlarge,ondemand_i1
   Gres=(null)
   NodeAddr=10.0.18.102 NodeHostName=ondemand-dy-c5xlarge-3 Version=20.11.0
   OS=Linux 4.14.203-156.332.amzn2.x86_64 #1 SMP Fri Oct 30 19:19:33 UTC 2020
   RealMemory=1 AllocMem=0 FreeMem=6484 Sockets=2 Boards=1
   State=MIXED#+CLOUD ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
   Partitions=ondemand
   BootTime=2020-11-26T10:49:16 SlurmdStartTime=2020-11-26T10:50:45
   CfgTRES=cpu=2,mem=1M,billing=2
   AllocTRES=cpu=1
   CapWatts=n/a
   CurrentWatts=0 AveWatts=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
   Comment=(null)

# Node is set to DOWN because marked as unexpectedly rebooted
[ec2-user@ip-10-0-0-114 ~]$ scontrol show nodes ondemand-dy-c5xlarge-3
NodeName=ondemand-dy-c5xlarge-3 Arch=x86_64 CoresPerSocket=1
   CPUAlloc=0 CPUTot=2 CPULoad=0.59
   AvailableFeatures=dynamic,c5.xlarge,ondemand_i1
   ActiveFeatures=dynamic,c5.xlarge,ondemand_i1
   Gres=(null)
   NodeAddr=10.0.18.102 NodeHostName=ondemand-dy-c5xlarge-3 Version=20.11.0
   OS=Linux 4.14.203-156.332.amzn2.x86_64 #1 SMP Fri Oct 30 19:19:33 UTC 2020
   RealMemory=1 AllocMem=0 FreeMem=6394 Sockets=2 Boards=1
   State=DOWN+CLOUD ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
   Partitions=ondemand
   BootTime=2020-11-26T11:05:25 SlurmdStartTime=2020-11-26T11:07:15
   CfgTRES=cpu=2,mem=1M,billing=2
   AllocTRES=
   CapWatts=n/a
   CurrentWatts=0 AveWatts=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
   Reason=Node unexpectedly rebooted [slurm@2020-11-26T11:07:15]
   Comment=(null)

```

Here is the relevant entries from slurmctld log file (full logs attached):

```
[2020-11-26T11:07:15.047] Node ondemand-dy-c5xlarge-3 rebooted 110 secs ago
[2020-11-26T11:07:15.047] debug3: resetting job_count on node ondemand-dy-c5xlarge-3 from 0 to 1
[2020-11-26T11:07:15.047] Node ondemand-dy-c5xlarge-3 now responding
[2020-11-26T11:07:15.047] validate_node_specs: Node ondemand-dy-c5xlarge-3 unexpectedly rebooted boot_time=1606388725 last response=1606388447
[2020-11-26T11:07:15.047] requeue job JobId=13 due to failure of node ondemand-dy-c5xlarge-3
```
Comment 1 Brian Christiansen 2020-12-02 17:05:46 MST
Thanks Francesco! This is fixed in 20.11.1:

https://github.com/SchedMD/slurm/commit/466c2878ba6ef3724f97a9a7c3c775c9992fb9dc