Ticket 11830

Summary: Compute node reamins in allocated state
Product: Slurm Reporter: aptivhpcsupport
Component: CloudAssignee: Oriol Vilarrubi <jvilarru>
Status: RESOLVED CANNOTREPRODUCE QA Contact:
Severity: 3 - Medium Impact    
Priority: ---    
Version: 20.11.5   
Hardware: Linux   
OS: Linux   
Site: Aptiv Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---

Description aptivhpcsupport 2021-06-15 02:25:53 MDT
Hello,

We've encountered a problem with one of vms in our cloud cluster. After the job submission vm remains in allocated state even though there are no jobs running on it:

sinfo
PARTITION  AVAIL  TIMELIMIT  NODES  STATE NODELIST
sil-prep      up   infinite      2  idle~ cn-com[001-002]
def*          up   infinite     20  idle~ cn-com[003-022]
def*          up   infinite      1   idle cn-com121
hil-prep      up   infinite     96  idle~ cn-com[023-097,100-120]
hil-prep      up   infinite      2  alloc cn-com[098-099]
hil-mid       up   infinite      9   idle cn-hilgw[001-009]
hil-remote    up   infinite      9   idle cn-hilgw-remote[001-009]

[root@cn-log01 kuba]# squeue -w cn-com098
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)


NodeName=cn-com098 Arch=x86_64 CoresPerSocket=4
   CPUAlloc=0 CPUTot=4 CPULoad=0.18
   AvailableFeatures=(null)
   ActiveFeatures=(null)
   Gres=(null)
   NodeAddr=172.16.24.148 NodeHostName=cn-com098 Version=20.11.5
   OS=Linux 3.10.0-1127.19.1.el7.x86_64 #1 SMP Tue Aug 25 17:23:54 UTC 2020
   RealMemory=15884 AllocMem=0 FreeMem=15219 Sockets=1 Boards=1
   State=ALLOCATED+CLOUD ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
   Partitions=hil-prep
   BootTime=2021-06-15T08:12:05 SlurmdStartTime=2021-06-15T08:12:43
   CfgTRES=cpu=4,mem=15884M,billing=4
   AllocTRES=
   CapWatts=n/a
   CurrentWatts=0 AveWatts=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
   Comment=(null)


slurmd.log:
[root@cn-com098 kuba]# cat /var/log/slurmdcn-com098.log
[2021-06-15T08:12:42.344] error: Node configuration differs from hardware: CPUs=4:4(hw) Boards=1:1(hw) SocketsPerBoard=1:1(hw) CoresPerSocket=4:2(hw) ThreadsPerCore=2:2(hw)
[2021-06-15T08:12:42.352] CPU frequency setting not configured for this node
[2021-06-15T08:12:42.364] slurmd version 20.11.5 started
[2021-06-15T08:12:42.381] slurmd started on Tue, 15 Jun 2021 08:12:42 +0000
[2021-06-15T08:12:43.656] CPUs=4 Boards=1 Sockets=1 Cores=4 Threads=2 Memory=15884 TmpDisk=29703 Uptime=38 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null)
[2021-06-15T08:12:44.823] launch task StepId=430684.0 request from UID:0 GID:0 HOST:172.16.24.5 PORT:50418
[2021-06-15T08:12:44.824] task/affinity: lllp_distribution: JobId=430684 implicit auto binding: sockets,one_thread, dist 1
[2021-06-15T08:12:44.824] task/affinity: _task_layout_lllp_cyclic: _task_layout_lllp_cyclic
[2021-06-15T08:12:44.824] task/affinity: _lllp_generate_cpu_bind: _lllp_generate_cpu_bind jobid [430684]: mask_cpu,one_thread, 0xF
[2021-06-15T08:12:52.125] [430684.0] in _window_manager
[2021-06-15T08:13:26.035] [430684.0] done with job


The vm is not excluded from suspend:
ResumeProgram=/etc/slurm/resume.sh
ResumeFailProgram=/etc/slurm/resume_fail.sh
ResumeTimeout=600
SuspendProgram=/etc/slurm/suspend.sh
SuspendTime=500
BatchStartTimeout=300
SuspendExcNodes=cn-hilgw[001-009],cn-hilgw-remote[001-009]
SuspendExcParts=hil-mid,hil-remote

All other vms in partition behaves as expected. 

Regards,
Jakub Rodak
Comment 1 Oriol Vilarrubi 2021-06-22 08:02:06 MDT
Hello, are you able to reproduce the error? In case yes, can you attach the slurm.conf so that I'm able to reproduce the issue?
Comment 2 aptivhpcsupport 2021-06-25 07:34:27 MDT
Hello,

I cannot replicate the issue, After restarting slurmctld the problem disappeared. The ticket can be closed.

Regards,
Jakub Rodak
Comment 3 Oriol Vilarrubi 2021-06-29 00:25:42 MDT
Hello Jakub,

If he issue happens again you can reopen this ticket simply by replying to this comment.

Regards.