| Summary: | Compute node reamins in allocated state | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | aptivhpcsupport |
| Component: | Cloud | Assignee: | Oriol Vilarrubi <jvilarru> |
| Status: | RESOLVED CANNOTREPRODUCE | QA Contact: | |
| Severity: | 3 - Medium Impact | ||
| Priority: | --- | ||
| Version: | 20.11.5 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | Aptiv | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
Hello, are you able to reproduce the error? In case yes, can you attach the slurm.conf so that I'm able to reproduce the issue? Hello, I cannot replicate the issue, After restarting slurmctld the problem disappeared. The ticket can be closed. Regards, Jakub Rodak Hello Jakub, If he issue happens again you can reopen this ticket simply by replying to this comment. Regards. |
Hello, We've encountered a problem with one of vms in our cloud cluster. After the job submission vm remains in allocated state even though there are no jobs running on it: sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST sil-prep up infinite 2 idle~ cn-com[001-002] def* up infinite 20 idle~ cn-com[003-022] def* up infinite 1 idle cn-com121 hil-prep up infinite 96 idle~ cn-com[023-097,100-120] hil-prep up infinite 2 alloc cn-com[098-099] hil-mid up infinite 9 idle cn-hilgw[001-009] hil-remote up infinite 9 idle cn-hilgw-remote[001-009] [root@cn-log01 kuba]# squeue -w cn-com098 JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) NodeName=cn-com098 Arch=x86_64 CoresPerSocket=4 CPUAlloc=0 CPUTot=4 CPULoad=0.18 AvailableFeatures=(null) ActiveFeatures=(null) Gres=(null) NodeAddr=172.16.24.148 NodeHostName=cn-com098 Version=20.11.5 OS=Linux 3.10.0-1127.19.1.el7.x86_64 #1 SMP Tue Aug 25 17:23:54 UTC 2020 RealMemory=15884 AllocMem=0 FreeMem=15219 Sockets=1 Boards=1 State=ALLOCATED+CLOUD ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A Partitions=hil-prep BootTime=2021-06-15T08:12:05 SlurmdStartTime=2021-06-15T08:12:43 CfgTRES=cpu=4,mem=15884M,billing=4 AllocTRES= CapWatts=n/a CurrentWatts=0 AveWatts=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s Comment=(null) slurmd.log: [root@cn-com098 kuba]# cat /var/log/slurmdcn-com098.log [2021-06-15T08:12:42.344] error: Node configuration differs from hardware: CPUs=4:4(hw) Boards=1:1(hw) SocketsPerBoard=1:1(hw) CoresPerSocket=4:2(hw) ThreadsPerCore=2:2(hw) [2021-06-15T08:12:42.352] CPU frequency setting not configured for this node [2021-06-15T08:12:42.364] slurmd version 20.11.5 started [2021-06-15T08:12:42.381] slurmd started on Tue, 15 Jun 2021 08:12:42 +0000 [2021-06-15T08:12:43.656] CPUs=4 Boards=1 Sockets=1 Cores=4 Threads=2 Memory=15884 TmpDisk=29703 Uptime=38 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null) [2021-06-15T08:12:44.823] launch task StepId=430684.0 request from UID:0 GID:0 HOST:172.16.24.5 PORT:50418 [2021-06-15T08:12:44.824] task/affinity: lllp_distribution: JobId=430684 implicit auto binding: sockets,one_thread, dist 1 [2021-06-15T08:12:44.824] task/affinity: _task_layout_lllp_cyclic: _task_layout_lllp_cyclic [2021-06-15T08:12:44.824] task/affinity: _lllp_generate_cpu_bind: _lllp_generate_cpu_bind jobid [430684]: mask_cpu,one_thread, 0xF [2021-06-15T08:12:52.125] [430684.0] in _window_manager [2021-06-15T08:13:26.035] [430684.0] done with job The vm is not excluded from suspend: ResumeProgram=/etc/slurm/resume.sh ResumeFailProgram=/etc/slurm/resume_fail.sh ResumeTimeout=600 SuspendProgram=/etc/slurm/suspend.sh SuspendTime=500 BatchStartTimeout=300 SuspendExcNodes=cn-hilgw[001-009],cn-hilgw-remote[001-009] SuspendExcParts=hil-mid,hil-remote All other vms in partition behaves as expected. Regards, Jakub Rodak