7866 – Logging too fast "memory is under-allocated"

Ticket 7866 - Logging too fast "memory is under-allocated"

Summary: Logging too fast "memory is under-allocated"

Status:	RESOLVED DUPLICATE of ticket 6769

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	slurmctld (show other tickets)
Version:	19.05.2
Hardware:	Linux Linux

Severity:	4 - Minor Issue
Assignee:	Director of Support
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2019-10-04 09:22 MDT by hpc-ops
Modified:	2019-10-04 09:58 MDT (History)
CC List:	0 users

See Also:	6769
Site:	Ghent
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description hpc-ops 2019-10-04 09:22:37 MDT

Hi,


We've had 7G of log messages of the form over the course of a few hours

~~~~
[2019-10-04T16:57:51.044] error: select/cons_tres: rm_job_res: node node3100.skitty.os memory is under-allocated (163840-184320) for JobId=462342
[2019-10-04T16:57:51.044] error: select/cons_tres: rm_job_res: node node3105.skitty.os memory is under-allocated (167936-184320) for JobId=462342
[2019-10-04T16:57:51.044] error: select/cons_tres: rm_job_res: node node3117.skitty.os memory is under-allocated (133120-184320) for JobId=462342
[2019-10-04T16:57:51.044] error: select/cons_tres: rm_job_res: node node3118.skitty.os memory is under-allocated (122880-184320) for JobId=462342
[2019-10-04T16:57:51.044] error: select/cons_tres: rm_job_res: node node3119.skitty.os memory is under-allocated (174080-184320) for JobId=462342
[2019-10-04T16:57:51.044] error: select/cons_tres: rm_job_res: node node3137.skitty.os memory is under-allocated (0-184320) for JobId=462342
[2019-10-04T16:57:51.044] error: select/cons_tres: rm_job_res: node node3154.skitty.os memory is under-allocated (163840-184320) for JobId=462342
~~~~

effectively blocking users from submitting jobs, and filling up /var/ (where the log files reside).

The job had the following info according to scontrol:

~~~~
NumNodes=1 NumCPUs=360 NumTasks=360 CPUs/Task=N/A ReqB:S:C:T=0:0::
  TRES=cpu=360,mem=1800G,node=10,billing=360
JobId=462342 JobName=LES_TimeStatistics_D1.3_3baro
  UserId=vsc41854(2541854) GroupId=vsc41854(2541854) MCS_label=N/A
  Priority=21576 Nice=0 Account=gvo00010 QOS=normal
  JobState=COMPLETING Reason=None Dependency=(null)
  Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
  RunTime=00:27:19 TimeLimit=2-00:00:00 TimeMin=N/A
  SubmitTime=2019-09-27T17:47:50 EligibleTime=2019-09-27T17:47:50
  AccrueTime=2019-09-27T17:47:50
  StartTime=2019-09-27T20:25:49 EndTime=2019-09-27T20:53:08 Deadline=N/A
  SuspendTime=None SecsPreSuspend=0 LastSchedEval=2019-09-27T20:25:49
  Partition=skitty AllocNode:Sid=gligar05.gastly.os:22265
  ReqNodeList=(null) ExcNodeList=(null)
  NodeList=node3137.skitty.os
  BatchHost=node3100.skitty.os
  NumNodes=1 NumCPUs=360 NumTasks=360 CPUs/Task=N/A ReqB:S:C:T=0:0::
  TRES=cpu=360,mem=1800G,node=10,billing=360
  Socks/Node= NtasksPerN:b:S:C=0:0:: CoreSpec=
  MinCPUsNode=1 MinMemoryCPU=5G MinTmpDiskNode=0
  Features=(null) DelayBoot=00:00:00
  OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
  Command=(null)
  WorkDir=/user/gent/418/vsc41854
  Comment=stdout=/kyukon/data/gent/vo/000/gvo00010/vsc41854/Fluent/Final/D_1.3mm/3baro/LES_Time_Statistics/%x.o%A
  StdErr=/kyukon/data/gent/vo/000/gvo00010/vsc41854/Fluent/Final/D_1.3mm/3baro/LES_Time_Statistics/LES_TimeStatistics_D1.3_3baro.e462342
  StdIn=/dev/null
  StdOut=/kyukon/data/gent/vo/000/gvo00010/vsc41854/Fluent/Final/D_1.3mm/3baro/LES_Time_Statistics/LES_TimeStatistics_D1.3_3baro.o462342
  Power=
~~~~

Is there any reason as to

- the logging is this frequent
- the nodes are not put to drain or some such

Kind regards,
-- Andy

Comment 1 Jason Booth 2019-10-04 09:58:32 MDT

Hi Andy - Please have a look at bug#6769 which should resolve this via the 19.05.3 release.

-- Fix select plugins' will run test under-allocating nodes usage for completing jobs.

*** This ticket has been marked as a duplicate of ticket 6769 ***