Ticket 7866

Summary: Logging too fast "memory is under-allocated"
Product: Slurm Reporter: hpc-admin
Component: slurmctldAssignee: Director of Support <support>
Status: RESOLVED DUPLICATE QA Contact:
Severity: 4 - Minor Issue    
Priority: ---    
Version: 19.05.2   
Hardware: Linux   
OS: Linux   
See Also: https://bugs.schedmd.com/show_bug.cgi?id=6769
Site: Ghent Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- NoveTech Sites: ---
Nvidia HWinf-CS Sites: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Tzag Elita Sites: ---
Linux Distro: --- Machine Name:
CLE Version: Version Fixed:
Target Release: --- DevPrio: ---
Emory-Cloud Sites: ---

Description hpc-admin 2019-10-04 09:22:37 MDT
Hi,


We've had 7G of log messages of the form over the course of a few hours

~~~~
[2019-10-04T16:57:51.044] error: select/cons_tres: rm_job_res: node node3100.skitty.os memory is under-allocated (163840-184320) for JobId=462342
[2019-10-04T16:57:51.044] error: select/cons_tres: rm_job_res: node node3105.skitty.os memory is under-allocated (167936-184320) for JobId=462342
[2019-10-04T16:57:51.044] error: select/cons_tres: rm_job_res: node node3117.skitty.os memory is under-allocated (133120-184320) for JobId=462342
[2019-10-04T16:57:51.044] error: select/cons_tres: rm_job_res: node node3118.skitty.os memory is under-allocated (122880-184320) for JobId=462342
[2019-10-04T16:57:51.044] error: select/cons_tres: rm_job_res: node node3119.skitty.os memory is under-allocated (174080-184320) for JobId=462342
[2019-10-04T16:57:51.044] error: select/cons_tres: rm_job_res: node node3137.skitty.os memory is under-allocated (0-184320) for JobId=462342
[2019-10-04T16:57:51.044] error: select/cons_tres: rm_job_res: node node3154.skitty.os memory is under-allocated (163840-184320) for JobId=462342
~~~~

effectively blocking users from submitting jobs, and filling up /var/ (where the log files reside).

The job had the following info according to scontrol:

~~~~
NumNodes=1 NumCPUs=360 NumTasks=360 CPUs/Task=N/A ReqB:S:C:T=0:0::
  TRES=cpu=360,mem=1800G,node=10,billing=360
JobId=462342 JobName=LES_TimeStatistics_D1.3_3baro
  UserId=vsc41854(2541854) GroupId=vsc41854(2541854) MCS_label=N/A
  Priority=21576 Nice=0 Account=gvo00010 QOS=normal
  JobState=COMPLETING Reason=None Dependency=(null)
  Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
  RunTime=00:27:19 TimeLimit=2-00:00:00 TimeMin=N/A
  SubmitTime=2019-09-27T17:47:50 EligibleTime=2019-09-27T17:47:50
  AccrueTime=2019-09-27T17:47:50
  StartTime=2019-09-27T20:25:49 EndTime=2019-09-27T20:53:08 Deadline=N/A
  SuspendTime=None SecsPreSuspend=0 LastSchedEval=2019-09-27T20:25:49
  Partition=skitty AllocNode:Sid=gligar05.gastly.os:22265
  ReqNodeList=(null) ExcNodeList=(null)
  NodeList=node3137.skitty.os
  BatchHost=node3100.skitty.os
  NumNodes=1 NumCPUs=360 NumTasks=360 CPUs/Task=N/A ReqB:S:C:T=0:0::
  TRES=cpu=360,mem=1800G,node=10,billing=360
  Socks/Node= NtasksPerN:b:S:C=0:0:: CoreSpec=
  MinCPUsNode=1 MinMemoryCPU=5G MinTmpDiskNode=0
  Features=(null) DelayBoot=00:00:00
  OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
  Command=(null)
  WorkDir=/user/gent/418/vsc41854
  Comment=stdout=/kyukon/data/gent/vo/000/gvo00010/vsc41854/Fluent/Final/D_1.3mm/3baro/LES_Time_Statistics/%x.o%A
  StdErr=/kyukon/data/gent/vo/000/gvo00010/vsc41854/Fluent/Final/D_1.3mm/3baro/LES_Time_Statistics/LES_TimeStatistics_D1.3_3baro.e462342
  StdIn=/dev/null
  StdOut=/kyukon/data/gent/vo/000/gvo00010/vsc41854/Fluent/Final/D_1.3mm/3baro/LES_Time_Statistics/LES_TimeStatistics_D1.3_3baro.o462342
  Power=
~~~~

Is there any reason as to

- the logging is this frequent
- the nodes are not put to drain or some such

Kind regards,
-- Andy
Comment 1 Jason Booth 2019-10-04 09:58:32 MDT
Hi Andy - Please have a look at bug#6769 which should resolve this via the 19.05.3 release.

-- Fix select plugins' will run test under-allocating nodes usage for completing jobs.

*** This ticket has been marked as a duplicate of ticket 6769 ***