Ticket 7866 - Logging too fast "memory is under-allocated"
Summary: Logging too fast "memory is under-allocated"
Status: RESOLVED DUPLICATE of ticket 6769
Alias: None
Product: Slurm
Classification: Unclassified
Component: slurmctld (show other tickets)
Version: 19.05.2
Hardware: Linux Linux
: 4 - Minor Issue
Assignee: Director of Support
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2019-10-04 09:22 MDT by hpc-admin
Modified: 2019-10-04 09:58 MDT (History)
0 users

See Also:
Site: Ghent
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description hpc-admin 2019-10-04 09:22:37 MDT
Hi,


We've had 7G of log messages of the form over the course of a few hours

~~~~
[2019-10-04T16:57:51.044] error: select/cons_tres: rm_job_res: node node3100.skitty.os memory is under-allocated (163840-184320) for JobId=462342
[2019-10-04T16:57:51.044] error: select/cons_tres: rm_job_res: node node3105.skitty.os memory is under-allocated (167936-184320) for JobId=462342
[2019-10-04T16:57:51.044] error: select/cons_tres: rm_job_res: node node3117.skitty.os memory is under-allocated (133120-184320) for JobId=462342
[2019-10-04T16:57:51.044] error: select/cons_tres: rm_job_res: node node3118.skitty.os memory is under-allocated (122880-184320) for JobId=462342
[2019-10-04T16:57:51.044] error: select/cons_tres: rm_job_res: node node3119.skitty.os memory is under-allocated (174080-184320) for JobId=462342
[2019-10-04T16:57:51.044] error: select/cons_tres: rm_job_res: node node3137.skitty.os memory is under-allocated (0-184320) for JobId=462342
[2019-10-04T16:57:51.044] error: select/cons_tres: rm_job_res: node node3154.skitty.os memory is under-allocated (163840-184320) for JobId=462342
~~~~

effectively blocking users from submitting jobs, and filling up /var/ (where the log files reside).

The job had the following info according to scontrol:

~~~~
NumNodes=1 NumCPUs=360 NumTasks=360 CPUs/Task=N/A ReqB:S:C:T=0:0::
  TRES=cpu=360,mem=1800G,node=10,billing=360
JobId=462342 JobName=LES_TimeStatistics_D1.3_3baro
  UserId=vsc41854(2541854) GroupId=vsc41854(2541854) MCS_label=N/A
  Priority=21576 Nice=0 Account=gvo00010 QOS=normal
  JobState=COMPLETING Reason=None Dependency=(null)
  Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
  RunTime=00:27:19 TimeLimit=2-00:00:00 TimeMin=N/A
  SubmitTime=2019-09-27T17:47:50 EligibleTime=2019-09-27T17:47:50
  AccrueTime=2019-09-27T17:47:50
  StartTime=2019-09-27T20:25:49 EndTime=2019-09-27T20:53:08 Deadline=N/A
  SuspendTime=None SecsPreSuspend=0 LastSchedEval=2019-09-27T20:25:49
  Partition=skitty AllocNode:Sid=gligar05.gastly.os:22265
  ReqNodeList=(null) ExcNodeList=(null)
  NodeList=node3137.skitty.os
  BatchHost=node3100.skitty.os
  NumNodes=1 NumCPUs=360 NumTasks=360 CPUs/Task=N/A ReqB:S:C:T=0:0::
  TRES=cpu=360,mem=1800G,node=10,billing=360
  Socks/Node= NtasksPerN:b:S:C=0:0:: CoreSpec=
  MinCPUsNode=1 MinMemoryCPU=5G MinTmpDiskNode=0
  Features=(null) DelayBoot=00:00:00
  OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
  Command=(null)
  WorkDir=/user/gent/418/vsc41854
  Comment=stdout=/kyukon/data/gent/vo/000/gvo00010/vsc41854/Fluent/Final/D_1.3mm/3baro/LES_Time_Statistics/%x.o%A
  StdErr=/kyukon/data/gent/vo/000/gvo00010/vsc41854/Fluent/Final/D_1.3mm/3baro/LES_Time_Statistics/LES_TimeStatistics_D1.3_3baro.e462342
  StdIn=/dev/null
  StdOut=/kyukon/data/gent/vo/000/gvo00010/vsc41854/Fluent/Final/D_1.3mm/3baro/LES_Time_Statistics/LES_TimeStatistics_D1.3_3baro.o462342
  Power=
~~~~

Is there any reason as to

- the logging is this frequent
- the nodes are not put to drain or some such

Kind regards,
-- Andy
Comment 1 Jason Booth 2019-10-04 09:58:32 MDT
Hi Andy - Please have a look at bug#6769 which should resolve this via the 19.05.3 release.

-- Fix select plugins' will run test under-allocating nodes usage for completing jobs.

*** This ticket has been marked as a duplicate of ticket 6769 ***