Hi, We've had 7G of log messages of the form over the course of a few hours ~~~~ [2019-10-04T16:57:51.044] error: select/cons_tres: rm_job_res: node node3100.skitty.os memory is under-allocated (163840-184320) for JobId=462342 [2019-10-04T16:57:51.044] error: select/cons_tres: rm_job_res: node node3105.skitty.os memory is under-allocated (167936-184320) for JobId=462342 [2019-10-04T16:57:51.044] error: select/cons_tres: rm_job_res: node node3117.skitty.os memory is under-allocated (133120-184320) for JobId=462342 [2019-10-04T16:57:51.044] error: select/cons_tres: rm_job_res: node node3118.skitty.os memory is under-allocated (122880-184320) for JobId=462342 [2019-10-04T16:57:51.044] error: select/cons_tres: rm_job_res: node node3119.skitty.os memory is under-allocated (174080-184320) for JobId=462342 [2019-10-04T16:57:51.044] error: select/cons_tres: rm_job_res: node node3137.skitty.os memory is under-allocated (0-184320) for JobId=462342 [2019-10-04T16:57:51.044] error: select/cons_tres: rm_job_res: node node3154.skitty.os memory is under-allocated (163840-184320) for JobId=462342 ~~~~ effectively blocking users from submitting jobs, and filling up /var/ (where the log files reside). The job had the following info according to scontrol: ~~~~ NumNodes=1 NumCPUs=360 NumTasks=360 CPUs/Task=N/A ReqB:S:C:T=0:0:: TRES=cpu=360,mem=1800G,node=10,billing=360 JobId=462342 JobName=LES_TimeStatistics_D1.3_3baro UserId=vsc41854(2541854) GroupId=vsc41854(2541854) MCS_label=N/A Priority=21576 Nice=0 Account=gvo00010 QOS=normal JobState=COMPLETING Reason=None Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 RunTime=00:27:19 TimeLimit=2-00:00:00 TimeMin=N/A SubmitTime=2019-09-27T17:47:50 EligibleTime=2019-09-27T17:47:50 AccrueTime=2019-09-27T17:47:50 StartTime=2019-09-27T20:25:49 EndTime=2019-09-27T20:53:08 Deadline=N/A SuspendTime=None SecsPreSuspend=0 LastSchedEval=2019-09-27T20:25:49 Partition=skitty AllocNode:Sid=gligar05.gastly.os:22265 ReqNodeList=(null) ExcNodeList=(null) NodeList=node3137.skitty.os BatchHost=node3100.skitty.os NumNodes=1 NumCPUs=360 NumTasks=360 CPUs/Task=N/A ReqB:S:C:T=0:0:: TRES=cpu=360,mem=1800G,node=10,billing=360 Socks/Node= NtasksPerN:b:S:C=0:0:: CoreSpec= MinCPUsNode=1 MinMemoryCPU=5G MinTmpDiskNode=0 Features=(null) DelayBoot=00:00:00 OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) Command=(null) WorkDir=/user/gent/418/vsc41854 Comment=stdout=/kyukon/data/gent/vo/000/gvo00010/vsc41854/Fluent/Final/D_1.3mm/3baro/LES_Time_Statistics/%x.o%A StdErr=/kyukon/data/gent/vo/000/gvo00010/vsc41854/Fluent/Final/D_1.3mm/3baro/LES_Time_Statistics/LES_TimeStatistics_D1.3_3baro.e462342 StdIn=/dev/null StdOut=/kyukon/data/gent/vo/000/gvo00010/vsc41854/Fluent/Final/D_1.3mm/3baro/LES_Time_Statistics/LES_TimeStatistics_D1.3_3baro.o462342 Power= ~~~~ Is there any reason as to - the logging is this frequent - the nodes are not put to drain or some such Kind regards, -- Andy
Hi Andy - Please have a look at bug#6769 which should resolve this via the 19.05.3 release. -- Fix select plugins' will run test under-allocating nodes usage for completing jobs. *** This ticket has been marked as a duplicate of ticket 6769 ***