Ticket 10588

Summary:	reported memory usage is incorrect
Product:	Slurm	Reporter:	Martin Siegert <siegert>
Component:	Accounting	Assignee:	Nate Rini <nate>
Status:	RESOLVED DUPLICATE	QA Contact:
Severity:	4 - Minor Issue
Priority:	---	CC:	asa188, kaizaad, nate
Version:	20.11.0
Hardware:	Linux
OS:	Linux
Site:	Simon Fraser University	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	slurmd log for node with debug4 of JobAcctGatherType=jobacct_gather/cgroup and linux

Description Martin Siegert 2021-01-07 13:50:40 MST

Since 20.11 we find that the memory usage reported by some jobs is incorrect, e.g.,

#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=32
#SBATCH --mem=128000

This job runs on a 32-core node that has 128GB of memory installed. Typical top output:
  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
12327 siegert   21   1  101.7g 101.6g    756 R  3200 80.8 433:50.23 xhpl

However, seff (and sacct) report:
Job ID: 58829413
Cluster: cedar
User/Group: siegert/siegert
State: COMPLETED (exit code 0)
Nodes: 1
Cores per node: 32
CPU Utilized: 09:44:00
CPU Efficiency: 79.12% of 12:18:08 core-walltime
Job Wall-clock time: 00:23:04
Memory Utilized: 203.24 GB
Memory Efficiency: 162.59% of 125.00 GB

Clearly, a memory usage of 203.24 GB is bogus since the node does not even have that much memory. Also the OOM killer does not kill the job (we are using cgroups) indicating that the actual memory usage is less than the requested memory. Strangely the memory reported by Slurm is exactly twice as much as the memory usage reported by top.

Comment 1 Nate Rini 2021-01-07 13:52:13 MST

Please provide the following:
> sacct -j 58829413 -o all -p
and your slurm.conf (& friends).

Comment 2 Martin Siegert 2021-01-07 15:14:17 MST

Created attachment 17386 [details]
sacct -j 58829413 -o all -p output

Comment 3 Martin Siegert 2021-01-07 15:14:50 MST

Created attachment 17387 [details]
slurm.conf

Comment 5 Nate Rini 2021-01-07 15:25:43 MST

I see that your slurm.conf has:
> ProctrackType=proctrack/pgid
> JobAcctGatherType=jobacct_gather/linux

Why not have this to match the accounting to the process tracking?
> ProctrackType=proctrack/linux

Based on the logs provided, Slurm is doubling the usage which could easily be due to the xhpl calling fork() and pgid can have issues handling fork trees.

Comment 6 Martin Siegert 2021-01-07 16:45:14 MST

We have
ProctrackType=proctrack/cgroup
which I thought is the recommended setting - isn't it?

Comment 7 Nate Rini 2021-01-07 16:51:42 MST

(In reply to Martin Siegert from comment #6)
> We have
> ProctrackType=proctrack/cgroup
> which I thought is the recommended setting - isn't it?

Oops, must have looked at the wrong slurm.conf (locally). Yes, that is suggested.

Please disable
> JobAcctGatherParams=UsePSS
proctrack/cgroup and UsePSS are incompatible (bug#10587)

Please make sure to restart all daemons after disabling UsePSS.

Comment 8 Martin Siegert 2021-01-07 18:34:51 MST

I commented out the JobAcctGatherParams=UsePSS line in slurm.conf and then restarted slumrctld and all slurmd. It has no effect:

# sacct -j 58853154 -o all -p
Account|AdminComment|AllocCPUS|AllocNodes|AllocTRES|AssocID|AveCPU|AveCPUFreq|AveDiskRead|AveDiskWrite|AvePages|AveRSS|AveVMSize|BlockID|Cluster|Comment|Constraints|ConsumedEnergy|ConsumedEnergyRaw|CPUTime|CPUTimeRAW|DBIndex|DerivedExitCode|Elapsed|ElapsedRaw|Eligible|End|ExitCode|Flags|GID|Group|JobID|JobIDRaw|JobName|Layout|MaxDiskRead|MaxDiskReadNode|MaxDiskReadTask|MaxDiskWrite|MaxDiskWriteNode|MaxDiskWriteTask|MaxPages|MaxPagesNode|MaxPagesTask|MaxRSS|MaxRSSNode|MaxRSSTask|MaxVMSize|MaxVMSizeNode|MaxVMSizeTask|McsLabel|MinCPU|MinCPUNode|MinCPUTask|NCPUS|NNodes|NodeList|NTasks|Priority|Partition|QOS|QOSRAW|Reason|ReqCPUFreq|ReqCPUFreqMin|ReqCPUFreqMax|ReqCPUFreqGov|ReqCPUS|ReqMem|ReqNodes|ReqTRES|Reservation|ReservationId|Reserved|ResvCPU|ResvCPURAW|Start|State|Submit|Suspended|SystemCPU|SystemComment|Timelimit|TimelimitRaw|TotalCPU|TRESUsageInAve|TRESUsageInMax|TRESUsageInMaxNode|TRESUsageInMaxTask|TRESUsageInMin|TRESUsageInMinNode|TRESUsageInMinTask|TRESUsageInTot|TRESUsageOutAve|TRESUsageOutMax|TRESUsageOutMaxNode|TRESUsageOutMaxTask|TRESUsageOutMin|TRESUsageOutMinNode|TRESUsageOutMinTask|TRESUsageOutTot|UID|User|UserCPU|WCKey|WCKeyID|WorkDir|
def-siegert-ab_cpu||32|1|billing=32,cpu=32,mem=125G,node=1|21181|||||||||cedar||broadwell|0|0|10:53:52|39232|551646502|0:0|00:20:26|1226|2021-01-07T16:39:57|2021-01-07T17:18:58|0:0|SchedBackfill|3000123|siegert|58853154|58853154|hpl-memtest.slrm|||||||||||||||||||||32|1|cdr980||1117375|c12hbackfill|normal|1|None|Unknown|Unknown|Unknown|Unknown|32|125Gn|1|billing=32,cpu=32,mem=125G,node=1|||00:18:35|09:54:40|35680|2021-01-07T16:58:32|COMPLETED|2021-01-07T16:39:57|00:00:00|02:20.767||02:00:00|120|09:53:33|||||||||||||||||3000123|siegert|09:51:12||0|/project/6001524/siegert/benchmarks/hpl/cpu-only/32|
def-siegert-ab_cpu||32|1|cpu=32,mem=125G,node=1|21181|00:00:00|2.10G|6.96M|0.03M|12|22628K|880852K||cedar|||0|0|10:53:52|39232|551646502||00:20:26|1226|2021-01-07T16:58:32|2021-01-07T17:18:58|0:0||||58853154.batch|58853154.batch|batch|Unknown|6.96M|cdr980|0|0.03M|cdr980|0|12|cdr980|0|22628K|cdr980|0|880852K|cdr980|0||00:00:00|cdr980|0|32|1|cdr980|1||||||0|0|0|0|32|125Gn|1|||||||2021-01-07T16:58:32|COMPLETED|2021-01-07T16:58:32|00:00:00|00:00.333||||00:01.365|cpu=00:00:00,energy=0,fs/disk=7296460,mem=22628K,pages=12,vmem=880852K|cpu=00:00:00,energy=0,fs/disk=7296460,mem=22628K,pages=12,vmem=880852K|cpu=cdr980,energy=cdr980,fs/disk=cdr980,mem=cdr980,pages=cdr980,vmem=cdr980|cpu=0,fs/disk=0,mem=0,pages=0,vmem=0|cpu=00:00:00,energy=0,fs/disk=7296460,mem=22628K,pages=12,vmem=880852K|cpu=cdr980,energy=cdr980,fs/disk=cdr980,mem=cdr980,pages=cdr980,vmem=cdr980|cpu=0,fs/disk=0,mem=0,pages=0,vmem=0|cpu=00:00:00,energy=0,fs/disk=7296460,mem=22628K,pages=12,vmem=880852K|energy=0,fs/disk=27460|energy=0,fs/disk=27460|energy=cdr980,fs/disk=cdr980|fs/disk=0|energy=0,fs/disk=27460|energy=cdr980,fs/disk=cdr980|fs/disk=0|energy=0,fs/disk=27460|||00:01.032||||
def-siegert-ab_cpu||32|1|billing=32,cpu=32,mem=125G,node=1|21181|00:00:00|1.20G|0.00M|0|0|928K|144572K||cedar|||0|0|10:56:00|39360|551646502||00:20:30|1230|2021-01-07T16:58:32|2021-01-07T17:19:02|0:0||||58853154.extern|58853154.extern|extern|Unknown|0.00M|cdr980|0|0|cdr980|0|0|cdr980|0|928K|cdr980|0|144572K|cdr980|0||00:00:00|cdr980|0|32|1|cdr980|1||||||0|0|0|0|32|125Gn|1|||||||2021-01-07T16:58:32|COMPLETED|2021-01-07T16:58:32|00:00:00|00:00.001||||00:00.002|cpu=00:00:00,energy=0,fs/disk=2012,mem=928K,pages=0,vmem=144572K|cpu=00:00:00,energy=0,fs/disk=2012,mem=928K,pages=0,vmem=144572K|cpu=cdr980,energy=cdr980,fs/disk=cdr980,mem=cdr980,pages=cdr980,vmem=cdr980|cpu=0,fs/disk=0,mem=0,pages=0,vmem=0|cpu=00:00:00,energy=0,fs/disk=2012,mem=928K,pages=0,vmem=144572K|cpu=cdr980,energy=cdr980,fs/disk=cdr980,mem=cdr980,pages=cdr980,vmem=cdr980|cpu=0,fs/disk=0,mem=0,pages=0,vmem=0|cpu=00:00:00,energy=0,fs/disk=2012,mem=928K,pages=0,vmem=144572K|energy=0,fs/disk=0|energy=0,fs/disk=0|energy=cdr980,fs/disk=cdr980|fs/disk=0|energy=0,fs/disk=0|energy=cdr980,fs/disk=cdr980|fs/disk=0|energy=0,fs/disk=0|||00:00.001||||
def-siegert-ab_cpu||32|1|cpu=32,mem=125G,node=1|21181|09:53:16|2K|5.55M|0.09M|182|213112848K|213453872K||cedar|||0|0|10:46:24|38784|551646502||00:20:12|1212|2021-01-07T16:58:46|2021-01-07T17:18:58|0:0||||58853154.0|58853154.0|run_xhpl_prv|Cyclic|5.55M|cdr980|0|0.09M|cdr980|0|182|cdr980|0|213112848K|cdr980|0|213453872K|cdr980|0||09:53:16|cdr980|0|32|1|cdr980|1||||||Unknown|Unknown|Unknown|Unknown|32|125Gn|1|||||||2021-01-07T16:58:46|COMPLETED|2021-01-07T16:58:46|00:00:00|02:20.433||||09:53:32|cpu=09:53:16,energy=0,fs/disk=5820751,mem=213112848K,pages=182,vmem=213453872K|cpu=09:53:16,energy=0,fs/disk=5820751,mem=213112848K,pages=182,vmem=213453872K|cpu=cdr980,energy=cdr980,fs/disk=cdr980,mem=cdr980,pages=cdr980,vmem=cdr980|cpu=0,fs/disk=0,mem=0,pages=0,vmem=0|cpu=09:53:16,energy=0,fs/disk=5820751,mem=213112848K,pages=182,vmem=213453872K|cpu=cdr980,energy=cdr980,fs/disk=cdr980,mem=cdr980,pages=cdr980,vmem=cdr980|cpu=0,fs/disk=0,mem=0,pages=0,vmem=0|cpu=09:53:16,energy=0,fs/disk=5820751,mem=213112848K,pages=182,vmem=213453872K|energy=0,fs/disk=91402|energy=0,fs/disk=91402|energy=cdr980,fs/disk=cdr980|fs/disk=0|energy=0,fs/disk=91402|energy=cdr980,fs/disk=cdr980|fs/disk=0|energy=0,fs/disk=91402|||09:51:11||||

Slurm still claims that I used 203GB of memory on a node that only has 128GB.

Comment 9 Nate Rini 2021-01-08 10:43:36 MST

(In reply to Martin Siegert from comment #8)
> Slurm still claims that I used 203GB of memory on a node that only has 128GB.

Is this from a new job run? Any previously recorded numbers will not change due to this config change.

Comment 10 Martin Siegert 2021-01-08 11:00:12 MST

This is a new job run after the change.
A colleague of mine suggested to use
JobAcctGatherType=jobacct_gather/cgroup
which sounds reasonable to me.
Is that worth a try?

Comment 11 Nate Rini 2021-01-08 11:12:17 MST

(In reply to Martin Siegert from comment #10)
> This is a new job run after the change.
> A colleague of mine suggested to use
> JobAcctGatherType=jobacct_gather/cgroup
> which sounds reasonable to me.
> Is that worth a try?

Yes, however, it would be nice to confirm the issue and fix the code if needed.

If possible, please set the following in slurm.conf:
> SlurmdDebug=debug4

Restart slurmd on the test nodes. Please then update the slurmd logs from a test job run.

Please revert the change when done, it can cause Slurm to run slower.

Comment 12 Adam 2021-01-08 16:05:47 MST

Created attachment 17400 [details]
slurmd log for node with debug4 of JobAcctGatherType=jobacct_gather/cgroup and linux

I've attached the log for this issue.

Using JobAcctGatherType=jobacct_gather/cgroup seems to report the correct memory while linux does not, this is a new issue since our upgrade to 20.11.0 as we have been using the jobacct_gather/linux

Thanks,

Adam

Comment 13 Nate Rini 2021-01-12 10:05:05 MST

Adam,

Reducing this to sev4 since this issue no longer actively affects your site but is a research bug now. Please reply if you have any objections.

Thanks,
--Nate

Comment 15 Nate Rini 2021-01-12 14:15:30 MST

Adam,

We have determined this is a duplicate of bug#10538 which can be corrected by reverting d86339dc27ed1b95 commit in the source. We will continue debugging the issue in bug#10538.

Thanks,
--Nate

*** This ticket has been marked as a duplicate of ticket 10538 ***

Comment 16 Martin Siegert 2021-01-12 17:42:50 MST

Please note that we cannot view bug #10538.

- Martin

On Tue, Jan 12, 2021 at 09:15:30PM +0000, bugs@schedmd.com wrote:
> 
>    [1]Nate Rini changed [2]bug 10588
> 
>       What    Removed   Added
>    Status     OPEN    RESOLVED
>    Resolution ---     DUPLICATE
> 
>    [3]Comment # 15 on [4]bug 10588 from [5]Nate Rini
> Adam,
> 
> We have determined this is a duplicate of [6]bug#10538 which can be corrected by
> reverting d86339dc27ed1b95 commit in the source. We will continue debugging the
> issue in [7]bug#10538.
> 
> Thanks,
> --Nate
> 
> *** This bug has been marked as a duplicate of [8]bug 10538 ***
>      __________________________________________________________________
> 
>    You are receiving this mail because:
>      * You reported the bug.

Comment 17 Jason Booth 2021-01-13 09:18:43 MST

Martin - I will have Nate see if we can open the associated bug to the public. We can also keep this open and communicate the progress though this bug to you if that is something you desire.

Comment 18 Jason Booth 2021-01-13 10:33:24 MST

You should not be able to view 10538. We did mark confidential data private but the bulk of the important bits should still be in that bug.

*** This ticket has been marked as a duplicate of ticket 10538 ***

Comment 19 Nate Rini 2021-01-15 09:44:38 MST

Martin,

The fix for bug#10538 is now upstream:

(In reply to Felip Moll from comment #68)
> we commited a fix which will be in 20.11.3+, commits
> 1ba2875649272..8b68aff3dcb7d.
> 
> The issue was due to a change where we modified the way we store the list of
> process statistics for a step. Now we don't remove the precs anymore because
> we need some values for each pid at the end of the job in order to
> accumulate syscpu and user cpu times. The issue was that memory was also
> incorrectly summed for all dead pids, which along with the rest of TRES
> required a different treatment than consumed cpu seconds.
> 
> We invalidate the TRES values (including memory) for past precs.
> 
> This release (20.11.3) should be coming out soon (around next week). 
> 
> Thank you for reporting.