Summary: | reported memory usage is incorrect | ||
---|---|---|---|
Product: | Slurm | Reporter: | Martin Siegert <siegert> |
Component: | Accounting | Assignee: | Nate Rini <nate> |
Status: | RESOLVED DUPLICATE | QA Contact: | |
Severity: | 4 - Minor Issue | ||
Priority: | --- | CC: | asa188, kaizaad, nate |
Version: | 20.11.0 | ||
Hardware: | Linux | ||
OS: | Linux | ||
Site: | Simon Fraser University | Alineos Sites: | --- |
Atos/Eviden Sites: | --- | Confidential Site: | --- |
Coreweave sites: | --- | Cray Sites: | --- |
DS9 clusters: | --- | HPCnow Sites: | --- |
HPE Sites: | --- | IBM Sites: | --- |
NOAA SIte: | --- | NoveTech Sites: | --- |
Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
Recursion Pharma Sites: | --- | SFW Sites: | --- |
SNIC sites: | --- | Tzag Elita Sites: | --- |
Linux Distro: | --- | Machine Name: | |
CLE Version: | Version Fixed: | ||
Target Release: | --- | DevPrio: | --- |
Emory-Cloud Sites: | --- | ||
Attachments: | slurmd log for node with debug4 of JobAcctGatherType=jobacct_gather/cgroup and linux |
Description
Martin Siegert
2021-01-07 13:50:40 MST
Please provide the following:
> sacct -j 58829413 -o all -p
and your slurm.conf (& friends).
Created attachment 17386 [details]
sacct -j 58829413 -o all -p output
Created attachment 17387 [details]
slurm.conf
I see that your slurm.conf has: > ProctrackType=proctrack/pgid > JobAcctGatherType=jobacct_gather/linux Why not have this to match the accounting to the process tracking? > ProctrackType=proctrack/linux Based on the logs provided, Slurm is doubling the usage which could easily be due to the xhpl calling fork() and pgid can have issues handling fork trees. We have ProctrackType=proctrack/cgroup which I thought is the recommended setting - isn't it? (In reply to Martin Siegert from comment #6) > We have > ProctrackType=proctrack/cgroup > which I thought is the recommended setting - isn't it? Oops, must have looked at the wrong slurm.conf (locally). Yes, that is suggested. Please disable > JobAcctGatherParams=UsePSS proctrack/cgroup and UsePSS are incompatible (bug#10587) Please make sure to restart all daemons after disabling UsePSS. I commented out the JobAcctGatherParams=UsePSS line in slurm.conf and then restarted slumrctld and all slurmd. It has no effect: # sacct -j 58853154 -o all -p Account|AdminComment|AllocCPUS|AllocNodes|AllocTRES|AssocID|AveCPU|AveCPUFreq|AveDiskRead|AveDiskWrite|AvePages|AveRSS|AveVMSize|BlockID|Cluster|Comment|Constraints|ConsumedEnergy|ConsumedEnergyRaw|CPUTime|CPUTimeRAW|DBIndex|DerivedExitCode|Elapsed|ElapsedRaw|Eligible|End|ExitCode|Flags|GID|Group|JobID|JobIDRaw|JobName|Layout|MaxDiskRead|MaxDiskReadNode|MaxDiskReadTask|MaxDiskWrite|MaxDiskWriteNode|MaxDiskWriteTask|MaxPages|MaxPagesNode|MaxPagesTask|MaxRSS|MaxRSSNode|MaxRSSTask|MaxVMSize|MaxVMSizeNode|MaxVMSizeTask|McsLabel|MinCPU|MinCPUNode|MinCPUTask|NCPUS|NNodes|NodeList|NTasks|Priority|Partition|QOS|QOSRAW|Reason|ReqCPUFreq|ReqCPUFreqMin|ReqCPUFreqMax|ReqCPUFreqGov|ReqCPUS|ReqMem|ReqNodes|ReqTRES|Reservation|ReservationId|Reserved|ResvCPU|ResvCPURAW|Start|State|Submit|Suspended|SystemCPU|SystemComment|Timelimit|TimelimitRaw|TotalCPU|TRESUsageInAve|TRESUsageInMax|TRESUsageInMaxNode|TRESUsageInMaxTask|TRESUsageInMin|TRESUsageInMinNode|TRESUsageInMinTask|TRESUsageInTot|TRESUsageOutAve|TRESUsageOutMax|TRESUsageOutMaxNode|TRESUsageOutMaxTask|TRESUsageOutMin|TRESUsageOutMinNode|TRESUsageOutMinTask|TRESUsageOutTot|UID|User|UserCPU|WCKey|WCKeyID|WorkDir| def-siegert-ab_cpu||32|1|billing=32,cpu=32,mem=125G,node=1|21181|||||||||cedar||broadwell|0|0|10:53:52|39232|551646502|0:0|00:20:26|1226|2021-01-07T16:39:57|2021-01-07T17:18:58|0:0|SchedBackfill|3000123|siegert|58853154|58853154|hpl-memtest.slrm|||||||||||||||||||||32|1|cdr980||1117375|c12hbackfill|normal|1|None|Unknown|Unknown|Unknown|Unknown|32|125Gn|1|billing=32,cpu=32,mem=125G,node=1|||00:18:35|09:54:40|35680|2021-01-07T16:58:32|COMPLETED|2021-01-07T16:39:57|00:00:00|02:20.767||02:00:00|120|09:53:33|||||||||||||||||3000123|siegert|09:51:12||0|/project/6001524/siegert/benchmarks/hpl/cpu-only/32| def-siegert-ab_cpu||32|1|cpu=32,mem=125G,node=1|21181|00:00:00|2.10G|6.96M|0.03M|12|22628K|880852K||cedar|||0|0|10:53:52|39232|551646502||00:20:26|1226|2021-01-07T16:58:32|2021-01-07T17:18:58|0:0||||58853154.batch|58853154.batch|batch|Unknown|6.96M|cdr980|0|0.03M|cdr980|0|12|cdr980|0|22628K|cdr980|0|880852K|cdr980|0||00:00:00|cdr980|0|32|1|cdr980|1||||||0|0|0|0|32|125Gn|1|||||||2021-01-07T16:58:32|COMPLETED|2021-01-07T16:58:32|00:00:00|00:00.333||||00:01.365|cpu=00:00:00,energy=0,fs/disk=7296460,mem=22628K,pages=12,vmem=880852K|cpu=00:00:00,energy=0,fs/disk=7296460,mem=22628K,pages=12,vmem=880852K|cpu=cdr980,energy=cdr980,fs/disk=cdr980,mem=cdr980,pages=cdr980,vmem=cdr980|cpu=0,fs/disk=0,mem=0,pages=0,vmem=0|cpu=00:00:00,energy=0,fs/disk=7296460,mem=22628K,pages=12,vmem=880852K|cpu=cdr980,energy=cdr980,fs/disk=cdr980,mem=cdr980,pages=cdr980,vmem=cdr980|cpu=0,fs/disk=0,mem=0,pages=0,vmem=0|cpu=00:00:00,energy=0,fs/disk=7296460,mem=22628K,pages=12,vmem=880852K|energy=0,fs/disk=27460|energy=0,fs/disk=27460|energy=cdr980,fs/disk=cdr980|fs/disk=0|energy=0,fs/disk=27460|energy=cdr980,fs/disk=cdr980|fs/disk=0|energy=0,fs/disk=27460|||00:01.032|||| def-siegert-ab_cpu||32|1|billing=32,cpu=32,mem=125G,node=1|21181|00:00:00|1.20G|0.00M|0|0|928K|144572K||cedar|||0|0|10:56:00|39360|551646502||00:20:30|1230|2021-01-07T16:58:32|2021-01-07T17:19:02|0:0||||58853154.extern|58853154.extern|extern|Unknown|0.00M|cdr980|0|0|cdr980|0|0|cdr980|0|928K|cdr980|0|144572K|cdr980|0||00:00:00|cdr980|0|32|1|cdr980|1||||||0|0|0|0|32|125Gn|1|||||||2021-01-07T16:58:32|COMPLETED|2021-01-07T16:58:32|00:00:00|00:00.001||||00:00.002|cpu=00:00:00,energy=0,fs/disk=2012,mem=928K,pages=0,vmem=144572K|cpu=00:00:00,energy=0,fs/disk=2012,mem=928K,pages=0,vmem=144572K|cpu=cdr980,energy=cdr980,fs/disk=cdr980,mem=cdr980,pages=cdr980,vmem=cdr980|cpu=0,fs/disk=0,mem=0,pages=0,vmem=0|cpu=00:00:00,energy=0,fs/disk=2012,mem=928K,pages=0,vmem=144572K|cpu=cdr980,energy=cdr980,fs/disk=cdr980,mem=cdr980,pages=cdr980,vmem=cdr980|cpu=0,fs/disk=0,mem=0,pages=0,vmem=0|cpu=00:00:00,energy=0,fs/disk=2012,mem=928K,pages=0,vmem=144572K|energy=0,fs/disk=0|energy=0,fs/disk=0|energy=cdr980,fs/disk=cdr980|fs/disk=0|energy=0,fs/disk=0|energy=cdr980,fs/disk=cdr980|fs/disk=0|energy=0,fs/disk=0|||00:00.001|||| def-siegert-ab_cpu||32|1|cpu=32,mem=125G,node=1|21181|09:53:16|2K|5.55M|0.09M|182|213112848K|213453872K||cedar|||0|0|10:46:24|38784|551646502||00:20:12|1212|2021-01-07T16:58:46|2021-01-07T17:18:58|0:0||||58853154.0|58853154.0|run_xhpl_prv|Cyclic|5.55M|cdr980|0|0.09M|cdr980|0|182|cdr980|0|213112848K|cdr980|0|213453872K|cdr980|0||09:53:16|cdr980|0|32|1|cdr980|1||||||Unknown|Unknown|Unknown|Unknown|32|125Gn|1|||||||2021-01-07T16:58:46|COMPLETED|2021-01-07T16:58:46|00:00:00|02:20.433||||09:53:32|cpu=09:53:16,energy=0,fs/disk=5820751,mem=213112848K,pages=182,vmem=213453872K|cpu=09:53:16,energy=0,fs/disk=5820751,mem=213112848K,pages=182,vmem=213453872K|cpu=cdr980,energy=cdr980,fs/disk=cdr980,mem=cdr980,pages=cdr980,vmem=cdr980|cpu=0,fs/disk=0,mem=0,pages=0,vmem=0|cpu=09:53:16,energy=0,fs/disk=5820751,mem=213112848K,pages=182,vmem=213453872K|cpu=cdr980,energy=cdr980,fs/disk=cdr980,mem=cdr980,pages=cdr980,vmem=cdr980|cpu=0,fs/disk=0,mem=0,pages=0,vmem=0|cpu=09:53:16,energy=0,fs/disk=5820751,mem=213112848K,pages=182,vmem=213453872K|energy=0,fs/disk=91402|energy=0,fs/disk=91402|energy=cdr980,fs/disk=cdr980|fs/disk=0|energy=0,fs/disk=91402|energy=cdr980,fs/disk=cdr980|fs/disk=0|energy=0,fs/disk=91402|||09:51:11|||| Slurm still claims that I used 203GB of memory on a node that only has 128GB. (In reply to Martin Siegert from comment #8) > Slurm still claims that I used 203GB of memory on a node that only has 128GB. Is this from a new job run? Any previously recorded numbers will not change due to this config change. This is a new job run after the change. A colleague of mine suggested to use JobAcctGatherType=jobacct_gather/cgroup which sounds reasonable to me. Is that worth a try? (In reply to Martin Siegert from comment #10) > This is a new job run after the change. > A colleague of mine suggested to use > JobAcctGatherType=jobacct_gather/cgroup > which sounds reasonable to me. > Is that worth a try? Yes, however, it would be nice to confirm the issue and fix the code if needed. If possible, please set the following in slurm.conf: > SlurmdDebug=debug4 Restart slurmd on the test nodes. Please then update the slurmd logs from a test job run. Please revert the change when done, it can cause Slurm to run slower. Created attachment 17400 [details]
slurmd log for node with debug4 of JobAcctGatherType=jobacct_gather/cgroup and linux
I've attached the log for this issue.
Using JobAcctGatherType=jobacct_gather/cgroup seems to report the correct memory while linux does not, this is a new issue since our upgrade to 20.11.0 as we have been using the jobacct_gather/linux
Thanks,
Adam
Adam, Reducing this to sev4 since this issue no longer actively affects your site but is a research bug now. Please reply if you have any objections. Thanks, --Nate Adam, We have determined this is a duplicate of bug#10538 which can be corrected by reverting d86339dc27ed1b95 commit in the source. We will continue debugging the issue in bug#10538. Thanks, --Nate *** This ticket has been marked as a duplicate of ticket 10538 *** Please note that we cannot view bug #10538. - Martin On Tue, Jan 12, 2021 at 09:15:30PM +0000, bugs@schedmd.com wrote: > > [1]Nate Rini changed [2]bug 10588 > > What Removed Added > Status OPEN RESOLVED > Resolution --- DUPLICATE > > [3]Comment # 15 on [4]bug 10588 from [5]Nate Rini > Adam, > > We have determined this is a duplicate of [6]bug#10538 which can be corrected by > reverting d86339dc27ed1b95 commit in the source. We will continue debugging the > issue in [7]bug#10538. > > Thanks, > --Nate > > *** This bug has been marked as a duplicate of [8]bug 10538 *** > __________________________________________________________________ > > You are receiving this mail because: > * You reported the bug. Martin - I will have Nate see if we can open the associated bug to the public. We can also keep this open and communicate the progress though this bug to you if that is something you desire. You should not be able to view 10538. We did mark confidential data private but the bulk of the important bits should still be in that bug. *** This ticket has been marked as a duplicate of ticket 10538 *** Martin, The fix for bug#10538 is now upstream: (In reply to Felip Moll from comment #68) > we commited a fix which will be in 20.11.3+, commits > 1ba2875649272..8b68aff3dcb7d. > > The issue was due to a change where we modified the way we store the list of > process statistics for a step. Now we don't remove the precs anymore because > we need some values for each pid at the end of the job in order to > accumulate syscpu and user cpu times. The issue was that memory was also > incorrectly summed for all dead pids, which along with the rest of TRES > required a different treatment than consumed cpu seconds. > > We invalidate the TRES values (including memory) for past precs. > > This release (20.11.3) should be coming out soon (around next week). > > Thank you for reporting. |