Summary: | sacct reporting maxrss higher than the node available memory | ||
---|---|---|---|
Product: | Slurm | Reporter: | dl_support-hpc |
Component: | Other | Assignee: | Albert Gil <albert.gil> |
Status: | RESOLVED INFOGIVEN | QA Contact: | |
Severity: | 4 - Minor Issue | ||
Priority: | --- | ||
Version: | 20.02.1 | ||
Hardware: | Linux | ||
OS: | Linux | ||
See Also: | https://bugs.schedmd.com/show_bug.cgi?id=8656 | ||
Site: | WEHI | Alineos Sites: | --- |
Atos/Eviden Sites: | --- | Confidential Site: | --- |
Coreweave sites: | --- | Cray Sites: | --- |
DS9 clusters: | --- | HPCnow Sites: | --- |
HPE Sites: | --- | IBM Sites: | --- |
NOAA SIte: | --- | NoveTech Sites: | --- |
Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
Recursion Pharma Sites: | --- | SFW Sites: | --- |
SNIC sites: | --- | Tzag Elita Sites: | --- |
Linux Distro: | --- | Machine Name: | |
CLE Version: | Version Fixed: | ||
Target Release: | --- | DevPrio: | --- |
Emory-Cloud Sites: | --- |
Description
dl_support-hpc
2020-05-07 23:41:39 MDT
Hi Laszlo, These values look wrong. We have an open issue regarding the accounting with cgroups on bug 8656, but I'm not certain if that's your case. I'm wondering if the fact that the step was OUT_OF_MEMORY maybe related to the wrong value, do you face a similar scenario with any other jobs? Could you attach your slurm.conf, slurmdbd.conf and cgroups.conf in case you use cgroups? Also, could you post the output of: $ sacct -j 34256 --long -p I'll need to reproduce your issue, so any hint that you can provide to reproduce it will help. Thanks, Albert Created attachment 14213 [details]
slurm.conf
Created attachment 14214 [details]
slurm.conf
Created attachment 14215 [details]
cgroup.conf
Created attachment 14216 [details]
slurmdbd.conf
Hi Albert, As requested [root@milton-login01 ~]# sacct -j 34256 --long -p JobID|JobIDRaw|JobName|Partition|MaxVMSize|MaxVMSizeNode|MaxVMSizeTask|AveVMSize|MaxRSS|MaxRSSNode|MaxRSSTask|AveRSS|MaxPages|MaxPagesNode|MaxPagesTask|AvePages|MinCPU|MinCPUNode|MinCPUTask|AveCPU|NTasks|AllocCPUS|Elapsed|State|ExitCode|AveCPUFreq|ReqCPUFreqMin|ReqCPUFreqMax|ReqCPUFreqGov|ReqMem|ConsumedEnergy|MaxDiskRead|MaxDiskReadNode|MaxDiskReadTask|AveDiskRead|MaxDiskWrite|MaxDiskWriteNode|MaxDiskWriteTask|AveDiskWrite|AllocGRES|ReqGRES|ReqTRES|AllocTRES|TRESUsageInAve|TRESUsageInMax|TRESUsageInMaxNode|TRESUsageInMaxTask|TRESUsageInMin|TRESUsageInMinNode|TRESUsageInMinTask|TRESUsageInTot|TRESUsageOutMax|TRESUsageOutMaxNode|TRESUsageOutMaxTask|TRESUsageOutAve|TRESUsageOutTot| 34256|34256|bionix-xHLA|regular||||||||||||||||||1|00:29:11|FAILED|1:0||Unknown|Unknown|Unknown|15Gn|0|||||||||||billing=31,cpu=1,mem=15G,node=1|billing=31,cpu=1,mem=15G,node=1|||||||||||||| 34256.extern|34256.extern|extern||142576K|milton-sml-002|0|142576K|888K|milton-sml-002|0|888K|0|milton-sml-002|0|0|00:00:00|milton-sml-002|0|00:00:00|1|1|00:29:11|COMPLETED|0:0|2.59M|0|0|0|15Gn|0|0.00M|milton-sml-002|0|0.00M|0|milton-sml-002|0|0||||billing=31,cpu=1,mem=15G,node=1|cpu=00:00:00,energy=0,fs/disk=2012,mem=888K,pages=0,vmem=142576K|cpu=00:00:00,energy=0,fs/disk=2012,mem=888K,pages=0,vmem=142576K|cpu=milton-sml-002,energy=milton-sml-002,fs/disk=milton-sml-002,mem=milton-sml-002,pages=milton-sml-002,vmem=milton-sml-002|cpu=0,fs/disk=0,mem=0,pages=0,vmem=0|cpu=00:00:00,energy=0,fs/disk=2012,mem=888K,pages=0,vmem=142576K|cpu=milton-sml-002,energy=milton-sml-002,fs/disk=milton-sml-002,mem=milton-sml-002,pages=milton-sml-002,vmem=milton-sml-002|cpu=0,fs/disk=0,mem=0,pages=0,vmem=0|cpu=00:00:00,energy=0,fs/disk=2012,mem=888K,pages=0,vmem=142576K|energy=0,fs/disk=0|energy=milton-sml-002,fs/disk=milton-sml-002|fs/disk=0|energy=0,fs/disk=0|energy=0,fs/disk=0| 34256.0|34256.0|chroot||372041024K|milton-sml-002|0|372041024K|320389892K|milton-sml-002|0|320389892K|24172|milton-sml-002|0|19|00:12:10|milton-sml-002|0|00:00:00|1|1|00:29:08|OUT_OF_MEMORY|0:125|2.59M|Unknown|Unknown|Unknown|15Gn|0|4161.86M|milton-sml-002|0|4161.86M|4578.07M|milton-sml-002|0|4578.07M||||cpu=1,mem=15G,node=1|cpu=00:00:00,energy=0,fs/disk=4364022996,mem=320389892K,pages=19,vmem=372041024K|cpu=00:12:10,energy=0,fs/disk=4364022996,mem=320389892K,pages=24172,vmem=372041024K|cpu=milton-sml-002,energy=milton-sml-002,fs/disk=milton-sml-002,mem=milton-sml-002,pages=milton-sml-002,vmem=milton-sml-002|cpu=0,fs/disk=0,mem=0,pages=0,vmem=0|cpu=00:12:10,energy=0,fs/disk=4364022996,mem=320389892K,pages=24172,vmem=372041024K|cpu=milton-sml-002,energy=milton-sml-002,fs/disk=milton-sml-002,mem=milton-sml-002,pages=milton-sml-002,vmem=milton-sml-002|cpu=0,fs/disk=0,mem=0,pages=0,vmem=0|cpu=00:00:00,energy=0,fs/disk=4364022996,mem=320389892K,pages=19,vmem=372041024K|energy=0,fs/disk=4800456113|energy=milton-sml-002,fs/disk=milton-sml-002|fs/disk=0|energy=0,fs/disk=4800456113|energy=0,fs/disk=4800456113| We did not see this behaviour before, we do not have many users using the system so it could be slip under the rocks. The requested files are attached. Thanks, Laszlo Hi Laszlo, I'm still not certain about what may be going on. Can you attach the slurmd logs of milton-sml-002 of 2020-05-08? The logs of slurmctld and slurmdbd may also help. What is the actual phisycal memory of milton-sml-002? Is the job actually running a chroot command? Are you able to reproduce the issue? Thanks, Albert PS: While looking at your config files, I've detected a couple of minor issues totally unrelated to the bug: - JobAcctGatherParams=NoOverMemoryKill was deprecated on 19.05 (https://github.com/SchedMD/slurm/blob/slurm-19.05/RELEASE_NOTES#L50) - DefaultStorage* are unnecessary because you're already using AccountingStorage* and JobCompType=jobcomp/none. Hi Albert, I do not have the logs on the milton-sml-002, this is what I got from slurmctld [2020-05-08T14:27:33.160] sched: _slurm_rpc_allocate_resources JobId=34256 NodeList=milton-sml-002 usec=296 [2020-05-08T14:27:33.160] prolog_running_decr: Configuration for JobId=34256 is complete [2020-05-08T14:56:44.586] _job_complete: JobId=34256 WEXITSTATUS 1 [2020-05-08T14:56:44.586] _job_complete: JobId=34256 done no details in slurmdbd logs The memory on milton-sml-002 [root@milton-sml-002 ~]# free -g total used free shared buff/cache available Mem: 113 5 105 0 2 106 Swap: 4 0 4 We do not use chroot environment. Thanks, Laszlo Hi Laszlo, > I do not have the logs on the milton-sml-002, this is what I got from > slurmctld > > [2020-05-08T14:27:33.160] sched: _slurm_rpc_allocate_resources JobId=34256 > NodeList=milton-sml-002 usec=296 > [2020-05-08T14:27:33.160] prolog_running_decr: Configuration for JobId=34256 > is complete > [2020-05-08T14:56:44.586] _job_complete: JobId=34256 WEXITSTATUS 1 > [2020-05-08T14:56:44.586] _job_complete: JobId=34256 done > > no details in slurmdbd logs > > The memory on milton-sml-002 > > [root@milton-sml-002 ~]# free -g > total used free shared buff/cache > available > Mem: 113 5 105 0 2 > 106 > Swap: 4 0 4 Thanks for the information. Please note that the problem seems to be located while gathering information from the node milton-sml-002 OS while the job was running on it. You are using jobacct_gather/linux, meaning that the info was gathered from /proc. The job is marked as OUT_OF_MEMORY, and as you are using task/cgroup and ConstrainRAMSpace=yes, it means that it was constrained and killed by cgroups. This combination works just fine, ans is actually the recommended one. Without knowing what was going on on milton-sml-002, and without being able to reproduce it, I'm afraid that we won't be able to provide a clear explanation. > We do not use chroot environment. Ok. I was just curios because the out-of-memory step (34256.0) has the JobName "chroot", and if the user didn't set it manually it is the command name by default. If we are not able to reproduce the issue, we should close it as "cannot reprodeuce". Regards, Albert Hi Albert, Let's close it, if it reocures I will open a ticket. Thanks, Laszlo Ok Laszlo, Closing as infogiven. Regards, Albert Hi Laszlo, I don't know how I missed it, but I just realized that you are not using any of thses two JobAcctGatherParams: NoShared Exclude shared memory from accounting. UsePss Use PSS value instead of RSS to calculate real usage of memory. The PSS value will be saved as RSS. A higher values of MaxRSS may be related to a job/step with multiple process that are sharing some memory and, as the accounting is done by default by pid as the /proc reports it, that shared memory may be aggregated several times. Maybe you want to use one or both options. Regards, Albert Hi Albert, I am doing this: salloc -n 1 salloc: Pending job allocation 172612 salloc: job 172612 queued and waiting for resources salloc: job 172612 has been allocated resources salloc: Granted job allocation 172612 bash-4.2$ scontrol show job 172612 JobId=172612 JobName=sh UserId=kun.l(3899) GroupId=allstaff(10908) MCS_label=N/A Priority=456 Nice=0 Account=wehi QOS=normal JobState=RUNNING Reason=None Dependency=(null) Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0 RunTime=00:00:22 TimeLimit=2-00:00:00 TimeMin=N/A SubmitTime=2020-07-21T17:13:42 EligibleTime=2020-07-21T17:13:42 AccrueTime=2020-07-21T17:13:42 StartTime=2020-07-21T17:14:04 EndTime=2020-07-23T17:14:04 Deadline=N/A SuspendTime=None SecsPreSuspend=0 LastSchedEval=2020-07-21T17:14:04 Partition=regular AllocNode:Sid=slurm-login:23983 ReqNodeList=(null) ExcNodeList=(null) NodeList=milton-sml-006 BatchHost=milton-sml-006 NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:* TRES=cpu=1,mem=10M,node=1,billing=1 Socks/Node=* NtasksPerN:B:S:C=0:0:*:1 CoreSpec=* MinCPUsNode=1 MinMemoryCPU=10M MinTmpDiskNode=0 Features=(null) DelayBoot=00:00:00 OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) Command=(null) WorkDir=/stornext/Home/data/allstaff/k/kun.l Power= MailUser=(null) MailType=NONE sacct -o reqmem,maxrss,maxvm,elapsed -j 172612 ReqMem MaxRSS MaxVMSize Elapsed ---------- ---------- ---------- ---------- 10Mc 00:03:34 10Mc 00:03:34 10Mc 00:03:31 bash-4.2$ scontrol show config | grep -i def AcctGatherNodeFreq = 0 sec CpuFreqDef = Unknown DefMemPerCPU = 10 bash-4.2$ scontrol show config | grep -i JobAcctGatherParams JobAcctGatherParams = NoShared,UsePss Why I am not seeing any value for MaxRSS or MaxVMSize? Thanks, Laszlo Hi Laszlo, > Why I am not seeing any value for MaxRSS or MaxVMSize? This probably belongs to a new ticket, just to keep each ticket focused on one main issue or question. But anyway, I'll try to answer it here for now. The MaxRSS and MaxVMSize are values gathered "on tasks". So, for example if I do: >> salloc -n 1 salloc: Granted job allocation 16152 >> sleep 5 >> srun sleep 10 >> sleep 15 >> srun sleep 20 >> exit exit salloc: Relinquishing job allocation 16152 Then I get: >> sacct -o jobid,jobname,reqmem,maxrss,maxvm,elapsed -j 16152 JobID JobName ReqMem MaxRSS MaxVMSize Elapsed ------------ ---------- ---------- ---------- ---------- ---------- 16152 bash 16Gn 00:01:26 16152.0 sleep 16Gn 983K 424864K 00:00:10 16152.1 sleep 16Gn 995K 424864K 00:00:20 So, you can see that I only get the MaxRSS and MaxVMSize of the actual task launched with srun (10s and 20s, not 5s neither 15s). We neither gather those values for the job allocation, just for actual steps. A similar thing happen with sbatch: >> sbatch -n1 --wrap "sleep 5; srun sleep 10; sleep 15; srun sleep 20" Submitted batch job 16153 >> sacct -o jobid,jobname,reqmem,maxrss,maxvm,elapsed -j 16153 JobID JobName ReqMem MaxRSS MaxVMSize Elapsed ------------ ---------- ---------- ---------- ---------- ---------- 16153 wrap 16Gn 00:00:50 16153.batch batch 16Gn 1159K 356512K 00:00:50 16153.0 sleep 16Gn 963K 424868K 00:00:10 16153.1 sleep 16Gn 981K 424868K 00:00:20 As you can see, in this case we also gather the .batch step/task, though. Does it solve you question? Or are you having different results? If you have further questions on this, please open a bug referring to this comments for better tracking this issue. Thanks, Albert |