Ticket 13656

Summary: sstat reports all memory/swap in use, but host isn't swapping
Product: Slurm Reporter: Renata Dart <renata>
Component: SchedulingAssignee: Albert Gil <albert.gil>
Status: RESOLVED TIMEDOUT QA Contact:
Severity: 4 - Minor Issue    
Priority: ---    
Version: 21.08.4   
Hardware: Linux   
OS: Linux   
Site: SLAC Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---
Attachments: slurmctld.log-20220320.gz
volt004.slurmd.log.gz

Description Renata Dart 2022-03-18 08:30:59 MDT
Hi SchedMD, we had a job yesterday that seemed to say that it was using 600+GB of memory (using sstat), but the host has 400GB of memory and 200 of swap and no swapping was going on, 


❯ sudo sstat 2110865 -a
JobID         MaxVMSize  MaxVMSizeNode  MaxVMSizeTask  AveVMSize     MaxRSS MaxRSSNode MaxRSSTask     AveRSS MaxPages MaxPagesNode   MaxPagesTask   AvePages     MinCPU MinCPUNode MinCPUTask     AveCPU   NTasks AveCPUFreq ReqCPUFreqMin ReqCPUFreqMax ReqCPUFreqGov ConsumedEnergy  MaxDiskRead MaxDiskReadNode MaxDiskReadTask  AveDiskRead MaxDiskWrite MaxDiskWriteNode MaxDiskWriteTask AveDiskWrite TRESUsageInAve TRESUsageInMax TRESUsageInMaxNode TRESUsageInMaxTask TRESUsageInMin TRESUsageInMinNode TRESUsageInMinTask TRESUsageInTot TRESUsageOutAve TRESUsageOutMax TRESUsageOutMaxNode TRESUsageOutMaxTask TRESUsageOutMin TRESUsageOutMinNode TRESUsageOutMinTask TRESUsageOutTot
------------ ---------- -------------- -------------- ---------- ---------- ---------- ---------- ---------- -------- ------------ -------------- ---------- ---------- ---------- ---------- ---------- -------- ---------- ------------- ------------- ------------- -------------- ------------ --------------- --------------- ------------ ------------ ---------------- ---------------- ------------ -------------- -------------- ------------------ ------------------ -------------- ------------------ ------------------ -------------- --------------- --------------- ------------------- ------------------- --------------- ------------------- ------------------- ---------------
2110865.ext+    212052K        volt004              0    108056K       844K    volt004          0       356K        0      volt004              0          0   00:00:00    volt004          0   00:00:00        1    799.93M             0             0             0              0         2012         volt004               0         2012            0          volt004                0            0 cpu=00:00:00,+ cpu=00:00:00,+ cpu=volt004,energ+ cpu=00:00:00,fs/d+ cpu=00:00:00,+ cpu=volt004,energ+ cpu=00:00:00,fs/d+ cpu=00:00:00,+ energy=0,fs/di+ energy=0,fs/di+ energy=volt004,fs/+           fs/disk=0 energy=0,fs/di+ energy=volt004,fs/+           fs/disk=0 energy=0,fs/di+
2110865.bat+ 869045328K        volt004              0  49387560K 629784824K    volt004          0  36284920K       12      volt004              0         12   01:09:35    volt004          0   01:09:35        1       869K             0             0             0              0 829132471519         volt004               0 829132471519 651390212082          volt004                0 651390212082 cpu=01:09:35,+ cpu=01:09:35,+ cpu=volt004,energ+ cpu=00:00:00,fs/d+ cpu=01:09:35,+ cpu=volt004,energ+ cpu=00:00:00,fs/d+ cpu=01:09:35,+ energy=0,fs/di+ energy=0,fs/di+ energy=volt004,fs/+           fs/disk=0 energy=0,fs/di+ energy=volt004,fs/+           fs/disk=0 energy=0,fs/di+


And sacct shows:

sacct -X -j 2110865 --format=user,jobid,start,alloctres%70
   User JobID           Start                               AllocTRES
--------- ------------ ------------------- ----------------------------------------------------------------------
 yanyanz 2110865   2022-03-16T15:21:30    billing=16,cpu=16,gres/gpu:v100=2,gres/gpu=2,mem=190000M,node=1

Is sstat the right tool to check on job resource usage, or is there something else that we should be using?  Or are our cgroups not set/working properly:

> ProctrackType=proctrack/cgroup
> TaskPlugin=task/cgroup,task/affinity

This is our gres.conf:

################################################################################
## slurm gres conf
################################################################################

#AutoDetect=nvml

###
# hep nodes
###
#NodeName=hep-gpu01           Name=gpu   Type=geforce_gtx_1080_ti   Count=8   File=/dev/nvidia[0,2-4,6-9]
#NodeName=hep-gpu01           Name=gpu   Type=titan_xp              Count=2   File=/dev/nvidia[1,5]

###
# pascal 1080ti nodes
###
NodeName=psc[000-009]        Name=gpu   Type=geforce_gtx_1080_ti   Count=10  File=/dev/nvidia[0-9]

###
# turing 2080 nodes
###
NodeName=tur[000-026]         Name=gpu   Type=geforce_rtx_2080_ti   Count=10  File=/dev/nvidia[0-9]

###
# volta v100 nodes
###
NodeName=volt[000-005]         Name=gpu   Type=v100                  Count=4  File=/dev/nvidia[0-3]

###
# ampere a100 nodes
###
NodeName=ampt[000-019]         Name=gpu  Type=a100                   Count=4 File=/dev/nvidia[0-3]



Thanks,
Renata
Comment 1 Albert Gil 2022-03-22 04:18:35 MDT
Hi Renata,

First of all let me clarify some points to avoid further confusion:

> we had a job yesterday that seemed to say that it was using
> 600+GB of memory (using sstat), but the host has 400GB of memory and 200 of
> swap and no swapping was going on,
>
> sudo sstat 2110865 -a
>
>     MaxRSS
> ----------
> 629784824K
>
> And sacct shows:
> sacct -X -j 2110865 --format=user,jobid,start,alloctres%70
>
>   AllocTRES
> -----------
> mem=190000M

Please note AllocTRES/mem and MaxRSS are two complete different values:
- AllocTRES/mem is the memory that Slurm allocate/schedule for the job, so the job can use that amount and only that (if memory limit is enforced).
- MaxRSS is the actual physical memory that the actual linux processes of the job used gathered from what the kernel reported.


> Is sstat the right tool to check on job resource usage, or is there
> something else that we should be using?

The difference between sstat and sacct is that sstat uses the information gathered by a running job while it's still "alive" in the cluster, it talks with slurmctld and slurmd, while and sacct talks to slurmdbd to return mostly the same information, once it has been accounted in the DB, also when the job is not alive anymore.
So, you can also see the MaxRSS field with sacct and shouldn't be different from the last MaxRSS seen in sstat, just a small delay necessary to save the gathered data could be noticed.


> Or are our cgroups not set/working
> properly:
> 
> > ProctrackType=proctrack/cgroup
> > TaskPlugin=task/cgroup,task/affinity

The MaxRSS value is gathered by the JobAcctGatherType, not by ProctrackType or TaskPlugin:

- https://slurm.schedmd.com/slurm.conf.html#OPT_JobAcctGatherType

Memory limit enforcement can be done by JobAcctGatherType or by TaskPlugin, though:

- https://slurm.schedmd.com/slurm.conf.html#OPT_OverMemoryKill

> This is our gres.conf:

Actually the memory allocated, used and limit enforcement is not related to gres.conf.
It's mainly related to slurm.conf with the options mentioned above.
The limit enforcement can also be related to cgroup.conf:

- https://slurm.schedmd.com/cgroup.conf.html#SECTION_TASK/CGROUP-PLUGIN


Being said that, this looks like a problem:

> host has 400GB of memory and 200 of swap and no swapping was going on
>
>     MaxRSS
> ----------
> 629784824K

That would mean that for some reason MaxRSS was not gathered correctly.

Could you post the output of "sacct -j 2110865 --format=jobid,maxrss"?


Regards,
Albert
Comment 2 Renata Dart 2022-03-22 08:06:02 MDT
Hi Albert, thanks for the explanation.   Here is
the output:


[renata@sdf-login02 ~]$ sacct -j 2110865 --format=jobid,maxrss
JobID            MaxRSS 
------------ ---------- 
2110865                 
2110865.bat+ 779022116K 
2110865.ext+       844K 

This is our cgroup.conf:

[renata@sdf-login02 ~]$ cat /var/spool/slurmd/conf-cache/cgroup.conf 
###
#
# Slurm cgroup support configuration file
#
# See man slurm.conf and man cgroup.conf for further
# information on cgroup configuration parameters
#--
CgroupAutomount=yes

TaskAffinity=no

ConstrainCores=yes
ConstrainDevices=yes
ConstrainRAMSpace=yes
ConstrainSwapSpace=yes

MemorySwappiness=0


But we have 

JobAcctGatherType=jobacct_gather/linux

rather than cgroup.  Should we change that?

Thanks,
Renata

On Tue, 22 Mar 2022, bugs@schedmd.com wrote:

>https://bugs.schedmd.com/show_bug.cgi?id=13656
>
>--- Comment #1 from Albert Gil <albert.gil@schedmd.com> ---
>Hi Renata,
>
>First of all let me clarify some points to avoid further confusion:
>
>> we had a job yesterday that seemed to say that it was using
>> 600+GB of memory (using sstat), but the host has 400GB of memory and 200 of
>> swap and no swapping was going on,
>>
>> sudo sstat 2110865 -a
>>
>>     MaxRSS
>> ----------
>> 629784824K
>>
>> And sacct shows:
>> sacct -X -j 2110865 --format=user,jobid,start,alloctres%70
>>
>>   AllocTRES
>> -----------
>> mem=190000M
>
>Please note AllocTRES/mem and MaxRSS are two complete different values:
>- AllocTRES/mem is the memory that Slurm allocate/schedule for the job, so the
>job can use that amount and only that (if memory limit is enforced).
>- MaxRSS is the actual physical memory that the actual linux processes of the
>job used gathered from what the kernel reported.
>
>
>> Is sstat the right tool to check on job resource usage, or is there
>> something else that we should be using?
>
>The difference between sstat and sacct is that sstat uses the information
>gathered by a running job while it's still "alive" in the cluster, it talks
>with slurmctld and slurmd, while and sacct talks to slurmdbd to return mostly
>the same information, once it has been accounted in the DB, also when the job
>is not alive anymore.
>So, you can also see the MaxRSS field with sacct and shouldn't be different
>from the last MaxRSS seen in sstat, just a small delay necessary to save the
>gathered data could be noticed.
>
>
>> Or are our cgroups not set/working
>> properly:
>> 
>> > ProctrackType=proctrack/cgroup
>> > TaskPlugin=task/cgroup,task/affinity
>
>The MaxRSS value is gathered by the JobAcctGatherType, not by ProctrackType or
>TaskPlugin:
>
>- https://slurm.schedmd.com/slurm.conf.html#OPT_JobAcctGatherType
>
>Memory limit enforcement can be done by JobAcctGatherType or by TaskPlugin,
>though:
>
>- https://slurm.schedmd.com/slurm.conf.html#OPT_OverMemoryKill
>
>> This is our gres.conf:
>
>Actually the memory allocated, used and limit enforcement is not related to
>gres.conf.
>It's mainly related to slurm.conf with the options mentioned above.
>The limit enforcement can also be related to cgroup.conf:
>
>- https://slurm.schedmd.com/cgroup.conf.html#SECTION_TASK/CGROUP-PLUGIN
>
>
>Being said that, this looks like a problem:
>
>> host has 400GB of memory and 200 of swap and no swapping was going on
>>
>>     MaxRSS
>> ----------
>> 629784824K
>
>That would mean that for some reason MaxRSS was not gathered correctly.
>
>Could you post the output of "sacct -j 2110865 --format=jobid,maxrss"?
>
>
>Regards,
>Albert
>
>-- 
>You are receiving this mail because:
>You reported the bug.
Comment 3 Albert Gil 2022-03-22 08:48:31 MDT
Hi Renata,

Thanks for the info.

> Here is the output:
> 
> [renata@sdf-login02 ~]$ sacct -j 2110865 --format=jobid,maxrss
> JobID            MaxRSS 
> ------------ ---------- 
> 2110865                 
> 2110865.bat+ 779022116K 
> 2110865.ext+       844K 

Ok, this is consistent with sstat values, right?
But I would agree that both seem wrong.

> This is our cgroup.conf:
> 
> [renata@sdf-login02 ~]$ cat /var/spool/slurmd/conf-cache/cgroup.conf 
>
> ConstrainRAMSpace=yes
> ConstrainSwapSpace=yes
> MemorySwappiness=0

Those values plus the TaskPlugin=task/cgroup that you are using should limit the memory.
But, as MaxRSS is bigger that the RealMemory the problem is clearly not there but in the gathering.

> But we have 
> 
> JobAcctGatherType=jobacct_gather/linux
> 
> rather than cgroup.  Should we change that?

You don't need to, both JobAcctGatherTypes should be fine.

What we do need now is a reproducer to investigate the issue.
Besides job 2110865, are you able to reproduce the issue?

Could you attach the slurmctld logs from 2022-03-16?
Could you also attach the slumd logs of node volt004 from 2022-03-16?

Regards,
Albert
Comment 4 Renata Dart 2022-03-22 09:42:41 MDT
Created attachment 23988 [details]
slurmctld.log-20220320.gz

Hi Albert, I don't know if it is reproducible, it may have occurred
before but gone unnoticed.  If I can successfully request this amount
of memory on a similar host would that be a good test?  That isn't 
exactly what happened in this case though.  I have attached the two logs.

Thanks,
Renata


On Tue, 22 Mar 2022, bugs@schedmd.com wrote:

>https://bugs.schedmd.com/show_bug.cgi?id=13656
>
>--- Comment #3 from Albert Gil <albert.gil@schedmd.com> ---
>Hi Renata,
>
>Thanks for the info.
>
>> Here is the output:
>> 
>> [renata@sdf-login02 ~]$ sacct -j 2110865 --format=jobid,maxrss
>> JobID            MaxRSS 
>> ------------ ---------- 
>> 2110865                 
>> 2110865.bat+ 779022116K 
>> 2110865.ext+       844K 
>
>Ok, this is consistent with sstat values, right?
>But I would agree that both seem wrong.
>
>> This is our cgroup.conf:
>> 
>> [renata@sdf-login02 ~]$ cat /var/spool/slurmd/conf-cache/cgroup.conf 
>>
>> ConstrainRAMSpace=yes
>> ConstrainSwapSpace=yes
>> MemorySwappiness=0
>
>Those values plus the TaskPlugin=task/cgroup that you are using should limit
>the memory.
>But, as MaxRSS is bigger that the RealMemory the problem is clearly not there
>but in the gathering.
>
>> But we have 
>> 
>> JobAcctGatherType=jobacct_gather/linux
>> 
>> rather than cgroup.  Should we change that?
>
>You don't need to, both JobAcctGatherTypes should be fine.
>
>What we do need now is a reproducer to investigate the issue.
>Besides job 2110865, are you able to reproduce the issue?
>
>Could you attach the slurmctld logs from 2022-03-16?
>Could you also attach the slumd logs of node volt004 from 2022-03-16?
>
>Regards,
>Albert
>
>-- 
>You are receiving this mail because:
>You reported the bug.
Comment 5 Renata Dart 2022-03-22 09:42:42 MDT
Created attachment 23989 [details]
volt004.slurmd.log.gz
Comment 7 Albert Gil 2022-03-23 08:55:03 MDT
Hi Renata,

Thanks for the logs!
They look fine, but I also see some particularities that may help me to reproduce the issue.
For example, you run a couple of reconfigures while the job was running, and the job was a preemptor of 3 preempted jobs.
This is all ok, but maybe it trigger some corner case and trying to emulate your workload I can reproduce it, debug it and fix it.
I've not been able to reproduce it yet, though.

> I don't know if it is reproducible, it may have occurred
> before but gone unnoticed.

Actually, in the logs I can see that job 2108111 was in the same node at the same time, although its logs are quite different, I would like to know if it had a similar problem. Could you pst the output of:

$ sacct -j 2108111 -p --format=jobid,maxrss,reqtres,alloctres

Also, this one for the jobs that got preempted by 2110865:

$ sacct -j 2103895,2108094,2081119 -p --format=jobid,maxrss,reqtres,alloctres,state


Finally, in order to obtain more information in case you are able to reproduce the issue, please set this value to slurm.conf and reconfigure:

DebugFlags=JobAccountGather


> If I can successfully request this amount
> of memory on a similar host would that be a good test?  That isn't 
> exactly what happened in this case though.

Not really.
Note that the problem is actually that the value of MaxRSS is just wrong because it's impossible.
In the logs I've seen that the node has ~400G as you said:

[2022-02-28T06:55:40.213] CPUs=32 Boards=1 Sockets=2 Cores=8 Threads=2 Memory=385106 TmpDisk=36157 Uptime=7078666 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null)


So, it's not that the memory limit of the job was not enforced or ignored, but that the gathered value from the kernel is just wrong.
In the logs I've also seen that memory was enforced with cgroups with the right value:

[2022-03-16T15:21:31.262] [2110865.batch] task/cgroup: _memcg_initialize: job: alloc=190000MB mem.limit=190000MB memsw.limit=190000MB
[2022-03-16T15:21:31.262] [2110865.batch] task/cgroup: _memcg_initialize: step: alloc=190000MB mem.limit=190000MB memsw.limit=190000MB


In summary, the problem is that for some reason the jobacct_gather/linux plugin was not able to gather the right amount of ram used by its process, so the MaxRSS value is just "a lie":

[2022-02-28T06:55:38.935] debug:  jobacct_gather/linux: init: Job accounting gather LINUX plugin loaded

I'll try to reproduce the issue, but any help on it or the logs with DebugFlags=JobAccountGather of a new job that got the same issue will help a lot.

Thanks,
Albert
Comment 8 Renata Dart 2022-03-23 14:34:16 MDT
Hi Albert, I have included the output requested interspersed:

On Wed, 23 Mar 2022, bugs@schedmd.com wrote:

>https://bugs.schedmd.com/show_bug.cgi?id=13656
>
>--- Comment #7 from Albert Gil <albert.gil@schedmd.com> ---
>Hi Renata,
>
>Thanks for the logs!
>They look fine, but I also see some particularities that may help me to
>reproduce the issue.
>For example, you run a couple of reconfigures while the job was running, and
>the job was a preemptor of 3 preempted jobs.
>This is all ok, but maybe it trigger some corner case and trying to emulate
>your workload I can reproduce it, debug it and fix it.
>I've not been able to reproduce it yet, though.
>
>> I don't know if it is reproducible, it may have occurred
>> before but gone unnoticed.
>
>Actually, in the logs I can see that job 2108111 was in the same node at the
>same time, although its logs are quite different, I would like to know if it
>had a similar problem. Could you pst the output of:
>
>$ sacct -j 2108111 -p --format=jobid,maxrss,reqtres,alloctres

[renata@sdf-login02 ~]$ sacct -j 2108111 -p --format=jobid,maxrss,reqtres,alloctres
JobID|MaxRSS|ReqTRES|AllocTRES|
2108111||billing=1,cpu=1,gres/gpu=1,mem=1500M,node=1|billing=2,cpu=1,gres/gpu:a100=1,gres/gpu=1,mem=1500M,node=1|
2108111.batch|234572K||cpu=1,gres/gpu:a100=1,gres/gpu=1,mem=1500M,node=1|
2108111.extern|1292K||billing=2,cpu=1,gres/gpu:a100=1,gres/gpu=1,mem=1500M,node=1|

>
>Also, this one for the jobs that got preempted by 2110865:
>
>$ sacct -j 2103895,2108094,2081119 -p --format=jobid,maxrss,reqtres,alloctres,state
>

[renata@sdf-login02 ~]$ sacct -j 2103895,2108094,2081119 -p --format=jobid,maxrss,reqtres,alloctres,state
JobID|MaxRSS|ReqTRES|AllocTRES|State|
2103895||billing=1,cpu=1,gres/gpu=1,mem=1500M,node=1|billing=2,cpu=2,gres/gpu:v100=1,gres/gpu=1,mem=1500M,node=1|PREEMPTED|
2103895.batch|264016K||cpu=2,gres/gpu:v100=1,gres/gpu=1,mem=1500M,node=1|CANCELLED|
2103895.extern|840K||billing=2,cpu=2,gres/gpu:v100=1,gres/gpu=1,mem=1500M,node=1|COMPLETED|
2108094||billing=1,cpu=1,gres/gpu=1,mem=1500M,node=1|billing=2,cpu=2,gres/gpu:v100=1,gres/gpu=1,mem=1500M,node=1|PREEMPTED|
2108094.batch|256032K||cpu=2,gres/gpu:v100=1,gres/gpu=1,mem=1500M,node=1|CANCELLED|
2108094.extern|836K||billing=2,cpu=2,gres/gpu:v100=1,gres/gpu=1,mem=1500M,node=1|COMPLETED|
2081119||billing=1,cpu=1,gres/gpu=1,mem=1500M,node=1|billing=2,cpu=2,gres/gpu:v100=1,gres/gpu=1,mem=1500M,node=1|PREEMPTED|
2081119.batch|260248K||cpu=2,gres/gpu:v100=1,gres/gpu=1,mem=1500M,node=1|CANCELLED|
2081119.extern|836K||billing=2,cpu=2,gres/gpu:v100=1,gres/gpu=1,mem=1500M,node=1|COMPLETED|


>
>Finally, in order to obtain more information in case you are able to reproduce
>the issue, please set this value to slurm.conf and reconfigure:
>
>DebugFlags=JobAccountGather
>
>
>> If I can successfully request this amount
>> of memory on a similar host would that be a good test?  That isn't 
>> exactly what happened in this case though.
>
>Not really.
>Note that the problem is actually that the value of MaxRSS is just wrong
>because it's impossible.
>In the logs I've seen that the node has ~400G as you said:
>
>[2022-02-28T06:55:40.213] CPUs=32 Boards=1 Sockets=2 Cores=8 Threads=2
>Memory=385106 TmpDisk=36157 Uptime=7078666 CPUSpecList=(null)
>FeaturesAvail=(null) FeaturesActive=(null)
>
>
>So, it's not that the memory limit of the job was not enforced or ignored, but
>that the gathered value from the kernel is just wrong.
>In the logs I've also seen that memory was enforced with cgroups with the right
>value:
>
>[2022-03-16T15:21:31.262] [2110865.batch] task/cgroup: _memcg_initialize: job:
>alloc=190000MB mem.limit=190000MB memsw.limit=190000MB
>[2022-03-16T15:21:31.262] [2110865.batch] task/cgroup: _memcg_initialize: step:
>alloc=190000MB mem.limit=190000MB memsw.limit=190000MB
>
>
>In summary, the problem is that for some reason the jobacct_gather/linux plugin
>was not able to gather the right amount of ram used by its process, so the
>MaxRSS value is just "a lie":
>
>[2022-02-28T06:55:38.935] debug:  jobacct_gather/linux: init: Job accounting
>gather LINUX plugin loaded
>
>I'll try to reproduce the issue, but any help on it or the logs with
>DebugFlags=JobAccountGather of a new job that got the same issue will help a
>lot.
>
>Thanks,
>Albert
>
>-- 
>You are receiving this mail because:
>You reported the bug.


Renata
Comment 9 Albert Gil 2022-03-25 08:19:31 MDT
Hi Renata,

> I have included the output requested interspersed

Thanks!
Unfortunately, it's not yet helping to reproduce the issue.
If you enable DebugFlags=JobAccountGather and you are able to reproduce it, please attach the logs so we can debug it further.

Thanks,
Albert
Comment 10 Albert Gil 2022-04-06 03:52:25 MDT
Hi Renata,

I hope you are doing well.

> Unfortunately, it's not yet helping to reproduce the issue.
> If you enable DebugFlags=JobAccountGather and you are able to reproduce it,
> please attach the logs so we can debug it further.

Have you been able to collect that debug information?

Regards,
Albert
Comment 11 Albert Gil 2022-04-18 07:07:27 MDT
Hi Renata,

I'm closing this ticket as timedout, but please don't hesitate to reopen it if you need further support and you are able to collect the debug information mentioned in comment 9.

Thanks,
Albert