Ticket 13856

Summary:	sacct shows higher MaxRSS than expected after OOM - jobacctgather/linux and task/cgroup
Product:	Slurm	Reporter:	Institut Pasteur HPC Admin <hpc>
Component:	Limits	Assignee:	Felip Moll <felip.moll>
Status:	RESOLVED INFOGIVEN	QA Contact:
Severity:	4 - Minor Issue
Priority:	---
Version:	20.11.7
Hardware:	Linux
OS:	Linux
See Also:	https://bugs.schedmd.com/show_bug.cgi?id=10538 https://bugs.schedmd.com/show_bug.cgi?id=10587 https://bugs.schedmd.com/show_bug.cgi?id=9110
Site:	Institut Pasteur	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	script used to launch the job array slurmd.log content of /proc/mounts kernel logs excerpt kernel logs excerpt for job 8948024 debug2 mode slurmd logs for job 8948024

Description Institut Pasteur HPC Admin 2022-04-15 04:54:03 MDT

Hello,

  I contact you because when answering some user questions
we discovered that even with cgroup enforced on memory usage,
that is to say

ProctrackType=proctrack/cgroup
TaskPlugin=task/cgroup,task/affinity

in slurm.conf and

ConstrainRAMSpace=yes

in gres.conf

jobs can exceed the allocated memory before being killed by the cgroup
out-of-memory handler. Example

$ sacct -j 6062066 -o jobid%20,jobname%20,state%15,reqmem,maxrss                   
               JobID              JobName           State     ReqMem     MaxRSS 
-------------------- -------------------- --------------- ---------- ---------- 
           6062066_0      batch_script.sh   OUT_OF_MEMORY        4Gc            
     6062066_0.batch                batch   OUT_OF_MEMORY        4Gc      5440K 
    6062066_0.extern               extern   OUT_OF_MEMORY        4Gc      1248K 
         6062066_0.0         nnUNet_train   OUT_OF_MEMORY        4Gc   9983980K 
           6062066_1      batch_script.sh   OUT_OF_MEMORY        4Gc            
     6062066_1.batch                batch   OUT_OF_MEMORY        4Gc      5428K 
    6062066_1.extern               extern   OUT_OF_MEMORY        4Gc      1244K 
         6062066_1.0         nnUNet_train   OUT_OF_MEMORY        4Gc   9967344K 
           6062066_2      batch_script.sh   OUT_OF_MEMORY        4Gc            
     6062066_2.batch                batch   OUT_OF_MEMORY        4Gc      5440K 
    6062066_2.extern               extern   OUT_OF_MEMORY        4Gc      1244K 
         6062066_2.0         nnUNet_train   OUT_OF_MEMORY        4Gc  10084200K 
           6062066_3      batch_script.sh   OUT_OF_MEMORY        4Gc            
     6062066_3.batch                batch   OUT_OF_MEMORY        4Gc      5424K 
    6062066_3.extern               extern   OUT_OF_MEMORY        4Gc      1244K 
         6062066_3.0         nnUNet_train   OUT_OF_MEMORY        4Gc   9493672K 
           6062066_4      batch_script.sh   OUT_OF_MEMORY        4Gc            
     6062066_4.batch                batch   OUT_OF_MEMORY        4Gc      5432K 
    6062066_4.extern               extern   OUT_OF_MEMORY        4Gc      1244K 
         6062066_4.0         nnUNet_train   OUT_OF_MEMORY        4Gc   9895572K 


  We had never noticed that behavior since we decided to use the Linux cgroups
to enforce memory limits. Can it be due to the substancial changes in Slurm
cgroup code to prepare it for cgroup V2 as Felip Moll says in the following
ticket
https://bugs.schedmd.com/show_bug.cgi?id=7536 ?

  Thanks in advance,

Comment 1 Felip Moll 2022-04-18 13:37:18 MDT

Hi,

> Can it be due to the substancial changes in Slurm
> cgroup code to prepare it for cgroup V2 as Felip Moll says in the following
> ticket
> https://bugs.schedmd.com/show_bug.cgi?id=7536 ?

This is very unlikely :)

I understand you are asking for 4GB (with --mem) and you see the processes consuming up to 9GB.

- Is it is possible to see how did you launch the job?
- Can you upload the slurmd log on the node the job ran?
- Can you show me the output of the command "cat /proc/mounts" in the node?
- A dmesg is a good way too to verify if there have been OOMs and at which process and memory limit. The kernel OOM log helps to see the limit applied to the cgroup.

Comment 2 Institut Pasteur HPC Admin 2022-04-21 10:49:39 MDT

Created attachment 24585 [details]
script used to launch the job array

Hello,

> I understand you are asking for 4GB (with --mem) and you see the processes 
> consuming up to 9GB.

  Not in real time (our atop monitoring has a one-minute interval), just in
the sacct output.


> - Is it is possible to see how did you launch the job?

  The user wrote a rather simple job script launched in a job array. 
See the batch_script.sh attached.


> - Can you upload the slurmd log on the node the job ran?
> - Can you show me the output of the command "cat /proc/mounts" in the node?

  Both in attachment.


> - A dmesg is a good way too to verify if there have been OOMs and at which 
> process and memory limit. The kernel OOM log helps to see the limit applied
> to the cgroup.

  Please find in attachment the kernel logs containing lines of the type

kernel: [426332.241768] Memory cgroup out of memory: Killed process 419423 (nnUNet_train) total-vm:8989768kB, anon-rss:763324kB, file-rss:113076kB, shmem-rss:92236kB, UID:39157


  Thanks in advance,

Comment 3 Institut Pasteur HPC Admin 2022-04-21 10:50:19 MDT

Created attachment 24586 [details]
slurmd.log

Comment 4 Institut Pasteur HPC Admin 2022-04-21 10:51:15 MDT

Created attachment 24587 [details]
content of /proc/mounts

Comment 5 Institut Pasteur HPC Admin 2022-04-21 10:51:56 MDT

Created attachment 24588 [details]
kernel logs excerpt

Comment 6 Felip Moll 2022-04-25 11:34:28 MDT

Hi,

The cgroup works as expected. For example, the process you pointed out (419423) which corresponds to step 0 of job 6062068 and uid 39157, had a limit of exactly 4GiB of Memory, and it has been killed exactly at this limit by cgroup:


2022-04-12T15:30:14.821431+02:00 maestro-3010 kernel: [426332.241768] Memory cgroup out of memory: Killed process 419423 (nnUNet_train) total-vm:8989768kB, anon-rss:763324kB, file-rss:113076kB, shmem-rss:92236kB, UID:39157
.....
2022-04-12T15:30:19.461128+02:00 maestro-3010 kernel: [426336.914774] Task in /slurm/uid_39157/job_6062068/step_0 killed as a result of limit of /slurm/uid_39157/job_6062068
2022-04-12T15:30:19.493669+02:00 maestro-3010 kernel: [426336.933313] memory: usage 4194304kB, limit 4194304kB, failcnt 769041
2022-04-12T15:30:19.493673+02:00 maestro-3010 kernel: [426336.947373] memory+swap: usage 4194304kB, limit 9007199254740988kB, failcnt 0


Instead, what I think that happens is that the stats of the job are incorrect.

Does the nnUNet_train software do fork any processes?
Can you repeat the test adding "SlurmdDebug=debug2" and "DebugFlags=JobAccountGather" to slurm.conf (+ reconfig)?. Then upload the slurmd logs again please.

A last question just in case the info I request doesn't give me enough, is there any possibility to apply a debug patch in your environment?



Thanks

Comment 7 Felip Moll 2022-04-25 11:36:04 MDT

> > I understand you are asking for 4GB (with --mem) and you see the processes 
> > consuming up to 9GB.
> 
>   Not in real time (our atop monitoring has a one-minute interval), just in
> the sacct output.

What does your "atop" do exactly to the consumed memory of the job?

Comment 8 Felip Moll 2022-05-03 06:31:37 MDT

Hi, do you have any feedback for me?

Thanks!

Comment 9 Institut Pasteur HPC Admin 2022-05-03 07:03:28 MDT

Hi Felip,

  Not yet but I hope to be able to repeat the test at the beginning of
next week.

  Thank you for asking and sorry for the delay,

Comment 10 Felip Moll 2022-05-10 08:59:11 MDT

(In reply to Institut Pasteur HPC Admin from comment #9)
> Hi Felip,
> 
>   Not yet but I hope to be able to repeat the test at the beginning of
> next week.
> 
>   Thank you for asking and sorry for the delay,

Ok no problem.
I will be waiting for your testing.

Thanks!

Comment 11 Institut Pasteur HPC Admin 2022-05-13 07:36:43 MDT

Created attachment 25021 [details]
kernel logs excerpt for job 8948024

Hi Felip,

  I was able to reproduce the behavior

$ sacct --format=jobid%15,jobname%15,state%15,exitcode,ncpus%5,reqmem%6,maxrss -j 8948024
          JobID         JobName           State ExitCode NCPUS ReqMem     MaxRSS 
--------------- --------------- --------------- -------- ----- ------ ---------- 
        8948024    nnUNet_train   OUT_OF_MEMORY    0:125     1    4Gc            
 8948024.extern          extern       COMPLETED      0:0     1    4Gc      1296K 
      8948024.0    nnUNet_train   OUT_OF_MEMORY    0:125     1    4Gc  49763804K 


  As you noticed previously, the attached kernel logs show that the job was
killed by the cgroup "out of memory" handler as expected. So the reported 
MaxRSS must be incorrect. 


> Can you repeat the test adding "SlurmdDebug=debug2" and 
> "DebugFlags=JobAccountGather" to slurm.conf (+ reconfig)?.

  Unfortunately "DebugFlags=JobAccountGather" doesn't seem to exist in 20.11.7 
so the output of slurmd in debug2 mode doesn't provide any relevant piece of 
information. 


> A last question just in case the info I request doesn't give me enough,
> is there any possibility to apply a debug patch in your environment?

  We just planed to upgrade Slurm in June. Now that we are able to reproduce
the problem almost at will, maybe we could wait for that upgrade to check
if the phenomenon still exists and try a debug patch if you have one?


> Does the nnUNet_train software do fork any processes?

  Yes, it creates 12 threads.


> What does your "atop" do exactly to the consumed memory of the job?

  Nothing, I just meant that with the atop 1-minute-sampling, we couldn't
catch the RSize at the moment the job was killed.

  Thanks for your help,

Comment 12 Institut Pasteur HPC Admin 2022-05-13 07:37:40 MDT

Created attachment 25022 [details]
debug2 mode slurmd logs for job 8948024

Comment 13 Felip Moll 2022-05-13 10:21:35 MDT

>   Unfortunately "DebugFlags=JobAccountGather" doesn't seem to exist in
> 20.11.7 
> so the output of slurmd in debug2 mode doesn't provide any relevant piece of 
> information. 
> 

Ahh, my bad. Yes, the info is not very useful.


> > A last question just in case the info I request doesn't give me enough,
> > is there any possibility to apply a debug patch in your environment?
> 
>   We just planed to upgrade Slurm in June. Now that we are able to reproduce
> the problem almost at will, maybe we could wait for that upgrade to check
> if the phenomenon still exists and try a debug patch if you have one?

That's an idea, but in the meantime I will try to reproduce it locally.
If you can provide me with the process layout, doing a "pstree -psla" in a node where the job is running, I will try to do the same. I cannot test nnUNet_train without CUDA, that would've been ideal.

> > Does the nnUNet_train software do fork any processes?
> 
>   Yes, it creates 12 threads.

I said that because there's a case where I've seen processes becoming orphan and being double-accounted in a specific case, but shouldn't apply here.

Can you show me the batch script you're using to run the job?

> 
> > What does your "atop" do exactly to the consumed memory of the job?
> 
>   Nothing, I just meant that with the atop 1-minute-sampling, we couldn't
> catch the RSize at the moment the job was killed.
> 

Got it.

Comment 14 Institut Pasteur HPC Admin 2022-05-17 10:24:41 MDT

Hi Felip,

> That's an idea, but in the meantime I will try to reproduce it locally.
> If you can provide me with the process layout, doing a "pstree -psla"
> in a node where the job is running, I will try to do the same. I cannot
> test nnUNet_train without CUDA, that would've been ideal.

Here it is:

 |-slurmd,711535 -d /usr/local/sbin/slurmstepd
  |-slurmstepd,1681028
  |   |-sleep,1681033 100000000
  |   |-{slurmstepd},1681029
  |   |-{slurmstepd},1681030
  |   |-{slurmstepd},1681031
  |   `-{slurmstepd},1681032
  |-slurmstepd,1681036   
  |   |-nnUNet_train,1681048 /home/user/bin/nnUNet_train 2d nnUNetTrainerV2 500 0
  |   |   |-nnUNet_train,1681121 /home/user/bin/nnUNet_train 2d nnUNetTrainerV2 500 0
  |   |   |-nnUNet_train,1681151 /home/user/bin/nnUNet_train 2d nnUNetTrainerV2 500 0
  |   |   |-nnUNet_train,1681152 /home/user/bin/nnUNet_train 2d nnUNetTrainerV2 500 0
  |   |   |-nnUNet_train,1681212 /home/user/bin/nnUNet_train 2d nnUNetTrainerV2 500 0
  |   |   |-nnUNet_train,1681213 /home/user/bin/nnUNet_train 2d nnUNetTrainerV2 500 0
  |   |   |-nnUNet_train,1681215 /home/user/bin/nnUNet_train 2d nnUNetTrainerV2 500 0
  |   |   |-nnUNet_train,1681216 /home/user/bin/nnUNet_train 2d nnUNetTrainerV2 500 0
  |   |   |-nnUNet_train,1681217 /home/user/bin/nnUNet_train 2d nnUNetTrainerV2 500 0
  |   |   |-nnUNet_train,1681224 /home/user/bin/nnUNet_train 2d nnUNetTrainerV2 500 0
  |   |   |-nnUNet_train,1681230 /home/user/bin/nnUNet_train 2d nnUNetTrainerV2 500 0
  |   |   |-nnUNet_train,1681231 /home/user/bin/nnUNet_train 2d nnUNetTrainerV2 500 0
  |   |   |-nnUNet_train,1681234 /home/user/bin/nnUNet_train 2d nnUNetTrainerV2 500 0
  |   |   |-{nnUNet_train},1681102
  |   |   |-{nnUNet_train},1681103
  |   |   `-{nnUNet_train},1681235
  |   |-{slurmstepd},1681037
  |   |-{slurmstepd},1681038
  |   |-{slurmstepd},1681039
  |   |-{slurmstepd},1681041
  |   |-{slurmstepd},1681042
  |   |-{slurmstepd},1681043
  |   |-{slurmstepd},1681044
  |   |-{slurmstepd},1681045
  |   `-{slurmstepd},1681047



> Can you show me the batch script you're using to run the job?

  I just use srun as the user told me:
srun -p gpu -q gpu --gres=gpu:1 nnUNet_train 2d nnUNetTrainerV2 500 0

The first time, I had to clone his git repository and to create a virtual 
environment with the pytorch version corresponding to our CUDA version. 
Before launching srun, I source it and a file to set some environment 
variables for input/output directories. Sorry, nothing really useful.

  Thanks for your help,

Comment 15 Felip Moll 2022-05-30 07:28:01 MDT

Hi, I am trying to reproduce but I realize I have two slurm.conf from you and I don't know which one is the most recent.

Can you please upload your latest slurm.conf?

Are you using JobAcctGather linux plugin? Or cgroup one?

Comment 20 Felip Moll 2022-05-31 10:57:19 MDT

I am wondering also what JobAcctGatherParams you have set.

It is possible that these forked processes are created after opening shared libraries, so the space accounted in jobacctgather will be counted per process eventhought they are not really using all this space in memory. Cgroup may only account once for an opened shared library, but jobacctgather/linux may account more than one... unless:

Can you set JobAcctGatherParams=UsePss and try again?

The available options are:

NoShared
Exclude shared memory from accounting.

UsePss
Use PSS value instead of RSS to calculate real usage of memory. The PSS value will be saved as RSS.

----
From
RSS is the total memory actually held in RAM for a process. RSS can be misleading, because it reports the total all of the shared libraries that the process uses, even though a shared library is only loaded into memory once regardless of how many processes use it. RSS is not an accurate representation of the memory usage for a single process.

PSS differs from RSS in that it reports the proportional size of its shared libraries, i.e. if three processes all use a shared library that has 30 pages, that library will only contribute 10 pages to the PSS that is reported for each of the three processes. PSS is a more useful number because when the PSS for all processes in the system are summed together, that is a good representation for the total memory usage in the system. When a process is killed, the shared libraries that contributed to its PSS will be proportionally distributed to the PSS totals for the remaining processes still using that library. In this way PSS can be slightly misleading, because when a process is killed, PSS does not accurately represent the memory returned to the overall system.
---

Comment 22 Felip Moll 2022-06-01 11:41:38 MDT

Hi, after looking more internally I am quite sure this can be due to not having UsePss or NoShared in JobAcctGatherParams when using jobacctgather/linux.

You could add this parameter, or as an alternative option switch to jobacctgather/cgroup.

Please, let me know if you already have NoShared set or what is your current config.

Comment 24 Institut Pasteur HPC Admin 2022-06-15 06:38:53 MDT

Hi Felip,

  Sorry for the delay, I was on vacation.


> after looking more internally I am quite sure this can be due to 
> not having UsePss or NoShared in JobAcctGatherParams when using 
> jobacctgather/linux.
>
> You could add this parameter, or as an alternative option switch 
> to jobacctgather/cgroup.
> 
> Please, let me know if you already have NoShared set or what is
> your current config.

  Here is what we have in slurm.conf:

ProctrackType=proctrack/cgroup
TaskPlugin=task/cgroup,task/affinity

JobAcctGatherFrequency=task=10
JobAcctGatherType=jobacct_gather/linux
# Do not count mmap on files in RSS
JobAcctGatherParams=NoShared


> Can you set JobAcctGatherParams=UsePss and try again?

  Do you still want me try it?

  Regarding your former question
> A last question just in case the info I request doesn't give me enough,
> is there any possibility to apply a debug patch in your environment?

  We should upgrade to 21.08.8-2 next week. Do you want me to take this 
opportunity to apply a debug patch?

  Thanks in advance,

Comment 25 Felip Moll 2022-06-15 08:23:24 MDT

(In reply to Institut Pasteur HPC Admin from comment #24)
> Hi Felip,
> 
>   Sorry for the delay, I was on vacation.

No problem!.


> > You could add this parameter, or as an alternative option switch 
> > to jobacctgather/cgroup.
> > 
> > Please, let me know if you already have NoShared set or what is
> > your current config.
> 
>   Here is what we have in slurm.conf:
> 
> ProctrackType=proctrack/cgroup
> TaskPlugin=task/cgroup,task/affinity
> 
> JobAcctGatherFrequency=task=10
> JobAcctGatherType=jobacct_gather/linux
> # Do not count mmap on files in RSS
> JobAcctGatherParams=NoShared

Interesting, you already had this option set. Would it be possible for you to try with jobacct_gather/cgroup? This one should provide the exact memory metrics than the OOM, since it will then grab stats from the same point than task/cgroup.


> 
> > Can you set JobAcctGatherParams=UsePss and try again?
> 
>   Do you still want me try it?

UsePss can be an option, but is incompatible with NoShared. I really think NoShared should've worked but I am not sure what kind of memory does your processes use. If it is not a big deal you can try UsePss.

In order to see which memory is accounted for we should investigate the memory layout of these jobs from /proc/<pid>/status just before the process is killed. Is it possible to get this information from one job which suffers from this issue?

 
>   Regarding your former question
> > A last question just in case the info I request doesn't give me enough,
> > is there any possibility to apply a debug patch in your environment?
> 
>   We should upgrade to 21.08.8-2 next week. Do you want me to take this 
> opportunity to apply a debug patch?
> 
>   Thanks in advance,

Thanks, let's first see the outcomes of UsePss + the /proc/pid/status. And maybe setting jobacct_gather/cgroup too.
If these are not conclusive we can think of a patch.

Comment 26 Institut Pasteur HPC Admin 2022-06-15 09:55:42 MDT

Hi Felip,

  Thanks for your quick reply.


> Thanks, let's first see the outcomes of UsePss + the /proc/pid/status. 

  Yes, okay. I think we will just save the content of /proc/pid/status since I can't predict when the job will exceed the limit.


> And maybe setting jobacct_gather/cgroup too.

  Hmmm. Given the note on JobAcctGatherType
"""
NOTE:  Changing  this  configuration  parameter changes the contents of the messages between Slurm daemons. Any previously running job steps are managed by a slurmstepd daemon that will persist through the lifetime of that job step and not change its communication protocol. Only change this configuration parameter when there are no running job steps.
"""

I think I'll wait the upgrade to change that parameter.

  I'll get back to you as soon as I have the outcomes of the change to UsePSS
and the last /proc/pid/status of nnUNet_train before being killed. Given 
the upgrade preparation (it's only one step in our shutdown roadmap), 
I''m not sure I'll be able to do it this week.

Thanks again,

Comment 27 Felip Moll 2022-06-17 06:29:17 MDT

>   Yes, okay. I think we will just save the content of /proc/pid/status since
> I can't predict when the job will exceed the limit.

Sounds good.


> > And maybe setting jobacct_gather/cgroup too.
> 
>   Hmmm. Given the note on JobAcctGatherType
> """
> NOTE:  Changing  this  configuration  parameter changes the contents of the
> messages between Slurm daemons. Any previously running job steps are managed
> by a slurmstepd daemon that will persist through the lifetime of that job
> step and not change its communication protocol. Only change this
> configuration parameter when there are no running job steps.
> """
> 
> I think I'll wait the upgrade to change that parameter.
> 

In theory what would happen is that current running steps would gather metrics with Linux plugin and new steps with cgroup, so for the same job there would be different ways of getting metrics which is not desirable.

So it is ok to wait until the upgrade. In the meantime let's look at the other info I requested.


>   I'll get back to you as soon as I have the outcomes of the change to UsePSS
> and the last /proc/pid/status of nnUNet_train before being killed. Given 
> the upgrade preparation (it's only one step in our shutdown roadmap), 
> I''m not sure I'll be able to do it this week.
> 

Ok no problem, will keep this open and waiting.

Comment 28 Felip Moll 2022-06-29 09:15:06 MDT

Hello!,

I just wanted to know if you have been able to do the upgrade and test the UsePSS and/or JobAcctGather cgroup plugin.

(Remember that UsePSS or NoShared is not compatible with cgroup, it has no effect.)

Thanks!

Comment 29 Felip Moll 2022-07-19 14:00:10 MDT

Hello, I am marking this bug as infogiven.

Please, reopen it when you have more feedback for me.

Regards