Ticket 9110

Summary:	sacct: maxrss larger than requested memory
Product:	Slurm	Reporter:	NYU HPC Team <hpc-staff>
Component:	Accounting	Assignee:	Albert Gil <albert.gil>
Status:	RESOLVED CANNOTREPRODUCE	QA Contact:
Severity:	3 - Medium Impact
Priority:	---	CC:	nate, scott
Version:	19.05.5
Hardware:	Linux
OS:	Linux
See Also:	https://bugs.schedmd.com/show_bug.cgi?id=8656 https://bugs.schedmd.com/show_bug.cgi?id=10538 https://bugs.schedmd.com/show_bug.cgi?id=13856
Site:	NYU	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	slurm configs for NYU set up slurmd + messages for 3 jobs

Description NYU HPC Team 2020-05-26 15:25:50 MDT

Dear Slurm support, 

we are seeing from sacct that there are jobs that have bigger that 100% RAM utilization (RAM utilization = MaxRSS/ReqMem * 100%). To determine memory utilization for jobs we actually use “seff”, but we can verify problematic accounting data with “sacct”. Here is an example of such job:

[root@slurm0 lib]# sacct --version
slurm 19.05.5

[root@slurm0 lib]# sacct -j 9867046  --format=User,JobID,state,time,elapsed,AveCPU,MaxRss,ReqMem,nnodes,ncpus,ntasks
     User        JobID      State  Timelimit    Elapsed     AveCPU     MaxRSS     ReqMem   NNodes      NCPUS   NTasks
--------- ------------ ---------- ---------- ---------- ---------- ---------- ---------- -------- ---------- --------
   sm7191 9867046         TIMEOUT 3-08:00:00 3-08:00:10                             64Gn        1          4
          9867046.bat+  CANCELLED            3-08:00:11 2-06:31:05 506709400K       64Gn        1          4        1
          9867046.ext+  COMPLETED            3-08:00:10   00:00:00       768K       64Gn        1          4        1

For this job memory utilization measures as 506709400K/64G * 100% = 755%

We don’t really understand whether these jobs actually use this memory, or this is an accounting reporting error.

We have many other examples of this behavior. Can you help us understand how this is possible? 

Oh, important thing to mention is that we do use cgroup plugin and we expect processes to be killed when they reach they c-limits (and we see that they are being killed). Only some jobs exhibit this behavior of using more RAM than requested.  

Thanks,
Eugene Dedits.

Comment 1 Scott Hilton 2020-05-27 17:07:43 MDT

Hi Eugene, 

I'm looking into your issue, but I think need more info to help reproduce the problem.

Can you attach a copy of your slurm.conf and cgroup.conf along with the output of "slurmd -C" from that batch node and "sacct -j 9867046 -o all -p -a" 

Thanks, 

Scott

Comment 3 NYU HPC Team 2020-05-28 09:49:26 MDT

Created attachment 14414 [details]
slurm configs for NYU set up

Comment 4 NYU HPC Team 2020-05-28 09:50:33 MDT

Hi Scott, 

I've attached what you requested. 

Thanks!
Eugene.

Comment 6 Albert Gil 2020-05-28 12:25:07 MDT

Hi Eugene,

Thanks for the information provided.
With it we can discard that it is a known issue related to jobacct_gather/cgroups.

We will take a closer look to them, but could you also post the slurmd logs of a node that run a job with such wrong values?
If you also have the dmesg of that nodes, it would help.
And, what's the "uname -a" of the computing nodes?

You mentioned that you can reproduce the issue on several jobs.
Do you notice any kind of pattern on the jobs?
Can you provide the same sacct output form some of them to see if we can reproduce the issue?

We'll investigate further, but most probably we would need to increase the debug level or flags on your compute nodes, or even provide you with a patch to obtain extra debug information, would it be ok for you?

Thanks,
Albert

Comment 7 NYU HPC Team 2020-06-01 20:22:03 MDT

Hi Albert,

I am attaching a tarball (nyu_jobs.tar.gz) with three directories jobXXX for three examples of jobs which such behavior -- using more ram than allocated.

In each directory there is an "sacct" file with the "sacct --all" output for the job. There are also slurmd logs and "messages" (I found that we have a lot of noise in our message files; unrelated but useful output from this ticket already! :))

I looked at these logs myself briefly and to me it looks like different behavior. These tree jobs, for example, ended with 3 different statuses: TIMEOUT, CANCELLED and COMPLETED. Also for the job 10034097 I see lots of OOM kill attempts, they are correlated with logs in "messages"... But I am getting a bit lost here.

These examples ran on three different nodes, on all 3, uname -a returns the same:
# ssh c19-01 "uname -a"
Linux c19-01 3.10.0-514.10.2.el7.x86_64 #1 SMP Fri Mar 3 00:04:05 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
# ssh c38-01 "uname -a"
Linux c38-01 3.10.0-514.10.2.el7.x86_64 #1 SMP Fri Mar 3 00:04:05 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
# ssh gpu-25 "uname -a"
Linux gpu-25 3.10.0-514.10.2.el7.x86_64 #1 SMP Fri Mar 3 00:04:05 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

We do have a few compute nodes with different kernel, but these three it is as quoted above.

As far as increasing debug level via patch -- depending on how intrusive it is our slurm admins will be more or less reluctant to do it, but if it is needed I will try to do my best to convince them :)

Please, let me know if I can give you more info.

Comment 8 NYU HPC Team 2020-06-01 20:23:07 MDT

Created attachment 14467 [details]
slurmd + messages for 3 jobs

Comment 10 Albert Gil 2020-06-11 08:40:03 MDT

Hi Eugene,

Thanks for the information, it's been very helpful.
So far I can confirm that the memory constrain system that you have configured using cgroups is working properly.
I can see how the cgroups in the kernel is keeping the RAM used by the job under the requested RAM (plus the 25% that you have configured as AllowedRAMSpace).

So, the problem seems that for some reason the gathered information from jobacct_gather/linux is a bigger value.

I'll keep investigating,
Albert

PS: Btw, the "NoOverMemoryKill" parameter was deprecated on 18.08 because it became the default behavior and the opposite option was "OverMemoryKill" was added instead:

https://github.com/SchedMD/slurm/blob/slurm-18.08/RELEASE_NOTES#L68

So, you can remove that option. It is just ignored by the system.

Comment 12 Albert Gil 2020-06-12 08:41:44 MDT

Hi Eugene,

If you are still able to reproduce the issue with new jobs, we should try to increase the debug logs.

In slurm.conf change the SlurmdDebug value from "info" (4) to "debug3" and restart slurmd on the nodes that you plan to use to reproduce the issue.
Then submit the jobs to those nodes to reproduce the issue and restore the slurm.conf back (debug3 may be very verbose).

Please attach the slurmd logs with extra debug of those nodes.

Thanks,
Albert

Comment 14 NYU HPC Team 2020-06-16 21:35:19 MDT

Hi Albert, 


I am not quite sure about how to reproduce this -- it seems to be happening randomly on different nodes on our cluster, for different jobs and different users. So I need some time to better understand what triggers such behavior. 

We do have a suspicion though -- possibly, when a multithreaded job is ran, where all threads are using the same shared memory, slurm (slurmd?) counts ram used up by each thread, so that result exceeds original ask. I am not ready to formulate this idea in a more rigorous terms...

In a meantime we've decided to look at croup as a source for things like maxrss and possibly other metrics. Potentially I would like to set up an epilog script that would do something like:

cat /sys/fs/cgroup/memory/slurm/uid_$SLURM_JOB_UID/job_$SLURM_JOBID/memory.max_usage_in_bytes 

Unfortunately, it looks like by the time slurm reaches the step, when epilog script runs, cgroup directory tree for the job is deleted. Similarly, at the prolog step cgroup tree doesn't yet exist. Do you know if there is a way to make it work? I see that there are three different prolog steps:
Prolog
SrunProlog
TaskProlog

and similarly three different epilogs. During any of these, is there a stage when we could access cgroup state? 


Thanks,
Eugene.

Comment 17 Albert Gil 2020-06-18 07:05:26 MDT

Hi Eugene,

> I am not quite sure about how to reproduce this -- it seems to be happening
> randomly on different nodes on our cluster, for different jobs and different
> users. So I need some time to better understand what triggers such behavior.

Thanks for the effort.

> We do have a suspicion though -- possibly, when a multithreaded job is ran,
> where all threads are using the same shared memory, slurm (slurmd?) counts
> ram used up by each thread, so that result exceeds original ask. I am not
> ready to formulate this idea in a more rigorous terms...

That make sense.
Note that jobacct_gather/linux gathers the information from /proc, so those multiple processes with shared memory may be an issue.
For these reason we have the "UsePSS" option along with the "NoShared".
Maybe you want to UsePSS.

> In a meantime we've decided to look at croup as a source for things like
> maxrss and possibly other metrics. Potentially I would like to set up an
> epilog script that would do something like:
> 
> cat
> /sys/fs/cgroup/memory/slurm/uid_$SLURM_JOB_UID/job_$SLURM_JOBID/memory.
> max_usage_in_bytes 
> 
> Unfortunately, it looks like by the time slurm reaches the step, when epilog
> script runs, cgroup directory tree for the job is deleted. Similarly, at the
> prolog step cgroup tree doesn't yet exist. Do you know if there is a way to
> make it work? I see that there are three different prolog steps:
> Prolog
> SrunProlog
> TaskProlog
> 
> and similarly three different epilogs.

It looks that you are trying to gather job account information from cgroups, and that's what jobacct_gather/cgroups was created for, right? ;-)

> During any of these, is there a stage
> when we could access cgroup state? 

Slurm can be really customized, and to customize how to gather job accounting information the right way would be through the Job Accounting Gather Plugin API:
https://slurm.schedmd.com/jobacct_gatherplugins.html

But I don't recommend you to go that road, but to relay on the existing jobacct_gather/linux or jobacct_gather/cgroups.
In bug 8656 we are working on a very similar issue (this one may be even a duplicate of it), but it seems more related to jobacct_gather/cgroups.

I'll keep trying to reproduce your issue, but if you can do it, I would like you to test if UsePSS fixes the issue for you.

Regards,
Albert

Comment 18 Albert Gil 2020-11-11 02:23:08 MST

Hi Eugene,

I'm closing this bug as cannotreproduce.
But please don't hesitate to reopen it if you need further assistance.

Regards,
Albert