Summary: | sacct: maxrss larger than requested memory | ||
---|---|---|---|
Product: | Slurm | Reporter: | NYU HPC Team <hpc-staff> |
Component: | Accounting | Assignee: | Albert Gil <albert.gil> |
Status: | RESOLVED CANNOTREPRODUCE | QA Contact: | |
Severity: | 3 - Medium Impact | ||
Priority: | --- | CC: | nate, scott |
Version: | 19.05.5 | ||
Hardware: | Linux | ||
OS: | Linux | ||
See Also: |
https://bugs.schedmd.com/show_bug.cgi?id=8656 https://bugs.schedmd.com/show_bug.cgi?id=10538 https://bugs.schedmd.com/show_bug.cgi?id=13856 |
||
Site: | NYU | Alineos Sites: | --- |
Atos/Eviden Sites: | --- | Confidential Site: | --- |
Coreweave sites: | --- | Cray Sites: | --- |
DS9 clusters: | --- | HPCnow Sites: | --- |
HPE Sites: | --- | IBM Sites: | --- |
NOAA SIte: | --- | NoveTech Sites: | --- |
Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
Recursion Pharma Sites: | --- | SFW Sites: | --- |
SNIC sites: | --- | Tzag Elita Sites: | --- |
Linux Distro: | --- | Machine Name: | |
CLE Version: | Version Fixed: | ||
Target Release: | --- | DevPrio: | --- |
Emory-Cloud Sites: | --- | ||
Attachments: |
slurm configs for NYU set up
slurmd + messages for 3 jobs |
Description
NYU HPC Team
2020-05-26 15:25:50 MDT
Hi Eugene, I'm looking into your issue, but I think need more info to help reproduce the problem. Can you attach a copy of your slurm.conf and cgroup.conf along with the output of "slurmd -C" from that batch node and "sacct -j 9867046 -o all -p -a" Thanks, Scott Created attachment 14414 [details]
slurm configs for NYU set up
Hi Scott, I've attached what you requested. Thanks! Eugene. Hi Eugene, Thanks for the information provided. With it we can discard that it is a known issue related to jobacct_gather/cgroups. We will take a closer look to them, but could you also post the slurmd logs of a node that run a job with such wrong values? If you also have the dmesg of that nodes, it would help. And, what's the "uname -a" of the computing nodes? You mentioned that you can reproduce the issue on several jobs. Do you notice any kind of pattern on the jobs? Can you provide the same sacct output form some of them to see if we can reproduce the issue? We'll investigate further, but most probably we would need to increase the debug level or flags on your compute nodes, or even provide you with a patch to obtain extra debug information, would it be ok for you? Thanks, Albert Hi Albert, I am attaching a tarball (nyu_jobs.tar.gz) with three directories jobXXX for three examples of jobs which such behavior -- using more ram than allocated. In each directory there is an "sacct" file with the "sacct --all" output for the job. There are also slurmd logs and "messages" (I found that we have a lot of noise in our message files; unrelated but useful output from this ticket already! :)) I looked at these logs myself briefly and to me it looks like different behavior. These tree jobs, for example, ended with 3 different statuses: TIMEOUT, CANCELLED and COMPLETED. Also for the job 10034097 I see lots of OOM kill attempts, they are correlated with logs in "messages"... But I am getting a bit lost here. These examples ran on three different nodes, on all 3, uname -a returns the same: # ssh c19-01 "uname -a" Linux c19-01 3.10.0-514.10.2.el7.x86_64 #1 SMP Fri Mar 3 00:04:05 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux # ssh c38-01 "uname -a" Linux c38-01 3.10.0-514.10.2.el7.x86_64 #1 SMP Fri Mar 3 00:04:05 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux # ssh gpu-25 "uname -a" Linux gpu-25 3.10.0-514.10.2.el7.x86_64 #1 SMP Fri Mar 3 00:04:05 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux We do have a few compute nodes with different kernel, but these three it is as quoted above. As far as increasing debug level via patch -- depending on how intrusive it is our slurm admins will be more or less reluctant to do it, but if it is needed I will try to do my best to convince them :) Please, let me know if I can give you more info. Created attachment 14467 [details]
slurmd + messages for 3 jobs
Hi Eugene, Thanks for the information, it's been very helpful. So far I can confirm that the memory constrain system that you have configured using cgroups is working properly. I can see how the cgroups in the kernel is keeping the RAM used by the job under the requested RAM (plus the 25% that you have configured as AllowedRAMSpace). So, the problem seems that for some reason the gathered information from jobacct_gather/linux is a bigger value. I'll keep investigating, Albert PS: Btw, the "NoOverMemoryKill" parameter was deprecated on 18.08 because it became the default behavior and the opposite option was "OverMemoryKill" was added instead: https://github.com/SchedMD/slurm/blob/slurm-18.08/RELEASE_NOTES#L68 So, you can remove that option. It is just ignored by the system. Hi Eugene, If you are still able to reproduce the issue with new jobs, we should try to increase the debug logs. In slurm.conf change the SlurmdDebug value from "info" (4) to "debug3" and restart slurmd on the nodes that you plan to use to reproduce the issue. Then submit the jobs to those nodes to reproduce the issue and restore the slurm.conf back (debug3 may be very verbose). Please attach the slurmd logs with extra debug of those nodes. Thanks, Albert Hi Albert, I am not quite sure about how to reproduce this -- it seems to be happening randomly on different nodes on our cluster, for different jobs and different users. So I need some time to better understand what triggers such behavior. We do have a suspicion though -- possibly, when a multithreaded job is ran, where all threads are using the same shared memory, slurm (slurmd?) counts ram used up by each thread, so that result exceeds original ask. I am not ready to formulate this idea in a more rigorous terms... In a meantime we've decided to look at croup as a source for things like maxrss and possibly other metrics. Potentially I would like to set up an epilog script that would do something like: cat /sys/fs/cgroup/memory/slurm/uid_$SLURM_JOB_UID/job_$SLURM_JOBID/memory.max_usage_in_bytes Unfortunately, it looks like by the time slurm reaches the step, when epilog script runs, cgroup directory tree for the job is deleted. Similarly, at the prolog step cgroup tree doesn't yet exist. Do you know if there is a way to make it work? I see that there are three different prolog steps: Prolog SrunProlog TaskProlog and similarly three different epilogs. During any of these, is there a stage when we could access cgroup state? Thanks, Eugene. Hi Eugene, > I am not quite sure about how to reproduce this -- it seems to be happening > randomly on different nodes on our cluster, for different jobs and different > users. So I need some time to better understand what triggers such behavior. Thanks for the effort. > We do have a suspicion though -- possibly, when a multithreaded job is ran, > where all threads are using the same shared memory, slurm (slurmd?) counts > ram used up by each thread, so that result exceeds original ask. I am not > ready to formulate this idea in a more rigorous terms... That make sense. Note that jobacct_gather/linux gathers the information from /proc, so those multiple processes with shared memory may be an issue. For these reason we have the "UsePSS" option along with the "NoShared". Maybe you want to UsePSS. > In a meantime we've decided to look at croup as a source for things like > maxrss and possibly other metrics. Potentially I would like to set up an > epilog script that would do something like: > > cat > /sys/fs/cgroup/memory/slurm/uid_$SLURM_JOB_UID/job_$SLURM_JOBID/memory. > max_usage_in_bytes > > Unfortunately, it looks like by the time slurm reaches the step, when epilog > script runs, cgroup directory tree for the job is deleted. Similarly, at the > prolog step cgroup tree doesn't yet exist. Do you know if there is a way to > make it work? I see that there are three different prolog steps: > Prolog > SrunProlog > TaskProlog > > and similarly three different epilogs. It looks that you are trying to gather job account information from cgroups, and that's what jobacct_gather/cgroups was created for, right? ;-) > During any of these, is there a stage > when we could access cgroup state? Slurm can be really customized, and to customize how to gather job accounting information the right way would be through the Job Accounting Gather Plugin API: https://slurm.schedmd.com/jobacct_gatherplugins.html But I don't recommend you to go that road, but to relay on the existing jobacct_gather/linux or jobacct_gather/cgroups. In bug 8656 we are working on a very similar issue (this one may be even a duplicate of it), but it seems more related to jobacct_gather/cgroups. I'll keep trying to reproduce your issue, but if you can do it, I would like you to test if UsePSS fixes the issue for you. Regards, Albert Hi Eugene, I'm closing this bug as cannotreproduce. But please don't hesitate to reopen it if you need further assistance. Regards, Albert |