Ticket 4017 - cgroup devices directories deleted from jobs older than one day.
Summary: cgroup devices directories deleted from jobs older than one day.
Status: RESOLVED INFOGIVEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: slurmd (show other tickets)
Version: 17.02.1
Hardware: Linux Linux
: 3 - Medium Impact
Assignee: Alejandro Sanchez
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2017-07-21 11:33 MDT by NYU HPC Team
Modified: 2017-08-01 07:13 MDT (History)
0 users

See Also:
Site: NYU
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description NYU HPC Team 2017-07-21 11:33:57 MDT
Hi Experts,

We observed that the cgroup devices directories are removed for jobs running longer than one day generally, but not for all these long-running jobs. 
/sys/fs/cgroup/devices/slurm/uid_*/job_*

Our cgroup.conf is as below:
$ cat /opt/slurm/etc/cgroup.conf
CgroupAutomount=yes
CgroupReleaseAgentDir="/opt/slurm/etc/cgroup"
ConstrainCores=yes
ConstrainKmemSpace=no
ConstrainRAMSpace=yes
ConstrainSwapSpace=yes
MaxSwapPercent=0  
ConstrainDevices=yes
TaskAffinity=yes

This directory is empty
$ ls -l /opt/slurm/etc/cgroup
total 0

AFAIK, we don't have other process to clean up /sys/fs/cgroup/devices/*. Do you have any idea where the deletion might be initiated? Thanks!
Comment 1 Danny Auble 2017-07-21 14:33:59 MDT
Starting in 17.02.0  you no longer need CgroupReleaseAgentDir in your cgroup.conf that should be removed.  Slurm should cleanup automatically the cgroups as needed.

That being said I know there were issues with cleanup that were fixed in 17.02.3.  Would it be possible for you to upgrade to at least that version (if not the latest 17.02.6) and see if that fixes your issues?

If this doesn't help Alex will help you further.
Comment 2 Alejandro Sanchez 2017-08-01 03:10:39 MDT
Hi. Did you manage to upgrade to the latest 17.02 and test the cgroup automatic hierarchy removal? is there anything else that you need from us? Thanks.
Comment 3 NYU HPC Team 2017-08-01 07:09:48 MDT
We have been running on 17.02.1. However I do see that part of source code is difference from 17.02.6. We need to schedule a download to upgrade Slurm which will be after the Summer. Thank you!
Comment 4 Alejandro Sanchez 2017-08-01 07:13:10 MDT
All right. The commit that Danny talks about in comment 1 is this:

https://github.com/SchedMD/slurm/commit/24e2cb07e8e363f24dda036637be97f90507fcd6

I'm closing this as resolved/infogiven. Please reopen if you encounter further issues. Thanks.