Ticket 13503 - Configless setup and per-node cgroup.conf
Summary: Configless setup and per-node cgroup.conf
Status: RESOLVED INFOGIVEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: Limits (show other tickets)
Version: 20.11.7
Hardware: Linux Linux
: 4 - Minor Issue
Assignee: Scott Hilton
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2022-02-23 13:38 MST by Jurij Pečar
Modified: 2022-03-03 10:40 MST (History)
1 user (show)

See Also:
Site: EMBL
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
slurm.conf (9.81 KB, text/plain)
2022-02-28 10:24 MST, Jurij Pečar
Details

Note You need to log in before you can comment on or make changes to this ticket.
Description Jurij Pečar 2022-02-23 13:38:05 MST
We have a couple of bigmem nodes with 2TB of memory and about 9.5TB of nvme drives configured as swap. Some years, slurm versions and many config file changes ago they already chewed through jobs with sizes over 10TB.

With current setup users reported that once their job eats over 2TB of memory, it gets killed by OOM. Which is not what I expect with the configuration I have in place.

I'm now using configless setup where majority of compute nodes only have local prolog and epilog scripts in /etc/slurm, everything else is fetched from slurmctld, including cgroup.conf. However on these bigmem nodes I also place local cgroup.conf there, expecting it to take priority. It has these lines:

CgroupAutomount=yes
ConstrainCores=yes
TaskAffinity=no
ConstrainRAMSpace=no
AllowedRAMSpace=450
ConstrainSwapSpace=no

Despite it, jobs get killed when they grow over 2TB. I have a simple test case in C to verify this relatively quickly.

There are two possibilities I see why this wouldn't work. Either I'm reading cgroup.conf man page wrong and the above lines do not achieve what I want OR local cgroup.conf is not in effect and somehow the one from slurmctld takes over.

I see that slurmd pulls cgroup.conf from slurmctld into /run/slurm/conf/cgroup.conf. 

Questions: 
How can I verify which cgroup.conf is in effect?
How can I ensure that on just these machines, cgroup.conf in /etc/slurm takes precedence?

Thanks,
Comment 1 Scott Hilton 2022-02-28 09:54:41 MST
Can I get your slurm.conf
Comment 2 Scott Hilton 2022-02-28 10:11:55 MST
Jurij,

You should be able to check the config being used on each node by checking SlurmdSpoolDir under the /conf-cache/. See:
https://slurm.schedmd.com/configless_slurm.html#INITIAL_TESTING

On the same page under Notes the precedence is shown as follows:

The order of precedence for determining what configuration source to use is as follows:

1. The slurmd --conf-server $host[:$port] option
2. The -f $config_file option
3. The SLURM_CONF environment variable (if set)
4. The default slurm config file (likely /etc/slurm.conf)
5. Any DNS SRV records (from lowest priority value to highest)

So if you are using --conf-server it would make sense that the cgroup.conf is being taken from the slurmctld. Perhaps using DNS SRV records would achieve what you are looking for.

-Scott
Comment 3 Jurij Pečar 2022-02-28 10:24:10 MST
Created attachment 23661 [details]
slurm.conf
Comment 4 Jurij Pečar 2022-02-28 10:29:04 MST
I am using DNS SRV records, that's why I came up with this idea to override cgroup.conf on a per-node basis. And the behavior I see puzzles me. 

Can you tell me when cgroup.conf is being read and applied? At slurmd startup or at job startup? If at job, I can overwrite it and possibly work around this issue...
Comment 5 Scott Hilton 2022-02-28 11:30:51 MST
Jurij,

The cgroup.conf is read on startup.

What is different between the two cgroup.confs?

-Scott
Comment 6 Jurij Pečar 2022-02-28 11:36:41 MST
Cgroup.conf that I use on majority of compute nodes is this:

CgroupAutomount=yes
ConstrainCores=yes
TaskAffinity=no
ConstrainRAMSpace=yes
AllowedRAMSpace=120
ConstrainKmemSpace=no #prevent cgroup leak
ConstrainSwapSpace=yes
AllowedSwapSpace=0
MaxSwapPercent=1
MemorySwappiness=0

And the behavior I see on fat nodes (sm-epyc-[01-05]) matches what's configured here.

But I would like fat nodes to behave differently with their memory restrictions and allow jobs to eat memory and swap as much as they want ...
Comment 7 Scott Hilton 2022-02-28 12:54:57 MST
Jurij,

Slurm looks for the slurm.conf file to establish where all the .conf files are. It will assume all the .conf files are in the same location. I think you would need to have all the conf files on the node if you want the cgroup.conf to be on the node.

-Scott
Comment 8 Jurij Pečar 2022-02-28 13:20:30 MST
Ok, that's informative. I didn't get that form the documentation.

I've fiddled with our puppet to also place slurm.conf on fat nodes and got that change applied. I've submitted my test job but it looks like it will take a day or two before it gets a chance to run. Will get back to you with results when it does.
Comment 9 Jurij Pečar 2022-02-28 13:32:57 MST
While I'm waiting ... I noticed that despide slurm.conf being in /etc/slurm and desired cgroup.conf also being there, after slurmd restart /var/spool/slurm gets cgroup.conf from slurmctld that has settings for regular nodes. Why is that?
Comment 10 Jurij Pečar 2022-02-28 15:23:32 MST
Ok I figured that out, now conf-cache only has topology.conf and gres.conf while slurm.conf and cgroup.conf are in /etc/slurm. Test job is still pending, lets wait for that to confirm things are now as I want them to be.
Comment 11 Jurij Pečar 2022-03-01 06:06:14 MST
Another thing I learned - slurmd doesn't like some conf files in /etc/slurm and others in conf-cache. It's either all or nothing. This is another thing that should be clarified in documentation.

My test job ran and behaved as desired - eating freely into swap. So case closed.
Comment 15 Scott Hilton 2022-03-02 12:59:43 MST
Jurij,

A colleague reminded me of another option you have:

There's a way to have this both ways: add a line like:

Include /some/local/cgroup-stuff.conf

in the cgroup.conf. Included files aren't sent out as part of configless, and this can let you vary the file on a per-node basis. (We may change part of this in 22.05, but as long as that's an absolute path to a node-local chunk of config we won't send that out ever.)

Let me know if you have any questions about this way of doing it.

-Scott
Comment 16 Jurij Pečar 2022-03-02 14:55:19 MST
Interesting option ... Wasn't aware that conf files other than slurm.conf also take Include directive.
 
Will keep this in mind and play with it next time I need to tinker with overall slurm configuration.
Comment 17 Scott Hilton 2022-03-03 10:40:43 MST
Closing ticket