We have a couple of bigmem nodes with 2TB of memory and about 9.5TB of nvme drives configured as swap. Some years, slurm versions and many config file changes ago they already chewed through jobs with sizes over 10TB. With current setup users reported that once their job eats over 2TB of memory, it gets killed by OOM. Which is not what I expect with the configuration I have in place. I'm now using configless setup where majority of compute nodes only have local prolog and epilog scripts in /etc/slurm, everything else is fetched from slurmctld, including cgroup.conf. However on these bigmem nodes I also place local cgroup.conf there, expecting it to take priority. It has these lines: CgroupAutomount=yes ConstrainCores=yes TaskAffinity=no ConstrainRAMSpace=no AllowedRAMSpace=450 ConstrainSwapSpace=no Despite it, jobs get killed when they grow over 2TB. I have a simple test case in C to verify this relatively quickly. There are two possibilities I see why this wouldn't work. Either I'm reading cgroup.conf man page wrong and the above lines do not achieve what I want OR local cgroup.conf is not in effect and somehow the one from slurmctld takes over. I see that slurmd pulls cgroup.conf from slurmctld into /run/slurm/conf/cgroup.conf. Questions: How can I verify which cgroup.conf is in effect? How can I ensure that on just these machines, cgroup.conf in /etc/slurm takes precedence? Thanks,
Can I get your slurm.conf
Jurij, You should be able to check the config being used on each node by checking SlurmdSpoolDir under the /conf-cache/. See: https://slurm.schedmd.com/configless_slurm.html#INITIAL_TESTING On the same page under Notes the precedence is shown as follows: The order of precedence for determining what configuration source to use is as follows: 1. The slurmd --conf-server $host[:$port] option 2. The -f $config_file option 3. The SLURM_CONF environment variable (if set) 4. The default slurm config file (likely /etc/slurm.conf) 5. Any DNS SRV records (from lowest priority value to highest) So if you are using --conf-server it would make sense that the cgroup.conf is being taken from the slurmctld. Perhaps using DNS SRV records would achieve what you are looking for. -Scott
Created attachment 23661 [details] slurm.conf
I am using DNS SRV records, that's why I came up with this idea to override cgroup.conf on a per-node basis. And the behavior I see puzzles me. Can you tell me when cgroup.conf is being read and applied? At slurmd startup or at job startup? If at job, I can overwrite it and possibly work around this issue...
Jurij, The cgroup.conf is read on startup. What is different between the two cgroup.confs? -Scott
Cgroup.conf that I use on majority of compute nodes is this: CgroupAutomount=yes ConstrainCores=yes TaskAffinity=no ConstrainRAMSpace=yes AllowedRAMSpace=120 ConstrainKmemSpace=no #prevent cgroup leak ConstrainSwapSpace=yes AllowedSwapSpace=0 MaxSwapPercent=1 MemorySwappiness=0 And the behavior I see on fat nodes (sm-epyc-[01-05]) matches what's configured here. But I would like fat nodes to behave differently with their memory restrictions and allow jobs to eat memory and swap as much as they want ...
Jurij, Slurm looks for the slurm.conf file to establish where all the .conf files are. It will assume all the .conf files are in the same location. I think you would need to have all the conf files on the node if you want the cgroup.conf to be on the node. -Scott
Ok, that's informative. I didn't get that form the documentation. I've fiddled with our puppet to also place slurm.conf on fat nodes and got that change applied. I've submitted my test job but it looks like it will take a day or two before it gets a chance to run. Will get back to you with results when it does.
While I'm waiting ... I noticed that despide slurm.conf being in /etc/slurm and desired cgroup.conf also being there, after slurmd restart /var/spool/slurm gets cgroup.conf from slurmctld that has settings for regular nodes. Why is that?
Ok I figured that out, now conf-cache only has topology.conf and gres.conf while slurm.conf and cgroup.conf are in /etc/slurm. Test job is still pending, lets wait for that to confirm things are now as I want them to be.
Another thing I learned - slurmd doesn't like some conf files in /etc/slurm and others in conf-cache. It's either all or nothing. This is another thing that should be clarified in documentation. My test job ran and behaved as desired - eating freely into swap. So case closed.
Jurij, A colleague reminded me of another option you have: There's a way to have this both ways: add a line like: Include /some/local/cgroup-stuff.conf in the cgroup.conf. Included files aren't sent out as part of configless, and this can let you vary the file on a per-node basis. (We may change part of this in 22.05, but as long as that's an absolute path to a node-local chunk of config we won't send that out ever.) Let me know if you have any questions about this way of doing it. -Scott
Interesting option ... Wasn't aware that conf files other than slurm.conf also take Include directive. Will keep this in mind and play with it next time I need to tinker with overall slurm configuration.
Closing ticket