Ticket 13624 - Max number of open file descriptors is capped to 4096 in Slurm 21.08
Summary: Max number of open file descriptors is capped to 4096 in Slurm 21.08
Status: RESOLVED FIXED
Alias: None
Product: Slurm
Classification: Unclassified
Component: slurmd (show other tickets)
Version: 21.08.6
Hardware: Linux Linux
: 3 - Medium Impact
Assignee: Nate Rini
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2022-03-15 13:04 MDT by Francesco De Martino
Modified: 2022-04-15 11:42 MDT (History)
7 users (show)

See Also:
Site: DS9 (PSLA)
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed: 21.08.7,22.05pre1
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
slurm.conf (1.70 KB, text/plain)
2022-03-16 02:30 MDT, Luca Carrogu
Details
patch for 21.08 (DS9 only test) (1.01 KB, patch)
2022-03-16 15:09 MDT, Nate Rini
Details | Diff
patch for master (2.44 KB, patch)
2022-03-16 15:20 MDT, Nate Rini
Details | Diff
patch (1.23 KB, patch)
2022-04-15 10:10 MDT, Nate Rini
Details | Diff

Note You need to log in before you can comment on or make changes to this ticket.
Description Francesco De Martino 2022-03-15 13:04:02 MDT
Hello,

In upgrading from Slurm 20.11 to Slurm 21.08 we identified a change in how ulimits on the max number of file descriptors are managed.

In both versions we set LimitNOFILE=131072 in slurmd service definition.

Regardless of Slurm version, in case the limit of file descriptors set on the submission node is higher than the hard limit set for slurmd, the job processes will fall back to the soft limit set for slurmd.

Now while in 20.11 the soft limit for slurmd was set to the value the service was started with (131072), in Slurm 21.08 this limit is capped to 4096. The change seems to come from this commit https://github.com/SchedMD/slurm/commit/18b2f4fff3f8fd5773ab14ec631bbd5f2995fa6e.

This is impacting applications that are setting a limit on the submission node that is greater than 131072 and that are now getting capped to 4096 rather than 131072.

Can you please share the reasoning behind this change in behavior? 

Thanks,
Francesco
Comment 1 Nate Rini 2022-03-15 13:50:41 MDT
(In reply to Francesco De Martino from comment #0)
> Can you please share the reasoning behind this change in behavior? 

This behavior was changed in bug#10254 and then (mostly) reverted in bug#12804:
> https://github.com/SchedMD/slurm/commit/d2c1a05e15de6019c1e2def91e77a0377cd1a446

Please verify your running Slurm version:
> slurmd -V
Comment 2 Maurizio Melato 2022-03-15 15:03:33 MDT
We're using Slurm version 21.08.6 where we're observing the behavior change reported above.
Comment 3 Nate Rini 2022-03-15 15:25:07 MDT
Please attach the slurm.conf. Please run this test job and attach the output:
> sbatch --wrap 'srun cat /proc/self/limits'
Comment 4 Luca Carrogu 2022-03-16 02:30:52 MDT
Created attachment 23883 [details]
slurm.conf

slurm.conf
Comment 5 Luca Carrogu 2022-03-16 02:35:07 MDT
On the head node:
$ sinfo -V
slurm 21.08.6


$ sbatch --wrap 'srun cat /proc/self/limits'
Submitted batch job 1

$ cat slurm-1.out
Limit                     Soft Limit           Hard Limit           Units
Max cpu time              unlimited            unlimited            seconds
Max file size             unlimited            unlimited            bytes
Max data size             unlimited            unlimited            bytes
Max stack size            10485760             unlimited            bytes
Max core file size        0                    unlimited            bytes
Max resident set          unlimited            unlimited            bytes
Max processes             unlimited            unlimited            processes
Max open files            8192                 131072               files
Max locked memory         unlimited            unlimited            bytes
Max address space         unlimited            unlimited            bytes
Max file locks            unlimited            unlimited            locks
Max pending signals       30446                30446                signals
Max msgqueue size         819200               819200               bytes
Max nice priority         0                    0
Max realtime priority     0                    0
Max realtime timeout      unlimited            unlimited            us


On the compute node:
$ ps aux | grep slurm
root      5166  0.0  0.0 152784  5936 ?        Ss   08:09   0:00 /opt/slurm/sbin/slurmd -D -N compute1-st-compute1-i1-1

$ sudo cat /proc/5166/limits
Limit                     Soft Limit           Hard Limit           Units
Max cpu time              unlimited            unlimited            seconds
Max file size             unlimited            unlimited            bytes
Max data size             unlimited            unlimited            bytes
Max stack size            unlimited            unlimited            bytes
Max core file size        unlimited            unlimited            bytes
Max resident set          unlimited            unlimited            bytes
Max processes             unlimited            unlimited            processes
Max open files            4096                 131072               files
Max locked memory         unlimited            unlimited            bytes
Max address space         unlimited            unlimited            bytes
Max file locks            unlimited            unlimited            locks
Max pending signals       30446                30446                signals
Max msgqueue size         819200               819200               bytes
Max nice priority         0                    0
Max realtime priority     0                    0
Max realtime timeout      unlimited            unlimited            us
Comment 6 Luca Carrogu 2022-03-16 02:39:55 MDT
issue is when on the head it's set a limit greater than the hard limit on the compute.

on the head node:
$ ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 30446
max locked memory       (kbytes, -l) unlimited
max memory size         (kbytes, -m) unlimited
open files                      (-n) 131073
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 10240
cpu time               (seconds, -t) unlimited
max user processes              (-u) unlimited
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

see open files 131073 > 131072

$ sbatch --wrap 'srun cat /proc/self/limits'
Submitted batch job 3

$ cat slurm-3.out
Limit                     Soft Limit           Hard Limit           Units
Max cpu time              unlimited            unlimited            seconds
Max file size             unlimited            unlimited            bytes
Max data size             unlimited            unlimited            bytes
Max stack size            10485760             unlimited            bytes
Max core file size        0                    unlimited            bytes
Max resident set          unlimited            unlimited            bytes
Max processes             unlimited            unlimited            processes
Max open files            4096                 131072               files
Max locked memory         unlimited            unlimited            bytes
Max address space         unlimited            unlimited            bytes
Max file locks            unlimited            unlimited            locks
Max pending signals       30446                30446                signals
Max msgqueue size         819200               819200               bytes
Max nice priority         0                    0
Max realtime priority     0                    0
Max realtime timeout      unlimited            unlimited            us

$ sbatch --wrap 'ulimit -a'
Submitted batch job 4

$ cat slurm-4.out
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 30446
max locked memory       (kbytes, -l) unlimited
max memory size         (kbytes, -m) unlimited
open files                      (-n) 4096
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 10240
cpu time               (seconds, -t) unlimited
max user processes              (-u) unlimited
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

see open files = 4096
Comment 7 Nate Rini 2022-03-16 09:23:14 MDT
working on replicating this locally
Comment 8 Nate Rini 2022-03-16 09:48:49 MDT
Please provide:
> systemctl --version
Comment 10 Francesco De Martino 2022-03-16 10:57:24 MDT
(In reply to Nate Rini from comment #8)
> Please provide:
> > systemctl --version

$ systemctl --version
systemd 219
+PAM +AUDIT +SELINUX +IMA -APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ +LZ4 -SECCOMP +BLKID +ELFUTILS +KMOD +IDN
Comment 11 Nate Rini 2022-03-16 11:07:51 MDT
Please also provide:
> uname -a
Comment 12 Francesco De Martino 2022-03-16 11:18:45 MDT
(In reply to Nate Rini from comment #11)
> Please also provide:
> > uname -a

[ec2-user@ip-10-0-0-57 ~]$ uname -a
Linux ip-10-0-0-57 4.14.268-205.500.amzn2.x86_64 #1 SMP Wed Mar 2 18:38:38 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
Comment 13 Nate Rini 2022-03-16 11:28:57 MDT
Please set in the slurmd.service unit file:
> LimitNOFILE=unlimited

A restart of slurmd will be required. Please then retry your test.

(In reply to Francesco De Martino from comment #10)
> systemd 219

I have been able to replicate this issue locally and the hard limit of any child process of slurmd is capped (by at the limit set in the slurmd unit file by systemd. It looks like systemd 240 *may* relax this limitation but my systemd is only at 239 so that will require further testing.
Comment 14 Francesco De Martino 2022-03-16 12:10:52 MDT
(In reply to Nate Rini from comment #13)

> I have been able to replicate this issue locally and the hard limit of any
> child process of slurmd is capped (by at the limit set in the slurmd unit
> file by systemd. It looks like systemd 240 *may* relax this limitation but
> my systemd is only at 239 so that will require further testing.

The problem is not that the hard limit of child processes is capped to the limit set in the slurmd unit file - that is indeed the desired behavior. What we are experiencing is that the limit gets capped to 4096 and not to the ones set in slurmd service file when limits propagated from submission host are bigger.

Submission host -> 131073
slurmd -> 131072
job process -> capped to 4096
Comment 15 Nate Rini 2022-03-16 12:13:12 MDT
(In reply to Francesco De Martino from comment #14)
> What we are experiencing is that the limit gets capped to 4096 and not to
> the ones set in slurmd service file when limits propagated from submission
> host are bigger.

Do you mean the soft limit? The data provided shows the hard limit is still at the expected limit?

(In reply to Luca Carrogu from comment #6)
> Max open files            4096                 131072               files
Comment 16 Francesco De Martino 2022-03-16 12:18:19 MDT
(In reply to Nate Rini from comment #15)
> (In reply to Francesco De Martino from comment #14)
> > What we are experiencing is that the limit gets capped to 4096 and not to
> > the ones set in slurmd service file when limits propagated from submission
> > host are bigger.
> 
> Do you mean the soft limit? The data provided shows the hard limit is still
> at the expected limit?
> 
> (In reply to Luca Carrogu from comment #6)
> > Max open files            4096                 131072               files

yes soft limits
Comment 17 Nate Rini 2022-03-16 12:24:43 MDT
I suspect the limits are getting inherited. Please set this in your slurm.conf and restart all the daemons:
> PropagateResourceLimits=NONE
Comment 18 Francesco De Martino 2022-03-16 12:31:33 MDT
(In reply to Nate Rini from comment #17)
> I suspect the limits are getting inherited. Please set this in your
> slurm.conf and restart all the daemons:
> > PropagateResourceLimits=NONE

Yes limits get propagated, but the limit on the submit node is not set to 4096 but rather to 131073. How come the end result is 4096 as long as the propagated limits are > than the slurmd ones?
Comment 19 Nate Rini 2022-03-16 12:38:33 MDT
(In reply to Francesco De Martino from comment #18)
> (In reply to Nate Rini from comment #17)
> > I suspect the limits are getting inherited. Please set this in your
> > slurm.conf and restart all the daemons:
> > > PropagateResourceLimits=NONE
> 
> Yes limits get propagated, but the limit on the submit node is not set to
> 4096 but rather to 131073. How come the end result is 4096 as long as the
> propagated limits are > than the slurmd ones?

slurmstepd attempts to set the limit as requested and fails:
> [2022-03-16T12:35:08.371] [4.batch] Can't propagate RLIMIT_NPROC of 'unlimited' from submit host: Operation not permitted
> [2022-03-16T12:35:08.371] [4.batch] debug2: _set_limit: RLIMIT_NOFILE : max:5000 cur:4096 req:8000
> [2022-03-16T12:35:08.371] [4.batch] Can't propagate RLIMIT_NOFILE of 8000 from submit host: Operation not permitted

When this happens, the job retains the default limit. In this case: 4096 files. Please note that slurmstepd is running as the user while attempting to change the limits and not root.
Comment 20 Francesco De Martino 2022-03-16 12:49:16 MDT
(In reply to Nate Rini from comment #19)
> (In reply to Francesco De Martino from comment #18)
> > (In reply to Nate Rini from comment #17)
> > > I suspect the limits are getting inherited. Please set this in your
> > > slurm.conf and restart all the daemons:
> > > > PropagateResourceLimits=NONE
> > 
> > Yes limits get propagated, but the limit on the submit node is not set to
> > 4096 but rather to 131073. How come the end result is 4096 as long as the
> > propagated limits are > than the slurmd ones?
> 
> slurmstepd attempts to set the limit as requested and fails:
> > [2022-03-16T12:35:08.371] [4.batch] Can't propagate RLIMIT_NPROC of 'unlimited' from submit host: Operation not permitted
> > [2022-03-16T12:35:08.371] [4.batch] debug2: _set_limit: RLIMIT_NOFILE : max:5000 cur:4096 req:8000
> > [2022-03-16T12:35:08.371] [4.batch] Can't propagate RLIMIT_NOFILE of 8000 from submit host: Operation not permitted
> 
> When this happens, the job retains the default limit. In this case: 4096
> files. Please note that slurmstepd is running as the user while attempting
> to change the limits and not root.

And this brings me back to my original question: why this behavior changed in 21.08? Where is this default limit coming from?

What we are seeing with 20.11 is:
Submission host -> 131073
slurmd -> 131072
job process -> 131072 (not 4096)
Comment 21 Nate Rini 2022-03-16 13:55:34 MDT
(In reply to Francesco De Martino from comment #20)
> And this brings me back to my original question: why this behavior changed
> in 21.08? Where is this default limit coming from?
> 
> What we are seeing with 20.11 is:
> Submission host -> 131073
> slurmd -> 131072
> job process -> 131072 (not 4096)

Prior to bug#10254, the nofile rlimit for slurmd was not changed. However, with bug#10254, we found that forking for slurmstepd with a large possible number of file descriptors is quite expensive. slurmd itself has no need to such a large limit.

I'll look at making the limit change be more graceful.
Comment 22 Nate Rini 2022-03-16 15:09:08 MDT
Created attachment 23899 [details]
patch for 21.08 (DS9 only test)

(In reply to Nate Rini from comment #21)
> I'll look at making the limit change be more graceful.

This patch switches around the logic to avoid failing when the hard limit is lower than the requested soft limit. Feel free to try this patch while it undergoes our normal QA process.
Comment 25 Maurizio Melato 2022-03-22 13:46:49 MDT
Could you share the Slurm version and indicatively when this patch will be released? Thanks.
Comment 26 Nate Rini 2022-03-22 13:55:43 MDT
(In reply to Nate Rini from comment #22)
> Created attachment 23899 [details]
> patch for 21.08 (DS9 only test)

(In reply to Nate Rini from comment #24)
> (In reply to Nate Rini from comment #23)
> > Created attachment 23900 [details]
> > patch for master

(In reply to mmelato from comment #25)
> Could you share the Slurm version and indicatively when this patch will be
> released? Thanks.

The patch for Slurm-21.08 and Slurm-22.05 (not tagged yet) releases are above. The patch is currently under going review and I don't have an ETA other than we are targeting the slurm-22.05 release. The slurm-21.08 patch has been provided to allow your site to test and verify it solves your issue.
Comment 27 Francesco De Martino 2022-03-23 09:23:30 MDT
(In reply to Nate Rini from comment #22)
> Created attachment 23899 [details]
> patch for 21.08 (DS9 only test)
> 
> (In reply to Nate Rini from comment #21)
> > I'll look at making the limit change be more graceful.
> 
> This patch switches around the logic to avoid failing when the hard limit is
> lower than the requested soft limit. Feel free to try this patch while it
> undergoes our normal QA process.

I tested the patch in my cluster and it looks like it restores the behavior we were seeing in Slurm 20.11.

[ec2-user@ip-10-0-0-114 ~]$ ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 30446
max locked memory       (kbytes, -l) unlimited
max memory size         (kbytes, -m) unlimited
open files                      (-n) 131073
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 10240
cpu time               (seconds, -t) unlimited
max user processes              (-u) unlimited
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited
[ec2-user@ip-10-0-0-114 ~]$ sbatch --wrap "ulimit -a"
Submitted batch job 8
[ec2-user@ip-10-0-0-114 ~]$ cat slurm-8.out
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 30446
max locked memory       (kbytes, -l) unlimited
max memory size         (kbytes, -m) unlimited
open files                      (-n) 131072
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 10240
cpu time               (seconds, -t) unlimited
max user processes              (-u) unlimited
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited
Comment 28 Francesco De Martino 2022-03-24 02:42:23 MDT
Hi Nate,

an additional question from my side: is there any downside if we were to set LimitNOFILE in slurmd service to the max allowed on the system?
Comment 30 Nate Rini 2022-03-24 18:39:06 MDT
(In reply to Francesco De Martino from comment #28)
> an additional question from my side: is there any downside if we were to set
> LimitNOFILE in slurmd service to the max allowed on the system?

Slurm will attempt to close() all the possible nofiles at every fork, so there will be additional delays to start slurmstepd, prolog, and epilog scripts. Unless the node will be doing large throughput of jobs, the difference will be neglible with any modern non-virtualized kernel.
Comment 37 Nate Rini 2022-04-14 12:37:08 MDT
(In reply to Francesco De Martino from comment #27)
> (In reply to Nate Rini from comment #22)
> > Created attachment 23899 [details]
> > patch for 21.08 (DS9 only test)
> > 
> > (In reply to Nate Rini from comment #21)
> > > I'll look at making the limit change be more graceful.
> > 
> > This patch switches around the logic to avoid failing when the hard limit is
> > lower than the requested soft limit. Feel free to try this patch while it
> > undergoes our normal QA process.
> 
> I tested the patch in my cluster and it looks like it restores the behavior
> we were seeing in Slurm 20.11.

The patch is now upstream for the upcoming 21.08.7 release:
> https://github.com/SchedMD/slurm/commit/16d1f708220d213534faf7149a05f5e40079883a

Please reply if you have any more related questions. I'm going to close this ticket out.
Comment 38 Chris Samuel (NERSC) 2022-04-15 00:05:28 MDT
Great to find this already fixed, I spent a while yesterday debugging this on Perlmutter for an NCCL app running at large scale that used to work but failed when we went to 21.08.

Whilst waiting for Slurm 21.08.7 to appear is it just this commit that's needed?

commit 16d1f708220d213534faf7149a05f5e40079883a
Author: Nathan Rini <nate@schedmd.com>
Date:   Wed Mar 16 14:52:50 2022 -0600

    slurmstepd - cap rlimit request to max possible
    
    slurmstepd runs as the job user and can't override the existing hard
    limits. Instead of failing to set the soft rlimit at all, set the soft
    rlimit to the max possible hard rlimit.
    
    Bug 13624

All the best,
Chris
Comment 39 Nate Rini 2022-04-15 10:10:23 MDT
Created attachment 24473 [details]
patch

(In reply to Chris Samuel (NERSC) from comment #38)
> Great to find this already fixed, I spent a while yesterday debugging this
> on Perlmutter for an NCCL app running at large scale that used to work but
> failed when we went to 21.08.
> 
> Whilst waiting for Slurm 21.08.7 to appear is it just this commit that's
> needed?

Yes. I have attached the patch sans the NEWS entry which will likely break 'git apply'. If you're having issues with rlimits nofiles: then there is a whole pile of patches to fix that.
Comment 40 Chris Samuel (NERSC) 2022-04-15 11:42:58 MDT
(In reply to Nate Rini from comment #39)

> Yes. I have attached the patch sans the NEWS entry which will likely break
> 'git apply'. If you're having issues with rlimits nofiles: then there is a
> whole pile of patches to fix that.

OK thanks, in that case rather than apply that pile of patches what I might just do is adjust our ansible for Perlmutter to update the:

LimitNOFILE=131072

in the unit file to be greater than our limits as I know that works from our test system and wait for 21.08.7 to appear.

Much obliged!