Ticket 12804 - Too many open files errors when launching to >4000 nodes
Summary: Too many open files errors when launching to >4000 nodes
Status: RESOLVED FIXED
Alias: None
Product: Slurm
Classification: Unclassified
Component: Limits (show other tickets)
Version: 21.08.1
Hardware: Linux Linux
: 3 - Medium Impact
Assignee: Nate Rini
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2021-11-02 10:32 MDT by David Gloe
Modified: 2022-01-24 09:45 MST (History)
5 users (show)

See Also:
Site: ORNL-OLCF
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed: 21.08.5, 22.05pre1
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
patch for 21.08 (v2) (4.51 KB, patch)
2021-11-03 08:32 MDT, Nate Rini
Details | Diff
patch for 21.08 (v3) (4.77 KB, patch)
2021-11-03 17:06 MDT, Nate Rini
Details | Diff

Note You need to log in before you can comment on or make changes to this ticket.
Description David Gloe 2021-11-02 10:32:07 MDT
Reporting this bug from HPE, but the issue was found on a customer system at ORNL.

When launching on >4000 nodes, we're seeing error messages like this:

srun -N 4002 -n 32016 --ntasks-per-node=8 --cpu_bind=map_cpu:48,56,16,24,2,8,32,40 /ccs/home/kmcmahon/Benchmarks/bisec_bw2 8 1000
Tue 02 Nov 2021 10:24:28 AM EDT
srun: error: Unable to accept new connection: Too many open files
srun: error: Unable to accept new connection: Too many open files
srun: error: Unable to accept new connection: Too many open files
srun: error: Unable to accept new connection: Too many open files

Increasing the open files limit doesn't seem to help this situation. Looking at /proc/<pid>/limits for the srun process, it seems to be setting its soft open files limit to 4096, regardless of what the limit is before it runs. Basically, running ulimit -n before srun doesn't have an effect on this problem.

It looks like this is related to the rlimits_adjust_nofile function in src/common/slurm_rlimits_info.c, which hardcodes 4096 for the file limit.

Is there a way to configure Slurm to require fewer open files in srun?
Comment 1 Nate Rini 2021-11-02 11:17:48 MDT
Please call as root:
> /proc/sys/fs/file-max
> ulimit -a -H
> ulimit -a -S

Please call this as the test user:
> ulimit -a -H
> ulimit -a -S
Comment 2 Matt Ezell 2021-11-02 11:25:48 MDT
As root on a compute node:
# cat /proc/sys/fs/file-max 
9223372036854775807
# ulimit -a -H
core file size          (blocks, -c) unlimited
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 2056733
max locked memory       (kbytes, -l) unlimited
max memory size         (kbytes, -m) unlimited
open files                      (-n) 16384
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) unlimited
cpu time               (seconds, -t) unlimited
max user processes              (-u) 2056733
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited
# ulimit -a -S
core file size          (blocks, -c) unlimited
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 2056733
max locked memory       (kbytes, -l) unlimited
max memory size         (kbytes, -m) unlimited
open files                      (-n) 16384
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 300000
cpu time               (seconds, -t) unlimited
max user processes              (-u) 2056733
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

As a regular user on a login node:
# ulimit -a -H
core file size          (blocks, -c) unlimited
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 1028069
max locked memory       (kbytes, -l) unlimited
max memory size         (kbytes, -m) unlimited
open files                      (-n) 524288
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) unlimited
cpu time               (seconds, -t) unlimited
max user processes              (-u) 1028069
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited
# ulimit -a -S
core file size          (blocks, -c) unlimited
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 1028069
max locked memory       (kbytes, -l) unlimited
max memory size         (kbytes, -m) unlimited
open files                      (-n) 524288
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 300000
cpu time               (seconds, -t) unlimited
max user processes              (-u) 1028069
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

As a regular user on the login node srun'ing to a compute:
# srun -N1 -A STF002 /bin/bash -c 'ulimit -a -H'
core file size          (blocks, -c) unlimited
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 2056733
max locked memory       (kbytes, -l) unlimited
max memory size         (kbytes, -m) unlimited
open files                      (-n) 524288
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) unlimited
cpu time               (seconds, -t) unlimited
max user processes              (-u) 2056733
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited
# srun -N1 -A STF002 /bin/bash -c 'ulimit -a -S'
core file size          (blocks, -c) unlimited
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 2056733
max locked memory       (kbytes, -l) unlimited
max memory size         (kbytes, -m) unlimited
open files                      (-n) 524288
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 300000
cpu time               (seconds, -t) unlimited
max user processes              (-u) 1028069
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

If I srun a sleep command and go look at slurmstepd on the node:
# cat /proc/62078/limits
Limit                     Soft Limit           Hard Limit           Units     
Max cpu time              unlimited            unlimited            seconds   
Max file size             unlimited            unlimited            bytes     
Max data size             unlimited            unlimited            bytes     
Max stack size            unlimited            unlimited            bytes     
Max core file size        unlimited            unlimited            bytes     
Max resident set          unlimited            unlimited            bytes     
Max processes             2056733              2056733              processes 
Max open files            4096                 524288               files     
Max locked memory         unlimited            unlimited            bytes     
Max address space         unlimited            unlimited            bytes     
Max file locks            unlimited            unlimited            locks     
Max pending signals       2056733              2056733              signals   
Max msgqueue size         819200               819200               bytes     
Max nice priority         0                    0                    
Max realtime priority     0                    0                    
Max realtime timeout      unlimited            unlimited            us   

I suspect it's slurmstepd trying to accept stdio/stderr connections that is hitting this problem
Comment 5 Nate Rini 2021-11-02 11:47:25 MDT
Please try this patch to srun:
> diff --git a/src/common/slurm_rlimits_info.c b/src/common/slurm_rlimits_info.c
> index aff433b..e37e295 100644
> --- a/src/common/slurm_rlimits_info.c
> +++ b/src/common/slurm_rlimits_info.c
> @@ -206,7 +206,7 @@ extern void rlimits_adjust_nofile(void)
>       if (getrlimit(RLIMIT_NOFILE, &rlim) < 0)
>               error("getrlimit(RLIMIT_NOFILE): %m");
>  
> -     rlim.rlim_cur = MIN(4096, rlim.rlim_max);
> +     rlim.rlim_cur = MAX(4096, rlim.rlim_max);
>  
>       if (setrlimit(RLIMIT_NOFILE, &rlim) < 0)
>               error("Unable to adjust maximum number of open files: %m");
Comment 6 Matt Ezell 2021-11-02 11:48:42 MDT
I built 21.08.2 with a one-line patch to change 4096 to 16384. We will test with this to see if it resolves the issue.
Comment 8 Matt Ezell 2021-11-02 12:37:25 MDT
(In reply to Matt Ezell from comment #6)
> I built 21.08.2 with a one-line patch to change 4096 to 16384. We will test
> with this to see if it resolves the issue.

This has allowed is to run a higher node counts.
Comment 9 Nate Rini 2021-11-02 12:59:16 MDT
(In reply to Matt Ezell from comment #8)
> (In reply to Matt Ezell from comment #6)
> > I built 21.08.2 with a one-line patch to change 4096 to 16384. We will test
> > with this to see if it resolves the issue.
> 
> This has allowed is to run a higher node counts.

Okay, this limit was set due to other limits. I will have to look into what the secondary consequences for this.
Comment 10 Nate Rini 2021-11-03 08:32:05 MDT
Created attachment 22104 [details]
patch for 21.08 (v2)
Comment 11 Nate Rini 2021-11-03 08:34:46 MDT
(In reply to Nate Rini from comment #10)
> Created attachment 22104 [details]
> patch for 21.08 (v2)

Matt et al.

Please try this patch. It removes the arbitrary limit on RLIMIT_NOFILE by being more efficient for the smaller jobs. In the case of large jobs, I doubt it will hurt performance noticeably.

This patch has not been QA tested, so please only try it on your test system.

Thanks,
--Nate
Comment 13 Nate Rini 2021-11-03 13:56:26 MDT
what version of glibc is installed on this cluster?
Comment 14 Matt Ezell 2021-11-03 13:58:08 MDT
(In reply to Nate Rini from comment #13)
> what version of glibc is installed on this cluster?

glibc-2.31-7.30.x86_64
Comment 15 Nate Rini 2021-11-03 17:06:29 MDT
Created attachment 22116 [details]
patch for 21.08 (v3)

Benchmarking shows that using procfs is actually slower than blinding calling close() on every possible file descriptor. The kernel devs add the new syscall close_range() but it only available with glibc 2.34+.

This patch instead removes the limit on srun. Please give it a try.
Comment 21 Nate Rini 2021-12-09 14:29:19 MST
David, Matt,

This is now fixed upstream for slurm-21.08.5:
> https://github.com/SchedMD/slurm/commit/d2c1a05e15de6019c1e2def91e77a0377cd1a446

Closing out the ticket. Please respond if there are any more related issues.

Thanks,
--Nate