Reporting this bug from HPE, but the issue was found on a customer system at ORNL. When launching on >4000 nodes, we're seeing error messages like this: srun -N 4002 -n 32016 --ntasks-per-node=8 --cpu_bind=map_cpu:48,56,16,24,2,8,32,40 /ccs/home/kmcmahon/Benchmarks/bisec_bw2 8 1000 Tue 02 Nov 2021 10:24:28 AM EDT srun: error: Unable to accept new connection: Too many open files srun: error: Unable to accept new connection: Too many open files srun: error: Unable to accept new connection: Too many open files srun: error: Unable to accept new connection: Too many open files Increasing the open files limit doesn't seem to help this situation. Looking at /proc/<pid>/limits for the srun process, it seems to be setting its soft open files limit to 4096, regardless of what the limit is before it runs. Basically, running ulimit -n before srun doesn't have an effect on this problem. It looks like this is related to the rlimits_adjust_nofile function in src/common/slurm_rlimits_info.c, which hardcodes 4096 for the file limit. Is there a way to configure Slurm to require fewer open files in srun?
Please call as root: > /proc/sys/fs/file-max > ulimit -a -H > ulimit -a -S Please call this as the test user: > ulimit -a -H > ulimit -a -S
As root on a compute node: # cat /proc/sys/fs/file-max 9223372036854775807 # ulimit -a -H core file size (blocks, -c) unlimited data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 2056733 max locked memory (kbytes, -l) unlimited max memory size (kbytes, -m) unlimited open files (-n) 16384 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) unlimited cpu time (seconds, -t) unlimited max user processes (-u) 2056733 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited # ulimit -a -S core file size (blocks, -c) unlimited data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 2056733 max locked memory (kbytes, -l) unlimited max memory size (kbytes, -m) unlimited open files (-n) 16384 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) 300000 cpu time (seconds, -t) unlimited max user processes (-u) 2056733 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited As a regular user on a login node: # ulimit -a -H core file size (blocks, -c) unlimited data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 1028069 max locked memory (kbytes, -l) unlimited max memory size (kbytes, -m) unlimited open files (-n) 524288 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) unlimited cpu time (seconds, -t) unlimited max user processes (-u) 1028069 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited # ulimit -a -S core file size (blocks, -c) unlimited data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 1028069 max locked memory (kbytes, -l) unlimited max memory size (kbytes, -m) unlimited open files (-n) 524288 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) 300000 cpu time (seconds, -t) unlimited max user processes (-u) 1028069 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited As a regular user on the login node srun'ing to a compute: # srun -N1 -A STF002 /bin/bash -c 'ulimit -a -H' core file size (blocks, -c) unlimited data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 2056733 max locked memory (kbytes, -l) unlimited max memory size (kbytes, -m) unlimited open files (-n) 524288 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) unlimited cpu time (seconds, -t) unlimited max user processes (-u) 2056733 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited # srun -N1 -A STF002 /bin/bash -c 'ulimit -a -S' core file size (blocks, -c) unlimited data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 2056733 max locked memory (kbytes, -l) unlimited max memory size (kbytes, -m) unlimited open files (-n) 524288 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) 300000 cpu time (seconds, -t) unlimited max user processes (-u) 1028069 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited If I srun a sleep command and go look at slurmstepd on the node: # cat /proc/62078/limits Limit Soft Limit Hard Limit Units Max cpu time unlimited unlimited seconds Max file size unlimited unlimited bytes Max data size unlimited unlimited bytes Max stack size unlimited unlimited bytes Max core file size unlimited unlimited bytes Max resident set unlimited unlimited bytes Max processes 2056733 2056733 processes Max open files 4096 524288 files Max locked memory unlimited unlimited bytes Max address space unlimited unlimited bytes Max file locks unlimited unlimited locks Max pending signals 2056733 2056733 signals Max msgqueue size 819200 819200 bytes Max nice priority 0 0 Max realtime priority 0 0 Max realtime timeout unlimited unlimited us I suspect it's slurmstepd trying to accept stdio/stderr connections that is hitting this problem
Please try this patch to srun: > diff --git a/src/common/slurm_rlimits_info.c b/src/common/slurm_rlimits_info.c > index aff433b..e37e295 100644 > --- a/src/common/slurm_rlimits_info.c > +++ b/src/common/slurm_rlimits_info.c > @@ -206,7 +206,7 @@ extern void rlimits_adjust_nofile(void) > if (getrlimit(RLIMIT_NOFILE, &rlim) < 0) > error("getrlimit(RLIMIT_NOFILE): %m"); > > - rlim.rlim_cur = MIN(4096, rlim.rlim_max); > + rlim.rlim_cur = MAX(4096, rlim.rlim_max); > > if (setrlimit(RLIMIT_NOFILE, &rlim) < 0) > error("Unable to adjust maximum number of open files: %m");
I built 21.08.2 with a one-line patch to change 4096 to 16384. We will test with this to see if it resolves the issue.
(In reply to Matt Ezell from comment #6) > I built 21.08.2 with a one-line patch to change 4096 to 16384. We will test > with this to see if it resolves the issue. This has allowed is to run a higher node counts.
(In reply to Matt Ezell from comment #8) > (In reply to Matt Ezell from comment #6) > > I built 21.08.2 with a one-line patch to change 4096 to 16384. We will test > > with this to see if it resolves the issue. > > This has allowed is to run a higher node counts. Okay, this limit was set due to other limits. I will have to look into what the secondary consequences for this.
Created attachment 22104 [details] patch for 21.08 (v2)
(In reply to Nate Rini from comment #10) > Created attachment 22104 [details] > patch for 21.08 (v2) Matt et al. Please try this patch. It removes the arbitrary limit on RLIMIT_NOFILE by being more efficient for the smaller jobs. In the case of large jobs, I doubt it will hurt performance noticeably. This patch has not been QA tested, so please only try it on your test system. Thanks, --Nate
what version of glibc is installed on this cluster?
(In reply to Nate Rini from comment #13) > what version of glibc is installed on this cluster? glibc-2.31-7.30.x86_64
Created attachment 22116 [details] patch for 21.08 (v3) Benchmarking shows that using procfs is actually slower than blinding calling close() on every possible file descriptor. The kernel devs add the new syscall close_range() but it only available with glibc 2.34+. This patch instead removes the limit on srun. Please give it a try.
David, Matt, This is now fixed upstream for slurm-21.08.5: > https://github.com/SchedMD/slurm/commit/d2c1a05e15de6019c1e2def91e77a0377cd1a446 Closing out the ticket. Please respond if there are any more related issues. Thanks, --Nate