| Summary: | Max number of open file descriptors is capped to 4096 in Slurm 21.08 | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Francesco De Martino <fdm> |
| Component: | slurmd | Assignee: | Nate Rini <nate> |
| Status: | RESOLVED FIXED | QA Contact: | |
| Severity: | 3 - Medium Impact | ||
| Priority: | --- | CC: | bas.vandervlies, carrogu, csamuel, mmelato, nate, nick, schedmd-contacts |
| Version: | 21.08.6 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| See Also: | https://bugs.schedmd.com/show_bug.cgi?id=13761 | ||
| Site: | DS9 (PSLA) | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | 21.08.7,22.05pre1 | Target Release: | --- |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
| Attachments: |
slurm.conf
patch for 21.08 (DS9 only test) patch for master patch |
||
|
Description
Francesco De Martino
2022-03-15 13:04:02 MDT
(In reply to Francesco De Martino from comment #0) > Can you please share the reasoning behind this change in behavior? This behavior was changed in bug#10254 and then (mostly) reverted in bug#12804: > https://github.com/SchedMD/slurm/commit/d2c1a05e15de6019c1e2def91e77a0377cd1a446 Please verify your running Slurm version: > slurmd -V We're using Slurm version 21.08.6 where we're observing the behavior change reported above. Please attach the slurm.conf. Please run this test job and attach the output:
> sbatch --wrap 'srun cat /proc/self/limits'
Created attachment 23883 [details]
slurm.conf
slurm.conf
On the head node: $ sinfo -V slurm 21.08.6 $ sbatch --wrap 'srun cat /proc/self/limits' Submitted batch job 1 $ cat slurm-1.out Limit Soft Limit Hard Limit Units Max cpu time unlimited unlimited seconds Max file size unlimited unlimited bytes Max data size unlimited unlimited bytes Max stack size 10485760 unlimited bytes Max core file size 0 unlimited bytes Max resident set unlimited unlimited bytes Max processes unlimited unlimited processes Max open files 8192 131072 files Max locked memory unlimited unlimited bytes Max address space unlimited unlimited bytes Max file locks unlimited unlimited locks Max pending signals 30446 30446 signals Max msgqueue size 819200 819200 bytes Max nice priority 0 0 Max realtime priority 0 0 Max realtime timeout unlimited unlimited us On the compute node: $ ps aux | grep slurm root 5166 0.0 0.0 152784 5936 ? Ss 08:09 0:00 /opt/slurm/sbin/slurmd -D -N compute1-st-compute1-i1-1 $ sudo cat /proc/5166/limits Limit Soft Limit Hard Limit Units Max cpu time unlimited unlimited seconds Max file size unlimited unlimited bytes Max data size unlimited unlimited bytes Max stack size unlimited unlimited bytes Max core file size unlimited unlimited bytes Max resident set unlimited unlimited bytes Max processes unlimited unlimited processes Max open files 4096 131072 files Max locked memory unlimited unlimited bytes Max address space unlimited unlimited bytes Max file locks unlimited unlimited locks Max pending signals 30446 30446 signals Max msgqueue size 819200 819200 bytes Max nice priority 0 0 Max realtime priority 0 0 Max realtime timeout unlimited unlimited us issue is when on the head it's set a limit greater than the hard limit on the compute. on the head node: $ ulimit -a core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 30446 max locked memory (kbytes, -l) unlimited max memory size (kbytes, -m) unlimited open files (-n) 131073 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) 10240 cpu time (seconds, -t) unlimited max user processes (-u) unlimited virtual memory (kbytes, -v) unlimited file locks (-x) unlimited see open files 131073 > 131072 $ sbatch --wrap 'srun cat /proc/self/limits' Submitted batch job 3 $ cat slurm-3.out Limit Soft Limit Hard Limit Units Max cpu time unlimited unlimited seconds Max file size unlimited unlimited bytes Max data size unlimited unlimited bytes Max stack size 10485760 unlimited bytes Max core file size 0 unlimited bytes Max resident set unlimited unlimited bytes Max processes unlimited unlimited processes Max open files 4096 131072 files Max locked memory unlimited unlimited bytes Max address space unlimited unlimited bytes Max file locks unlimited unlimited locks Max pending signals 30446 30446 signals Max msgqueue size 819200 819200 bytes Max nice priority 0 0 Max realtime priority 0 0 Max realtime timeout unlimited unlimited us $ sbatch --wrap 'ulimit -a' Submitted batch job 4 $ cat slurm-4.out core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 30446 max locked memory (kbytes, -l) unlimited max memory size (kbytes, -m) unlimited open files (-n) 4096 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) 10240 cpu time (seconds, -t) unlimited max user processes (-u) unlimited virtual memory (kbytes, -v) unlimited file locks (-x) unlimited see open files = 4096 working on replicating this locally Please provide:
> systemctl --version
(In reply to Nate Rini from comment #8) > Please provide: > > systemctl --version $ systemctl --version systemd 219 +PAM +AUDIT +SELINUX +IMA -APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ +LZ4 -SECCOMP +BLKID +ELFUTILS +KMOD +IDN Please also provide:
> uname -a
(In reply to Nate Rini from comment #11) > Please also provide: > > uname -a [ec2-user@ip-10-0-0-57 ~]$ uname -a Linux ip-10-0-0-57 4.14.268-205.500.amzn2.x86_64 #1 SMP Wed Mar 2 18:38:38 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux Please set in the slurmd.service unit file: > LimitNOFILE=unlimited A restart of slurmd will be required. Please then retry your test. (In reply to Francesco De Martino from comment #10) > systemd 219 I have been able to replicate this issue locally and the hard limit of any child process of slurmd is capped (by at the limit set in the slurmd unit file by systemd. It looks like systemd 240 *may* relax this limitation but my systemd is only at 239 so that will require further testing. (In reply to Nate Rini from comment #13) > I have been able to replicate this issue locally and the hard limit of any > child process of slurmd is capped (by at the limit set in the slurmd unit > file by systemd. It looks like systemd 240 *may* relax this limitation but > my systemd is only at 239 so that will require further testing. The problem is not that the hard limit of child processes is capped to the limit set in the slurmd unit file - that is indeed the desired behavior. What we are experiencing is that the limit gets capped to 4096 and not to the ones set in slurmd service file when limits propagated from submission host are bigger. Submission host -> 131073 slurmd -> 131072 job process -> capped to 4096 (In reply to Francesco De Martino from comment #14) > What we are experiencing is that the limit gets capped to 4096 and not to > the ones set in slurmd service file when limits propagated from submission > host are bigger. Do you mean the soft limit? The data provided shows the hard limit is still at the expected limit? (In reply to Luca Carrogu from comment #6) > Max open files 4096 131072 files (In reply to Nate Rini from comment #15) > (In reply to Francesco De Martino from comment #14) > > What we are experiencing is that the limit gets capped to 4096 and not to > > the ones set in slurmd service file when limits propagated from submission > > host are bigger. > > Do you mean the soft limit? The data provided shows the hard limit is still > at the expected limit? > > (In reply to Luca Carrogu from comment #6) > > Max open files 4096 131072 files yes soft limits I suspect the limits are getting inherited. Please set this in your slurm.conf and restart all the daemons:
> PropagateResourceLimits=NONE
(In reply to Nate Rini from comment #17) > I suspect the limits are getting inherited. Please set this in your > slurm.conf and restart all the daemons: > > PropagateResourceLimits=NONE Yes limits get propagated, but the limit on the submit node is not set to 4096 but rather to 131073. How come the end result is 4096 as long as the propagated limits are > than the slurmd ones? (In reply to Francesco De Martino from comment #18) > (In reply to Nate Rini from comment #17) > > I suspect the limits are getting inherited. Please set this in your > > slurm.conf and restart all the daemons: > > > PropagateResourceLimits=NONE > > Yes limits get propagated, but the limit on the submit node is not set to > 4096 but rather to 131073. How come the end result is 4096 as long as the > propagated limits are > than the slurmd ones? slurmstepd attempts to set the limit as requested and fails: > [2022-03-16T12:35:08.371] [4.batch] Can't propagate RLIMIT_NPROC of 'unlimited' from submit host: Operation not permitted > [2022-03-16T12:35:08.371] [4.batch] debug2: _set_limit: RLIMIT_NOFILE : max:5000 cur:4096 req:8000 > [2022-03-16T12:35:08.371] [4.batch] Can't propagate RLIMIT_NOFILE of 8000 from submit host: Operation not permitted When this happens, the job retains the default limit. In this case: 4096 files. Please note that slurmstepd is running as the user while attempting to change the limits and not root. (In reply to Nate Rini from comment #19) > (In reply to Francesco De Martino from comment #18) > > (In reply to Nate Rini from comment #17) > > > I suspect the limits are getting inherited. Please set this in your > > > slurm.conf and restart all the daemons: > > > > PropagateResourceLimits=NONE > > > > Yes limits get propagated, but the limit on the submit node is not set to > > 4096 but rather to 131073. How come the end result is 4096 as long as the > > propagated limits are > than the slurmd ones? > > slurmstepd attempts to set the limit as requested and fails: > > [2022-03-16T12:35:08.371] [4.batch] Can't propagate RLIMIT_NPROC of 'unlimited' from submit host: Operation not permitted > > [2022-03-16T12:35:08.371] [4.batch] debug2: _set_limit: RLIMIT_NOFILE : max:5000 cur:4096 req:8000 > > [2022-03-16T12:35:08.371] [4.batch] Can't propagate RLIMIT_NOFILE of 8000 from submit host: Operation not permitted > > When this happens, the job retains the default limit. In this case: 4096 > files. Please note that slurmstepd is running as the user while attempting > to change the limits and not root. And this brings me back to my original question: why this behavior changed in 21.08? Where is this default limit coming from? What we are seeing with 20.11 is: Submission host -> 131073 slurmd -> 131072 job process -> 131072 (not 4096) (In reply to Francesco De Martino from comment #20) > And this brings me back to my original question: why this behavior changed > in 21.08? Where is this default limit coming from? > > What we are seeing with 20.11 is: > Submission host -> 131073 > slurmd -> 131072 > job process -> 131072 (not 4096) Prior to bug#10254, the nofile rlimit for slurmd was not changed. However, with bug#10254, we found that forking for slurmstepd with a large possible number of file descriptors is quite expensive. slurmd itself has no need to such a large limit. I'll look at making the limit change be more graceful. Created attachment 23899 [details] patch for 21.08 (DS9 only test) (In reply to Nate Rini from comment #21) > I'll look at making the limit change be more graceful. This patch switches around the logic to avoid failing when the hard limit is lower than the requested soft limit. Feel free to try this patch while it undergoes our normal QA process. Could you share the Slurm version and indicatively when this patch will be released? Thanks. (In reply to Nate Rini from comment #22) > Created attachment 23899 [details] > patch for 21.08 (DS9 only test) (In reply to Nate Rini from comment #24) > (In reply to Nate Rini from comment #23) > > Created attachment 23900 [details] > > patch for master (In reply to mmelato from comment #25) > Could you share the Slurm version and indicatively when this patch will be > released? Thanks. The patch for Slurm-21.08 and Slurm-22.05 (not tagged yet) releases are above. The patch is currently under going review and I don't have an ETA other than we are targeting the slurm-22.05 release. The slurm-21.08 patch has been provided to allow your site to test and verify it solves your issue. (In reply to Nate Rini from comment #22) > Created attachment 23899 [details] > patch for 21.08 (DS9 only test) > > (In reply to Nate Rini from comment #21) > > I'll look at making the limit change be more graceful. > > This patch switches around the logic to avoid failing when the hard limit is > lower than the requested soft limit. Feel free to try this patch while it > undergoes our normal QA process. I tested the patch in my cluster and it looks like it restores the behavior we were seeing in Slurm 20.11. [ec2-user@ip-10-0-0-114 ~]$ ulimit -a core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 30446 max locked memory (kbytes, -l) unlimited max memory size (kbytes, -m) unlimited open files (-n) 131073 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) 10240 cpu time (seconds, -t) unlimited max user processes (-u) unlimited virtual memory (kbytes, -v) unlimited file locks (-x) unlimited [ec2-user@ip-10-0-0-114 ~]$ sbatch --wrap "ulimit -a" Submitted batch job 8 [ec2-user@ip-10-0-0-114 ~]$ cat slurm-8.out core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 30446 max locked memory (kbytes, -l) unlimited max memory size (kbytes, -m) unlimited open files (-n) 131072 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) 10240 cpu time (seconds, -t) unlimited max user processes (-u) unlimited virtual memory (kbytes, -v) unlimited file locks (-x) unlimited Hi Nate, an additional question from my side: is there any downside if we were to set LimitNOFILE in slurmd service to the max allowed on the system? (In reply to Francesco De Martino from comment #28) > an additional question from my side: is there any downside if we were to set > LimitNOFILE in slurmd service to the max allowed on the system? Slurm will attempt to close() all the possible nofiles at every fork, so there will be additional delays to start slurmstepd, prolog, and epilog scripts. Unless the node will be doing large throughput of jobs, the difference will be neglible with any modern non-virtualized kernel. (In reply to Francesco De Martino from comment #27) > (In reply to Nate Rini from comment #22) > > Created attachment 23899 [details] > > patch for 21.08 (DS9 only test) > > > > (In reply to Nate Rini from comment #21) > > > I'll look at making the limit change be more graceful. > > > > This patch switches around the logic to avoid failing when the hard limit is > > lower than the requested soft limit. Feel free to try this patch while it > > undergoes our normal QA process. > > I tested the patch in my cluster and it looks like it restores the behavior > we were seeing in Slurm 20.11. The patch is now upstream for the upcoming 21.08.7 release: > https://github.com/SchedMD/slurm/commit/16d1f708220d213534faf7149a05f5e40079883a Please reply if you have any more related questions. I'm going to close this ticket out. Great to find this already fixed, I spent a while yesterday debugging this on Perlmutter for an NCCL app running at large scale that used to work but failed when we went to 21.08. Whilst waiting for Slurm 21.08.7 to appear is it just this commit that's needed? commit 16d1f708220d213534faf7149a05f5e40079883a Author: Nathan Rini <nate@schedmd.com> Date: Wed Mar 16 14:52:50 2022 -0600 slurmstepd - cap rlimit request to max possible slurmstepd runs as the job user and can't override the existing hard limits. Instead of failing to set the soft rlimit at all, set the soft rlimit to the max possible hard rlimit. Bug 13624 All the best, Chris Created attachment 24473 [details] patch (In reply to Chris Samuel (NERSC) from comment #38) > Great to find this already fixed, I spent a while yesterday debugging this > on Perlmutter for an NCCL app running at large scale that used to work but > failed when we went to 21.08. > > Whilst waiting for Slurm 21.08.7 to appear is it just this commit that's > needed? Yes. I have attached the patch sans the NEWS entry which will likely break 'git apply'. If you're having issues with rlimits nofiles: then there is a whole pile of patches to fix that. (In reply to Nate Rini from comment #39) > Yes. I have attached the patch sans the NEWS entry which will likely break > 'git apply'. If you're having issues with rlimits nofiles: then there is a > whole pile of patches to fix that. OK thanks, in that case rather than apply that pile of patches what I might just do is adjust our ansible for Perlmutter to update the: LimitNOFILE=131072 in the unit file to be greater than our limits as I know that works from our test system and wait for 21.08.7 to appear. Much obliged! |