Ticket 13531

Summary: Wrong "core file size" limit on slurmd
Product: Slurm Reporter: Misha Ahmadian <misha.ahmadian>
Component: LimitsAssignee: Carlos Tripiana Montes <tripiana>
Status: RESOLVED INFOGIVEN QA Contact:
Severity: 4 - Minor Issue    
Priority: ---    
Version: 20.11.7   
Hardware: Linux   
OS: Linux   
Site: TTU Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---
Attachments: slurm.conf

Description Misha Ahmadian 2022-02-28 13:11:15 MST
Created attachment 23663 [details]
slurm.conf

Hello,

Following the Bug #13340, which is already resolved, and we could learn what was causing the issue, we're facing a similar problem with the "core file size" limits that are unlimited for Slurmd service, which occasionally causes the nodes to go to "Kill task failed" state.

A few months ago, we added the following "core file size" limits into the limits.conf file of every worker node (except login and head nodes) to prevent worker nodes from going to the "Kill task failed" state due to the following errors:

Jan 15 19:21:01 cpu-24-24 systemd[89824]: systemd-coredump@124-89751-0.service: Failed to set up network namespacing: No space left on device
Jan 15 19:21:01 cpu-24-24 systemd[89824]: systemd-coredump@124-89751-0.service: Failed at step NETWORK spawning /usr/lib/systemd/systemd-coredump: No space left on device

So, let's pick a reserved node:

[root@login-20-25 ~]# ssh cpu-25-8
[root@cpu-25-8 ~]# grep -v '#' /etc/security/limits.conf

* soft memlock unlimited
* hard memlock unlimited
* hard nofile 8192
* soft nofile 8192
* soft core 2097152
* hard core 4194304

> as you see, we added hard and soft limits for "core" (core file size) 

[root@cpu-25-8 ~]# ulimit -a
core file size          (blocks, -c) 2097152
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 2061717
max locked memory       (kbytes, -l) unlimited
max memory size         (kbytes, -m) unlimited
open files                      (-n) 8192
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) 2061717
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

> The "ulimit" command confirms the changes

[root@cpu-25-8 ~]# ulimit -a user1
core file size          (blocks, -c) 2097152
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 2061717
max locked memory       (kbytes, -l) unlimited
max memory size         (kbytes, -m) unlimited
open files                      (-n) 8192
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) unlimited
cpu time               (seconds, -t) unlimited
max user processes              (-u) 2061717
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

> Even a nonprivileged user has the correct "core file size" outside Slurm.

However, the limits of the slurmd process running inside the systemd looks totaly different:

[root@cpu-25-8 ~]# cat /proc/$(pidof slurmd)/limits
Limit                     Soft Limit           Hard Limit           Units
Max cpu time              unlimited            unlimited            seconds
Max file size             unlimited            unlimited            bytes
Max data size             unlimited            unlimited            bytes
Max stack size            unlimited            unlimited            bytes
Max core file size        unlimited            unlimited            bytes
Max resident set          unlimited            unlimited            bytes
Max processes             2061717              2061717              processes
Max open files            131072               131072               files
Max locked memory         unlimited            unlimited            bytes
Max address space         unlimited            unlimited            bytes
Max file locks            unlimited            unlimited            locks
Max pending signals       2061717              2061717              signals
Max msgqueue size         819200               819200               bytes
Max nice priority         0                    0
Max realtime priority     0                    0
Max realtime timeout      unlimited            unlimited            us


So, let's login as a normal user to the login node, then make an interactive session to the same node as above:

$ ssh user1@login-20-25
[user1@login-20-25 ~]$ salloc -N1 -n 128 -p nocona --reservation=misha -w cpu-25-8
salloc: Granted job allocation 4953070
salloc: Waiting for resource configuration
salloc: Nodes cpu-25-8 are ready for job

[user1@cpu-25-8 ~]$ ulimit -a
core file size          (blocks, -c) unlimited
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 2061717
max locked memory       (kbytes, -l) unlimited
max memory size         (kbytes, -m) 527826944
open files                      (-n) 8192
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) unlimited
cpu time               (seconds, -t) unlimited
max user processes              (-u) 2061717
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

As you see here, the "core file size" is "unlimited, " the same as the ulimit settings on the login-20-25.
If I exit the interactive session and just do "srun" on the login node, I'll get the same:

[user1@cpu-25-8 ~]$ exit
[user1@login-20-25 ~]$ srun -n1 -N1 -p nocona --reservation=misha -w cpu-25-8 ulimit -a
core file size          (blocks, -c) unlimited
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 2061717
max locked memory       (kbytes, -l) unlimited
max memory size         (kbytes, -m) 4123648
open files                      (-n) 8192
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) unlimited
cpu time               (seconds, -t) unlimited
max user processes              (-u) 2061717
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

Please also note the following setting on the worker nodes (such as cpu-25-8):

1) /etc/security/limits.d/ is empty
2) The settings in /etc/security/limits.conf:

* soft memlock unlimited
* hard memlock unlimited
* hard nofile 8192
* soft nofile 8192
* soft core 2097152
* hard core 4194304

3) There is no limit on user authentication at login under /etc/pam.d/
4) The Slurmd service file:

[root@cpu-25-8 ~]# cat /usr/lib/systemd/system/slurmd.service
[Unit]
Description=Slurm node daemon
After=munge.service network.target remote-fs.target
#ConditionPathExists=/etc/slurm/slurm.conf

[Service]
Type=simple
EnvironmentFile=-/etc/sysconfig/slurmd
ExecStart=/usr/sbin/slurmd -D $SLURMD_OPTIONS
ExecReload=/bin/kill -HUP $MAINPID
KillMode=process
LimitNOFILE=131072
LimitMEMLOCK=infinity
LimitSTACK=infinity
Delegate=yes
TasksMax=infinity

[Install]
WantedBy=multi-user.target

5) No limit has been set in /etc/systemd/* (We can look further if that's necessary!)
6) The "core file size" has already been excluded in slurm.conf from being propagated to the worker nodes:

# grep PropagateResourceLimitsExcept /etc/slurm/slurm.conf
PropagateResourceLimitsExcept=CORE,MEMLOCK,NPROC

So, I went ahead and stopped the slurmd service on cpu-25-8, and tried to run slurmd outside the systemd:

[root@cpu-25-8 ~]# systemctl stop slurmd
[root@cpu-25-8 ~]# slurmd -D &
[root@cpu-25-8 ~]# cat /proc/$(pidof slurmd)/limits
Limit                     Soft Limit           Hard Limit           Units
Max cpu time              unlimited            unlimited            seconds
Max file size             unlimited            unlimited            bytes
Max data size             unlimited            unlimited            bytes
Max stack size            8388608              unlimited            bytes
Max core file size        4294967296           4294967296           bytes
Max resident set          unlimited            unlimited            bytes
Max processes             2061717              2061717              processes
Max open files            8192                 8192                 files
Max locked memory         unlimited            unlimited            bytes
Max address space         unlimited            unlimited            bytes
Max file locks            unlimited            unlimited            locks
Max pending signals       2061717              2061717              signals
Max msgqueue size         819200               819200               bytes
Max nice priority         0                    0
Max realtime priority     0                    0
Max realtime timeout      unlimited            unlimited            us


I also get a different ulimits inside an interactive session to cpu-25-8:

[user1@login-20-25 ~]$ salloc -N1 -n 128 -p nocona --reservation=misha -w cpu-25-8
salloc: Granted job allocation 4953070
salloc: Waiting for resource configuration
salloc: Nodes cpu-25-8 are ready for job

[user1@cpu-25-8 ~]$ ulimit -a
core file size          (blocks, -c) 4194304
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 2061717
max locked memory       (kbytes, -l) unlimited
max memory size         (kbytes, -m) 527826944
open files                      (-n) 8192
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) unlimited
cpu time               (seconds, -t) unlimited
max user processes              (-u) 2061717
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

As you can see above, the "core file size" works correctly outside the systemd. I wondered whether you might have some idea on what possibly changes the ulimits of slurmd inside the systemd? Or, where should we look, and what is the possible solution for this?

Best Regards,
Misha
Comment 2 Carlos Tripiana Montes 2022-03-02 03:31:17 MST
Hi Misha,

I want to take a look at:

  $ pstree -alnpst $(pidof slurmd)

But most probably, the thing goes as follows:

1. As stated in [1]:

> Resource limits not configured explicitly for a unit default to the value configured in the various DefaultLimitCPU=, DefaultLimitFSIZE=, … options available in systemd-system.conf(5), and – if not configured there – the kernel or per-user defaults, as defined by the OS (the latter only for user services, see below).

As no LimitCORE seems to be defined for this systemd service, this value is taken from DefaultLimitCORE. In [2]:

> DefaultLimitCORE= does not have a default but it is worth mentioning that RLIMIT_CORE is set to "infinity" by PID 1 which is inherited by its children.

Following on, regarding limits.conf/limits.d [3], these files relate only to pam_limits.so. This means *only to user login session*. For this reason limits there don't apply to systemd services, which are affected by the whole [4] (being [1] part of [4]).

At the end of the day, I think any service will have unlimited core dump size, not just slurmd, unless explicit LimitCORE is added to the service file or something. That should be easy to check.

As a conclusion: It doesn't seem that there's a Slurm bug, propagating unlimited core size from logins. It is that limits.conf is not used for services, and you need to add either DefaultLimitCORE in systemd/system.conf or LimitCORE to slurmd service file.

Cheers,
Carlos.

[1] https://www.freedesktop.org/software/systemd/man/systemd.exec.html#Process%20Properties
[2] https://www.freedesktop.org/software/systemd/man/systemd-system.conf.html#DefaultLimitCPU=
[3] https://linux.die.net/man/5/limits.conf
[4] https://www.freedesktop.org/software/systemd/man/systemd-system.conf.html
Comment 3 Carlos Tripiana Montes 2022-03-04 01:40:47 MST
Misha,

Guessing you have the issue covered by my last answer I'm going to close the issue as info given by now.

Please reopen if needed.

Cheers,
Carlos.
Comment 4 Misha Ahmadian 2022-03-04 07:44:04 MST
Hi Carlos,

Sorry for the delay in my reply. I was busy with other stuff.
Thank you very much. I'll let you know if I had further questions.

Best,
Misha