Created attachment 23663 [details] slurm.conf Hello, Following the Bug #13340, which is already resolved, and we could learn what was causing the issue, we're facing a similar problem with the "core file size" limits that are unlimited for Slurmd service, which occasionally causes the nodes to go to "Kill task failed" state. A few months ago, we added the following "core file size" limits into the limits.conf file of every worker node (except login and head nodes) to prevent worker nodes from going to the "Kill task failed" state due to the following errors: Jan 15 19:21:01 cpu-24-24 systemd[89824]: systemd-coredump@124-89751-0.service: Failed to set up network namespacing: No space left on device Jan 15 19:21:01 cpu-24-24 systemd[89824]: systemd-coredump@124-89751-0.service: Failed at step NETWORK spawning /usr/lib/systemd/systemd-coredump: No space left on device So, let's pick a reserved node: [root@login-20-25 ~]# ssh cpu-25-8 [root@cpu-25-8 ~]# grep -v '#' /etc/security/limits.conf * soft memlock unlimited * hard memlock unlimited * hard nofile 8192 * soft nofile 8192 * soft core 2097152 * hard core 4194304 > as you see, we added hard and soft limits for "core" (core file size) [root@cpu-25-8 ~]# ulimit -a core file size (blocks, -c) 2097152 data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 2061717 max locked memory (kbytes, -l) unlimited max memory size (kbytes, -m) unlimited open files (-n) 8192 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) 8192 cpu time (seconds, -t) unlimited max user processes (-u) 2061717 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited > The "ulimit" command confirms the changes [root@cpu-25-8 ~]# ulimit -a user1 core file size (blocks, -c) 2097152 data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 2061717 max locked memory (kbytes, -l) unlimited max memory size (kbytes, -m) unlimited open files (-n) 8192 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) unlimited cpu time (seconds, -t) unlimited max user processes (-u) 2061717 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited > Even a nonprivileged user has the correct "core file size" outside Slurm. However, the limits of the slurmd process running inside the systemd looks totaly different: [root@cpu-25-8 ~]# cat /proc/$(pidof slurmd)/limits Limit Soft Limit Hard Limit Units Max cpu time unlimited unlimited seconds Max file size unlimited unlimited bytes Max data size unlimited unlimited bytes Max stack size unlimited unlimited bytes Max core file size unlimited unlimited bytes Max resident set unlimited unlimited bytes Max processes 2061717 2061717 processes Max open files 131072 131072 files Max locked memory unlimited unlimited bytes Max address space unlimited unlimited bytes Max file locks unlimited unlimited locks Max pending signals 2061717 2061717 signals Max msgqueue size 819200 819200 bytes Max nice priority 0 0 Max realtime priority 0 0 Max realtime timeout unlimited unlimited us So, let's login as a normal user to the login node, then make an interactive session to the same node as above: $ ssh user1@login-20-25 [user1@login-20-25 ~]$ salloc -N1 -n 128 -p nocona --reservation=misha -w cpu-25-8 salloc: Granted job allocation 4953070 salloc: Waiting for resource configuration salloc: Nodes cpu-25-8 are ready for job [user1@cpu-25-8 ~]$ ulimit -a core file size (blocks, -c) unlimited data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 2061717 max locked memory (kbytes, -l) unlimited max memory size (kbytes, -m) 527826944 open files (-n) 8192 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) unlimited cpu time (seconds, -t) unlimited max user processes (-u) 2061717 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited As you see here, the "core file size" is "unlimited, " the same as the ulimit settings on the login-20-25. If I exit the interactive session and just do "srun" on the login node, I'll get the same: [user1@cpu-25-8 ~]$ exit [user1@login-20-25 ~]$ srun -n1 -N1 -p nocona --reservation=misha -w cpu-25-8 ulimit -a core file size (blocks, -c) unlimited data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 2061717 max locked memory (kbytes, -l) unlimited max memory size (kbytes, -m) 4123648 open files (-n) 8192 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) unlimited cpu time (seconds, -t) unlimited max user processes (-u) 2061717 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited Please also note the following setting on the worker nodes (such as cpu-25-8): 1) /etc/security/limits.d/ is empty 2) The settings in /etc/security/limits.conf: * soft memlock unlimited * hard memlock unlimited * hard nofile 8192 * soft nofile 8192 * soft core 2097152 * hard core 4194304 3) There is no limit on user authentication at login under /etc/pam.d/ 4) The Slurmd service file: [root@cpu-25-8 ~]# cat /usr/lib/systemd/system/slurmd.service [Unit] Description=Slurm node daemon After=munge.service network.target remote-fs.target #ConditionPathExists=/etc/slurm/slurm.conf [Service] Type=simple EnvironmentFile=-/etc/sysconfig/slurmd ExecStart=/usr/sbin/slurmd -D $SLURMD_OPTIONS ExecReload=/bin/kill -HUP $MAINPID KillMode=process LimitNOFILE=131072 LimitMEMLOCK=infinity LimitSTACK=infinity Delegate=yes TasksMax=infinity [Install] WantedBy=multi-user.target 5) No limit has been set in /etc/systemd/* (We can look further if that's necessary!) 6) The "core file size" has already been excluded in slurm.conf from being propagated to the worker nodes: # grep PropagateResourceLimitsExcept /etc/slurm/slurm.conf PropagateResourceLimitsExcept=CORE,MEMLOCK,NPROC So, I went ahead and stopped the slurmd service on cpu-25-8, and tried to run slurmd outside the systemd: [root@cpu-25-8 ~]# systemctl stop slurmd [root@cpu-25-8 ~]# slurmd -D & [root@cpu-25-8 ~]# cat /proc/$(pidof slurmd)/limits Limit Soft Limit Hard Limit Units Max cpu time unlimited unlimited seconds Max file size unlimited unlimited bytes Max data size unlimited unlimited bytes Max stack size 8388608 unlimited bytes Max core file size 4294967296 4294967296 bytes Max resident set unlimited unlimited bytes Max processes 2061717 2061717 processes Max open files 8192 8192 files Max locked memory unlimited unlimited bytes Max address space unlimited unlimited bytes Max file locks unlimited unlimited locks Max pending signals 2061717 2061717 signals Max msgqueue size 819200 819200 bytes Max nice priority 0 0 Max realtime priority 0 0 Max realtime timeout unlimited unlimited us I also get a different ulimits inside an interactive session to cpu-25-8: [user1@login-20-25 ~]$ salloc -N1 -n 128 -p nocona --reservation=misha -w cpu-25-8 salloc: Granted job allocation 4953070 salloc: Waiting for resource configuration salloc: Nodes cpu-25-8 are ready for job [user1@cpu-25-8 ~]$ ulimit -a core file size (blocks, -c) 4194304 data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 2061717 max locked memory (kbytes, -l) unlimited max memory size (kbytes, -m) 527826944 open files (-n) 8192 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) unlimited cpu time (seconds, -t) unlimited max user processes (-u) 2061717 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited As you can see above, the "core file size" works correctly outside the systemd. I wondered whether you might have some idea on what possibly changes the ulimits of slurmd inside the systemd? Or, where should we look, and what is the possible solution for this? Best Regards, Misha
Hi Misha, I want to take a look at: $ pstree -alnpst $(pidof slurmd) But most probably, the thing goes as follows: 1. As stated in [1]: > Resource limits not configured explicitly for a unit default to the value configured in the various DefaultLimitCPU=, DefaultLimitFSIZE=, … options available in systemd-system.conf(5), and – if not configured there – the kernel or per-user defaults, as defined by the OS (the latter only for user services, see below). As no LimitCORE seems to be defined for this systemd service, this value is taken from DefaultLimitCORE. In [2]: > DefaultLimitCORE= does not have a default but it is worth mentioning that RLIMIT_CORE is set to "infinity" by PID 1 which is inherited by its children. Following on, regarding limits.conf/limits.d [3], these files relate only to pam_limits.so. This means *only to user login session*. For this reason limits there don't apply to systemd services, which are affected by the whole [4] (being [1] part of [4]). At the end of the day, I think any service will have unlimited core dump size, not just slurmd, unless explicit LimitCORE is added to the service file or something. That should be easy to check. As a conclusion: It doesn't seem that there's a Slurm bug, propagating unlimited core size from logins. It is that limits.conf is not used for services, and you need to add either DefaultLimitCORE in systemd/system.conf or LimitCORE to slurmd service file. Cheers, Carlos. [1] https://www.freedesktop.org/software/systemd/man/systemd.exec.html#Process%20Properties [2] https://www.freedesktop.org/software/systemd/man/systemd-system.conf.html#DefaultLimitCPU= [3] https://linux.die.net/man/5/limits.conf [4] https://www.freedesktop.org/software/systemd/man/systemd-system.conf.html
Misha, Guessing you have the issue covered by my last answer I'm going to close the issue as info given by now. Please reopen if needed. Cheers, Carlos.
Hi Carlos, Sorry for the delay in my reply. I was busy with other stuff. Thank you very much. I'll let you know if I had further questions. Best, Misha