Ticket 7360

Summary:	slurmctld fails with pthread_create error Resource temporarily unavailable
Product:	Slurm	Reporter:	Azat <azat.khuziyakhmetov>
Component:	slurmctld	Assignee:	Alejandro Sanchez <alex>
Status:	RESOLVED FIXED	QA Contact:
Severity:	4 - Minor Issue
Priority:	---	CC:	alex, cinek, felip.moll, ricardo.silva, whong, whowell
Version:	19.05.2
Hardware:	Linux
OS:	Linux
See Also:	https://bugs.schedmd.com/show_bug.cgi?id=7423 https://bugs.schedmd.com/show_bug.cgi?id=7532 https://bugs.schedmd.com/show_bug.cgi?id=7571
Site:	GWDG	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:	19.05.3 20.02.0pre1
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	smaps sysctl -a output bug7360_diagnostics_v0.patch bug7360_diagnostics_v1.patch slurmctld #fd monitoring slurmctld #fd and #threads monitoring

Description Azat 2019-07-05 02:33:17 MDT

Dear support team,

After the update to 19.05.0 our control daemons started to crash with following error messages: 

> fatal: _slurmctld_rpc_mgr: pthread_create error Resource temporarily unavailable

or

> fatal: _start_msg_tree_internal: pthread_create error Resource temporarily unavailable

here is a short chronology of crashes: 

[2019-06-27T15:36] _start_msg_tree_internal crash of primary slurmctld (backup takes over)
[2019-06-29T00:09] _start_msg_tree_internal crash of backup slurmctld (slurm becomes inaccessible)
... both daemons were restarted manually
[2019-07-02T02:43] _slurmctld_rpc_mgr crash of primary slurmctld (backup takes over)
[2019-07-04T17:15] _start_msg_tree_internal crash of backup slurmctld (slurm becomes inaccessible)

I assume in a couple of days they will start crashing again. 

FYI: I have followed the Bug #5064 (https://bugs.schedmd.com/show_bug.cgi?id=5064) but I am not sure if it is the same cause here. Limits seem to be reasonable in our system:

> gwdu104:6 10:01:12 ~ # cat /proc/27435/limits
> Limit                     Soft Limit           Hard Limit           Units     
> ...   
> Max processes             385661               385661               processes 
> Max open files            65536                65536                files
> ...

and also no messages from prolog_slurmctld.

Comment 1 Alejandro Sanchez 2019-07-05 04:37:01 MDT

Hi Azat,

Will you attach slurmctld.log? thanks.

Comment 2 Alejandro Sanchez 2019-07-05 04:41:50 MDT

Can you also show the output of:

$ systemctl status slurmctld
$ ulimit -a
$ cat /proc/sys/kernel/threads-max

Comment 3 Azat 2019-07-05 07:43:29 MDT

(In reply to Alejandro Sanchez from comment #1)
> Hi Azat,
> 
> Will you attach slurmctld.log? thanks.

Hi Alex, 

It is quite huge. But there are no new errors which we didn't have previously. And no specific errors preceding the failure of the daemon. There are tons of errors like 

> [2019-07-05T11:36:36.396] error: select/cons_res: node gwdd041 memory is under-allocated (22400-41600) for JobId=709079
> [2019-07-05T11:36:36.396] error: select/cons_res: node gwdd069 memory is under-allocated (51200-64000) for JobId=709079
> [2019-07-05T11:36:36.396] error: select/cons_res: node gwdd071 memory is under-allocated (51200-64000) for JobId=709079

Are you looking for specific errors? 

Here are the output of the comands
> gwdu104:4 15:38:46 ~ # systemctl status slurmctld
> ● slurmctld.service - Slurm controller daemon
>    Loaded: loaded (/etc/systemd/system/slurmctld.service; enabled; vendor preset: disabled)
>    Active: active (running) since Fri 2019-07-05 09:53:14 CEST; 5h 45min ago
>   Process: 27433 ExecStart=/opt/slurm/sbin/slurmctld $SLURMCTLD_OPTIONS (code=exited, status=0/SUCCESS)
>  Main PID: 27435 (slurmctld)
>    CGroup: /system.slice/slurmctld.service
>            └─27435 /opt/slurm/sbin/slurmctld
> 
> Jul 05 09:53:14 gwdu104 systemd[1]: Starting Slurm controller daemon...
> Jul 05 09:53:14 gwdu104 systemd[1]: PID file /var/run/slurmctld.pid not readable (yet?) after start.
> Jul 05 09:53:14 gwdu104 systemd[1]: Started Slurm controller daemon.

> gwdu104:4 15:38:57 ~ # ulimit -a
> core file size          (blocks, -c) 0
> data seg size           (kbytes, -d) unlimited
> scheduling priority             (-e) 0
> file size               (blocks, -f) unlimited
> pending signals                 (-i) 385661
> max locked memory       (kbytes, -l) unlimited
> max memory size         (kbytes, -m) unlimited
> open files                      (-n) 65536
> pipe size            (512 bytes, -p) 8
> POSIX message queues     (bytes, -q) 819200
> real-time priority              (-r) 0
> stack size              (kbytes, -s) unlimited
> cpu time               (seconds, -t) unlimited
> max user processes              (-u) 385661
> virtual memory          (kbytes, -v) unlimited
> file locks                      (-x) unlimited

> gwdu104:4 15:41:10 ~ # cat /proc/sys/kernel/threads-max
> 771323

Thank you!

Comment 4 Alejandro Sanchez 2019-07-08 04:40:34 MDT

(In reply to Azat from comment #3)
> (In reply to Alejandro Sanchez from comment #1)
> > Hi Azat,
> > 
> > Will you attach slurmctld.log? thanks.
> 
> Hi Alex, 
> 
> It is quite huge. 

If your logs are quite large, remember you can logrotate them (there's an example in slurm.conf man page at the bottom). Take into account this recent improvement to such example:

https://github.com/SchedMD/slurm/commit/1e3548f387c6f6b

> But there are no new errors which we didn't have
> previously. And no specific errors preceding the failure of the daemon.
> There are tons of errors like 
> 
> > [2019-07-05T11:36:36.396] error: select/cons_res: node gwdd041 memory is under-allocated (22400-41600) for JobId=709079
> > [2019-07-05T11:36:36.396] error: select/cons_res: node gwdd069 memory is under-allocated (51200-64000) for JobId=709079
> > [2019-07-05T11:36:36.396] error: select/cons_res: node gwdd071 memory is under-allocated (51200-64000) for JobId=709079

There's a patch pending review in bug 6769 where we are tracking this issue.
 
> Are you looking for specific errors? 

Not particularly but just to have a feel of what's going on.
 
> Here are the output of the comands
> > gwdu104:4 15:38:46 ~ # systemctl status slurmctld

Sorry I was more interested in:

$ systemctl show slurmctld.service | egrep "TasksMax|LimitNPROC"

Comment 5 Azat 2019-07-09 08:22:56 MDT

(In reply to Alejandro Sanchez from comment #4)
> If your logs are quite large, remember you can logrotate them (there's an
> example in slurm.conf man page at the bottom). Take into account this recent

Thank you, we will look into that possibility

> There's a patch pending review in bug 6769 where we are tracking this issue.

That would probably highly reduce the logs :) 
 
> Sorry I was more interested in:
> 
> $ systemctl show slurmctld.service | egrep "TasksMax|LimitNPROC"

Here is the output: 

> gwdu104:6 16:18:02 ~ # systemctl show slurmctld.service | egrep "TasksMax|LimitNPROC"
> LimitNPROC=385661

But once again, all the limits are the same and the workload did not change after the update. Thank you again for the help!

Comment 6 Alejandro Sanchez 2019-07-09 08:28:03 MDT

It is weird that TasksMax isn't in the output of the previous command. Can you try?

$ systemctl show -p TasksMax slurmctld

Thanks.

Comment 7 Azat 2019-07-10 02:26:57 MDT

(In reply to Alejandro Sanchez from comment #6)
> It is weird that TasksMax isn't in the output of the previous command. Can
> you try?
> 
> $ systemctl show -p TasksMax slurmctld

Yes it is empty indeed. 

> gwdu104:2 10:22:54 ~ # systemctl show -p TasksMax slurmctld
> gwdu104:2 10:25:04 ~ #

Comment 8 Alejandro Sanchez 2019-07-10 02:32:54 MDT

Can you verify if you have:

LimitNOFILE=65536
TasksMax=infinity

in your slurmctld.service and if not add those and restart it?

Comment 9 Azat 2019-07-10 03:03:13 MDT

(In reply to Alejandro Sanchez from comment #8)
> Can you verify if you have:
> 
> LimitNOFILE=65536
> TasksMax=infinity
> 
> in your slurmctld.service and if not add those and restart it?

We are using the service file provided by the build process of Slurm. Its content is:

> [Unit]
> Description=Slurm controller daemon
> After=network.target munge.service
> ConditionPathExists=/opt/slurm/etc/slurm.conf
> 
> [Service]
> Type=forking
> EnvironmentFile=-/etc/sysconfig/slurmctld
> ExecStart=/opt/slurm/slurm/19.05.0/install/sbin/slurmctld > $SLURMCTLD_OPTIONS
> ExecReload=/bin/kill -HUP $MAINPID
> PIDFile=/var/run/slurmctld.pid
> LimitNOFILE=65536
> 
> 
> [Install]
> WantedBy=multi-user.target

We would not like to set TasksMax to infinity, since if there is some problem with Slurm related to the amount of processes it creates, it may crash the head nodes of the cluster which he runs on and we would like to avoid that. 

and we already have a fairly high limit on the number of processes set in /proc/SLURMCTLDPID/limits:
> Max processes             385661 

Should we first try to set the higher limit? With the following amount for instance? 
> TasksMax=600000

Comment 10 Alejandro Sanchez 2019-07-10 03:20:42 MDT

That's interesting. I'll provide you with more background:

Since systemd version 227 TasksMax setting was added[1] to unit files. That's why the Slurm configure script makes use of the auxdir/x_ac_systemd.m4 macro to detect[2] the check if systemd version >= 227 and if so, proceed and substitute SYSTEMD_TASKSMAX_OPTION[3] by TasksMax=infinity[4] at build time.

What version of systemd have you installed in your environment? if "inifnity" isn't a good value for you the one you suggested should be fine, though depending on your systemd version the setting won't be supported in the unit files yet.

[1] https://github.com/systemd/systemd/blob/master/NEWS#L4082

[2] https://github.com/SchedMD/slurm/blob/slurm-19-05-0-1/auxdir/x_ac_systemd.m4#L31

[3] alex@polaris:~/slurm$ grep -i TasksMax source/etc/slurmctld.service.in 
@SYSTEMD_TASKSMAX_OPTION@

[4] alex@polaris:~/slurm$ grep -i tasks 19.05/build/etc/slurmctld.service 
TasksMax=infinity

Comment 11 Azat 2019-07-10 03:40:06 MDT

(In reply to Alejandro Sanchez from comment #10)

Thank you for the exhaustive information. It clarifies a lot.

The systemd version on our nodes is 219

> gwdu104:4 11:28:47 ~ # systemctl --version
> systemd 219
> ...

So that means we don't need to set TasksMax right? 

What else can we do in order to find the root cause of the issue?

Comment 12 Alejandro Sanchez 2019-07-10 03:51:15 MDT

Is there any chance to upgrade systemd version? 

Otherwise, we'd need to investigate how systemd handled this prior to TasksMax setting availability.

Comment 13 Azat 2019-07-10 04:00:26 MDT

(In reply to Alejandro Sanchez from comment #12)
> Is there any chance to upgrade systemd version? 

Unfortunately it is the latest version of the systemd for the Scientific Linux we are running on. So we cannot easily update it. 

> Otherwise, we'd need to investigate how systemd handled this prior to
> TasksMax setting availability.

But it did work with the previous version of Slurm.

Comment 14 Alejandro Sanchez 2019-07-10 04:53:33 MDT

The fact it did work for your site while you were running 18.08 isn't conclusive. 

The problem might also occur if other resource limits are being hit (like memory or file descriptors). 

For instance, we had a bug from another site running 18.08 where they also reported this problem. They noticed an increased number of mmaps in /proc/<slurmctld pid>/smaps. In their case, they had their own jobcomp plugin and it wasn't properly free_buf'ing some mmap'd memory. After fixing this, they stopped seeing the error. I don't think that's your case since it's unlikely you run your own plugin, but it wouldn't hurt attaching the output of:

$ cat /proc/$(pgrep slurmctld)/smaps

and monitoring if slurmctld memory grows abnormally.

In the other bug they also tried increasing:

/etc/security/limits.conf nofile to 1048576
/proc/sys/kernel/threads-max to 3091639
/proc/sys/kernel/pid_max to 4194303

Could you try all this and see if it helps?

AFAIC, this has just happened once since upgrading so far, am I right?

Comment 15 Azat 2019-07-10 06:38:15 MDT

Created attachment 10857 [details]
smaps

Comment 16 Azat 2019-07-10 06:48:00 MDT

(In reply to Alejandro Sanchez from comment #14)
> The fact it did work for your site while you were running 18.08 isn't
> conclusive. 
That's true, I just wanted to point to the fact that the environment for both versions was the same and with 18.08 we never got errors like that (never hit the limits)

> but it wouldn't hurt attaching the output of:
> 
> $ cat /proc/$(pgrep slurmctld)/smaps
attached it. 

> and monitoring if slurmctld memory grows abnormally.
didn't notice anything suspicious for a couple of hours... it stays around 22.5G, maybe should look at it for a bit longer. 

> In the other bug they also tried increasing:
> 
> /etc/security/limits.conf nofile to 1048576
> /proc/sys/kernel/threads-max to 3091639
> /proc/sys/kernel/pid_max to 4194303
> 
> Could you try all this and see if it helps?
We can try to increase some limits, yes. 

> AFAIC, this has just happened once since upgrading so far, am I right?
That's the problem, it happens periodically, every every 2-5 days so far.

Comment 17 Azat 2019-07-10 08:22:02 MDT

(In reply to Azat from comment #16)
> > and monitoring if slurmctld memory grows abnormally.
> didn't notice anything suspicious for a couple of hours... it stays around
> 22.5G, maybe should look at it for a bit longer. 
Just to clarify, this is the virtual memory... the resident set size doesn't change much and is ~2G.

Comment 18 Alejandro Sanchez 2019-07-11 03:21:22 MDT

Ok. Let's do the limits change and monitor the memory growth for a few days.

Comment 19 Alejandro Sanchez 2019-07-15 04:03:59 MDT

Hi,

how are things going so far?

Comment 20 Azat 2019-07-15 05:23:24 MDT

(In reply to Alejandro Sanchez from comment #19)
> Hi,
> 
> how are things going so far?

Dear Alex,

I have now increased the limits and we will see how the daemon behaves. 

I also looked into our monitoring system and the measurements of Memory consumption (~2.5Gb) and Processes (~2000) amount are in normal range when the daemon crashes... so it is probably not because of that, but something else.

We also had another crash few days ago. So the crashes are consistent.

Comment 22 Alejandro Sanchez 2019-07-26 08:12:45 MDT

Have you noticed anything relevant on your monitoring?

When Slurm added[1] support for AIX systems back in ~2004, code was put in place to explicitly set the stack size attribute for all pthreads to 1024*1024 bytes (1M). AIX is no longer a supported platform and I'm wondering if this is being a factor contributing to the issue reported in here.

Maybe reducing the stack size attribute for all pthreads can help:

diff --git a/src/common/macros.h b/src/common/macros.h
index aa8b48040e..d35ed10fb5 100644
--- a/src/common/macros.h
+++ b/src/common/macros.h
@@ -287,7 +287,7 @@
                        errno = err;                                    \
                        error("pthread_attr_setscope: %m");             \
                }                                                       \
-               err = pthread_attr_setstacksize(attr, 1024*1024);       \
+               err = pthread_attr_setstacksize(attr, 256*1024);        \
                if (err) {                                              \
                        errno = err;                                    \
                        error("pthread_attr_setstacksize: %m");         \

This is kind of blind guess since I've not been able to get slurmctld to fatal due to EAGAIN when creating threads. Note also that changing the macro will impact threads creation from all slurm components and daemons, not just slurmctld.

On the other hand, you reported the stack size limit as per ulimit:

stack size              (kbytes, -s) unlimited

but there's also the mapping systemd limit LimitSTACK. 

Can you show me the output of 

$ systemctl show slurmctld.service | grep LimitSTACK

and

$ grep -A 1 stack /proc/$(pgrep slurmctld)/smaps

[1] https://github.com/SchedMD/slurm/commit/55e62ab46

Comment 23 Azat 2019-07-29 06:26:33 MDT

(In reply to Alejandro Sanchez from comment #22)
Hi Alex,

unfortunately increasing the limits didn't help. We have encountered 2 crashes since I have changed the limits, which means the crashes occur more or less with the same frequency. 

> When Slurm added[1] support for AIX systems back in ~2004, code was put in
> place to explicitly set the stack size attribute for all pthreads to
> 1024*1024 bytes (1M).
I would like to emphasize it again, that the environment where Slurm is running didn't changed at all, we have only updated the Slurm version. Whatever is the cause of crashes now, more likely was not causing the crashes in the previous version.  

> Can you show me the output of 
> 
> $ systemctl show slurmctld.service | grep LimitSTACK
> 
> and
> 
> $ grep -A 1 stack /proc/$(pgrep slurmctld)/smaps
Sure: 
> gwdu105:8 14:15:39 ~ # systemctl show slurmctld.service | grep LimitSTACK
> LimitSTACK=18446744073709551615
> gwdu105:8 14:16:56 ~ # grep -A 1 stack /proc/$(pgrep slurmctld)/smaps
> 7fff61428000-7fff61528000 rw-p 00000000 00:00 0                          [stack:371201]
> Size:               1024 kB
> --
> 7fff61d31000-7fff61e31000 rw-p 00000000 00:00 0                          [stack:370716]
> Size:               1024 kB
> --
> 7ffff0b2b000-7ffff0c2b000 rw-p 00000000 00:00 0                          [stack:349656]
> Size:               1024 kB
> --
> 7ffff0d2d000-7ffff0e2d000 rw-p 00000000 00:00 0                          [stack:349654]
> Size:               1024 kB
> --
> 7ffff0e2e000-7ffff0f2e000 rw-p 00000000 00:00 0                          [stack:349653]
> Size:               1024 kB
> --
> 7ffff0f2f000-7ffff102f000 rw-p 00000000 00:00 0                          [stack:349652]
> Size:               1024 kB
> --
> 7ffff1434000-7ffff1534000 rw-p 00000000 00:00 0                          [stack:349651]
> Size:               1024 kB
> --
> 7ffff1740000-7ffff1840000 rw-p 00000000 00:00 0                          [stack:349526]
> Size:               1024 kB
> --
> 7ffff1841000-7ffff1941000 rw-p 00000000 00:00 0                          [stack:349525]
> Size:               1024 kB
> --
> 7ffff1942000-7ffff1a42000 rw-p 00000000 00:00 0                          [stack:349497]
> Size:               1024 kB
> --
> 7ffff24fa000-7ffff25fa000 rw-p 00000000 00:00 0                          [stack:349303]
> Size:               1024 kB
> --
> 7ffff3416000-7ffff3516000 rw-p 00000000 00:00 0                          [stack:329563]
> Size:               1024 kB
> --
> 7ffff3726000-7ffff3826000 rw-p 00000000 00:00 0                          [stack:329562]
> Size:               1024 kB
> --
> 7ffff7edd000-7ffff7fdd000 rw-p 00000000 00:00 0                          [stack:329560]
> Size:               1024 kB
> --
> 7ffffffde000-7ffffffff000 rw-p 00000000 00:00 0                          [stack]
> Size:                132 kB

Comment 24 Felip Moll 2019-08-02 06:06:38 MDT

Hi Azat,

Alex is out until the end of next week. I will be looking at your bug in the meantime.

While I am reading through your bug, I miss the full output of:

cat /proc/27435/limits

Can you show me the full output?
This error happens 90% due to some limits in the system. How many open files do you have in the system right now? And how many from the Slurm user?

You say your environment haven't changed, but, does it mean you haven't upgraded the kernel, or any other package?

Another idea that we could try in the meantime and if it bothers a lot, is to run slurmctld outside systemd, just to discard things. You just need to stop slurmctld and start it from the console as SlurmUser.

I'll continue looking into your issue.

Comment 25 Azat 2019-08-02 06:34:43 MDT

(In reply to Felip Moll from comment #24)
Hi Felip,

> I will be looking at your bug in the meantime.
Thank you! 

> While I am reading through your bug, I miss the full output of:
> cat /proc/27435/limits
> Can you show me the full output?
Sure:
> gwdu105:10 14:15:29 ~ # cat /proc/$(pgrep slurmctld)/limits
> Limit                     Soft Limit           Hard Limit           Units     
> Max cpu time              unlimited            unlimited            seconds   
> Max file size             unlimited            unlimited            bytes     
> Max data size             unlimited            unlimited            bytes     
> Max stack size            unlimited            unlimited            bytes     
> Max core file size        unlimited            unlimited            bytes     
> Max resident set          unlimited            unlimited            bytes     
> Max processes             385661               385661               processes 
> Max open files            65536                65536                files     
> Max locked memory         65536                65536                bytes     
> Max address space         unlimited            unlimited            bytes     
> Max file locks            unlimited            unlimited            locks     
> Max pending signals       385661               385661               signals   
> Max msgqueue size         819200               819200               bytes     
> Max nice priority         0                    0                    
> Max realtime priority     0                    0                    
> Max realtime timeout      unlimited            unlimited            us        

> This error happens 90% due to some limits in the system. How many open files
> do you have in the system right now? And how many from the Slurm user?
Here are lsof line counts:
> gwdu105:10 14:14:54 ~ # lsof | wc -l
> 77380
> gwdu105:10 14:15:12 ~ # lsof | grep slurm | wc -l
> 5891
> gwdu105:10 14:27:01 ~ # lsof -p $(pgrep slurmctld) | wc -l
> 68

> You say your environment haven't changed, but, does it mean you haven't
> upgraded the kernel, or any other package?
Yes it means we haven't updated kernel or installed/updated/removed any other packages on the system and haven't changed any system configuration as well. Since those are head nodes we do it very selectively and on rare occasion.  

> Another idea that we could try in the meantime and if it bothers a lot, is
> to run slurmctld outside systemd, just to discard things. You just need to
> stop slurmctld and start it from the console as SlurmUser.
The issue actually bothers us a lot since slurmctld daemons crash periodically and we have to restart them manually. We will try to start them without systemd next time it crashes. 

Is there a way to get more verbose output from the slurmctld daemon why it failed with pthread_create error? Frankly it doesn't look like the limits issue. Or what limits can cause the error, since number of open files, processes, cpu/memory usage seem to be fine.  
 
> I'll continue looking into your issue.
Thanks!

Comment 26 Felip Moll 2019-08-02 10:05:56 MDT

> Is there a way to get more verbose output from the slurmctld daemon why it
> failed with pthread_create error?

The error you're seeing is generated from macros.h:

#define slurm_thread_create(id, func, arg)
        do {
                pthread_attr_t attr;
                int err;
                slurm_attr_init(&attr);
                err = pthread_create(id, &attr, func, arg);
                if (err) {
                        errno = err;
                        fatal("%s: pthread_create error %m", __func__);
                }
                slurm_attr_destroy(&attr);
        } while (0)


As you can see, the function that generates the error is pthread_create() which is external to Slurm. This means that the error doesn't come directly from Slurm but from the system, ** which means that there's some system configuration that avoids the creation of new threads or some resource which is exhausted. **

To create a thread we need to be able to open new files. Moreover each thread just needs space in the stack to allocate its data structures, so open files and stack size are key. Then there are possible limits on the maximum tasks a pid can create, in newer systemds it's controlled by TasksMax, but in yours it seems is already ok (by the last command you shown). The kernel can also limit globally the nr of open files, maximum processes or maximum threads on the system, but we doesn't seem to be hitting any of these limits too.

-- All this info is what we already know. -- 

We are missing something here, how often do this error reproduce? It would be interesting to just capture the nr of open files, threads in the system by the slurmctld and slurm user at the exact moment the error is happening.

What is your exact kernel version? It's unlikely but a bug existed in the kernel https://bugzilla.kernel.org/show_bug.cgi?id=154011 which sporadically created this issue.

I'd like to know why this mismatch:
> gwdu105:10 14:15:12 ~ # lsof | grep slurm | wc -l
> 5891
> gwdu105:10 14:27:01 ~ # lsof -p $(pgrep slurmctld) | wc -l
> 68

What are these 5891-68 files opened?

I also would like to see your system logs at the exact time when the error happened to see if any other component is complaining.

There's even the kernel.pid_max which can affect the creation of new threads. What's your value (sysctl kernel.pid_max)? In fact seeing a 'sysctl -a' would be useful.

You mentioned systemd 219-42, but, what's your exact version (systemctl --version)?
RHEL (and therefore SL) have backported the fix into 219-4: https://access.redhat.com/errata/RHBA-2017:2297

Finally, I'd like you to check:

]$ cat /proc/$(pgrep slurmctld)/cgroup 
....
5:pids:/user.slice/user-1000.slice/session-2.scope
....

Then find the pids.max in the subdirectory, e.g.:
]$ find /sys/fs/cgroup/pids/user.slice/user-1000.slice -name pids.max -exec echo -n '{}: ' \; -exec cat '{}' \;

https://www.kernel.org/doc/Documentation/cgroup-v1/pids.txt

If it is to low, feel free to increase the value manually to test whether it crashes again.

If none of this limits applies then it must be that your server is really having some problem, then system logs should give us some hint.
Another matter will be then to see why Slurm is trying to open many threads (if it really does), an sdiag then may help.

Sorry to insist on limits and for making so many questions, but all of these are the main reasons why pthread_create() fails.

Thanks

Comment 27 Azat 2019-08-05 03:48:41 MDT

(In reply to Felip Moll from comment #26)
> The error you're seeing is generated from macros.h:
> 
> ...
>                 err = pthread_create(id, &attr, func, arg);
>                 if (err) {
>                         errno = err;
>                         fatal("%s: pthread_create error %m", __func__);
> ...
> As you can see, the function that generates the error is pthread_create()
> which is external to Slurm. This means that the error doesn't come directly
> from Slurm but from the system, ** which means that there's some system
> configuration that avoids the creation of new threads or some resource which
> is exhausted. **
That makes sense. 

> We are missing something here, how often do this error reproduce? It would
> be interesting to just capture the nr of open files, threads in the system
> by the slurmctld and slurm user at the exact moment the error is happening.
We have a monitoring running on the node with 10s interval. The total number of threads is always in the adequate range 800-2000. Can you please tell me what command would be enough to track the number of open files of the slurm user? "lsof -p $(pgrep slurmctld) | wc -l" will it be sufficient? I can add it to the monitoring. 

> What is your exact kernel version? It's unlikely but a bug existed in the
> kernel https://bugzilla.kernel.org/show_bug.cgi?id=154011 which sporadically
> created this issue.
We have an older kernel version:
> gwdu105:10 10:33:01 ~ # uname -r
> 3.10.0-514.26.2.el7.x86_64 

> I'd like to know why this mismatch:
> > gwdu105:10 14:15:12 ~ # lsof | grep slurm | wc -l
> > 5891
> > gwdu105:10 14:27:01 ~ # lsof -p $(pgrep slurmctld) | wc -l
> > 68
> 
> What are these 5891-68 files opened?
It is because "lsof | grep slurm" returns all fd for every thread, where the second one outputs only once, since the fds are shared with the main process (roughly it means there are 5891/68 threads of slurm processes, since there are some other fds wich belong to other processes and contain "slurm"). 

> I also would like to see your system logs at the exact time when the error
> happened to see if any other component is complaining.
Yes, but there is nothing preceding the failure 
> Aug 02 01:30:22 gwdu105 conmand[5375]: Console [gwdu103] connected to <10.109.49.203>
> Aug 02 01:30:33 gwdu105 CROND[739662]: pam_ldap(crond:session): error opening connection to nslcd: No such file or directory
> Aug 02 01:31:31 gwdu105 slurmctld[329559]: fatal: _start_msg_tree_internal: pthread_create error Resource temporarily unavailable
> Aug 02 01:31:32 gwdu105 systemd[1]: slurmctld.service: main process exited, code=exited, status=1/FAILURE
the crash happened at 01:31:31. the messages above happen periodically and have nothing to do with slurmctld. 

> There's even the kernel.pid_max which can affect the creation of new
> threads. What's your value (sysctl kernel.pid_max)? In fact seeing a 'sysctl
> -a' would be useful.
I will attach the output of it. 

> You mentioned systemd 219-42, but, what's your exact version (systemctl
> --version)?
> RHEL (and therefore SL) have backported the fix into 219-4:
> https://access.redhat.com/errata/RHBA-2017:2297
It seems we have an older version
> gwdu105:10 11:16:04 ~ # rpm -q systemd
> systemd-219-30.el7_3.3.x86_64

> Finally, I'd like you to check:
> 
> ]$ cat /proc/$(pgrep slurmctld)/cgroup 
We don't have any cgroups configured for slurmctld:
> gwdu105:10 11:32:14 ~ # cat /proc/$(pgrep slurmctld)/cgroup
> 11:pids:/
> 10:perf_event:/
> 9:cpuset:/
> 8:freezer:/
> 7:memory:/
> 6:hugetlb:/
> 5:devices:/
> 4:cpuacct,cpu:/
> 3:blkio:/
> 2:net_prio,net_cls:/
> 1:name=systemd:/system.slice/slurmctld.service

> If none of this limits applies then it must be that your server is really
> having some problem, then system logs should give us some hint.
> Another matter will be then to see why Slurm is trying to open many threads
> (if it really does), an sdiag then may help.
It doesn't seem that Slurm needs many threads to run and limits are high enough on our system. 

> Sorry to insist on limits and for making so many questions, but all of these
> are the main reasons why pthread_create() fails.
Not a problem, this crash error still confuses me and I hope we can figure out the root cause of it :)

Comment 28 Azat 2019-08-05 03:51:14 MDT

Created attachment 11100 [details]
sysctl -a output

Comment 29 Felip Moll 2019-08-05 07:19:18 MDT

Thank you.
It seems nothing is wrong in all your setup.. I will think about other options to investigate. If you see any strange thing in the server it could be valuable to comment it here.
Remember that if it crashes again you can try to run it manually from command line instead of from systemd which would remove one component in the diagnose.

> We have a monitoring running on the node with 10s interval. The total number of threads is always in the adequate range 800-2000.
> Can you please tell me what command would be enough to track the number of open files of the slurm user?
> "lsof -p $(pgrep slurmctld) | wc -l" will it be sufficient? I can add it to the monitoring. 

I think showing the nr of open files by slurmctld process is what we want, so your command should suffice.


P.D: As you've seen this comment I did two typos/mistakes:
> > You mentioned systemd 219-42, but, what's your exact version (systemctl
> > --version)? RHEL (and therefore SL) have backported the fix into 219-4:
Should've been: 'you mentioned systemd 219' , and 'backported the fix into 219-42'.
You have 219-30 so you should not have TasksMax in systemd yet, which is 'good'.

Comment 31 Felip Moll 2019-08-06 11:50:11 MDT

Created attachment 11121 [details]
bug7360_diagnostics_v0.patch

Hi Azat,

I am attaching a small patch here that replaces the 'fatal' by a 'fatal_abort' in slurm_thread_create. With this patch I aim to get a bit more of information of what threads were created and existed in the slurmctld instance when it fails again.

Are you able to apply this patch and recompile slurm?

If so, when it fails again you'll get a coredump, then open it with gdb run the following commands and attach the output here:

info threads
thread apply all bt full

Thanks

Comment 32 Felip Moll 2019-08-06 12:13:05 MDT

Created attachment 11122 [details]
bug7360_diagnostics_v1.patch

This patch version contemplates also the failure during creation of detached threads. Please apply this one if possible.

Comment 33 Azat 2019-08-07 07:21:14 MDT

Created attachment 11138 [details]
slurmctld #fd monitoring

(In reply to Felip Moll from comment #29)
> Thank you.
> It seems nothing is wrong in all your setup.. I will think about other
> options to investigate. If you see any strange thing in the server it could
> be valuable to comment it here.
Thank you. We didn't notice anything suspicious about our servers recently. 

> Remember that if it crashes again you can try to run it manually from
> command line instead of from systemd which would remove one component in the
> diagnose.
Yes, it crashed again and we started it without systemd... let's see what happens. 

> > We have a monitoring running on the node with 10s interval. The total number of threads is always in the adequate range 800-2000.
> > Can you please tell me what command would be enough to track the number of open files of the slurm user?
> > "lsof -p $(pgrep slurmctld) | wc -l" will it be sufficient? I can add it to the monitoring. 
> 
> I think showing the nr of open files by slurmctld process is what we want,
> so your command should suffice.
We have monitored the number of files for couple of days before it crashes and here are some results (see the attachment):
- the average #fds is 70
- sometimes it goes almost to 400
- before the crash #fds is ~400 too 
you can see in the attachment the monitoring data (every 10 sec) before the crash that happened around 20:50.

> Are you able to apply this patch and recompile slurm?
Since we are trying currently to run slurmctld without systemd, we postponed patching Slurm. If it is not because of systemd, we will apply the patch and get a coredump.

Comment 34 Felip Moll 2019-08-09 08:50:58 MDT

> you can see in the attachment the monitoring data (every 10 sec) before the
> crash that happened around 20:50.

I see this intersting that before the crash it was just increasing threads.

> > Are you able to apply this patch and recompile slurm?

We've other customers with this issue, and the info of one of them seems to be interesting. It would be nice, if it happens again, to get the backtrace from you too to confirm some suspicions.

> Since we are trying currently to run slurmctld without systemd, we postponed
> patching Slurm. If it is not because of systemd, we will apply the patch and
> get a coredump.

Keep me posted if it fails again, I'd like to know about it as soon as it happens.

Two more questions:
1. I need your recent slurm.conf
2. I need a:

 free -m 

in your slurmctld host.

btw, Alex is available again, so he may be responding this bug from now on too.

Comment 35 Azat 2019-08-12 02:46:10 MDT

Created attachment 11189 [details]
slurmctld #fd and #threads monitoring

(In reply to Felip Moll from comment #34)

> I see this intersting that before the crash it was just increasing threads.
Yes, you are right. I have attached another graph with # of total threads in the system included on the right axis. As you can see, when # of open files increase the # threads increase as well. (Ignore high peaks of # of threads, it is caused by another software running on the host and is completely normal)

> We've other customers with this issue, and the info of one of them seems to
> be interesting. It would be nice, if it happens again, to get the backtrace
> from you too to confirm some suspicions.
Ok, we will apply the patch, we are still waiting for the next crash after we have started the daemons without systemd. 

> Two more questions:
> 1. I need your recent slurm.conf
You can find it in #7423, or I can attach it here too if needed. 

> 2. I need a:
> 
>  free -m 
> 
> in your slurmctld host.
Here is the output: 
> gwdu105:0 10:44:41 ~ # free -m
>               total        used        free      shared  buff/cache   available
> Mem:          96500       12758       10474        3304       73267       79041
> Swap:         15624        2589       13035

Comment 36 Alejandro Sanchez 2019-08-15 05:36:59 MDT

Hi. Any updates after starting ctld without systemd? thanks.

Comment 38 Azat 2019-08-16 03:27:52 MDT

(In reply to Alejandro Sanchez from comment #36)
> Hi. Any updates after starting ctld without systemd? thanks.

Hi Alex! 

Yes, the daemon crashed without systemd too. Today we have patched the sources and recompiled slurmctld. Now the patched version is running without systemd and after the crash I will provide the output of gdb on the coredump.

Comment 39 Alejandro Sanchez 2019-08-16 04:54:12 MDT

Ok, thanks for your feedback.

Could you show the output of:

$ cat /proc/$(pidof slurmctld)/limits

And if possible execute a script over time that will monitor these metrics and append executions output to a file?

#!/bin/bash
echo ctld_thread_count=$(ps --no-headers -p $(pidof slurmctld) -L | wc -l)
echo ctld_opened_files=$(lsof -p $(pidof slurmctld) | wc -l)
echo all_opened_files=$(lsof | wc -l)
echo ctld_maps=$(cat /proc/$(pidof slurmctld)/maps | wc -l)
echo file-nr=$(cat /proc/sys/fs/file-nr)
echo inode-nr=$(cat /proc/sys/fs/inode-nr)
echo nr_open=$(cat /proc/sys/fs/nr_open)
echo "== status =="
cat /proc/$(pidof slurmctld)/status
echo "== ipcs =="
ipcs -p $(pidof slurmctld)
echo "== free =="
free -m
echo "== vsz =="
ps aux --sort=-vsz | head -20
echo "== rsz =="
ps aux --sort=-rss | head -20

Comment 40 Azat 2019-08-16 05:58:23 MDT

(In reply to Alejandro Sanchez from comment #39)
> Ok, thanks for your feedback.
> 
> Could you show the output of:
> 
> $ cat /proc/$(pidof slurmctld)/limits

Sure:
> gwdu105:0 13:55:30 ~ # cat /proc/$(pidof slurmctld)/limits
> Limit                     Soft Limit           Hard Limit           Units     
> Max cpu time              unlimited            unlimited            seconds   
> Max file size             unlimited            unlimited            bytes     
> Max data size             unlimited            unlimited            bytes     
> Max stack size            unlimited            unlimited            bytes     
> Max core file size        unlimited            unlimited            bytes     
> Max resident set          unlimited            unlimited            bytes     
> Max processes             385661               385661               processes 
> Max open files            265536               265536               files     
> Max locked memory         unlimited            unlimited            bytes     
> Max address space         unlimited            unlimited            bytes     
> Max file locks            unlimited            unlimited            locks     
> Max pending signals       385661               385661               signals   
> Max msgqueue size         819200               819200               bytes     
> Max nice priority         0                    0                    
> Max realtime priority     0                    0                    
> Max realtime timeout      unlimited            unlimited            us 


 
> And if possible execute a script over time that will monitor these metrics
> and append executions output to a file?
Ok, we are collecting the output of the script now every 30 secs. Once the slurmctld crashes I will attach the concatenated output.

Comment 41 Alejandro Sanchez 2019-08-19 00:48:29 MDT

Hi,

We've been able to reproduce this issue. As a temporal workaround, we suggest disabling the SlurmctldProlog and the EpilogSlurmctld to prevent more fatal's from happening. In the meantime, we're studying a fix for this. 

We suspect about a change in 19.05 logic where a thread changed from detached to joinable and it was never joined, leaking anonymous pages that eventually would leave no resources to make subsequent pthread_create calls successful. But for now, we still need to verify this is the actual root cause of the problem.

Thanks.

Comment 42 Alejandro Sanchez 2019-08-19 00:49:37 MDT

(In reply to Alejandro Sanchez from comment #41)
> Hi,
> 
> We've been able to reproduce this issue. As a temporal workaround, we
> suggest disabling the SlurmctldProlog and the EpilogSlurmctld

The correct option name is PrologSlurmctld ...

Comment 54 Alejandro Sanchez 2019-08-20 03:44:58 MDT

Hi,

this has been fixed in the following commit which will be available in 19.05.3:

https://github.com/SchedMD/slurm/commit/a04eea2e03af418

I'd suggest upgrading to 19.05.2, which includes this other fix:

https://github.com/SchedMD/slurm/commit/d1863b963cb1bac

and apply a04eea2e03af418 on top till .3 is released. With that applied you should be able to re-enable any [Prolog|Epilog]Slurmctld and then the created threads should be properly cleaned up and stop leaking anonymous pages.

Please, let us know how it goes and so we can close this or ask any further questions otherwise. Thanks for your collaboration.

Comment 57 Azat 2019-08-21 03:24:57 MDT

Created attachment 11296 [details]
slurmctld debug

Hi Alex,

Thank you for the info and patches. We have now compiled Slurm 19.05.2 and applied a04eea2e03af418 patch. Will let you know if it crashes and if not then the ticket can be closed after 2 weeks.  

Meanwhile previous Slurm version with core dump patch crashed and the output from the script you provided (I have added the dates to navigate easier) and gdb are attached here. The crash happened on 20.08.2019 at 03:32:20 +/- 10 seconds.  

PS: please mark the attachment as private since it contains the output of ps tool which might show usernames if we have used some commands with usernames on the node.

Comment 58 Alejandro Sanchez 2019-08-21 03:51:00 MDT

Hi,

ok, thanks for the feedback. I think there's no need to apply the fatal_abort patch nor running the monitoring script anymore.

I've marked the attachment as private (you can also do so from your side if needed for future attachments just fyi).

Let's wait a couple of weeks and see if ctld gets stable with the fix. I'm also lowering the severity for now.

Please, let us know if anything else comes up.

Thanks.

Comment 59 Alejandro Sanchez 2019-09-04 04:23:36 MDT

Hi. It's been a couple of weeks already with the patch applied. Have you seen any more problems or can we close this out? thanks.

Comment 60 Azat 2019-09-04 04:28:06 MDT

(In reply to Alejandro Sanchez from comment #59)
> Hi. It's been a couple of weeks already with the patch applied. Have you
> seen any more problems or can we close this out? thanks.

Hi, Alex. Since we have applied the patch everything works smoothly. We can close this as resolved. Thank you once again for the help!

Comment 61 Alejandro Sanchez 2019-09-04 04:38:05 MDT

All right, thanks for your feedback and cooperation.

Comment 62 Alejandro Sanchez 2019-09-09 01:23:06 MDT

*** Ticket 7705 has been marked as a duplicate of this ticket. ***