Ticket 7571

Summary:	fatal: _slurmctld_rpc_mgr: pthread_create error Resource temporarily unavailable
Product:	Slurm	Reporter:	Eric <etomeo>
Component:	slurmctld	Assignee:	Jacob Jenson <jacob>
Status:	RESOLVED DUPLICATE	QA Contact:
Severity:	6 - No support contract
Priority:	---	CC:	taylor, whowell
Version:	19.05.0
Hardware:	Linux
OS:	Linux
See Also:	https://bugs.schedmd.com/show_bug.cgi?id=7532 https://bugs.schedmd.com/show_bug.cgi?id=7360
Site:	-Other-	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:	HiPerGator
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---

Description Eric 2019-08-13 14:16:24 MDT

Within the same day of upgrading to Slurm v19.05.0, we have been experiencing frequent slurmctld crashes with the following messages. Each block is a separate crash instance:
>[2019-07-26T20:02:06.013] fatal: _slurmctld_rpc_mgr: pthread_create error Resource temporarily unavailable

>[2019-07-29T08:48:47.316] fatal: _agent_retry: pthread_create error Resource temporarily unavailable

>[2019-08-02T12:01:26.596] fatal: _slurmctld_rpc_mgr: pthread_create error Resource temporarily unavailable

>[2019-08-05T11:21:32.822] fatal: agent: pthread_create error Resource temporarily unavailable
>[2019-08-05T11:21:32.822] fatal: agent: pthread_create error Resource temporarily unavailable

>[2019-08-06T20:22:52.614] fatal: _agent_retry: pthread_create error Resource temporarily unavailable

>[2019-08-07T20:36:35.209] fatal: _slurmctld_rpc_mgr: pthread_create error Resource temporarily unavailable
>[2019-08-07T20:36:43.041] fatal: _agent_retry: pthread_create error Resource temporarily unavailable

In following similar open requests 7532 and 7360, we have tried adjusting max procs and open files, but increasing those don't help. 

Even with these values, slurmctld still crashed:
[root@slurm2 ~]# prlimit -p $(pgrep slurmctld)
RESOURCE   DESCRIPTION                             SOFT      HARD UNITS
AS         address space limit                unlimited unlimited bytes
CORE       max core file size                 unlimited unlimited blocks
CPU        CPU time                           unlimited unlimited seconds
DATA       max data size                      unlimited unlimited bytes
FSIZE      max file size                      unlimited unlimited blocks
LOCKS      max number of file locks held      unlimited unlimited
MEMLOCK    max locked-in-memory address space     65536     65536 bytes
MSGQUEUE   max bytes in POSIX mqueues            819200    819200 bytes
NICE       max nice prio allowed to raise             0         0
NOFILE     max number of open files              262144    262144
NPROC      max number of processes               127764    127764
RSS        max resident set size              unlimited unlimited pages
RTPRIO     max real-time priority                     0         0
RTTIME     timeout for real-time tasks        unlimited unlimited microsecs
SIGPENDING max number of pending signals         127764    127764
STACK      max stack size                     unlimited unlimited bytes

We're not using systemd but are running slurmctld on RedHat EL7 (3.10.0-862.3.2.el7.x86_64) 

We haven't changed anything between our previous 18.08 environment and 19.05. 

At this point, we haven't seen a crash since Wednesday, August 7, but I think part of that is because we had been restarting slurmctld rather frequently since then and over the weekend. It was also restarted today, but we don't intend to restart it any further to see if it will crash.

In looking at system load, processes, threads, etc via Grafana, it does not appear that there is a gradual increase in anything. It seems as though there is some spike of activity or flood that slurmctld simply can't handle and crashes. We are seeing some spikes, but for the most part slurmctld stays up. It's possible we're not seeing exactly what happens before the crash in Grafana due to monitoring intervals. If it happens within 30 seconds, we may not capture it.

Ultimately, we'd like to get a fix for this apparent problem. Or perhaps a quick rundown on what it would take to go back to Slurm 18, if that is the preferred route here. We're following others with the same issue and would like to try and help resolve it by providing additional information. Thanks, 

Eric