Within the same day of upgrading to Slurm v19.05.0, we have been experiencing frequent slurmctld crashes with the following messages. Each block is a separate crash instance: >[2019-07-26T20:02:06.013] fatal: _slurmctld_rpc_mgr: pthread_create error Resource temporarily unavailable >[2019-07-29T08:48:47.316] fatal: _agent_retry: pthread_create error Resource temporarily unavailable >[2019-08-02T12:01:26.596] fatal: _slurmctld_rpc_mgr: pthread_create error Resource temporarily unavailable >[2019-08-05T11:21:32.822] fatal: agent: pthread_create error Resource temporarily unavailable >[2019-08-05T11:21:32.822] fatal: agent: pthread_create error Resource temporarily unavailable >[2019-08-06T20:22:52.614] fatal: _agent_retry: pthread_create error Resource temporarily unavailable >[2019-08-07T20:36:35.209] fatal: _slurmctld_rpc_mgr: pthread_create error Resource temporarily unavailable >[2019-08-07T20:36:43.041] fatal: _agent_retry: pthread_create error Resource temporarily unavailable In following similar open requests 7532 and 7360, we have tried adjusting max procs and open files, but increasing those don't help. Even with these values, slurmctld still crashed: [root@slurm2 ~]# prlimit -p $(pgrep slurmctld) RESOURCE DESCRIPTION SOFT HARD UNITS AS address space limit unlimited unlimited bytes CORE max core file size unlimited unlimited blocks CPU CPU time unlimited unlimited seconds DATA max data size unlimited unlimited bytes FSIZE max file size unlimited unlimited blocks LOCKS max number of file locks held unlimited unlimited MEMLOCK max locked-in-memory address space 65536 65536 bytes MSGQUEUE max bytes in POSIX mqueues 819200 819200 bytes NICE max nice prio allowed to raise 0 0 NOFILE max number of open files 262144 262144 NPROC max number of processes 127764 127764 RSS max resident set size unlimited unlimited pages RTPRIO max real-time priority 0 0 RTTIME timeout for real-time tasks unlimited unlimited microsecs SIGPENDING max number of pending signals 127764 127764 STACK max stack size unlimited unlimited bytes We're not using systemd but are running slurmctld on RedHat EL7 (3.10.0-862.3.2.el7.x86_64) We haven't changed anything between our previous 18.08 environment and 19.05. At this point, we haven't seen a crash since Wednesday, August 7, but I think part of that is because we had been restarting slurmctld rather frequently since then and over the weekend. It was also restarted today, but we don't intend to restart it any further to see if it will crash. In looking at system load, processes, threads, etc via Grafana, it does not appear that there is a gradual increase in anything. It seems as though there is some spike of activity or flood that slurmctld simply can't handle and crashes. We are seeing some spikes, but for the most part slurmctld stays up. It's possible we're not seeing exactly what happens before the crash in Grafana due to monitoring intervals. If it happens within 30 seconds, we may not capture it. Ultimately, we'd like to get a fix for this apparent problem. Or perhaps a quick rundown on what it would take to go back to Slurm 18, if that is the preferred route here. We're following others with the same issue and would like to try and help resolve it by providing additional information. Thanks, Eric