Ticket 12716 - slurmctld crashing due to pthread_create error Resource temporarily unavailable
Summary: slurmctld crashing due to pthread_create error Resource temporarily unavailable
Status: RESOLVED TIMEDOUT
Alias: None
Product: Slurm
Classification: Unclassified
Component: slurmctld (show other tickets)
Version: - Unsupported Older Versions
Hardware: Linux Linux
: 3 - Medium Impact
Assignee: Dominik Bartkiewicz
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2021-10-21 09:27 MDT by Johnathan Lee
Modified: 2021-12-22 08:27 MST (History)
1 user (show)

See Also:
Site: ASU
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description Johnathan Lee 2021-10-21 09:27:27 MDT
Hey guys,

I know Lee had previously opened a ticket about this issue and applied a patch which had made things pretty stable, we were able to go a month or so with out having trouble but as of this week, we have been seeing slurmctld crash about every 10 hours. 

We are in the process of building a new cluster and to upgrade the current cluster would require too much work at this time. We are planning to move this cluster to the latest version after we finish the new cluster build. Are there any steps would could take to get this running a bit more smoothly? I am not seeing any information on the previous tickets Lee had opened, but i can send over some additional details if needed.


[2021-10-21T00:09:55.019] error: fork(): Cannot allocate memory
[2021-10-21T00:09:55.155] error: fork(): Cannot allocate memory
[2021-10-21T00:09:55.296] error: fork(): Cannot allocate memory
[2021-10-21T00:09:55.425] error: fork(): Cannot allocate memory
[2021-10-21T00:09:55.425] fatal: _agent_retry: pthread_create error Resource temporarily unavailable
[2021-10-21T00:09:55.569] error: fork(): Cannot allocate memory
[2021-10-21T00:09:55.701] error: fork(): Cannot allocate memory
[2021-10-21T00:09:55.702] error: slurm_receive_msgs: Zero Bytes were transmitted or received
[2021-10-21T00:09:59.537] error: slurm_receive_msgs: Socket timed out on send/recv operation


Thanks!
Comment 3 Dominik Bartkiewicz 2021-10-22 03:45:24 MDT
Hi

Does slurmctld generate core file at crash?
If yes, could you send me an output of this gdb command:
gdb -ex 't a a bt' -batch <slurmctld path> <corefile>

Do you still use the debugging patch from bug 11510 comment 5?

Could you send me a bigger chunk of slurmctld log?

Dominik
Comment 5 Dominik Bartkiewicz 2021-11-15 09:10:02 MST
Hi

Does this problem still occur? If yes, could you send the data mentioned in the comment 3?

Dominik
Comment 6 Dominik Bartkiewicz 2021-12-16 07:43:49 MST
Hi

Any update on this?
If you don't make here any update I will close this in few days as timeout.

Dominik