Ticket 12716

Summary: slurmctld crashing due to pthread_create error Resource temporarily unavailable
Product: Slurm Reporter: Johnathan Lee <johnathan.lee>
Component: slurmctldAssignee: Dominik Bartkiewicz <bart>
Status: RESOLVED TIMEDOUT QA Contact:
Severity: 3 - Medium Impact    
Priority: --- CC: marshall
Version: - Unsupported Older Versions   
Hardware: Linux   
OS: Linux   
Site: ASU Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---

Description Johnathan Lee 2021-10-21 09:27:27 MDT
Hey guys,

I know Lee had previously opened a ticket about this issue and applied a patch which had made things pretty stable, we were able to go a month or so with out having trouble but as of this week, we have been seeing slurmctld crash about every 10 hours. 

We are in the process of building a new cluster and to upgrade the current cluster would require too much work at this time. We are planning to move this cluster to the latest version after we finish the new cluster build. Are there any steps would could take to get this running a bit more smoothly? I am not seeing any information on the previous tickets Lee had opened, but i can send over some additional details if needed.


[2021-10-21T00:09:55.019] error: fork(): Cannot allocate memory
[2021-10-21T00:09:55.155] error: fork(): Cannot allocate memory
[2021-10-21T00:09:55.296] error: fork(): Cannot allocate memory
[2021-10-21T00:09:55.425] error: fork(): Cannot allocate memory
[2021-10-21T00:09:55.425] fatal: _agent_retry: pthread_create error Resource temporarily unavailable
[2021-10-21T00:09:55.569] error: fork(): Cannot allocate memory
[2021-10-21T00:09:55.701] error: fork(): Cannot allocate memory
[2021-10-21T00:09:55.702] error: slurm_receive_msgs: Zero Bytes were transmitted or received
[2021-10-21T00:09:59.537] error: slurm_receive_msgs: Socket timed out on send/recv operation


Thanks!
Comment 3 Dominik Bartkiewicz 2021-10-22 03:45:24 MDT
Hi

Does slurmctld generate core file at crash?
If yes, could you send me an output of this gdb command:
gdb -ex 't a a bt' -batch <slurmctld path> <corefile>

Do you still use the debugging patch from bug 11510 comment 5?

Could you send me a bigger chunk of slurmctld log?

Dominik
Comment 5 Dominik Bartkiewicz 2021-11-15 09:10:02 MST
Hi

Does this problem still occur? If yes, could you send the data mentioned in the comment 3?

Dominik
Comment 6 Dominik Bartkiewicz 2021-12-16 07:43:49 MST
Hi

Any update on this?
If you don't make here any update I will close this in few days as timeout.

Dominik