11510 – Slurmctld crashed

Ticket 11510 - Slurmctld crashed

Summary: Slurmctld crashed

Status:	RESOLVED INFOGIVEN

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	slurmctld (show other tickets)
Version:	- Unsupported Older Versions
Hardware:	Linux Linux

Severity:	3 - Medium Impact
Assignee:	Dominik Bartkiewicz
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2021-05-03 13:25 MDT by Lee Reynolds
Modified:	2021-08-09 02:19 MDT (History)
CC List:	0 users

See Also:
Site:	ASU
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
Patch supplied to us to fix thread issue (1.99 KB, patch) 2021-06-04 12:28 MDT, Lee Reynolds	Details \| Diff
core file creation patch (727 bytes, patch) 2021-06-07 03:56 MDT, Dominik Bartkiewicz	Details \| Diff
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Lee Reynolds 2021-05-03 13:25:14 MDT

Our director has asked me to submit a report for this.  We're running 19.05.8 and slurmctld crashed last week with the following message in its logs:

[2021-04-27T15:56:59.412] email msg to amtarave@asu.edu: Slurm Job_id=9119877 Name=snakejob.fastqc_analysis.438.sh Began, Queued time 00:03:20
[2021-04-27T15:56:59.412] backfill: Started JobId=9119877 in serial on cg2-9    
[2021-04-27T15:56:59.416] email msg to amtarave@asu.edu: Slurm Job_id=9119879 Name=snakejob.fastqc_analysis.152.sh Began, Queued time 00:03:19
[2021-04-27T15:56:59.416] backfill: Started JobId=9119879 in serial on cg12-11  
[2021-04-27T15:56:59.419] email msg to amtarave@asu.edu: Slurm Job_id=9119892 Name=snakejob.fastqc_analysis.31.sh Began, Queued time 00:02:41
[2021-04-27T15:56:59.419] backfill: Started JobId=9119892 in serial on cg12-11  
[2021-04-27T15:56:59.422] email msg to amtarave@asu.edu: Slurm Job_id=9119895 Name=snakejob.fastqc_analysis.160.sh Began, Queued time 00:02:22
[2021-04-27T15:56:59.422] backfill: Started JobId=9119895 in serial on cg3-2    
[2021-04-27T15:56:59.424] email msg to amtarave@asu.edu: Slurm Job_id=9119961 Name=snakejob.fastqc_analysis.34.sh Began, Queued time 00:00:10
[2021-04-27T15:56:59.425] backfill: Started JobId=9119961 in serial on cg4-7    
[2021-04-27T15:56:59.427] email msg to amtarave@asu.edu: Slurm Job_id=9119903 Name=snakejob.fastqc_analysis.347.sh Began, Queued time 00:01:50
[2021-04-27T15:56:59.427] backfill: Started JobId=9119903 in serial on cg4-1    
[2021-04-27T15:56:59.430] email msg to amtarave@asu.edu: Slurm Job_id=9119907 Name=snakejob.fastqc_analysis.76.sh Began, Queued time 00:01:43
[2021-04-27T15:56:59.431] backfill: Started JobId=9119907 in serial on cg4-1    
[2021-04-27T15:57:00.297] error: fork(): Cannot allocate memory                 
[2021-04-27T15:57:00.297] fatal: _agent_retry: pthread_create error Resource temporarily unavailable
[2021-04-27T15:57:00.393] _pick_best_nodes: JobId=8671447 never runnable in partition asinghargpu1
[2021-04-27T15:57:00.403] _pick_best_nodes: JobId=8671447 never runnable in partition rcgpu5
[2021-04-27T15:57:00.410] sched: Allocate JobId=9119968 NodeList=cg38-5 #CPUs=8 Partition=htc
[2021-04-27T15:57:00.488] job_submit_defaults.c : job_submit() : Setting empty feature list to 'GenCPU' in job_modify
[2021-04-27T15:57:00.488] job_submit_defaults.c : job_submit() : Changing switch setting to 1 in job_submit
[2021-04-27T15:57:00.488] job_submit_partition.c : job_submit() : ### partition plugin start ###
[2021-04-27T15:57:00.488] job_submit_partition.c : job_submit() : partition = (null)
[2021-04-27T15:57:00.488] job_submit_partition.c : job_submit() : Features = GenCPU
[2021-04-27T15:57:00.488] job_submit_partition.c : job_submit() : min_cpus = 8  
[2021-04-27T15:57:00.488] job_submit_partition.c : job_submit() : min_nodes = 4294967294
[2021-04-27T15:57:00.488] job_submit_partition.c : job_submit() : max_nodes = 4294967294
[2021-04-27T15:57:00.488] job_submit_partition.c : job_submit() : time_limit = 120
[2021-04-27T15:57:00.488] job_submit_partition.c : job_submit() : Generic SERIAL job with a time limit <= 240 minutes / 4 hours, sending it to the 'HTC' partitions

This looks like an instance of the bug which caused slurmctld to run out of threads while dealing with email messages.  We're running a patch provided by schedmd to compensate for this issue.

I'd like to verify that my diagnosis of this issue is accurate.

We're planning to upgrade a supported version of slurm during the summer break.

Comment 1 Dominik Bartkiewicz 2021-05-04 04:40:03 MDT

Hi

Without better logs, it will be difficult to track what caused this crash.
Could you point bug with this patch? Did this happen frequently or only once?

Dominik

Comment 2 Dominik Bartkiewicz 2021-05-26 03:04:42 MDT

Hi

c43b1066a63 -- This commit should solve this issue by reducing the number of forked mail processes (from 256 to 64).

Dominik

Comment 3 Lee Reynolds 2021-06-04 12:28:26 MDT

Created attachment 19816 [details]
Patch supplied to us to fix thread issue

Comment 4 Lee Reynolds 2021-06-04 12:29:44 MDT

It has just happened again.

Here's the end of the log from the point that it crashed:

[2021-06-04T11:14:17.653] email msg to amtarave@asu.edu: Slurm Job_id=9611683 Name=snakejob.getAlleleFrq.1521.sh Ended, Run time 00:00:34, COMPLETED, ExitCode 0
[2021-06-04T11:14:17.653] _job_complete: JobId=9611683 done                     
[2021-06-04T11:14:17.657] _job_complete: JobId=9611679 WEXITSTATUS 0            
[2021-06-04T11:14:17.657] email msg to amtarave@asu.edu: Slurm Job_id=9611679 Name=snakejob.getAlleleFrq.299.sh Ended, Run time 00:00:34, COMPLETED, ExitCode 0
[2021-06-04T11:14:17.657] _job_complete: JobId=9611679 done                     
[2021-06-04T11:14:17.819] error: fork(): Cannot allocate memory                 
[2021-06-04T11:14:17.820] error: fork(): Cannot allocate memory                 
[2021-06-04T11:14:17.820] error: fork(): Cannot allocate memory                 
[2021-06-04T11:14:17.825] error: fork(): Cannot allocate memory                 
[2021-06-04T11:14:17.827] error: fork(): Cannot allocate memory                 
[2021-06-04T11:14:17.827] error: fork(): Cannot allocate memory                 
[2021-06-04T11:14:17.827] fatal: _agent_retry: pthread_create error Resource temporarily unavailable
[2021-06-04T11:14:17.828] error: fork(): Cannot allocate memory                 
[2021-06-04T11:14:17.828] fatal: _slurmctld_rpc_mgr: pthread_create error Resource temporarily unavailable


I'm attaching the patch that we were provided with before.

Comment 5 Dominik Bartkiewicz 2021-06-07 03:56:48 MDT

Created attachment 19821 [details]
core file creation patch

Hi

Could you check the limits value of slurmctld process?
eg.:
cat /proc/`pidof slurmctld`/limits

Does dmesg/syslog contain any relevant info close to slurmctld crash?

You can also apply this patch. If you hit this issue next time,
we will have core dump and we will be able to find the root cause of this.

Dominik

Comment 6 Dominik Bartkiewicz 2021-06-15 07:49:12 MDT

Hi

Any news?

Dominik

Comment 7 Lee Reynolds 2021-06-17 14:05:08 MDT

(In reply to Dominik Bartkiewicz from comment #5)
> Created attachment 19821 [details]
> core file creation patch
> 
> Hi
> 
> Could you check the limits value of slurmctld process?
> eg.:
> cat /proc/`pidof slurmctld`/limits
> 
> Does dmesg/syslog contain any relevant info close to slurmctld crash?
> 
> You can also apply this patch. If you hit this issue next time,
> we will have core dump and we will be able to find the root cause of this.
> 
> Dominik

We've been running this patch for a little while now.  I suspect that it is working as it should, but that the load on our cluster is such that even with the thread setting adjustment it is still exceeding the threshold.  Our cluster is not very large in terms of cores, but we have many, many simultaneous users utilizing it at any given time.  

Here are the current limits:

cat /proc/`pidof slurmctld`/limits 
Limit                     Soft Limit           Hard Limit           Units     
Max cpu time              unlimited            unlimited            seconds   
Max file size             unlimited            unlimited            bytes     
Max data size             unlimited            unlimited            bytes     
Max stack size            unlimited            unlimited            bytes     
Max core file size        unlimited            unlimited            bytes     
Max resident set          unlimited            unlimited            bytes     
Max processes             31112                31112                processes 
Max open files            65536                65536                files     
Max locked memory         65536                65536                bytes     
Max address space         unlimited            unlimited            bytes     
Max file locks            unlimited            unlimited            locks     
Max pending signals       31112                31112                signals   
Max msgqueue size         819200               819200               bytes     
Max nice priority         0                    0                    
Max realtime priority     0                    0                    
Max realtime timeout      unlimited            unlimited            us       

I've looked at dmesg and the main syslog in the past, and have not seen anything that looked out of place.

I do have a core dump from the 4th that I can provide to you.

Is there a dropbox or other service you'd like me to use?  It is 500M+

Comment 8 Dominik Bartkiewicz 2021-06-21 05:36:37 MDT

Hi

Sorry for the late response.
Core file without binaries and libs is useless.
Could you load the core file into gdb and share the backtrace with us?

e.g.:
gdb -ex 't a a bt' -batch <slurmctld path> <corefile>

Dominik

Comment 9 Dominik Bartkiewicz 2021-07-28 09:24:47 MDT

Hi

Any news?

Dominik

Comment 12 Dominik Bartkiewicz 2021-08-09 02:19:34 MDT

Hi

I haven't seen an update to this ticket for a month.  I'll go ahead and close
but if it does come up again, or you can collect the information mentioned in comment 8, feel free to reopen the ticket.

Dominik