Our director has asked me to submit a report for this. We're running 19.05.8 and slurmctld crashed last week with the following message in its logs: [2021-04-27T15:56:59.412] email msg to amtarave@asu.edu: Slurm Job_id=9119877 Name=snakejob.fastqc_analysis.438.sh Began, Queued time 00:03:20 [2021-04-27T15:56:59.412] backfill: Started JobId=9119877 in serial on cg2-9 [2021-04-27T15:56:59.416] email msg to amtarave@asu.edu: Slurm Job_id=9119879 Name=snakejob.fastqc_analysis.152.sh Began, Queued time 00:03:19 [2021-04-27T15:56:59.416] backfill: Started JobId=9119879 in serial on cg12-11 [2021-04-27T15:56:59.419] email msg to amtarave@asu.edu: Slurm Job_id=9119892 Name=snakejob.fastqc_analysis.31.sh Began, Queued time 00:02:41 [2021-04-27T15:56:59.419] backfill: Started JobId=9119892 in serial on cg12-11 [2021-04-27T15:56:59.422] email msg to amtarave@asu.edu: Slurm Job_id=9119895 Name=snakejob.fastqc_analysis.160.sh Began, Queued time 00:02:22 [2021-04-27T15:56:59.422] backfill: Started JobId=9119895 in serial on cg3-2 [2021-04-27T15:56:59.424] email msg to amtarave@asu.edu: Slurm Job_id=9119961 Name=snakejob.fastqc_analysis.34.sh Began, Queued time 00:00:10 [2021-04-27T15:56:59.425] backfill: Started JobId=9119961 in serial on cg4-7 [2021-04-27T15:56:59.427] email msg to amtarave@asu.edu: Slurm Job_id=9119903 Name=snakejob.fastqc_analysis.347.sh Began, Queued time 00:01:50 [2021-04-27T15:56:59.427] backfill: Started JobId=9119903 in serial on cg4-1 [2021-04-27T15:56:59.430] email msg to amtarave@asu.edu: Slurm Job_id=9119907 Name=snakejob.fastqc_analysis.76.sh Began, Queued time 00:01:43 [2021-04-27T15:56:59.431] backfill: Started JobId=9119907 in serial on cg4-1 [2021-04-27T15:57:00.297] error: fork(): Cannot allocate memory [2021-04-27T15:57:00.297] fatal: _agent_retry: pthread_create error Resource temporarily unavailable [2021-04-27T15:57:00.393] _pick_best_nodes: JobId=8671447 never runnable in partition asinghargpu1 [2021-04-27T15:57:00.403] _pick_best_nodes: JobId=8671447 never runnable in partition rcgpu5 [2021-04-27T15:57:00.410] sched: Allocate JobId=9119968 NodeList=cg38-5 #CPUs=8 Partition=htc [2021-04-27T15:57:00.488] job_submit_defaults.c : job_submit() : Setting empty feature list to 'GenCPU' in job_modify [2021-04-27T15:57:00.488] job_submit_defaults.c : job_submit() : Changing switch setting to 1 in job_submit [2021-04-27T15:57:00.488] job_submit_partition.c : job_submit() : ### partition plugin start ### [2021-04-27T15:57:00.488] job_submit_partition.c : job_submit() : partition = (null) [2021-04-27T15:57:00.488] job_submit_partition.c : job_submit() : Features = GenCPU [2021-04-27T15:57:00.488] job_submit_partition.c : job_submit() : min_cpus = 8 [2021-04-27T15:57:00.488] job_submit_partition.c : job_submit() : min_nodes = 4294967294 [2021-04-27T15:57:00.488] job_submit_partition.c : job_submit() : max_nodes = 4294967294 [2021-04-27T15:57:00.488] job_submit_partition.c : job_submit() : time_limit = 120 [2021-04-27T15:57:00.488] job_submit_partition.c : job_submit() : Generic SERIAL job with a time limit <= 240 minutes / 4 hours, sending it to the 'HTC' partitions This looks like an instance of the bug which caused slurmctld to run out of threads while dealing with email messages. We're running a patch provided by schedmd to compensate for this issue. I'd like to verify that my diagnosis of this issue is accurate. We're planning to upgrade a supported version of slurm during the summer break.
Hi Without better logs, it will be difficult to track what caused this crash. Could you point bug with this patch? Did this happen frequently or only once? Dominik
Hi c43b1066a63 -- This commit should solve this issue by reducing the number of forked mail processes (from 256 to 64). Dominik
Created attachment 19816 [details] Patch supplied to us to fix thread issue
It has just happened again. Here's the end of the log from the point that it crashed: [2021-06-04T11:14:17.653] email msg to amtarave@asu.edu: Slurm Job_id=9611683 Name=snakejob.getAlleleFrq.1521.sh Ended, Run time 00:00:34, COMPLETED, ExitCode 0 [2021-06-04T11:14:17.653] _job_complete: JobId=9611683 done [2021-06-04T11:14:17.657] _job_complete: JobId=9611679 WEXITSTATUS 0 [2021-06-04T11:14:17.657] email msg to amtarave@asu.edu: Slurm Job_id=9611679 Name=snakejob.getAlleleFrq.299.sh Ended, Run time 00:00:34, COMPLETED, ExitCode 0 [2021-06-04T11:14:17.657] _job_complete: JobId=9611679 done [2021-06-04T11:14:17.819] error: fork(): Cannot allocate memory [2021-06-04T11:14:17.820] error: fork(): Cannot allocate memory [2021-06-04T11:14:17.820] error: fork(): Cannot allocate memory [2021-06-04T11:14:17.825] error: fork(): Cannot allocate memory [2021-06-04T11:14:17.827] error: fork(): Cannot allocate memory [2021-06-04T11:14:17.827] error: fork(): Cannot allocate memory [2021-06-04T11:14:17.827] fatal: _agent_retry: pthread_create error Resource temporarily unavailable [2021-06-04T11:14:17.828] error: fork(): Cannot allocate memory [2021-06-04T11:14:17.828] fatal: _slurmctld_rpc_mgr: pthread_create error Resource temporarily unavailable I'm attaching the patch that we were provided with before.
Created attachment 19821 [details] core file creation patch Hi Could you check the limits value of slurmctld process? eg.: cat /proc/`pidof slurmctld`/limits Does dmesg/syslog contain any relevant info close to slurmctld crash? You can also apply this patch. If you hit this issue next time, we will have core dump and we will be able to find the root cause of this. Dominik
Hi Any news? Dominik
(In reply to Dominik Bartkiewicz from comment #5) > Created attachment 19821 [details] > core file creation patch > > Hi > > Could you check the limits value of slurmctld process? > eg.: > cat /proc/`pidof slurmctld`/limits > > Does dmesg/syslog contain any relevant info close to slurmctld crash? > > You can also apply this patch. If you hit this issue next time, > we will have core dump and we will be able to find the root cause of this. > > Dominik We've been running this patch for a little while now. I suspect that it is working as it should, but that the load on our cluster is such that even with the thread setting adjustment it is still exceeding the threshold. Our cluster is not very large in terms of cores, but we have many, many simultaneous users utilizing it at any given time. Here are the current limits: cat /proc/`pidof slurmctld`/limits Limit Soft Limit Hard Limit Units Max cpu time unlimited unlimited seconds Max file size unlimited unlimited bytes Max data size unlimited unlimited bytes Max stack size unlimited unlimited bytes Max core file size unlimited unlimited bytes Max resident set unlimited unlimited bytes Max processes 31112 31112 processes Max open files 65536 65536 files Max locked memory 65536 65536 bytes Max address space unlimited unlimited bytes Max file locks unlimited unlimited locks Max pending signals 31112 31112 signals Max msgqueue size 819200 819200 bytes Max nice priority 0 0 Max realtime priority 0 0 Max realtime timeout unlimited unlimited us I've looked at dmesg and the main syslog in the past, and have not seen anything that looked out of place. I do have a core dump from the 4th that I can provide to you. Is there a dropbox or other service you'd like me to use? It is 500M+
Hi Sorry for the late response. Core file without binaries and libs is useless. Could you load the core file into gdb and share the backtrace with us? e.g.: gdb -ex 't a a bt' -batch <slurmctld path> <corefile> Dominik
Hi I haven't seen an update to this ticket for a month. I'll go ahead and close but if it does come up again, or you can collect the information mentioned in comment 8, feel free to reopen the ticket. Dominik