| Summary: | Slurmctld crashed | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Lee Reynolds <Lee.Reynolds> |
| Component: | slurmctld | Assignee: | Dominik Bartkiewicz <bart> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 3 - Medium Impact | ||
| Priority: | --- | ||
| Version: | - Unsupported Older Versions | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | ASU | Slinky Site: | --- |
| Alineos Sites: | --- | Atos/Eviden Sites: | --- |
| Confidential Site: | --- | Coreweave sites: | --- |
| Cray Sites: | --- | DS9 clusters: | --- |
| Google sites: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | NoveTech Sites: | --- |
| Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Tzag Elita Sites: | --- |
| Linux Distro: | --- | Machine Name: | |
| CLE Version: | Version Fixed: | ||
| Target Release: | --- | DevPrio: | --- |
| Emory-Cloud Sites: | --- | ||
| Attachments: |
Patch supplied to us to fix thread issue
core file creation patch |
||
|
Description
Lee Reynolds
2021-05-03 13:25:14 MDT
Hi Without better logs, it will be difficult to track what caused this crash. Could you point bug with this patch? Did this happen frequently or only once? Dominik Hi c43b1066a63 -- This commit should solve this issue by reducing the number of forked mail processes (from 256 to 64). Dominik Created attachment 19816 [details]
Patch supplied to us to fix thread issue
It has just happened again. Here's the end of the log from the point that it crashed: [2021-06-04T11:14:17.653] email msg to amtarave@asu.edu: Slurm Job_id=9611683 Name=snakejob.getAlleleFrq.1521.sh Ended, Run time 00:00:34, COMPLETED, ExitCode 0 [2021-06-04T11:14:17.653] _job_complete: JobId=9611683 done [2021-06-04T11:14:17.657] _job_complete: JobId=9611679 WEXITSTATUS 0 [2021-06-04T11:14:17.657] email msg to amtarave@asu.edu: Slurm Job_id=9611679 Name=snakejob.getAlleleFrq.299.sh Ended, Run time 00:00:34, COMPLETED, ExitCode 0 [2021-06-04T11:14:17.657] _job_complete: JobId=9611679 done [2021-06-04T11:14:17.819] error: fork(): Cannot allocate memory [2021-06-04T11:14:17.820] error: fork(): Cannot allocate memory [2021-06-04T11:14:17.820] error: fork(): Cannot allocate memory [2021-06-04T11:14:17.825] error: fork(): Cannot allocate memory [2021-06-04T11:14:17.827] error: fork(): Cannot allocate memory [2021-06-04T11:14:17.827] error: fork(): Cannot allocate memory [2021-06-04T11:14:17.827] fatal: _agent_retry: pthread_create error Resource temporarily unavailable [2021-06-04T11:14:17.828] error: fork(): Cannot allocate memory [2021-06-04T11:14:17.828] fatal: _slurmctld_rpc_mgr: pthread_create error Resource temporarily unavailable I'm attaching the patch that we were provided with before. Created attachment 19821 [details]
core file creation patch
Hi
Could you check the limits value of slurmctld process?
eg.:
cat /proc/`pidof slurmctld`/limits
Does dmesg/syslog contain any relevant info close to slurmctld crash?
You can also apply this patch. If you hit this issue next time,
we will have core dump and we will be able to find the root cause of this.
Dominik
Hi Any news? Dominik (In reply to Dominik Bartkiewicz from comment #5) > Created attachment 19821 [details] > core file creation patch > > Hi > > Could you check the limits value of slurmctld process? > eg.: > cat /proc/`pidof slurmctld`/limits > > Does dmesg/syslog contain any relevant info close to slurmctld crash? > > You can also apply this patch. If you hit this issue next time, > we will have core dump and we will be able to find the root cause of this. > > Dominik We've been running this patch for a little while now. I suspect that it is working as it should, but that the load on our cluster is such that even with the thread setting adjustment it is still exceeding the threshold. Our cluster is not very large in terms of cores, but we have many, many simultaneous users utilizing it at any given time. Here are the current limits: cat /proc/`pidof slurmctld`/limits Limit Soft Limit Hard Limit Units Max cpu time unlimited unlimited seconds Max file size unlimited unlimited bytes Max data size unlimited unlimited bytes Max stack size unlimited unlimited bytes Max core file size unlimited unlimited bytes Max resident set unlimited unlimited bytes Max processes 31112 31112 processes Max open files 65536 65536 files Max locked memory 65536 65536 bytes Max address space unlimited unlimited bytes Max file locks unlimited unlimited locks Max pending signals 31112 31112 signals Max msgqueue size 819200 819200 bytes Max nice priority 0 0 Max realtime priority 0 0 Max realtime timeout unlimited unlimited us I've looked at dmesg and the main syslog in the past, and have not seen anything that looked out of place. I do have a core dump from the 4th that I can provide to you. Is there a dropbox or other service you'd like me to use? It is 500M+ Hi Sorry for the late response. Core file without binaries and libs is useless. Could you load the core file into gdb and share the backtrace with us? e.g.: gdb -ex 't a a bt' -batch <slurmctld path> <corefile> Dominik Hi Any news? Dominik Hi I haven't seen an update to this ticket for a month. I'll go ahead and close but if it does come up again, or you can collect the information mentioned in comment 8, feel free to reopen the ticket. Dominik |