| Summary: | slurmctld died | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | ARC Admins <arc-slurm-admins> |
| Component: | slurmctld | Assignee: | Marshall Garey <marshall> |
| Status: | RESOLVED CANNOTREPRODUCE | QA Contact: | |
| Severity: | 3 - Medium Impact | ||
| Priority: | --- | CC: | bart |
| Version: | 21.08.8 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | University of Michigan | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
| Attachments: |
lhctld slurm.conf
lhctld slurmsched.log lhctld slurmctld.log |
||
Created attachment 26305 [details]
lhctld slurmsched.log
Created attachment 26306 [details]
lhctld slurmctld.log
Unfortunately, there are no clues in the log files as to why the slurmctld crashed. The best clue we have is: Aug 12 06:47:37 lhctld.arc-ts.umich.edu slurmctld[325190]: malloc_consolidate(): invalid chunk size Aug 12 06:47:45 lhctld.arc-ts.umich.edu systemd[1]: slurmctld.service: Main process exited, code=dumped, status=6/ABRT It looks like malloc() failed. Questions: * Do you know how much memory was slurmctld using at or near the time of the crash? How about the memory usage of the controller node? Is that something you monitor? - We've fixed a few memory leaks since 21.08.8, so it is possible that slurmctld leaked memory. * What limit do you have on the size of the core file? Are you able to increase that so we can get a whole core file? We really need the backtrace from the coredump in order to debug crashes. (In reply to Marshall Garey from comment #4) Marshall, > * Do you know how much memory was slurmctld using at or near the time of the > crash? How about the memory usage of the controller node? Is that something > you monitor? > - We've fixed a few memory leaks since 21.08.8, so it is possible that > slurmctld leaked memory. We do not have the value/memory used at the time of the event, unfortunately. > * What limit do you have on the size of the core file? Are you able to > increase that so we can get a whole core file? We really need the backtrace > from the coredump in order to debug crashes. We currently allow for unlimited core file size, it looks like: ``` [root@lhctld etc]# date; ulimit -a Mon Aug 15 08:52:30 EDT 2022 core file size (blocks, -c) unlimited data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 256335 max locked memory (kbytes, -l) unlimited max memory size (kbytes, -m) unlimited open files (-n) 20480 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) unlimited cpu time (seconds, -t) unlimited max user processes (-u) 256335 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited ``` David Without a coredump, I have no idea what caused this crash (and therefore don't know what needs to be fixed or how to fix it). I'm closing this as cannotreproduce, but please re-open the ticket if it happens again. |
Created attachment 26304 [details] lhctld slurm.conf Hello, This morning, about 6:47am EST, we found our slurmctld had died and core-dumped. Unfortunately, the core file was truncated so I can't get a back trace. Here's what systemd showed: ``` [root@lhctld ~]# systemctl status slurmctld.service ● slurmctld.service - Slurm controller daemon Loaded: loaded (/usr/lib/systemd/system/slurmctld.service; enabled; vendor preset: disabled) Drop-In: /etc/systemd/system/slurmctld.service.d └─nofile.conf Active: failed (Result: core-dump) since Fri 2022-08-12 06:47:45 EDT; 59min ago Process: 325190 ExecStart=/opt/slurm/sbin/slurmctld -D -s $SLURMCTLD_OPTIONS (code=dumped, signal=ABRT) Main PID: 325190 (code=dumped, signal=ABRT) Aug 12 06:33:45 lhctld.arc-ts.umich.edu slurmctld[325190]: slurmctld: sched: Allocate JobId=1851017 NodeList=lh0419 #CPUs=20 Partition=sigbio Aug 12 06:33:45 lhctld.arc-ts.umich.edu slurmctld[325190]: slurmctld: sched: Allocate JobId=1851018 NodeList=lh0419 #CPUs=20 Partition=sigbio Aug 12 06:33:45 lhctld.arc-ts.umich.edu slurmctld[325190]: slurmctld: retry_list retry_list_size:154 msg_type=REQUEST_LAUNCH_PROLOG,REQUEST_BA> Aug 12 06:37:10 lhctld.arc-ts.umich.edu slurmctld[325190]: slurmctld: sched/backfill: _start_job: Started JobId=1850681 in sigbio on lh0407 Aug 12 06:43:37 lhctld.arc-ts.umich.edu slurmctld[325190]: slurmctld: _job_complete: JobId=1850973 OOM failure Aug 12 06:43:37 lhctld.arc-ts.umich.edu slurmctld[325190]: slurmctld: _job_complete: JobId=1850973 done Aug 12 06:43:45 lhctld.arc-ts.umich.edu slurmctld[325190]: slurmctld: sched: Allocate JobId=1851019 NodeList=lh0412 #CPUs=20 Partition=sigbio Aug 12 06:47:37 lhctld.arc-ts.umich.edu slurmctld[325190]: malloc_consolidate(): invalid chunk size Aug 12 06:47:45 lhctld.arc-ts.umich.edu systemd[1]: slurmctld.service: Main process exited, code=dumped, status=6/ABRT Aug 12 06:47:45 lhctld.arc-ts.umich.edu systemd[1]: slurmctld.service: Failed with result 'core-dump'. ``` and coredumpctl: ``` [root@lhctld slurm]# coredumpctl info PID: 325190 (slurmctld) UID: 495 (slurm) GID: 497 (slurm) Signal: 6 (ABRT) Timestamp: Fri 2022-08-12 06:47:37 EDT (1h 48min ago) Command Line: /opt/slurm/sbin/slurmctld -D -s Executable: /opt/slurm/sbin/slurmctld Control Group: /system.slice/slurmctld.service Unit: slurmctld.service Slice: system.slice Boot ID: cbd2a931bd9f49bfa64bb05bd6447e09 Machine ID: 0834a8f00d8d4a5588dfb56dc67495a9 Hostname: lhctld.arc-ts.umich.edu Storage: /var/lib/systemd/coredump/core.slurmctld.495.cbd2a931bd9f49bfa64bb05bd6447e09.325190.1660301257000000.lz4 (truncated) Message: Process 325190 (slurmctld) of user 495 dumped core. Stack trace of thread 325190: #0 0x00007fd12b69137f n/a (n/a) ``` We just upgraded to 21.08.8 from 21.08.7, as a note. David