Ticket 14743

Summary: slurmctld died
Product: Slurm Reporter: ARC Admins <arc-slurm-admins>
Component: slurmctldAssignee: Marshall Garey <marshall>
Status: RESOLVED CANNOTREPRODUCE QA Contact:
Severity: 3 - Medium Impact    
Priority: --- CC: bart
Version: 21.08.8   
Hardware: Linux   
OS: Linux   
Site: University of Michigan Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---
Attachments: lhctld slurm.conf
lhctld slurmsched.log
lhctld slurmctld.log

Description ARC Admins 2022-08-12 07:04:17 MDT
Created attachment 26304 [details]
lhctld slurm.conf

Hello,

This morning, about 6:47am EST, we found our slurmctld had died and core-dumped. Unfortunately, the core file was truncated so I can't get a back trace. Here's what systemd showed:

```
[root@lhctld ~]# systemctl status slurmctld.service 
● slurmctld.service - Slurm controller daemon
   Loaded: loaded (/usr/lib/systemd/system/slurmctld.service; enabled; vendor preset: disabled)
  Drop-In: /etc/systemd/system/slurmctld.service.d
           └─nofile.conf
   Active: failed (Result: core-dump) since Fri 2022-08-12 06:47:45 EDT; 59min ago
  Process: 325190 ExecStart=/opt/slurm/sbin/slurmctld -D -s $SLURMCTLD_OPTIONS (code=dumped, signal=ABRT)
 Main PID: 325190 (code=dumped, signal=ABRT)

Aug 12 06:33:45 lhctld.arc-ts.umich.edu slurmctld[325190]: slurmctld: sched: Allocate JobId=1851017 NodeList=lh0419 #CPUs=20 Partition=sigbio
Aug 12 06:33:45 lhctld.arc-ts.umich.edu slurmctld[325190]: slurmctld: sched: Allocate JobId=1851018 NodeList=lh0419 #CPUs=20 Partition=sigbio
Aug 12 06:33:45 lhctld.arc-ts.umich.edu slurmctld[325190]: slurmctld:    retry_list retry_list_size:154 msg_type=REQUEST_LAUNCH_PROLOG,REQUEST_BA>
Aug 12 06:37:10 lhctld.arc-ts.umich.edu slurmctld[325190]: slurmctld: sched/backfill: _start_job: Started JobId=1850681 in sigbio on lh0407
Aug 12 06:43:37 lhctld.arc-ts.umich.edu slurmctld[325190]: slurmctld: _job_complete: JobId=1850973 OOM failure
Aug 12 06:43:37 lhctld.arc-ts.umich.edu slurmctld[325190]: slurmctld: _job_complete: JobId=1850973 done
Aug 12 06:43:45 lhctld.arc-ts.umich.edu slurmctld[325190]: slurmctld: sched: Allocate JobId=1851019 NodeList=lh0412 #CPUs=20 Partition=sigbio
Aug 12 06:47:37 lhctld.arc-ts.umich.edu slurmctld[325190]: malloc_consolidate(): invalid chunk size
Aug 12 06:47:45 lhctld.arc-ts.umich.edu systemd[1]: slurmctld.service: Main process exited, code=dumped, status=6/ABRT
Aug 12 06:47:45 lhctld.arc-ts.umich.edu systemd[1]: slurmctld.service: Failed with result 'core-dump'.
```

and coredumpctl:

```
[root@lhctld slurm]# coredumpctl info
           PID: 325190 (slurmctld)
           UID: 495 (slurm)
           GID: 497 (slurm)
        Signal: 6 (ABRT)
     Timestamp: Fri 2022-08-12 06:47:37 EDT (1h 48min ago)
  Command Line: /opt/slurm/sbin/slurmctld -D -s
    Executable: /opt/slurm/sbin/slurmctld
 Control Group: /system.slice/slurmctld.service
          Unit: slurmctld.service
         Slice: system.slice
       Boot ID: cbd2a931bd9f49bfa64bb05bd6447e09
    Machine ID: 0834a8f00d8d4a5588dfb56dc67495a9
      Hostname: lhctld.arc-ts.umich.edu
       Storage: /var/lib/systemd/coredump/core.slurmctld.495.cbd2a931bd9f49bfa64bb05bd6447e09.325190.1660301257000000.lz4 (truncated)
       Message: Process 325190 (slurmctld) of user 495 dumped core.

                Stack trace of thread 325190:
                #0  0x00007fd12b69137f n/a (n/a)
```

We just upgraded to 21.08.8 from 21.08.7, as a note.

David
Comment 1 ARC Admins 2022-08-12 07:04:38 MDT
Created attachment 26305 [details]
lhctld slurmsched.log
Comment 2 ARC Admins 2022-08-12 07:04:59 MDT
Created attachment 26306 [details]
lhctld slurmctld.log
Comment 4 Marshall Garey 2022-08-12 17:53:50 MDT
Unfortunately, there are no clues in the log files as to why the slurmctld crashed. The best clue we have is:

Aug 12 06:47:37 lhctld.arc-ts.umich.edu slurmctld[325190]: malloc_consolidate(): invalid chunk size
Aug 12 06:47:45 lhctld.arc-ts.umich.edu systemd[1]: slurmctld.service: Main process exited, code=dumped, status=6/ABRT


It looks like malloc() failed.

Questions:

* Do you know how much memory was slurmctld using at or near the time of the crash? How about the memory usage of the controller node? Is that something you monitor?
  - We've fixed a few memory leaks since 21.08.8, so it is possible that slurmctld leaked memory.
* What limit do you have on the size of the core file? Are you able to increase that so we can get a whole core file? We really need the backtrace from the coredump in order to debug crashes.
Comment 5 ARC Admins 2022-08-15 06:53:49 MDT
(In reply to Marshall Garey from comment #4)
Marshall,

> * Do you know how much memory was slurmctld using at or near the time of the
> crash? How about the memory usage of the controller node? Is that something
> you monitor?
>   - We've fixed a few memory leaks since 21.08.8, so it is possible that
> slurmctld leaked memory.

We do not have the value/memory used at the time of the event, unfortunately.

> * What limit do you have on the size of the core file? Are you able to
> increase that so we can get a whole core file? We really need the backtrace
> from the coredump in order to debug crashes.

We currently allow for unlimited core file size, it looks like:

```
[root@lhctld etc]# date; ulimit -a
Mon Aug 15 08:52:30 EDT 2022
core file size          (blocks, -c) unlimited
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 256335
max locked memory       (kbytes, -l) unlimited
max memory size         (kbytes, -m) unlimited
open files                      (-n) 20480
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) unlimited
cpu time               (seconds, -t) unlimited
max user processes              (-u) 256335
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited
```

David
Comment 7 Marshall Garey 2022-08-17 16:53:55 MDT
Without a coredump, I have no idea what caused this crash (and therefore don't know what needs to be fixed or how to fix it).

I'm closing this as cannotreproduce, but please re-open the ticket if it happens again.