Ticket 10492 - slurmstep processes are not cleaned up & slurmd failed
Summary: slurmstep processes are not cleaned up & slurmd failed
Status: RESOLVED INFOGIVEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: slurmd (show other tickets)
Version: 20.11.1
Hardware: Linux Linux
: 3 - Medium Impact
Assignee: Felip Moll
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2020-12-21 05:08 MST by Ramy Adly
Modified: 2020-12-27 01:34 MST (History)
1 user (show)

See Also:
Site: KAUST
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
slurmd from gpu210-06 (28.08 KB, text/plain)
2020-12-21 22:41 MST, Ramy Adly
Details
slurmctld log file (11.55 MB, application/gzip)
2020-12-21 22:43 MST, Ramy Adly
Details
dmesg for gpu210-06 (18.10 KB, text/plain)
2020-12-22 05:03 MST, Ramy Adly
Details
Journalctl for gpu210-06 (26.16 KB, text/plain)
2020-12-22 05:03 MST, Ramy Adly
Details
syslog for gpu210-06 (670.37 KB, text/x-log)
2020-12-22 05:04 MST, Ramy Adly
Details
ABRT notification. (63.52 KB, text/plain)
2020-12-23 02:31 MST, Greg Wickham
Details
Backtrace of three Core Dumps (5.77 KB, text/plain)
2020-12-23 04:36 MST, Greg Wickham
Details
using "thread apply all bt full" (8.43 KB, text/plain)
2020-12-23 06:21 MST, Greg Wickham
Details
ptype $_siginfo and p $_siginfo (5.59 KB, text/plain)
2020-12-23 06:56 MST, Greg Wickham
Details

Note You need to log in before you can comment on or make changes to this ticket.
Description Ramy Adly 2020-12-21 05:08:23 MST
Hello,

Slurmd keeps failing randomly on some nodes, leaving some slurmstepd and some processes running without cleanup:

---------------------------------------------------------------
Dec 21 10:14:24 gpu210-06 systemd[1]: slurmd.service failed.
-----------------
$ sacct -j 13368592 -o jobid,state,end
       JobID      State                 End 
------------ ---------- ------------------- 
13368592         FAILED 2020-12-21T10:14:24 
13368592.ba+     FAILED 2020-12-21T10:14:24 
13368592.ex+  CANCELLED 2020-12-21T10:24:35 
-------------------------------------------------------


From slurmctld log:
-------------------------------
[2020-12-21T10:14:24.000] _job_complete: JobId=13368592 WEXITSTATUS 137
[2020-12-21T10:14:24.000] _job_complete: JobId=13368592 done
[2020-12-21T10:15:32.775] update_node: node gpu210-06 reason set to: NHC: check_slurm: slurmd PID 41886 not present
-----------------------------------------------------


After a while,
node state is down and jobs get re-queued leaving their slurmstepd and processes not cleaned up as well:

-----------------------------------
[2020-12-21T10:23:12.121] error: Nodes gpu210-06 not responding
[2020-12-21T10:24:35.795] requeue job JobId=13348662_7(13348662) due to failure of node gpu210-06
[2020-12-21T10:24:35.795] email msg to masheal.alghamdi@kaust.edu.sa: Slurm Array Summary Job_id=13348662_* (13348662) Name=2DC3D Requeued
[2020-12-21T10:24:35.795] email msg to masheal.alghamdi@kaust.edu.sa: Slurm Array Summary Job_id=13348662_* (13348662) Name=2DC3D Ended, NODE_FAIL, ExitCode [0-0], with requeued tasks
[2020-12-21T10:24:35.795] Requeuing JobId=13348662_7(13348662)
[2020-12-21T10:24:35.795] cleanup_completing: JobId=13368592 completion process took 611 seconds
[2020-12-21T10:24:35.795] requeue job JobId=13368599 due to failure of node gpu210-06
[2020-12-21T10:24:35.795] Requeuing JobId=13368599
[2020-12-21T10:24:35.795] requeue job JobId=13275645_1159(13368671) due to failure of node gpu210-06
[2020-12-21T10:24:35.796] Requeuing JobId=13275645_1159(13368671)
[2020-12-21T10:24:35.796] error: Nodes gpu210-06 not responding, setting DOWN
-------------------------------------------------

-------------------------------------------
root      16096  0.0  0.0 296040  4004 ?        Sl   Dec20   0:05 slurmstepd: [13348662.extern]
root      16140  0.0  0.0 296308  4840 ?        Sl   Dec20   0:09 slurmstepd: [13348662.batch]
alghmm0b  16176  0.0  0.0 113288  1224 ?        S    Dec20   0:00  \_ /bin/sh /var/spool/slurm/slurmd/job13348662/slurm_script
root      13846  0.0  0.0 230512  4324 ?        Sl   06:10   0:02 slurmstepd: [13368599.extern]
root      13853  0.0  0.0 230788  5484 ?        Sl   06:10   0:09 slurmstepd: [13368599.batch]
x_hamma+  13864  0.0  0.0 113292  1404 ?        S    06:10   0:00  \_ /bin/bash /var/spool/slurm/slurmd/job13368599/slurm_script
root      30826  0.0  0.0 230512  4320 ?        Sl   06:39   0:01 slurmstepd: [13368671.extern]
root      30835  0.0  0.0 230780  4816 ?        Sl   06:39   0:02 slurmstepd: [13368671.batch]
soldanm   30843  0.0  0.0 113304  1204 ?        S    06:39   0:00  \_ /bin/sh /var/spool/slurm/slurmd/job13368671/slurm_script
root      78978  0.0  0.0 230512  4320 ?        Sl   07:59   0:01 slurmstepd: [13368592.extern]
----------------------------------------------------------------------

The reason of the slurmd failure and processes not getting cleaned up are not clear to us.

Please advise how to resolve this.

Regards,
Ramy
Comment 1 Greg Wickham 2020-12-21 06:57:10 MST
Dear Team,

This issue appears to be happening too frequently. Each time slurmd fails the node is marked as NODE_FAIL:

> sacct -X -D  -S 2020-12-10  --format jobid,start,user,state,nodelist  | grep -i NODE_FAIL | wc -l
86

This issue started happening at 2020-12-17T19:59:04 which is roughly the time that Ibex went back into service after upgrading to 20.11.1

   -Greg
Comment 2 Jason Booth 2020-12-21 11:32:55 MST
Hi Ramy and Greg - I will have Felip work with you on this. Would you please attach your slurmd (from the affected node/s) and slurmctld logs to this bug?
Comment 3 Greg Wickham 2020-12-21 21:46:37 MST
FYI we've set SlurmdTimeout=600 and are running a cron job to restart slurmd if it's not running. There has now been no more NODE_FAIL.
Comment 4 Ramy Adly 2020-12-21 22:41:41 MST
Created attachment 17243 [details]
slurmd from gpu210-06
Comment 5 Ramy Adly 2020-12-21 22:43:22 MST
Created attachment 17244 [details]
slurmctld log file
Comment 6 Felip Moll 2020-12-22 04:09:35 MST
Hi Ramy, Greg,

The key point is to find out why slurmd dies.

Can you check system logs (dmesg, journalctl, OOM, and so on) or coredumps generated between 2020-12-21@07:59 and 18:45? Can you upload system logs?

As you see the gap is at that time.

[2020-12-21T07:59:07.097] Launching batch job 13368592 for UID 800146
... <no more slurmd daemon logging> ...
[2020-12-21T18:45:01.891] topology/tree: _validate_switches: TOPOLOGY: warning -- no switch can reach all nodes through its descendants. If this is not intentional, fix the topology.conf file.
[2020-12-21T18:45:01.916] slurmd version 20.11.1 started
[2020-12-21T18:45:01.932] slurmd started on Mon, 21 Dec 2020 18:45:01 +0300


Is slurmd started with systemd or manually?
I am interested also in checking under which cgroup slurmd resides.

cat /proc/$(pidof slurmd)/cgroup

Do you have any means you think you could reproduce? If so, starting slurmd manually under gdb and reproducing the issue on one node would help, in order to get the backtrace.
Comment 7 Ramy Adly 2020-12-22 05:03:15 MST
Created attachment 17246 [details]
dmesg for gpu210-06
Comment 8 Ramy Adly 2020-12-22 05:03:42 MST
Created attachment 17247 [details]
Journalctl for gpu210-06
Comment 9 Ramy Adly 2020-12-22 05:04:17 MST
Created attachment 17248 [details]
syslog for gpu210-06
Comment 10 Ramy Adly 2020-12-22 05:09:43 MST
Hello Felip,

Thanks for the reply.

I have attached the dmesg, journalctl and syslog file.

slurmd is started from systemd

root@gpu210-06: ~ # cat /proc/9666/cgroup 
11:cpuacct,cpu:/system.slice/slurmd.service
10:perf_event:/
9:memory:/system.slice/slurmd.service
8:freezer:/
7:hugetlb:/
6:pids:/system.slice/slurmd.service
5:devices:/system.slice/slurmd.service
4:net_prio,net_cls:/
3:blkio:/system.slice/slurmd.service
2:cpuset:/
1:name=systemd:/system.slice/slurmd.service


We just had another daemon failure an hour ago but we have a cronjob now restarting the slurmd once it fails.
Do you want me to upload the new logs as well?

Regards,
Ramy
Comment 11 Greg Wickham 2020-12-22 05:43:22 MST
Today we have had four node failures so far today:

13340826       zhouy0g 2020-12-18T20:12:39  NODE_FAIL 
13341103       chenj0g 2020-12-18T23:34:50  NODE_FAIL 
13341106       chenj0g 2020-12-18T23:36:51  NODE_FAIL 
13341110       chenj0g 2020-12-19T04:15:12  NODE_FAIL 

It's puzzling why one particular user is affected >70% of the time.

   -greg
Comment 12 Greg Wickham 2020-12-22 05:52:33 MST
A slurmd has been launched under gdb.

User chenj0g will send jobs to this node.

:fingers crossed:

   -greg
Comment 13 Felip Moll 2020-12-22 06:25:21 MST
Oh.. from your logs:

Dec 21 10:13:40 gpu210-06 kernel: Out of memory: Kill process 79043 (python) score 725 or sacrifice child
Dec 21 10:13:40 gpu210-06 kernel: Killed process 79043 (python), UID 800146, total-vm:619547352kB, anon-rss:588297684kB, file-rss:260kB, shmem-rss:2640kB
Dec 21 10:14:22 gpu210-06 kernel: beegfs: python(79440): NodeConn (acquire stream): Connected: beegfs-meta@10.109.149.91:8005 (protocol: TCP; fallback route)

... some OOMs... and then:

Dec 21 10:14:24 gpu210-06 abrt-hook-ccpp[4099]: Process 41886 (slurmd) of user 0 killed by SIGBUS - dumping core
Dec 21 10:14:24 gpu210-06 systemd[1]: slurmd.service: main process exited, code=dumped, status=7/BUS

... there should be a core somewhere as stated here ... SIGBUS is not a nice signal.

If you can find the abrt file (/var/spool/abrt) and paste here the dump (thread apply all bt) would give us a quick lead... but...

Dec 21 10:14:24 gpu210-06 systemd[1]: Unit slurmd.service entered failed state.
Dec 21 10:14:24 gpu210-06 systemd[1]: slurmd.service failed.
Dec 21 10:14:24 gpu210-06 abrt-server[4100]: Executable '/opt/slurm/install/slurm-20.11.1-9d619f1921b63ce48ff2a6d17c5af5866b9c3976-CentOS-7.9.2009-MLNX/sbin/slurmd' doesn't belong to any package and ProcessUnpackaged is set to 'no'
Dec 21 10:14:24 gpu210-06 abrt-server[4100]: 'post-create' on '/var/spool/abrt/ccpp-2020-12-21-10:14:24-41886' exited with 1
Dec 21 10:14:24 gpu210-06 abrt-server[4100]: Deleting problem directory '/var/spool/abrt/ccpp-2020-12-21-10:14:24-41886'

<-- I see there was a problem generating the abrt, i tseems 'ProcessUnpackaged is set to no'...

Let me know what you find.
Comment 14 Greg Wickham 2020-12-22 23:03:23 MST
Program received signal SIGHUP, Hangup.
0x00007ffff62559dd in accept () from /usr/lib64/libpthread.so.0
Missing separate debuginfos, use: debuginfo-install audit-libs-2.8.5-4.el7.x86_64 glibc-2.17-317.el7.x86_64 hdf5-1.8.12-11.el7.x86_64 hwloc-libs-1.11.8-4.el7.x86_64 infiniband-diags-51mlnx1-1.51258.x86_64 libaec-1.0.4-1.el7.x86_64 libcap-ng-0.7.5-4.el7.x86_64 libgcc-4.8.5-44.el7.x86_64 libibumad-51mlnx1-1.51258.x86_64 libtool-ltdl-2.4.2-22.el7_3.x86_64 lz4-1.8.3-1.el7.x86_64 munge-libs-0.5.11-3.el7.x86_64 numactl-libs-2.0.12-5.el7.x86_64 pam-1.1.8-23.el7.x86_64 sssd-client-1.16.5-10.el7_9.5.x86_64 zlib-1.2.7-18.el7.x86_64
(gdb) 
(gdb) 
(gdb) where
#0  0x00007ffff62559dd in accept () from /usr/lib64/libpthread.so.0
#1  0x00007ffff770cb1c in slurm_accept_msg_conn (fd=<optimized out>, addr=<optimized out>) at slurm_protocol_socket.c:452
#2  0x000000000040fe07 in _msg_engine () at slurmd.c:473
#3  main (argc=2, argv=0x7fffffffe068) at slurmd.c:394
(gdb)
Comment 15 Greg Wickham 2020-12-22 23:13:40 MST
Felip,

Looking into this a little, it is apparent we are using SIGHUG for log rotation when in fact SIGUSR2 should be used.

We'll change that.

  -Greg
Comment 16 Greg Wickham 2020-12-23 02:31:38 MST
Created attachment 17258 [details]
ABRT notification.

reason:         slurmd killed by SIGBUS
cmdline:        /opt/slurm/cluster/ibex/install/sbin/slurmd -D
executable:     /opt/slurm/install/slurm-20.11.1-9d619f1921b63ce48ff2a6d17c5af5866b9c3976-CentOS-7.9.2009-MLNX/sbin/slurmd
pid:            44470
pwd:            /



root@cn514-02-r: /var/spool/abrt/ccpp-2020-12-23-12:10:57-44470 # gdb /opt/slurm/install/slurm-20.11.1-9d619f1921b63ce48ff2a6d17c5af5866b9c3976-CentOS-7.9.2009-MLNX/sbin/slurmd -c coredump 
GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-120.el7
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /opt/slurm/install/slurm-20.11.1-9d619f1921b63ce48ff2a6d17c5af5866b9c3976-CentOS-7.9.2009-MLNX/sbin/slurmd...done.
[New LWP 44470]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `/opt/slurm/cluster/ibex/install/sbin/slurmd -D'.
Program terminated with signal 7, Bus error.
#0  0x000000000040fe07 in _msg_engine () at slurmd.c:473
473	slurmd.c: No such file or directory.
Missing separate debuginfos, use: debuginfo-install audit-libs-2.8.5-4.el7.x86_64 glibc-2.17-317.el7.x86_64 hdf5-1.8.12-11.el7.x86_64 hwloc-libs-1.11.8-4.el7.x86_64 infiniband-diags-51mlnx1-1.51258.x86_64 libaec-1.0.4-1.el7.x86_64 libcap-ng-0.7.5-4.el7.x86_64 libgcc-4.8.5-44.el7.x86_64 libibumad-51mlnx1-1.51258.x86_64 libtool-ltdl-2.4.2-22.el7_3.x86_64 lz4-1.8.3-1.el7.x86_64 munge-libs-0.5.11-3.el7.x86_64 numactl-libs-2.0.12-5.el7.x86_64 pam-1.1.8-23.el7.x86_64 sssd-client-1.16.5-10.el7_9.5.x86_64 zlib-1.2.7-18.el7.x86_64
(gdb) where
#0  0x000000000040fe07 in _msg_engine () at slurmd.c:473
#1  main (argc=2, argv=0x7ffe6726fb98) at slurmd.c:394
(gdb)
Comment 17 Greg Wickham 2020-12-23 02:43:27 MST
GDB info with sources:

$ gdb /opt/slurm/install/slurm-20.11.1-9d619f1921b63ce48ff2a6d17c5af5866b9c3976-CentOS-7.9.2009-MLNX/sbin/slurmd -c /tmp/coredump 
GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-120.el7
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /opt/slurm/install/slurm-20.11.1-9d619f1921b63ce48ff2a6d17c5af5866b9c3976-CentOS-7.9.2009-MLNX/sbin/slurmd...done.
[New LWP 44470]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `/opt/slurm/cluster/ibex/install/sbin/slurmd -D'.
Program terminated with signal 7, Bus error.
#0  0x000000000040fe07 in _msg_engine () at slurmd.c:473
473			if ((sock = slurm_accept_msg_conn(conf->lfd, cli)) >= 0) {
Missing separate debuginfos, use: debuginfo-install audit-libs-2.8.5-4.el7.x86_64 glibc-2.17-317.el7.x86_64 hdf5-1.8.12-11.el7.x86_64 hwloc-libs-1.11.8-4.el7.x86_64 infiniband-diags-51mlnx1-1.51258.x86_64 libaec-1.0.4-1.el7.x86_64 libcap-ng-0.7.5-4.el7.x86_64 libgcc-4.8.5-44.el7.x86_64 libibumad-51mlnx1-1.51258.x86_64 libtool-ltdl-2.4.2-22.el7_3.x86_64 lz4-1.8.3-1.el7.x86_64 munge-libs-0.5.11-3.el7.x86_64 numactl-libs-2.0.12-5.el7.x86_64 pam-1.1.8-23.el7.x86_64 sssd-client-1.16.5-10.el7_9.5.x86_64 zlib-1.2.7-18.el7.x86_64
(gdb) where
#0  0x000000000040fe07 in _msg_engine () at slurmd.c:473
#1  main (argc=2, argv=0x7ffe6726fb98) at slurmd.c:394
(gdb) p conf
$1 = (slurmd_conf_t *) 0x1359a30
(gdb) p conf->lfd
$2 = 5
(gdb) p cli
$3 = (slurm_addr_t *) 0x14bdeb0
Comment 18 Greg Wickham 2020-12-23 04:07:25 MST
More failures today:

$ gdb /opt/slurm/install/slurm-20.11.1-9d619f1921b63ce48ff2a6d17c5af5866b9c3976-CentOS-7.9.2009-MLNX/sbin/slurmd -c /tmp/cn514-02-r/ccpp-2020-12-23-12\:10\:57-44470/coredump 
GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-120.el7
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /opt/slurm/install/slurm-20.11.1-9d619f1921b63ce48ff2a6d17c5af5866b9c3976-CentOS-7.9.2009-MLNX/sbin/slurmd...done.
[New LWP 44470]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
wCore was generated by `/opt/slurm/cluster/ibex/install/sbin/slurmd -D'.
Program terminated with signal 7, Bus error.
#0  0x000000000040fe07 in _msg_engine () at slurmd.c:473
473			if ((sock = slurm_accept_msg_conn(conf->lfd, cli)) >= 0) {
Missing separate debuginfos, use: debuginfo-install audit-libs-2.8.5-4.el7.x86_64 glibc-2.17-317.el7.x86_64 hdf5-1.8.12-11.el7.x86_64 hwloc-libs-1.11.8-4.el7.x86_64 infiniband-diags-51mlnx1-1.51258.x86_64 libaec-1.0.4-1.el7.x86_64 libcap-ng-0.7.5-4.el7.x86_64 libgcc-4.8.5-44.el7.x86_64 libibumad-51mlnx1-1.51258.x86_64 libtool-ltdl-2.4.2-22.el7_3.x86_64 lz4-1.8.3-1.el7.x86_64 munge-libs-0.5.11-3.el7.x86_64 numactl-libs-2.0.12-5.el7.x86_64 pam-1.1.8-23.el7.x86_64 sssd-client-1.16.5-10.el7_9.5.x86_64 zlib-1.2.7-18.el7.x86_64



$ gdb /opt/slurm/install/slurm-20.11.1-9d619f1921b63ce48ff2a6d17c5af5866b9c3976-CentOS-7.9.2009-MLNX/sbin/slurmd -c /tmp/cn506-03-r/ccpp-2020-12-23-11\:26\:15-45584/coredump
GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-120.el7
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /opt/slurm/install/slurm-20.11.1-9d619f1921b63ce48ff2a6d17c5af5866b9c3976-CentOS-7.9.2009-MLNX/sbin/slurmd...done.
[New LWP 175241]
[New LWP 45584]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `/opt/slurm/cluster/ibex/install/sbin/slurmd -D'.
Program terminated with signal 7, Bus error.
#0  0x000000000040e35f in run_script_health_check () at slurmd.c:2620
2620			rc = run_script("health_check", slurm_conf.health_check_program,
Missing separate debuginfos, use: debuginfo-install audit-libs-2.8.5-4.el7.x86_64 glibc-2.17-317.el7.x86_64 hdf5-1.8.12-11.el7.x86_64 hwloc-libs-1.11.8-4.el7.x86_64 infiniband-diags-51mlnx1-1.51258.x86_64 libaec-1.0.4-1.el7.x86_64 libcap-ng-0.7.5-4.el7.x86_64 libgcc-4.8.5-44.el7.x86_64 libibumad-51mlnx1-1.51258.x86_64 libtool-ltdl-2.4.2-22.el7_3.x86_64 lz4-1.8.3-1.el7.x86_64 munge-libs-0.5.11-3.el7.x86_64 numactl-libs-2.0.12-5.el7.x86_64 pam-1.1.8-23.el7.x86_64 sssd-client-1.16.5-10.el7_9.5.x86_64 zlib-1.2.7-18.el7.x86_64
Comment 19 Felip Moll 2020-12-23 04:20:15 MST
(In reply to Greg Wickham from comment #15)
> Felip,
> 
> Looking into this a little, it is apparent we are using SIGHUG for log
> rotation when in fact SIGUSR2 should be used.
> 
> We'll change that.
> 
>   -Greg

Have you tried this, and has it helped?

In the coredumps, please attach to them with gdb and show me the output of:

thread apply all bt


I see it failed in different points, so I am guessing it is something external killing slurmd.
Comment 20 Greg Wickham 2020-12-23 04:36:41 MST
Created attachment 17259 [details]
Backtrace of three Core Dumps

Thread backtrace of three core dumps attached (to save polluting the comments here).

If there is a benefit of changing from HUP to USR2 for log rotation it won't be noticed until tomorrow at 4am.
Comment 21 Felip Moll 2020-12-23 06:09:14 MST
(In reply to Greg Wickham from comment #20)
> Created attachment 17259 [details]
> Backtrace of three Core Dumps
> 
> Thread backtrace of three core dumps attached (to save polluting the
> comments here).

There's nothing I can say from these dumps. It just receives SIGBUS from different locations. Can you attach again and run:

 ptype $_signinfo
 p $_signinfo 

We'll see the signal code.

I have three things in mind:

1. There's an underlying file operation which fails in a NFS filesystem. Do you have NFS on the nodes that slurm is relying on?

I see the sigbus happens just after (2sec) python (from step_batch) is killed, and the beegfs gets a connection. I am wondering if there's any relation.

Dec 21 10:14:22 gpu210-06 kernel: beegfs: python(79440): NodeConn (acquire stream): Connected: beegfs-meta@10.109.149.91:8005 (protocol: TCP; fallback route)
Dec 21 10:14:24 gpu210-06 abrt-hook-ccpp: Process 41886 (slurmd) of user 0 killed by SIGBUS - dumping core
Dec 21 10:14:24 gpu210-06 systemd: slurmd.service: main process exited, code=dumped, status=7/BUS

2. The installation is linked to new but using old version of Slurm libraries, thus accessing incorrect memory addresses. Can you tell me briefly how you did the upgrade procedure?

3. There's a bug in some of your OS libraries calling to a non-async-thread function. Sorry to ask again but a: 'thread apply all bt full' would give us now the exact call when it happens. If it is different in all the occasions it happened, I'd probably discard this option.

> If there is a benefit of changing from HUP to USR2 for log rotation it won't
> be noticed until tomorrow at 4am.

There shouldn't be any difference. I tried this locally and sending a SIGHUP doesn't terminate slurmd.


See also bug 10496, bug 10488, bug 9827, and https://sourceware.org/bugzilla/show_bug.cgi?id=26713
Comment 22 Felip Moll 2020-12-23 06:10:28 MST
> If there is a benefit of changing from HUP to USR2 for log rotation it won't
> be noticed until tomorrow at 4am.

There shouldn't be any benefit on our issue, in my testbed I just send a SIGHUP to my slurmd and it doesn't die (as expected, signal is masked).
Comment 24 Greg Wickham 2020-12-23 06:21:00 MST
Created attachment 17260 [details]
using "thread apply all bt full"
Comment 26 Greg Wickham 2020-12-23 06:35:50 MST
Hi Felip,

Answers:

1/ The slurm binaries are all coming off an NFS share, but we're not seeing any NFS issues.

2/ While possible, this is unlikely as this site users a slurm build script that hasn't substantially changed for quite some time. I can share more about this if relevant.

3/ We did upgrade to CentOS 7.9 at the same time as moving to Slurm 20.11. We did bench 20.11 on CentOS 7.9 and this problem wasn't apparent. Checking the statistics since the 17th Dec (20.11 going live) 72,435 jobs have been processed with 88 being affected by NODE_FAIL. Strangely, one particular user has been affected 46 times (over half the faults) - they are using python / pytorch. If it wasn't for this user wondering why his jobs kept failing we might not have noticed.
Comment 27 Greg Wickham 2020-12-23 06:56:27 MST
Created attachment 17261 [details]
ptype $_siginfo and p $_siginfo
Comment 28 Felip Moll 2020-12-23 09:26:05 MST
(In reply to Greg Wickham from comment #26)
> Hi Felip,
> 
> Answers:
> 
> 1/ The slurm binaries are all coming off an NFS share, but we're not seeing
> any NFS issues.

Can you check the slurmd binary timestamp in the NFS share, and correlate this with the time since slurmd daemons were started in all nodes?

I want to see if the binary is newer than the one running in the nodes.

(https://rachelbythebay.com/w/2018/03/15/core/)
Comment 29 Greg Wickham 2020-12-23 10:34:04 MST
Hi Felip,

$ ls -las /opt/slurm/install/slurm-20.11.1-9d619f1921b63ce48ff2a6d17c5af5866b9c3976-CentOS-7.9.2009-MLNX/sbin/slurmd
1080 -rwxr-xr-x. 1 wickhagj g-wickhagj 1094696 Dec 13 00:48 /opt/slurm/install/slurm-20.11.1-9d619f1921b63ce48ff2a6d17c5af5866b9c3976-CentOS-7.9.2009-MLNX/sbin/slurmd

The issue has occurred on 22 nodes, and never the same node twice (as far as we can tell from the sacct records).

Of these 22 nodes, they were all booted on the 16th December or later

The date of the slurm install was at Dec 13 00:44 +/- a few minutes.

I've just gone and re-started all of the slurmd across the cluster. (If your assertion is correct then this should fix the issue).

[2020-12-23T20:27:52.182] slurmd started on Wed, 23 Dec 2020 20:27:52 +0300

   -greg
Comment 30 Felip Moll 2020-12-23 11:43:17 MST
(In reply to Greg Wickham from comment #29)
> Hi Felip,
> 
> $ ls -las
> /opt/slurm/install/slurm-20.11.1-9d619f1921b63ce48ff2a6d17c5af5866b9c3976-
> CentOS-7.9.2009-MLNX/sbin/slurmd
> 1080 -rwxr-xr-x. 1 wickhagj g-wickhagj 1094696 Dec 13 00:48
> /opt/slurm/install/slurm-20.11.1-9d619f1921b63ce48ff2a6d17c5af5866b9c3976-
> CentOS-7.9.2009-MLNX/sbin/slurmd
> 
> The issue has occurred on 22 nodes, and never the same node twice (as far as
> we can tell from the sacct records).
> 
> Of these 22 nodes, they were all booted on the 16th December or later
> 
> The date of the slurm install was at Dec 13 00:44 +/- a few minutes.
> 
> I've just gone and re-started all of the slurmd across the cluster. (If your
> assertion is correct then this should fix the issue).
> 
> [2020-12-23T20:27:52.182] slurmd started on Wed, 23 Dec 2020 20:27:52 +0300
> 
>    -greg

Cool. Let me know if this fixes the issue.
Comment 31 Greg Wickham 2020-12-26 21:52:55 MST
Dear Felip,

After restart slurmd there have been no further failures.

Please close the ticket.

with thanks,

   -greg
Comment 32 Felip Moll 2020-12-27 01:34:09 MST
(In reply to Greg Wickham from comment #31)
> Dear Felip,
> 
> After restart slurmd there have been no further failures.
> 
> Please close the ticket.
> 
> with thanks,
> 
>    -greg

Thanks Greg. So I assume this was the issue. It uses to be one common reason for SIGBUS.

Closing the ticket then, let me know if you see this again. 

Regards