Created attachment 13484 [details] gdb trace of a core dump Hi, During the previous days, we have noticed that Slurmctld crashing several times a day. Please find attached the gdb trace from the generated core dump. Thank you in advance, adel
Adel, Could you please attach slurmctld log from the time of failures? You opened this as severity 4, it's OK, although, I'll list our severity definitions below so you can check which one is the best description of impact. cheers, Marcin Severity 1 – Major Impact A Severity 1 issue occurs when there is a continued system outage that affects a large set of end users. The system is down and nonfunctional due to Slurm problem(s) and no procedural workaround exists. Severity 2 – High Impact A Severity 2 issue is a high impact problem that is causing sporadic outages or is consistently encountered by end users with adverse impact to end user interaction with the system. Severity 3 – Medium Impact A Severity 3 issue is a medium to low impact problem that includes partial noncritical loss of system access or which impairs some operations on the system but allows the end user to continue to function on the system with workarounds. Severity 4 – Minor Issues A Severity 4 issue is a minor issue with limited or no loss in functionality within the customer environment. Severity 4 issues may also be used for recommendations for future product enhancements or modifications.
Adel, The core dump looks - strange. It doesn't contain any indication of abort/segmentation fault/bus error. It's surprising that you have two threads waiting for a reply from munge. How many munge threads you have? Can you execute "remunge" command and share results with us? cheers, Marcin
Created attachment 13500 [details] slurmlog How many munge threads you have? 16
Adel, The logfile in fact indicates that there were some outages, however, this was not a result of a fatal error in slurmctld since this wasn't logged. Is it possible that slurmctld was killed by direct kill -9 $(pidof slurmctld)? How did you get the core dump? Can you share a full output from gdb? Did you notice any oom-killer activity on the host - anything in dmesg output? Could you please check cgroup/security limits on slurmctld process? Are you using systemd unit-files delivered with slurm sources or a modified version? Two things visible in logs (I don't expect them to be a root cause for slurmctld crash, but still...) are: 1)error: chdir(/var/log): Permission denied You should adjust /var/log permissions to let Slurm user make it a current working directory. 2)error: _shutdown_bu_thread:send/recv xyz02: Connection refused It looks like you have a backup controller configured but it's not really running. I'd recommend either starting it or removing it from the config. cheers, Marcin
Adel, Just a kindly reminder - could you please answer the questions from comment 4? cheers, Marcin
> Is it possible that slurmctld was killed by direct kill -9 $(pidof slurmctld)? No, slurmctld wasn't killed, it has crashed. > How did you get the core dump? Can you share a full output from gdb? The slurmctld generated the coredump in /var/lib/systemd/coredump and we ran the following command on the gdb trace: gdb slurmctld <core file> <<< " set pagination off set logging on core.xxxx.log thread apply all bt quit" You already have the full gdb trace produced. > Did you notice any oom-killer activity on the host - anything in dmesg output? Noting suspicious in dmesg and oom killer wasn't triggered. > Could you please check cgroup/security limits on slurmctld process? Are you using systemd unit-files delivered with slurm sources or a modified version? Limit Soft Limit Hard Limit Units Max cpu time unlimited unlimited seconds Max file size unlimited unlimited bytes Max data size unlimited unlimited bytes Max stack size unlimited unlimited bytes Max core file size unlimited unlimited bytes Max resident set unlimited unlimited bytes Max processes 2063504 2063504 processes Max open files 65536 65536 files Max locked memory 65536 65536 bytes Max address space unlimited unlimited bytes Max file locks unlimited unlimited locks Max pending signals 2063504 2063504 signals Max msgqueue size 819200 819200 bytes Max nice priority 0 0 Max realtime priority 0 0 Max realtime timeout unlimited unlimited us > Are you using systemd unit-files delivered with slurm sources or a modified version? The same unit-files delivered with slurm sources. Of course the paths are different. > Two things visible in logs (I don't expect them to be a root cause for slurmctld crash, but still...) are: 1)error: chdir(/var/log): Permission denied You should adjust /var/log permissions to let Slurm user make it a current working directory. 2)error: _shutdown_bu_thread:send/recv xyz02: Connection refused It looks like you have a backup controller configured but it's not really running. I'd recommend either starting it or removing it from the config. These lines don't seem to be relevant. They have always been there and the crashes started happening recently.
Adel, Please attach the output of: #coredumpctl info /PATH/TO/SLURMCTL Thanks for ulitmis. Can you check if slurmctld process is limited by any cgroup? >These lines don't seem to be relevant. They have always been there and the crashes started happening recently. In concur(comment 4). However, I still recommend fixing wrong(error-prone) configuration. cheers, Marcin
Created attachment 13807 [details] coredumpctl info > Please attach the output of: > #coredumpctl info /PATH/TO/SLURMCTL Attached. > Thanks for ulitmis. Can you check if slurmctld process is limited by any cgroup? No, slurmctld is not limited by any cgroup.
Please load the core dump to gdb and attach results of the folowing commands: >(gdb)set print pretty on >(gdb)print $_siginfo >(gdb)info thread >(gdb)t a a bt full What is the glibc version you use? cheers, Marcin
Created attachment 13812 [details] gdb+glibc Please find attached the requested info/output.
Could you please share result of: journalctl --unit=slurmctld Are you able to configure an audit rule monitoring signals activity? I'd be interested in setting something like: auditctl -a exit,always -F arch=b64 -S kill -k audit_kill This will log a piece of information about the process sending signal 11 to slurmctld so we'll know where it came from. The log normally goes to /var/log/audit/audit.log. It looks really strange since the siginfo_t structure that was given to rt_sigaction syscall looks uninitialized. When "Segmentation fault" happens the pid of a kworker is stored there - the same applies to kill command and glibc sigqueue. In this case, we have some random negative value and all bits set for si_uid. cheers, Marcin
Adel, Where you able to check journalctl and configure audit rule? cheers, Marcin
Just a kindly reminder. Please reply to comment 11. In case of no reply I'll close the ticket as timed out. cheers, Marcin
Hi Marcin, We will update the ticket once the slurmctld crashes again. At the moment, we have enabled auditing.
What about `journalctl --unit=slurmctld` cheers, Marcin
Created attachment 14068 [details] journalctl > What about `journalctl --unit=slurmctld` You can find it attached.
Created attachment 14236 [details] audit file Hi Marcin, I have attached the audit output file.
Adel, Thanks for audit.log - the case seems to be even more interesting. The only activity of kill syscall logged was related to sshd(SIGTERM) and bash(SIGINT). I missed the fact that signals may come also from the real-time kernel infrastructure. Please add the following to audit rules: #auditctl -a exit,always -F arch=b64 -S RT_SIGQUEUEINFO -k audit_kill #auditctl -a exit,always -F arch=b64 -S RT_TGSIGQUEUEINFO -k audit_kill Let me know when you'll notice slurmctld crash again. I hope we'll catch the sender this time. Once you see that in the line with "key=audit_kill" you'll have a ppid=XX and pid=YY numbers. Please check the processes behind those PIDs. cheers, Marcin
Created attachment 14314 [details] audit logs Hi Marcin, Please find attached the required audit logs with the additional info.
Created attachment 14315 [details] another audit log
Adel, Thanks for the logs. I see two logs related to slurmctld, but it's not the one comming from the rules we added: type=ANOM_ABEND msg=audit(1589940420.146:996483): auid=4294967295 uid=450 gid=450 ses=4294967295 pid=102789 comm="rpcmgr" exe="/cm/shared/apps/slurm/18.08.7/sbin/ slurmctld" sig=11 However, one strange thing here is that normally I'd expect a "reason" filed in audit logs stating something like "memory violation". It's just one of the possibilities but segmentation fault may also be a result of malfunctioning hardware - can you run memtester on the system? Could you please load related core dumps to gdb and share the output like in comment 9? Please attach dmesg from the server. cheers, Marcin
Created attachment 14319 [details] core1 #auditctl -l -a always,exit -F arch=x86_64 -S kill -F key=audit_kill -a always,exit -F arch=x86_64 -S rt_sigqueueinfo -F key=audit_kill -a always,exit -F arch=x86_64 -S rt_tgsigqueueinfo -F key=audit_kill
Created attachment 14320 [details] core2
Adel, By "like in comment 9" I meant: >Please load the core dump to gdb and attach results of the folowing commands: >>(gdb)set print pretty on >>(gdb)print $_siginfo >>(gdb)info thread >>(gdb)t a a bt full sorry for confusion. Please attach dmesg from the server. cheers, Marcin
Created attachment 14351 [details] core3 Hi Marcin, Please find the requested core dump attached.
Created attachment 14352 [details] dmesg
Created attachment 14356 [details] testProgram.c Adel, The dmesg output contains some interesting information. It shows that the actual SEGV happens in libpthread (mostly when called from slurm rpcmgr, but once it also happened from slurm agent thread). The file is probably truncated, but last occurrence there is from Feb 3 2020. Could you please (additionally to previously requested commands) load the coredump to gdb and execute: (gdb) p slurmctld_config.server_thread_count previously requested: >By "like in comment 9" I meant: >>Please load the core dump to gdb and attach results of the following commands: >>>(gdb)set print pretty on >>>(gdb)print $_siginfo >>>(gdb)info thread >>>(gdb)t a a bt full What is the result of: #readelf -x .interp /path/to/slurmctld Could you please build and install the attached test program the same way as you do for Slurm and run it on the server where slurmctld executes in a bash loop like: #for i in $(seq 1 1000); do /tmp/pthreadTest ; done Am I correct that the machine has ECC memory - could you please check the result of dmidecode -t 17? Please enable Protocol debug flag[1] and share slurmctld.log together with gdb output with us when the issue happens again. cheers, Marcin [1]https://slurm.schedmd.com/slurm.conf.html#OPT_DebugFlags
Hi Marcin, > Could you please (additionally to previously requested commands) load the coredump to gdb and execute: > (gdb) p slurmctld_config.server_thread_count So you would like the output of the following, correct? (gdb)set print pretty on (gdb)print $_siginfo (gdb)info thread (gdb)t a a bt full (gdb)p slurmctld_config.server_thread_count > Please enable Protocol debug flag[1] and share slurmctld.log together with gdb output with us when the issue happens again. So we should add "DebugFlags=Protocol" in slurm.conf and reload the configs, correct? I will work on getting the rest of the required output.
>So you would like the output of the following, correct? >(gdb)set print pretty on >(gdb)print $_siginfo >(gdb)info thread >(gdb)t a a bt full >(gdb)p slurmctld_config.server_thread_count Yes. >So we should add "DebugFlags=Protocol" in slurm.conf and reload the configs, correct? You can add it to slurm.conf (to make it permanent) and then just execute: scontrol setdebugflags +Protocol cheers, Marcin
> You can add it to slurm.conf (to make it permanent) and then just execute: > scontrol setdebugflags +Protocol No extra debug level is required? SlurmctldDebug=versbose is enough?
>No extra debug level is required? SlurmctldDebug=versbose is enough? Correct.
Created attachment 14383 [details] logs Hi Marcin, Please find the required output attached. readelf -x .interp $(which slurmctld) Hex dump of section '.interp': 0x00400238 2f6c6962 36342f6c 642d6c69 6e75782d /lib64/ld-linux- 0x00400248 7838362d 36342e73 6f2e3200 x86-64.so. I will attach the latest slurmctld.log as soon as I receive it.
Created attachment 14390 [details] slurmctld logs
Adel, In the last attached dmesg we're seeing an indication of memory and potentially other hardware problems on the server: >[867426.453322] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4 >[867426.453326] {1}[Hardware Error]: It has been corrected by h/w and requires no further action I think that you need to replace a failing DIMM. The error is correctable, but to my best knowledge and experience, it should be replaced. (It's obviously outside of SchedMD expertise). >[867688.111552] mce: [Hardware Error]: Machine check events logged To better understand the issues behind this one you can use mcelog, again it's also general system administration rather than Slurm support. I think that both may be a reason for unexpected segfaults, but since the rpcmgr segfault happens exactly at the same address in libpthread-2.22.so I'm not definitive about that. Additionally, we're seeing an indication of SYN flood on slurmctld port: >[1953852.476028] TCP: request_sock_TCP: Possible SYN flooding on port 6817. Sending cookies. Check SNMP counters The easiest way to overcome this OS-level limitation is to set slurmctld to listen on multiple ports[1]. Unfortunately, the output from gdb is not complete. If you'll check the output of: >(gdb) t a a bt full We only have a backtrace of threads 48-41, threads 40-1 are missing, while the one of primary interest is thread 1. (This happened for some previous backtraces as well) Could you please double-check that and attach the full output of the command. Are you sure that Protocol debugflag was enabled? I'd expect additional info messages in the log like:"_slurmctld_rpc_mgr: accept() connection from...". I'm sorry I didn't ask specifically for "dmesg -T" but if you can attach it with slurmctld.log with Protocol messages, by default dmesg messages are attached with a timestamp relative to system boot time, which is hard to match with a date/time stamps in logs. I'm seeing some new hostnames in the log file, could you please attach current slurm.conf (I'm working on a one shared before in other the bug, but it looks outdated). Do you see any coincidence of segfault/crash with configuration changes applied by scontrol reconfigure? cheers, Marcin PS. I know that you don't think it's of any importance from Bug 7946, but I still recommend to disable HA configuration if you're not running a HA pair. This leads to unnecessary network connection attempts and error messages in log. The same applies to /var/log permissions resulting in: >error: chdir(/var/log): Permission denied [1]https://slurm.schedmd.com/slurm.conf.html#OPT_SlurmctldPort
Created attachment 14479 [details] new logs
Adel, Looking at the logs we see that rpcmgr segfault happens at: >[Fri May 29 04:06:41 2020] rpcmgr[123216]: segfault at 7fff9e6e6d13 ip 00007ffff75f9ee0 sp 00007fffee467cc0 error 4 in libpthread-2.22.so[7ffff75f1000+18000] However, slurmctld starts just a few minutes before this event: >[2020-05-29T04:01:06.830] slurmctld version 18.08.7 started on cluster slurm_cluster And it the minute when the segfault happens the only error logged is connection refused happening to the backup controler: >[2020-05-29T04:04:54.713] _job_complete: JobId=14549249_51(14552820) done >[2020-05-29T04:06:08.036] error: _shutdown_bu_thread:send/recv xyz02: Connection refused >[2020-05-29T04:11:08.417] error: _shutdown_bu_thread:send/recv xyz02: Connection refused >[2020-05-29T04:16:08.770] error: _shutdown_bu_thread:send/recv xyz02: Connection refused What is interesting is that according to logs slurmctld continues execution without interruption after rpcmgr segfault. That's really strange Unfortunately, protocol debug flag was enabled after the segfault: >[2020-05-29T14:10:18.183] _slurm_rpc_reconfigure_controller: completed usec=40376 so we didn't get more detailed log messages from rpcmgr thread. Did you notice a segfault with it enabled? Could you please remove it (as recommended before in Bug 7946 and here in comment 34)? I'm curious if this may be related. I'll try to analyze the code to look for some conditions where I can reproduce the issue. Could you please check if `dmesg -T` timestamps are in sync with OS time? Is the date correctly set on the system - I'm surprised that on 2020-06-02 you uploaded a log file ending on 2020-05-29. I have to add that even if we'll find a bug causing the issue we'll not be providing a patch for 18.08 and since we're quite a time after its end of support I recommend an upgrade to a supported release. cheers, Marcin
One very strange visible in the backtrace is that you have 17 threads waiting for munge. Other (incomplete backtraces) are also indicating that similar situation. Did you notice any indication of failure in munge.log? Please attach it to the case so I can review it. cheers, Marcin
Created attachment 14545 [details] logs-20200605 Hi Marcin, Please find the required logs attached.
Adel, I'm unable to trigger this segfault. If you wish to continue working this then I will need answers to comment 34 and comment 36. Up until now, I have not received answers to these and so I cannot make any further progress until you provide this needed information. cheers, Marcin
Hi Marcin, I think I have answered all the questions. Can you please specify which questions are still not answered?
Could you please comment on the hardware issues I mentioned at the beginning of the comment 34? Please find quoted questions and recommendations from previous comments below: >I'm seeing new hostnames in the log file, could you please attach current slurm.conf (I'm working on a one shared before in other the bug, but it looks outdated). >Do you see any coincidence of segfault/crash with configuration changes applied by scontrol reconfigure? >Could you please check if `dmesg -T` timestamps are in sync with OS time? Is the date correctly set on the system >Additionally, we're seeing an indication of SYN flood on slurmctld port: >>[1953852.476028] TCP: request_sock_TCP: Possible SYN flooding on port 6817. Sending cookies. Check SNMP counters The easiest way to overcome this OS-level limitation is to set slurmctld to listen on multiple ports[1]. >[...]but I still recommend to disable HA configuration if you're not running a HA pair. This leads to unnecessary network connection attempts and error messages in log. The same applies to /var/log permissions resulting in: >error: chdir(/var/log): Permission denied ---- Additionally could you please check the result of `remunge` command? cheers, Marcin
Created attachment 14648 [details] slurm.conf latest > Could you please comment on the hardware issues I mentioned at the beginning of the comment 34? We believe that there are no hardware errors at the moment. >I'm seeing new hostnames in the log file, could you please attach current slurm.conf (I'm working on a one shared before in other the bug, but it looks outdated). Attached. >Do you see any coincidence of segfault/crash with configuration changes applied by scontrol reconfigure? No, >Could you please check if `dmesg -T` timestamps are in sync with OS time? Is the date correctly set on the system Yes, they are in sync. Yes, the date is set correctly. >Additionally, we're seeing an indication of SYN flood on slurmctld port: >>[1953852.476028] TCP: request_sock_TCP: Possible SYN flooding on port 6817. Sending cookies. Check SNMP counters > The easiest way to overcome this OS-level limitation is to set slurmctld to listen on multiple ports[1]. Yes, we will start using 4 ports soon. >[...]but I still recommend to disable HA configuration if you're not running a HA pair. This leads to unnecessary network connection attempts and error messages in log. We are using HA and we don't have plans to disable HA soon. What we can do is to start the slurmctld service on the passive (backup) head node to get rid of this in the logs message. If slurmctld will crash when the passive (backup) head node will become unavailable or unreachable, then don't you think that this is a bug candidate in Slurm? > The same applies to /var/log permissions resulting in: >error: chdir(/var/log): Permission denied The current permissions on /var/log are as follows: ls -ld /var/log/ drwxr-xr-x 16 root root 4096 12 jun 11:17 /var/log/ It doesn't seem that Slurm has problems writing logs under /var/log so we believe that this is just a false error in the logs. Besides, we don't plan to make the /var/log world-readable. Why would Slurm want /var/log/ to be world-readable? > Additionally could you please check the result of `remunge` command? # remunge 2020-06-12 14:52:09 Spawning 1 thread for encoding 2020-06-12 14:52:09 Processing credentials for 1 second 2020-06-12 14:52:10 Processed 23645 credentials in 1.000s (23639 creds/sec)
Adel, >> Could you please comment on the hardware issues I mentioned at the beginning of the comment 34? >We believe that there are no hardware errors at the moment. How do you read: >[867426.453322] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4 >[867426.453326] {1}[Hardware Error]: It has been corrected by h/w and requires [...] >[867688.111552] mce: [Hardware Error]: Machine check events logged >Yes, they are in sync. [...] Could you please just measure the difference and share it with me? # date | tee /dev/kmsg # dmesg -T | tail -1 >> The same applies to /var/log permissions resulting in: >>error: chdir(/var/log): Permission denied >It doesn't seem that Slurm has problems writing logs under /var/log so we >believe that this is just a false error in the logs. Besides, we don't plan to >make the /var/log world-readable. Why would Slurm want /var/log/ to be >world-readable? This error message comes from the code changing slurmctld working directory(chdir) to the directory of slurmctld logfile. You should be able to reproduce the issue doing something like: #sudo -u slurm /bin/bash #cd /var/log It's not about "world-readable" permissions, but permissions of slurm user, which is not root in your case. >If slurmctld will crash when the passive (backup) head node will become unavailable or unreachable, then don't you think that this is a bug candidate in Slurm? We're looking at it as a potential bug from the beginning of the ticket, however, we're unable to reproduce the issue. Did you get a similar segfault with backup controller enabled as you preferred? cheers, Marcin
> Could you please just measure the difference and share it with me? > # date | tee /dev/kmsg > # dmesg -T | tail -1 ams01-bhn-01:~ # date | tee /dev/kmsg Mon Jun 29 18:53:57 CEST 2020 ams01-bhn-01:~ # dmesg -T | tail -1 [Mon Jun 29 19:04:14 2020] Mon Jun 29 18:53:57 CEST 2020 ams01-bhn-01:~ # We will start the slurmctld service on the passive head node and let you know if it crashes again.
Hi Adel, I wanted to give you an update regarding this issue. As Marcin mentioned we have not been able to reproduce this issue. With any bug we work, we need a way to reproduce so that we can fix the issue. We feel that we have exhausted all our avenues while trying to debug and reproduce this issue. Furthermore, we would like for you to move this system to a supported version of Slurm since 18.08 reached its end of life for support back in February 2020. We would be more than happy to look at this issue again if you can reproduce it reliably on a supported version. For now, I am resolving this issue. Jason Booth Director of Support, SchedMD LLC Commercial Slurm Development and Support