8736 – slurmctld has crashed several times.

Ticket 8736 - slurmctld has crashed several times.

Summary: slurmctld has crashed several times.

Status:	RESOLVED CANNOTREPRODUCE

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	slurmctld (show other tickets)
Version:	18.08.7
Hardware:	Linux Linux

Severity:	4 - Minor Issue
Assignee:	Marcin Stolarek
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2020-03-26 01:58 MDT by Adel Aly
Modified:	2020-07-01 10:11 MDT (History)
CC List:	4 users (show)

See Also:
Site:	Zenuity Compute
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	SUSE
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
gdb trace of a core dump (7.60 KB, text/x-log) 2020-03-26 01:58 MDT, Adel Aly	Details
slurmlog (3.15 MB, application/x-bzip) 2020-03-27 03:16 MDT, Adel Aly	Details
coredumpctl info (12.91 KB, text/plain) 2020-04-15 06:53 MDT, Adel Aly	Details
gdb+glibc (21.04 KB, text/plain) 2020-04-15 10:10 MDT, Adel Aly	Details
journalctl (5.37 KB, text/plain) 2020-05-01 08:25 MDT, Adel Aly	Details
audit file (2.98 MB, text/plain) 2020-05-14 03:37 MDT, Adel Aly	Details
audit logs (554.26 KB, application/zip) 2020-05-20 05:23 MDT, Adel Aly	Details
another audit log (5.62 MB, text/x-log) 2020-05-20 05:42 MDT, Adel Aly	Details
core1 (6.68 KB, text/x-log) 2020-05-20 08:54 MDT, Adel Aly	Details
core2 (6.66 KB, text/x-log) 2020-05-20 08:56 MDT, Adel Aly	Details
core3 (6.73 KB, text/x-log) 2020-05-22 00:49 MDT, Adel Aly	Details
dmesg (187.66 KB, text/plain) 2020-05-22 00:49 MDT, Adel Aly	Details
testProgram.c (788 bytes, text/plain) 2020-05-22 07:17 MDT, Marcin Stolarek	Details
logs (30.90 KB, application/gzip) 2020-05-26 10:03 MDT, Adel Aly	Details
slurmctld logs (3.48 MB, application/gzip) 2020-05-26 13:09 MDT, Adel Aly	Details
new logs (1004.94 KB, application/gzip) 2020-06-02 10:58 MDT, Adel Aly	Details
logs-20200605 (3.74 MB, application/gzip) 2020-06-05 04:58 MDT, Adel Aly	Details
slurm.conf latest (4.73 KB, text/plain) 2020-06-12 07:20 MDT, Adel Aly	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Adel Aly 2020-03-26 01:58:20 MDT

Created attachment 13484 [details]
gdb trace of a core dump

Hi,

During the previous days, we have noticed that Slurmctld crashing several times a day. Please find attached the gdb trace from the generated core dump.

Thank you in advance,
adel

Comment 1 Marcin Stolarek 2020-03-26 02:53:12 MDT

Adel,

Could you please attach slurmctld log from the time of failures?

You opened this as severity 4, it's OK, although, I'll list our severity definitions below so you can check which one is the best description of impact.

cheers,
Marcin 

Severity 1 – Major Impact
A Severity 1 issue occurs when there is a continued system outage that affects a large set of end users. The system is down and nonfunctional due to Slurm problem(s) and no procedural workaround exists.

Severity 2 – High Impact
A Severity 2 issue is a high impact problem that is causing sporadic outages or is consistently encountered by end users with adverse impact to end user interaction with the system.

Severity 3 – Medium Impact
A Severity 3 issue is a medium to low impact problem that includes partial noncritical loss of system access or which impairs some operations on the system but allows the end user to continue to function on the system with workarounds.

Severity 4 – Minor Issues
A Severity 4 issue is a minor issue with limited or no loss in functionality within the customer
environment. Severity 4 issues may also be used for recommendations for future product
enhancements or modifications.

Comment 2 Marcin Stolarek 2020-03-26 03:29:33 MDT

Adel,

The core dump looks - strange. It doesn't contain any indication of abort/segmentation fault/bus error. It's surprising that you have two threads waiting for a reply from munge. 

How many munge threads you have? Can you execute "remunge" command and share results with us?

cheers,
Marcin

Comment 3 Adel Aly 2020-03-27 03:16:19 MDT

Created attachment 13500 [details]
slurmlog

How many munge threads you have?
16

Comment 4 Marcin Stolarek 2020-04-01 05:25:43 MDT

Adel,

The logfile in fact indicates that there were some outages, however, this was not a result of a fatal error in slurmctld since this wasn't logged. Is it possible that slurmctld was killed by direct kill -9 $(pidof slurmctld)?

How did you get the core dump? Can you share a full output from gdb?
Did you notice any oom-killer activity on the host - anything in dmesg output?
Could you please check cgroup/security limits on slurmctld process? Are you using systemd unit-files delivered with slurm sources or a modified version?

Two things visible in logs (I don't expect them to be a root cause for slurmctld crash, but still...) are:
1)error: chdir(/var/log): Permission denied
You should adjust /var/log permissions to let Slurm user make it a current working directory.

2)error: _shutdown_bu_thread:send/recv xyz02: Connection refused
It looks like you have a backup controller configured but it's not really running. I'd recommend either starting it or removing it from the config.

cheers,
Marcin

Comment 5 Marcin Stolarek 2020-04-15 03:57:46 MDT

Adel,

Just a kindly reminder - could you please answer the questions from comment 4?

cheers,
Marcin

Comment 6 Adel Aly 2020-04-15 04:10:44 MDT

> Is it possible that slurmctld was killed by direct kill -9 $(pidof slurmctld)?

No, slurmctld wasn't killed, it has crashed.

> How did you get the core dump? Can you share a full output from gdb?

The slurmctld generated the coredump in /var/lib/systemd/coredump and we ran the following command on the gdb trace:

gdb slurmctld <core file> <<< "
set pagination off
set logging on core.xxxx.log
thread apply all bt
quit"

You already have the full gdb trace produced.

> Did you notice any oom-killer activity on the host - anything in dmesg output?

Noting suspicious in dmesg and oom killer wasn't triggered.

> Could you please check cgroup/security limits on slurmctld process? Are you using systemd unit-files delivered with slurm sources or a modified version?

Limit                     Soft Limit           Hard Limit           Units
Max cpu time              unlimited            unlimited            seconds
Max file size             unlimited            unlimited            bytes
Max data size             unlimited            unlimited            bytes
Max stack size            unlimited            unlimited            bytes
Max core file size        unlimited            unlimited            bytes
Max resident set          unlimited            unlimited            bytes
Max processes             2063504              2063504              processes
Max open files            65536                65536                files
Max locked memory         65536                65536                bytes
Max address space         unlimited            unlimited            bytes
Max file locks            unlimited            unlimited            locks
Max pending signals       2063504              2063504              signals
Max msgqueue size         819200               819200               bytes
Max nice priority         0                    0
Max realtime priority     0                    0
Max realtime timeout      unlimited            unlimited            us

> Are you using systemd unit-files delivered with slurm sources or a modified version?

The same unit-files delivered with slurm sources. Of course the paths are different.

> Two things visible in logs (I don't expect them to be a root cause for slurmctld crash, but still...) are:
1)error: chdir(/var/log): Permission denied
You should adjust /var/log permissions to let Slurm user make it a current working directory.

2)error: _shutdown_bu_thread:send/recv xyz02: Connection refused
It looks like you have a backup controller configured but it's not really running. I'd recommend either starting it or removing it from the config.

These lines don't seem to be relevant. They have always been there and the crashes started happening recently.

Comment 7 Marcin Stolarek 2020-04-15 05:04:30 MDT

Adel,

Please attach the output of:
#coredumpctl info /PATH/TO/SLURMCTL

Thanks for ulitmis. Can you check if slurmctld process is limited by any cgroup?

>These lines don't seem to be relevant. They have always been there and the crashes started happening recently.
In concur(comment 4). However, I still recommend fixing wrong(error-prone) configuration.

cheers,
Marcin

Comment 8 Adel Aly 2020-04-15 06:53:32 MDT

Created attachment 13807 [details]
coredumpctl info

> Please attach the output of:
> #coredumpctl info /PATH/TO/SLURMCTL

Attached.

> Thanks for ulitmis. Can you check if slurmctld process is limited by any cgroup?

No, slurmctld is not limited by any cgroup.

Comment 9 Marcin Stolarek 2020-04-15 07:36:58 MDT

Please load the core dump to gdb and attach results of the folowing commands:
>(gdb)set print pretty on
>(gdb)print $_siginfo
>(gdb)info thread
>(gdb)t a a bt full

What is the glibc version you use?

cheers,
Marcin

Comment 10 Adel Aly 2020-04-15 10:10:55 MDT

Created attachment 13812 [details]
gdb+glibc

Please find attached the requested info/output.

Comment 11 Marcin Stolarek 2020-04-16 03:11:25 MDT

Could you please share result of:
journalctl --unit=slurmctld

Are you able to configure an audit rule monitoring signals activity? I'd be interested in setting something like:
auditctl -a exit,always -F arch=b64 -S kill -k audit_kill

This will log a piece of information about the process sending signal 11 to slurmctld so we'll know where it came from. The log normally goes to /var/log/audit/audit.log.

It looks really strange since the siginfo_t structure that was given to rt_sigaction syscall looks uninitialized. When "Segmentation fault" happens the pid of a kworker is stored there - the same applies to kill command and glibc sigqueue. In this case, we have some random negative value and all bits set for si_uid.

cheers,
Marcin

Comment 12 Marcin Stolarek 2020-04-23 08:48:53 MDT

Adel,

Where you able to check journalctl and configure audit rule?

cheers,
Marcin

Comment 13 Marcin Stolarek 2020-05-01 04:41:35 MDT

Just a kindly reminder. Please reply to comment 11.

In case of no reply I'll close the ticket as timed out.

cheers,
Marcin

Comment 14 Adel Aly 2020-05-01 04:43:18 MDT

Hi Marcin,

We will update the ticket once the slurmctld crashes again.

At the moment, we have enabled auditing.

Comment 15 Marcin Stolarek 2020-05-01 08:20:02 MDT

What about  `journalctl --unit=slurmctld`

cheers,
Marcin

Comment 16 Adel Aly 2020-05-01 08:25:48 MDT

Created attachment 14068 [details]
journalctl

> What about  `journalctl --unit=slurmctld`

You can find it attached.

Comment 17 Adel Aly 2020-05-14 03:37:05 MDT

Created attachment 14236 [details]
audit file

Hi Marcin,

I have attached the audit output file.

Comment 18 Marcin Stolarek 2020-05-14 05:20:48 MDT

Adel, 

Thanks for audit.log - the case seems to be even more interesting. The only activity of kill syscall logged was related to sshd(SIGTERM) and bash(SIGINT). I missed the fact that signals may come also from the real-time kernel infrastructure. Please add the following to audit rules:
#auditctl -a exit,always -F arch=b64 -S RT_SIGQUEUEINFO -k audit_kill
#auditctl -a exit,always -F arch=b64 -S RT_TGSIGQUEUEINFO -k audit_kill

Let me know when you'll notice slurmctld crash again. I hope we'll catch the sender this time. Once you see that in the line with "key=audit_kill" you'll have a ppid=XX and pid=YY numbers. Please check the processes behind those PIDs.

cheers,
Marcin

Comment 19 Adel Aly 2020-05-20 05:23:09 MDT

Created attachment 14314 [details]
audit logs

Hi Marcin,

Please find attached the required audit logs with the additional info.

Comment 20 Adel Aly 2020-05-20 05:42:33 MDT

Created attachment 14315 [details]
another audit log

Comment 21 Marcin Stolarek 2020-05-20 06:36:00 MDT

Adel,

Thanks for the logs. I see two logs related to slurmctld, but it's not the one comming from the rules we added:
type=ANOM_ABEND msg=audit(1589940420.146:996483): auid=4294967295 uid=450 gid=450 ses=4294967295 pid=102789 comm="rpcmgr" exe="/cm/shared/apps/slurm/18.08.7/sbin/
slurmctld" sig=11

However, one strange thing here is that normally I'd expect a "reason" filed in audit logs stating something like "memory violation". It's just one of the possibilities but segmentation fault may also be a result of malfunctioning hardware - can you run memtester on the system?

Could you please load related core dumps to gdb and share the output like in comment 9? Please attach dmesg from the server.

cheers,
Marcin

Comment 22 Adel Aly 2020-05-20 08:54:52 MDT

Created attachment 14319 [details]
core1

#auditctl -l

-a always,exit -F arch=x86_64 -S kill -F key=audit_kill
-a always,exit -F arch=x86_64 -S rt_sigqueueinfo -F key=audit_kill
-a always,exit -F arch=x86_64 -S rt_tgsigqueueinfo -F key=audit_kill

Comment 23 Adel Aly 2020-05-20 08:56:13 MDT

Created attachment 14320 [details]
core2

Comment 24 Marcin Stolarek 2020-05-20 09:10:53 MDT

Adel,

By "like in comment 9" I meant:
>Please load the core dump to gdb and attach results of the folowing commands:
>>(gdb)set print pretty on
>>(gdb)print $_siginfo
>>(gdb)info thread
>>(gdb)t a a bt full
sorry for confusion.

Please attach dmesg from the server.

cheers,
Marcin

Comment 25 Adel Aly 2020-05-22 00:49:22 MDT

Created attachment 14351 [details]
core3

Hi Marcin,

Please find the requested core dump attached.

Comment 26 Adel Aly 2020-05-22 00:49:54 MDT

Created attachment 14352 [details]
dmesg

Comment 27 Marcin Stolarek 2020-05-22 07:17:30 MDT

Created attachment 14356 [details]
testProgram.c

Adel, 

The dmesg output contains some interesting information. It shows that the actual SEGV happens in libpthread (mostly when called from slurm rpcmgr, but once it also happened from slurm agent thread). The file is probably truncated, but last occurrence there is from Feb 3 2020. 

Could you please (additionally to previously requested commands) load the coredump to gdb and execute:
(gdb) p slurmctld_config.server_thread_count

previously requested:
>By "like in comment 9" I meant:
>>Please load the core dump to gdb and attach results of the following commands:
>>>(gdb)set print pretty on
>>>(gdb)print $_siginfo
>>>(gdb)info thread
>>>(gdb)t a a bt full

What is the result of:
#readelf -x .interp /path/to/slurmctld

Could you please build and install the attached test program the same way as you do for Slurm and run it on the server where slurmctld executes in a bash loop like:
#for i in $(seq 1 1000); do /tmp/pthreadTest ; done

Am I correct that the machine has ECC memory - could you please check the result of dmidecode -t 17?

Please enable Protocol debug flag[1] and share slurmctld.log together with gdb output with us when the issue happens again.

cheers,
Marcin

[1]https://slurm.schedmd.com/slurm.conf.html#OPT_DebugFlags

Comment 28 Adel Aly 2020-05-22 08:23:13 MDT

Hi Marcin,

> Could you please (additionally to previously requested commands) load the coredump to gdb and execute:
> (gdb) p slurmctld_config.server_thread_count

So you would like the output of the following, correct?

(gdb)set print pretty on
(gdb)print $_siginfo
(gdb)info thread
(gdb)t a a bt full
(gdb)p slurmctld_config.server_thread_count


> Please enable Protocol debug flag[1] and share slurmctld.log together with gdb output with us when the issue happens again.

So we should add "DebugFlags=Protocol" in slurm.conf and reload the configs, correct?

I will work on getting the rest of the required output.

Comment 29 Marcin Stolarek 2020-05-22 08:44:59 MDT

>So you would like the output of the following, correct?
>(gdb)set print pretty on
>(gdb)print $_siginfo
>(gdb)info thread
>(gdb)t a a bt full
>(gdb)p slurmctld_config.server_thread_count

Yes.

>So we should add "DebugFlags=Protocol" in slurm.conf and reload the configs, correct?
You can add it to slurm.conf (to make it permanent) and then just execute:
scontrol setdebugflags +Protocol

cheers,
Marcin

Comment 30 Adel Aly 2020-05-22 08:47:15 MDT

> You can add it to slurm.conf (to make it permanent) and then just execute: > scontrol setdebugflags +Protocol

No extra debug level is required? SlurmctldDebug=versbose is enough?

Comment 31 Marcin Stolarek 2020-05-22 08:54:45 MDT

>No extra debug level is required? SlurmctldDebug=versbose is enough?
Correct.

Comment 32 Adel Aly 2020-05-26 10:03:38 MDT

Created attachment 14383 [details]
logs

Hi Marcin,

Please find the required output attached.

readelf -x .interp $(which slurmctld)

Hex dump of section '.interp':
  0x00400238 2f6c6962 36342f6c 642d6c69 6e75782d /lib64/ld-linux-
  0x00400248 7838362d 36342e73 6f2e3200          x86-64.so.


I will attach the latest slurmctld.log as soon as I receive it.

Comment 33 Adel Aly 2020-05-26 13:09:38 MDT

Created attachment 14390 [details]
slurmctld logs

Comment 34 Marcin Stolarek 2020-05-28 04:28:11 MDT

Adel,

In the last attached dmesg we're seeing an indication of memory and potentially other hardware problems on the server:
>[867426.453322] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4
>[867426.453326] {1}[Hardware Error]: It has been corrected by h/w and requires no further action
I think that you need to replace a failing DIMM. The error is correctable, but to my best knowledge and experience, it should be replaced. (It's obviously outside of SchedMD expertise).

>[867688.111552] mce: [Hardware Error]: Machine check events logged
To better understand the issues behind this one you can use mcelog, again it's also general system administration rather than Slurm support.

I think that both may be a reason for unexpected segfaults, but since the rpcmgr segfault happens exactly at the same address in libpthread-2.22.so I'm not definitive about that.
 
Additionally, we're seeing an indication of SYN flood on slurmctld port:
>[1953852.476028] TCP: request_sock_TCP: Possible SYN flooding on port 6817. Sending cookies.  Check SNMP counters
The easiest way to overcome this OS-level limitation is to set slurmctld to listen on multiple ports[1].

Unfortunately, the output from gdb is not complete. If you'll check the output of:
>(gdb) t a a bt full
We only have a backtrace of threads 48-41, threads 40-1 are missing, while the one of primary interest is thread 1. (This happened for some previous backtraces as well) Could you please double-check that and attach the full output of the command.

Are you sure that Protocol debugflag was enabled? I'd expect additional info messages in the log like:"_slurmctld_rpc_mgr: accept() connection from...".

I'm sorry I didn't ask specifically for "dmesg -T" but if you can attach it with slurmctld.log with Protocol messages, by default dmesg messages are attached with a timestamp relative to system boot time, which is hard to match with a date/time stamps in logs.

I'm seeing some new hostnames in the log file, could you please attach current slurm.conf (I'm working on a one shared before in other the bug, but it looks outdated).

Do you see any coincidence of segfault/crash with configuration changes applied by scontrol reconfigure?

cheers,
Marcin

PS. I know that you don't think it's of any importance from Bug 7946, but I still recommend to disable HA configuration if you're not running a HA pair. This leads to unnecessary network connection attempts and error messages in log.
The same applies to /var/log permissions resulting in:
>error: chdir(/var/log): Permission denied

[1]https://slurm.schedmd.com/slurm.conf.html#OPT_SlurmctldPort

Comment 35 Adel Aly 2020-06-02 10:58:15 MDT

Created attachment 14479 [details]
new logs

Comment 36 Marcin Stolarek 2020-06-04 05:54:30 MDT

Adel,

Looking at the logs we see that rpcmgr segfault happens at:
>[Fri May 29 04:06:41 2020] rpcmgr[123216]: segfault at 7fff9e6e6d13 ip 00007ffff75f9ee0 sp 00007fffee467cc0 error 4 in libpthread-2.22.so[7ffff75f1000+18000]

However, slurmctld starts just a few minutes before this event:
>[2020-05-29T04:01:06.830] slurmctld version 18.08.7 started on cluster slurm_cluster

And it the minute when the segfault happens the only error logged is connection refused happening to the backup controler:
>[2020-05-29T04:04:54.713] _job_complete: JobId=14549249_51(14552820) done
>[2020-05-29T04:06:08.036] error: _shutdown_bu_thread:send/recv xyz02: Connection refused
>[2020-05-29T04:11:08.417] error: _shutdown_bu_thread:send/recv xyz02: Connection refused
>[2020-05-29T04:16:08.770] error: _shutdown_bu_thread:send/recv xyz02: Connection refused

What is interesting is that according to logs slurmctld continues execution without interruption after rpcmgr segfault. That's really strange

Unfortunately, protocol debug flag was enabled after the segfault:
>[2020-05-29T14:10:18.183] _slurm_rpc_reconfigure_controller: completed usec=40376
 so we didn't get more detailed log messages from rpcmgr thread. Did you notice a segfault with it enabled?

Could you please remove it (as recommended before in Bug 7946 and here in comment 34)? I'm curious if this may be related. I'll try to analyze the code to look for some conditions where I can reproduce the issue.

Could you please check if `dmesg -T` timestamps are in sync with OS time? Is the date correctly set on the system - I'm surprised that on 2020-06-02 you uploaded a log file ending on 2020-05-29.

I have to add that even if we'll find a bug causing the issue we'll not be providing a patch for 18.08 and since we're quite a time after its end of support I recommend an upgrade to a supported release.

cheers,
Marcin

Comment 37 Marcin Stolarek 2020-06-04 06:17:52 MDT

One very strange visible in the backtrace is that you have 17 threads waiting for munge. Other (incomplete backtraces) are also indicating that similar situation. 

Did you notice any indication of failure in munge.log? Please attach it to the case so I can review it.

cheers,
Marcin

Comment 38 Adel Aly 2020-06-05 04:58:38 MDT

Created attachment 14545 [details]
logs-20200605

Hi Marcin,

Please find the required logs attached.

Comment 41 Marcin Stolarek 2020-06-12 02:40:35 MDT

Adel, 

I'm unable to trigger this segfault. If you wish to continue working this then I will need answers to comment 34 and comment 36. Up until now, I have not received answers to these and so I cannot make any further progress until you provide this needed information.

cheers,
Marcin

Comment 42 Adel Aly 2020-06-12 02:51:09 MDT

Hi Marcin,

I think I have answered all the questions. Can you please specify which questions are still not answered?

Comment 43 Marcin Stolarek 2020-06-12 03:07:22 MDT

Could you please comment on the hardware issues I mentioned at the beginning of the comment 34?

Please find quoted questions and recommendations from previous comments below:

>I'm seeing new hostnames in the log file, could you please attach current slurm.conf (I'm working on a one shared before in other the bug, but it looks outdated).

>Do you see any coincidence of segfault/crash with configuration changes applied by scontrol reconfigure?

>Could you please check if `dmesg -T` timestamps are in sync with OS time? Is the date correctly set on the system 

>Additionally, we're seeing an indication of SYN flood on slurmctld port:
>>[1953852.476028] TCP: request_sock_TCP: Possible SYN flooding on port 6817. Sending cookies.  Check SNMP counters
The easiest way to overcome this OS-level limitation is to set slurmctld to listen on multiple ports[1].

>[...]but I still recommend to disable HA configuration if you're not running a HA pair. This leads to unnecessary network connection attempts and error messages in log.
The same applies to /var/log permissions resulting in:
>error: chdir(/var/log): Permission denied

----
Additionally could you please check the result of `remunge` command?

cheers,
Marcin

Comment 44 Adel Aly 2020-06-12 07:20:21 MDT

Created attachment 14648 [details]
slurm.conf latest

> Could you please comment on the hardware issues I mentioned at the beginning of the comment 34?

We believe that there are no hardware errors at the moment.

>I'm seeing new hostnames in the log file, could you please attach current slurm.conf (I'm working on a one shared before in other the bug, but it looks outdated).

Attached.


>Do you see any coincidence of segfault/crash with configuration changes applied by scontrol reconfigure?

No,

>Could you please check if `dmesg -T` timestamps are in sync with OS time? Is the date correctly set on the system 

Yes, they are in sync. Yes, the date is set correctly.

>Additionally, we're seeing an indication of SYN flood on slurmctld port:
>>[1953852.476028] TCP: request_sock_TCP: Possible SYN flooding on port 6817. Sending cookies.  Check SNMP counters

> The easiest way to overcome this OS-level limitation is to set slurmctld to listen on multiple ports[1].

Yes, we will start using 4 ports soon.

>[...]but I still recommend to disable HA configuration if you're not running a HA pair. This leads to unnecessary network connection attempts and error messages in log.

We are using HA and we don't have plans to disable HA soon.

What we can do is to start the slurmctld service on the passive (backup) head node to get rid of this in the logs message. 

If slurmctld will crash when the passive (backup) head node will become unavailable or unreachable, then don't you think that this is a bug candidate in Slurm?

> The same applies to /var/log permissions resulting in:
>error: chdir(/var/log): Permission denied

The current permissions on /var/log are as follows:

ls -ld /var/log/
drwxr-xr-x 16 root root 4096 12 jun 11:17 /var/log/

It doesn't seem that Slurm has problems writing logs under /var/log so we believe that this is just a false error in the logs. Besides, we don't plan to make the /var/log world-readable. Why would Slurm want /var/log/ to be world-readable? 

> Additionally could you please check the result of `remunge` command?

# remunge
2020-06-12 14:52:09 Spawning 1 thread for encoding
2020-06-12 14:52:09 Processing credentials for 1 second
2020-06-12 14:52:10 Processed 23645 credentials in 1.000s (23639 creds/sec)

Comment 45 Marcin Stolarek 2020-06-29 04:17:14 MDT

Adel,
>> Could you please comment on the hardware issues I mentioned at the beginning of the comment 34?
>We believe that there are no hardware errors at the moment.
How do you read:
>[867426.453322] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4
>[867426.453326] {1}[Hardware Error]: It has been corrected by h/w and requires 
[...]
>[867688.111552] mce: [Hardware Error]: Machine check events logged

>Yes, they are in sync. [...] 
Could you please just measure the difference and share it with me?
# date | tee  /dev/kmsg
# dmesg -T | tail -1

>> The same applies to /var/log permissions resulting in:
>>error: chdir(/var/log): Permission denied
>It doesn't seem that Slurm has problems writing logs under /var/log so we 
>believe that this is just a false error in the logs. Besides, we don't plan to 
>make the /var/log world-readable. Why would Slurm want /var/log/ to be
>world-readable? 
This error message comes from the code changing slurmctld working directory(chdir) to the directory of slurmctld logfile. You should be able to reproduce the issue doing something like:
#sudo -u slurm /bin/bash
#cd /var/log

It's not about "world-readable" permissions, but permissions of slurm user, which is not root in your case.

>If slurmctld will crash when the passive (backup) head node will become unavailable or unreachable, then don't you think that this is a bug candidate in Slurm?
We're looking at it as a potential bug from the beginning of the ticket, however, we're unable to reproduce the issue. Did you get a similar segfault with backup controller enabled as you preferred?

cheers,
Marcin

Comment 46 Adel Aly 2020-06-30 02:14:33 MDT

> Could you please just measure the difference and share it with me?
> # date | tee  /dev/kmsg
> # dmesg -T | tail -1


ams01-bhn-01:~ # date | tee  /dev/kmsg
Mon Jun 29 18:53:57 CEST 2020
ams01-bhn-01:~ # dmesg -T | tail -1
[Mon Jun 29 19:04:14 2020] Mon Jun 29 18:53:57 CEST 2020
ams01-bhn-01:~ #



We will start the slurmctld service on the passive head node and let you know if it crashes again.

Comment 47 Jason Booth 2020-07-01 10:11:00 MDT

Hi Adel,

I wanted to give you an update regarding this issue. As Marcin mentioned we have not been able to reproduce this issue. With any bug we work, we need a way to reproduce so that we can fix the issue. We feel that we have exhausted all our avenues while trying to debug and reproduce this issue.

Furthermore, we would like for you to move this system to a supported version of Slurm since 18.08 reached its end of life for support back in February 2020. 

We would be more than happy to look at this issue again if you can reproduce it reliably on a supported version.

For now, I am resolving this issue.

Jason Booth
Director of Support, SchedMD LLC
Commercial Slurm Development and Support