Ticket 6572

Summary: slurmctld keep crashing
Product: Slurm Reporter: whong
Component: slurmctldAssignee: Director of Support <support>
Status: RESOLVED INFOGIVEN QA Contact:
Severity: 3 - Medium Impact    
Priority: --- CC: bart
Version: 18.08.5   
Hardware: Linux   
OS: Linux   
Site: Swinburne Slinky Site: ---
Alineos Sites: --- Atos/Eviden Sites: ---
Confidential Site: --- Coreweave sites: ---
Cray Sites: --- DS9 clusters: ---
Google sites: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- NoveTech Sites: ---
Nvidia HWinf-CS Sites: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Tzag Elita Sites: ---
Linux Distro: --- Machine Name:
CLE Version: Version Fixed:
Target Release: --- DevPrio: ---
Emory-Cloud Sites: ---
Attachments: all errors in slurmctld on day 23 Feb
core dump of slurmctld on 5 Jan
core dump of slurmctld on 2 Jan
another core dump of slurmctld on 2 Jan
core dump of slurmctld on 1 Jan
core dump on 1 Jan
core dump on 5 Jan
core dump on 2 Jan
another core dump on 2 Jan
slurm.conf
system log file, slurmctld file on those crashed days 1/2/5 Jan & 23 Feb as well as current dmesg
gdb result for three core dump files
new gdb for those 3 core dump files as requested

Description whong 2019-02-24 16:21:19 MST
Created attachment 9274 [details]
all errors in slurmctld on day 23 Feb

For last two months, our slurmctld keep crashing. Almost every a few days. Sometimes many times per day. On 23 Feb it crashed 6 times. To avoid outage, we had a script setup to auto restart slurmctld when this happen. There is no core dump created. 

Here is a record when crash happened on that day. and attached is a file of error logged in slurmctld.log

Feb 23 00:43:42 transom1 slurmctld: restarted
Feb 23 04:02:12 transom1 slurmctld: restarted
Feb 23 06:31:17 transom1 slurmctld: restarted
Feb 23 12:22:56 transom1 slurmctld: restarted
Feb 23 19:23:20 transom1 slurmctld: restarted
Feb 23 20:19:23 transom1 slurmctld: restarted

I have drained those not responding nodes this morning. Hope it will do any help.

I am new on slurm. So please let me know how to troubleshoot with detail when response. Appreciate of that.

Thanks,

Wei
Comment 1 whong 2019-02-24 19:43:12 MST
Created attachment 9275 [details]
core dump of slurmctld on 5 Jan
Comment 2 whong 2019-02-24 19:43:59 MST
Created attachment 9276 [details]
core dump of slurmctld on 2 Jan
Comment 3 whong 2019-02-24 19:44:27 MST
Created attachment 9277 [details]
another core dump of slurmctld on 2 Jan
Comment 4 whong 2019-02-24 19:45:09 MST
Created attachment 9278 [details]
core dump of slurmctld on 1 Jan
Comment 5 Dominik Bartkiewicz 2019-02-25 00:26:28 MST
Hi

Without binaries core file is useless.
Could you generate backtrace on the slurmctld host?

eg:
gdb -batch -ex "thread apply all bt full" <slurmctld path> <coredump>


Dominik
Comment 6 whong 2019-02-25 03:48:58 MST
Hi Dominik,

I did attach 4 core dump files though they are not from the latest crashes. Did you get chance to check them out.

Thanks,

Wei

On 25 Feb. 2019 6:49 pm, bugs@schedmd.com wrote:

Comment # 5<https://bugs.schedmd.com/show_bug.cgi?id=6572#c5> on bug 6572<https://bugs.schedmd.com/show_bug.cgi?id=6572> from Dominik Bartkiewicz<mailto:bart@schedmd.com>

Hi

Without binaries core file is useless.
Could you generate backtrace on the slurmctld host?

eg:
gdb -batch -ex "thread apply all bt full" <slurmctld path> <coredump>


Dominik

________________________________
You are receiving this mail because:

  *   You reported the bug.
Comment 7 Jason Booth 2019-02-25 09:10:43 MST
Hi Wei,

Dominik meant to say that binaries core file are useless to us since, in most cases, we can not not get a good backtrace from them. This is impart due to the OS and library differences between our systems. 


Please generate a backtrace on the slurmctld host as mentioned in comment #5

eg:
gdb -batch -ex "thread apply all bt full" <slurmctld path> <coredump>


It would also be nice to see backtraces from a few of your core files so that we can determine if they are all crashing in the same place.

-Jason
Comment 9 whong 2019-02-25 16:00:38 MST
Created attachment 9295 [details]
core dump on 1 Jan
Comment 10 whong 2019-02-25 16:01:09 MST
Created attachment 9296 [details]
core dump on 5 Jan
Comment 11 whong 2019-02-25 16:01:40 MST
Created attachment 9297 [details]
core dump on 2 Jan
Comment 12 whong 2019-02-25 16:02:03 MST
Created attachment 9298 [details]
another core dump on 2 Jan
Comment 13 whong 2019-02-25 16:05:28 MST
Hi Dominik/Jason,

Thanks for your reply. Sorry I misunderstood earlier. Now please see the attached core dump files.

Thanks,
Wei
Comment 14 Michael Hinton 2019-02-25 16:10:31 MST
Hi Wei,

Would you mind also attaching your slurm.conf?

Thanks,
Michael
Comment 15 whong 2019-02-25 16:21:20 MST
Created attachment 9299 [details]
slurm.conf
Comment 16 whong 2019-02-25 16:24:06 MST
Also be aware the core dump files of 1/2 Jan is from slurm v18.08.3 while 5 Jan is from v18.08.4. Currently we are running v18.08.5
Comment 17 Michael Hinton 2019-02-25 16:42:50 MST
Hi Wei,

We'll take a look at those old core dump stack traces. In the meantime, could you get a stack trace on a recent core dump file while on 18.05, and give us the slurmctld.log, syslog, and output of dmesg as well? That would help us a lot.

One thing to note with your four old stack traces is that one of them crashed in a different way than the other three. So there could be more than one issue.

Thanks,
Michael
Comment 18 whong 2019-02-25 17:02:28 MST
Hi Michael,

There are not any more core dump files created since 5 Jan. So that is why I uploaded all those files from Jan. I assume they are the same cause since we haven't did anything to fix it except had 2 upgrade. I noticed that one of them is  due to signal 11 instead of 6. 

I will try to collect other logs for you.

Thanks,
Wei
Comment 19 whong 2019-02-25 17:42:23 MST
Created attachment 9306 [details]
system log file, slurmctld file on those crashed days 1/2/5 Jan & 23 Feb as well as current dmesg

Please see attached system log file, slurmctld file on those crashed days 1/2/5 Jan & 23 Feb as well as current dmesg.

Thanks,
Wei
Comment 20 Michael Hinton 2019-02-27 13:04:50 MST
Wei,

I'm confused why you said "on 23 Feb it crashed 6 times." slurmctld-20190223.log shows that slurmctld is being sent a sigterm and then starting again. It's not aborting or segfaulting here. Maybe your script is restarting slurmctld prematurely.

As for the core dumps: Can you run these GDB commands for me?

    gdb -batch -ex "frame 4" -ex "print item" -ex "frame 5" -ex "print job_entry" <slurmctld path> <gdb.173981 path> 

    gdb -batch -ex "frame 4" -ex "print item" -ex "frame 5" -ex "print job_entry" <slurmctld path> <gdb.258485 path> 

    gdb -batch -ex "frame 4" -ex "print item" -ex "frame 5" -ex "print job_entry" <slurmctld path> <gdb.221758 path> 

Of course, you can interactively run the "frame" and "print" commands while in gdb, if you prefer. I just need to know what item and job_entry look like for each of the aborts.

As for the segfault in gdb.282452... this is still a mystery. It's segfaulting inside libc's malloc_consolidate() function. Most of the things I've read about this kind of error indicate that there was probably a buffer overrun in memory somewhere, corrupting memory management structures used by libc.

It's possible that Slurm's to blame, or maybe even your lua job submit script or something else. It's hard to say without duplicating it. Have you changed your job submit plugin recently?

See
*https://stackoverflow.com/questions/6725164/segmentation-fault-in-malloc-consolidate-malloc-c-that-valgrind-doesnt-detect
*https://stackoverflow.com/questions/14820533/malloc-causing-segmentation-fault-by-int-malloc
*https://stackoverflow.com/questions/15915920/segfault-of-malloc-consolidate-from-gethostbyname
*https://stackoverflow.com/questions/3100193/segfaults-in-malloc-and-malloc-consolidate

Thanks,
Michael
Comment 21 whong 2019-02-27 15:43:02 MST
Created attachment 9346 [details]
gdb result for three core dump files
Comment 22 whong 2019-02-27 15:49:25 MST
Here is the script to restart our slurmctld. basically just start the daemon once find it is gone.

====================
#!/bin/sh

while true; do
  p=`pgrep slurmctld`
  if [ "$p" = "" ]; then
     /apps/slurm/latest/sbin/slurmctld < /dev/null
     echo restarted 
     echo restarted | logger -t slurmctld 
     sleep 30 
     echo
======================
Comment 23 Michael Hinton 2019-02-27 16:08:04 MST
Sorry, I told you wrong, so the gdb output that you sent is useless.

Could you do this instead?

    gdb <slurmctld path> <gdb.173918 path>
    frame 4
    print *item
    print **item
    frame 5
    print *job_entry
    
The * and ** should properly dereference the pointers and print the contents. Then type `exit` to close out gdb. Could you repeat this for gdb.258485 and gdb.221758 as well? Thanks.
    
Feel free to paste the text here in Bugzilla; it's easier for me than trying to unpack an attached file.
Comment 24 Michael Hinton 2019-02-27 17:17:17 MST
(In reply to whong from comment #22)
> Here is the script to restart our slurmctld. basically just start the daemon
> once find it is gone.
> 
> ====================
> #!/bin/sh
> 
> while true; do
>   p=`pgrep slurmctld`
>   if [ "$p" = "" ]; then
>      /apps/slurm/latest/sbin/slurmctld < /dev/null
>      echo restarted 
>      echo restarted | logger -t slurmctld 
>      sleep 30 
>      echo
> ======================
Looking at messages-20190223, I see this:

    Feb 23 06:31:16 transom1 systemd: Stopping Slurm controller daemon...
    Feb 23 06:31:17 transom1 slurmctld: restarted
    Feb 23 06:31:17 transom1 systemd: Starting Slurm controller daemon...

I'm assuming systemd is the one stopping the slurmctld. Then your script senses this and tries to restart slurmctld, when I think systemd is already trying to do that. That's probably what's causing this line:

    Feb 23 06:31:17 transom1 systemd: PID file /var/run/slurm/slurmctld.pid not readable (yet?) after start.

At any rate, isn't the whole point of using systemd to help monitor processes in case they crash? I would recommend getting rid of your bash script and instead use systemd to monitor slurmctld.

You probably want to add `Restart=always` to the slurmctld systemd service file that you apparently already have somewhere.

See https://www.digitalocean.com/community/tutorials/how-to-configure-a-linux-service-to-start-automatically-after-a-crash-or-reboot-part-1-practical-examples#auto-starting-services-with-systemd.
Comment 25 whong 2019-02-27 17:34:34 MST
Created attachment 9350 [details]
new gdb for those 3 core dump files as requested
Comment 26 Michael Hinton 2019-02-27 18:04:52 MST
(In reply to whong from comment #25)
> Created attachment 9350 [details]
> new gdb for those 3 core dump files as requested
Thanks!

So, looking at these three stack traces, the aborts() are happening within libc, during some kind of free (_int_free()). This is actually a very similar situation to the segfault gdb stack trace, except that libc is able to abort before segfaulting. This issue is also likely caused by memory corruption happening somewhere else. See https://stackoverflow.com/questions/2334352/why-do-i-get-a-sigabrt-here

It sounds like you could run the slurmctld under valgrind and hope that it detects when memory initially gets corrupted, so you can find out what code is responsible. I'm not really sure what else can be done, especially if you haven't seen this problem in a month. For all we know, the issue could have been fixed in 18.08.5.

As for the recent (Feb 23) slurmctld outages, do you agree that it is being restarted by systemd and isn't really crashing?
Comment 27 whong 2019-02-27 18:57:03 MST
Hi Michael,

Yep, we use script to manage the crashed slurmctld is because all our partitions are set not start by default. So we need to manual enable each partitions after sturmctld is up.

I agree that the mentioned restarted event in the log is triggered by systemd. Though I am not aware of anyone did this at 6:31am, which is quite early. The script is not perfect, but my colleague don't feel safe to set those partitions to be on by default.

Plus I don't think all those script triggered restarted event is caused by systemd.

Thanks,

Wei

On 28 Feb. 2019 11:17 am, bugs@schedmd.com wrote:

Comment # 24<https://bugs.schedmd.com/show_bug.cgi?id=6572#c24> on bug 6572<https://bugs.schedmd.com/show_bug.cgi?id=6572> from Michael Hinton<mailto:hinton@schedmd.com>

(In reply to whong from comment #22<https://bugs.schedmd.com/show_bug.cgi?id=6572#c22>)
> Here is the script to restart our slurmctld. basically just start the daemon
> once find it is gone.
>
> ====================
> #!/bin/sh
>
> while true; do
>   p=`pgrep slurmctld`
>   if [ "$p" = "" ]; then
>      /apps/slurm/latest/sbin/slurmctld < /dev/null
>      echo restarted
>      echo restarted | logger -t slurmctld
>      sleep 30
>      echo
> ======================
Looking at messages-20190223, I see this:

    Feb 23 06:31:16 transom1 systemd: Stopping Slurm controller daemon...
    Feb 23 06:31:17 transom1 slurmctld: restarted
    Feb 23 06:31:17 transom1 systemd: Starting Slurm controller daemon...

I'm assuming systemd is the one stopping the slurmctld. Then your script senses
this and tries to restart slurmctld, when I think systemd is already trying to
do that. That's probably what's causing this line:

    Feb 23 06:31:17 transom1 systemd: PID file /var/run/slurm/slurmctld.pid not
readable (yet?) after start.

At any rate, isn't the whole point of using systemd to help monitor processes
in case they crash? I would recommend getting rid of your bash script and
instead use systemd to monitor slurmctld.

You probably want to add `Restart=always` to the slurmctld systemd service file
that you apparently already have somewhere.

See
https://www.digitalocean.com/community/tutorials/how-to-configure-a-linux-service-to-start-automatically-after-a-crash-or-reboot-part-1-practical-examples#auto-starting-services-with-systemd<https://www.digitalocean.com/community/tutorials/how-to-configure-a-linux-service-to-start-automatically-after-a-crash-or-reboot-part-1-practical-examples#auto-starting-services-with-systemd>.

________________________________
You are receiving this mail because:

  *   You reported the bug.
Comment 28 Michael Hinton 2019-03-01 18:36:47 MST
(In reply to whong from comment #27)
> I agree that the mentioned restarted event in the log is triggered by
> systemd. Though I am not aware of anyone did this at 6:31am, which is quite
> early. The script is not perfect, but my colleague don't feel safe to set
> those partitions to be on by default.
> 
> Plus I don't think all those script triggered restarted event is caused by
> systemd.
I think systemd is causing *ALL* the restarts, not just the one I posted. messages-20190223 shows that every restart event had something to do with systemd.

The point is, Slurm isn't crashing, as far as I can tell. If it crashed (segfault or abort), then there would have been a core dump file and maybe some errors in the logs. 

My advice would be to turn off the systemd slurmctld service file and your custom restart script to prove that Slurm is really crashing.

It's possible that your restart script and slurmd are negatively interfering with each other, since they are both apparently trying to restart Slurm at the same time.

Until you figure that out, I don't think there is much else I can do to help.

Thanks,
-Michael

messages-20190223
=======================================

Feb 22 17:02:26 transom1 systemd: Stopping Slurm controller daemon...
Feb 22 17:02:26 transom1 slurmctld: restarted
Feb 22 17:02:27 transom1 systemd: Starting Slurm controller daemon...
Feb 22 17:02:27 transom1 systemd: PID file /var/run/slurm/slurmctld.pid not readable (yet?) after start.
Feb 22 17:02:27 transom1 systemd: Started Slurm controller daemon.

Feb 22 18:43:39 transom1 systemd: Stopping Slurm controller daemon...
Feb 22 18:43:39 transom1 slurmctld: restarted
Feb 22 18:43:39 transom1 systemd: Starting Slurm controller daemon...
Feb 22 18:43:39 transom1 systemd: PID file /var/run/slurm/slurmctld.pid not readable (yet?) after start.
Feb 22 18:43:39 transom1 systemd: Started Slurm controller daemon.

Feb 22 22:30:29 transom1 slurmctld: restarted
Feb 22 22:30:29 transom1 systemd: Stopping Slurm controller daemon...
Feb 22 22:30:29 transom1 systemd: Starting Slurm controller daemon...
Feb 22 22:30:29 transom1 systemd: PID file /var/run/slurm/slurmctld.pid not readable (yet?) after start.
Feb 22 22:30:29 transom1 systemd: Started Slurm controller daemon.

Feb 23 00:43:42 transom1 slurmctld: restarted
Feb 23 00:43:42 transom1 systemd: Stopping Slurm controller daemon...
Feb 23 00:43:43 transom1 systemd: Starting Slurm controller daemon...
Feb 23 00:43:43 transom1 systemd: PID file /var/run/slurm/slurmctld.pid not readable (yet?) after start.
Feb 23 00:43:43 transom1 systemd: Started Slurm controller daemon.

Feb 23 04:02:12 transom1 systemd: Stopping Slurm controller daemon...
Feb 23 04:02:12 transom1 slurmctld: restarted
Feb 23 04:02:12 transom1 systemd: Starting Slurm controller daemon...
Feb 23 04:02:12 transom1 systemd: PID file /var/run/slurm/slurmctld.pid not readable (yet?) after start.
Feb 23 04:02:12 transom1 systemd: Started Slurm controller daemon.

Feb 23 06:31:16 transom1 systemd: Stopping Slurm controller daemon...
Feb 23 06:31:17 transom1 slurmctld: restarted
Feb 23 06:31:17 transom1 systemd: Starting Slurm controller daemon...
Feb 23 06:31:17 transom1 systemd: PID file /var/run/slurm/slurmctld.pid not readable (yet?) after start.
Feb 23 06:31:17 transom1 systemd: Started Slurm controller daemon.

Feb 23 12:22:56 transom1 systemd: Stopping Slurm controller daemon...
Feb 23 12:22:56 transom1 slurmctld: restarted
Feb 23 12:22:56 transom1 systemd: Starting Slurm controller daemon...
Feb 23 12:22:56 transom1 systemd: PID file /var/run/slurm/slurmctld.pid not readable (yet?) after start.
Feb 23 12:22:56 transom1 systemd: Started Slurm controller daemon.

Feb 23 19:23:20 transom1 systemd: Stopping Slurm controller daemon...
Feb 23 19:23:20 transom1 slurmctld: restarted
Feb 23 19:23:20 transom1 systemd: Starting Slurm controller daemon...
Feb 23 19:23:20 transom1 systemd: PID file /var/run/slurm/slurmctld.pid not readable (yet?) after start.
Feb 23 19:23:20 transom1 systemd: Started Slurm controller daemon.

Feb 23 20:19:23 transom1 systemd: Stopping Slurm controller daemon...
Feb 23 20:19:23 transom1 slurmctld: restarted
Feb 23 20:19:23 transom1 systemd: Starting Slurm controller daemon...
Feb 23 20:19:23 transom1 systemd: PID file /var/run/slurm/slurmctld.pid not readable (yet?) after start.
Feb 23 20:19:23 transom1 systemd: Started Slurm controller daemon.
Comment 29 whong 2019-03-03 15:44:23 MST
Hi Michael,

Understood. You may close this job. We will modify our script & continue monitor the system.

Thanks for all the help.

Kind regards,
Wei

________________________________________
From: bugs@schedmd.com <bugs@schedmd.com>
Sent: Saturday, 2 March 2019 12:36:47 PM
To: Wei Hong
Subject: [Bug 6572] slurmctld keep crashing

Comment # 28<https://bugs.schedmd.com/show_bug.cgi?id=6572#c28> on bug 6572<https://bugs.schedmd.com/show_bug.cgi?id=6572> from Michael Hinton<mailto:hinton@schedmd.com>

(In reply to whong from comment #27<https://bugs.schedmd.com/show_bug.cgi?id=6572#c27>)
> I agree that the mentioned restarted event in the log is triggered by
> systemd. Though I am not aware of anyone did this at 6:31am, which is quite
> early. The script is not perfect, but my colleague don't feel safe to set
> those partitions to be on by default.
>
> Plus I don't think all those script triggered restarted event is caused by
> systemd.
I think systemd is causing *ALL* the restarts, not just the one I posted.
messages-20190223 shows that every restart event had something to do with
systemd.

The point is, Slurm isn't crashing, as far as I can tell. If it crashed
(segfault or abort), then there would have been a core dump file and maybe some
errors in the logs.

My advice would be to turn off the systemd slurmctld service file and your
custom restart script to prove that Slurm is really crashing.

It's possible that your restart script and slurmd are negatively interfering
with each other, since they are both apparently trying to restart Slurm at the
same time.

Until you figure that out, I don't think there is much else I can do to help.

Thanks,
-Michael

messages-20190223
=======================================

Feb 22 17:02:26 transom1 systemd: Stopping Slurm controller daemon...
Feb 22 17:02:26 transom1 slurmctld: restarted
Feb 22 17:02:27 transom1 systemd: Starting Slurm controller daemon...
Feb 22 17:02:27 transom1 systemd: PID file /var/run/slurm/slurmctld.pid not
readable (yet?) after start.
Feb 22 17:02:27 transom1 systemd: Started Slurm controller daemon.

Feb 22 18:43:39 transom1 systemd: Stopping Slurm controller daemon...
Feb 22 18:43:39 transom1 slurmctld: restarted
Feb 22 18:43:39 transom1 systemd: Starting Slurm controller daemon...
Feb 22 18:43:39 transom1 systemd: PID file /var/run/slurm/slurmctld.pid not
readable (yet?) after start.
Feb 22 18:43:39 transom1 systemd: Started Slurm controller daemon.

Feb 22 22:30:29 transom1 slurmctld: restarted
Feb 22 22:30:29 transom1 systemd: Stopping Slurm controller daemon...
Feb 22 22:30:29 transom1 systemd: Starting Slurm controller daemon...
Feb 22 22:30:29 transom1 systemd: PID file /var/run/slurm/slurmctld.pid not
readable (yet?) after start.
Feb 22 22:30:29 transom1 systemd: Started Slurm controller daemon.

Feb 23 00:43:42 transom1 slurmctld: restarted
Feb 23 00:43:42 transom1 systemd: Stopping Slurm controller daemon...
Feb 23 00:43:43 transom1 systemd: Starting Slurm controller daemon...
Feb 23 00:43:43 transom1 systemd: PID file /var/run/slurm/slurmctld.pid not
readable (yet?) after start.
Feb 23 00:43:43 transom1 systemd: Started Slurm controller daemon.

Feb 23 04:02:12 transom1 systemd: Stopping Slurm controller daemon...
Feb 23 04:02:12 transom1 slurmctld: restarted
Feb 23 04:02:12 transom1 systemd: Starting Slurm controller daemon...
Feb 23 04:02:12 transom1 systemd: PID file /var/run/slurm/slurmctld.pid not
readable (yet?) after start.
Feb 23 04:02:12 transom1 systemd: Started Slurm controller daemon.

Feb 23 06:31:16 transom1 systemd: Stopping Slurm controller daemon...
Feb 23 06:31:17 transom1 slurmctld: restarted
Feb 23 06:31:17 transom1 systemd: Starting Slurm controller daemon...
Feb 23 06:31:17 transom1 systemd: PID file /var/run/slurm/slurmctld.pid not
readable (yet?) after start.
Feb 23 06:31:17 transom1 systemd: Started Slurm controller daemon.

Feb 23 12:22:56 transom1 systemd: Stopping Slurm controller daemon...
Feb 23 12:22:56 transom1 slurmctld: restarted
Feb 23 12:22:56 transom1 systemd: Starting Slurm controller daemon...
Feb 23 12:22:56 transom1 systemd: PID file /var/run/slurm/slurmctld.pid not
readable (yet?) after start.
Feb 23 12:22:56 transom1 systemd: Started Slurm controller daemon.

Feb 23 19:23:20 transom1 systemd: Stopping Slurm controller daemon...
Feb 23 19:23:20 transom1 slurmctld: restarted
Feb 23 19:23:20 transom1 systemd: Starting Slurm controller daemon...
Feb 23 19:23:20 transom1 systemd: PID file /var/run/slurm/slurmctld.pid not
readable (yet?) after start.
Feb 23 19:23:20 transom1 systemd: Started Slurm controller daemon.

Feb 23 20:19:23 transom1 systemd: Stopping Slurm controller daemon...
Feb 23 20:19:23 transom1 slurmctld: restarted
Feb 23 20:19:23 transom1 systemd: Starting Slurm controller daemon...
Feb 23 20:19:23 transom1 systemd: PID file /var/run/slurm/slurmctld.pid not
readable (yet?) after start.
Feb 23 20:19:23 transom1 systemd: Started Slurm controller daemon.

________________________________
You are receiving this mail because:

  *   You reported the bug.
Comment 30 Michael Hinton 2019-03-04 08:49:43 MST
(In reply to whong from comment #29)
> Hi Michael,
> 
> Understood. You may close this job. We will modify our script & continue
> monitor the system.
> 
> Thanks for all the help.
> 
> Kind regards,
> Wei
No problem. Feel free to reopen this if anything changes or you find new information.

Thanks,
Michael