Ticket 2274

Summary: suspended job got resumed for no reason after slurmctld restart
Product: Slurm Reporter: Sergey Meirovich <sergey_meirovich>
Component: slurmdAssignee: Moe Jette <jette>
Status: RESOLVED FIXED QA Contact:
Severity: 3 - Medium Impact    
Priority: --- CC: tim
Version: 15.08.6   
Hardware: Linux   
OS: Linux   
Site: AMAT Slinky Site: ---
Alineos Sites: --- Atos/Eviden Sites: ---
Confidential Site: --- Coreweave sites: ---
Cray Sites: --- DS9 clusters: ---
Google sites: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- NoveTech Sites: ---
Nvidia HWinf-CS Sites: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Tzag Elita Sites: ---
Linux Distro: --- Machine Name:
CLE Version: Version Fixed: 15.08.7
Target Release: --- DevPrio: ---
Emory-Cloud Sites: ---
Attachments: dcalph001 slurmd log
slurm.conf
slurmctld log
fix for bug

Description Sergey Meirovich 2015-12-22 00:22:38 MST
Created attachment 2532 [details]
dcalph001 slurmd log

Hi,

I have recently upgraded 15.08.5 => 15.08.6 as per:
http://slurm.schedmd.com/quickstart_admin.html#upgrade

After the upgrade observed clear anomaly

======================================================================================
-sh-4.1$ squeue | grep 001
727   e154414  R  normal       wrap      (null)    Dec  8 12:54 20   1     dcalph001                                         
1131  e154414  S  low          wrap      (null)    Dec  1 22:28 36   1     dcalph001                                         
-sh-4.1$ scontrol show node=dcalph001
NodeName=dcalph001 Arch=x86_64 CoresPerSocket=18
   CPUAlloc=36 CPUErr=0 CPUTot=36 CPULoad=56.00 Features=E5-2699v3,128G,cae
   Gres=(null)
   NodeAddr=dcalph001 NodeHostName=dcalph001 Version=15.08
   OS=Linux RealMemory=128951 AllocMem=0 FreeMem=58209 Sockets=2 Boards=1
   State=ALLOCATED ThreadsPerCore=1 TmpDisk=1951 Weight=1 Owner=N/A
   BootTime=Nov 19 12:04 SlurmdStartTime=Dec 21  7:43
   CapWatts=n/a
   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
   

-sh-4.1$ 
======================================================================================

Please note: "CPUAlloc=36 ... CPULoad=56.00" I have checked the situation on host=dcalph001  and indeed JobID=1131 is running despite that slurmctld thinks that it is suspended. Moreover that is clear from slurmd log od dcalph001 that JobID=1131  was deliberately resumed by slurmd around slurmctld restart.

======================================================================================
[root@dcalph001 ~]# grep 1131.0 /var/log/slurmd  | tail -n 37
[2015-12-21T09:42:32.087] [1131.0] Sending signal 18 to pid 112291 (vasp_parallel)
[2015-12-21T09:42:32.087] [1131.0] Sending signal 18 to pid 112292 (vasp_parallel)
[2015-12-21T09:42:32.087] [1131.0] Sending signal 18 to pid 112293 (vasp_parallel)
[2015-12-21T09:42:32.087] [1131.0] Sending signal 18 to pid 112294 (vasp_parallel)
[2015-12-21T09:42:32.087] [1131.0] Sending signal 18 to pid 112295 (vasp_parallel)
[2015-12-21T09:42:32.087] [1131.0] Sending signal 18 to pid 112296 (vasp_parallel)
[2015-12-21T09:42:32.087] [1131.0] Sending signal 18 to pid 112297 (vasp_parallel)
[2015-12-21T09:42:32.087] [1131.0] Sending signal 18 to pid 112298 (vasp_parallel)
[2015-12-21T09:42:32.087] [1131.0] Sending signal 18 to pid 112299 (vasp_parallel)
[2015-12-21T09:42:32.087] [[2015-12-21T09:42:32.087] [1131] R1131.0] Sending signal 18 to pid 112300 (vasp_parallel)
[2015-12-21T09:42:32.087] [1131.0] Sending signal 18 to pid 112301 (vasp_parallel)
[2015-12-21T09:42:32.087] [1131.0] Sending signal 18 to pid 112302 (vasp_parallel)
[2015-12-21T09:42:32.087] [1131.0] Sending signal 18 to pid 112303 (vasp_parallel)
[2015-12-21T09:42:32.087] [1131.0] Sending signal 18 to pid 112304 (vasp_parallel)
[2015-12-21T09:42:32.088] [1131.0] Sending signal 18 to pid 112305 (vasp_parallel)
[2015-12-21T09:42:32.088] [1131.0] Sending signal 18 to pid 112306 (vasp_parallel)
[2015-12-21T09:42:32.088] [1131.0] Sending signal 18 to pid 112307 (vasp_parallel)
[2015-12-21T09:42:32.088] [1131.0] Sending signal 18 to pid 112308 (vasp_parallel)
[2015-12-21T09:42:32.088] [1131.0] Sending signal 18 to pid 112309 (vasp_parallel)
[2015-12-21T09:42:32.088] [1131.0] Sending signal 18 to pid 112310 (vasp_parallel)
[2015-12-21T09:42:32.088] [1131.0] Sending signal 18 to pid 112311 (vasp_parallel)
[2015-12-21T09:42:32.088] [1131.0] Sending signal 18 to pid 112312 (vasp_parallel)
[2015-12-21T09:42:32.088] [1131.0] Sending signal 18 to pid 112313 (vasp_parallel)
[2015-12-21T09:42:32.090] [1131.0] Sending signal 18 to pid 112314 (vasp_parallel)
[2015-12-21T09:42:32.090] [1131.0] Sending signal 18 to pid 112315 (vasp_parallel)
[2015-12-21T09:42:32.090] [1131.0] Sending signal 18 to pid 112316 (vasp_parallel)
[2015-12-21T09:42:32.090] [1131.0] Sending signal 18 to pid 112317 (vasp_parallel)
[2015-12-21T09:42:32.090] [1131.0] Sending signal 18 to pid 112318 (vasp_parallel)
[2015-12-21T09:42:32.090] [1131.0] Sending signal 18 to pid 112319 (vasp_parallel)
[2015-12-21T09:42:32.090] [1131.0] Sending signal 18 to pid 112320 (vasp_parallel)
[2015-12-21T09:42:32.090] [1131.0] Sending signal 18 to pid 112321 (vasp_parallel)
[2015-12-21T09:42:32.091] [1131.0] Sending signal 18 to pid 112322 (vasp_parallel)
[2015-12-21T09:42:32.091] [1131.0] Sending signal 18 to pid 112323 (vasp_parallel)
[2015-12-21T09:42:32.091] [1131.0] Sending signal 18 to pid 112324 (vasp_parallel)
[2015-12-21T09:42:32.091] [1131.0] Sending signal 18 to pid 112325 (vasp_parallel)
[2015-12-21T09:42:32.092] [1131.0] Sending signal 18 to pid 112326 (vasp_parallel)
[2015-12-21T09:42:32.109] [1131.0] Resumed 1131.0
[root@dcalph001 ~]# 
======================================================================================


slurmctld recovered job as suspended (State=0x2):
======================================================================================
[root@DCALPH000 ~]# grep 1131 /var/log/slurmctld | tail -n 2
[2015-12-21T07:42:31.049] recovered job step 1131.0
[2015-12-21T07:42:31.049] Recovered JobID=1131 State=0x2 NodeCnt=0 Assoc=166
[root@DCALPH000 ~]# 
======================================================================================

Please be not confused by 2 hours difference. That happened because we also changed CST=>PST timezone on our servers. Since slurmctld had been restarted first it was already reporting PST while slurmd was still technically in CST at that time.


Not sure if that 15.08.6 or 15.08.5 issue though. Because when that wrong resume happened slurmcld was at 15.08.6 and affected slurmd was still at 15.08.5.

Attaching slurm.conf/slurmd log from dcalph001/slurmctld log here
Comment 1 Sergey Meirovich 2015-12-22 00:24:27 MST
Created attachment 2533 [details]
slurm.conf
Comment 2 Sergey Meirovich 2015-12-22 00:25:30 MST
Created attachment 2534 [details]
slurmctld log
Comment 3 David Bigagli 2015-12-22 01:39:29 MST
Hi,
    thanks for your report. We will examine the data you have provided and
get back to you.

David
Comment 4 David Bigagli 2015-12-22 23:01:36 MST
Hi,
   I cannot reproduce what you reported and Slurm is not suppose to resume
or change state of job after reconfigure or restart. 
The logs show that 'scontrol resume 1131' command was ran to resume the job
on purpose that's a clearly logged sequence which can be reproduced easily.

David
Comment 5 Sergey Meirovich 2015-12-23 02:08:06 MST
Hi David,

Where do you see manual resume? There should be "Processing RPC: REQUEST_SUSPEND(resume) from uid=..." in slurmctld log that is coming from next:
/src/slurmctld/proc_req.c:	info("Processing RPC: REQUEST_SUSPEND(%s) from uid=%u",

However that is not present in the slurmctld log that is attached.

====================================================================
root@debian:/mnt/dnld# zcat slurmctld.gz | grep REQUEST_SUSPEND
root@debian:/mnt/dnld# 
====================================================================

SchedMD clearly stated to us that manual suspend/resume is dangerous:
http://bugs.schedmd.com/show_bug.cgi?id=2031

So we are strictly refraining from it on our production cluster.
Comment 6 David Bigagli 2015-12-27 23:09:42 MST
Hi Sergey,
         yes my initial thought was not correct. When gang scheduling is configured at reconfiguration/restart of the slurmctld the jobs are resumed
by the gang scheduling plugin but they are also suspended immediately.
What seems to happen in your case is that the job got resumed one more time.
Do you observe this for all jobs or just sporadically?

David
Comment 8 Sergey Meirovich 2015-12-28 00:19:26 MST
Hi David,

I have seen that only once so far.

Indeed I see the behavior you have explained for other JobID - there was two jobs in suspended status during last restart: 
=========================================================================================
root@debian:/mnt/dnld# zgrep 'Recovered ' slurmctld.gz | grep State=0x2 | grep 2015-12-21
[2015-12-21T07:42:31.049] Recovered JobID=1131 State=0x2 NodeCnt=0 Assoc=166
[2015-12-21T07:42:31.049] Recovered JobID=2019 State=0x2 NodeCnt=0 Assoc=166
root@debian:/mnt/dnld# 
=========================================================================================

JobID=2019 had resumed and immediately suspended exactly as you have explained:

=========================================================================================
[root@dcalph045 ~]# grep 2015-12-21 /var/log/slurmd | grep 2019
[2015-12-21T07:42:31.118] [2019.0] Sending signal 18 to pid 1715 (vasp_parallel)
[2015-12-21T07:42:31.118] [2019.0] Sending signal 18 to pid 1716 (vasp_parallel)
[2015-12-21T07:42:31.118] [2019.0] Sending signal 18 to pid 1717 (vasp_parallel)
[2015-12-21T07:42:31.118] [2019.0] Sending signal 18 to pid 1718 (vasp_parallel)
[2015-12-21T07:42:31.138] [2019] Sending signal 18 to pid 1701 (srun)
[2015-12-21T07:42:31.155] [2019] Sending signal 18 to pid 1700 (srun)
[2015-12-21T07:42:31.155] [2019] Sending signal 18 to pid 1699 (slurm_script)
[2015-12-21T07:42:31.155] [2019] Resumed 2019.4294967294
[2015-12-21T07:42:31.144] [2019.0] Sending signal 18 to pid 1719 (vasp_parallel)
[2015-12-21T07:42:31.160] [2019.0] Sending signal 18 to pid 1720 (vasp_parallel)
[2015-12-21T07:42:31.160] [2019.0] Sending signal 18 to pid 1721 (vasp_parallel)
[2015-12-21T07:42:31.160] [2019.0] Sending signal 18 to pid 1722 (vasp_parallel)
[2015-12-21T07:42:31.160] [2019.0] Sending signal 18 to pid 1723 (vasp_parallel)
[2015-12-21T07:42:31.160] [2019.0] Sending signal 18 to pid 1724 (vasp_parallel)
[2015-12-21T07:42:31.160] [2019.0] Sending signal 18 to pid 1725 (vasp_parallel)
[2015-12-21T07:42:31.160] [2019.0] Sending signal 18 to pid 1726 (vasp_parallel)
[2015-12-21T07:42:31.160] [2019.0] Sending signal 18 to pid 1727 (vasp_parallel)
[2015-12-21T07:42:31.160] [2019.0] Sending signal 18 to pid 1728 (vasp_parallel)
[2015-12-21T07:42:31.160] [2019.0] Sending signal 18 to pid 1729 (vasp_parallel)
[2015-12-21T07:42:31.160] [2019.0] Sending signal 18 to pid 1730 (vasp_parallel)
[2015-12-21T07:42:31.160] [2019.0] Sending signal 18 to pid 1731 (vasp_parallel)
[2015-12-21T07:42:31.160] [2019.0] Sending signal 18 to pid 1732 (vasp_parallel)
[2015-12-21T07:42:31.160] [2019.0] Sending signal 18 to pid 1733 (vasp_parallel)
[2015-12-21T07:42:31.160] [2019.0] Sending signal 18 to pid 1734 (vasp_parallel)
[2015-12-21T07:42:31.160] [2019.0] Sending signal 18 to pid 1735 (vasp_parallel)
[2015-12-21T07:42:31.160] [2019.0] Sending signal 18 to pid 1736 (vasp_parallel)
[2015-12-21T07:42:31.160] [2019.0] Sending signal 18 to pid 1737 (vasp_parallel)
[2015-12-21T07:42:31.160] [2019.0] Sending signal 18 to pid 1738 (vasp_parallel)
[2015-12-21T07:42:31.160] [2019.0] Sending signal 18 to pid 1739 (vasp_parallel)
[2015-12-21T07:42:31.160] [2019.0] Sending signal 18 to pid 1740 (vasp_parallel)
[2015-12-21T07:42:31.160] [2019.0] Sending signal 18 to pid 1741 (vasp_parallel)
[2015-12-21T07:42:31.160] [2019.0] Sending signal 18 to pid 1742 (vasp_parallel)
[2015-12-21T07:42:31.160] [2019.0] Sending signal 18 to pid 1743 (vasp_parallel)
[2015-12-21T07:42:31.160] [2019.0] Sending signal 18 to pid 1780 (vasp_parallel)
[2015-12-21T07:42:31.160] [2019.0] Sending signal 18 to pid 1784 (vasp_parallel)
[2015-12-21T07:42:31.160] [2019.0] Sending signal 18 to pid 1785 (vasp_parallel)
[2015-12-21T07:42:31.160] [2019.0] Sending signal 18 to pid 1786 (vasp_parallel)
[2015-12-21T07:42:31.160] [2019.0] Sending signal 18 to pid 1787 (vasp_parallel)
[2015-12-21T07:42:31.160] [2019.0] Sending signal 18 to pid 1788 (vasp_parallel)
[2015-12-21T07:42:31.160] [2019.0] Sending signal 18 to pid 1789 (vasp_parallel)
[2015-12-21T07:42:31.160] [2019.0] Resumed 2019.0
[2015-12-21T07:42:31.242] [2019.0] Sending signal 20 to pid 1715 (vasp_parallel)
[2015-12-21T07:42:31.242] [2019.0] Sending signal 20 to pid 1716 (vasp_parallel)
[2015-12-21T07:42:31.242] [2019.0] Sending signal 20 to pid 1717 (vasp_parallel)
[2015-12-21T07:42:31.242] [2019.0] Sending signal 20 to pid 1718 (vasp_parallel)
[2015-12-21T07:42:31.242] [2019.0] Sending signal 20 to pid 1719 (vasp_parallel)
[2015-12-21T07:42:31.242] [2019.0] Sending signal 20 to pid 1720 (vasp_parallel)
[2015-12-21T07:42:31.242] [2019.0] Sending signal 20 to pid 1721 (vasp_parallel)
[2015-12-21T07:42:31.242] [2019.0] Sending signal 20 to pid 1722 (vasp_parallel)
[2015-12-21T07:42:31.242] [2019.0] Sending signal 20 to pid 1723 (vasp_parallel)
[2015-12-21T07:42:31.242] [2019.0] Sending signal 20 to pid 1724 (vasp_parallel)
[2015-12-21T07:42:31.242] [2019.0] Sending signal 20 to pid 1725 (vasp_parallel)
[2015-12-21T07:42:31.242] [2019.0] Sending signal 20 to pid 1726 (vasp_parallel)
[2015-12-21T07:42:31.242] [2019.0] Sending signal 20 to pid 1727 (vasp_parallel)
[2015-12-21T07:42:31.242] [2019.0] Sending signal 20 to pid 1728 (vasp_parallel)
[2015-12-21T07:42:31.242] [2019.0] Sending signal 20 to pid 1729 (vasp_parallel)
[2015-12-21T07:42:31.242] [2019.0] Sending signal 20 to pid 1730 (vasp_parallel)
[2015-12-21T07:42:31.242] [2019.0] Sending signal 20 to pid 1731 (vasp_parallel)
[2015-12-21T07:42:31.242] [2019.0] Sending signal 20 to pid 1732 (vasp_parallel)
[2015-12-21T07:42:31.242] [2019.0] Sending signal 20 to pid 1733 (vasp_parallel)
[2015-12-21T07:42:31.242] [2019.0] Sending signal 20 to pid 1734 (vasp_parallel)
[2015-12-21T07:42:31.242] [2019.0] Sending signal 20 to pid 1735 (vasp_parallel)
[2015-12-21T07:42:31.242] [2019.0] Sending signal 20 to pid 1736 (vasp_parallel)
[2015-12-21T07:42:31.242] [2019.0] Sending signal 20 to pid 1737 (vasp_parallel)
[2015-12-21T07:42:31.242] [2019.0] Sending signal 20 to pid 1738 (vasp_parallel)
[2015-12-21T07:42:31.242] [2019.0] Sending signal 20 to pid 1739 (vasp_parallel)
[2015-12-21T07:42:31.242] [2019.0] Sending signal 20 to pid 1740 (vasp_parallel)
[2015-12-21T07:42:31.242] [2019.0] Sending signal 20 to pid 1741 (vasp_parallel)
[2015-12-21T07:42:31.242] [2019.0] Sending signal 20 to pid 1742 (vasp_parallel)
[2015-12-21T07:42:31.242] [2019.0] Sending signal 20 to pid 1743 (vasp_parallel)
[2015-12-21T07:42:31.242] [2019.0] Sending signal 20 to pid 1780 (vasp_parallel)
[2015-12-21T07:42:31.242] [2019.0] Sending signal 20 to pid 1784 (vasp_parallel)
[2015-12-21T07:42:31.242] [2019.0] Sending signal 20 to pid 1785 (vasp_parallel)
[2015-12-21T07:42:31.242] [2019.0] Sending signal 20 to pid 1786 (vasp_parallel)
[2015-12-21T07:42:31.242] [2019.0] Sending signal 20 to pid 1787 (vasp_parallel)
[2015-12-21T07:42:31.242] [2019.0] Sending signal 20 to pid 1788 (vasp_parallel)
[2015-12-21T07:42:31.242] [2019.0] Sending signal 20 to pid 1789 (vasp_parallel)
[2015-12-21T07:42:31.248] [2019] Sending signal 20 to pid 1701 (srun)
[2015-12-21T07:42:31.263] [2019] Sending signal 20 to pid 1700 (srun)
[2015-12-21T07:42:31.263] [2019] Sending signal 20 to pid 1699 (slurm_script)
[2015-12-21T07:42:33.268] [2019.0] Sending signal 19 to pid 1715 (vasp_parallel)
[2015-12-21T07:42:33.276] [2019.0] Sending signal 19 to pid 1716 (vasp_parallel)
[2015-12-21T07:42:33.276] [2019.0] Sending signal 19 to pid 1717 (vasp_parallel)
[2015-12-21T07:42:33.276] [2019.0] Sending signal 19 to pid 1718 (vasp_parallel)
[2015-12-21T07:42:33.276] [2019.0] Sending signal 19 to pid 1719 (vasp_parallel)
[2015-12-21T07:42:33.276] [2019.0] Sending signal 19 to pid 1720 (vasp_parallel)
[2015-12-21T07:42:33.276] [2019.0] Sending signal 19 to pid 1721 (vasp_parallel)
[2015-12-21T07:42:33.276] [2019.0] Sending signal 19 to pid 1722 (vasp_parallel)
[2015-12-21T07:42:33.276] [2019.0] Sending signal 19 to pid 1723 (vasp_parallel)
[2015-12-21T07:42:33.276] [2019.0] Sending signal 19 to pid 1724 (vasp_parallel)
[2015-12-21T07:42:33.276] [2019.0] Sending signal 19 to pid 1725 (vasp_parallel)
[2015-12-21T07:42:33.276] [2019.0] Sending signal 19 to pid 1726 (vasp_parallel)
[2015-12-21T07:42:33.276] [2019.0] Sending signal 19 to pid 1727 (vasp_parallel)
[2015-12-21T07:42:33.276] [2019.0] Sending signal 19 to pid 1728 (vasp_parallel)
[2015-12-21T07:42:33.276] [2019.0] Sending signal 19 to pid 1729 (vasp_parallel)
[2015-12-21T07:42:33.276] [2019.0] Sending signal 19 to pid 1730 (vasp_parallel)
[2015-12-21T07:42:33.276] [2019.0] Sending signal 19 to pid 1731 (vasp_parallel)
[2015-12-21T07:42:33.276] [2019.0] Sending signal 19 to pid 1732 (vasp_parallel)
[2015-12-21T07:42:33.276] [2019.0] Sending signal 19 to pid 1733 (vasp_parallel)
[2015-12-21T07:42:33.276] [2019.0] Sending signal 19 to pid 1734 (vasp_parallel)
[2015-12-21T07:42:33.276] [2019.0] Sending signal 19 to pid 1735 (vasp_parallel)
[2015-12-21T07:42:33.276] [2019.0] Sending signal 19 to pid 1736 (vasp_parallel)
[2015-12-21T07:42:33.276] [2019.0] Sending signal 19 to pid 1737 (vasp_parallel)
[2015-12-21T07:42:33.276] [2019.0] Sending signal 19 to pid 1738 (vasp_parallel)
[2015-12-21T07:42:33.276] [2019.0] Sending signal 19 to pid 1739 (vasp_parallel)
[2015-12-21T07:42:33.276] [2019.0] Sending signal 19 to pid 1740 (vasp_parallel)
[2015-12-21T07:42:33.276] [2019.0] Sending signal 19 to pid 1741 (vasp_parallel)
[2015-12-21T07:42:33.276] [2019.0] Sending signal 19 to pid 1742 (vasp_parallel)
[2015-12-21T07:42:33.276] [2019.0] Sending signal 19 to pid 1743 (vasp_parallel)
[2015-12-21T07:42:33.276] [2019.0] Sending signal 19 to pid 1780 (vasp_parallel)
[2015-12-21T07:42:33.276] [2019.0] Sending signal 19 to pid 1784 (vasp_parallel)
[2015-12-21T07:42:33.276] [2019.0] Sending signal 19 to pid 1785 (vasp_parallel)
[2015-12-21T07:42:33.276] [2019.0] Sending signal 19 to pid 1786 (vasp_parallel)
[2015-12-21T07:42:33.276] [2019.0] Sending signal 19 to pid 1787 (vasp_parallel)
[2015-12-21T07:42:33.276] [2019.0] Sending signal 19 to pid 1788 (vasp_parallel)
[2015-12-21T07:42:33.276] [2019.0] Sending signal 19 to pid 1789 (vasp_parallel)
[2015-12-21T07:42:33.277] [2019.0] Suspended 2019.0
[2015-12-21T07:42:33.288] [2019] Sending signal 19 to pid 1701 (srun)
[2015-12-21T07:42:33.288] [2019] Sending signal 19 to pid 1700 (srun)
[2015-12-21T07:42:33.288] [2019] Sending signal 19 to pid 1699 (slurm_script)
[2015-12-21T07:42:33.288] [2019] Suspended 2019.4294967294
[root@dcalph045 ~]# 
=========================================================================================

Not sure why that didn't happen the same for the JobID=1131
Comment 9 David Bigagli 2015-12-28 00:32:46 MST
Thanks for the update. I was able to reproduce the problem once when I used
your configuration and enabled the gang scheduling. We are now investigating
a possible cause of the problem.

David
Comment 14 Moe Jette 2015-12-28 04:52:33 MST
Created attachment 2548 [details]
fix for bug

I was able to confirm the behaviour described. The original logic was resuming all jobs when slurmctld restarted or was reconfigured, then suspended jobs. This change leaves suspended jobs suspended.
Comment 15 Moe Jette 2015-12-28 04:54:00 MST
The fix will be in version 15.08.7 when released, likely in mid-January. The commit is here:
https://github.com/SchedMD/slurm/commit/aac1434c27f9a16e48ff9f55c54137a9f4f9c413

Please re-open the ticket if you encounter further problems with this patch.