Created attachment 2532 [details] dcalph001 slurmd log Hi, I have recently upgraded 15.08.5 => 15.08.6 as per: http://slurm.schedmd.com/quickstart_admin.html#upgrade After the upgrade observed clear anomaly ====================================================================================== -sh-4.1$ squeue | grep 001 727 e154414 R normal wrap (null) Dec 8 12:54 20 1 dcalph001 1131 e154414 S low wrap (null) Dec 1 22:28 36 1 dcalph001 -sh-4.1$ scontrol show node=dcalph001 NodeName=dcalph001 Arch=x86_64 CoresPerSocket=18 CPUAlloc=36 CPUErr=0 CPUTot=36 CPULoad=56.00 Features=E5-2699v3,128G,cae Gres=(null) NodeAddr=dcalph001 NodeHostName=dcalph001 Version=15.08 OS=Linux RealMemory=128951 AllocMem=0 FreeMem=58209 Sockets=2 Boards=1 State=ALLOCATED ThreadsPerCore=1 TmpDisk=1951 Weight=1 Owner=N/A BootTime=Nov 19 12:04 SlurmdStartTime=Dec 21 7:43 CapWatts=n/a CurrentWatts=0 LowestJoules=0 ConsumedJoules=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s -sh-4.1$ ====================================================================================== Please note: "CPUAlloc=36 ... CPULoad=56.00" I have checked the situation on host=dcalph001 and indeed JobID=1131 is running despite that slurmctld thinks that it is suspended. Moreover that is clear from slurmd log od dcalph001 that JobID=1131 was deliberately resumed by slurmd around slurmctld restart. ====================================================================================== [root@dcalph001 ~]# grep 1131.0 /var/log/slurmd | tail -n 37 [2015-12-21T09:42:32.087] [1131.0] Sending signal 18 to pid 112291 (vasp_parallel) [2015-12-21T09:42:32.087] [1131.0] Sending signal 18 to pid 112292 (vasp_parallel) [2015-12-21T09:42:32.087] [1131.0] Sending signal 18 to pid 112293 (vasp_parallel) [2015-12-21T09:42:32.087] [1131.0] Sending signal 18 to pid 112294 (vasp_parallel) [2015-12-21T09:42:32.087] [1131.0] Sending signal 18 to pid 112295 (vasp_parallel) [2015-12-21T09:42:32.087] [1131.0] Sending signal 18 to pid 112296 (vasp_parallel) [2015-12-21T09:42:32.087] [1131.0] Sending signal 18 to pid 112297 (vasp_parallel) [2015-12-21T09:42:32.087] [1131.0] Sending signal 18 to pid 112298 (vasp_parallel) [2015-12-21T09:42:32.087] [1131.0] Sending signal 18 to pid 112299 (vasp_parallel) [2015-12-21T09:42:32.087] [[2015-12-21T09:42:32.087] [1131] R1131.0] Sending signal 18 to pid 112300 (vasp_parallel) [2015-12-21T09:42:32.087] [1131.0] Sending signal 18 to pid 112301 (vasp_parallel) [2015-12-21T09:42:32.087] [1131.0] Sending signal 18 to pid 112302 (vasp_parallel) [2015-12-21T09:42:32.087] [1131.0] Sending signal 18 to pid 112303 (vasp_parallel) [2015-12-21T09:42:32.087] [1131.0] Sending signal 18 to pid 112304 (vasp_parallel) [2015-12-21T09:42:32.088] [1131.0] Sending signal 18 to pid 112305 (vasp_parallel) [2015-12-21T09:42:32.088] [1131.0] Sending signal 18 to pid 112306 (vasp_parallel) [2015-12-21T09:42:32.088] [1131.0] Sending signal 18 to pid 112307 (vasp_parallel) [2015-12-21T09:42:32.088] [1131.0] Sending signal 18 to pid 112308 (vasp_parallel) [2015-12-21T09:42:32.088] [1131.0] Sending signal 18 to pid 112309 (vasp_parallel) [2015-12-21T09:42:32.088] [1131.0] Sending signal 18 to pid 112310 (vasp_parallel) [2015-12-21T09:42:32.088] [1131.0] Sending signal 18 to pid 112311 (vasp_parallel) [2015-12-21T09:42:32.088] [1131.0] Sending signal 18 to pid 112312 (vasp_parallel) [2015-12-21T09:42:32.088] [1131.0] Sending signal 18 to pid 112313 (vasp_parallel) [2015-12-21T09:42:32.090] [1131.0] Sending signal 18 to pid 112314 (vasp_parallel) [2015-12-21T09:42:32.090] [1131.0] Sending signal 18 to pid 112315 (vasp_parallel) [2015-12-21T09:42:32.090] [1131.0] Sending signal 18 to pid 112316 (vasp_parallel) [2015-12-21T09:42:32.090] [1131.0] Sending signal 18 to pid 112317 (vasp_parallel) [2015-12-21T09:42:32.090] [1131.0] Sending signal 18 to pid 112318 (vasp_parallel) [2015-12-21T09:42:32.090] [1131.0] Sending signal 18 to pid 112319 (vasp_parallel) [2015-12-21T09:42:32.090] [1131.0] Sending signal 18 to pid 112320 (vasp_parallel) [2015-12-21T09:42:32.090] [1131.0] Sending signal 18 to pid 112321 (vasp_parallel) [2015-12-21T09:42:32.091] [1131.0] Sending signal 18 to pid 112322 (vasp_parallel) [2015-12-21T09:42:32.091] [1131.0] Sending signal 18 to pid 112323 (vasp_parallel) [2015-12-21T09:42:32.091] [1131.0] Sending signal 18 to pid 112324 (vasp_parallel) [2015-12-21T09:42:32.091] [1131.0] Sending signal 18 to pid 112325 (vasp_parallel) [2015-12-21T09:42:32.092] [1131.0] Sending signal 18 to pid 112326 (vasp_parallel) [2015-12-21T09:42:32.109] [1131.0] Resumed 1131.0 [root@dcalph001 ~]# ====================================================================================== slurmctld recovered job as suspended (State=0x2): ====================================================================================== [root@DCALPH000 ~]# grep 1131 /var/log/slurmctld | tail -n 2 [2015-12-21T07:42:31.049] recovered job step 1131.0 [2015-12-21T07:42:31.049] Recovered JobID=1131 State=0x2 NodeCnt=0 Assoc=166 [root@DCALPH000 ~]# ====================================================================================== Please be not confused by 2 hours difference. That happened because we also changed CST=>PST timezone on our servers. Since slurmctld had been restarted first it was already reporting PST while slurmd was still technically in CST at that time. Not sure if that 15.08.6 or 15.08.5 issue though. Because when that wrong resume happened slurmcld was at 15.08.6 and affected slurmd was still at 15.08.5. Attaching slurm.conf/slurmd log from dcalph001/slurmctld log here
Created attachment 2533 [details] slurm.conf
Created attachment 2534 [details] slurmctld log
Hi, thanks for your report. We will examine the data you have provided and get back to you. David
Hi, I cannot reproduce what you reported and Slurm is not suppose to resume or change state of job after reconfigure or restart. The logs show that 'scontrol resume 1131' command was ran to resume the job on purpose that's a clearly logged sequence which can be reproduced easily. David
Hi David, Where do you see manual resume? There should be "Processing RPC: REQUEST_SUSPEND(resume) from uid=..." in slurmctld log that is coming from next: /src/slurmctld/proc_req.c: info("Processing RPC: REQUEST_SUSPEND(%s) from uid=%u", However that is not present in the slurmctld log that is attached. ==================================================================== root@debian:/mnt/dnld# zcat slurmctld.gz | grep REQUEST_SUSPEND root@debian:/mnt/dnld# ==================================================================== SchedMD clearly stated to us that manual suspend/resume is dangerous: http://bugs.schedmd.com/show_bug.cgi?id=2031 So we are strictly refraining from it on our production cluster.
Hi Sergey, yes my initial thought was not correct. When gang scheduling is configured at reconfiguration/restart of the slurmctld the jobs are resumed by the gang scheduling plugin but they are also suspended immediately. What seems to happen in your case is that the job got resumed one more time. Do you observe this for all jobs or just sporadically? David
Hi David, I have seen that only once so far. Indeed I see the behavior you have explained for other JobID - there was two jobs in suspended status during last restart: ========================================================================================= root@debian:/mnt/dnld# zgrep 'Recovered ' slurmctld.gz | grep State=0x2 | grep 2015-12-21 [2015-12-21T07:42:31.049] Recovered JobID=1131 State=0x2 NodeCnt=0 Assoc=166 [2015-12-21T07:42:31.049] Recovered JobID=2019 State=0x2 NodeCnt=0 Assoc=166 root@debian:/mnt/dnld# ========================================================================================= JobID=2019 had resumed and immediately suspended exactly as you have explained: ========================================================================================= [root@dcalph045 ~]# grep 2015-12-21 /var/log/slurmd | grep 2019 [2015-12-21T07:42:31.118] [2019.0] Sending signal 18 to pid 1715 (vasp_parallel) [2015-12-21T07:42:31.118] [2019.0] Sending signal 18 to pid 1716 (vasp_parallel) [2015-12-21T07:42:31.118] [2019.0] Sending signal 18 to pid 1717 (vasp_parallel) [2015-12-21T07:42:31.118] [2019.0] Sending signal 18 to pid 1718 (vasp_parallel) [2015-12-21T07:42:31.138] [2019] Sending signal 18 to pid 1701 (srun) [2015-12-21T07:42:31.155] [2019] Sending signal 18 to pid 1700 (srun) [2015-12-21T07:42:31.155] [2019] Sending signal 18 to pid 1699 (slurm_script) [2015-12-21T07:42:31.155] [2019] Resumed 2019.4294967294 [2015-12-21T07:42:31.144] [2019.0] Sending signal 18 to pid 1719 (vasp_parallel) [2015-12-21T07:42:31.160] [2019.0] Sending signal 18 to pid 1720 (vasp_parallel) [2015-12-21T07:42:31.160] [2019.0] Sending signal 18 to pid 1721 (vasp_parallel) [2015-12-21T07:42:31.160] [2019.0] Sending signal 18 to pid 1722 (vasp_parallel) [2015-12-21T07:42:31.160] [2019.0] Sending signal 18 to pid 1723 (vasp_parallel) [2015-12-21T07:42:31.160] [2019.0] Sending signal 18 to pid 1724 (vasp_parallel) [2015-12-21T07:42:31.160] [2019.0] Sending signal 18 to pid 1725 (vasp_parallel) [2015-12-21T07:42:31.160] [2019.0] Sending signal 18 to pid 1726 (vasp_parallel) [2015-12-21T07:42:31.160] [2019.0] Sending signal 18 to pid 1727 (vasp_parallel) [2015-12-21T07:42:31.160] [2019.0] Sending signal 18 to pid 1728 (vasp_parallel) [2015-12-21T07:42:31.160] [2019.0] Sending signal 18 to pid 1729 (vasp_parallel) [2015-12-21T07:42:31.160] [2019.0] Sending signal 18 to pid 1730 (vasp_parallel) [2015-12-21T07:42:31.160] [2019.0] Sending signal 18 to pid 1731 (vasp_parallel) [2015-12-21T07:42:31.160] [2019.0] Sending signal 18 to pid 1732 (vasp_parallel) [2015-12-21T07:42:31.160] [2019.0] Sending signal 18 to pid 1733 (vasp_parallel) [2015-12-21T07:42:31.160] [2019.0] Sending signal 18 to pid 1734 (vasp_parallel) [2015-12-21T07:42:31.160] [2019.0] Sending signal 18 to pid 1735 (vasp_parallel) [2015-12-21T07:42:31.160] [2019.0] Sending signal 18 to pid 1736 (vasp_parallel) [2015-12-21T07:42:31.160] [2019.0] Sending signal 18 to pid 1737 (vasp_parallel) [2015-12-21T07:42:31.160] [2019.0] Sending signal 18 to pid 1738 (vasp_parallel) [2015-12-21T07:42:31.160] [2019.0] Sending signal 18 to pid 1739 (vasp_parallel) [2015-12-21T07:42:31.160] [2019.0] Sending signal 18 to pid 1740 (vasp_parallel) [2015-12-21T07:42:31.160] [2019.0] Sending signal 18 to pid 1741 (vasp_parallel) [2015-12-21T07:42:31.160] [2019.0] Sending signal 18 to pid 1742 (vasp_parallel) [2015-12-21T07:42:31.160] [2019.0] Sending signal 18 to pid 1743 (vasp_parallel) [2015-12-21T07:42:31.160] [2019.0] Sending signal 18 to pid 1780 (vasp_parallel) [2015-12-21T07:42:31.160] [2019.0] Sending signal 18 to pid 1784 (vasp_parallel) [2015-12-21T07:42:31.160] [2019.0] Sending signal 18 to pid 1785 (vasp_parallel) [2015-12-21T07:42:31.160] [2019.0] Sending signal 18 to pid 1786 (vasp_parallel) [2015-12-21T07:42:31.160] [2019.0] Sending signal 18 to pid 1787 (vasp_parallel) [2015-12-21T07:42:31.160] [2019.0] Sending signal 18 to pid 1788 (vasp_parallel) [2015-12-21T07:42:31.160] [2019.0] Sending signal 18 to pid 1789 (vasp_parallel) [2015-12-21T07:42:31.160] [2019.0] Resumed 2019.0 [2015-12-21T07:42:31.242] [2019.0] Sending signal 20 to pid 1715 (vasp_parallel) [2015-12-21T07:42:31.242] [2019.0] Sending signal 20 to pid 1716 (vasp_parallel) [2015-12-21T07:42:31.242] [2019.0] Sending signal 20 to pid 1717 (vasp_parallel) [2015-12-21T07:42:31.242] [2019.0] Sending signal 20 to pid 1718 (vasp_parallel) [2015-12-21T07:42:31.242] [2019.0] Sending signal 20 to pid 1719 (vasp_parallel) [2015-12-21T07:42:31.242] [2019.0] Sending signal 20 to pid 1720 (vasp_parallel) [2015-12-21T07:42:31.242] [2019.0] Sending signal 20 to pid 1721 (vasp_parallel) [2015-12-21T07:42:31.242] [2019.0] Sending signal 20 to pid 1722 (vasp_parallel) [2015-12-21T07:42:31.242] [2019.0] Sending signal 20 to pid 1723 (vasp_parallel) [2015-12-21T07:42:31.242] [2019.0] Sending signal 20 to pid 1724 (vasp_parallel) [2015-12-21T07:42:31.242] [2019.0] Sending signal 20 to pid 1725 (vasp_parallel) [2015-12-21T07:42:31.242] [2019.0] Sending signal 20 to pid 1726 (vasp_parallel) [2015-12-21T07:42:31.242] [2019.0] Sending signal 20 to pid 1727 (vasp_parallel) [2015-12-21T07:42:31.242] [2019.0] Sending signal 20 to pid 1728 (vasp_parallel) [2015-12-21T07:42:31.242] [2019.0] Sending signal 20 to pid 1729 (vasp_parallel) [2015-12-21T07:42:31.242] [2019.0] Sending signal 20 to pid 1730 (vasp_parallel) [2015-12-21T07:42:31.242] [2019.0] Sending signal 20 to pid 1731 (vasp_parallel) [2015-12-21T07:42:31.242] [2019.0] Sending signal 20 to pid 1732 (vasp_parallel) [2015-12-21T07:42:31.242] [2019.0] Sending signal 20 to pid 1733 (vasp_parallel) [2015-12-21T07:42:31.242] [2019.0] Sending signal 20 to pid 1734 (vasp_parallel) [2015-12-21T07:42:31.242] [2019.0] Sending signal 20 to pid 1735 (vasp_parallel) [2015-12-21T07:42:31.242] [2019.0] Sending signal 20 to pid 1736 (vasp_parallel) [2015-12-21T07:42:31.242] [2019.0] Sending signal 20 to pid 1737 (vasp_parallel) [2015-12-21T07:42:31.242] [2019.0] Sending signal 20 to pid 1738 (vasp_parallel) [2015-12-21T07:42:31.242] [2019.0] Sending signal 20 to pid 1739 (vasp_parallel) [2015-12-21T07:42:31.242] [2019.0] Sending signal 20 to pid 1740 (vasp_parallel) [2015-12-21T07:42:31.242] [2019.0] Sending signal 20 to pid 1741 (vasp_parallel) [2015-12-21T07:42:31.242] [2019.0] Sending signal 20 to pid 1742 (vasp_parallel) [2015-12-21T07:42:31.242] [2019.0] Sending signal 20 to pid 1743 (vasp_parallel) [2015-12-21T07:42:31.242] [2019.0] Sending signal 20 to pid 1780 (vasp_parallel) [2015-12-21T07:42:31.242] [2019.0] Sending signal 20 to pid 1784 (vasp_parallel) [2015-12-21T07:42:31.242] [2019.0] Sending signal 20 to pid 1785 (vasp_parallel) [2015-12-21T07:42:31.242] [2019.0] Sending signal 20 to pid 1786 (vasp_parallel) [2015-12-21T07:42:31.242] [2019.0] Sending signal 20 to pid 1787 (vasp_parallel) [2015-12-21T07:42:31.242] [2019.0] Sending signal 20 to pid 1788 (vasp_parallel) [2015-12-21T07:42:31.242] [2019.0] Sending signal 20 to pid 1789 (vasp_parallel) [2015-12-21T07:42:31.248] [2019] Sending signal 20 to pid 1701 (srun) [2015-12-21T07:42:31.263] [2019] Sending signal 20 to pid 1700 (srun) [2015-12-21T07:42:31.263] [2019] Sending signal 20 to pid 1699 (slurm_script) [2015-12-21T07:42:33.268] [2019.0] Sending signal 19 to pid 1715 (vasp_parallel) [2015-12-21T07:42:33.276] [2019.0] Sending signal 19 to pid 1716 (vasp_parallel) [2015-12-21T07:42:33.276] [2019.0] Sending signal 19 to pid 1717 (vasp_parallel) [2015-12-21T07:42:33.276] [2019.0] Sending signal 19 to pid 1718 (vasp_parallel) [2015-12-21T07:42:33.276] [2019.0] Sending signal 19 to pid 1719 (vasp_parallel) [2015-12-21T07:42:33.276] [2019.0] Sending signal 19 to pid 1720 (vasp_parallel) [2015-12-21T07:42:33.276] [2019.0] Sending signal 19 to pid 1721 (vasp_parallel) [2015-12-21T07:42:33.276] [2019.0] Sending signal 19 to pid 1722 (vasp_parallel) [2015-12-21T07:42:33.276] [2019.0] Sending signal 19 to pid 1723 (vasp_parallel) [2015-12-21T07:42:33.276] [2019.0] Sending signal 19 to pid 1724 (vasp_parallel) [2015-12-21T07:42:33.276] [2019.0] Sending signal 19 to pid 1725 (vasp_parallel) [2015-12-21T07:42:33.276] [2019.0] Sending signal 19 to pid 1726 (vasp_parallel) [2015-12-21T07:42:33.276] [2019.0] Sending signal 19 to pid 1727 (vasp_parallel) [2015-12-21T07:42:33.276] [2019.0] Sending signal 19 to pid 1728 (vasp_parallel) [2015-12-21T07:42:33.276] [2019.0] Sending signal 19 to pid 1729 (vasp_parallel) [2015-12-21T07:42:33.276] [2019.0] Sending signal 19 to pid 1730 (vasp_parallel) [2015-12-21T07:42:33.276] [2019.0] Sending signal 19 to pid 1731 (vasp_parallel) [2015-12-21T07:42:33.276] [2019.0] Sending signal 19 to pid 1732 (vasp_parallel) [2015-12-21T07:42:33.276] [2019.0] Sending signal 19 to pid 1733 (vasp_parallel) [2015-12-21T07:42:33.276] [2019.0] Sending signal 19 to pid 1734 (vasp_parallel) [2015-12-21T07:42:33.276] [2019.0] Sending signal 19 to pid 1735 (vasp_parallel) [2015-12-21T07:42:33.276] [2019.0] Sending signal 19 to pid 1736 (vasp_parallel) [2015-12-21T07:42:33.276] [2019.0] Sending signal 19 to pid 1737 (vasp_parallel) [2015-12-21T07:42:33.276] [2019.0] Sending signal 19 to pid 1738 (vasp_parallel) [2015-12-21T07:42:33.276] [2019.0] Sending signal 19 to pid 1739 (vasp_parallel) [2015-12-21T07:42:33.276] [2019.0] Sending signal 19 to pid 1740 (vasp_parallel) [2015-12-21T07:42:33.276] [2019.0] Sending signal 19 to pid 1741 (vasp_parallel) [2015-12-21T07:42:33.276] [2019.0] Sending signal 19 to pid 1742 (vasp_parallel) [2015-12-21T07:42:33.276] [2019.0] Sending signal 19 to pid 1743 (vasp_parallel) [2015-12-21T07:42:33.276] [2019.0] Sending signal 19 to pid 1780 (vasp_parallel) [2015-12-21T07:42:33.276] [2019.0] Sending signal 19 to pid 1784 (vasp_parallel) [2015-12-21T07:42:33.276] [2019.0] Sending signal 19 to pid 1785 (vasp_parallel) [2015-12-21T07:42:33.276] [2019.0] Sending signal 19 to pid 1786 (vasp_parallel) [2015-12-21T07:42:33.276] [2019.0] Sending signal 19 to pid 1787 (vasp_parallel) [2015-12-21T07:42:33.276] [2019.0] Sending signal 19 to pid 1788 (vasp_parallel) [2015-12-21T07:42:33.276] [2019.0] Sending signal 19 to pid 1789 (vasp_parallel) [2015-12-21T07:42:33.277] [2019.0] Suspended 2019.0 [2015-12-21T07:42:33.288] [2019] Sending signal 19 to pid 1701 (srun) [2015-12-21T07:42:33.288] [2019] Sending signal 19 to pid 1700 (srun) [2015-12-21T07:42:33.288] [2019] Sending signal 19 to pid 1699 (slurm_script) [2015-12-21T07:42:33.288] [2019] Suspended 2019.4294967294 [root@dcalph045 ~]# ========================================================================================= Not sure why that didn't happen the same for the JobID=1131
Thanks for the update. I was able to reproduce the problem once when I used your configuration and enabled the gang scheduling. We are now investigating a possible cause of the problem. David
Created attachment 2548 [details] fix for bug I was able to confirm the behaviour described. The original logic was resuming all jobs when slurmctld restarted or was reconfigured, then suspended jobs. This change leaves suspended jobs suspended.
The fix will be in version 15.08.7 when released, likely in mid-January. The commit is here: https://github.com/SchedMD/slurm/commit/aac1434c27f9a16e48ff9f55c54137a9f4f9c413 Please re-open the ticket if you encounter further problems with this patch.