| Summary: | suspended job got resumed for no reason after slurmctld restart | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Sergey Meirovich <sergey_meirovich> |
| Component: | slurmd | Assignee: | Moe Jette <jette> |
| Status: | RESOLVED FIXED | QA Contact: | |
| Severity: | 3 - Medium Impact | ||
| Priority: | --- | CC: | tim |
| Version: | 15.08.6 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | AMAT | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | 15.08.7 | Target Release: | --- |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
| Attachments: |
dcalph001 slurmd log
slurm.conf slurmctld log fix for bug |
||
|
Description
Sergey Meirovich
2015-12-22 00:22:38 MST
Created attachment 2533 [details]
slurm.conf
Created attachment 2534 [details]
slurmctld log
Hi,
thanks for your report. We will examine the data you have provided and
get back to you.
David
Hi, I cannot reproduce what you reported and Slurm is not suppose to resume or change state of job after reconfigure or restart. The logs show that 'scontrol resume 1131' command was ran to resume the job on purpose that's a clearly logged sequence which can be reproduced easily. David Hi David,
Where do you see manual resume? There should be "Processing RPC: REQUEST_SUSPEND(resume) from uid=..." in slurmctld log that is coming from next:
/src/slurmctld/proc_req.c: info("Processing RPC: REQUEST_SUSPEND(%s) from uid=%u",
However that is not present in the slurmctld log that is attached.
====================================================================
root@debian:/mnt/dnld# zcat slurmctld.gz | grep REQUEST_SUSPEND
root@debian:/mnt/dnld#
====================================================================
SchedMD clearly stated to us that manual suspend/resume is dangerous:
http://bugs.schedmd.com/show_bug.cgi?id=2031
So we are strictly refraining from it on our production cluster.
Hi Sergey,
yes my initial thought was not correct. When gang scheduling is configured at reconfiguration/restart of the slurmctld the jobs are resumed
by the gang scheduling plugin but they are also suspended immediately.
What seems to happen in your case is that the job got resumed one more time.
Do you observe this for all jobs or just sporadically?
David
Hi David, I have seen that only once so far. Indeed I see the behavior you have explained for other JobID - there was two jobs in suspended status during last restart: ========================================================================================= root@debian:/mnt/dnld# zgrep 'Recovered ' slurmctld.gz | grep State=0x2 | grep 2015-12-21 [2015-12-21T07:42:31.049] Recovered JobID=1131 State=0x2 NodeCnt=0 Assoc=166 [2015-12-21T07:42:31.049] Recovered JobID=2019 State=0x2 NodeCnt=0 Assoc=166 root@debian:/mnt/dnld# ========================================================================================= JobID=2019 had resumed and immediately suspended exactly as you have explained: ========================================================================================= [root@dcalph045 ~]# grep 2015-12-21 /var/log/slurmd | grep 2019 [2015-12-21T07:42:31.118] [2019.0] Sending signal 18 to pid 1715 (vasp_parallel) [2015-12-21T07:42:31.118] [2019.0] Sending signal 18 to pid 1716 (vasp_parallel) [2015-12-21T07:42:31.118] [2019.0] Sending signal 18 to pid 1717 (vasp_parallel) [2015-12-21T07:42:31.118] [2019.0] Sending signal 18 to pid 1718 (vasp_parallel) [2015-12-21T07:42:31.138] [2019] Sending signal 18 to pid 1701 (srun) [2015-12-21T07:42:31.155] [2019] Sending signal 18 to pid 1700 (srun) [2015-12-21T07:42:31.155] [2019] Sending signal 18 to pid 1699 (slurm_script) [2015-12-21T07:42:31.155] [2019] Resumed 2019.4294967294 [2015-12-21T07:42:31.144] [2019.0] Sending signal 18 to pid 1719 (vasp_parallel) [2015-12-21T07:42:31.160] [2019.0] Sending signal 18 to pid 1720 (vasp_parallel) [2015-12-21T07:42:31.160] [2019.0] Sending signal 18 to pid 1721 (vasp_parallel) [2015-12-21T07:42:31.160] [2019.0] Sending signal 18 to pid 1722 (vasp_parallel) [2015-12-21T07:42:31.160] [2019.0] Sending signal 18 to pid 1723 (vasp_parallel) [2015-12-21T07:42:31.160] [2019.0] Sending signal 18 to pid 1724 (vasp_parallel) [2015-12-21T07:42:31.160] [2019.0] Sending signal 18 to pid 1725 (vasp_parallel) [2015-12-21T07:42:31.160] [2019.0] Sending signal 18 to pid 1726 (vasp_parallel) [2015-12-21T07:42:31.160] [2019.0] Sending signal 18 to pid 1727 (vasp_parallel) [2015-12-21T07:42:31.160] [2019.0] Sending signal 18 to pid 1728 (vasp_parallel) [2015-12-21T07:42:31.160] [2019.0] Sending signal 18 to pid 1729 (vasp_parallel) [2015-12-21T07:42:31.160] [2019.0] Sending signal 18 to pid 1730 (vasp_parallel) [2015-12-21T07:42:31.160] [2019.0] Sending signal 18 to pid 1731 (vasp_parallel) [2015-12-21T07:42:31.160] [2019.0] Sending signal 18 to pid 1732 (vasp_parallel) [2015-12-21T07:42:31.160] [2019.0] Sending signal 18 to pid 1733 (vasp_parallel) [2015-12-21T07:42:31.160] [2019.0] Sending signal 18 to pid 1734 (vasp_parallel) [2015-12-21T07:42:31.160] [2019.0] Sending signal 18 to pid 1735 (vasp_parallel) [2015-12-21T07:42:31.160] [2019.0] Sending signal 18 to pid 1736 (vasp_parallel) [2015-12-21T07:42:31.160] [2019.0] Sending signal 18 to pid 1737 (vasp_parallel) [2015-12-21T07:42:31.160] [2019.0] Sending signal 18 to pid 1738 (vasp_parallel) [2015-12-21T07:42:31.160] [2019.0] Sending signal 18 to pid 1739 (vasp_parallel) [2015-12-21T07:42:31.160] [2019.0] Sending signal 18 to pid 1740 (vasp_parallel) [2015-12-21T07:42:31.160] [2019.0] Sending signal 18 to pid 1741 (vasp_parallel) [2015-12-21T07:42:31.160] [2019.0] Sending signal 18 to pid 1742 (vasp_parallel) [2015-12-21T07:42:31.160] [2019.0] Sending signal 18 to pid 1743 (vasp_parallel) [2015-12-21T07:42:31.160] [2019.0] Sending signal 18 to pid 1780 (vasp_parallel) [2015-12-21T07:42:31.160] [2019.0] Sending signal 18 to pid 1784 (vasp_parallel) [2015-12-21T07:42:31.160] [2019.0] Sending signal 18 to pid 1785 (vasp_parallel) [2015-12-21T07:42:31.160] [2019.0] Sending signal 18 to pid 1786 (vasp_parallel) [2015-12-21T07:42:31.160] [2019.0] Sending signal 18 to pid 1787 (vasp_parallel) [2015-12-21T07:42:31.160] [2019.0] Sending signal 18 to pid 1788 (vasp_parallel) [2015-12-21T07:42:31.160] [2019.0] Sending signal 18 to pid 1789 (vasp_parallel) [2015-12-21T07:42:31.160] [2019.0] Resumed 2019.0 [2015-12-21T07:42:31.242] [2019.0] Sending signal 20 to pid 1715 (vasp_parallel) [2015-12-21T07:42:31.242] [2019.0] Sending signal 20 to pid 1716 (vasp_parallel) [2015-12-21T07:42:31.242] [2019.0] Sending signal 20 to pid 1717 (vasp_parallel) [2015-12-21T07:42:31.242] [2019.0] Sending signal 20 to pid 1718 (vasp_parallel) [2015-12-21T07:42:31.242] [2019.0] Sending signal 20 to pid 1719 (vasp_parallel) [2015-12-21T07:42:31.242] [2019.0] Sending signal 20 to pid 1720 (vasp_parallel) [2015-12-21T07:42:31.242] [2019.0] Sending signal 20 to pid 1721 (vasp_parallel) [2015-12-21T07:42:31.242] [2019.0] Sending signal 20 to pid 1722 (vasp_parallel) [2015-12-21T07:42:31.242] [2019.0] Sending signal 20 to pid 1723 (vasp_parallel) [2015-12-21T07:42:31.242] [2019.0] Sending signal 20 to pid 1724 (vasp_parallel) [2015-12-21T07:42:31.242] [2019.0] Sending signal 20 to pid 1725 (vasp_parallel) [2015-12-21T07:42:31.242] [2019.0] Sending signal 20 to pid 1726 (vasp_parallel) [2015-12-21T07:42:31.242] [2019.0] Sending signal 20 to pid 1727 (vasp_parallel) [2015-12-21T07:42:31.242] [2019.0] Sending signal 20 to pid 1728 (vasp_parallel) [2015-12-21T07:42:31.242] [2019.0] Sending signal 20 to pid 1729 (vasp_parallel) [2015-12-21T07:42:31.242] [2019.0] Sending signal 20 to pid 1730 (vasp_parallel) [2015-12-21T07:42:31.242] [2019.0] Sending signal 20 to pid 1731 (vasp_parallel) [2015-12-21T07:42:31.242] [2019.0] Sending signal 20 to pid 1732 (vasp_parallel) [2015-12-21T07:42:31.242] [2019.0] Sending signal 20 to pid 1733 (vasp_parallel) [2015-12-21T07:42:31.242] [2019.0] Sending signal 20 to pid 1734 (vasp_parallel) [2015-12-21T07:42:31.242] [2019.0] Sending signal 20 to pid 1735 (vasp_parallel) [2015-12-21T07:42:31.242] [2019.0] Sending signal 20 to pid 1736 (vasp_parallel) [2015-12-21T07:42:31.242] [2019.0] Sending signal 20 to pid 1737 (vasp_parallel) [2015-12-21T07:42:31.242] [2019.0] Sending signal 20 to pid 1738 (vasp_parallel) [2015-12-21T07:42:31.242] [2019.0] Sending signal 20 to pid 1739 (vasp_parallel) [2015-12-21T07:42:31.242] [2019.0] Sending signal 20 to pid 1740 (vasp_parallel) [2015-12-21T07:42:31.242] [2019.0] Sending signal 20 to pid 1741 (vasp_parallel) [2015-12-21T07:42:31.242] [2019.0] Sending signal 20 to pid 1742 (vasp_parallel) [2015-12-21T07:42:31.242] [2019.0] Sending signal 20 to pid 1743 (vasp_parallel) [2015-12-21T07:42:31.242] [2019.0] Sending signal 20 to pid 1780 (vasp_parallel) [2015-12-21T07:42:31.242] [2019.0] Sending signal 20 to pid 1784 (vasp_parallel) [2015-12-21T07:42:31.242] [2019.0] Sending signal 20 to pid 1785 (vasp_parallel) [2015-12-21T07:42:31.242] [2019.0] Sending signal 20 to pid 1786 (vasp_parallel) [2015-12-21T07:42:31.242] [2019.0] Sending signal 20 to pid 1787 (vasp_parallel) [2015-12-21T07:42:31.242] [2019.0] Sending signal 20 to pid 1788 (vasp_parallel) [2015-12-21T07:42:31.242] [2019.0] Sending signal 20 to pid 1789 (vasp_parallel) [2015-12-21T07:42:31.248] [2019] Sending signal 20 to pid 1701 (srun) [2015-12-21T07:42:31.263] [2019] Sending signal 20 to pid 1700 (srun) [2015-12-21T07:42:31.263] [2019] Sending signal 20 to pid 1699 (slurm_script) [2015-12-21T07:42:33.268] [2019.0] Sending signal 19 to pid 1715 (vasp_parallel) [2015-12-21T07:42:33.276] [2019.0] Sending signal 19 to pid 1716 (vasp_parallel) [2015-12-21T07:42:33.276] [2019.0] Sending signal 19 to pid 1717 (vasp_parallel) [2015-12-21T07:42:33.276] [2019.0] Sending signal 19 to pid 1718 (vasp_parallel) [2015-12-21T07:42:33.276] [2019.0] Sending signal 19 to pid 1719 (vasp_parallel) [2015-12-21T07:42:33.276] [2019.0] Sending signal 19 to pid 1720 (vasp_parallel) [2015-12-21T07:42:33.276] [2019.0] Sending signal 19 to pid 1721 (vasp_parallel) [2015-12-21T07:42:33.276] [2019.0] Sending signal 19 to pid 1722 (vasp_parallel) [2015-12-21T07:42:33.276] [2019.0] Sending signal 19 to pid 1723 (vasp_parallel) [2015-12-21T07:42:33.276] [2019.0] Sending signal 19 to pid 1724 (vasp_parallel) [2015-12-21T07:42:33.276] [2019.0] Sending signal 19 to pid 1725 (vasp_parallel) [2015-12-21T07:42:33.276] [2019.0] Sending signal 19 to pid 1726 (vasp_parallel) [2015-12-21T07:42:33.276] [2019.0] Sending signal 19 to pid 1727 (vasp_parallel) [2015-12-21T07:42:33.276] [2019.0] Sending signal 19 to pid 1728 (vasp_parallel) [2015-12-21T07:42:33.276] [2019.0] Sending signal 19 to pid 1729 (vasp_parallel) [2015-12-21T07:42:33.276] [2019.0] Sending signal 19 to pid 1730 (vasp_parallel) [2015-12-21T07:42:33.276] [2019.0] Sending signal 19 to pid 1731 (vasp_parallel) [2015-12-21T07:42:33.276] [2019.0] Sending signal 19 to pid 1732 (vasp_parallel) [2015-12-21T07:42:33.276] [2019.0] Sending signal 19 to pid 1733 (vasp_parallel) [2015-12-21T07:42:33.276] [2019.0] Sending signal 19 to pid 1734 (vasp_parallel) [2015-12-21T07:42:33.276] [2019.0] Sending signal 19 to pid 1735 (vasp_parallel) [2015-12-21T07:42:33.276] [2019.0] Sending signal 19 to pid 1736 (vasp_parallel) [2015-12-21T07:42:33.276] [2019.0] Sending signal 19 to pid 1737 (vasp_parallel) [2015-12-21T07:42:33.276] [2019.0] Sending signal 19 to pid 1738 (vasp_parallel) [2015-12-21T07:42:33.276] [2019.0] Sending signal 19 to pid 1739 (vasp_parallel) [2015-12-21T07:42:33.276] [2019.0] Sending signal 19 to pid 1740 (vasp_parallel) [2015-12-21T07:42:33.276] [2019.0] Sending signal 19 to pid 1741 (vasp_parallel) [2015-12-21T07:42:33.276] [2019.0] Sending signal 19 to pid 1742 (vasp_parallel) [2015-12-21T07:42:33.276] [2019.0] Sending signal 19 to pid 1743 (vasp_parallel) [2015-12-21T07:42:33.276] [2019.0] Sending signal 19 to pid 1780 (vasp_parallel) [2015-12-21T07:42:33.276] [2019.0] Sending signal 19 to pid 1784 (vasp_parallel) [2015-12-21T07:42:33.276] [2019.0] Sending signal 19 to pid 1785 (vasp_parallel) [2015-12-21T07:42:33.276] [2019.0] Sending signal 19 to pid 1786 (vasp_parallel) [2015-12-21T07:42:33.276] [2019.0] Sending signal 19 to pid 1787 (vasp_parallel) [2015-12-21T07:42:33.276] [2019.0] Sending signal 19 to pid 1788 (vasp_parallel) [2015-12-21T07:42:33.276] [2019.0] Sending signal 19 to pid 1789 (vasp_parallel) [2015-12-21T07:42:33.277] [2019.0] Suspended 2019.0 [2015-12-21T07:42:33.288] [2019] Sending signal 19 to pid 1701 (srun) [2015-12-21T07:42:33.288] [2019] Sending signal 19 to pid 1700 (srun) [2015-12-21T07:42:33.288] [2019] Sending signal 19 to pid 1699 (slurm_script) [2015-12-21T07:42:33.288] [2019] Suspended 2019.4294967294 [root@dcalph045 ~]# ========================================================================================= Not sure why that didn't happen the same for the JobID=1131 Thanks for the update. I was able to reproduce the problem once when I used your configuration and enabled the gang scheduling. We are now investigating a possible cause of the problem. David Created attachment 2548 [details]
fix for bug
I was able to confirm the behaviour described. The original logic was resuming all jobs when slurmctld restarted or was reconfigured, then suspended jobs. This change leaves suspended jobs suspended.
The fix will be in version 15.08.7 when released, likely in mid-January. The commit is here: https://github.com/SchedMD/slurm/commit/aac1434c27f9a16e48ff9f55c54137a9f4f9c413 Please re-open the ticket if you encounter further problems with this patch. |