Created attachment 11306 [details] slurmctld.log from head node We upgraded from 17.11.12 to 18.08.8 recently. We now find that many hundreds of jobs are left pending with the reason "launch failed requeued held". I've attached the slurmctld log file from our management node. Can anyone offer some insight into this?
Roy - It is easier to focus in on a single job and the nodes that jobs were assigned to. For this would you please gather the following output and attach that to the ticket as well. scontrol show job <jobID> From that output look at the "NodeList=" and also gather the slurmd.log from that node. scontrol show nodes Also, based on the logs attached I see a number of errors related to "Kill task failed" which suggests that the job was unable to be killed. This generally is associated with a task that is hung on I/O but it could also be some other transient failure. The best way to diagnose this is to log into the node and look at the tasks with "ps aux" and to also look at the output of dmesg to see what is happening on the node.
Created attachment 11308 [details] 1191484_307__20190821_umbc I've attached these files. As far as the 'kill tasks failed', that's the subject of another ticket. One of the resolutions proposed was to update slurm. After updating slurm we started to see the issue with the "launch failed requeued held" reason. Am Mi., 21. Aug. 2019 um 12:15 Uhr schrieb <bugs@schedmd.com>: > *Comment # 1 <https://bugs.schedmd.com/show_bug.cgi?id=7610#c1> on bug > 7610 <https://bugs.schedmd.com/show_bug.cgi?id=7610> from Jason Booth > <jbooth@schedmd.com> * > > Roy - It is easier to focus in on a single job and the nodes that jobs were > assigned to. For this would you please gather the following output and attach > that to the ticket as well. > > scontrol show job <jobID> > > From that output look at the "NodeList=" and also gather the slurmd.log from > that node. > > scontrol show nodes > > > Also, based on the logs attached I see a number of errors related to "Kill task > failed" which suggests that the job was unable to be killed. This generally is > associated with a task that is hung on I/O but it could also be some other > transient failure. The best way to diagnose this is to log into the node and > look at the tasks with "ps aux" and to also look at the output of dmesg to see > what is happening on the node. > > ------------------------------ > You are receiving this mail because: > > - You reported the bug. > >
Created attachment 11309 [details] 20190821_slurmAllNodes
Created attachment 11310 [details] 20180821_cnode013_slurmd.log
Roy - the issue seems to be with your prolog. The script ran for 0 seconds which is odd. Would you please check your script and verify that it is doing the correct thing. You can also test by commenting out that setting and restarting the slurmd on a test node. > Prolog=/cm/local/apps/cmd/scripts/prolog >[2019-08-21T09:48:15.398] _run_prolog: prolog with lock for job 1191484 ran for 0 seconds >[2019-08-21T09:48:15.400] error: _forkexec_slurmstepd: slurmstepd failed to send return code got 0: No such process >[2019-08-21T09:48:15.400] Launching batch job 1191484 for UID 13357
I've made this change on the cluster. Noting no change in the systemctl status after slurmd fails to start. We're noting in the system log: "systemd: PID file /var/run/slurmd.pid not readable (yet?) after start." Could this be related? Am Mi., 21. Aug. 2019 um 13:52 Uhr schrieb <bugs@schedmd.com>: > *Comment # 5 <https://bugs.schedmd.com/show_bug.cgi?id=7610#c5> on bug > 7610 <https://bugs.schedmd.com/show_bug.cgi?id=7610> from Jason Booth > <jbooth@schedmd.com> * > > Roy - the issue seems to be with your prolog. The script ran for 0 seconds > which is odd. Would you please check your script and verify that it is doing > the correct thing. You can also test by commenting out that setting and > restarting the slurmd on a test node. > > Prolog=/cm/local/apps/cmd/scripts/prolog > > >[2019-08-21T09:48:15.398] _run_prolog: prolog with lock for job 1191484 ran for 0 seconds > >[2019-08-21T09:48:15.400] error: _forkexec_slurmstepd: slurmstepd failed to send return code got 0: No such process > >[2019-08-21T09:48:15.400] Launching batch job 1191484 for UID 13357 > > ------------------------------ > You are receiving this mail because: > > - You reported the bug. > >
Also: `systemctl restart slurmd` hangs on the nodes in question. Am Mi., 21. Aug. 2019 um 14:32 Uhr schrieb Roy Prouty <proutyr1@umbc.edu>: > I've made this change on the cluster. Noting no change in the systemctl > status after slurmd fails to start. > > We're noting in the system log: "systemd: PID file /var/run/slurmd.pid not > readable (yet?) after start." > > Could this be related? > > > Am Mi., 21. Aug. 2019 um 13:52 Uhr schrieb <bugs@schedmd.com>: > >> *Comment # 5 <https://bugs.schedmd.com/show_bug.cgi?id=7610#c5> on bug >> 7610 <https://bugs.schedmd.com/show_bug.cgi?id=7610> from Jason Booth >> <jbooth@schedmd.com> * >> >> Roy - the issue seems to be with your prolog. The script ran for 0 seconds >> which is odd. Would you please check your script and verify that it is doing >> the correct thing. You can also test by commenting out that setting and >> restarting the slurmd on a test node. >> > Prolog=/cm/local/apps/cmd/scripts/prolog >> >> >[2019-08-21T09:48:15.398] _run_prolog: prolog with lock for job 1191484 ran for 0 seconds >> >[2019-08-21T09:48:15.400] error: _forkexec_slurmstepd: slurmstepd failed to send return code got 0: No such process >> >[2019-08-21T09:48:15.400] Launching batch job 1191484 for UID 13357 >> >> ------------------------------ >> You are receiving this mail because: >> >> - You reported the bug. >> >>
Hi Roy - "systemctl restart slurmd" should not be hanging. Is this a Bright Cluster Manager setup? If so how did you do the upgrade? Was this via Brights provided RPMs or did you install under /cm/shared/apps/slurm/<verson> and symlink to /cm/shared/apps/slurm/current? If you would like I can join you on Zoom and look at the situation directly. -Jason
Yes, this slurm set up is managed through Bright. We did the upgrade by following their recommendations and using their RPMs. I'd be happy to host a meeting on zoom, or you can host the meeting and invite me to it. Am Mi., 21. Aug. 2019 um 14:48 Uhr schrieb <bugs@schedmd.com>: > *Comment # 8 <https://bugs.schedmd.com/show_bug.cgi?id=7610#c8> on bug > 7610 <https://bugs.schedmd.com/show_bug.cgi?id=7610> from Jason Booth > <jbooth@schedmd.com> * > > Hi Roy - "systemctl restart slurmd" should not be hanging. Is this a Bright > Cluster Manager setup? If so how did you do the upgrade? Was this via Brights > provided RPMs or did you install under /cm/shared/apps/slurm/<verson> and > symlink to /cm/shared/apps/slurm/current? > > If you would like I can join you on Zoom and look at the situation directly. > > -Jason > > ------------------------------ > You are receiving this mail because: > > - You reported the bug. > >
Hi Roy - I just following up with you about this issue and checking on the current status. When we were discussing this last week you had a few issues going on between the way Bright was managing services, and config files. Let me know if those were resolved and what the current status is related to this bug.
We can consider this ticket resolved! We still have some lingering issues, but they seem unrelated. We will follow-up in other tickets, if necessary. Roy Prouty UMBC Office: ENGR 201A | (410) 455-6351 Cell: (443) 617-5771 The most dangerous phrase in the language is, "We've always done it this way." - Grace Hopper Am Do., 29. Aug. 2019 um 16:37 Uhr schrieb <bugs@schedmd.com>: > *Comment # 11 <https://bugs.schedmd.com/show_bug.cgi?id=7610#c11> on bug > 7610 <https://bugs.schedmd.com/show_bug.cgi?id=7610> from Jason Booth > <jbooth@schedmd.com> * > > Hi Roy - I just following up with you about this issue and checking on the > current status. When we were discussing this last week you had a few issues > going on between the way Bright was managing services, and config files. Let me > know if those were resolved and what the current status is related to this bug. > > ------------------------------ > You are receiving this mail because: > > - You reported the bug. > >
Resolving