7610 – Launch Failed Requeued Held After Upgrade

Ticket 7610 - Launch Failed Requeued Held After Upgrade

Summary: Launch Failed Requeued Held After Upgrade

Status:	RESOLVED INFOGIVEN

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	Scheduling (show other tickets)
Version:	18.08.8
Hardware:	Linux Linux

Severity:	4 - Minor Issue
Assignee:	Jason Booth
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2019-08-21 09:43 MDT by Roy
Modified:	2019-09-09 12:45 MDT (History)
CC List:	1 user (show)

See Also:
Site:	UMBC
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	CentOS
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
slurmctld.log from head node (14.98 MB, text/plain) 2019-08-21 09:43 MDT, Roy	Details
1191484_307__20190821_umbc (1.54 KB, application/octet-stream) 2019-08-21 10:27 MDT, Roy	Details
20190821_slurmAllNodes (140.81 KB, application/octet-stream) 2019-08-21 10:27 MDT, Roy	Details
20180821_cnode013_slurmd.log (1.31 MB, application/octet-stream) 2019-08-21 10:27 MDT, Roy	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Roy 2019-08-21 09:43:32 MDT

Created attachment 11306 [details]
slurmctld.log from head node

We upgraded from 17.11.12 to 18.08.8 recently. We now find that many hundreds of jobs are left pending with the reason "launch failed requeued held".

I've attached the slurmctld log file from our management node.

Can anyone offer some insight into this?

Comment 1 Jason Booth 2019-08-21 10:15:16 MDT

Roy - It is easier to focus in on a single job and the nodes that jobs were assigned to. For this would you please gather the following output and attach that to the ticket as well.

scontrol show job <jobID>

From that output look at the "NodeList=" and also gather the slurmd.log from that node.

scontrol show nodes


Also, based on the logs attached I see a number of errors related to "Kill task failed" which suggests that the job was unable to be killed. This generally is associated with a task that is hung on I/O but it could also be some other transient failure. The best way to diagnose this is to log into the node and look at the tasks with "ps aux" and to also look at the output of dmesg to see what is happening on the node.

Comment 2 Roy 2019-08-21 10:27:35 MDT

Created attachment 11308 [details]
1191484_307__20190821_umbc

I've attached these files.

As far as the 'kill tasks failed', that's the subject of another ticket.
One of the resolutions proposed was to update slurm. After updating slurm
we started to see the issue with the "launch failed requeued held" reason.


Am Mi., 21. Aug. 2019 um 12:15 Uhr schrieb <bugs@schedmd.com>:

> *Comment # 1 <https://bugs.schedmd.com/show_bug.cgi?id=7610#c1> on bug
> 7610 <https://bugs.schedmd.com/show_bug.cgi?id=7610> from Jason Booth
> <jbooth@schedmd.com> *
>
> Roy - It is easier to focus in on a single job and the nodes that jobs were
> assigned to. For this would you please gather the following output and attach
> that to the ticket as well.
>
> scontrol show job <jobID>
>
> From that output look at the "NodeList=" and also gather the slurmd.log from
> that node.
>
> scontrol show nodes
>
>
> Also, based on the logs attached I see a number of errors related to "Kill task
> failed" which suggests that the job was unable to be killed. This generally is
> associated with a task that is hung on I/O but it could also be some other
> transient failure. The best way to diagnose this is to log into the node and
> look at the tasks with "ps aux" and to also look at the output of dmesg to see
> what is happening on the node.
>
> ------------------------------
> You are receiving this mail because:
>
>    - You reported the bug.
>
>

Comment 3 Roy 2019-08-21 10:27:36 MDT

Created attachment 11309 [details]
20190821_slurmAllNodes

Comment 4 Roy 2019-08-21 10:27:36 MDT

Created attachment 11310 [details]
20180821_cnode013_slurmd.log

Comment 5 Jason Booth 2019-08-21 11:52:36 MDT

Roy - the issue seems to be with your prolog. The script ran for 0 seconds which is odd. Would you please check your script and verify that it is doing the correct thing. You can also test by commenting out that setting and restarting the slurmd on a test node.

> Prolog=/cm/local/apps/cmd/scripts/prolog


>[2019-08-21T09:48:15.398] _run_prolog: prolog with lock for job 1191484 ran for 0 seconds
>[2019-08-21T09:48:15.400] error: _forkexec_slurmstepd: slurmstepd failed to send return code got 0: No such process
>[2019-08-21T09:48:15.400] Launching batch job 1191484 for UID 13357

Comment 6 Roy 2019-08-21 12:33:31 MDT

I've made this change on the cluster. Noting no change in the systemctl
status after slurmd fails to start.

We're noting in the system log: "systemd: PID file /var/run/slurmd.pid not
readable (yet?) after start."

Could this be related?


Am Mi., 21. Aug. 2019 um 13:52 Uhr schrieb <bugs@schedmd.com>:

> *Comment # 5 <https://bugs.schedmd.com/show_bug.cgi?id=7610#c5> on bug
> 7610 <https://bugs.schedmd.com/show_bug.cgi?id=7610> from Jason Booth
> <jbooth@schedmd.com> *
>
> Roy - the issue seems to be with your prolog. The script ran for 0 seconds
> which is odd. Would you please check your script and verify that it is doing
> the correct thing. You can also test by commenting out that setting and
> restarting the slurmd on a test node.
> > Prolog=/cm/local/apps/cmd/scripts/prolog
>
> >[2019-08-21T09:48:15.398] _run_prolog: prolog with lock for job 1191484 ran for 0 seconds
> >[2019-08-21T09:48:15.400] error: _forkexec_slurmstepd: slurmstepd failed to send return code got 0: No such process
> >[2019-08-21T09:48:15.400] Launching batch job 1191484 for UID 13357
>
> ------------------------------
> You are receiving this mail because:
>
>    - You reported the bug.
>
>

Comment 7 Roy 2019-08-21 12:34:25 MDT

Also: `systemctl restart slurmd` hangs on the nodes in question.



Am Mi., 21. Aug. 2019 um 14:32 Uhr schrieb Roy Prouty <proutyr1@umbc.edu>:

> I've made this change on the cluster. Noting no change in the systemctl
> status after slurmd fails to start.
>
> We're noting in the system log: "systemd: PID file /var/run/slurmd.pid not
> readable (yet?) after start."
>
> Could this be related?
>
>
> Am Mi., 21. Aug. 2019 um 13:52 Uhr schrieb <bugs@schedmd.com>:
>
>> *Comment # 5 <https://bugs.schedmd.com/show_bug.cgi?id=7610#c5> on bug
>> 7610 <https://bugs.schedmd.com/show_bug.cgi?id=7610> from Jason Booth
>> <jbooth@schedmd.com> *
>>
>> Roy - the issue seems to be with your prolog. The script ran for 0 seconds
>> which is odd. Would you please check your script and verify that it is doing
>> the correct thing. You can also test by commenting out that setting and
>> restarting the slurmd on a test node.
>> > Prolog=/cm/local/apps/cmd/scripts/prolog
>>
>> >[2019-08-21T09:48:15.398] _run_prolog: prolog with lock for job 1191484 ran for 0 seconds
>> >[2019-08-21T09:48:15.400] error: _forkexec_slurmstepd: slurmstepd failed to send return code got 0: No such process
>> >[2019-08-21T09:48:15.400] Launching batch job 1191484 for UID 13357
>>
>> ------------------------------
>> You are receiving this mail because:
>>
>>    - You reported the bug.
>>
>>

Comment 8 Jason Booth 2019-08-21 12:48:56 MDT

Hi Roy - "systemctl restart slurmd" should not be hanging. Is this a Bright Cluster Manager setup? If so how did you do the upgrade? Was this via Brights provided RPMs or did you install under /cm/shared/apps/slurm/<verson> and symlink to /cm/shared/apps/slurm/current?

If you would like I can join you on Zoom and look at the situation directly.

-Jason

Comment 9 Roy 2019-08-21 13:55:47 MDT

Yes, this slurm set up is managed through Bright. We did the upgrade by
following their recommendations and using their RPMs.

I'd be happy to host a meeting on zoom, or you can host the meeting and
invite me to it.

Am Mi., 21. Aug. 2019 um 14:48 Uhr schrieb <bugs@schedmd.com>:

> *Comment # 8 <https://bugs.schedmd.com/show_bug.cgi?id=7610#c8> on bug
> 7610 <https://bugs.schedmd.com/show_bug.cgi?id=7610> from Jason Booth
> <jbooth@schedmd.com> *
>
> Hi Roy - "systemctl restart slurmd" should not be hanging. Is this a Bright
> Cluster Manager setup? If so how did you do the upgrade? Was this via Brights
> provided RPMs or did you install under /cm/shared/apps/slurm/<verson> and
> symlink to /cm/shared/apps/slurm/current?
>
> If you would like I can join you on Zoom and look at the situation directly.
>
> -Jason
>
> ------------------------------
> You are receiving this mail because:
>
>    - You reported the bug.
>
>

Comment 11 Jason Booth 2019-08-29 14:37:36 MDT

Hi Roy - I just following up with you about this issue and checking on the current status. When we were discussing this last week you had a few issues going on between the way Bright was managing services, and config files. Let me know if those were resolved and what the current status is related to this bug.

Comment 12 Roy 2019-09-06 08:33:35 MDT

We can consider this ticket resolved! We still have some lingering issues,
but they seem unrelated. We will follow-up in other tickets, if necessary.

Roy Prouty
UMBC Office: ENGR 201A | (410) 455-6351
Cell: (443) 617-5771

The most dangerous phrase in the language is, "We've always done it this
way."

- Grace Hopper



Am Do., 29. Aug. 2019 um 16:37 Uhr schrieb <bugs@schedmd.com>:

> *Comment # 11 <https://bugs.schedmd.com/show_bug.cgi?id=7610#c11> on bug
> 7610 <https://bugs.schedmd.com/show_bug.cgi?id=7610> from Jason Booth
> <jbooth@schedmd.com> *
>
> Hi Roy - I just following up with you about this issue and checking on the
> current status. When we were discussing this last week you had a few issues
> going on between the way Bright was managing services, and config files. Let me
> know if those were resolved and what the current status is related to this bug.
>
> ------------------------------
> You are receiving this mail because:
>
>    - You reported the bug.
>
>

Comment 13 Jason Booth 2019-09-09 12:45:46 MDT

Resolving