Ticket 4833

Summary:	Job array stuck and can't cancel - ignoring user with multiple emails
Product:	Slurm	Reporter:	Wei Feinstein <wfeinstein>
Component:	User Commands	Assignee:	Brian Christiansen <brian>
Status:	RESOLVED INFOGIVEN	QA Contact:
Severity:	3 - Medium Impact
Priority:	---	CC:	aradeva, brian.gilmer, bweigand, jonathon.anderson, kaylea.nelson, remi-externe.palancher
Version:	17.11.3
Hardware:	Linux
OS:	Linux
Site:	LBNL - Lawrence Berkeley National Laboratory	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	Logs for job in slurmctld.log.

Description Wei Feinstein 2018-02-26 12:58:45 MST

Created attachment 6239 [details]
Logs for job in slurmctld.log.

The following jobs are orphaned and can not be cancelled by either the user or an admin:
     11840060_2       lr3 thm_dime   aszasz RH       0:00      1 (JobHoldMaxRequeue)
        11840060_3       lr3 thm_dime   aszasz RH       0:00      1 (JobHoldMaxRequeue)
        11840060_4       lr3 thm_dime   aszasz RH       0:00      1 (JobHoldMaxRequeue)
        11840060_5       lr3 thm_dime   aszasz RH       0:00      1 (JobHoldMaxRequeue)
        11840060_6       lr3 thm_dime   aszasz RH       0:00      1 (JobHoldMaxRequeue)
        11840060_7       lr3 thm_dime   aszasz RH       0:00      1 (JobHoldMaxRequeue)
        11840060_8       lr3 thm_dime   aszasz RH       0:00      1 (JobHoldMaxRequeue)
        11840060_9       lr3 thm_dime   aszasz RH       0:00      1 (JobHoldMaxRequeue)
       11840060_10       lr3 thm_dime   aszasz RH       0:00      1 (JobHoldMaxRequeue)
       11840060_11       lr3 thm_dime   aszasz RH       0:00      1 (JobHoldMaxRequeue)
       11840060_13       lr3 thm_dime   aszasz RH       0:00      1 (JobHoldMaxRequeue)

Tried the following steps - 
scontrol release (to release the JobHoldMxRequeue) and then I tried scancel with the following options -f and --signal=TERM.  

When the jobs were released they were in PD with (BeginTime) and then they went right back to JobHoldMaxRequeue.  

The user is getting annoyed by the number of emails he is receiving.  And I have not found a way to remove his orphaned tasks.  I am enclosing the logs that state the status of his job from slurm. 

scontrol show job no longer shows the job and so I can not kill it.  

Also is there a way to stop the emails for the user until this is resolved?

Thanks

Jackie

Comment 3 Brian Christiansen 2018-02-26 14:34:17 MST

Hey Jackie,

The jobs almost certainly got into this state due a bug which was fixed in commit (in 17.11.4):

https://github.com/SchedMD/slurm/commit/f381e4e6abca6ce45709b86989112442487f856a

The job array hashes are getting corrupt and the job can't be found to be killed. Also the meta job of the array is being cleared from the slurmctld's memory -- it should stay in memory until all of the tasks are gone -- so the job can't be requeued (because it holds the job script for the array). If you can apply this patch sooner than 17.11.4 that would help prevent future incidents from happening.

As for resolving the issue now, if you restart the slurmctld the jobs should go away. Since the meta job of the array should be out of slurmctld's memory, on restart the slurmctld will see that the job's job script is missing and will mark them as failed and will clean them up.

You'll see a messages like:
error: Script for job 15152 lost, state set to FAILED

Can you try restarting the controller?

Thanks,
Brian

Comment 4 Brian Christiansen 2018-02-26 14:49:09 MST

An "scontrol reconfigure" will also cause the slurmctld to check for jobs without job files.

Comment 5 Wei Feinstein 2018-02-26 18:55:01 MST

I will perform that tomorrow. Since it is already after hours.

Thanks

Jackie

On Mon, Feb 26, 2018 at 1:49 PM, <bugs@schedmd.com> wrote:

> *Comment # 4 <https://bugs.schedmd.com/show_bug.cgi?id=4833#c4> on bug
> 4833 <https://bugs.schedmd.com/show_bug.cgi?id=4833> from Brian
> Christiansen <brian@schedmd.com> *
>
> An "scontrol reconfigure" will also cause the slurmctld to check for jobs
> without job files.
>
> ------------------------------
> You are receiving this mail because:
>
>    - You reported the bug.
>
>

Comment 6 Brian Christiansen 2018-03-02 08:33:53 MST

Hey Jackie, were you able to clear up the array jobs?

Thanks,
Brian

Comment 7 Wei Feinstein 2018-03-02 08:55:19 MST

Yes it worked restarting slurmctld.

Thanks

Jacke

On Fri, Mar 2, 2018 at 7:33 AM, <bugs@schedmd.com> wrote:

> *Comment # 6 <https://bugs.schedmd.com/show_bug.cgi?id=4833#c6> on bug
> 4833 <https://bugs.schedmd.com/show_bug.cgi?id=4833> from Brian
> Christiansen <brian@schedmd.com> *
>
> Hey Jackie, were you able to clear up the array jobs?
>
> Thanks,
> Brian
>
> ------------------------------
> You are receiving this mail because:
>
>    - You reported the bug.
>
>

Comment 8 Brian Christiansen 2018-03-02 08:56:25 MST

Great. Thanks Jackie.

Comment 9 Tim Wickberg 2018-04-19 11:00:12 MDT

*** Ticket 5079 has been marked as a duplicate of this ticket. ***

Comment 10 Isaac Hartung 2018-05-16 09:53:42 MDT

*** Ticket 5170 has been marked as a duplicate of this ticket. ***

Comment 11 Brian Christiansen 2018-06-07 08:54:02 MDT

*** Ticket 5272 has been marked as a duplicate of this ticket. ***

Comment 12 Jason Booth 2020-08-26 16:41:32 MDT

*** Ticket 9676 has been marked as a duplicate of this ticket. ***

Comment 13 Ben Roberts 2022-01-06 13:45:00 MST

*** Ticket 13138 has been marked as a duplicate of this ticket. ***

Comment 14 Jonathon Anderson 2022-01-06 13:45:08 MST

My last day with RC and CU is 22 October 2021. For Research Computing, please contact rc-help@colorado.edu.