Ticket 13138 - cannot cancel jobs in RH state
Summary: cannot cancel jobs in RH state
Status: RESOLVED DUPLICATE of ticket 4833
Alias: None
Product: Slurm
Classification: Unclassified
Component: Scheduling (show other tickets)
Version: - Unsupported Older Versions
Hardware: Linux Linux
: 3 - Medium Impact
Assignee: Ben Roberts
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2022-01-06 12:16 MST by ar
Modified: 2022-01-06 13:45 MST (History)
0 users

See Also:
Site: Columbia University
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description ar 2022-01-06 12:16:31 MST
We have a few jobs with state RH (REQUEUE_HOLD) that we can not cancel. 
"scancel" command runs without any errors but it does not remove the jobs from the squeue and it shows in the DB as pending.

Please see below:

 
[root@roll ar2667]# squeue
....
          26126703     short drc_a_20  msd2202  R    7:39:25      1 node021
          26126704     short drc_a_20  msd2202  R    7:39:25      1 node022
          26126706     short drc_a_20  msd2202  R    7:39:25      1 node051
          26126710     short drc_a_20  msd2202  R    7:39:25      1 node028
        26114272_8 short,cmt       B0   am5328 RH       0:00      1 (JobHoldMaxRequeue)
        26114272_9 short,cmt       B0   am5328 RH       0:00      1 (JobHoldMaxRequeue)
       26114272_10 short,cmt       B0   am5328 RH       0:00      1 (JobHoldMaxRequeue)
       26114272_11 short,cmt       B0   am5328 RH       0:00      1 (JobHoldMaxRequeue)
       26114272_12 short,cmt       B0   am5328 RH       0:00      1 (JobHoldMaxRequeue)
       26114272_73 short,cmt       B0   am5328 RH       0:00      1 (JobHoldMaxRequeue)
       26114272_74 short,cmt       B0   am5328 RH       0:00      1 (JobHoldMaxRequeue)
[root@roll ar2667]# scontrol show job 26114272_8
slurm_load_jobs error: Invalid job id specified
[root@roll ar2667]# scancel --full 26114272_8
[root@roll ar2667]# 

[ar2667@node289 ~]$ sacct -j 26114272_8  --format=User,JobID,jobname,state,time,start,end,elapsed,ReqTRE,nodelist
     User        JobID    JobName      State  Timelimit               Start                 End    Elapsed    ReqTRES        NodeList 
--------- ------------ ---------- ---------- ---------- ------------------- ------------------- ---------- ---------- --------------- 
   am5328 26114272_8           B0    PENDING   11:59:00             Unknown             Unknown   00:00:00 cpu=24,me+   None assigned
Comment 1 Ben Roberts 2022-01-06 12:57:27 MST
Hi,

We have seen an issue that is very similar to what you are describing in older versions of Slurm (prior to 17.11.4).  You marked the ticket as being for an older version of Slurm.  Which version exactly are you running?  The solution in that case was to restart slurmctld, which would try to get information about these jobs when it started back up and they would be cancelled when the job information couldn't be found.  Newer versions of Slurm have a fix that should prevent array jobs from getting in this state.  You can read more about it in bug 4833.  Let us know if this doesn't help in your case.

Thanks,
Ben
Comment 2 ar 2022-01-06 13:02:56 MST
Thank you, Ben.
We are running 17.11.2. We will restart slurmctld.
~a


*---*
Axinia Radeva
Manager, Research Computing Services




On Thu, Jan 6, 2022 at 2:57 PM <bugs@schedmd.com> wrote:

> *Comment # 1
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D13138-23c1&d=DwMFaQ&c=009klHSCxuh5AI1vNQzSO0KGjl4nbi2Q0M1QLJX9BeE&r=4v6XNOMLkOlZVNYSZZEHUddWVGsteDK-7RNrHFN7nyY&m=9h5bzByP04iBjRSY_v3t1EbVMTjBMe1Wi_LbyB789pUyHtfdT0nUjotPJrivNUZG&s=lDpJceBMAdE7y2rtFQCKTEL8VURxGgkPWHSE3VKzpXo&e=>
> on bug 13138
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D13138&d=DwMFaQ&c=009klHSCxuh5AI1vNQzSO0KGjl4nbi2Q0M1QLJX9BeE&r=4v6XNOMLkOlZVNYSZZEHUddWVGsteDK-7RNrHFN7nyY&m=9h5bzByP04iBjRSY_v3t1EbVMTjBMe1Wi_LbyB789pUyHtfdT0nUjotPJrivNUZG&s=wXl1atkBdO5DUoWlGTUDVnbtTkqovtXX6AnMGGdF25c&e=>
> from Ben Roberts <ben@schedmd.com> *
>
> Hi,
>
> We have seen an issue that is very similar to what you are describing in older
> versions of Slurm (prior to 17.11.4).  You marked the ticket as being for an
> older version of Slurm.  Which version exactly are you running?  The solution
> in that case was to restart slurmctld, which would try to get information about
> these jobs when it started back up and they would be cancelled when the job
> information couldn't be found.  Newer versions of Slurm have a fix that should
> prevent array jobs from getting in this state.  You can read more about it inbug 4833 <https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D4833&d=DwMFaQ&c=009klHSCxuh5AI1vNQzSO0KGjl4nbi2Q0M1QLJX9BeE&r=4v6XNOMLkOlZVNYSZZEHUddWVGsteDK-7RNrHFN7nyY&m=9h5bzByP04iBjRSY_v3t1EbVMTjBMe1Wi_LbyB789pUyHtfdT0nUjotPJrivNUZG&s=HFQYUm-wxMqu8T6cZM2UF6jJ1wf6720knTUTfyv_wng&e=>.  Let us know if this doesn't help in your case.
>
> Thanks,
> Ben
>
> ------------------------------
> You are receiving this mail because:
>
>    - You reported the bug.
>
>
Comment 3 ar 2022-01-06 13:32:56 MST
I restarted the controller and the RH jobs are gone from the squeue.

Thank you


*---*
Axinia Radeva
Manager, Research Computing Services




On Thu, Jan 6, 2022 at 3:02 PM Axinia Radeva <aradeva@columbia.edu> wrote:

> Thank you, Ben.
> We are running 17.11.2. We will restart slurmctld.
> ~a
>
>
> *---*
> Axinia Radeva
> Manager, Research Computing Services
>
>
>
>
> On Thu, Jan 6, 2022 at 2:57 PM <bugs@schedmd.com> wrote:
>
>> *Comment # 1
>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D13138-23c1&d=DwMFaQ&c=009klHSCxuh5AI1vNQzSO0KGjl4nbi2Q0M1QLJX9BeE&r=4v6XNOMLkOlZVNYSZZEHUddWVGsteDK-7RNrHFN7nyY&m=9h5bzByP04iBjRSY_v3t1EbVMTjBMe1Wi_LbyB789pUyHtfdT0nUjotPJrivNUZG&s=lDpJceBMAdE7y2rtFQCKTEL8VURxGgkPWHSE3VKzpXo&e=>
>> on bug 13138
>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D13138&d=DwMFaQ&c=009klHSCxuh5AI1vNQzSO0KGjl4nbi2Q0M1QLJX9BeE&r=4v6XNOMLkOlZVNYSZZEHUddWVGsteDK-7RNrHFN7nyY&m=9h5bzByP04iBjRSY_v3t1EbVMTjBMe1Wi_LbyB789pUyHtfdT0nUjotPJrivNUZG&s=wXl1atkBdO5DUoWlGTUDVnbtTkqovtXX6AnMGGdF25c&e=>
>> from Ben Roberts <ben@schedmd.com> *
>>
>> Hi,
>>
>> We have seen an issue that is very similar to what you are describing in older
>> versions of Slurm (prior to 17.11.4).  You marked the ticket as being for an
>> older version of Slurm.  Which version exactly are you running?  The solution
>> in that case was to restart slurmctld, which would try to get information about
>> these jobs when it started back up and they would be cancelled when the job
>> information couldn't be found.  Newer versions of Slurm have a fix that should
>> prevent array jobs from getting in this state.  You can read more about it inbug 4833 <https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D4833&d=DwMFaQ&c=009klHSCxuh5AI1vNQzSO0KGjl4nbi2Q0M1QLJX9BeE&r=4v6XNOMLkOlZVNYSZZEHUddWVGsteDK-7RNrHFN7nyY&m=9h5bzByP04iBjRSY_v3t1EbVMTjBMe1Wi_LbyB789pUyHtfdT0nUjotPJrivNUZG&s=HFQYUm-wxMqu8T6cZM2UF6jJ1wf6720knTUTfyv_wng&e=>.  Let us know if this doesn't help in your case.
>>
>> Thanks,
>> Ben
>>
>> ------------------------------
>> You are receiving this mail because:
>>
>>    - You reported the bug.
>>
>>
Comment 4 Ben Roberts 2022-01-06 13:45:00 MST
I'm glad to hear that restarting the controller cleared out these jobs.  I'll mark this ticket as a duplicate of bug 4833.  I'll also take this opportunity to encourage you to make plans to upgrade to a recent version of Slurm.  There have been a lot of bug fixes that have gone into the software since 17.11 and the amount of support we can provide for 17.11 is limited.  You can review our documentation on the upgrade process here:
https://slurm.schedmd.com/quickstart_admin.html#upgrade

Feel free to let us know if you have any questions about upgrading.

Thanks,
Ben

*** This ticket has been marked as a duplicate of ticket 4833 ***