We have a few jobs with state RH (REQUEUE_HOLD) that we can not cancel. "scancel" command runs without any errors but it does not remove the jobs from the squeue and it shows in the DB as pending. Please see below: [root@roll ar2667]# squeue .... 26126703 short drc_a_20 msd2202 R 7:39:25 1 node021 26126704 short drc_a_20 msd2202 R 7:39:25 1 node022 26126706 short drc_a_20 msd2202 R 7:39:25 1 node051 26126710 short drc_a_20 msd2202 R 7:39:25 1 node028 26114272_8 short,cmt B0 am5328 RH 0:00 1 (JobHoldMaxRequeue) 26114272_9 short,cmt B0 am5328 RH 0:00 1 (JobHoldMaxRequeue) 26114272_10 short,cmt B0 am5328 RH 0:00 1 (JobHoldMaxRequeue) 26114272_11 short,cmt B0 am5328 RH 0:00 1 (JobHoldMaxRequeue) 26114272_12 short,cmt B0 am5328 RH 0:00 1 (JobHoldMaxRequeue) 26114272_73 short,cmt B0 am5328 RH 0:00 1 (JobHoldMaxRequeue) 26114272_74 short,cmt B0 am5328 RH 0:00 1 (JobHoldMaxRequeue) [root@roll ar2667]# scontrol show job 26114272_8 slurm_load_jobs error: Invalid job id specified [root@roll ar2667]# scancel --full 26114272_8 [root@roll ar2667]# [ar2667@node289 ~]$ sacct -j 26114272_8 --format=User,JobID,jobname,state,time,start,end,elapsed,ReqTRE,nodelist User JobID JobName State Timelimit Start End Elapsed ReqTRES NodeList --------- ------------ ---------- ---------- ---------- ------------------- ------------------- ---------- ---------- --------------- am5328 26114272_8 B0 PENDING 11:59:00 Unknown Unknown 00:00:00 cpu=24,me+ None assigned
Hi, We have seen an issue that is very similar to what you are describing in older versions of Slurm (prior to 17.11.4). You marked the ticket as being for an older version of Slurm. Which version exactly are you running? The solution in that case was to restart slurmctld, which would try to get information about these jobs when it started back up and they would be cancelled when the job information couldn't be found. Newer versions of Slurm have a fix that should prevent array jobs from getting in this state. You can read more about it in bug 4833. Let us know if this doesn't help in your case. Thanks, Ben
Thank you, Ben. We are running 17.11.2. We will restart slurmctld. ~a *---* Axinia Radeva Manager, Research Computing Services On Thu, Jan 6, 2022 at 2:57 PM <bugs@schedmd.com> wrote: > *Comment # 1 > <https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D13138-23c1&d=DwMFaQ&c=009klHSCxuh5AI1vNQzSO0KGjl4nbi2Q0M1QLJX9BeE&r=4v6XNOMLkOlZVNYSZZEHUddWVGsteDK-7RNrHFN7nyY&m=9h5bzByP04iBjRSY_v3t1EbVMTjBMe1Wi_LbyB789pUyHtfdT0nUjotPJrivNUZG&s=lDpJceBMAdE7y2rtFQCKTEL8VURxGgkPWHSE3VKzpXo&e=> > on bug 13138 > <https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D13138&d=DwMFaQ&c=009klHSCxuh5AI1vNQzSO0KGjl4nbi2Q0M1QLJX9BeE&r=4v6XNOMLkOlZVNYSZZEHUddWVGsteDK-7RNrHFN7nyY&m=9h5bzByP04iBjRSY_v3t1EbVMTjBMe1Wi_LbyB789pUyHtfdT0nUjotPJrivNUZG&s=wXl1atkBdO5DUoWlGTUDVnbtTkqovtXX6AnMGGdF25c&e=> > from Ben Roberts <ben@schedmd.com> * > > Hi, > > We have seen an issue that is very similar to what you are describing in older > versions of Slurm (prior to 17.11.4). You marked the ticket as being for an > older version of Slurm. Which version exactly are you running? The solution > in that case was to restart slurmctld, which would try to get information about > these jobs when it started back up and they would be cancelled when the job > information couldn't be found. Newer versions of Slurm have a fix that should > prevent array jobs from getting in this state. You can read more about it inbug 4833 <https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D4833&d=DwMFaQ&c=009klHSCxuh5AI1vNQzSO0KGjl4nbi2Q0M1QLJX9BeE&r=4v6XNOMLkOlZVNYSZZEHUddWVGsteDK-7RNrHFN7nyY&m=9h5bzByP04iBjRSY_v3t1EbVMTjBMe1Wi_LbyB789pUyHtfdT0nUjotPJrivNUZG&s=HFQYUm-wxMqu8T6cZM2UF6jJ1wf6720knTUTfyv_wng&e=>. Let us know if this doesn't help in your case. > > Thanks, > Ben > > ------------------------------ > You are receiving this mail because: > > - You reported the bug. > >
I restarted the controller and the RH jobs are gone from the squeue. Thank you *---* Axinia Radeva Manager, Research Computing Services On Thu, Jan 6, 2022 at 3:02 PM Axinia Radeva <aradeva@columbia.edu> wrote: > Thank you, Ben. > We are running 17.11.2. We will restart slurmctld. > ~a > > > *---* > Axinia Radeva > Manager, Research Computing Services > > > > > On Thu, Jan 6, 2022 at 2:57 PM <bugs@schedmd.com> wrote: > >> *Comment # 1 >> <https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D13138-23c1&d=DwMFaQ&c=009klHSCxuh5AI1vNQzSO0KGjl4nbi2Q0M1QLJX9BeE&r=4v6XNOMLkOlZVNYSZZEHUddWVGsteDK-7RNrHFN7nyY&m=9h5bzByP04iBjRSY_v3t1EbVMTjBMe1Wi_LbyB789pUyHtfdT0nUjotPJrivNUZG&s=lDpJceBMAdE7y2rtFQCKTEL8VURxGgkPWHSE3VKzpXo&e=> >> on bug 13138 >> <https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D13138&d=DwMFaQ&c=009klHSCxuh5AI1vNQzSO0KGjl4nbi2Q0M1QLJX9BeE&r=4v6XNOMLkOlZVNYSZZEHUddWVGsteDK-7RNrHFN7nyY&m=9h5bzByP04iBjRSY_v3t1EbVMTjBMe1Wi_LbyB789pUyHtfdT0nUjotPJrivNUZG&s=wXl1atkBdO5DUoWlGTUDVnbtTkqovtXX6AnMGGdF25c&e=> >> from Ben Roberts <ben@schedmd.com> * >> >> Hi, >> >> We have seen an issue that is very similar to what you are describing in older >> versions of Slurm (prior to 17.11.4). You marked the ticket as being for an >> older version of Slurm. Which version exactly are you running? The solution >> in that case was to restart slurmctld, which would try to get information about >> these jobs when it started back up and they would be cancelled when the job >> information couldn't be found. Newer versions of Slurm have a fix that should >> prevent array jobs from getting in this state. You can read more about it inbug 4833 <https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D4833&d=DwMFaQ&c=009klHSCxuh5AI1vNQzSO0KGjl4nbi2Q0M1QLJX9BeE&r=4v6XNOMLkOlZVNYSZZEHUddWVGsteDK-7RNrHFN7nyY&m=9h5bzByP04iBjRSY_v3t1EbVMTjBMe1Wi_LbyB789pUyHtfdT0nUjotPJrivNUZG&s=HFQYUm-wxMqu8T6cZM2UF6jJ1wf6720knTUTfyv_wng&e=>. Let us know if this doesn't help in your case. >> >> Thanks, >> Ben >> >> ------------------------------ >> You are receiving this mail because: >> >> - You reported the bug. >> >>
I'm glad to hear that restarting the controller cleared out these jobs. I'll mark this ticket as a duplicate of bug 4833. I'll also take this opportunity to encourage you to make plans to upgrade to a recent version of Slurm. There have been a lot of bug fixes that have gone into the software since 17.11 and the amount of support we can provide for 17.11 is limited. You can review our documentation on the upgrade process here: https://slurm.schedmd.com/quickstart_admin.html#upgrade Feel free to let us know if you have any questions about upgrading. Thanks, Ben *** This ticket has been marked as a duplicate of ticket 4833 ***