Ticket 5170

Summary: unable to cancel array job
Product: Slurm Reporter: Jonathon Anderson <jonathon.anderson>
Component: slurmctldAssignee: Director of Support <support>
Status: RESOLVED DUPLICATE QA Contact:
Severity: 3 - Medium Impact    
Priority: ---    
Version: 17.11.1   
Hardware: Linux   
OS: Linux   
Site: University of Colorado Slinky Site: ---
Alineos Sites: --- Atos/Eviden Sites: ---
Confidential Site: --- Coreweave sites: ---
Cray Sites: --- DS9 clusters: ---
Google sites: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- NoveTech Sites: ---
Nvidia HWinf-CS Sites: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Tzag Elita Sites: ---
Linux Distro: --- Machine Name:
CLE Version: Version Fixed:
Target Release: --- DevPrio: ---
Emory-Cloud Sites: ---

Description Jonathon Anderson 2018-05-14 11:17:17 MDT
This user would like to cancel his array job 796729.

[root@slurm5 ~]# squeue -u luev6784
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
            915122      smem mdd_ldms luev6784  R 1-00:54:01      1 smem0401
            915121      smem mdd_ldms luev6784  R 1-00:54:13      1 smem0201
            915117      smem      gad luev6784  R 1-01:03:46      1 smem0101
            915116      smem      gad luev6784  R 1-01:04:46      1 smem0501
          796729_2      smem       LD luev6784 RH       0:00      1 (JobHoldMaxRequeue)
          796729_4      smem       LD luev6784 RH       0:00      1 (JobHoldMaxRequeue)
          796729_5      smem       LD luev6784 RH       0:00      1 (JobHoldMaxRequeue)
          796729_6      smem       LD luev6784 RH       0:00      1 (JobHoldMaxRequeue)
          796729_7      smem       LD luev6784 RH       0:00      1 (JobHoldMaxRequeue)
         796729_12      smem       LD luev6784 RH       0:00      1 (JobHoldMaxRequeue)

But scancel as both the user and an admin fails.


[root@slurm5 ~]# scancel 796729

[2018-05-14T11:15:47.377] _slurm_rpc_kill_job: REQUEST_KILL_JOB job 796729 uid 0
[2018-05-14T11:15:47.377] job_str_signal(3): invalid job id 796729
[2018-05-14T11:15:47.377] _slurm_rpc_kill_job: job_str_signal() job 796729 sig 9 returned Invalid job id specified


[root@slurm5 ~]# scancel 796729_2

[2018-05-14T11:15:50.516] _slurm_rpc_kill_job: REQUEST_KILL_JOB job 796729_2 uid 0
[2018-05-14T11:15:50.517] job_str_signal(5): invalid job id 796729_2
[2018-05-14T11:15:50.517] _slurm_rpc_kill_job: job_str_signal() job 796729_2 sig 9 returned Invalid job id specified

Please advise.
Comment 3 Isaac Hartung 2018-05-16 09:53:42 MDT
Hi Jonathon,

This bug has been fixed in 17.11.4.  For an explanation of the problem and its solution, you can look at bug 4833.

I am going to close this bug as a duplicate, but should the fix in 17.11.4 not solve your problem, please comment here/reopen this ticket.

Regards,
Isaac

*** This ticket has been marked as a duplicate of ticket 4833 ***