Ticket 5170 - unable to cancel array job
Summary: unable to cancel array job
Status: RESOLVED DUPLICATE of ticket 4833
Alias: None
Product: Slurm
Classification: Unclassified
Component: slurmctld (show other tickets)
Version: 17.11.1
Hardware: Linux Linux
: 3 - Medium Impact
Assignee: Director of Support
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2018-05-14 11:17 MDT by Jonathon Anderson
Modified: 2018-05-16 09:53 MDT (History)
0 users

See Also:
Site: University of Colorado
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description Jonathon Anderson 2018-05-14 11:17:17 MDT
This user would like to cancel his array job 796729.

[root@slurm5 ~]# squeue -u luev6784
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
            915122      smem mdd_ldms luev6784  R 1-00:54:01      1 smem0401
            915121      smem mdd_ldms luev6784  R 1-00:54:13      1 smem0201
            915117      smem      gad luev6784  R 1-01:03:46      1 smem0101
            915116      smem      gad luev6784  R 1-01:04:46      1 smem0501
          796729_2      smem       LD luev6784 RH       0:00      1 (JobHoldMaxRequeue)
          796729_4      smem       LD luev6784 RH       0:00      1 (JobHoldMaxRequeue)
          796729_5      smem       LD luev6784 RH       0:00      1 (JobHoldMaxRequeue)
          796729_6      smem       LD luev6784 RH       0:00      1 (JobHoldMaxRequeue)
          796729_7      smem       LD luev6784 RH       0:00      1 (JobHoldMaxRequeue)
         796729_12      smem       LD luev6784 RH       0:00      1 (JobHoldMaxRequeue)

But scancel as both the user and an admin fails.


[root@slurm5 ~]# scancel 796729

[2018-05-14T11:15:47.377] _slurm_rpc_kill_job: REQUEST_KILL_JOB job 796729 uid 0
[2018-05-14T11:15:47.377] job_str_signal(3): invalid job id 796729
[2018-05-14T11:15:47.377] _slurm_rpc_kill_job: job_str_signal() job 796729 sig 9 returned Invalid job id specified


[root@slurm5 ~]# scancel 796729_2

[2018-05-14T11:15:50.516] _slurm_rpc_kill_job: REQUEST_KILL_JOB job 796729_2 uid 0
[2018-05-14T11:15:50.517] job_str_signal(5): invalid job id 796729_2
[2018-05-14T11:15:50.517] _slurm_rpc_kill_job: job_str_signal() job 796729_2 sig 9 returned Invalid job id specified

Please advise.
Comment 3 Isaac Hartung 2018-05-16 09:53:42 MDT
Hi Jonathon,

This bug has been fixed in 17.11.4.  For an explanation of the problem and its solution, you can look at bug 4833.

I am going to close this bug as a duplicate, but should the fix in 17.11.4 not solve your problem, please comment here/reopen this ticket.

Regards,
Isaac

*** This ticket has been marked as a duplicate of ticket 4833 ***