Ticket 994 - scancel fails to cancel job
Summary: scancel fails to cancel job
Status: RESOLVED FIXED
Alias: None
Product: Slurm
Classification: Unclassified
Component: Other (show other tickets)
Version: 14.03.6
Hardware: Linux Linux
: 4 - Minor Issue
Assignee: David Bigagli
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2014-07-27 17:14 MDT by Stuart Midgley
Modified: 2014-07-28 08:08 MDT (History)
1 user (show)

See Also:
Site: DownUnder GeoSolutions
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed: 14.03.7
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description Stuart Midgley 2014-07-27 17:14:23 MDT
Here is an the output from about 10s of a running job.  Notice how the job goes from PD -> R -> CG -> SE .  

Given where I issues the scancel, why wasn't the job cancelled?


20140728130813 bud30:scripts> squeue -u stuartm
PARTITION   PRIORITY   NAME                     USER ST       TIME  NODES NODELIST(REASON JOBID
teamdev     100        error_job             stuartm PD       0:00      1 (None)          5060732
20140728130814 bud30:scripts> squeue -u stuartm
PARTITION   PRIORITY   NAME                     USER ST       TIME  NODES NODELIST(REASON JOBID
teamdev     100        error_job             stuartm PD       0:00      1 (None)          5060732
20140728130815 bud30:scripts> squeue -u stuartm
PARTITION   PRIORITY   NAME                     USER ST       TIME  NODES NODELIST(REASON JOBID
teamdev     100        error_job             stuartm  R       0:00      1 bud98           5060732
20140728130815 bud30:scripts> scancel 5060732
20140728130820 bud30:scripts> squeue -u stuartm
PARTITION   PRIORITY   NAME                     USER ST       TIME  NODES NODELIST(REASON JOBID
teamdev     0          error_job             stuartm CG       0:00      1 bud98           5060732
20140728130821 bud30:scripts> squeue -u stuartm
PARTITION   PRIORITY   NAME                     USER ST       TIME  NODES NODELIST(REASON JOBID
teamdev     0          error_job             stuartm CG       0:00      1 bud98           5060732
20140728130829 bud30:scripts> squeue -u stuartm
PARTITION   PRIORITY   NAME                     USER ST       TIME  NODES NODELIST(REASON JOBID
teamdev     0          error_job             stuartm SE       0:00      1 (JobHeldUser)   5060732

The job is

#!/bin/bash
#rj cpus=1 name=error_job io=0 mem=1000m queue=teamdev

sleep 10
exit 100


The requeue occured after the cancel

140728115411 pque0001:etc# grep 5060732 /var/log/slurm/slurmctld.log
[2014-07-28T10:47:08.913] _slurm_rpc_submit_batch_job JobId=5060732 usec=5631
[2014-07-28T10:47:11.488] sched: Allocate JobId=5060732 NodeList=bud98 #CPUs=1
[2014-07-28T10:47:22.986] completing job 5060732 status 25600
[2014-07-28T10:47:23.078] sched: job_complete for JobId=5060732 successful, exit code=25600
[2014-07-28T10:47:23.613] _slurm_rpc_requeue: 5060732: usec=486
[2014-07-28T10:47:24.334] requeue batch job 5060732
[2014-07-28T11:49:38.042] sched: update_job: releasing hold for job_id 5060732, new priority is 100
[2014-07-28T11:49:38.043] _slurm_rpc_update_job complete JobId=5060732 uid=0 usec=514
[2014-07-28T11:49:43.136] sched: Allocate JobId=5060732 NodeList=bud98 #CPUs=1
[2014-07-28T11:49:55.072] completing job 5060732 status 25600
[2014-07-28T11:49:55.611] sched: job_complete for JobId=5060732 successful, exit code=25600
[2014-07-28T11:49:56.147] _slurm_rpc_requeue: 5060732: usec=1137
[2014-07-28T11:49:57.020] requeue batch job 5060732
[2014-07-28T13:08:05.175] sched: update_job: releasing hold for job_id 5060732, new priority is 100
[2014-07-28T13:08:05.188] _slurm_rpc_update_job complete JobId=5060732 uid=3005 usec=12979
[2014-07-28T13:08:15.496] sched: Allocate JobId=5060732 NodeList=bud98 #CPUs=1
[2014-07-28T13:08:20.451] sched: Cancel of JobId=5060732 by UID=3005, usec=193107
[2014-07-28T13:08:21.264] _slurm_rpc_requeue: 5060732: usec=13706
[2014-07-28T13:08:21.542] completing job 5060732 status 15
[2014-07-28T13:08:23.685] requeue batch job 5060732


I assume this is due to the job record staying around inside slurm for a few minutes?  This allowing it to be resurrected?
Comment 1 David Bigagli 2014-07-28 08:08:17 MDT
Hi,
   this was a bug. The job got requeued by the epilog slurmctld because the
job exit status was not cleaned from the previous requeue.
Fixed in commit b0db5afc16c01c0.

David