Ticket 994

Summary: scancel fails to cancel job
Product: Slurm Reporter: Stuart Midgley <stuartm>
Component: OtherAssignee: David Bigagli <david>
Status: RESOLVED FIXED QA Contact:
Severity: 4 - Minor Issue    
Priority: --- CC: da
Version: 14.03.6   
Hardware: Linux   
OS: Linux   
Site: DownUnder GeoSolutions Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: 14.03.7 Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---

Description Stuart Midgley 2014-07-27 17:14:23 MDT
Here is an the output from about 10s of a running job.  Notice how the job goes from PD -> R -> CG -> SE .  

Given where I issues the scancel, why wasn't the job cancelled?


20140728130813 bud30:scripts> squeue -u stuartm
PARTITION   PRIORITY   NAME                     USER ST       TIME  NODES NODELIST(REASON JOBID
teamdev     100        error_job             stuartm PD       0:00      1 (None)          5060732
20140728130814 bud30:scripts> squeue -u stuartm
PARTITION   PRIORITY   NAME                     USER ST       TIME  NODES NODELIST(REASON JOBID
teamdev     100        error_job             stuartm PD       0:00      1 (None)          5060732
20140728130815 bud30:scripts> squeue -u stuartm
PARTITION   PRIORITY   NAME                     USER ST       TIME  NODES NODELIST(REASON JOBID
teamdev     100        error_job             stuartm  R       0:00      1 bud98           5060732
20140728130815 bud30:scripts> scancel 5060732
20140728130820 bud30:scripts> squeue -u stuartm
PARTITION   PRIORITY   NAME                     USER ST       TIME  NODES NODELIST(REASON JOBID
teamdev     0          error_job             stuartm CG       0:00      1 bud98           5060732
20140728130821 bud30:scripts> squeue -u stuartm
PARTITION   PRIORITY   NAME                     USER ST       TIME  NODES NODELIST(REASON JOBID
teamdev     0          error_job             stuartm CG       0:00      1 bud98           5060732
20140728130829 bud30:scripts> squeue -u stuartm
PARTITION   PRIORITY   NAME                     USER ST       TIME  NODES NODELIST(REASON JOBID
teamdev     0          error_job             stuartm SE       0:00      1 (JobHeldUser)   5060732

The job is

#!/bin/bash
#rj cpus=1 name=error_job io=0 mem=1000m queue=teamdev

sleep 10
exit 100


The requeue occured after the cancel

140728115411 pque0001:etc# grep 5060732 /var/log/slurm/slurmctld.log
[2014-07-28T10:47:08.913] _slurm_rpc_submit_batch_job JobId=5060732 usec=5631
[2014-07-28T10:47:11.488] sched: Allocate JobId=5060732 NodeList=bud98 #CPUs=1
[2014-07-28T10:47:22.986] completing job 5060732 status 25600
[2014-07-28T10:47:23.078] sched: job_complete for JobId=5060732 successful, exit code=25600
[2014-07-28T10:47:23.613] _slurm_rpc_requeue: 5060732: usec=486
[2014-07-28T10:47:24.334] requeue batch job 5060732
[2014-07-28T11:49:38.042] sched: update_job: releasing hold for job_id 5060732, new priority is 100
[2014-07-28T11:49:38.043] _slurm_rpc_update_job complete JobId=5060732 uid=0 usec=514
[2014-07-28T11:49:43.136] sched: Allocate JobId=5060732 NodeList=bud98 #CPUs=1
[2014-07-28T11:49:55.072] completing job 5060732 status 25600
[2014-07-28T11:49:55.611] sched: job_complete for JobId=5060732 successful, exit code=25600
[2014-07-28T11:49:56.147] _slurm_rpc_requeue: 5060732: usec=1137
[2014-07-28T11:49:57.020] requeue batch job 5060732
[2014-07-28T13:08:05.175] sched: update_job: releasing hold for job_id 5060732, new priority is 100
[2014-07-28T13:08:05.188] _slurm_rpc_update_job complete JobId=5060732 uid=3005 usec=12979
[2014-07-28T13:08:15.496] sched: Allocate JobId=5060732 NodeList=bud98 #CPUs=1
[2014-07-28T13:08:20.451] sched: Cancel of JobId=5060732 by UID=3005, usec=193107
[2014-07-28T13:08:21.264] _slurm_rpc_requeue: 5060732: usec=13706
[2014-07-28T13:08:21.542] completing job 5060732 status 15
[2014-07-28T13:08:23.685] requeue batch job 5060732


I assume this is due to the job record staying around inside slurm for a few minutes?  This allowing it to be resurrected?
Comment 1 David Bigagli 2014-07-28 08:08:17 MDT
Hi,
   this was a bug. The job got requeued by the epilog slurmctld because the
job exit status was not cleaned from the previous requeue.
Fixed in commit b0db5afc16c01c0.

David