Here is an the output from about 10s of a running job. Notice how the job goes from PD -> R -> CG -> SE . Given where I issues the scancel, why wasn't the job cancelled? 20140728130813 bud30:scripts> squeue -u stuartm PARTITION PRIORITY NAME USER ST TIME NODES NODELIST(REASON JOBID teamdev 100 error_job stuartm PD 0:00 1 (None) 5060732 20140728130814 bud30:scripts> squeue -u stuartm PARTITION PRIORITY NAME USER ST TIME NODES NODELIST(REASON JOBID teamdev 100 error_job stuartm PD 0:00 1 (None) 5060732 20140728130815 bud30:scripts> squeue -u stuartm PARTITION PRIORITY NAME USER ST TIME NODES NODELIST(REASON JOBID teamdev 100 error_job stuartm R 0:00 1 bud98 5060732 20140728130815 bud30:scripts> scancel 5060732 20140728130820 bud30:scripts> squeue -u stuartm PARTITION PRIORITY NAME USER ST TIME NODES NODELIST(REASON JOBID teamdev 0 error_job stuartm CG 0:00 1 bud98 5060732 20140728130821 bud30:scripts> squeue -u stuartm PARTITION PRIORITY NAME USER ST TIME NODES NODELIST(REASON JOBID teamdev 0 error_job stuartm CG 0:00 1 bud98 5060732 20140728130829 bud30:scripts> squeue -u stuartm PARTITION PRIORITY NAME USER ST TIME NODES NODELIST(REASON JOBID teamdev 0 error_job stuartm SE 0:00 1 (JobHeldUser) 5060732 The job is #!/bin/bash #rj cpus=1 name=error_job io=0 mem=1000m queue=teamdev sleep 10 exit 100 The requeue occured after the cancel 140728115411 pque0001:etc# grep 5060732 /var/log/slurm/slurmctld.log [2014-07-28T10:47:08.913] _slurm_rpc_submit_batch_job JobId=5060732 usec=5631 [2014-07-28T10:47:11.488] sched: Allocate JobId=5060732 NodeList=bud98 #CPUs=1 [2014-07-28T10:47:22.986] completing job 5060732 status 25600 [2014-07-28T10:47:23.078] sched: job_complete for JobId=5060732 successful, exit code=25600 [2014-07-28T10:47:23.613] _slurm_rpc_requeue: 5060732: usec=486 [2014-07-28T10:47:24.334] requeue batch job 5060732 [2014-07-28T11:49:38.042] sched: update_job: releasing hold for job_id 5060732, new priority is 100 [2014-07-28T11:49:38.043] _slurm_rpc_update_job complete JobId=5060732 uid=0 usec=514 [2014-07-28T11:49:43.136] sched: Allocate JobId=5060732 NodeList=bud98 #CPUs=1 [2014-07-28T11:49:55.072] completing job 5060732 status 25600 [2014-07-28T11:49:55.611] sched: job_complete for JobId=5060732 successful, exit code=25600 [2014-07-28T11:49:56.147] _slurm_rpc_requeue: 5060732: usec=1137 [2014-07-28T11:49:57.020] requeue batch job 5060732 [2014-07-28T13:08:05.175] sched: update_job: releasing hold for job_id 5060732, new priority is 100 [2014-07-28T13:08:05.188] _slurm_rpc_update_job complete JobId=5060732 uid=3005 usec=12979 [2014-07-28T13:08:15.496] sched: Allocate JobId=5060732 NodeList=bud98 #CPUs=1 [2014-07-28T13:08:20.451] sched: Cancel of JobId=5060732 by UID=3005, usec=193107 [2014-07-28T13:08:21.264] _slurm_rpc_requeue: 5060732: usec=13706 [2014-07-28T13:08:21.542] completing job 5060732 status 15 [2014-07-28T13:08:23.685] requeue batch job 5060732 I assume this is due to the job record staying around inside slurm for a few minutes? This allowing it to be resurrected?
Hi, this was a bug. The job got requeued by the epilog slurmctld because the job exit status was not cleaned from the previous requeue. Fixed in commit b0db5afc16c01c0. David