994 – scancel fails to cancel job

Ticket 994 - scancel fails to cancel job

Summary: scancel fails to cancel job

Status:	RESOLVED FIXED

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	Other (show other tickets)
Version:	14.03.6
Hardware:	Linux Linux

Severity:	4 - Minor Issue
Assignee:	David Bigagli
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2014-07-27 17:14 MDT by Stuart Midgley
Modified:	2014-07-28 08:08 MDT (History)
CC List:	1 user (show)

See Also:
Site:	DownUnder GeoSolutions
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:	14.03.7
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Stuart Midgley 2014-07-27 17:14:23 MDT

Here is an the output from about 10s of a running job.  Notice how the job goes from PD -> R -> CG -> SE .  

Given where I issues the scancel, why wasn't the job cancelled?


20140728130813 bud30:scripts> squeue -u stuartm
PARTITION   PRIORITY   NAME                     USER ST       TIME  NODES NODELIST(REASON JOBID
teamdev     100        error_job             stuartm PD       0:00      1 (None)          5060732
20140728130814 bud30:scripts> squeue -u stuartm
PARTITION   PRIORITY   NAME                     USER ST       TIME  NODES NODELIST(REASON JOBID
teamdev     100        error_job             stuartm PD       0:00      1 (None)          5060732
20140728130815 bud30:scripts> squeue -u stuartm
PARTITION   PRIORITY   NAME                     USER ST       TIME  NODES NODELIST(REASON JOBID
teamdev     100        error_job             stuartm  R       0:00      1 bud98           5060732
20140728130815 bud30:scripts> scancel 5060732
20140728130820 bud30:scripts> squeue -u stuartm
PARTITION   PRIORITY   NAME                     USER ST       TIME  NODES NODELIST(REASON JOBID
teamdev     0          error_job             stuartm CG       0:00      1 bud98           5060732
20140728130821 bud30:scripts> squeue -u stuartm
PARTITION   PRIORITY   NAME                     USER ST       TIME  NODES NODELIST(REASON JOBID
teamdev     0          error_job             stuartm CG       0:00      1 bud98           5060732
20140728130829 bud30:scripts> squeue -u stuartm
PARTITION   PRIORITY   NAME                     USER ST       TIME  NODES NODELIST(REASON JOBID
teamdev     0          error_job             stuartm SE       0:00      1 (JobHeldUser)   5060732

The job is

#!/bin/bash
#rj cpus=1 name=error_job io=0 mem=1000m queue=teamdev

sleep 10
exit 100


The requeue occured after the cancel

140728115411 pque0001:etc# grep 5060732 /var/log/slurm/slurmctld.log
[2014-07-28T10:47:08.913] _slurm_rpc_submit_batch_job JobId=5060732 usec=5631
[2014-07-28T10:47:11.488] sched: Allocate JobId=5060732 NodeList=bud98 #CPUs=1
[2014-07-28T10:47:22.986] completing job 5060732 status 25600
[2014-07-28T10:47:23.078] sched: job_complete for JobId=5060732 successful, exit code=25600
[2014-07-28T10:47:23.613] _slurm_rpc_requeue: 5060732: usec=486
[2014-07-28T10:47:24.334] requeue batch job 5060732
[2014-07-28T11:49:38.042] sched: update_job: releasing hold for job_id 5060732, new priority is 100
[2014-07-28T11:49:38.043] _slurm_rpc_update_job complete JobId=5060732 uid=0 usec=514
[2014-07-28T11:49:43.136] sched: Allocate JobId=5060732 NodeList=bud98 #CPUs=1
[2014-07-28T11:49:55.072] completing job 5060732 status 25600
[2014-07-28T11:49:55.611] sched: job_complete for JobId=5060732 successful, exit code=25600
[2014-07-28T11:49:56.147] _slurm_rpc_requeue: 5060732: usec=1137
[2014-07-28T11:49:57.020] requeue batch job 5060732
[2014-07-28T13:08:05.175] sched: update_job: releasing hold for job_id 5060732, new priority is 100
[2014-07-28T13:08:05.188] _slurm_rpc_update_job complete JobId=5060732 uid=3005 usec=12979
[2014-07-28T13:08:15.496] sched: Allocate JobId=5060732 NodeList=bud98 #CPUs=1
[2014-07-28T13:08:20.451] sched: Cancel of JobId=5060732 by UID=3005, usec=193107
[2014-07-28T13:08:21.264] _slurm_rpc_requeue: 5060732: usec=13706
[2014-07-28T13:08:21.542] completing job 5060732 status 15
[2014-07-28T13:08:23.685] requeue batch job 5060732


I assume this is due to the job record staying around inside slurm for a few minutes?  This allowing it to be resurrected?

Comment 1 David Bigagli 2014-07-28 08:08:17 MDT

Hi,
   this was a bug. The job got requeued by the epilog slurmctld because the
job exit status was not cleaned from the previous requeue.
Fixed in commit b0db5afc16c01c0.

David