Ticket 3072

Summary:	"scancel garbage" returns 0 and has no output
Product:	Slurm	Reporter:	Ryan Cox <ryan_cox>
Component:	User Commands	Assignee:	Alejandro Sanchez <alex>
Status:	RESOLVED FIXED	QA Contact:
Severity:	5 - Enhancement
Priority:	---	CC:	alex
Version:	15.08.12
Hardware:	Linux
OS:	Linux
Site:	BYU - Brigham Young University	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:	17.02
Target Release:	17.02	DevPrio:	4 - Medium
Emory-Cloud Sites:	---

Description Ryan Cox 2016-09-09 15:42:15 MDT

$ scancel garbage
$ echo $?
0
$

One of users was mistyping job IDs and didn't realize it since it never errored out.

Comment 1 Alejandro Sanchez 2016-09-12 04:00:51 MDT

Hi Ryan, I can easily reproduce this. Not sure at all if scancel should exit with 1 since we're not requesting a valid jobid to be scancel'd. Gonna discuss it with the team.

Comment 2 Alejandro Sanchez 2016-09-12 05:28:58 MDT

I've been observing how the Linux `kill` command behaves on different use cases. I think scancel should behave similar to this command, but it doesn't.

So for instance, we request to kill 1 valid process and one invalid process:

$ sleep 99999 &
[1] 17007
$ ps -ef | grep sleep | grep -v grep
alex     17007  4208  0 13:14 pts/10   00:00:00 sleep 99999
$ kill 17007 garbage
bash: kill: garbage: arguments must be process or job IDs
$ echo $?
0
[1]+  Terminated              sleep 99999

Result: valid process is killed, error message is shown indicating the invalid process and exit code is 0, so it's an optimistic exit code.

If instead we request to kill a single invalid process, the exit code is 1:

$ kill garbage
bash: kill: garbage: arguments must be process or job IDs
$ echo $?
1

So it seems that by default the exit code is 1 unless a valid process is signaled with success.

Now let's see how scancel behaves when requested 1 valid and 1 invalid job_id: 

$ scancel 20010 garbage
scancel: error: Invalid job_id garbage
$ echo $?
1

I see two differences here respect to `kill`: first one is that valid job_id is not killed. Second one is the exit code.

Now let's see behavior with single invalid job_id (your description comment):

$ scancel garbage
$ echo $?
0

Behavior is also different to `kill` command.

Another issue I see is the exit code when trying to scancel a finished job_id:

$ sacct -j 20012 -o jobid,state
       JobID      State 
------------ ---------- 
20012        CANCELLED+ 
20012.batch   CANCELLED 
$ scancel 20012
$ echo $?
0

So again my understanding is that scancel should behave as `kill`, but this can be discussed with the team and we're also open to listen to your thoughts on this. Thanks Ryan for reporting.

Comment 3 Ryan Cox 2016-09-12 09:09:03 MDT

I hadn't thought about how it might be designed to be like the kill command.  /bin/kill seems to only use return codes but the built-in kill in both bash and tcsh do output errors for non-existent processes.

At the very least scancel should use a non-zero return code for a non-existent job.  I would also vote for an error message on stderr just like bash and tcsh's kill implementations do.  I'm always a fan of more verbose error messages.

I gave an example of "garbage" as the job ID, but the user was accidentally using a truncated job ID.  It was an integer, just not an integer that represented a real job.  So really, we should be talking about "scancel 123" (where 123 doesn't exist) instead of "scancel garbage" as I initially posted.

Comment 4 Alejandro Sanchez 2016-09-12 09:24:02 MDT

Ok so I'd vote for this design:

1. scancel invalid_jobid
Result: show error message and exit with status 1.

2. scancel valid_jobid invalid_jobid
Result: try to cancel the valid_jobid, show error message for the invalid one and use the highest rc, which in this case would be 1 because of the invalid_jobid.

3. Trying to scancel jobs in states different to RUNNING, PENDING or SUSPENDED should show an error too and exit with status 1.

Also would it be a problem if we change severity to 5 and address this change to 17.02? Thanks.

Comment 5 Ryan Cox 2016-09-12 09:32:00 MDT

Sounds good. Thanks.

Comment 9 Moe Jette 2017-01-12 18:58:19 MST

I've committed a few patches for this.

This one from Alex logs an error any time the job ID on the execute line doesn't parse as a number:
https://github.com/SchedMD/slurm/commit/3c713c10b48a3d10c64db445c57a0c6f45614237
$ scancel foo
scancel: error: Invalid job id foo
$ echo $?
1

The second patch logs when no jobs satisfy the filter specified, but only in verbose mode:
https://github.com/SchedMD/slurm/commit/568b23aff683aa76b32d283a8ec45896e2554c44
$ scancel --account=aa --qos=bb -v
scancel: Linear node selection plugin loaded with argument 20
scancel: Consumable Resources (CR) Node Selection plugin loaded with argument 20
scancel: Cray node selection plugin loaded
scancel: Serial Job Resource Selection plugin loaded with argument 20
scancel: error: No active jobs matching ALL job filters, including: account=aa qos=bb 
$ echo $?
0

Finally, a test was added to our regression test suite:
https://github.com/SchedMD/slurm/commit/4db1dabcdeca41f74d390140b624833d3d623746

Is this sufficient?

Comment 12 Alejandro Sanchez 2017-02-03 05:48:39 MST

Ryan, can we go ahead and mark this as resolved as per the fixes provided? Thanks.

Comment 13 Ryan Cox 2017-02-03 08:17:32 MST

Yes. Thanks