Ticket 3072

Summary: "scancel garbage" returns 0 and has no output
Product: Slurm Reporter: Ryan Cox <ryan_cox>
Component: User CommandsAssignee: Alejandro Sanchez <alex>
Status: RESOLVED FIXED QA Contact:
Severity: 5 - Enhancement    
Priority: --- CC: alex
Version: 15.08.12   
Hardware: Linux   
OS: Linux   
Site: BYU - Brigham Young University Slinky Site: ---
Alineos Sites: --- Atos/Eviden Sites: ---
Confidential Site: --- Coreweave sites: ---
Cray Sites: --- DS9 clusters: ---
Google sites: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- NoveTech Sites: ---
Nvidia HWinf-CS Sites: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Tzag Elita Sites: ---
Linux Distro: --- Machine Name:
CLE Version: Version Fixed: 17.02
Target Release: 17.02 DevPrio: 4 - Medium
Emory-Cloud Sites: ---

Description Ryan Cox 2016-09-09 15:42:15 MDT
$ scancel garbage
$ echo $?
0
$

One of users was mistyping job IDs and didn't realize it since it never errored out.
Comment 1 Alejandro Sanchez 2016-09-12 04:00:51 MDT
Hi Ryan, I can easily reproduce this. Not sure at all if scancel should exit with 1 since we're not requesting a valid jobid to be scancel'd. Gonna discuss it with the team.
Comment 2 Alejandro Sanchez 2016-09-12 05:28:58 MDT
I've been observing how the Linux `kill` command behaves on different use cases. I think scancel should behave similar to this command, but it doesn't.

So for instance, we request to kill 1 valid process and one invalid process:

$ sleep 99999 &
[1] 17007
$ ps -ef | grep sleep | grep -v grep
alex     17007  4208  0 13:14 pts/10   00:00:00 sleep 99999
$ kill 17007 garbage
bash: kill: garbage: arguments must be process or job IDs
$ echo $?
0
[1]+  Terminated              sleep 99999

Result: valid process is killed, error message is shown indicating the invalid process and exit code is 0, so it's an optimistic exit code.

If instead we request to kill a single invalid process, the exit code is 1:

$ kill garbage
bash: kill: garbage: arguments must be process or job IDs
$ echo $?
1

So it seems that by default the exit code is 1 unless a valid process is signaled with success.

Now let's see how scancel behaves when requested 1 valid and 1 invalid job_id: 

$ scancel 20010 garbage
scancel: error: Invalid job_id garbage
$ echo $?
1

I see two differences here respect to `kill`: first one is that valid job_id is not killed. Second one is the exit code.

Now let's see behavior with single invalid job_id (your description comment):

$ scancel garbage
$ echo $?
0

Behavior is also different to `kill` command.

Another issue I see is the exit code when trying to scancel a finished job_id:

$ sacct -j 20012 -o jobid,state
       JobID      State 
------------ ---------- 
20012        CANCELLED+ 
20012.batch   CANCELLED 
$ scancel 20012
$ echo $?
0

So again my understanding is that scancel should behave as `kill`, but this can be discussed with the team and we're also open to listen to your thoughts on this. Thanks Ryan for reporting.
Comment 3 Ryan Cox 2016-09-12 09:09:03 MDT
I hadn't thought about how it might be designed to be like the kill command.  /bin/kill seems to only use return codes but the built-in kill in both bash and tcsh do output errors for non-existent processes.

At the very least scancel should use a non-zero return code for a non-existent job.  I would also vote for an error message on stderr just like bash and tcsh's kill implementations do.  I'm always a fan of more verbose error messages.

I gave an example of "garbage" as the job ID, but the user was accidentally using a truncated job ID.  It was an integer, just not an integer that represented a real job.  So really, we should be talking about "scancel 123" (where 123 doesn't exist) instead of "scancel garbage" as I initially posted.
Comment 4 Alejandro Sanchez 2016-09-12 09:24:02 MDT
Ok so I'd vote for this design:

1. scancel invalid_jobid
Result: show error message and exit with status 1.

2. scancel valid_jobid invalid_jobid
Result: try to cancel the valid_jobid, show error message for the invalid one and use the highest rc, which in this case would be 1 because of the invalid_jobid.

3. Trying to scancel jobs in states different to RUNNING, PENDING or SUSPENDED should show an error too and exit with status 1.

Also would it be a problem if we change severity to 5 and address this change to 17.02? Thanks.
Comment 5 Ryan Cox 2016-09-12 09:32:00 MDT
Sounds good. Thanks.
Comment 9 Moe Jette 2017-01-12 18:58:19 MST
I've committed a few patches for this.

This one from Alex logs an error any time the job ID on the execute line doesn't parse as a number:
https://github.com/SchedMD/slurm/commit/3c713c10b48a3d10c64db445c57a0c6f45614237
$ scancel foo
scancel: error: Invalid job id foo
$ echo $?
1

The second patch logs when no jobs satisfy the filter specified, but only in verbose mode:
https://github.com/SchedMD/slurm/commit/568b23aff683aa76b32d283a8ec45896e2554c44
$ scancel --account=aa --qos=bb -v
scancel: Linear node selection plugin loaded with argument 20
scancel: Consumable Resources (CR) Node Selection plugin loaded with argument 20
scancel: Cray node selection plugin loaded
scancel: Serial Job Resource Selection plugin loaded with argument 20
scancel: error: No active jobs matching ALL job filters, including: account=aa qos=bb 
$ echo $?
0

Finally, a test was added to our regression test suite:
https://github.com/SchedMD/slurm/commit/4db1dabcdeca41f74d390140b624833d3d623746

Is this sufficient?
Comment 12 Alejandro Sanchez 2017-02-03 05:48:39 MST
Ryan, can we go ahead and mark this as resolved as per the fixes provided? Thanks.
Comment 13 Ryan Cox 2017-02-03 08:17:32 MST
Yes. Thanks