| Summary: | "scancel garbage" returns 0 and has no output | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Ryan Cox <ryan_cox> |
| Component: | User Commands | Assignee: | Alejandro Sanchez <alex> |
| Status: | RESOLVED FIXED | QA Contact: | |
| Severity: | 5 - Enhancement | ||
| Priority: | --- | CC: | alex |
| Version: | 15.08.12 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | BYU - Brigham Young University | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | 17.02 | Target Release: | 17.02 |
| DevPrio: | 4 - Medium | Emory-Cloud Sites: | --- |
|
Description
Ryan Cox
2016-09-09 15:42:15 MDT
Hi Ryan, I can easily reproduce this. Not sure at all if scancel should exit with 1 since we're not requesting a valid jobid to be scancel'd. Gonna discuss it with the team. I've been observing how the Linux `kill` command behaves on different use cases. I think scancel should behave similar to this command, but it doesn't.
So for instance, we request to kill 1 valid process and one invalid process:
$ sleep 99999 &
[1] 17007
$ ps -ef | grep sleep | grep -v grep
alex 17007 4208 0 13:14 pts/10 00:00:00 sleep 99999
$ kill 17007 garbage
bash: kill: garbage: arguments must be process or job IDs
$ echo $?
0
[1]+ Terminated sleep 99999
Result: valid process is killed, error message is shown indicating the invalid process and exit code is 0, so it's an optimistic exit code.
If instead we request to kill a single invalid process, the exit code is 1:
$ kill garbage
bash: kill: garbage: arguments must be process or job IDs
$ echo $?
1
So it seems that by default the exit code is 1 unless a valid process is signaled with success.
Now let's see how scancel behaves when requested 1 valid and 1 invalid job_id:
$ scancel 20010 garbage
scancel: error: Invalid job_id garbage
$ echo $?
1
I see two differences here respect to `kill`: first one is that valid job_id is not killed. Second one is the exit code.
Now let's see behavior with single invalid job_id (your description comment):
$ scancel garbage
$ echo $?
0
Behavior is also different to `kill` command.
Another issue I see is the exit code when trying to scancel a finished job_id:
$ sacct -j 20012 -o jobid,state
JobID State
------------ ----------
20012 CANCELLED+
20012.batch CANCELLED
$ scancel 20012
$ echo $?
0
So again my understanding is that scancel should behave as `kill`, but this can be discussed with the team and we're also open to listen to your thoughts on this. Thanks Ryan for reporting.
I hadn't thought about how it might be designed to be like the kill command. /bin/kill seems to only use return codes but the built-in kill in both bash and tcsh do output errors for non-existent processes. At the very least scancel should use a non-zero return code for a non-existent job. I would also vote for an error message on stderr just like bash and tcsh's kill implementations do. I'm always a fan of more verbose error messages. I gave an example of "garbage" as the job ID, but the user was accidentally using a truncated job ID. It was an integer, just not an integer that represented a real job. So really, we should be talking about "scancel 123" (where 123 doesn't exist) instead of "scancel garbage" as I initially posted. Ok so I'd vote for this design: 1. scancel invalid_jobid Result: show error message and exit with status 1. 2. scancel valid_jobid invalid_jobid Result: try to cancel the valid_jobid, show error message for the invalid one and use the highest rc, which in this case would be 1 because of the invalid_jobid. 3. Trying to scancel jobs in states different to RUNNING, PENDING or SUSPENDED should show an error too and exit with status 1. Also would it be a problem if we change severity to 5 and address this change to 17.02? Thanks. Sounds good. Thanks. I've committed a few patches for this. This one from Alex logs an error any time the job ID on the execute line doesn't parse as a number: https://github.com/SchedMD/slurm/commit/3c713c10b48a3d10c64db445c57a0c6f45614237 $ scancel foo scancel: error: Invalid job id foo $ echo $? 1 The second patch logs when no jobs satisfy the filter specified, but only in verbose mode: https://github.com/SchedMD/slurm/commit/568b23aff683aa76b32d283a8ec45896e2554c44 $ scancel --account=aa --qos=bb -v scancel: Linear node selection plugin loaded with argument 20 scancel: Consumable Resources (CR) Node Selection plugin loaded with argument 20 scancel: Cray node selection plugin loaded scancel: Serial Job Resource Selection plugin loaded with argument 20 scancel: error: No active jobs matching ALL job filters, including: account=aa qos=bb $ echo $? 0 Finally, a test was added to our regression test suite: https://github.com/SchedMD/slurm/commit/4db1dabcdeca41f74d390140b624833d3d623746 Is this sufficient? Ryan, can we go ahead and mark this as resolved as per the fixes provided? Thanks. Yes. Thanks |