| Summary: | users can cancel each other's array jobs | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Doug Jacobsen <dmjacobsen> |
| Component: | slurmctld | Assignee: | David Bigagli <david> |
| Status: | RESOLVED FIXED | QA Contact: | |
| Severity: | 4 - Minor Issue | ||
| Priority: | --- | ||
| Version: | 15.08.3 | ||
| Hardware: | Cray XC | ||
| OS: | Linux | ||
| Site: | NERSC | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | 15.08.4 | Target Release: | --- |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
Hello,
thanks for your report and detailed analyzes. This bug is now fixed.
commit 8e66e26773352e5a27445a6b60a2134b632c3453
Author: David Bigagli <david@schedmd.com>
Date: Wed Nov 11 13:04:28 2015 +0100
Fix job cancelation bug.
The job array mist have had at least some elements pending for this bug
to happen.
Thanks,
David
|
Hello, After seeing a related post on the slurm-dev list from Markus Stohr, I decided to test if my array jobs could be deleted by another user (same account, no operator or other account coordination capabilities). dmj@cori07:~/svn/slurm_scripts> sbatch -p regular -a 1-10 --wrap "sleep 90" Submitted batch job 23545 dmj@cori07:~/svn/slurm_scripts> ### another term nid00837:~ # su - yunhe yunhe@nid00837:~> scancel 23545 yunhe@nid00837:~> sacctmgr show user yunhe User Def Acct Admin ---------- ---------- --------- yunhe mpccc None yunhe@nid00837:~> ### back to original dmj@cori07:~/svn/slurm_scripts> sacct -j 23545 JobID JobName Partition Account AllocCPUS State ExitCode ------------ ---------- ---------- ---------- ---------- ---------- -------- 23545_[1-10] wrap regular mpccc 1 CANCELLED+ 0:0 dmj@cori07:~/svn/slurm_scripts> sbatch -p regular --wrap "sleep 90" Submitted batch job 23546 dmj@cori07:~/svn/slurm_scripts> ### attempt to cancel non-array job yunhe@nid00837:~> scancel 23546 scancel: error: Kill job error on job id 23546: Access/permission denied yunhe@nid00837:~> slurmctld logs show: [2015-11-11T00:15:54.908] debug: _slurm_rpc_kill_job2: REQUEST_KILL_JOB job 23545 uid 18456 [2015-11-11T00:15:57.161] burst_buffer/cray: bb_p_job_cancel: JobID=23545_* [2015-11-11T00:15:57.161] _job_signal: of pending JobID=23545_* State=0x4 NodeCnt=0 successful ... [2015-11-11T00:17:36.130] debug: _slurm_rpc_kill_job2: REQUEST_KILL_JOB job 23546 uid 18456 [2015-11-11T00:17:36.131] error: Security violation, JOB_CANCEL RPC for jobID 23546 from uid 18456 [2015-11-11T00:17:36.131] error: _slurm_rpc_kill_job2: job_str_signal() job 23546 sig 9 returned Access/permission denied Looks like the issue is that for array jobs _job_signal is called instead of job_signal (job_signal seems to to do the uid verification) src/slurmctld/job_mgr.c: ... extern int job_str_signal(char *job_id_str, uint16_t signal, uint16_t flags, uid_t uid, bool preempt) ... if (job_ptr && (job_ptr->array_task_id == NO_VAL) && (job_ptr->array_recs == NULL)) { /* This is a regular job, not a job array */ return job_signal(job_id, signal, flags, uid, preempt); } if (job_ptr && job_ptr->array_recs) { /* This is a job array */ job_ptr_done = job_ptr; rc = _job_signal(job_ptr, signal, flags, uid, preempt); jobs_signalled++; if (rc == ESLURM_ALREADY_DONE) { jobs_done++; rc = SLURM_SUCCESS; } } ... Thanks for looking at this, Doug