Ticket 2132

Summary: users can cancel each other's array jobs
Product: Slurm Reporter: Doug Jacobsen <dmjacobsen>
Component: slurmctldAssignee: David Bigagli <david>
Status: RESOLVED FIXED QA Contact:
Severity: 4 - Minor Issue    
Priority: ---    
Version: 15.08.3   
Hardware: Cray XC   
OS: Linux   
Site: NERSC Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: 15.08.4 Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---

Description Doug Jacobsen 2015-11-10 18:31:25 MST
Hello,

After seeing a related post on the slurm-dev list from Markus Stohr, I decided to test if my array jobs could be deleted by another user (same account, no operator or other account coordination capabilities).


dmj@cori07:~/svn/slurm_scripts> sbatch -p regular -a 1-10 --wrap "sleep 90"
Submitted batch job 23545
dmj@cori07:~/svn/slurm_scripts>

### another term
nid00837:~ # su - yunhe
yunhe@nid00837:~> scancel 23545
yunhe@nid00837:~> sacctmgr show user yunhe
      User   Def Acct     Admin
---------- ---------- ---------
     yunhe      mpccc      None
yunhe@nid00837:~>


### back to original
dmj@cori07:~/svn/slurm_scripts> sacct -j 23545
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
23545_[1-10]       wrap    regular      mpccc          1 CANCELLED+      0:0
dmj@cori07:~/svn/slurm_scripts> sbatch -p regular  --wrap "sleep 90"
Submitted batch job 23546
dmj@cori07:~/svn/slurm_scripts>


### attempt to cancel non-array job
yunhe@nid00837:~> scancel 23546
scancel: error: Kill job error on job id 23546: Access/permission denied
yunhe@nid00837:~>



slurmctld logs show:
[2015-11-11T00:15:54.908] debug:  _slurm_rpc_kill_job2: REQUEST_KILL_JOB job 23545 uid 18456
[2015-11-11T00:15:57.161] burst_buffer/cray: bb_p_job_cancel: JobID=23545_*
[2015-11-11T00:15:57.161] _job_signal: of pending JobID=23545_* State=0x4 NodeCnt=0 successful
...
[2015-11-11T00:17:36.130] debug:  _slurm_rpc_kill_job2: REQUEST_KILL_JOB job 23546 uid 18456
[2015-11-11T00:17:36.131] error: Security violation, JOB_CANCEL RPC for jobID 23546 from uid 18456
[2015-11-11T00:17:36.131] error: _slurm_rpc_kill_job2: job_str_signal() job 23546 sig 9 returned Access/permission denied



Looks like the issue is that for array jobs _job_signal is called instead of job_signal (job_signal seems to to do the uid verification)


src/slurmctld/job_mgr.c:
...
extern int job_str_signal(char *job_id_str, uint16_t signal, uint16_t flags,
              uid_t uid, bool preempt)
...
        if (job_ptr && (job_ptr->array_task_id == NO_VAL) &&
            (job_ptr->array_recs == NULL)) {
            /* This is a regular job, not a job array */
            return job_signal(job_id, signal, flags, uid, preempt);
        }

        if (job_ptr && job_ptr->array_recs) {
            /* This is a job array */
            job_ptr_done = job_ptr;
            rc = _job_signal(job_ptr, signal, flags, uid, preempt);
            jobs_signalled++;
            if (rc == ESLURM_ALREADY_DONE) {
                jobs_done++;
                rc = SLURM_SUCCESS;
            }
        }
...

Thanks for looking at this,
Doug
Comment 1 David Bigagli 2015-11-10 22:05:42 MST
Hello,
      thanks for your report and detailed analyzes. This bug is now fixed.

commit 8e66e26773352e5a27445a6b60a2134b632c3453
Author: David Bigagli <david@schedmd.com>
Date:   Wed Nov 11 13:04:28 2015 +0100

    Fix job cancelation bug.

The job array mist have had at least some elements pending for this bug
to happen.

Thanks,
        David