Ticket 17156

Summary: Users are able to modify each others' job array limits
Product: Slurm Reporter: sysadmin
Component: User CommandsAssignee: Marcin Stolarek <cinek>
Status: RESOLVED FIXED QA Contact:
Severity: 4 - Minor Issue    
Priority: --- CC: cinek
Version: 21.08.8   
Hardware: Linux   
OS: Linux   
Site: Allen Institute Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: 23.02.4 23.11rc1 Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---

Description sysadmin 2023-07-07 15:49:06 MDT
[amused.admin@slurm ~]# scontrol show job 10026927_8
JobId=10031043 ArrayJobId=10026927 ArrayTaskId=8 ArrayTaskThrottle=5 JobName=txSageRnaSeq
   UserId=victim(01134) GroupId=users(31770) MCS_label=N/A
[bored.user@slurm ~]$ scontrol update JobId=10026927_[3-180] ArrayTaskThrottle=6 
10026927_3-180: Invalid user id
[bored.user@slurm ~]$ logout
[amused.admin@slurm ~]# squeue | grep 10026927
  10026927_[9-180] celltypes txSageRn victim PD       0:00      1 (JobArrayTaskLimit)
        10026927_3 celltypes txSageRn victim  R      13:05      1 n294
        10026927_4 celltypes txSageRn victim  R      13:05      1 n294
        10026927_5 celltypes txSageRn victim  R      13:05      1 n294
        10026927_6 celltypes txSageRn victim  R       8:32      1 n293
        10026927_7 celltypes txSageRn victim  R       1:04      1 n293
        10026927_8 celltypes txSageRn victim  R       1:04      1 n291
[amused.admin@aidc-hpc-prd ~]# scontrol show job 10026927_8
JobId=10031043 ArrayJobId=10026927 ArrayTaskId=8 ArrayTaskThrottle=6 JobName=txSageRnaSeq
   UserId=victim(01134) GroupId=users(31770) MCS_label=N/A


In the logs, I also see this change:
[2023-07-07T14:12:41.210] _update_job: set max_run_tasks to 5 for job array JobId=10026927_*
[2023-07-07T14:12:41.226] _slurm_rpc_update_job: complete JobId=10026927_6-180 uid=20415 usec=16973
[2023-07-07T14:12:42.639] sched: Allocate JobId=10026927_6(10026933) NodeList=n293 #CPUs=32 Partition=celltypes
[2023-07-07T14:19:39.185] _job_complete: JobId=10026927_2(10026929) WEXITSTATUS 0
[2023-07-07T14:19:39.185] _job_complete: JobId=10026927_2(10026929) done
[2023-07-07T14:20:02.609] _update_job: set max_run_tasks to 6 for job array JobId=10026927_*

I claim that users should not be able to change the array limits (user could carve out more resources for themselves by limiting other jobs, or do a mass modify that would overwrite all job array limits on accident), let me know if this could just be a config that I missed
Comment 2 Marcin Stolarek 2023-07-10 05:41:35 MDT
I can reproduce the issue. I'm sending a patch to address that to review. I'll keep you posted on the progress.

cheers,
Marcin
Comment 19 Marcin Stolarek 2023-10-18 05:35:35 MDT
The bug got fixed by 442c8442d8, which got already released in Slurm 23.02.4, sorry for not letting you know earlier. We kept the bug open to work on additional improvements in this area (6b6c75cb5b) that landed in master branch - Slurm 23.11 to be.

I'm closing the bug as fixed now. Should you have any questions please reopen.

cheers,
Marcin