| Summary: | Users are able to modify each others' job array limits | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | sysadmin |
| Component: | User Commands | Assignee: | Marcin Stolarek <cinek> |
| Status: | RESOLVED FIXED | QA Contact: | |
| Severity: | 4 - Minor Issue | ||
| Priority: | --- | CC: | cinek |
| Version: | 21.08.8 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | Allen Institute | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | 23.02.4 23.11rc1 | Target Release: | --- |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
I can reproduce the issue. I'm sending a patch to address that to review. I'll keep you posted on the progress. cheers, Marcin The bug got fixed by 442c8442d8, which got already released in Slurm 23.02.4, sorry for not letting you know earlier. We kept the bug open to work on additional improvements in this area (6b6c75cb5b) that landed in master branch - Slurm 23.11 to be. I'm closing the bug as fixed now. Should you have any questions please reopen. cheers, Marcin |
[amused.admin@slurm ~]# scontrol show job 10026927_8 JobId=10031043 ArrayJobId=10026927 ArrayTaskId=8 ArrayTaskThrottle=5 JobName=txSageRnaSeq UserId=victim(01134) GroupId=users(31770) MCS_label=N/A [bored.user@slurm ~]$ scontrol update JobId=10026927_[3-180] ArrayTaskThrottle=6 10026927_3-180: Invalid user id [bored.user@slurm ~]$ logout [amused.admin@slurm ~]# squeue | grep 10026927 10026927_[9-180] celltypes txSageRn victim PD 0:00 1 (JobArrayTaskLimit) 10026927_3 celltypes txSageRn victim R 13:05 1 n294 10026927_4 celltypes txSageRn victim R 13:05 1 n294 10026927_5 celltypes txSageRn victim R 13:05 1 n294 10026927_6 celltypes txSageRn victim R 8:32 1 n293 10026927_7 celltypes txSageRn victim R 1:04 1 n293 10026927_8 celltypes txSageRn victim R 1:04 1 n291 [amused.admin@aidc-hpc-prd ~]# scontrol show job 10026927_8 JobId=10031043 ArrayJobId=10026927 ArrayTaskId=8 ArrayTaskThrottle=6 JobName=txSageRnaSeq UserId=victim(01134) GroupId=users(31770) MCS_label=N/A In the logs, I also see this change: [2023-07-07T14:12:41.210] _update_job: set max_run_tasks to 5 for job array JobId=10026927_* [2023-07-07T14:12:41.226] _slurm_rpc_update_job: complete JobId=10026927_6-180 uid=20415 usec=16973 [2023-07-07T14:12:42.639] sched: Allocate JobId=10026927_6(10026933) NodeList=n293 #CPUs=32 Partition=celltypes [2023-07-07T14:19:39.185] _job_complete: JobId=10026927_2(10026929) WEXITSTATUS 0 [2023-07-07T14:19:39.185] _job_complete: JobId=10026927_2(10026929) done [2023-07-07T14:20:02.609] _update_job: set max_run_tasks to 6 for job array JobId=10026927_* I claim that users should not be able to change the array limits (user could carve out more resources for themselves by limiting other jobs, or do a mass modify that would overwrite all job array limits on accident), let me know if this could just be a config that I missed