| Summary: | Array job resize apparently overrun limit | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Michael Gutteridge <mrg> |
| Component: | Scheduling | Assignee: | Dominik Bartkiewicz <bart> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 4 - Minor Issue | ||
| Priority: | --- | CC: | bart |
| Version: | 22.05.3 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | FHCRC - Fred Hutchinson Cancer Research Center | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
| Attachments: | current slurm config | ||
Hi
I can't recreate the first issue ("NumNodes=4")
Could you send me output from "sacctmgr show qos public"?
Could you also send me part of slurmctld.log, which cover the update of those jobs?
For the second issue, modified parameters are not checked against limits because you update those jobs as a root.
man scontrol:
---
Note that update requests done by either root, SlurmUser or Administrators are not subject
to certain restrictions. For instance, if an Administrator changes the QOS on a pending
job, certain limits such as the TimeLimit will not be changed automatically as changes
made by the Administrators are allowed to violate these restrictions.
---
Dominik
The relevant section of the controller log is:
------------------------------>8 snip 8<-----------------------------
[2022-10-13T16:35:45.998] _update_job: setting min_cpus from 1 to 4 for JobId=665765
63_*
[2022-10-13T16:35:45.998] _update_job: updating accounting
[2022-10-13T16:35:45.998] _slurm_rpc_update_job: complete JobId=66576563 uid=0 usec=
1219
[2022-10-13T16:35:46.026] _update_job: setting min_cpus from 1 to 4 for JobId=665765
62_*
[2022-10-13T16:35:46.026] _update_job: updating accounting
[2022-10-13T16:35:46.027] _slurm_rpc_update_job: complete JobId=66576562 uid=0 usec=
1148
[2022-10-13T16:35:46.057] _update_job: setting min_cpus from 1 to 4 for JobId=665765
61_*
[2022-10-13T16:35:46.057] _update_job: updating accounting
[2022-10-13T16:35:46.057] _slurm_rpc_update_job: complete JobId=66576561 uid=0 usec=
1269
[2022-10-13T16:35:46.091] _update_job: setting min_cpus from 1 to 4 for JobId=665765
60_*
[2022-10-13T16:35:46.091] _update_job: updating accounting
[2022-10-13T16:35:46.091] _slurm_rpc_update_job: complete JobId=66576560 uid=0 usec=
1227
[2022-10-13T16:35:46.124] _update_job: setting min_cpus from 1 to 4 for JobId=665765
58_*
[2022-10-13T16:35:46.124] _update_job: updating accounting
[2022-10-13T16:35:46.124] _slurm_rpc_update_job: complete JobId=66576558 uid=0 usec=
1348
[2022-10-13T16:35:46.154] _update_job: setting min_cpus from 1 to 4 for JobId=665765
57_*
[2022-10-13T16:35:46.155] _update_job: updating accounting
[2022-10-13T16:35:46.155] _slurm_rpc_update_job: complete JobId=66576557 uid=0 usec=
1143
[2022-10-13T16:35:46.186] _update_job: setting min_cpus from 1 to 4 for JobId=66576556_*
[2022-10-13T16:35:46.186] _update_job: updating accounting
[2022-10-13T16:35:46.186] _slurm_rpc_update_job: complete JobId=66576556 uid=0 usec=1117
[2022-10-13T16:35:46.217] _update_job: setting min_cpus from 1 to 4 for JobId=66576555_*
[2022-10-13T16:35:46.217] _update_job: updating accounting
[2022-10-13T16:35:46.218] _slurm_rpc_update_job: complete JobId=66576555 uid=0 usec=1183
[2022-10-13T16:35:46.248] _update_job: setting min_cpus from 1 to 4 for JobId=66576554_*
------------------------------>8 snip 8<-----------------------------
The relevant QOS configuration is below. I've included "normal" as "public" is the partition QOS and "normal" is the default QOS for the association.
------------------------------>8 snip 8<-----------------------------
Name Priority GraceTime Preempt PreemptExemptTime PreemptMode Flags UsageThres UsageFactor GrpTRES GrpTRESMins GrpTRESRunMin GrpJobs GrpSubmit GrpWall MaxTRES MaxTRESPerNode MaxTRESMins MaxWall MaxTRESPU MaxJobsPU MaxSubmitPU MaxTRESPA MaxJobsPA MaxSubmitPA MinTRES
---------- ---------- ---------- ---------- ------------------- ----------- ---------------------------------------- ---------- ----------- ------------- ------------- ------------- ------- --------- ----------- ------------- -------------- ------------- ----------- ------------- --------- ----------- ------------- --------- ----------- -------------
normal 100000 00:00:00 restart,r+ cluster 1.000000 50000 cpu=1
public 0 00:00:00 cluster 1.000000 cpu=1020 cpu=1000
------------------------------>8 snip 8<-----------------------------
> For the second issue, modified parameters are not checked against limits because you update those jobs as a root
My reading of that section of the manpage suggests that restriction applies to allowing a job to "be" in a QOS with a restriction. e.g. I have a "short" partition with a timelimit of 12 hours. I can't submit into that partition with "timelimit=3-0" but I can submit with "timelimit=10:00:00" and subsequently adjust the time limit after submission and it will still run.
I don't think that affects association limits (like on TRES) since the eligibility for a job to be run is done separately. If my association is already using 1000 cores, adjusting the number of cores for a pending job doesn't directly affect whether it's run or not.
I can't seem to replicate this either... I guess we can chalk this up to planetary alignment or something equally bizzare. Can you confirm that the command I've used (scontrol update jobid=<> numcpus=4) is the appropriate way to adjust the CPUs for job arrays? If some jobs in the array are already running, I assume those would not be affected.
Thanks
- Michael
Hi This also affects tres limits. We don't check tres overwrite by admin. https://github.com/SchedMD/slurm/blob/0f880b51ebbc98dc8bf1fd54993eb2a9ef6a723b/src/slurmctld/job_mgr.c#L12302-L12308 https://github.com/SchedMD/slurm/blob/0f880b51ebbc98dc8bf1fd54993eb2a9ef6a723b/src/slurmctld/job_mgr.c#L13128 Dominik Hi Thanks for the info- I think perhaps I'm not explaining the situation with the usage limits well. Not sure that's important as I've been unable to reproduce this which also raises the idea that I perhaps misread output of my diagnostic commands. We can set that issue aside until I have better information there. The other issue I'd had with this operation is that the NumNodes for these jobs was getting set to anything between 1 and 4 when allocated. I did a little experimentation and it looks like to get this to do what I'd intended I need to update _both_ NumCPUs and CPUs/Task: scontrol update jobid=2111746 numcpus=4 cpuspertask=4 Is that correct for updating pending array jobs? I basically want to change the pending job to run as if it had been submitted with `sbatch -c 4 ...`. It looks like "NumNodes" is a minimum so I think that explains why these jobs were getting split over multiple nodes. Thanks - Michael Hi Sorry about the delay in my response. If you want to have sure that sbatch step has access to all resources, this is the correct procedure. Dominik Got it. Thanks for the info- I think that's all we need for now. Thanks again for the help. - Michael |
Created attachment 27277 [details] current slurm config I've got a situation where adjustments to an array job seem to have enabled that individual to exceed the limits set in the partition and QOS. I resized some array jobs using scontrol: squeue -h -u <user> -t pd -o %A | xargs -I{} sudo scontrol update jobid={} numcpus=4 Which seems to have resized the job to include "NumNodes=4" for some reason: JobId=66576556 ArrayJobId=66576556 ArrayTaskId=7108-7124 JobName=call_sim_feature_select.sh UserId=*******(53337) GroupId=*******(53337) MCS_label=N/A Priority=0 Nice=0 Account=******* QOS=normal JobState=PENDING Reason=JobHeldUser Dependency=(null) Requeue=0 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 RunTime=00:00:00 TimeLimit=14-00:00:00 TimeMin=N/A SubmitTime=2022-10-12T08:57:06 EligibleTime=Unknown AccrueTime=Unknown StartTime=Unknown EndTime=Unknown Deadline=N/A SuspendTime=None SecsPreSuspend=0 LastSchedEval=2022-10-14T08:39:29 Scheduler=Main Partition=campus-new AllocNode:Sid=*******:5519 ReqNodeList=(null) ExcNodeList=(null) NodeList=(null) NumNodes=4 NumCPUs=4 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:* TRES=cpu=4,mem=385631M,node=1,billing=4 Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0 Features=(null) DelayBoot=00:00:00 What this has apparently done as well is allowed this user to blow right past the limit set in the QOS for the partition. The partition QOS (named public) has a TRESPerAccount limit of "cpu=1000" and TRESPerUser limit of "cpu=1020". At this moment the user has around 2990 cores in use. I'll include our slurm.conf. At this point I'm not sure what can be done to figure out why it happened, but I am concerned that I used the incorrect process to adjust the pending jobs in the array. Any advice would be welcome. Thanks. - Michael