Ticket 15183

Summary:	Array job resize apparently overrun limit
Product:	Slurm	Reporter:	Michael Gutteridge <mrg>
Component:	Scheduling	Assignee:	Dominik Bartkiewicz <bart>
Status:	RESOLVED INFOGIVEN	QA Contact:
Severity:	4 - Minor Issue
Priority:	---	CC:	bart
Version:	22.05.3
Hardware:	Linux
OS:	Linux
Site:	FHCRC - Fred Hutchinson Cancer Research Center	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	current slurm config

Description Michael Gutteridge 2022-10-14 18:48:50 MDT

Created attachment 27277 [details]
current slurm config

I've got a situation where adjustments to an array job seem to have enabled that individual to exceed the limits set in the partition and QOS.

I resized some array jobs using scontrol:

squeue -h -u <user> -t pd -o %A | xargs -I{} sudo scontrol update jobid={} numcpus=4

Which seems to have resized the job to include "NumNodes=4" for some reason:

JobId=66576556 ArrayJobId=66576556 ArrayTaskId=7108-7124 JobName=call_sim_feature_select.sh
   UserId=*******(53337) GroupId=*******(53337) MCS_label=N/A
   Priority=0 Nice=0 Account=******* QOS=normal
   JobState=PENDING Reason=JobHeldUser Dependency=(null)
   Requeue=0 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=14-00:00:00 TimeMin=N/A
   SubmitTime=2022-10-12T08:57:06 EligibleTime=Unknown
   AccrueTime=Unknown
   StartTime=Unknown EndTime=Unknown Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2022-10-14T08:39:29 Scheduler=Main
   Partition=campus-new AllocNode:Sid=*******:5519
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=(null)
   NumNodes=4 NumCPUs=4 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=4,mem=385631M,node=1,billing=4
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00

What this has apparently done as well is allowed this user to blow right past the limit set in the QOS for the partition.  The partition QOS (named public) has a TRESPerAccount limit of "cpu=1000" and TRESPerUser limit of "cpu=1020".  At this moment the user has around 2990 cores in use.

I'll include our slurm.conf.  At this point I'm not sure what can be done to figure out why it happened, but I am concerned that I used the incorrect process to adjust the pending jobs in the array.

Any advice would be welcome.

Thanks.

 - Michael

Comment 1 Dominik Bartkiewicz 2022-10-17 06:58:00 MDT

Hi

I can't recreate the first issue ("NumNodes=4")
Could you send me output from "sacctmgr show qos public"?
Could you also send me part of slurmctld.log, which cover the update of those jobs?

For the second issue, modified parameters are not checked against limits because you update those jobs as a root.
man scontrol:
---
       Note that update requests done by either root, SlurmUser or Administrators are not subject
       to certain restrictions. For instance, if an Administrator changes the QOS  on  a  pending
       job,  certain  limits  such  as the TimeLimit will not be changed automatically as changes
       made by the Administrators are allowed to violate these restrictions.
---

Dominik

Comment 2 Michael Gutteridge 2022-10-17 09:37:45 MDT

The relevant section of the controller log is:

------------------------------>8 snip 8<-----------------------------

[2022-10-13T16:35:45.998] _update_job: setting min_cpus from 1 to 4 for JobId=665765
63_*
[2022-10-13T16:35:45.998] _update_job: updating accounting
[2022-10-13T16:35:45.998] _slurm_rpc_update_job: complete JobId=66576563 uid=0 usec=
1219
[2022-10-13T16:35:46.026] _update_job: setting min_cpus from 1 to 4 for JobId=665765
62_*
[2022-10-13T16:35:46.026] _update_job: updating accounting
[2022-10-13T16:35:46.027] _slurm_rpc_update_job: complete JobId=66576562 uid=0 usec=
1148
[2022-10-13T16:35:46.057] _update_job: setting min_cpus from 1 to 4 for JobId=665765
61_*
[2022-10-13T16:35:46.057] _update_job: updating accounting
[2022-10-13T16:35:46.057] _slurm_rpc_update_job: complete JobId=66576561 uid=0 usec=
1269
[2022-10-13T16:35:46.091] _update_job: setting min_cpus from 1 to 4 for JobId=665765
60_*
[2022-10-13T16:35:46.091] _update_job: updating accounting
[2022-10-13T16:35:46.091] _slurm_rpc_update_job: complete JobId=66576560 uid=0 usec=
1227
[2022-10-13T16:35:46.124] _update_job: setting min_cpus from 1 to 4 for JobId=665765
58_*
[2022-10-13T16:35:46.124] _update_job: updating accounting
[2022-10-13T16:35:46.124] _slurm_rpc_update_job: complete JobId=66576558 uid=0 usec=
1348
[2022-10-13T16:35:46.154] _update_job: setting min_cpus from 1 to 4 for JobId=665765
57_*
[2022-10-13T16:35:46.155] _update_job: updating accounting
[2022-10-13T16:35:46.155] _slurm_rpc_update_job: complete JobId=66576557 uid=0 usec=
1143
[2022-10-13T16:35:46.186] _update_job: setting min_cpus from 1 to 4 for JobId=66576556_*
[2022-10-13T16:35:46.186] _update_job: updating accounting
[2022-10-13T16:35:46.186] _slurm_rpc_update_job: complete JobId=66576556 uid=0 usec=1117
[2022-10-13T16:35:46.217] _update_job: setting min_cpus from 1 to 4 for JobId=66576555_*
[2022-10-13T16:35:46.217] _update_job: updating accounting
[2022-10-13T16:35:46.218] _slurm_rpc_update_job: complete JobId=66576555 uid=0 usec=1183
[2022-10-13T16:35:46.248] _update_job: setting min_cpus from 1 to 4 for JobId=66576554_*

------------------------------>8 snip 8<-----------------------------

The relevant QOS configuration is below.  I've included "normal" as "public" is the partition QOS and "normal" is the default QOS for the association.

------------------------------>8 snip 8<-----------------------------

      Name   Priority  GraceTime    Preempt   PreemptExemptTime PreemptMode                                    Flags UsageThres UsageFactor       GrpTRES   GrpTRESMins GrpTRESRunMin GrpJobs GrpSubmit     GrpWall       MaxTRES MaxTRESPerNode   MaxTRESMins     MaxWall     MaxTRESPU MaxJobsPU MaxSubmitPU     MaxTRESPA MaxJobsPA MaxSubmitPA       MinTRES 
---------- ---------- ---------- ---------- ------------------- ----------- ---------------------------------------- ---------- ----------- ------------- ------------- ------------- ------- --------- ----------- ------------- -------------- ------------- ----------- ------------- --------- ----------- ------------- --------- ----------- ------------- 
    normal     100000   00:00:00 restart,r+                         cluster                                                        1.000000                                                                                                                                                              50000                                             cpu=1 
    public          0   00:00:00                                    cluster                                                        1.000000                                                                                                                                     cpu=1020                            cpu=1000 
------------------------------>8 snip 8<-----------------------------

> For the second issue, modified parameters are not checked against limits because you update those jobs as a root

My reading of that section of the manpage suggests that restriction applies to allowing a job to "be" in a QOS with a restriction.  e.g. I have a "short" partition with a timelimit of 12 hours.  I can't submit into that partition with "timelimit=3-0" but I can submit with "timelimit=10:00:00" and subsequently adjust the time limit after submission and it will still run.

I don't think that affects association limits (like on TRES) since the eligibility for a job to be run is done separately.  If my association is already using 1000 cores, adjusting the number of cores for a pending job doesn't directly affect whether it's run or not.

I can't seem to replicate this either... I guess we can chalk this up to planetary alignment or something equally bizzare.  Can you confirm that the command I've used (scontrol update jobid=<> numcpus=4) is the appropriate way to adjust the CPUs for job arrays?  If some jobs in the array are already running, I assume those would not be affected.

Thanks

 - Michael

Comment 3 Dominik Bartkiewicz 2022-10-19 06:04:50 MDT

Hi

This also affects tres limits. We don't check tres overwrite by admin.  
https://github.com/SchedMD/slurm/blob/0f880b51ebbc98dc8bf1fd54993eb2a9ef6a723b/src/slurmctld/job_mgr.c#L12302-L12308
https://github.com/SchedMD/slurm/blob/0f880b51ebbc98dc8bf1fd54993eb2a9ef6a723b/src/slurmctld/job_mgr.c#L13128

Dominik

Comment 4 Michael Gutteridge 2022-10-19 10:26:12 MDT

Hi

Thanks for the info- I think perhaps I'm not explaining the situation with the usage limits well.  Not sure that's important as I've been unable to reproduce this which also raises the idea that I perhaps misread output of my diagnostic commands.  We can set that issue aside until I have better information there.

The other issue I'd had with this operation is that the NumNodes for these jobs was getting set to anything between 1 and 4 when allocated.  I did a little experimentation and it looks like to get this to do what I'd intended I need to update _both_ NumCPUs and CPUs/Task:

scontrol update jobid=2111746 numcpus=4 cpuspertask=4

Is that correct for updating pending array jobs?  I basically want to change the pending job to run as if it had been submitted with `sbatch -c 4 ...`.  It looks like "NumNodes" is a minimum so I think that explains why these jobs were getting split over multiple nodes.

Thanks

 - Michael

Comment 5 Dominik Bartkiewicz 2022-10-24 06:40:49 MDT

Hi

Sorry about the delay in my response.
If you want to have sure that sbatch step has access to all resources, this is the correct procedure.

Dominik

Comment 6 Michael Gutteridge 2022-10-24 07:37:07 MDT

Got it.  Thanks for the info- I think that's all we need for now.

Thanks again for the help.

 - Michael