| Summary: | Large daily reservations are not running jobs with reason "Resources" | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Trey Dockendorf <tdockendorf> |
| Component: | reservations | Assignee: | Scott Hilton <scott> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 3 - Medium Impact | ||
| Priority: | --- | CC: | scott, troy |
| Version: | 21.08.4 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | Ohio State OSC | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
| Attachments: | slurm.conf | ||
|
Description
Trey Dockendorf
2021-12-03 06:12:59 MST
I managed to check on one of the reservations when this happened and see what happens. Slurm ran a job on the reserved node but the issue appears to be that someone at OSC extended their job walltime and it caused issues for the reservation and prevented the job from starting. The job that should have run: # scontrol show job=8079531 JobId=8079531 JobName=x001-m-20211203-12 UserId=wxops(30211) GroupId=PYS0343(5387) MCS_label=N/A Priority=1200332727 Nice=0 Account=pys1043 QOS=pitzer-override-tres JobState=PENDING Reason=Resources Dependency=(null) Requeue=0 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 RunTime=00:00:00 TimeLimit=01:10:00 TimeMin=N/A SubmitTime=2021-12-03T11:04:36 EligibleTime=2021-12-03T11:20:04 AccrueTime=2021-12-03T11:20:04 StartTime=2021-12-07T16:05:20 EndTime=2021-12-07T17:15:20 Deadline=N/A SuspendTime=None SecsPreSuspend=0 LastSchedEval=2021-12-03T11:24:27 Scheduler=Main Partition=parallel-48core AllocNode:Sid=0.0.0.0:155366 ReqNodeList=(null) ExcNodeList=(null) NodeList=(null) SchedNodeList=p[0504-0529,0531-0536,0538-0539,0541-0543,0545-0548,0550-0554,0556-0568,0571-0585,0587-0594,0610,0617-0618,0625-0626,0634,0642,0706,0718,0781] NumNodes=92-92 NumCPUs=4416 NumTasks=4416 CPUs/Task=1 ReqB:S:C:T=0:0:*:* TRES=cpu=4416,mem=16767552M,node=92,billing=4416 Socks/Node=* NtasksPerN:B:S:C=48:0:*:1 CoreSpec=* MinCPUsNode=48 MinMemoryNode=0 MinTmpDiskNode=0 Features=(null) DelayBoot=00:00:00 Reservation=x001-12 OverSubscribe=NO Contiguous=0 Licenses=(null) Network=(null) Command=/fs/ess/scratch/PYS0343/wxops/runs/ufs/20211203/12/runscript WorkDir=/fs/ess/scratch/PYS0343/wxops/runs/ufs/20211203/12 Comment=stdout=/fs/ess/scratch/PYS0343/wxops/runs/ufs/20211203/12/runscript.out StdErr=/fs/ess/scratch/PYS0343/wxops/runs/ufs/20211203/12/runscript.out StdIn=/dev/null StdOut=/fs/ess/scratch/PYS0343/wxops/runs/ufs/20211203/12/runscript.out Power= The reservation: # scontrol show res=x001-12 ReservationName=x001-12 StartTime=2021-12-03T11:20:00 EndTime=2021-12-03T12:35:00 Duration=01:15:00 Nodes=p[0504-0529,0531-0536,0538-0539,0541-0543,0545-0548,0550-0554,0556-0568,0571-0585,0587-0594,0610,0617-0618,0625-0626,0634,0642,0706,0718,0781] NodeCnt=92 CoreCnt=4416 Features=c6420&48core PartitionName=parallel-48core Flags=DAILY,REPLACE_DOWN,PURGE_COMP=00:02:00 TRES=cpu=4416 Users=(null) Groups=(null) Accounts=PYS1043 Licenses=(null) State=ACTIVE BurstBuffer=(null) Watts=n/a MaxStartDelay=(null) The state of the nodes for the reservation: # sinfo -n p[0504-0529,0531-0536,0538-0539,0541-0543,0545-0548,0550-0554,0556-0568,0571-0585,0587-0594,0610,0617-0618,0625-0626,0634,0642,0706,0718,0781] -p parallel-48core PARTITION AVAIL TIMELIMIT NODES STATE NODELIST parallel-48core up 4-00:00:00 91 resv p[0504-0523,0525-0529,0531-0536,0538-0539,0541-0543,0545-0548,0550-0554,0556-0568,0571-0585,0587-0594,0610,0617-0618,0625-0626,0634,0642,0706,0718,0781] parallel-48core up 4-00:00:00 1 alloc p0524 The job that ended up on p0524 when it was reserved: # scontrol show job=8068468 JobId=8068468 JobName=bench6_8m_intel UserId=skhuvis(23231) GroupId=PZS0710(5511) MCS_label=N/A Priority=200003722 Nice=0 Account=pzs0720 QOS=pitzer-default JobState=RUNNING Reason=None Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 RunTime=19:20:37 TimeLimit=5-00:00:00 TimeMin=N/A SubmitTime=2021-12-02T16:05:02 EligibleTime=2021-12-02T16:05:02 AccrueTime=2021-12-02T16:05:02 StartTime=2021-12-02T16:05:20 EndTime=2021-12-07T16:05:20 Deadline=N/A SuspendTime=None SecsPreSuspend=0 LastSchedEval=2021-12-02T16:05:20 Scheduler=Backfill Partition=serial-48core AllocNode:Sid=pitzer-login01:50721 ReqNodeList=(null) ExcNodeList=(null) NodeList=p0524 BatchHost=p0524 NumNodes=1 NumCPUs=48 NumTasks=48 CPUs/Task=1 ReqB:S:C:T=0:0:*:* TRES=cpu=48,mem=182256M,node=1,billing=48 Socks/Node=* NtasksPerN:B:S:C=48:0:*:1 CoreSpec=* MinCPUsNode=48 MinMemoryCPU=3797M MinTmpDiskNode=0 Features=(null) DelayBoot=00:00:00 OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) Command=/fs/ess/scratch/PZS0710/skhuvis/setsm/runs/v4.3.10/run.slurm WorkDir=/fs/ess/scratch/PZS0710/skhuvis/setsm/runs/v4.3.10 Comment=stdout=/fs/ess/scratch/PZS0710/skhuvis/setsm/runs/v4.3.10/slurm-8068468.out StdErr=/fs/ess/scratch/PZS0710/skhuvis/setsm/runs/v4.3.10/slurm-8068468.out StdIn=/dev/null StdOut=/fs/ess/scratch/PZS0710/skhuvis/setsm/runs/v4.3.10/slurm-8068468.out Power= It looks like one of our staff increased the runtime of the job that ended up on reserved node: # grep 8068468 /var/log/slurm/slurmsched.log sched: [2021-12-03T09:08:11.010] _update_job: setting time_limit to 1440 for JobId=8068468 sched: [2021-12-03T09:51:36.143] _update_job: setting time_limit to 7200 for JobId=8068468 I canceled that job and now the reservation job is running. This is not an issue we had seen in 20.08.x versions Trey, Glad you figured it out. A regular user shouldn't be able to increase the TimeLimit. It must be done by an admin. I reproduced the behavior you mentioned and I don't see any change in behavior in this o between 21.08 and 20.11. Do you have any more question? -Scott I'm not sure this is resolved. Is this expected behavior? That if a node is part of a future daily reservation, and it's running a job, and that job has the wall time extended to overlap with the reservation, is Slurm not going to re-allocate a different node to ensure the reservation has the appropriate number of idle nodes to run accommodate the reservation? Along with my previous question, the user who changed the TimeLimit on their job:
# sacctmgr show user where name=skhuvis
User Def Acct Admin
---------- ---------- ---------
skhuvis pzs0710 Operator
But the man page for scontrol says this for TimeLimit:
Only the Slurm administrator or root can increase job's TimeLimit.
So should an Operator been allowed to change their own job's time limit?
Trey, Yes, Operators can also extend time limits. Coordinators and regular users cannot. https://slurm.schedmd.com/user_permissions.html Slurm is operating as expected, extending time limits is not supposed to happen often. That is why it is limited to admins and operators. The things we could do to prevent this issue have greater drawbacks than possible gains: We could replan the reservation when someone extended an existing job, but there would be no guarantee that there would be nodes free at such a late stage. We could reject the time limit extension on the job since it would collide with the reservation, but, as noted, users are not granted that ability, and it's meant as an administrative override only. We don't want to deny the admin/operator the capability to do what they want. -Scott Thanks, I'm going to close this. I think our path forward is to write a script that checks if a running job about to have walltime extended will run into a reservation and if it does not, it does something like this: scontrol update job=$job timelimit=$time admincomment="timelimit changed" Then we will have to update our Lua job_submit filter to look for timelimit modify events and only allow them if the admin comment is set. It's not fool proof but might avoid staff forgetting to run scripts we already had to avoid this issue. |