Ticket 12954

Summary: Large daily reservations are not running jobs with reason "Resources"
Product: Slurm Reporter: Trey Dockendorf <tdockendorf>
Component: reservationsAssignee: Scott Hilton <scott>
Status: RESOLVED INFOGIVEN QA Contact:
Severity: 3 - Medium Impact    
Priority: --- CC: scott, troy
Version: 21.08.4   
Hardware: Linux   
OS: Linux   
Site: Ohio State OSC Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---
Attachments: slurm.conf

Description Trey Dockendorf 2021-12-03 06:12:59 MST
Created attachment 22510 [details]
slurm.conf

We recently upgraded from Slurm 20.11.7 to 21.08.4 and we've had reports from one of our paying customers that their jobs using daily reservations are never running. So far the common factor between the multiple reservations where this is happening is they are all 92 node reservations.  There are many other reservations for this customer that are smaller and so far those reservations are working, and configured exactly the same except for starttime and nodecnt.

This maybe related to bug #12943 but we've since worked around that issue and these jobs are getting assigned a no_consume GRES based on filesystem.

I've tried to reproduce this issue with similar 92 node reservations and similar job submit flags but have been unable.

Here is one such job that should have run at 2021-12-02T23:20:00 (EST)

# scontrol show job=8072914
JobId=8072914 JobName=x001-m-20211203-00
   UserId=wxops(30211) GroupId=PYS0343(5387) MCS_label=N/A
   Priority=1200332779 Nice=0 Account=pys1043 QOS=pitzer-override-tres
   JobState=PENDING Reason=Reservation Dependency=(null)
   Requeue=0 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=01:10:00 TimeMin=N/A
   SubmitTime=2021-12-02T23:02:19 EligibleTime=2021-12-02T23:20:02
   AccrueTime=2021-12-02T23:20:02
   StartTime=Unknown EndTime=Unknown Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2021-12-03T00:34:58 Scheduler=Main
   Partition=parallel-48core AllocNode:Sid=0.0.0.0:9112
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=(null)
   NumNodes=92-92 NumCPUs=4416 NumTasks=4416 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=4416,mem=16767552M,node=92,billing=4416
   Socks/Node=* NtasksPerN:B:S:C=48:0:*:1 CoreSpec=*
   MinCPUsNode=48 MinMemoryNode=0 MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   Reservation=x001-00
   OverSubscribe=NO Contiguous=0 Licenses=(null) Network=(null)
   Command=/fs/ess/scratch/PYS0343/wxops/runs/ufs/20211203/00/runscript
   WorkDir=/fs/ess/scratch/PYS0343/wxops/runs/ufs/20211203/00
   Comment=stdout=/fs/ess/scratch/PYS0343/wxops/runs/ufs/20211203/00/runscript.out 
   StdErr=/fs/ess/scratch/PYS0343/wxops/runs/ufs/20211203/00/runscript.out
   StdIn=/dev/null
   StdOut=/fs/ess/scratch/PYS0343/wxops/runs/ufs/20211203/00/runscript.out
   Power=


This is the reservation:

# scontrol show res=x001-00
ReservationName=x001-00 StartTime=2021-12-03T23:20:00 EndTime=2021-12-04T00:35:00 Duration=01:15:00
   Nodes=p[0501-0505,0507-0530,0532-0533,0535-0554,0556-0558,0560-0569,0571-0573,0575-0585,0587-0588,0591,0593-0594,0610,0617-0618,0625-0626,0634,0642,0706,0718] NodeCnt=92 CoreCnt=4416 Features=c6420&48core PartitionName=parallel-48core Flags=DAILY,REPLACE_DOWN,PURGE_COMP=00:02:00
   TRES=cpu=4416
   Users=(null) Groups=(null) Accounts=PYS1043 Licenses=(null) State=INACTIVE BurstBuffer=(null) Watts=n/a
   MaxStartDelay=(null)

This is #SBATCH lines for job, I believe they pass the reservation via sbatch flag:

#SBATCH --account=PYS1043
#SBATCH --chdir=/fs/ess/scratch/PYS0343/wxops/runs/ufs/20211203/00
#SBATCH --export=NONE
#SBATCH --job-name=x001-m-20211203-00
#SBATCH --mem-per-cpu=0
#SBATCH --nice=0
#SBATCH --no-requeue
#SBATCH --nodes=92
#SBATCH --ntasks-per-node=48
#SBATCH --open-mode=append
#SBATCH --output=runscript.out
#SBATCH --time=01:10:00
#SBATCH --partition=parallel-48core

These are some logging entries from debugflags we enabled:

(This one is repeated until reservation goes to next day):
sched: [2021-12-02T23:20:02.618] JobId=8072914. State=PENDING. Reason=Resources. Priority=1200332727. Partition=parallel-48core.

(This message repeated before reservation starts)
Dec  2 22:35:19 pitzer-slurm01 slurmctld[16679]: RESERVATION: job_test_resv: reservation x001-00 uses full nodes or JobId=8067226 will not share nodes

(This message repeats while reservation is active)
Dec  2 23:20:02 pitzer-slurm01 slurmctld[16679]: job_test_resv: JobId=8072914 reservation:x001-00 nodes:p[0501-0505,0507-0530,0532-0533,0535-0554,0556-0558,0560-0569,0571-0573,0575-0585,0587-0588,0591,0593-0594,0610,0617-0618,0625-0626,0634,0642,0706,0718]
Comment 1 Trey Dockendorf 2021-12-03 09:33:08 MST
I managed to check on one of the reservations when this happened and see what happens. Slurm ran a job on the reserved node but the issue appears to be that someone at OSC extended their job walltime and it caused issues for the reservation and prevented the job from starting.

The job that should have run:

# scontrol show job=8079531
JobId=8079531 JobName=x001-m-20211203-12
   UserId=wxops(30211) GroupId=PYS0343(5387) MCS_label=N/A
   Priority=1200332727 Nice=0 Account=pys1043 QOS=pitzer-override-tres
   JobState=PENDING Reason=Resources Dependency=(null)
   Requeue=0 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=01:10:00 TimeMin=N/A
   SubmitTime=2021-12-03T11:04:36 EligibleTime=2021-12-03T11:20:04
   AccrueTime=2021-12-03T11:20:04
   StartTime=2021-12-07T16:05:20 EndTime=2021-12-07T17:15:20 Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2021-12-03T11:24:27 Scheduler=Main
   Partition=parallel-48core AllocNode:Sid=0.0.0.0:155366
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=(null) SchedNodeList=p[0504-0529,0531-0536,0538-0539,0541-0543,0545-0548,0550-0554,0556-0568,0571-0585,0587-0594,0610,0617-0618,0625-0626,0634,0642,0706,0718,0781]
   NumNodes=92-92 NumCPUs=4416 NumTasks=4416 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=4416,mem=16767552M,node=92,billing=4416
   Socks/Node=* NtasksPerN:B:S:C=48:0:*:1 CoreSpec=*
   MinCPUsNode=48 MinMemoryNode=0 MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   Reservation=x001-12
   OverSubscribe=NO Contiguous=0 Licenses=(null) Network=(null)
   Command=/fs/ess/scratch/PYS0343/wxops/runs/ufs/20211203/12/runscript
   WorkDir=/fs/ess/scratch/PYS0343/wxops/runs/ufs/20211203/12
   Comment=stdout=/fs/ess/scratch/PYS0343/wxops/runs/ufs/20211203/12/runscript.out 
   StdErr=/fs/ess/scratch/PYS0343/wxops/runs/ufs/20211203/12/runscript.out
   StdIn=/dev/null
   StdOut=/fs/ess/scratch/PYS0343/wxops/runs/ufs/20211203/12/runscript.out
   Power=


The reservation:

# scontrol show res=x001-12
ReservationName=x001-12 StartTime=2021-12-03T11:20:00 EndTime=2021-12-03T12:35:00 Duration=01:15:00
   Nodes=p[0504-0529,0531-0536,0538-0539,0541-0543,0545-0548,0550-0554,0556-0568,0571-0585,0587-0594,0610,0617-0618,0625-0626,0634,0642,0706,0718,0781] NodeCnt=92 CoreCnt=4416 Features=c6420&48core PartitionName=parallel-48core Flags=DAILY,REPLACE_DOWN,PURGE_COMP=00:02:00
   TRES=cpu=4416
   Users=(null) Groups=(null) Accounts=PYS1043 Licenses=(null) State=ACTIVE BurstBuffer=(null) Watts=n/a
   MaxStartDelay=(null)


The state of the nodes for the reservation:

# sinfo -n p[0504-0529,0531-0536,0538-0539,0541-0543,0545-0548,0550-0554,0556-0568,0571-0585,0587-0594,0610,0617-0618,0625-0626,0634,0642,0706,0718,0781] -p parallel-48core
PARTITION       AVAIL  TIMELIMIT  NODES  STATE NODELIST
parallel-48core    up 4-00:00:00     91   resv p[0504-0523,0525-0529,0531-0536,0538-0539,0541-0543,0545-0548,0550-0554,0556-0568,0571-0585,0587-0594,0610,0617-0618,0625-0626,0634,0642,0706,0718,0781]
parallel-48core    up 4-00:00:00      1  alloc p0524


The job that ended up on p0524 when it was reserved:

# scontrol show job=8068468
JobId=8068468 JobName=bench6_8m_intel
   UserId=skhuvis(23231) GroupId=PZS0710(5511) MCS_label=N/A
   Priority=200003722 Nice=0 Account=pzs0720 QOS=pitzer-default
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=19:20:37 TimeLimit=5-00:00:00 TimeMin=N/A
   SubmitTime=2021-12-02T16:05:02 EligibleTime=2021-12-02T16:05:02
   AccrueTime=2021-12-02T16:05:02
   StartTime=2021-12-02T16:05:20 EndTime=2021-12-07T16:05:20 Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2021-12-02T16:05:20 Scheduler=Backfill
   Partition=serial-48core AllocNode:Sid=pitzer-login01:50721
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=p0524
   BatchHost=p0524
   NumNodes=1 NumCPUs=48 NumTasks=48 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=48,mem=182256M,node=1,billing=48
   Socks/Node=* NtasksPerN:B:S:C=48:0:*:1 CoreSpec=*
   MinCPUsNode=48 MinMemoryCPU=3797M MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/fs/ess/scratch/PZS0710/skhuvis/setsm/runs/v4.3.10/run.slurm
   WorkDir=/fs/ess/scratch/PZS0710/skhuvis/setsm/runs/v4.3.10
   Comment=stdout=/fs/ess/scratch/PZS0710/skhuvis/setsm/runs/v4.3.10/slurm-8068468.out 
   StdErr=/fs/ess/scratch/PZS0710/skhuvis/setsm/runs/v4.3.10/slurm-8068468.out
   StdIn=/dev/null
   StdOut=/fs/ess/scratch/PZS0710/skhuvis/setsm/runs/v4.3.10/slurm-8068468.out
   Power=

It looks like one of our staff increased the runtime of the job that ended up on reserved node:

# grep 8068468 /var/log/slurm/slurmsched.log
sched: [2021-12-03T09:08:11.010] _update_job: setting time_limit to 1440 for JobId=8068468
sched: [2021-12-03T09:51:36.143] _update_job: setting time_limit to 7200 for JobId=8068468

I canceled that job and now the reservation job is running.  This is not an issue we had seen in 20.08.x versions
Comment 3 Scott Hilton 2021-12-03 15:19:06 MST
Trey,

Glad you figured it out.

A regular user shouldn't be able to increase the TimeLimit. It must be done by an admin.

I reproduced the behavior you mentioned and I don't see any change in behavior in this o between 21.08 and 20.11.

Do you have any more question?

-Scott
Comment 5 Trey Dockendorf 2021-12-03 15:23:33 MST
I'm not sure this is resolved.  Is this expected behavior? That if a node is part of a future daily reservation, and it's running a job, and that job has the wall time extended to overlap with the reservation, is Slurm not going to re-allocate a different node to ensure the reservation has the appropriate number of idle nodes to run accommodate the reservation?
Comment 6 Trey Dockendorf 2021-12-03 17:09:50 MST
Along with my previous question, the user who changed the TimeLimit on their job:

# sacctmgr show user where name=skhuvis                                                                                                          
      User   Def Acct     Admin 
---------- ---------- --------- 
   skhuvis    pzs0710  Operator 


But the man page for scontrol says this for TimeLimit:

Only the Slurm administrator or root can increase job's TimeLimit.

So should an Operator been allowed to change their own job's time limit?
Comment 7 Scott Hilton 2021-12-06 10:23:31 MST
Trey,

Yes, Operators can also extend time limits. Coordinators and regular users cannot.
https://slurm.schedmd.com/user_permissions.html

Slurm is operating as expected, extending time limits is not supposed to happen often. That is why it is limited to admins and operators.

The things we could do to prevent this issue have greater drawbacks than possible gains:

We could replan the reservation when someone extended an existing job, but there would be no guarantee that there would be nodes free at such a late stage.

We could reject the time limit extension on the job since it would collide with the reservation, but, as noted, users are not granted that ability, and it's meant as an administrative override only. We don't want to deny the admin/operator the capability to do what they want.

-Scott
Comment 8 Trey Dockendorf 2021-12-07 07:07:57 MST
Thanks, I'm going to close this. I think our path forward is to write a script that checks if a running job about to have walltime extended will run into a reservation and if it does not, it does something like this:

scontrol update job=$job timelimit=$time admincomment="timelimit changed"

Then we will have to update our Lua job_submit filter to look for timelimit modify events and only allow them if the admin comment is set. It's not fool proof but might avoid staff forgetting to run scripts we already had to avoid this issue.