| Summary: | Job requesting reservation does not start with reason=BadConstraints | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Troy Baer <troy> |
| Component: | reservations | Assignee: | Ben Roberts <ben> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 4 - Minor Issue | ||
| Priority: | --- | CC: | bart, nate, tdockendorf |
| Version: | 20.02.5 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | Ohio State OSC | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
Hi Troy,
I believe I see what is causing the job to fail to start in the reservation. It looks like the reservation was created, specifying the 'batch' partition and 'c6420&48core' as features. However, the job requests the 'parallel-48core' partition. I didn't try to dig up a slurm.conf you've sent in another ticket, but my guess is that the nodes in the 'parallel-48core' partition don't have the same features. Is that right?
If so, I can reproduce this behavior. I created a reservation that specifies a feature that is only on half the nodes in my cluster.
$ scontrol create reservation reservationname=test_osc partition=debug feature=rack2 nodecnt=5 flags=purge_comp=5:00 account=sub1 starttime=11:25:00 duration=30:00
Reservation created: test_osc
$ scontrol show res
ReservationName=test_osc StartTime=2020-10-12T11:25:00 EndTime=2020-10-12T11:55:00 Duration=00:30:00
Nodes=node[09-13] NodeCnt=5 CoreCnt=120 Features=rack2 PartitionName=debug Flags=PURGE_COMP=00:05:00
TRES=cpu=120
Users=(null) Accounts=sub1 Licenses=(null) State=INACTIVE BurstBuffer=(null) Watts=n/a
MaxStartDelay=(null)
Then I submit a job that requests a partition that doesn't have the same nodes available to it. When it's time for it to start it fails with a 'BadConstraints' reason. Another thing to notice is that the Priority goes to '0' when it fails to start.
$ sbatch -N5 --reservation=test_osc -pgpu -Asub1 -t10:00 --wrap='srun sleep 300'
sbatch: In original lua submit function
Submitted batch job 753
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
753 gpu wrap ben PD 0:00 5 (BadConstraints)
$ scontrol show job 753
JobId=753 JobName=wrap
UserId=ben(1000) GroupId=ben(1000) MCS_label=N/A
Priority=0 Nice=0 Account=sub1 QOS=normal
JobState=PENDING Reason=BadConstraints Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
RunTime=00:00:00 TimeLimit=00:10:00 TimeMin=N/A
SubmitTime=2020-10-12T11:24:21 EligibleTime=2020-10-12T11:25:00
AccrueTime=Unknown
StartTime=2020-10-12T11:25:11 EndTime=2020-10-12T11:35:11 Deadline=N/A
SuspendTime=None SecsPreSuspend=0 LastSchedEval=2020-10-12T11:25:00
Partition=gpu AllocNode:Sid=kitt:7656
ReqNodeList=(null) ExcNodeList=(null)
NodeList=(null)
NumNodes=5-5 NumCPUs=5 NumTasks=5 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
TRES=cpu=5,node=5,billing=5
Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
Reservation=test_osc
OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
Command=(null)
WorkDir=/home/ben/slurm/src/20-02/kitt/etc
StdErr=/home/ben/slurm/src/20-02/kitt/etc/slurm-753.out
StdIn=/dev/null
StdOut=/home/ben/slurm/src/20-02/kitt/etc/slurm-753.out
Power=
MailUser=(null) MailType=NONE
If I submit the job with another partition specified that overlaps with the nodes in the reservation it runs fine. Let me know if this looks like what you're running in to.
Thanks,
Ben
(In reply to Ben Roberts from comment #2) > I believe I see what is causing the job to fail to start in the reservation. > It looks like the reservation was created, specifying the 'batch' partition > and 'c6420&48core' as features. However, the job requests the > 'parallel-48core' partition. I didn't try to dig up a slurm.conf you've > sent in another ticket, but my guess is that the nodes in the > 'parallel-48core' partition don't have the same features. Is that right? No, that is not the case that I can find For instance, one of the reserved nodes is p0501, which has both the c6420 and 48core features: troy@pitzer-login01:~$ scontrol show node p0501 NodeName=p0501 Arch=x86_64 CoresPerSocket=24 CPUAlloc=0 CPUTot=48 CPULoad=0.11 AvailableFeatures=48core,expansion,exp,c6420,cpu,eth-pitzer-rack11h1,ib-i4l1s14,ib-i4,pitzer-rack11,pitzer-rack11-c01 ActiveFeatures=48core,expansion,exp,c6420,cpu,eth-pitzer-rack11h1,ib-i4l1s14,ib-i4,pitzer-rack11,pitzer-rack11-c01 Gres=pfsdir:scratch:1,pfsdir:ess:1,ime:1,gpfs:project:1,gpfs:scratch:1,gpfs:ess:1 NodeAddr=10.4.11.1 NodeHostName=p0501 Version=20.02.5 OS=Linux 3.10.0-1062.18.1.el7.x86_64 #1 SMP Wed Feb 12 14:08:31 UTC 2020 RealMemory=182272 AllocMem=0 FreeMem=179001 Sockets=2 Boards=1 State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=2 Owner=N/A MCS_label=N/A Partitions=batch,debug,debug-48core,parallel,parallel-48core,serial,serial-48core,systems BootTime=2020-08-18T12:05:04 SlurmdStartTime=2020-09-15T09:13:49 CfgTRES=cpu=48,mem=178G,billing=48,gres/gpfs:ess=1,gres/gpfs:project=1,gres/gpfs:scratch=1,gres/ime=1,gres/pfsdir=2,gres/pfsdir:ess=1,gres/pfsdir:scratch=1 AllocTRES= CapWatts=n/a CurrentWatts=0 AveWatts=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s troy@pitzer-login01:~$ scontrol show nodes p[0501-0518,0521-0522,0527,0529] | egrep 'NodeName|Features' NodeName=p0501 Arch=x86_64 CoresPerSocket=24 AvailableFeatures=48core,expansion,exp,c6420,cpu,eth-pitzer-rack11h1,ib-i4l1s14,ib-i4,pitzer-rack11,pitzer-rack11-c01 ActiveFeatures=48core,expansion,exp,c6420,cpu,eth-pitzer-rack11h1,ib-i4l1s14,ib-i4,pitzer-rack11,pitzer-rack11-c01 NodeName=p0502 Arch=x86_64 CoresPerSocket=24 AvailableFeatures=48core,expansion,exp,c6420,cpu,eth-pitzer-rack11h1,ib-i4l1s14,ib-i4,pitzer-rack11,pitzer-rack11-c01 ActiveFeatures=48core,expansion,exp,c6420,cpu,eth-pitzer-rack11h1,ib-i4l1s14,ib-i4,pitzer-rack11,pitzer-rack11-c01 NodeName=p0503 Arch=x86_64 CoresPerSocket=24 AvailableFeatures=48core,expansion,exp,c6420,cpu,eth-pitzer-rack11h1,ib-i4l1s14,ib-i4,pitzer-rack11,pitzer-rack11-c01 ActiveFeatures=48core,expansion,exp,c6420,cpu,eth-pitzer-rack11h1,ib-i4l1s14,ib-i4,pitzer-rack11,pitzer-rack11-c01 NodeName=p0504 Arch=x86_64 CoresPerSocket=24 AvailableFeatures=48core,expansion,exp,c6420,cpu,eth-pitzer-rack11h1,ib-i4l1s14,ib-i4,pitzer-rack11,pitzer-rack11-c01 ActiveFeatures=48core,expansion,exp,c6420,cpu,eth-pitzer-rack11h1,ib-i4l1s14,ib-i4,pitzer-rack11,pitzer-rack11-c01 NodeName=p0505 Arch=x86_64 CoresPerSocket=24 AvailableFeatures=48core,expansion,exp,c6420,cpu,eth-pitzer-rack11h1,ib-i4l1s14,ib-i4,pitzer-rack11,pitzer-rack11-c02 ActiveFeatures=48core,expansion,exp,c6420,cpu,eth-pitzer-rack11h1,ib-i4l1s14,ib-i4,pitzer-rack11,pitzer-rack11-c02 NodeName=p0506 Arch=x86_64 CoresPerSocket=24 AvailableFeatures=48core,expansion,exp,c6420,cpu,eth-pitzer-rack11h1,ib-i4l1s14,ib-i4,pitzer-rack11,pitzer-rack11-c02 ActiveFeatures=48core,expansion,exp,c6420,cpu,eth-pitzer-rack11h1,ib-i4l1s14,ib-i4,pitzer-rack11,pitzer-rack11-c02 NodeName=p0507 Arch=x86_64 CoresPerSocket=24 AvailableFeatures=48core,expansion,exp,c6420,cpu,eth-pitzer-rack11h1,ib-i4l1s14,ib-i4,pitzer-rack11,pitzer-rack11-c02 ActiveFeatures=48core,expansion,exp,c6420,cpu,eth-pitzer-rack11h1,ib-i4l1s14,ib-i4,pitzer-rack11,pitzer-rack11-c02 NodeName=p0508 Arch=x86_64 CoresPerSocket=24 AvailableFeatures=48core,expansion,exp,c6420,cpu,eth-pitzer-rack11h1,ib-i4l1s14,ib-i4,pitzer-rack11,pitzer-rack11-c02 ActiveFeatures=48core,expansion,exp,c6420,cpu,eth-pitzer-rack11h1,ib-i4l1s14,ib-i4,pitzer-rack11,pitzer-rack11-c02 NodeName=p0509 Arch=x86_64 CoresPerSocket=24 AvailableFeatures=48core,expansion,exp,c6420,cpu,eth-pitzer-rack11h1,ib-i4l1s14,ib-i4,pitzer-rack11,pitzer-rack11-c03 ActiveFeatures=48core,expansion,exp,c6420,cpu,eth-pitzer-rack11h1,ib-i4l1s14,ib-i4,pitzer-rack11,pitzer-rack11-c03 NodeName=p0510 Arch=x86_64 CoresPerSocket=24 AvailableFeatures=48core,expansion,exp,c6420,cpu,eth-pitzer-rack11h1,ib-i4l1s14,ib-i4,pitzer-rack11,pitzer-rack11-c03 ActiveFeatures=48core,expansion,exp,c6420,cpu,eth-pitzer-rack11h1,ib-i4l1s14,ib-i4,pitzer-rack11,pitzer-rack11-c03 NodeName=p0511 Arch=x86_64 CoresPerSocket=24 AvailableFeatures=48core,expansion,exp,c6420,cpu,eth-pitzer-rack11h1,ib-i4l1s14,ib-i4,pitzer-rack11,pitzer-rack11-c03 ActiveFeatures=48core,expansion,exp,c6420,cpu,eth-pitzer-rack11h1,ib-i4l1s14,ib-i4,pitzer-rack11,pitzer-rack11-c03 NodeName=p0512 Arch=x86_64 CoresPerSocket=24 AvailableFeatures=48core,expansion,exp,c6420,cpu,eth-pitzer-rack11h1,ib-i4l1s14,ib-i4,pitzer-rack11,pitzer-rack11-c03 ActiveFeatures=48core,expansion,exp,c6420,cpu,eth-pitzer-rack11h1,ib-i4l1s14,ib-i4,pitzer-rack11,pitzer-rack11-c03 NodeName=p0513 Arch=x86_64 CoresPerSocket=24 AvailableFeatures=48core,expansion,exp,c6420,cpu,eth-pitzer-rack11h1,ib-i4l1s16,ib-i4,pitzer-rack11,pitzer-rack11-c04 ActiveFeatures=48core,expansion,exp,c6420,cpu,eth-pitzer-rack11h1,ib-i4l1s16,ib-i4,pitzer-rack11,pitzer-rack11-c04 NodeName=p0514 Arch=x86_64 CoresPerSocket=24 AvailableFeatures=48core,expansion,exp,c6420,cpu,eth-pitzer-rack11h1,ib-i4l1s16,ib-i4,pitzer-rack11,pitzer-rack11-c04 ActiveFeatures=48core,expansion,exp,c6420,cpu,eth-pitzer-rack11h1,ib-i4l1s16,ib-i4,pitzer-rack11,pitzer-rack11-c04 NodeName=p0515 Arch=x86_64 CoresPerSocket=24 AvailableFeatures=48core,expansion,exp,c6420,cpu,eth-pitzer-rack11h1,ib-i4l1s16,ib-i4,pitzer-rack11,pitzer-rack11-c04 ActiveFeatures=48core,expansion,exp,c6420,cpu,eth-pitzer-rack11h1,ib-i4l1s16,ib-i4,pitzer-rack11,pitzer-rack11-c04 NodeName=p0516 Arch=x86_64 CoresPerSocket=24 AvailableFeatures=48core,expansion,exp,c6420,cpu,eth-pitzer-rack11h1,ib-i4l1s16,ib-i4,pitzer-rack11,pitzer-rack11-c04 ActiveFeatures=48core,expansion,exp,c6420,cpu,eth-pitzer-rack11h1,ib-i4l1s16,ib-i4,pitzer-rack11,pitzer-rack11-c04 NodeName=p0517 Arch=x86_64 CoresPerSocket=24 AvailableFeatures=48core,expansion,exp,c6420,cpu,eth-pitzer-rack11h1,ib-i4l1s16,ib-i4,pitzer-rack11,pitzer-rack11-c05 ActiveFeatures=48core,expansion,exp,c6420,cpu,eth-pitzer-rack11h1,ib-i4l1s16,ib-i4,pitzer-rack11,pitzer-rack11-c05 NodeName=p0518 Arch=x86_64 CoresPerSocket=24 AvailableFeatures=48core,expansion,exp,c6420,cpu,eth-pitzer-rack11h1,ib-i4l1s16,ib-i4,pitzer-rack11,pitzer-rack11-c05 ActiveFeatures=48core,expansion,exp,c6420,cpu,eth-pitzer-rack11h1,ib-i4l1s16,ib-i4,pitzer-rack11,pitzer-rack11-c05 NodeName=p0521 Arch=x86_64 CoresPerSocket=24 AvailableFeatures=48core,expansion,exp,c6420,cpu,eth-pitzer-rack11h1,ib-i4l1s16,ib-i4,pitzer-rack11,pitzer-rack11-c06 ActiveFeatures=48core,expansion,exp,c6420,cpu,eth-pitzer-rack11h1,ib-i4l1s16,ib-i4,pitzer-rack11,pitzer-rack11-c06 NodeName=p0522 Arch=x86_64 CoresPerSocket=24 AvailableFeatures=48core,expansion,exp,c6420,cpu,eth-pitzer-rack11h1,ib-i4l1s16,ib-i4,pitzer-rack11,pitzer-rack11-c06 ActiveFeatures=48core,expansion,exp,c6420,cpu,eth-pitzer-rack11h1,ib-i4l1s16,ib-i4,pitzer-rack11,pitzer-rack11-c06 NodeName=p0527 Arch=x86_64 CoresPerSocket=24 AvailableFeatures=48core,expansion,exp,c6420,cpu,eth-pitzer-rack11h1,ib-i4l1s16,ib-i4,pitzer-rack11,pitzer-rack11-c07 ActiveFeatures=48core,expansion,exp,c6420,cpu,eth-pitzer-rack11h1,ib-i4l1s16,ib-i4,pitzer-rack11,pitzer-rack11-c07 NodeName=p0529 Arch=x86_64 CoresPerSocket=24 AvailableFeatures=48core,expansion,exp,c6420,cpu,eth-pitzer-rack11h1,ib-i4l1s16,ib-i4,pitzer-rack11,pitzer-rack11-c08 ActiveFeatures=48core,expansion,exp,c6420,cpu,eth-pitzer-rack11h1,ib-i4l1s16,ib-i4,pitzer-rack11,pitzer-rack11-c08 I can't find any of the reserved nodes that don't have both of those features: troy@pitzer-login01:~$ scontrol show nodes p[0501-0518,0521-0522,0527,0529] | grep 'ActiveFeatures' | egrep -v '48core|c6420' [...nothing...] troy@pitzer-login01:~/slurm-bugs$ scontrol show nodes p[0501-0518,0521-0522,0527,0529] | grep 'ActiveFeatures' | egrep -v '48core.*c6420' [...nothing...] BTW, there was no partition specified when these reservations were created, so I think the batch partition got picked up by the reservation since it's the default: troy@pitzer-login01:~/slurm-bugs/9973$ scontrol show partition batch PartitionName=batch AllowGroups=ALL DenyAccounts=pcon0060,pcon0003,pcon0014,pcon0015,pcon0016,pcon0008,pcon0010,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101 AllowQos=ALL AllocNodes=ALL Default=YES QoS=N/A DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO MaxNodes=UNLIMITED MaxTime=7-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED Nodes=p0[001-195],p0[401-404,501-792],p02[25-39],p03[01-19],p035[1-2],p02[57-60],p09[01-12] PriorityJobFactor=1000 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO OverTimeLimit=NONE PreemptMode=OFF State=INACTIVE TotalCPUs=24512 TotalNodes=543 SelectTypeParameters=NONE JobDefaults=(null) DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED As you can see, the batch partition is inactive; we use it for routing in the submit filter. Would it help if we changed the reservations' PartitionName value to parallel-48core? (And is it possible to set a reservation partition to a list of partitions rather than just one?) Thanks for confirming that Troy. One of my colleagues said that this looks like an issue he has worked on. There was a change he checked in to 20.02.5 to address a problem with adding constraints with a count on the number of features. Unfortunately this caused unintended side effects with features and it has been reverted in the code base, but it won't be in the official releases until 20.02.6. You can find the commit that reverts the problem code here: https://github.com/SchedMD/slurm/commit/19c72c188b0f27526d9402dda8e329999442fe3b Our apologies for the inconvenience caused by this bug. Would you be able to apply this patch to your system to address the problem you're running into? Thanks, Ben Ben, I noticed one difference between your test and mine is that your reservation has Flags=PURGE_COMP=00:05:00 and mine has Flags=MAINT,DAILY,PURGE_COMP=00:05:00. We've been operating under the assumption that the MAINT flag was just to disable accounting on the reservation, but I noticed on my test system that the reserved nodes went into the MAINT state during the reservation. Is it possible that the MAINT flags is the issue? I've been doing some more testing to see if I could identify anything else that might cause the problem. You're right that I left off the MAINT and DAILY flags in my initial test. I did some testing to see if using a different partition that included the nodes that are reserved, but aren't the one specified in the reservation would make a difference. In my testing that worked fine still. I also tried including the MAINT and DAILY flags to see if they made a difference, but they didn't change the behavior. Here is another example of a test with the flags and different partitions.
I added a second feature to all the nodes with nodes 13-16 having both 'rack2' and 'feat4'. I modified my partitions so that 'debug' has all the nodes, 'gpu' still excludes the 'rack2' feature and the 'high' partition includes nodes 13-16, along with some others.
$ scontrol create reservation reservationname=test_osc partition=debug feature="rack2&feat4" nodecnt=4 flags=maint,daily,purge_comp=5:00 account=sub1 starttime=15:07:00 duration=30:00
Reservation created: test_osc
$ scontrol show res
ReservationName=test_osc StartTime=2020-10-12T15:07:00 EndTime=2020-10-12T15:37:00 Duration=00:30:00
Nodes=node[13-16] NodeCnt=4 CoreCnt=96 Features=rack2&feat4 PartitionName=debug Flags=MAINT,DAILY,PURGE_COMP=00:05:00
TRES=cpu=96
Users=(null) Accounts=sub1 Licenses=(null) State=INACTIVE BurstBuffer=(null) Watts=n/a
MaxStartDelay=(null)
$ sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
debug* up infinite 4 maint node[13-16]
debug* up infinite 14 idle node[01-12,17-18]
high up infinite 4 maint node[13-16]
high up infinite 8 idle node[05-12]
gpu up infinite 8 idle node[01-08]
socket up infinite 4 maint node[13-16]
socket up infinite 14 idle node[01-12,17-18]
$ sbatch -N4 --reservation=test_osc -phigh -Asub1 -t10:00 --wrap='srun sleep 60'
Submitted batch job 762
$ sbatch -N4 --reservation=test_osc -pgpu -Asub1 -t10:00 --wrap='srun sleep 60'
Submitted batch job 763
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
763 gpu wrap ben PD 0:00 4 (BadConstraints)
762 high wrap ben R 0:19 4 node[13-16]
The 'gpu' partition job still fails in this way, but the 'high' partition job, which has access to the right nodes, is able to run. The fact that the nodes show in a 'MAINT' state shouldn't have an effect on their ability to run these jobs.
You mention that you are using a test system. Are you able to reproduce the behavior there?
Thanks,
Ben
> You mention that you are using a test system. Are you able to reproduce the behavior there?
I decided to go back to first principles and build up from there to something similar to what I was trying to do:
scontrol create reservation=test start=09:40:00 duration=10:00 accounts=PZS0708 nodecnt=2
sbatch --time=05:00 --nodes=1 --reservation=test serial-short.job
# works
scontrol create reservation=test start=09:45:00 duration=10:00 accounts=PZS0708 nodecnt=2 flags=daily
sbatch --time=05:00 --nodes=1 --reservation=test serial-short.job
# works
scontrol create reservation=test start=09:50:00 duration=10:00 accounts=PZS0708 nodecnt=2 flags=daily,maint
sbatch --time=05:00 --nodes=1 --reservation=test serial-short.job
# works
scontrol create reservation=test start=09:55:00 duration=10:00 accounts=PZS0708 nodecnt=2 flags=daily,maint feature='haswell&vm'
sbatch --time=05:00 --nodes=1 --reservation=test serial-short.job
# works
So there is something different between my test system and my production system. I'll follow up on this in a couple hours.
OK, we've modified our test environment to be more like Pitzer in that some nodes have different features than others and reverted to the version of Slurm on the production system. However, we still haven't been able to reproduce this there. I've been asked by my management to escalate this to the highest priority level. Hi Troy, That's interesting that it isn't happening on your test system either. Do you have some nodes on your production system we can use to create a reservation and try and narrow down what's happening? Thanks, Ben (In reply to Ben Roberts from comment #11) > Hi Troy, > > That's interesting that it isn't happening on your test system either. Do > you have some nodes on your production system we can use to create a > reservation and try and narrow down what's happening? > > Thanks, > Ben That's what I've been doing right now: troy@pitzer-login04:~$ scontrol create reservation=test nodecnt=30 feature='c6420&48core' Flags=MAINT,DAILY,PURGE_COMP=00:05:00 start=13:20 duration=00:40:00 accounts=PZS0708 Reservation created: test troy@pitzer-login04:~$ sbatch --nodes=1 --reservation=test --time=10:00 test.job Submitted batch job 2029112 troy@pitzer-login04:~$ sbatch --nodes=2 --reservation=test --time=10:00 test.job Submitted batch job 2029113 troy@pitzer-login04:~$ sbatch --nodes=4 --reservation=test --time=10:00 test.job Submitted batch job 2029114 troy@pitzer-login04:~$ sbatch --nodes=8 --reservation=test --time=10:00 test.job Submitted batch job 2029115 troy@pitzer-login04:~$ sbatch --nodes=16 --reservation=test --time=10:00 test.job Submitted batch job 2029117 troy@pitzer-login04:~$ sbatch --nodes=30 --reservation=test --time=10:00 test.job Submitted batch job 2029118 […wait till after 13:20…] troy@pitzer-login04:~$ squeue -u troy JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 2029118 parallel- test troy PD 0:00 30 (PartitionNodeLimit) 2029113 parallel- test troy PD 0:00 2 (Resources) 2029112 serial-40 test troy PD 0:00 1 (Priority) 2029117 parallel- test troy R 1:23 16 p[0501-0516] 2029115 parallel- test troy R 1:23 8 p[0517-0518,0531-0533,0613-0615] 2029114 parallel- test troy R 1:23 4 p[0525,0536,0645-0646] troy@pitzer-login04:~$ scancel 2029117 2029115 2029114 troy@pitzer-login04:~$ squeue -u troy JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 2029118 parallel- test troy PD 0:00 30 (BadConstraints) 2029113 parallel- test troy R 0:40 2 p[0645-0646] 2029112 serial-48 test troy R 0:40 1 p0525 troy@pitzer-login04:~$ sbatch --nodes=29 --reservation=test --time=10:00 test.job Submitted batch job 2029164 troy@pitzer-login04:~$ sbatch --nodes=28 --reservation=test --time=10:00 test.job Submitted batch job 2029165 troy@pitzer-login04:~$ squeue -u troy JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 2029118 parallel- test troy PD 0:00 30 (PartitionNodeLimit) 2029164 parallel- test troy PD 0:00 29 (PartitionNodeLimit) 2029165 parallel- test troy PD 0:00 28 (PartitionNodeLimit) 2029113 parallel- test troy R 4:13 2 p[0645-0646] 2029112 serial-48 test troy R 4:13 1 p0525 troy@pitzer-login04:~$ scancel 2029112 2029113 troy@pitzer-login04:~$ squeue -u troy JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 2029118 parallel- test troy PD 0:00 30 (PartitionNodeLimit) 2029165 parallel- test troy PD 0:00 28 (PartitionNodeLimit) 2029164 parallel- test troy R 1:35 29 p[0501-0518,0525,0531-0533,0536,0575,0613-0615,0645-0646] So it’s funny that the one job that wouldn’t start is the one that was the same size as the reservation. Maybe we just need to make the reservation 1-2 nodes bigger? (And yes, the 28 node job did eventually start.) It's also a little strange that in this case, the reason given for the 30-node job not running *isn't* BadConstraints: troy@pitzer-login04:~$ scontrol show job 2029118 JobId=2029118 JobName=test UserId=troy(6624) GroupId=PZS0708(5509) MCS_label=N/A Priority=100627430 Nice=0 Account=pzs0708 QOS=pitzer-all JobState=PENDING Reason=ReqNodeNotAvail,_UnavailableNodes:p[0401-0404,0501,0581] Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 RunTime=00:00:00 TimeLimit=00:10:00 TimeMin=N/A SubmitTime=2020-10-13T13:18:15 EligibleTime=2020-10-13T13:20:01 AccrueTime=2020-10-13T13:20:01 StartTime=Unknown EndTime=Unknown Deadline=N/A SuspendTime=None SecsPreSuspend=0 LastSchedEval=2020-10-13T13:36:29 Partition=parallel-40core,parallel-48core,condo-olivucci-backfill-parallel,gpubackfill-parallel-40core,condo-osumed-gpu-48core-backfill-parallel,gpubackfill-parallel-48core,condo-datta-backfill-parallel,condo-belloni-backfill-parallel,condo-honscheid-backfill-parallel,gpubackfill-parallel-quad,condo-ccapp-backfill-parallel,condo-osumed-gpu-quad-backfill-parallel,condo-osumed-gpu-40core-backfill-parallel AllocNode:Sid=pitzer-login04:257153 ReqNodeList=(null) ExcNodeList=(null) NodeList=(null) NumNodes=30-30 NumCPUs=30 NumTasks=30 CPUs/Task=1 ReqB:S:C:T=0:0:*:* TRES=cpu=30,mem=136680M,node=30,billing=30 Socks/Node=* NtasksPerN:B:S:C=0:0:*:1 CoreSpec=* MinCPUsNode=1 MinMemoryCPU=3797M MinTmpDiskNode=0 Features=(null) DelayBoot=00:00:00 Reservation=test OverSubscribe=NO Contiguous=0 Licenses=(null) Network=(null) Command=/users/sysp/troy/test.job WorkDir=/users/sysp/troy Comment=stdout=/users/sysp/troy/%x.o2029118 StdErr=/users/sysp/troy/test.o2029118 StdIn=/dev/null StdOut=/users/sysp/troy/test.o2029118 Power= MailUser=(null) MailType=NONE troy@pitzer-login04:~$ sinfo -T RESV_NAME STATE START_TIME END_TIME DURATION NODELIST [...] test ACTIVE 2020-10-13T13:20:00 2020-10-13T14:00:00 00:40:00 p[0501-0518,0525,0531-0533,0536,0575,0613-0615,0645-0647] What's odd about this is that all of the listed nodes are down, but only p0501 was in the reservation. Thanks for putting that together. The that stands out to me is that the reason that shows up for the 28 and 30 node jobs is PartitionNodeLimit. I can see the details for the 'batch' partition, which doesn't have a MaxNodes limit set. Can I have you send the output of 'scontrol show partition' to verify that one of the other partitions being used doesn't have a limit set? I see you updated with information about the nodes too. Nodes 0401-0404 were ones we were looking at yesterday as being in the partition, but not having the same features. Also, I'm happy to work with you quickly to get this resolved, but I would refer you to our definitions of severity levels. ----------------- Severity 1 — Major Impact A Severity 1 issue occurs when there is a continued system outage that affects a large set of end users. The system is down and non-functional due to Slurm problem(s) and no procedural workaround exists. Severity 2 — High Impact A Severity 2 issue is a high-impact problem that is causing sporadic outages or is consistently encountered by end users with adverse impact to end user interaction with the system. ----------------- This seems like it fits the definition of a Severity 2 ticket better. We do have an "Importance" field that allows you to reflect the impact to your site. As I said, it won't change my responsiveness. Thanks, Ben troy@pitzer-login04:~$ scontrol show partitions PartitionName=batch AllowGroups=ALL DenyAccounts=pcon0060,pcon0003,pcon0014,pcon0015,pcon0016,pcon0008,pcon0010,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101 AllowQos=ALL AllocNodes=ALL Default=YES QoS=N/A DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO MaxNodes=UNLIMITED MaxTime=7-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED Nodes=p0[001-195],p0[401-404,501-792],p02[25-39],p03[01-19],p035[1-2],p02[57-60],p09[01-12] PriorityJobFactor=1000 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO OverTimeLimit=NONE PreemptMode=OFF State=INACTIVE TotalCPUs=24512 TotalNodes=543 SelectTypeParameters=NONE JobDefaults=(null) DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED PartitionName=condo-belloni-backfill-parallel AllowGroups=ALL DenyAccounts=pcon0060,pcon0003,pcon0014,pcon0015,pcon0016,pcon0008,pcon0010,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101 AllowQos=ALL AllocNodes=ALL Default=NO QoS=N/A DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO MaxNodes=UNLIMITED MaxTime=04:00:00 MinNodes=2 LLN=NO MaxCPUsPerNode=40 Nodes=p02[19-24] PriorityJobFactor=1000 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=EXCLUSIVE OverTimeLimit=NONE PreemptMode=OFF State=UP TotalCPUs=240 TotalNodes=6 SelectTypeParameters=NONE JobDefaults=(null) DefMemPerCPU=4556 MaxMemPerCPU=4556 PartitionName=condo-belloni-backfill-serial AllowGroups=ALL DenyAccounts=pcon0060,pcon0003,pcon0014,pcon0015,pcon0016,pcon0008,pcon0010,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101 AllowQos=ALL AllocNodes=ALL Default=NO QoS=N/A DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO MaxNodes=1 MaxTime=04:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=40 Nodes=p02[19-24] PriorityJobFactor=1000 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO OverTimeLimit=NONE PreemptMode=OFF State=UP TotalCPUs=240 TotalNodes=6 SelectTypeParameters=NONE JobDefaults=(null) DefMemPerCPU=4556 MaxMemPerCPU=4556 PartitionName=condo-belloni-parallel AllowGroups=ALL AllowAccounts=pcon0060 AllowQos=ALL AllocNodes=ALL Default=NO QoS=pitzer-condo-belloni DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO MaxNodes=UNLIMITED MaxTime=7-00:00:00 MinNodes=2 LLN=NO MaxCPUsPerNode=40 Nodes=p02[19-24] PriorityJobFactor=1000 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=EXCLUSIVE OverTimeLimit=NONE PreemptMode=OFF State=UP TotalCPUs=240 TotalNodes=6 SelectTypeParameters=NONE JobDefaults=(null) DefMemPerCPU=4556 MaxMemPerCPU=4556 PartitionName=condo-belloni-serial AllowGroups=ALL AllowAccounts=pcon0060 AllowQos=ALL AllocNodes=ALL Default=NO QoS=pitzer-condo-belloni DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO MaxNodes=1 MaxTime=7-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=40 Nodes=p02[19-24] PriorityJobFactor=1000 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO OverTimeLimit=NONE PreemptMode=OFF State=UP TotalCPUs=240 TotalNodes=6 SelectTypeParameters=NONE JobDefaults=(null) DefMemPerCPU=4556 MaxMemPerCPU=4556 PartitionName=condo-ccapp-backfill-parallel AllowGroups=ALL DenyAccounts=pcon0060,pcon0003,pcon0014,pcon0015,pcon0016,pcon0008,pcon0010,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101 AllowQos=ALL AllocNodes=ALL Default=NO QoS=N/A DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO MaxNodes=UNLIMITED MaxTime=04:00:00 MinNodes=2 LLN=NO MaxCPUsPerNode=40 Nodes=p02[08-18] PriorityJobFactor=1000 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=EXCLUSIVE OverTimeLimit=NONE PreemptMode=OFF State=UP TotalCPUs=440 TotalNodes=11 SelectTypeParameters=NONE JobDefaults=(null) DefMemPerCPU=4556 MaxMemPerCPU=4556 PartitionName=condo-ccapp-backfill-serial AllowGroups=ALL DenyAccounts=pcon0060,pcon0003,pcon0014,pcon0015,pcon0016,pcon0008,pcon0010,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101 AllowQos=ALL AllocNodes=ALL Default=NO QoS=N/A DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO MaxNodes=1 MaxTime=04:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=40 Nodes=p02[08-18] PriorityJobFactor=1000 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO OverTimeLimit=NONE PreemptMode=OFF State=UP TotalCPUs=440 TotalNodes=11 SelectTypeParameters=NONE JobDefaults=(null) DefMemPerCPU=4556 MaxMemPerCPU=4556 PartitionName=condo-ccapp-parallel AllowGroups=ALL AllowAccounts=pcon0003 AllowQos=ALL AllocNodes=ALL Default=NO QoS=pitzer-condo-ccapp DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO MaxNodes=UNLIMITED MaxTime=7-00:00:00 MinNodes=2 LLN=NO MaxCPUsPerNode=40 Nodes=p02[08-18] PriorityJobFactor=1000 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=EXCLUSIVE OverTimeLimit=NONE PreemptMode=OFF State=UP TotalCPUs=440 TotalNodes=11 SelectTypeParameters=NONE JobDefaults=(null) DefMemPerCPU=4556 MaxMemPerCPU=4556 PartitionName=condo-ccapp-serial AllowGroups=ALL AllowAccounts=pcon0003 AllowQos=ALL AllocNodes=ALL Default=NO QoS=pitzer-condo-ccapp DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO MaxNodes=1 MaxTime=7-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=40 Nodes=p02[08-18] PriorityJobFactor=1000 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO OverTimeLimit=NONE PreemptMode=OFF State=UP TotalCPUs=440 TotalNodes=11 SelectTypeParameters=NONE JobDefaults=(null) DefMemPerCPU=4556 MaxMemPerCPU=4556 PartitionName=condo-datta-backfill-parallel AllowGroups=ALL DenyAccounts=pcon0060,pcon0003,pcon0014,pcon0015,pcon0016,pcon0008,pcon0010,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101 AllowQos=ALL AllocNodes=ALL Default=NO QoS=N/A DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO MaxNodes=UNLIMITED MaxTime=04:00:00 MinNodes=2 LLN=NO MaxCPUsPerNode=48 Nodes=p0[793-840] PriorityJobFactor=1000 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=EXCLUSIVE OverTimeLimit=NONE PreemptMode=OFF State=UP TotalCPUs=2304 TotalNodes=48 SelectTypeParameters=NONE JobDefaults=(null) DefMemPerCPU=3797 MaxMemPerCPU=3797 PartitionName=condo-datta-backfill-serial AllowGroups=ALL DenyAccounts=pcon0060,pcon0003,pcon0014,pcon0015,pcon0016,pcon0008,pcon0010,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101 AllowQos=ALL AllocNodes=ALL Default=NO QoS=N/A DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO MaxNodes=1 MaxTime=04:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=48 Nodes=p0[793-840] PriorityJobFactor=1000 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO OverTimeLimit=NONE PreemptMode=OFF State=UP TotalCPUs=2304 TotalNodes=48 SelectTypeParameters=NONE JobDefaults=(null) DefMemPerCPU=3797 MaxMemPerCPU=3797 PartitionName=condo-datta-parallel AllowGroups=ALL AllowAccounts=pcon0014,pcon0015,pcon0016 AllowQos=ALL AllocNodes=ALL Default=NO QoS=pitzer-condo-datta DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO MaxNodes=UNLIMITED MaxTime=7-00:00:00 MinNodes=2 LLN=NO MaxCPUsPerNode=48 Nodes=p0[793-840] PriorityJobFactor=1000 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=EXCLUSIVE OverTimeLimit=NONE PreemptMode=OFF State=UP TotalCPUs=2304 TotalNodes=48 SelectTypeParameters=NONE JobDefaults=(null) DefMemPerCPU=3797 MaxMemPerCPU=3797 PartitionName=condo-datta-serial AllowGroups=ALL AllowAccounts=pcon0014,pcon0015,pcon0016 AllowQos=ALL AllocNodes=ALL Default=NO QoS=pitzer-condo-datta DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO MaxNodes=1 MaxTime=7-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=48 Nodes=p0[793-840] PriorityJobFactor=1000 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO OverTimeLimit=NONE PreemptMode=OFF State=UP TotalCPUs=2304 TotalNodes=48 SelectTypeParameters=NONE JobDefaults=(null) DefMemPerCPU=3797 MaxMemPerCPU=3797 PartitionName=condo-honscheid-backfill-parallel AllowGroups=ALL DenyAccounts=pcon0060,pcon0003,pcon0014,pcon0015,pcon0016,pcon0008,pcon0010,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101 AllowQos=ALL AllocNodes=ALL Default=NO QoS=N/A DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO MaxNodes=UNLIMITED MaxTime=04:00:00 MinNodes=2 LLN=NO MaxCPUsPerNode=40 Nodes=p020[3-7] PriorityJobFactor=1000 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=EXCLUSIVE OverTimeLimit=NONE PreemptMode=OFF State=UP TotalCPUs=200 TotalNodes=5 SelectTypeParameters=NONE JobDefaults=(null) DefMemPerCPU=4556 MaxMemPerCPU=4556 PartitionName=condo-honscheid-backfill-serial AllowGroups=ALL DenyAccounts=pcon0060,pcon0003,pcon0014,pcon0015,pcon0016,pcon0008,pcon0010,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101 AllowQos=ALL AllocNodes=ALL Default=NO QoS=N/A DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO MaxNodes=1 MaxTime=04:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=40 Nodes=p020[3-7] PriorityJobFactor=1000 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO OverTimeLimit=NONE PreemptMode=OFF State=UP TotalCPUs=200 TotalNodes=5 SelectTypeParameters=NONE JobDefaults=(null) DefMemPerCPU=4556 MaxMemPerCPU=4556 PartitionName=condo-honscheid-parallel AllowGroups=ALL AllowAccounts=pcon0008 AllowQos=ALL AllocNodes=ALL Default=NO QoS=pitzer-condo-honscheid DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO MaxNodes=UNLIMITED MaxTime=7-00:00:00 MinNodes=2 LLN=NO MaxCPUsPerNode=40 Nodes=p020[3-7] PriorityJobFactor=1000 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=EXCLUSIVE OverTimeLimit=NONE PreemptMode=OFF State=UP TotalCPUs=200 TotalNodes=5 SelectTypeParameters=NONE JobDefaults=(null) DefMemPerCPU=4556 MaxMemPerCPU=4556 PartitionName=condo-honscheid-serial AllowGroups=ALL AllowAccounts=pcon0008 AllowQos=ALL AllocNodes=ALL Default=NO QoS=pitzer-condo-honscheid DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO MaxNodes=1 MaxTime=7-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=40 Nodes=p020[3-7] PriorityJobFactor=1000 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO OverTimeLimit=NONE PreemptMode=OFF State=UP TotalCPUs=200 TotalNodes=5 SelectTypeParameters=NONE JobDefaults=(null) DefMemPerCPU=4556 MaxMemPerCPU=4556 PartitionName=condo-olivucci-backfill-parallel AllowGroups=ALL DenyAccounts=pcon0060,pcon0003,pcon0014,pcon0015,pcon0016,pcon0008,pcon0010,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101 AllowQos=ALL AllocNodes=ALL Default=NO QoS=N/A DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO MaxNodes=UNLIMITED MaxTime=04:00:00 MinNodes=2 LLN=NO MaxCPUsPerNode=40 Nodes=p0[196-202] PriorityJobFactor=1000 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=EXCLUSIVE OverTimeLimit=NONE PreemptMode=OFF State=UP TotalCPUs=280 TotalNodes=7 SelectTypeParameters=NONE JobDefaults=(null) DefMemPerCPU=4556 MaxMemPerCPU=4556 PartitionName=condo-olivucci-backfill-serial AllowGroups=ALL DenyAccounts=pcon0060,pcon0003,pcon0014,pcon0015,pcon0016,pcon0008,pcon0010,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101 AllowQos=ALL AllocNodes=ALL Default=NO QoS=N/A DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO MaxNodes=1 MaxTime=04:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=40 Nodes=p0[196-202] PriorityJobFactor=1000 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO OverTimeLimit=NONE PreemptMode=OFF State=UP TotalCPUs=280 TotalNodes=7 SelectTypeParameters=NONE JobDefaults=(null) DefMemPerCPU=4556 MaxMemPerCPU=4556 PartitionName=condo-olivucci-parallel AllowGroups=ALL AllowAccounts=pcon0010 AllowQos=ALL AllocNodes=ALL Default=NO QoS=pitzer-condo-olivucci DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO MaxNodes=UNLIMITED MaxTime=7-00:00:00 MinNodes=2 LLN=NO MaxCPUsPerNode=40 Nodes=p0[196-202] PriorityJobFactor=1000 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=EXCLUSIVE OverTimeLimit=NONE PreemptMode=OFF State=UP TotalCPUs=280 TotalNodes=7 SelectTypeParameters=NONE JobDefaults=(null) DefMemPerCPU=4556 MaxMemPerCPU=4556 PartitionName=condo-olivucci-serial AllowGroups=ALL AllowAccounts=pcon0010 AllowQos=ALL AllocNodes=ALL Default=NO QoS=pitzer-condo-olivucci DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO MaxNodes=1 MaxTime=7-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=40 Nodes=p0[196-202] PriorityJobFactor=1000 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO OverTimeLimit=NONE PreemptMode=OFF State=UP TotalCPUs=280 TotalNodes=7 SelectTypeParameters=NONE JobDefaults=(null) DefMemPerCPU=4556 MaxMemPerCPU=4556 PartitionName=condo-osumed-gpu-40core-backfill-parallel AllowGroups=ALL DenyAccounts=pcon0060,pcon0003,pcon0014,pcon0015,pcon0016,pcon0008,pcon0010,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101 AllowQos=ALL AllocNodes=ALL Default=NO QoS=pitzer-condo-osumed-gpu-backfill DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO MaxNodes=UNLIMITED MaxTime=04:00:00 MinNodes=2 LLN=NO MaxCPUsPerNode=40 Nodes=p02[40-56] PriorityJobFactor=1000 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=EXCLUSIVE OverTimeLimit=NONE PreemptMode=OFF State=UP TotalCPUs=680 TotalNodes=17 SelectTypeParameters=NONE JobDefaults=(null) DefMemPerCPU=9292 MaxMemPerCPU=9292 PartitionName=condo-osumed-gpu-40core-backfill-serial AllowGroups=ALL DenyAccounts=pcon0060,pcon0003,pcon0014,pcon0015,pcon0016,pcon0008,pcon0010,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101 AllowQos=ALL AllocNodes=ALL Default=NO QoS=pitzer-condo-osumed-gpu-backfill DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO MaxNodes=1 MaxTime=04:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=40 Nodes=p02[40-56] PriorityJobFactor=1000 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO OverTimeLimit=NONE PreemptMode=OFF State=UP TotalCPUs=680 TotalNodes=17 SelectTypeParameters=NONE JobDefaults=(null) DefMemPerCPU=9292 MaxMemPerCPU=9292 PartitionName=condo-osumed-gpu-40core-parallel AllowGroups=ALL AllowAccounts=pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101 AllowQos=ALL AllocNodes=ALL Default=NO QoS=pitzer-condo-osumed-gpu-40core DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO MaxNodes=UNLIMITED MaxTime=7-00:00:00 MinNodes=2 LLN=NO MaxCPUsPerNode=40 Nodes=p02[40-56] PriorityJobFactor=3000 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=EXCLUSIVE OverTimeLimit=NONE PreemptMode=OFF State=UP TotalCPUs=680 TotalNodes=17 SelectTypeParameters=NONE JobDefaults=(null) DefMemPerCPU=9292 MaxMemPerCPU=9292 PartitionName=condo-osumed-gpu-40core-serial AllowGroups=ALL AllowAccounts=pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101 AllowQos=ALL AllocNodes=ALL Default=NO QoS=pitzer-condo-osumed-gpu-40core DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO MaxNodes=1 MaxTime=7-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=40 Nodes=p02[40-56] PriorityJobFactor=3000 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO OverTimeLimit=NONE PreemptMode=OFF State=UP TotalCPUs=680 TotalNodes=17 SelectTypeParameters=NONE JobDefaults=(null) DefMemPerCPU=9292 MaxMemPerCPU=9292 PartitionName=condo-osumed-gpu-48core-backfill-parallel AllowGroups=ALL DenyAccounts=pcon0060,pcon0003,pcon0014,pcon0015,pcon0016,pcon0008,pcon0010,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101 AllowQos=ALL AllocNodes=ALL Default=NO QoS=pitzer-condo-osumed-gpu-backfill DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO MaxNodes=UNLIMITED MaxTime=04:00:00 MinNodes=2 LLN=NO MaxCPUsPerNode=48 Nodes=p03[20-42] PriorityJobFactor=1000 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=EXCLUSIVE OverTimeLimit=NONE PreemptMode=OFF State=UP TotalCPUs=1104 TotalNodes=23 SelectTypeParameters=NONE JobDefaults=(null) DefMemPerCPU=7744 MaxMemPerCPU=7744 PartitionName=condo-osumed-gpu-48core-backfill-serial AllowGroups=ALL DenyAccounts=pcon0060,pcon0003,pcon0014,pcon0015,pcon0016,pcon0008,pcon0010,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101 AllowQos=ALL AllocNodes=ALL Default=NO QoS=pitzer-condo-osumed-gpu-backfill DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO MaxNodes=1 MaxTime=04:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=48 Nodes=p03[20-42] PriorityJobFactor=1000 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO OverTimeLimit=NONE PreemptMode=OFF State=UP TotalCPUs=1104 TotalNodes=23 SelectTypeParameters=NONE JobDefaults=(null) DefMemPerCPU=7744 MaxMemPerCPU=7744 PartitionName=condo-osumed-gpu-48core-parallel AllowGroups=ALL AllowAccounts=pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101 AllowQos=ALL AllocNodes=ALL Default=NO QoS=pitzer-condo-osumed-gpu-48core DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO MaxNodes=UNLIMITED MaxTime=7-00:00:00 MinNodes=2 LLN=NO MaxCPUsPerNode=48 Nodes=p03[20-42] PriorityJobFactor=2000 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=EXCLUSIVE OverTimeLimit=NONE PreemptMode=OFF State=UP TotalCPUs=1104 TotalNodes=23 SelectTypeParameters=NONE JobDefaults=(null) DefMemPerCPU=7744 MaxMemPerCPU=7744 PartitionName=condo-osumed-gpu-48core-serial AllowGroups=ALL AllowAccounts=pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101 AllowQos=ALL AllocNodes=ALL Default=NO QoS=pitzer-condo-osumed-gpu-48core DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO MaxNodes=1 MaxTime=7-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=48 Nodes=p03[20-42] PriorityJobFactor=2000 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO OverTimeLimit=NONE PreemptMode=OFF State=UP TotalCPUs=1104 TotalNodes=23 SelectTypeParameters=NONE JobDefaults=(null) DefMemPerCPU=7744 MaxMemPerCPU=7744 PartitionName=condo-osumed-gpu-quad-backfill-parallel AllowGroups=ALL DenyAccounts=pcon0060,pcon0003,pcon0014,pcon0015,pcon0016,pcon0008,pcon0010,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101 AllowQos=ALL AllocNodes=ALL Default=NO QoS=pitzer-condo-osumed-gpu-backfill-quad DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO MaxNodes=UNLIMITED MaxTime=04:00:00 MinNodes=2 LLN=NO MaxCPUsPerNode=48 Nodes=p035[3-4] PriorityJobFactor=1000 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=EXCLUSIVE OverTimeLimit=NONE PreemptMode=OFF State=UP TotalCPUs=96 TotalNodes=2 SelectTypeParameters=NONE JobDefaults=(null) DefMemPerCPU=15872 MaxMemPerCPU=15872 PartitionName=condo-osumed-gpu-quad-backfill-serial AllowGroups=ALL DenyAccounts=pcon0060,pcon0003,pcon0014,pcon0015,pcon0016,pcon0008,pcon0010,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101 AllowQos=ALL AllocNodes=ALL Default=NO QoS=pitzer-condo-osumed-gpu-backfill-quad DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO MaxNodes=1 MaxTime=04:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=48 Nodes=p035[3-4] PriorityJobFactor=1000 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO OverTimeLimit=NONE PreemptMode=OFF State=UP TotalCPUs=96 TotalNodes=2 SelectTypeParameters=NONE JobDefaults=(null) DefMemPerCPU=15872 MaxMemPerCPU=15872 PartitionName=condo-osumed-gpu-quad-parallel AllowGroups=ALL AllowAccounts=pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101 AllowQos=ALL AllocNodes=ALL Default=NO QoS=N/A DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO MaxNodes=UNLIMITED MaxTime=7-00:00:00 MinNodes=2 LLN=NO MaxCPUsPerNode=48 Nodes=p035[3-4] PriorityJobFactor=1000 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=EXCLUSIVE OverTimeLimit=NONE PreemptMode=OFF State=UP TotalCPUs=96 TotalNodes=2 SelectTypeParameters=NONE JobDefaults=(null) DefMemPerCPU=15872 MaxMemPerCPU=15872 PartitionName=condo-osumed-gpu-quad-serial AllowGroups=ALL AllowAccounts=pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101 AllowQos=ALL AllocNodes=ALL Default=NO QoS=N/A DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO MaxNodes=1 MaxTime=7-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=48 Nodes=p035[3-4] PriorityJobFactor=1000 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO OverTimeLimit=NONE PreemptMode=OFF State=UP TotalCPUs=96 TotalNodes=2 SelectTypeParameters=NONE JobDefaults=(null) DefMemPerCPU=15872 MaxMemPerCPU=15872 PartitionName=debug AllowGroups=ALL DenyAccounts=pcon0060,pcon0003,pcon0014,pcon0015,pcon0016,pcon0008,pcon0010,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101 AllowQos=ALL AllocNodes=ALL Default=NO QoS=N/A DefaultTime=00:30:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO MaxNodes=2 MaxTime=01:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=48 Nodes=p0[001-195],p0[401-404,501-792] PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO OverTimeLimit=NONE PreemptMode=OFF State=INACTIVE TotalCPUs=22008 TotalNodes=491 SelectTypeParameters=NONE JobDefaults=(null) DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED PartitionName=debug-40core AllowGroups=ALL DenyAccounts=pcon0060,pcon0003,pcon0014,pcon0015,pcon0016,pcon0008,pcon0010,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101 AllowQos=ALL AllocNodes=ALL Default=NO QoS=debug DefaultTime=00:30:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO MaxNodes=2 MaxTime=01:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=48 Nodes=p0[001-195] PriorityJobFactor=5000 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO OverTimeLimit=NONE PreemptMode=OFF State=UP TotalCPUs=7800 TotalNodes=195 SelectTypeParameters=NONE JobDefaults=(null) DefMemPerCPU=4556 MaxMemPerCPU=4556 PartitionName=debug-48core AllowGroups=ALL DenyAccounts=pcon0060,pcon0003,pcon0014,pcon0015,pcon0016,pcon0008,pcon0010,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101 AllowQos=ALL AllocNodes=ALL Default=NO QoS=debug DefaultTime=00:30:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO MaxNodes=2 MaxTime=01:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=48 Nodes=p0[401-404,501-792] PriorityJobFactor=5000 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO OverTimeLimit=NONE PreemptMode=OFF State=UP TotalCPUs=14208 TotalNodes=296 SelectTypeParameters=NONE JobDefaults=(null) DefMemPerCPU=3797 MaxMemPerCPU=3797 PartitionName=gpubackfill-parallel-40core AllowGroups=ALL DenyAccounts=pcon0060,pcon0003,pcon0014,pcon0015,pcon0016,pcon0008,pcon0010,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101 AllowQos=ALL AllocNodes=ALL Default=NO QoS=N/A DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO MaxNodes=4 MaxTime=04:00:00 MinNodes=2 LLN=NO MaxCPUsPerNode=40 Nodes=p02[25-39] PriorityJobFactor=1000 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=EXCLUSIVE OverTimeLimit=NONE PreemptMode=OFF State=UP TotalCPUs=600 TotalNodes=15 SelectTypeParameters=NONE JobDefaults=(null) DefMemPerCPU=9292 MaxMemPerCPU=9292 PartitionName=gpubackfill-parallel-48core AllowGroups=ALL DenyAccounts=pcon0060,pcon0003,pcon0014,pcon0015,pcon0016,pcon0008,pcon0010,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101 AllowQos=ALL AllocNodes=ALL Default=NO QoS=N/A DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO MaxNodes=UNLIMITED MaxTime=04:00:00 MinNodes=2 LLN=NO MaxCPUsPerNode=48 Nodes=p03[01-19] PriorityJobFactor=1000 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=EXCLUSIVE OverTimeLimit=NONE PreemptMode=OFF State=UP TotalCPUs=912 TotalNodes=19 SelectTypeParameters=NONE JobDefaults=(null) DefMemPerCPU=7744 MaxMemPerCPU=7744 PartitionName=gpubackfill-parallel-quad AllowGroups=ALL DenyAccounts=pcon0060,pcon0003,pcon0014,pcon0015,pcon0016,pcon0008,pcon0010,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101 AllowQos=ALL AllocNodes=ALL Default=NO QoS=N/A DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO MaxNodes=2 MaxTime=04:00:00 MinNodes=2 LLN=NO MaxCPUsPerNode=48 Nodes=p035[1-2] PriorityJobFactor=1000 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=EXCLUSIVE OverTimeLimit=NONE PreemptMode=OFF State=UP TotalCPUs=96 TotalNodes=2 SelectTypeParameters=NONE JobDefaults=(null) DefMemPerCPU=15872 MaxMemPerCPU=15872 PartitionName=gpubackfill-serial-40core AllowGroups=ALL DenyAccounts=pcon0060,pcon0003,pcon0014,pcon0015,pcon0016,pcon0008,pcon0010,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101 AllowQos=ALL AllocNodes=ALL Default=NO QoS=N/A DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO MaxNodes=1 MaxTime=04:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=40 Nodes=p02[25-39] PriorityJobFactor=1000 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO OverTimeLimit=NONE PreemptMode=OFF State=UP TotalCPUs=600 TotalNodes=15 SelectTypeParameters=NONE JobDefaults=(null) DefMemPerCPU=9292 MaxMemPerCPU=9292 PartitionName=gpubackfill-serial-48core AllowGroups=ALL DenyAccounts=pcon0060,pcon0003,pcon0014,pcon0015,pcon0016,pcon0008,pcon0010,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101 AllowQos=ALL AllocNodes=ALL Default=NO QoS=N/A DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO MaxNodes=1 MaxTime=04:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=48 Nodes=p03[01-19] PriorityJobFactor=1000 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO OverTimeLimit=NONE PreemptMode=OFF State=UP TotalCPUs=912 TotalNodes=19 SelectTypeParameters=NONE JobDefaults=(null) DefMemPerCPU=7744 MaxMemPerCPU=7744 PartitionName=gpubackfill-serial-quad AllowGroups=ALL DenyAccounts=pcon0060,pcon0003,pcon0014,pcon0015,pcon0016,pcon0008,pcon0010,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101 AllowQos=ALL AllocNodes=ALL Default=NO QoS=N/A DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO MaxNodes=1 MaxTime=04:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=48 Nodes=p035[1-2] PriorityJobFactor=1000 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO OverTimeLimit=NONE PreemptMode=OFF State=UP TotalCPUs=96 TotalNodes=2 SelectTypeParameters=NONE JobDefaults=(null) DefMemPerCPU=15872 MaxMemPerCPU=15872 PartitionName=gpudebug AllowGroups=ALL DenyAccounts=pcon0060,pcon0003,pcon0014,pcon0015,pcon0016,pcon0008,pcon0010,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101 AllowQos=ALL AllocNodes=ALL Default=NO QoS=N/A DefaultTime=00:30:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO MaxNodes=2 MaxTime=01:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=48 Nodes=p02[25-39],p03[01-19] PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO OverTimeLimit=NONE PreemptMode=OFF State=INACTIVE TotalCPUs=1512 TotalNodes=34 SelectTypeParameters=NONE JobDefaults=(null) DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED PartitionName=gpudebug-40core AllowGroups=ALL DenyAccounts=pcon0060,pcon0003,pcon0014,pcon0015,pcon0016,pcon0008,pcon0010,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101 AllowQos=ALL AllocNodes=ALL Default=NO QoS=gpudebug DefaultTime=00:30:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO MaxNodes=2 MaxTime=01:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=40 Nodes=p02[25-39] PriorityJobFactor=5000 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO OverTimeLimit=NONE PreemptMode=OFF State=UP TotalCPUs=600 TotalNodes=15 SelectTypeParameters=NONE JobDefaults=(null) DefMemPerCPU=9292 MaxMemPerCPU=9292 PartitionName=gpudebug-48core AllowGroups=ALL DenyAccounts=pcon0060,pcon0003,pcon0014,pcon0015,pcon0016,pcon0008,pcon0010,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101 AllowQos=ALL AllocNodes=ALL Default=NO QoS=gpudebug DefaultTime=00:30:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO MaxNodes=2 MaxTime=01:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=48 Nodes=p03[01-19] PriorityJobFactor=5000 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO OverTimeLimit=NONE PreemptMode=OFF State=UP TotalCPUs=912 TotalNodes=19 SelectTypeParameters=NONE JobDefaults=(null) DefMemPerCPU=7744 MaxMemPerCPU=7744 PartitionName=gpudebug-quad AllowGroups=ALL DenyAccounts=pcon0060,pcon0003,pcon0014,pcon0015,pcon0016,pcon0008,pcon0010,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101 AllowQos=ALL AllocNodes=ALL Default=NO QoS=gpudebug DefaultTime=00:30:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO MaxNodes=2 MaxTime=01:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=48 Nodes=p035[1-2] PriorityJobFactor=5000 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO OverTimeLimit=NONE PreemptMode=OFF State=UP TotalCPUs=96 TotalNodes=2 SelectTypeParameters=NONE JobDefaults=(null) DefMemPerCPU=15872 MaxMemPerCPU=15872 PartitionName=gpuparallel AllowGroups=ALL DenyAccounts=pcon0060,pcon0003,pcon0014,pcon0015,pcon0016,pcon0008,pcon0010,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101 AllowQos=ALL AllocNodes=ALL Default=NO QoS=pitzer-gpuparallel-partition DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO MaxNodes=10 MaxTime=4-00:00:00 MinNodes=2 LLN=NO MaxCPUsPerNode=48 Nodes=p02[25-39],p03[01-19],p035[1-2] PriorityJobFactor=1000 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=EXCLUSIVE OverTimeLimit=NONE PreemptMode=OFF State=INACTIVE TotalCPUs=1608 TotalNodes=36 SelectTypeParameters=NONE JobDefaults=(null) DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED PartitionName=gpuparallel-40core AllowGroups=ALL DenyAccounts=pcon0060,pcon0003,pcon0014,pcon0015,pcon0016,pcon0008,pcon0010,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101 AllowQos=ALL AllocNodes=ALL Default=NO QoS=pitzer-gpuparallel-partition DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO MaxNodes=10 MaxTime=4-00:00:00 MinNodes=2 LLN=NO MaxCPUsPerNode=40 Nodes=p02[25-39] PriorityJobFactor=2000 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=EXCLUSIVE OverTimeLimit=NONE PreemptMode=OFF State=UP TotalCPUs=600 TotalNodes=15 SelectTypeParameters=NONE JobDefaults=(null) DefMemPerCPU=9292 MaxMemPerCPU=9292 PartitionName=gpuparallel-48core AllowGroups=ALL DenyAccounts=pcon0060,pcon0003,pcon0014,pcon0015,pcon0016,pcon0008,pcon0010,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101 AllowQos=ALL AllocNodes=ALL Default=NO QoS=pitzer-gpuparallel-partition DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO MaxNodes=10 MaxTime=4-00:00:00 MinNodes=2 LLN=NO MaxCPUsPerNode=48 Nodes=p03[01-19] PriorityJobFactor=2000 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=EXCLUSIVE OverTimeLimit=NONE PreemptMode=OFF State=UP TotalCPUs=912 TotalNodes=19 SelectTypeParameters=NONE JobDefaults=(null) DefMemPerCPU=7744 MaxMemPerCPU=7744 PartitionName=gpuparallel-quad AllowGroups=ALL DenyAccounts=pcon0060,pcon0003,pcon0014,pcon0015,pcon0016,pcon0008,pcon0010,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101 AllowQos=ALL AllocNodes=ALL Default=NO QoS=pitzer-gpu-quad-partition DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO MaxNodes=2 MaxTime=4-00:00:00 MinNodes=2 LLN=NO MaxCPUsPerNode=48 Nodes=p035[1-2] PriorityJobFactor=1000 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=EXCLUSIVE OverTimeLimit=NONE PreemptMode=OFF State=UP TotalCPUs=96 TotalNodes=2 SelectTypeParameters=NONE JobDefaults=(null) DefMemPerCPU=15872 MaxMemPerCPU=15872 PartitionName=gpuserial AllowGroups=ALL DenyAccounts=pcon0060,pcon0003,pcon0014,pcon0015,pcon0016,pcon0008,pcon0010,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101 AllowQos=ALL AllocNodes=ALL Default=NO QoS=pitzer-gpuserial-partition DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO MaxNodes=1 MaxTime=7-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=48 Nodes=p02[25-39],p03[01-19],p035[1-2] PriorityJobFactor=1000 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO OverTimeLimit=NONE PreemptMode=OFF State=INACTIVE TotalCPUs=1608 TotalNodes=36 SelectTypeParameters=NONE JobDefaults=(null) DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED PartitionName=gpuserial-40core AllowGroups=ALL DenyAccounts=pcon0060,pcon0003,pcon0014,pcon0015,pcon0016,pcon0008,pcon0010,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101 AllowQos=ALL AllocNodes=ALL Default=NO QoS=pitzer-gpuserial-partition DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO MaxNodes=1 MaxTime=7-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=40 Nodes=p02[25-39] PriorityJobFactor=2000 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO OverTimeLimit=NONE PreemptMode=OFF State=UP TotalCPUs=600 TotalNodes=15 SelectTypeParameters=NONE JobDefaults=(null) DefMemPerCPU=9292 MaxMemPerCPU=9292 PartitionName=gpuserial-48core AllowGroups=ALL DenyAccounts=pcon0060,pcon0003,pcon0014,pcon0015,pcon0016,pcon0008,pcon0010,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101 AllowQos=ALL AllocNodes=ALL Default=NO QoS=pitzer-gpuserial-partition DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO MaxNodes=1 MaxTime=7-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=48 Nodes=p03[01-19] PriorityJobFactor=2000 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO OverTimeLimit=NONE PreemptMode=OFF State=UP TotalCPUs=912 TotalNodes=19 SelectTypeParameters=NONE JobDefaults=(null) DefMemPerCPU=7744 MaxMemPerCPU=7744 PartitionName=gpuserial-quad AllowGroups=ALL DenyAccounts=pcon0060,pcon0003,pcon0014,pcon0015,pcon0016,pcon0008,pcon0010,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101 AllowQos=ALL AllocNodes=ALL Default=NO QoS=pitzer-gpu-quad-partition DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO MaxNodes=1 MaxTime=7-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=48 Nodes=p035[1-2] PriorityJobFactor=1000 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO OverTimeLimit=NONE PreemptMode=OFF State=UP TotalCPUs=96 TotalNodes=2 SelectTypeParameters=NONE JobDefaults=(null) DefMemPerCPU=15872 MaxMemPerCPU=15872 PartitionName=hugemem AllowGroups=ALL DenyAccounts=pcon0060,pcon0003,pcon0014,pcon0015,pcon0016,pcon0008,pcon0010,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101 AllowQos=ALL AllocNodes=ALL Default=NO QoS=pitzer-hugemem-partition DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO MaxNodes=1 MaxTime=7-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=80 Nodes=p02[57-60] PriorityJobFactor=1000 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO OverTimeLimit=NONE PreemptMode=OFF State=UP TotalCPUs=320 TotalNodes=4 SelectTypeParameters=NONE JobDefaults=(null) DefMemPerCPU=38259 MaxMemPerCPU=38259 PartitionName=hugemem-parallel AllowGroups=ALL AllowAccounts=pzs0708,pzs0710,pzs0712 AllowQos=ALL AllocNodes=ALL Default=NO QoS=N/A DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO MaxNodes=1 MaxTime=4-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=80 Nodes=p02[57-60] PriorityJobFactor=1000 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO OverTimeLimit=NONE PreemptMode=OFF State=UP TotalCPUs=320 TotalNodes=4 SelectTypeParameters=NONE JobDefaults=(null) DefMemPerCPU=38259 MaxMemPerCPU=38259 PartitionName=largemem AllowGroups=ALL DenyAccounts=pcon0060,pcon0003,pcon0014,pcon0015,pcon0016,pcon0008,pcon0010,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101 AllowQos=ALL AllocNodes=ALL Default=NO QoS=pitzer-largemem-partition DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO MaxNodes=1 MaxTime=7-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=48 Nodes=p09[01-12] PriorityJobFactor=1000 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO OverTimeLimit=NONE PreemptMode=OFF State=UP TotalCPUs=576 TotalNodes=12 SelectTypeParameters=NONE JobDefaults=(null) DefMemPerCPU=15872 MaxMemPerCPU=15872 PartitionName=largemem-parallel AllowGroups=ALL AllowAccounts=pzs0708,pzs0710,pzs0712 AllowQos=ALL AllocNodes=ALL Default=NO QoS=N/A DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO MaxNodes=UNLIMITED MaxTime=4-00:00:00 MinNodes=2 LLN=NO MaxCPUsPerNode=48 Nodes=p09[01-12] PriorityJobFactor=1000 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO OverTimeLimit=NONE PreemptMode=OFF State=UP TotalCPUs=576 TotalNodes=12 SelectTypeParameters=NONE JobDefaults=(null) DefMemPerCPU=15872 MaxMemPerCPU=15872 PartitionName=longserial AllowGroups=ALL AllowAccounts=pzs0708,pzs0710,pzs0714,pzs0712,pas0426,pfs0183,paa0209,pas1350,pas1117,pjs0320,pas1501 AllowQos=ALL AllocNodes=ALL Default=NO QoS=N/A DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO MaxNodes=1 MaxTime=14-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=40 Nodes=p0[001-195] PriorityJobFactor=1000 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO OverTimeLimit=NONE PreemptMode=OFF State=UP TotalCPUs=7800 TotalNodes=195 SelectTypeParameters=NONE JobDefaults=(null) DefMemPerCPU=4556 MaxMemPerCPU=4556 PartitionName=parallel AllowGroups=ALL DenyAccounts=pcon0060,pcon0003,pcon0014,pcon0015,pcon0016,pcon0008,pcon0010,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101 AllowQos=ALL AllocNodes=ALL Default=NO QoS=N/A DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO MaxNodes=40 MaxTime=4-00:00:00 MinNodes=2 LLN=NO MaxCPUsPerNode=48 Nodes=p0[001-195],p0[401-404,501-792] PriorityJobFactor=1000 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=EXCLUSIVE OverTimeLimit=NONE PreemptMode=OFF State=INACTIVE TotalCPUs=22008 TotalNodes=491 SelectTypeParameters=NONE JobDefaults=(null) DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED PartitionName=parallel-40core AllowGroups=ALL DenyAccounts=pcon0060,pcon0003,pcon0014,pcon0015,pcon0016,pcon0008,pcon0010,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101 AllowQos=ALL AllocNodes=ALL Default=NO QoS=N/A DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO MaxNodes=40 MaxTime=4-00:00:00 MinNodes=2 LLN=NO MaxCPUsPerNode=40 Nodes=p0[001-195] PriorityJobFactor=2000 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=EXCLUSIVE OverTimeLimit=NONE PreemptMode=OFF State=UP TotalCPUs=7800 TotalNodes=195 SelectTypeParameters=NONE JobDefaults=(null) DefMemPerCPU=4556 MaxMemPerCPU=4556 PartitionName=parallel-48core AllowGroups=ALL DenyAccounts=pcon0060,pcon0003,pcon0014,pcon0015,pcon0016,pcon0008,pcon0010,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101 AllowQos=ALL AllocNodes=ALL Default=NO QoS=N/A DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO MaxNodes=40 MaxTime=4-00:00:00 MinNodes=2 LLN=NO MaxCPUsPerNode=48 Nodes=p0[401-404,501-792] PriorityJobFactor=2000 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=EXCLUSIVE OverTimeLimit=NONE PreemptMode=OFF State=UP TotalCPUs=14208 TotalNodes=296 SelectTypeParameters=NONE JobDefaults=(null) DefMemPerCPU=3797 MaxMemPerCPU=3797 PartitionName=serial AllowGroups=ALL DenyAccounts=pcon0060,pcon0003,pcon0014,pcon0015,pcon0016,pcon0008,pcon0010,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101 AllowQos=ALL AllocNodes=ALL Default=NO QoS=N/A DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO MaxNodes=1 MaxTime=7-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=48 Nodes=p0[001-195],p0[401-404,501-792] PriorityJobFactor=1000 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO OverTimeLimit=NONE PreemptMode=OFF State=INACTIVE TotalCPUs=22008 TotalNodes=491 SelectTypeParameters=NONE JobDefaults=(null) DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED PartitionName=serial-40core AllowGroups=ALL DenyAccounts=pcon0060,pcon0003,pcon0014,pcon0015,pcon0016,pcon0008,pcon0010,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101 AllowQos=ALL AllocNodes=ALL Default=NO QoS=N/A DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO MaxNodes=1 MaxTime=7-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=40 Nodes=p0[001-195] PriorityJobFactor=2000 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO OverTimeLimit=NONE PreemptMode=OFF State=UP TotalCPUs=7800 TotalNodes=195 SelectTypeParameters=NONE JobDefaults=(null) DefMemPerCPU=4556 MaxMemPerCPU=4556 PartitionName=serial-48core AllowGroups=ALL DenyAccounts=pcon0060,pcon0003,pcon0014,pcon0015,pcon0016,pcon0008,pcon0010,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101 AllowQos=ALL AllocNodes=ALL Default=NO QoS=N/A DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO MaxNodes=1 MaxTime=7-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=48 Nodes=p0[401-404,501-792] PriorityJobFactor=2000 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO OverTimeLimit=NONE PreemptMode=OFF State=UP TotalCPUs=14208 TotalNodes=296 SelectTypeParameters=NONE JobDefaults=(null) DefMemPerCPU=3797 MaxMemPerCPU=3797 PartitionName=systems AllowGroups=sysstf,sappstf DenyAccounts=pcon0060,pcon0003,pcon0014,pcon0015,pcon0016,pcon0008,pcon0010,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101 AllowQos=ALL AllocNodes=ALL Default=NO QoS=N/A DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED Nodes=p0[001-195],p0[401-404,501-792],p02[25-39],p03[01-19],p035[1-2],p02[57-60],p09[01-12] PriorityJobFactor=10000 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO OverTimeLimit=NONE PreemptMode=OFF State=UP TotalCPUs=24512 TotalNodes=543 SelectTypeParameters=NONE JobDefaults=(null) DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED WRT the severity, I'm following the instructions of my manager. Also, the Importance field is what I set -- I can't find a severity field in my view other than that one. I see that job 2029118 has a large list of partitions. Some of those partitions do have a MaxNodes limit that would be exceeded by the job, like 'gpubackfill-parallel-40core' as an example. I assume the other test jobs you ran had the same partition list though, is that right? It is interesting that the nodes it's listing as unavailable are down. Can you run a test again with a couple 29 node jobs so that one is pending. While it's pending I'd like to see if it references nodes outside of the reservation as well as being unavailable. The fact that it's making reference to nodes that don't match the reservation make me think it could be related to the bug I referenced yesterday. I wasn't able to reproduce the behavior you're seeing where the bug came into play, but I think there are a lot of variables in your environment that I didn't include in my testing and may not have been a factor in your test environment either. What are your thoughts on applying the patch I referenced? Thanks, Ben One other hinky thing I just noticed is that one of the reserved nodes, p0501, was in an IDLE+DRAIN state. I cleared that just in case. Ben, we've made the following changes on our production system: 1. We've made the customer reservations 1 node larger than the jobs (i.e. 31 nodes instead of 30), in case one of the nodes gets into a bad state but remains in the reservation like we saw with p0501 in the test reservation earlier. (That's something I made a habit of with large reservations in Moab after being burned by node failures a number of times, so it's nothing new.) 2. We've updated Slurm with the patch you provided yesterday. Hopefully between those two, we'll have this fixed. I'll do some tests myself here in a little bit, and then the next iteration of the customer's reservation is around 17:35EDT tonight. Thanks for the update, I do think those two actions should help. I'll stay posted for an update this afternoon. Thanks, Ben Based on my tests, I think this is going to work:
troy@pitzer-login04:~$ scontrol create reservation=test nodecnt=30 feature='c6420&48core' Flags=MAINT,PURGE_COMP=00:05:00 start=14:55:00 duration=01:00:00 accounts=PZS0708
Reservation created: test
troy@pitzer-login04:~$ sbatch --nodes=30 --reservation=test --time=10:00 test.job
Submitted batch job 2029438
troy@pitzer-login04:~$ sbatch --nodes=29 --reservation=test --time=10:00 test.job
Submitted batch job 2029439
troy@pitzer-login04:~$ sbatch --nodes=28 --reservation=test --time=10:00 test.job
Submitted batch job 2029440
troy@pitzer-login04:~$ sbatch --nodes=24 --reservation=test --time=10:00 test.job
Submitted batch job 2029441
troy@pitzer-login04:~$ sbatch --nodes=16 --reservation=test --time=10:00 test.job
Submitted batch job 2029442
troy@pitzer-login04:~$ sbatch --nodes=8 --reservation=test --time=10:00 test.job
Submitted batch job 2029443
troy@pitzer-login04:~$ sbatch --nodes=4 --reservation=test --time=10:00 test.job
Submitted batch job 2029444
troy@pitzer-login04:~$ sbatch --nodes=2 --reservation=test --time=10:00 test.job
Submitted batch job 2029445
troy@pitzer-login04:~$ sbatch --nodes=1 --reservation=test --time=10:00 test.job
Submitted batch job 2029447
troy@pitzer-login04:~$ squeue -u troy
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
2029438 parallel- test troy PD 0:00 30 (Reservation)
2029439 parallel- test troy PD 0:00 29 (Reservation)
2029440 parallel- test troy PD 0:00 28 (Reservation)
2029441 parallel- test troy PD 0:00 24 (Reservation)
2029442 parallel- test troy PD 0:00 16 (Reservation)
2029443 parallel- test troy PD 0:00 8 (Reservation)
2029444 parallel- test troy PD 0:00 4 (Reservation)
2029445 parallel- test troy PD 0:00 2 (Reservation)
2029447 serial-40 test troy PD 0:00 1 (Reservation)
[...wait till ~15:06...]
troy@pitzer-login04:~$ squeue -u troy
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
2029440 parallel- test troy PD 0:00 28 (PartitionNodeLimit)
2029441 parallel- test troy PD 0:00 24 (PartitionNodeLimit)
2029442 parallel- test troy PD 0:00 16 (PartitionNodeLimit)
2029443 parallel- test troy PD 0:00 8 (PartitionNodeLimit)
2029444 parallel- test troy PD 0:00 4 (PartitionNodeLimit)
2029445 parallel- test troy PD 0:00 2 (Priority)
2029439 parallel- test troy R 4:54 29 p[0501-0518,0521-0522,0527,0529,0531-0533,0535-0538]
2029447 serial-48 test troy R 4:51 1 p0546
# what happened to the 30-node job?
troy@pitzer-login04:~$ sacct -j2029438
JobID JobName Partition Account AllocCPUS State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
2029438 test parallel-+ pzs0708 1440 COMPLETED 0:0
2029438.bat+ batch pzs0708 48 COMPLETED 0:0
2029438.ext+ extern pzs0708 1440 COMPLETED 0:0
That does look hopeful. To see what happened to the 30 node job I would start with the slurmctld logs. Do they have entries for that job id showing when it started and ended? (In reply to Ben Roberts from comment #22) > That does look hopeful. To see what happened to the 30 node job I would > start with the slurmctld logs. Do they have entries for that job id showing > when it started and ended? I can tell it ran from the accounting: troy@pitzer-login04:~$ sacct -j2029438 -X -o jobid,start,end,nodelist%100 JobID Start End NodeList ------------ ------------------- ------------------- ----------------------------------------------------- 2029438 2020-10-13T14:55:00 2020-10-13T15:00:02 p[0501-0518,0521-0522,0527,0529,0531-0533,0535-0538,0546] Yeah, I was looking for the time stamps, so thank you for the sacct output. It looks like it was the first to run when the reservation started. It ended at 15:00:02 and in your comment it looks like you waited until ~15:06 to look at the status again. This is enough time that the default MinJobAge value of 300 seconds would have been met and the record would have been purged from slurmctld's memory, requiring you to look at sacct for information about the job. This does look like normal behavior unless you've increased your MinJobAge from the default. Thanks, Ben The customer's latest job just started within the requested reservation. I'd like to have a couple more of these before we declare victory, but this is very encouraging. I'm glad to hear it. Since the job started and you'd like to monitor a few more times I'll lower the severity but leave the ticket open while we wait to see how it goes. Thanks, Ben The customer just had this fail again, in what looks to be the same way: root@pitzer-slurm01:~# grep 2032519 /var/log/slurm/slurmctld.log Oct 14 11:26:11 pitzer-slurm01 slurmctld[59138]: _slurm_rpc_submit_batch_job: JobId=2032519 InitPrio=1200107024 usec=443 Oct 14 11:35:01 pitzer-slurm01 slurmctld[59138]: _pick_best_nodes: JobId=2032519 never runnable in partition parallel-48core Oct 14 11:35:01 pitzer-slurm01 slurmctld[59138]: sched: schedule: JobId=2032519 non-runnable: Requested node configuration is not available Oct 14 11:37:43 pitzer-slurm01 slurmctld[59138]: _slurm_rpc_kill_job: REQUEST_KILL_JOB JobId=2032519 uid 30211 They deleted the job before I had a chance to take a closer look at it. I've created a test reservation to have the customer test this for themselves: troy@pitzer-login01:~$ scontrol create reservation=test nodecnt=2 feature='c6420&48core' start=12:30 duration=05:00:00 accounts=PZS0708,PYS1043 flags=maint Reservation created: test troy@pitzer-login01:~$ sinfo -T RESV_NAME STATE START_TIME END_TIME DURATION NODELIST x005-00 INACTIVE 2020-10-14T23:35:00 2020-10-15T00:15:00 00:40:00 p[0501-0509,0511-0518,0521-0522,0525,0528-0529,0532,0535,0537] x005-06 INACTIVE 2020-10-15T05:35:00 2020-10-15T06:15:00 00:40:00 p[0501-0509,0511-0518,0521-0522,0525,0528-0529,0532,0535,0537-0538] x005-12 INACTIVE 2020-10-15T11:35:00 2020-10-15T12:15:00 00:40:00 p[0501-0509,0511-0518,0521-0522,0525,0528-0529,0532,0535,0537] x005-18 INACTIVE 2020-10-14T17:35:00 2020-10-14T18:15:00 00:40:00 p[0501-0509,0511-0518,0521-0522,0525,0528-0529,0532,0535,0537-0538] test INACTIVE 2020-10-14T12:30:00 2020-10-14T17:30:00 05:00:00 p[0501-0502] I just had another colleague bring up a ticket he's working on where the 'MAINT' flag seems to cause problems with jobs being able to start in a reservation. I'll do some testing with this to see if I can narrow down a reproducer, but I wanted to bring it up as something you might look at on your side as well. Thanks, Ben Looking in our slurmctld logs, we get a *LOT* of messages about modifying the node list for these reservations, but they usually seems to be on the same nodes:
root@pitzer-slurm01:~# grep -i X005-00 /var/log/slurm/slurmctld.log | head
Oct 14 03:28:06 pitzer-slurm01 slurmctld[59138]: modified reservation x005-00 due to unusable nodes, new nodes: p[0501-0507,0509-0518,0521-0522,0525,0528-0529,0532,0535,0537]
Oct 14 03:28:36 pitzer-slurm01 slurmctld[59138]: modified reservation x005-00 due to unusable nodes, new nodes: p[0501-0507,0509-0518,0521-0522,0525,0528-0529,0532,0535,0537]
Oct 14 03:29:06 pitzer-slurm01 slurmctld[59138]: modified reservation x005-00 due to unusable nodes, new nodes: p[0501-0507,0509-0518,0521-0522,0525,0528-0529,0532,0535,0537]
Oct 14 03:29:36 pitzer-slurm01 slurmctld[59138]: modified reservation x005-00 due to unusable nodes, new nodes: p[0501-0507,0509-0518,0521-0522,0525,0528-0529,0532,0535,0537]
Oct 14 03:30:06 pitzer-slurm01 slurmctld[59138]: modified reservation x005-00 due to unusable nodes, new nodes: p[0501-0507,0509-0518,0521-0522,0525,0528-0529,0532,0535,0537]
Oct 14 03:30:36 pitzer-slurm01 slurmctld[59138]: modified reservation x005-00 due to unusable nodes, new nodes: p[0501-0507,0509-0518,0521-0522,0525,0528-0529,0532,0535,0537]
Oct 14 03:31:06 pitzer-slurm01 slurmctld[59138]: modified reservation x005-00 due to unusable nodes, new nodes: p[0501-0507,0509-0518,0521-0522,0525,0528-0529,0532,0535,0537]
Oct 14 03:31:36 pitzer-slurm01 slurmctld[59138]: modified reservation x005-00 due to unusable nodes, new nodes: p[0501-0507,0509-0518,0521-0522,0525,0528-0529,0532,0535,0537]
Oct 14 03:32:06 pitzer-slurm01 slurmctld[59138]: modified reservation x005-00 due to unusable nodes, new nodes: p[0501-0507,0509-0518,0521-0522,0525,0528-0529,0532,0535,0537]
Oct 14 03:32:36 pitzer-slurm01 slurmctld[59138]: modified reservation x005-00 due to unusable nodes, new nodes: p[0501-0507,0509-0518,0521-0522,0525,0528-0529,0532,0535,0537]
root@pitzer-slurm01:~# grep -i X005-00 /var/log/slurm/slurmctld.log | grep modified | awk '{print $NF}' | sort | uniq -c
10 p[0501-0507,0509-0518,0521-0522,0525,0528-0529,0532,0535,0537]
1099 p[0501-0509,0511-0518,0521-0522,0525,0528-0529,0532,0535,0537]
1 p[0501-0518,0521-0522,0525,0528-0529,0532,0535,0537]
root@pitzer-slurm01:~# grep -i X005-06 /var/log/slurm/slurmctld.log | head
Oct 14 03:28:06 pitzer-slurm01 slurmctld[59138]: modified reservation x005-06 due to unusable nodes, new nodes: p[0501-0507,0509-0518,0521-0522,0525,0528-0529,0532,0535,0537-0538]
Oct 14 03:28:36 pitzer-slurm01 slurmctld[59138]: modified reservation x005-06 due to unusable nodes, new nodes: p[0501-0507,0509-0518,0521-0522,0525,0528-0529,0532,0535,0537-0538]
Oct 14 03:29:06 pitzer-slurm01 slurmctld[59138]: modified reservation x005-06 due to unusable nodes, new nodes: p[0501-0507,0509-0518,0521-0522,0525,0528-0529,0532,0535,0537-0538]
Oct 14 03:29:36 pitzer-slurm01 slurmctld[59138]: modified reservation x005-06 due to unusable nodes, new nodes: p[0501-0507,0509-0518,0521-0522,0525,0528-0529,0532,0535,0537-0538]
Oct 14 03:30:06 pitzer-slurm01 slurmctld[59138]: modified reservation x005-06 due to unusable nodes, new nodes: p[0501-0507,0509-0518,0521-0522,0525,0528-0529,0532,0535,0537-0538]
Oct 14 03:30:36 pitzer-slurm01 slurmctld[59138]: modified reservation x005-06 due to unusable nodes, new nodes: p[0501-0507,0509-0518,0521-0522,0525,0528-0529,0532,0535,0537-0538]
Oct 14 03:31:06 pitzer-slurm01 slurmctld[59138]: modified reservation x005-06 due to unusable nodes, new nodes: p[0501-0507,0509-0518,0521-0522,0525,0528-0529,0532,0535,0537-0538]
Oct 14 03:31:36 pitzer-slurm01 slurmctld[59138]: modified reservation x005-06 due to unusable nodes, new nodes: p[0501-0507,0509-0518,0521-0522,0525,0528-0529,0532,0535,0537-0538]
Oct 14 03:32:06 pitzer-slurm01 slurmctld[59138]: modified reservation x005-06 due to unusable nodes, new nodes: p[0501-0507,0509-0518,0521-0522,0525,0528-0529,0532,0535,0537-0538]
Oct 14 03:32:36 pitzer-slurm01 slurmctld[59138]: modified reservation x005-06 due to unusable nodes, new nodes: p[0501-0507,0509-0518,0521-0522,0525,0528-0529,0532,0535,0537-0538]
root@pitzer-slurm01:~# grep -i X005-06 /var/log/slurm/slurmctld.log | grep modified | awk '{print $NF}' | sort | uniq -c
10 p[0501-0507,0509-0518,0521-0522,0525,0528-0529,0532,0535,0537-0538]
1101 p[0501-0509,0511-0518,0521-0522,0525,0528-0529,0532,0535,0537-0538]
root@pitzer-slurm01:~# grep -i X005-12 /var/log/slurm/slurmctld.log | head
Oct 14 03:28:06 pitzer-slurm01 slurmctld[59138]: modified reservation x005-12 due to unusable nodes, new nodes: p[0501-0507,0509-0518,0521-0522,0525,0528-0529,0532,0535,0537]
Oct 14 03:28:36 pitzer-slurm01 slurmctld[59138]: modified reservation x005-12 due to unusable nodes, new nodes: p[0501-0507,0509-0518,0521-0522,0525,0528-0529,0532,0535,0537]
Oct 14 03:29:06 pitzer-slurm01 slurmctld[59138]: modified reservation x005-12 due to unusable nodes, new nodes: p[0501-0507,0509-0518,0521-0522,0525,0528-0529,0532,0535,0537]
Oct 14 03:29:36 pitzer-slurm01 slurmctld[59138]: modified reservation x005-12 due to unusable nodes, new nodes: p[0501-0507,0509-0518,0521-0522,0525,0528-0529,0532,0535,0537]
Oct 14 03:30:06 pitzer-slurm01 slurmctld[59138]: modified reservation x005-12 due to unusable nodes, new nodes: p[0501-0507,0509-0518,0521-0522,0525,0528-0529,0532,0535,0537]
Oct 14 03:30:36 pitzer-slurm01 slurmctld[59138]: modified reservation x005-12 due to unusable nodes, new nodes: p[0501-0507,0509-0518,0521-0522,0525,0528-0529,0532,0535,0537]
Oct 14 03:31:06 pitzer-slurm01 slurmctld[59138]: modified reservation x005-12 due to unusable nodes, new nodes: p[0501-0507,0509-0518,0521-0522,0525,0528-0529,0532,0535,0537]
Oct 14 03:31:36 pitzer-slurm01 slurmctld[59138]: modified reservation x005-12 due to unusable nodes, new nodes: p[0501-0507,0509-0518,0521-0522,0525,0528-0529,0532,0535,0537]
Oct 14 03:32:06 pitzer-slurm01 slurmctld[59138]: modified reservation x005-12 due to unusable nodes, new nodes: p[0501-0507,0509-0518,0521-0522,0525,0528-0529,0532,0535,0537]
Oct 14 03:32:36 pitzer-slurm01 slurmctld[59138]: modified reservation x005-12 due to unusable nodes, new nodes: p[0501-0507,0509-0518,0521-0522,0525,0528-0529,0532,0535,0537]
root@pitzer-slurm01:~# grep -i X005-12 /var/log/slurm/slurmctld.log | grep modified | awk '{print $NF}' | sort | uniq -c
10 p[0501-0507,0509-0518,0521-0522,0525,0528-0529,0532,0535,0537]
1085 p[0501-0509,0511-0518,0521-0522,0525,0528-0529,0532,0535,0537]
1 p[0501-0518,0521-0522,0525,0528-0529,0532,0535,0537]
root@pitzer-slurm01:~# grep -i X005-18 /var/log/slurm/slurmctld.log | head
Oct 14 03:28:06 pitzer-slurm01 slurmctld[59138]: modified reservation x005-18 due to unusable nodes, new nodes: p[0501-0507,0509-0518,0521-0522,0525,0528-0529,0532,0535,0537-0538]
Oct 14 03:28:36 pitzer-slurm01 slurmctld[59138]: modified reservation x005-18 due to unusable nodes, new nodes: p[0501-0507,0509-0518,0521-0522,0525,0528-0529,0532,0535,0537-0538]
Oct 14 03:29:06 pitzer-slurm01 slurmctld[59138]: modified reservation x005-18 due to unusable nodes, new nodes: p[0501-0507,0509-0518,0521-0522,0525,0528-0529,0532,0535,0537-0538]
Oct 14 03:29:36 pitzer-slurm01 slurmctld[59138]: modified reservation x005-18 due to unusable nodes, new nodes: p[0501-0507,0509-0518,0521-0522,0525,0528-0529,0532,0535,0537-0538]
Oct 14 03:30:06 pitzer-slurm01 slurmctld[59138]: modified reservation x005-18 due to unusable nodes, new nodes: p[0501-0507,0509-0518,0521-0522,0525,0528-0529,0532,0535,0537-0538]
Oct 14 03:30:36 pitzer-slurm01 slurmctld[59138]: modified reservation x005-18 due to unusable nodes, new nodes: p[0501-0507,0509-0518,0521-0522,0525,0528-0529,0532,0535,0537-0538]
Oct 14 03:31:06 pitzer-slurm01 slurmctld[59138]: modified reservation x005-18 due to unusable nodes, new nodes: p[0501-0507,0509-0518,0521-0522,0525,0528-0529,0532,0535,0537-0538]
Oct 14 03:31:36 pitzer-slurm01 slurmctld[59138]: modified reservation x005-18 due to unusable nodes, new nodes: p[0501-0507,0509-0518,0521-0522,0525,0528-0529,0532,0535,0537-0538]
Oct 14 03:32:06 pitzer-slurm01 slurmctld[59138]: modified reservation x005-18 due to unusable nodes, new nodes: p[0501-0507,0509-0518,0521-0522,0525,0528-0529,0532,0535,0537-0538]
Oct 14 03:32:36 pitzer-slurm01 slurmctld[59138]: modified reservation x005-18 due to unusable nodes, new nodes: p[0501-0507,0509-0518,0521-0522,0525,0528-0529,0532,0535,0537-0538]
root@pitzer-slurm01:~# grep -i X005-18 /var/log/slurm/slurmctld.log | grep modified | awk '{print $NF}' | sort | uniq -c
10 p[0501-0507,0509-0518,0521-0522,0525,0528-0529,0532,0535,0537-0538]
1096 p[0501-0509,0511-0518,0521-0522,0525,0528-0529,0532,0535,0537-0538]
It seems like p0508 and p0510 come and go from these reservations but the other nodes stay largely the same. I see a couple messages from slurmctld about p0510 but nothing about p0508:
root@pitzer-slurm01:~# grep p0508 /var/log/slurm/slurmctld.log
[...nothing...]
root@pitzer-slurm01:~# grep p0510 /var/log/slurm/slurmctld.log
Oct 14 03:32:49 pitzer-slurm01 slurmctld[59138]: update_node: node p0510 reason set to: NHC: check_cmd_output: TIMEOUT after 15s for "/bin/cat /fs/project/.testfile"; subprocess terminated.
Oct 14 03:32:49 pitzer-slurm01 slurmctld[59138]: update_node: node p0510 state set to DRAINED
Oct 14 03:42:37 pitzer-slurm01 slurmctld[59138]: update_node: node p0510 state set to IDLE
Oct 14 06:53:23 pitzer-slurm01 slurmctld[59138]: sched: Allocate JobId=2031930 NodeList=p0510 #CPUs=48 Partition=serial-48core
Very few jobs have run on them either:
root@pitzer-slurm01:~# sacct -X -S 2020-10-14T00:00:00 -E 2020-10-14T12:45:00 -N p0508,p0510 -o jobid,jobname,start,end,nodelist
JobID JobName Start End NodeList
------------ ---------- ------------------- ------------------- ---------------
2031930 ondemand/+ 2020-10-14T06:53:23 2020-10-14T11:53:36 p0510
What would cause sustained flurries of "modified reservation <rsv> due to unusable nodes" events like this?
We've been using the MAINT flag to prevent charging for reservations. The successful run we had last night was with the MAINT flag set, so if it is a problem, it's inconsistently so. Agreed, if it is something related to the MAINT flag it is not a consistent problem. I'm still looking into what might be happening with that flag. However, with the log messages you're seeing, I am able to reproduce similar behavior. I set up a daily reservation and waited for it not to be currently active so there is a reservation in the future. Then if I set one of the nodes in the reservation down I get log entries like you're seeing. This looks like a bug because it says it should be picking new nodes, but I see it selecting the same nodes, including the one that is down, so it starts repeating that message until the node comes back up. I still need to do more testing, but I haven't seen cases where reservations start with a down node (unless created with the ignstate flag) so I assume as it gets closer to the start time it would actually change the nodes, but I will confirm that. It's interesting that you don't see log entries about p0508 changing state and only one change for p0510. Can you monitor those nodes for a while to see if they do change state? Are you able to run a single node job on either of them? Thanks, Ben (In reply to Ben Roberts from comment #33) > It's interesting that you don't see log entries about p0508 changing state > and only one change for p0510. Can you monitor those nodes for a while to > see if they do change state? Are you able to run a single node job on > either of them? Yes, I can run short test jobs on both of them: troy@pitzer-login01:~$ sbatch --nodes=1 --ntasks=48 --nodelist=p0508 test.job Submitted batch job 2033005 troy@pitzer-login01:~$ sbatch --nodes=1 --ntasks=48 --nodelist=p0510 test.job Submitted batch job 2033006 troy@pitzer-login01:~$ squeue -u troy JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 2033005 serial-48 test troy R 0:03 1 p0508 2033006 serial-48 test troy R 0:03 1 p0510 (In reply to Ben Roberts from comment #33) > However, with the log messages you're seeing, I am able to reproduce similar > behavior. I set up a daily reservation and waited for it not to be > currently active so there is a reservation in the future. Then if I set one > of the nodes in the reservation down I get log entries like you're seeing. > This looks like a bug because it says it should be picking new nodes, but I > see it selecting the same nodes, including the one that is down, so it > starts repeating that message until the node comes back up. I still need to > do more testing, but I haven't seen cases where reservations start with a > down node (unless created with the ignstate flag) so I assume as it gets > closer to the start time it would actually change the nodes, but I will > confirm that. Please do. If the reservation includes enough down nodes to prevent a job from starting, that might explain the BadConstraints symptom we see with this. There is a bit of a 'duh' moment on my part. I forgot about the 'REPLACE_DOWN' flag until I stopped to think about it. The default behavior of reservations is that they won't replace nodes that go down, which does explain why jobs that request the full reservation weren't able to run (when a node was down) and why the log entries repeat without changing the nodes in the reservation. However, I did some testing with this flag and it doesn't replace down nodes when the 'MAINT' flag is also included on the reservation. Without 'MAINT' it works as I would expect. I know that maintenance reservations are treated differently in some ways and this is probably one of them. I can see that they wouldn't replace down nodes for a maintenance reservation because nodes are expected to go down. I'll confirm that it is expected behavior for maintenance reservations not to replace down nodes. Let me know if a node being down when the reservation is active doesn't line up with what you have been seeing though. Thanks, Ben (In reply to Ben Roberts from comment #36) > There is a bit of a 'duh' moment on my part. I forgot about the > 'REPLACE_DOWN' flag until I stopped to think about it. The default behavior > of reservations is that they won't replace nodes that go down, which does > explain why jobs that request the full reservation weren't able to run (when > a node was down) and why the log entries repeat without changing the nodes > in the reservation. OK, it sounds like I need to set REPLACE_DOWN on these reservations. I (perhaps foolishly) thought that was the default when setting NodeCnt. I traced what was happening with a maintenance reservation. When one of the nodes in a reservation goes down it calls _resv_node_replace, which in turn makes a call to _select_nodes. In _select_nodes it does a check for the MAINT or OVERLAP flags to be on the reservation and if so skips over the node selection. Here are what I found to be the relevant lines: https://github.com/SchedMD/slurm/blob/09769ad701e91ce7956e4b093933d89f0f10ec86/src/slurmctld/reservation.c#L3924-L3931 You would want to use the REPLACE_DOWN flag. I know you've mentioned that you use the MAINT flag to disable accounting for the reservation. Is removing that flag so that down nodes can be replaced a possibility? Thanks, Ben > You would want to use the REPLACE_DOWN flag. I know you've mentioned that you use the MAINT flag to disable accounting for the reservation. Is removing that flag so that down nodes can be replaced a possibility?
Yes, the MAINT flag only there to prevent charging for reservations while we were working through problems, not expecting that it might cause them. I'll remove that flag.
Ok, I'm glad to hear that the flag isn't critical to how you are doing things. I'll wait to hear how things go with the change to replace the down nodes. Thanks, Ben The latest customer job was able to launch in its reservation. I think we're getting there, but I'd like to see a couple more successes before we declare victory. We've seen another successful job launch this morning, so we consider this resolved. Thank you for the feedback. Marking as resolved. I think we may have spoken too soon about this being solved. I created another, larger (~80-node) set of reservations for this customer, and while the earlier reservations seem to be working, these new ones are not. The especially curious thing is that after the job got reason=BadConstraints, I tried increased the NodeCnt parameter on the reservation from 81 to 84, but AFAICT the job was never reevaluated for scheduling after the initial "never runnable in partition" message. troy@pitzer-login01:~$ scontrol show job 2049096 JobId=2049096 JobName=x003-12-20201016 UserId=wxops(30211) GroupId=PYS0343(5387) MCS_label=N/A Priority=0 Nice=0 Account=pys1043 QOS=pitzer-override-tres JobState=PENDING Reason=BadConstraints Dependency=(null) Requeue=0 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 RunTime=00:00:00 TimeLimit=02:30:00 TimeMin=N/A SubmitTime=2020-10-16T12:36:17 EligibleTime=2020-10-16T12:50:04 AccrueTime=2020-10-16T12:50:04 StartTime=Unknown EndTime=Unknown Deadline=N/A SuspendTime=None SecsPreSuspend=0 LastSchedEval=2020-10-16T12:50:05 Partition=parallel-48core AllocNode:Sid=pitzer-login01:250275 ReqNodeList=(null) ExcNodeList=(null) NodeList=(null) NumNodes=80-80 NumCPUs=3840 NumTasks=3840 CPUs/Task=1 ReqB:S:C:T=0:0:*:* TRES=cpu=3840,mem=14580480M,node=80,billing=3840,gres/gpfs:ess=80 Socks/Node=* NtasksPerN:B:S:C=48:0:*:1 CoreSpec=* MinCPUsNode=48 MinMemoryCPU=3797M MinTmpDiskNode=0 Features=(null) DelayBoot=00:00:00 Reservation=x003-12 OverSubscribe=NO Contiguous=0 Licenses=(null) Network=(null) Command=/fs/ess/scratch/PYS0343/wxops/runs/mpas/coldstart-g9-spirero/20201016/12/runscript WorkDir=/fs/ess/scratch/PYS0343/wxops/runs/mpas/coldstart-g9-spirero/20201016/12 Comment=stdout=/fs/ess/scratch/PYS0343/wxops/runs/mpas/coldstart-g9-spirero/20201016/12/runscript.out StdErr=/fs/ess/scratch/PYS0343/wxops/runs/mpas/coldstart-g9-spirero/20201016/12/runscript.out StdIn=/dev/null StdOut=/fs/ess/scratch/PYS0343/wxops/runs/mpas/coldstart-g9-spirero/20201016/12/runscript.out Power= TresPerNode=gpfs:ess:1 MailUser=(null) MailType=NONE troy@pitzer-login01:~$ scontrol show reservation x003-12 ReservationName=x003-12 StartTime=2020-10-16T12:50:00 EndTime=2020-10-16T15:20:00 Duration=02:30:00 Nodes=p[0501-0506,0509-0511,0514,0519-0522,0525-0539,0542-0547,0549-0550,0556-0557,0560,0563-0565,0567-0569,0571-0572,0574,0576-0580,0586,0589-0602,0624,0643,0668,0670,0683-0685,0687,0693,0699,0775,0778,0902,0905,0912] NodeCnt=84 CoreCnt=4032 Features=c6420&48core PartitionName=batch Flags=DAILY,REPLACE_DOWN,PURGE_COMP=00:05:00 TRES=cpu=4032 Users=(null) Accounts=PYS1043 Licenses=(null) State=ACTIVE BurstBuffer=(null) Watts=n/a MaxStartDelay=(null) root@pitzer-slurm01:~# grep 2049096 /var/log/slurm/slurmctld.log Oct 16 12:36:17 pitzer-slurm01 slurmctld[59138]: _slurm_rpc_submit_batch_job: JobId=2049096 InitPrio=1200285399 usec=16432 Oct 16 12:50:05 pitzer-slurm01 slurmctld[59138]: _pick_best_nodes: JobId=204909 never runnable in partition parallel-48core Oct 16 12:50:05 pitzer-slurm01 slurmctld[59138]: sched: schedule: JobId=2049096 non-runnable: Requested node configuration is not available (In reply to Troy Baer from comment #44) > The especially curious thing is that after the job got reason=BadConstraints, I Please verify that the patch from comment#5 has been applied to slurmctld. > tried increased the NodeCnt parameter on the reservation from 81 to 84, but > AFAICT the job was never reevaluated for scheduling after the initial "never > runnable in partition" message. Jobs are only re-evaluated under certain states. We can however force it to be re-evaluated by holding and releasing it. Please perform the following: > sdiag -r > scontrol show job -d 2049096 > scontrol setdebugflags +TraceJobs > scontrol setdebugflags +SelectType > scontrol setdebugflags +Reservation > scontrol setdebug debug3 > scontrol hold 2049096 > sleep 1 > scontrol show job -d 2049096 > scontrol release 2049096 > sleep 500 > scontrol setdebugflags -TraceJobs > scontrol setdebugflags -SelectType > scontrol setdebugflags -Reservation > scontrol setdebug info > scontrol show job -d 2049096 > sdiag Please attach the slurmctld logs generated during this period. Nate, thanks for the update. We've verified that we are indeed using the patch in question. The job and reservation from Friday have long since passed, but we'll use that procedure the next time we see this, which could be as soon as ~40 minutes from now. (In reply to Troy Baer from comment #46) > The job and reservation from Friday have long since > passed, but we'll use that procedure the next time we see this, which could > be as soon as ~40 minutes from now. Reducing ticket severity while we wait for the problem to reappear. We did a test of this at 12:50 EDY today. Naturally, when I was watching it with the debugging turned up, it worked. I'd like to see a couple more successes before we declare victory on this (again). Thanks for the update Troy. That is frustrating that the problem didn't happen when you were ready for it. We're happy to wait for a few more iterations. Thanks, Ben We've had three more successful goes at this, so we're declaring victory. Thanks for all the help. I'm glad to hear it's kept working. Let us know if anything else comes up. Thanks, Ben |
We have a client who needs daily reservations to handle a recurring workflow. However, we're running into problems when that workflow actually attempts to run. One of the reservations in question looks like this: troy@pitzer-login01:~$ scontrol show reservation x005-18 ReservationName=x005-18 StartTime=2020-10-12T17:35:00 EndTime=2020-10-12T18:15:00 Duration=00:40:00 Nodes=p[0501-0518,0521-0522,0527,0529] NodeCnt=30 CoreCnt=1440 Features=c6420&48core PartitionName=batch Flags=MAINT,DAILY,PURGE_COMP=00:05:00 TRES=cpu=1440 Users=(null) Accounts=PYS1043 Licenses=(null) State=INACTIVE BurstBuffer=(null) Watts=n/a MaxStartDelay=(null) The user typically submits the job using this reservation 5-10 minutes before the start of the reservation: # before the start of the reservation troy@pitzer-login01:~$ date ; scontrol show job 2015017 Fri Oct 9 17:33:42 EDT 2020 JobId=2015017 JobName=x005-20201009-18 UserId=wxops(30211) GroupId=PYS0343(5387) MCS_label=N/A Priority=1200107024 Nice=0 Account=pys1043 QOS=pitzer-override-tres JobState=PENDING Reason=Reservation Dependency=(null) Requeue=0 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 RunTime=00:00:00 TimeLimit=00:35:00 TimeMin=N/A SubmitTime=2020-10-09T17:28:19 EligibleTime=Unknown AccrueTime=Unknown StartTime=Unknown EndTime=Unknown Deadline=N/A SuspendTime=None SecsPreSuspend=0 LastSchedEval=2020-10-09T17:28:19 Partition=parallel-48core AllocNode:Sid=pitzer-login02:13543 ReqNodeList=(null) ExcNodeList=(null) NodeList=(null) NumNodes=30-30 NumCPUs=1440 NumTasks=1440 CPUs/Task=1 ReqB:S:C:T=0:0:*:* TRES=cpu=1440,mem=5467680M,node=30,billing=1440,gres/gpfs:ess=30 Socks/Node=* NtasksPerN:B:S:C=48:0:*:1 CoreSpec=* MinCPUsNode=48 MinMemoryCPU=3797M MinTmpDiskNode=0 Features=(null) DelayBoot=00:00:00 Reservation=x005-18 OverSubscribe=NO Contiguous=0 Licenses=(null) Network=(null) Command=/fs/ess/scratch/PYS0343/wxops/tmp/20201009/18/runscript WorkDir=/fs/ess/scratch/PYS0343/wxops/tmp/20201009/18 Comment=stdout=/fs/ess/scratch/PYS0343/wxops/tmp/20201009/18/runscript.out StdErr=/fs/ess/scratch/PYS0343/wxops/tmp/20201009/18/runscript.out StdIn=/dev/null StdOut=/fs/ess/scratch/PYS0343/wxops/tmp/20201009/18/runscript.out Power= TresPerNode=gpfs:ess:1 MailUser=(null) MailType=NONE However, what we see is that when the reservation starts, the job does not start, and instead it has reason=BadConstraints: # right after the start of the reservation troy@pitzer-login01:~$ date ; scontrol show job 2015017 Fri Oct 9 17:35:58 EDT 2020 JobId=2015017 JobName=x005-20201009-18 UserId=wxops(30211) GroupId=PYS0343(5387) MCS_label=N/A Priority=0 Nice=0 Account=pys1043 QOS=pitzer-override-tres JobState=PENDING Reason=BadConstraints Dependency=(null) Requeue=0 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 RunTime=00:00:00 TimeLimit=00:35:00 TimeMin=N/A SubmitTime=2020-10-09T17:28:19 EligibleTime=2020-10-09T17:35:04 AccrueTime=Unknown StartTime=Unknown EndTime=Unknown Deadline=N/A SuspendTime=None SecsPreSuspend=0 LastSchedEval=2020-10-09T17:35:04 Partition=parallel-48core AllocNode:Sid=pitzer-login02:13543 ReqNodeList=(null) ExcNodeList=(null) NodeList=(null) NumNodes=30-30 NumCPUs=1440 NumTasks=1440 CPUs/Task=1 ReqB:S:C:T=0:0:*:* TRES=cpu=1440,mem=5467680M,node=30,billing=1440,gres/gpfs:ess=30 Socks/Node=* NtasksPerN:B:S:C=48:0:*:1 CoreSpec=* MinCPUsNode=48 MinMemoryCPU=3797M MinTmpDiskNode=0 Features=(null) DelayBoot=00:00:00 Reservation=x005-18 OverSubscribe=NO Contiguous=0 Licenses=(null) Network=(null) Command=/fs/ess/scratch/PYS0343/wxops/tmp/20201009/18/runscript WorkDir=/fs/ess/scratch/PYS0343/wxops/tmp/20201009/18 Comment=stdout=/fs/ess/scratch/PYS0343/wxops/tmp/20201009/18/runscript.out StdErr=/fs/ess/scratch/PYS0343/wxops/tmp/20201009/18/runscript.out StdIn=/dev/null StdOut=/fs/ess/scratch/PYS0343/wxops/tmp/20201009/18/runscript.out Power= TresPerNode=gpfs:ess:1 MailUser=(null) MailType=NONE In the occurrence shown above, I turned up the debugging logging to try to determine what the bad constraint was and then stored the resulting log for analysis. However, I didn't really get much additional information: troy@pitzer-login01:~$ zgrep 2015017 X005-18-slurmctld.log.gz Oct 9 17:28:19 pitzer-slurm01 slurmctld[4699]: _slurm_rpc_submit_batch_job: JobId=2015017 InitPrio=1200107024 usec=675 Oct 9 17:28:31 pitzer-slurm01 slurmctld[4699]: debug3: Writing job id 2015017 to header record of job_state file Oct 9 17:28:36 pitzer-slurm01 slurmctld[4699]: debug3: Writing job id 2015017 to header record of job_state file Oct 9 17:34:16 pitzer-slurm01 slurmctld[4699]: debug2: priority for job 2015017 is now 1200107024 Oct 9 17:35:04 pitzer-slurm01 slurmctld[4699]: _pick_best_nodes: JobId=2015017 never runnable in partition parallel-48core Oct 9 17:35:04 pitzer-slurm01 slurmctld[4699]: sched: schedule: JobId=2015017 non-runnable: Requested node configuration is not available To my knowledge, the same job will run if it doesn't request a reservation. Please advise. Let me know what additional information is needed.