Ticket 9973 - Job requesting reservation does not start with reason=BadConstraints
Summary: Job requesting reservation does not start with reason=BadConstraints
Status: RESOLVED INFOGIVEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: reservations (show other tickets)
Version: 20.02.5
Hardware: Linux Linux
: 4 - Minor Issue
Assignee: Ben Roberts
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2020-10-12 09:40 MDT by Troy Baer
Modified: 2020-10-21 12:09 MDT (History)
3 users (show)

See Also:
Site: Ohio State OSC
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description Troy Baer 2020-10-12 09:40:17 MDT
We have a client who needs daily reservations to handle a recurring workflow.  However, we're running into problems when that workflow actually attempts to run.

One of the reservations in question looks like this:


troy@pitzer-login01:~$ scontrol show reservation x005-18
ReservationName=x005-18 StartTime=2020-10-12T17:35:00 EndTime=2020-10-12T18:15:00 Duration=00:40:00
   Nodes=p[0501-0518,0521-0522,0527,0529] NodeCnt=30 CoreCnt=1440 Features=c6420&48core PartitionName=batch Flags=MAINT,DAILY,PURGE_COMP=00:05:00
   TRES=cpu=1440
   Users=(null) Accounts=PYS1043 Licenses=(null) State=INACTIVE BurstBuffer=(null) Watts=n/a
   MaxStartDelay=(null)

The user typically submits the job using this reservation 5-10 minutes before the start of the reservation:

# before the start of the reservation
troy@pitzer-login01:~$ date ; scontrol show job 2015017
Fri Oct 9 17:33:42 EDT 2020
JobId=2015017 JobName=x005-20201009-18
    UserId=wxops(30211) GroupId=PYS0343(5387) MCS_label=N/A
    Priority=1200107024 Nice=0 Account=pys1043 QOS=pitzer-override-tres
    JobState=PENDING Reason=Reservation Dependency=(null)
    Requeue=0 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
    RunTime=00:00:00 TimeLimit=00:35:00 TimeMin=N/A
    SubmitTime=2020-10-09T17:28:19 EligibleTime=Unknown
    AccrueTime=Unknown
    StartTime=Unknown EndTime=Unknown Deadline=N/A
    SuspendTime=None SecsPreSuspend=0 LastSchedEval=2020-10-09T17:28:19
    Partition=parallel-48core AllocNode:Sid=pitzer-login02:13543
    ReqNodeList=(null) ExcNodeList=(null)
    NodeList=(null)
    NumNodes=30-30 NumCPUs=1440 NumTasks=1440 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
    TRES=cpu=1440,mem=5467680M,node=30,billing=1440,gres/gpfs:ess=30
    Socks/Node=* NtasksPerN:B:S:C=48:0:*:1 CoreSpec=*
    MinCPUsNode=48 MinMemoryCPU=3797M MinTmpDiskNode=0
    Features=(null) DelayBoot=00:00:00
    Reservation=x005-18
    OverSubscribe=NO Contiguous=0 Licenses=(null) Network=(null)
    Command=/fs/ess/scratch/PYS0343/wxops/tmp/20201009/18/runscript
    WorkDir=/fs/ess/scratch/PYS0343/wxops/tmp/20201009/18
    Comment=stdout=/fs/ess/scratch/PYS0343/wxops/tmp/20201009/18/runscript.out
    StdErr=/fs/ess/scratch/PYS0343/wxops/tmp/20201009/18/runscript.out
    StdIn=/dev/null
    StdOut=/fs/ess/scratch/PYS0343/wxops/tmp/20201009/18/runscript.out
    Power=
    TresPerNode=gpfs:ess:1
    MailUser=(null) MailType=NONE

However, what we see is that when the reservation starts, the job does not start, and instead it has reason=BadConstraints:

# right after the start of the reservation
troy@pitzer-login01:~$ date ; scontrol show job 2015017
Fri Oct 9 17:35:58 EDT 2020
JobId=2015017 JobName=x005-20201009-18
    UserId=wxops(30211) GroupId=PYS0343(5387) MCS_label=N/A
    Priority=0 Nice=0 Account=pys1043 QOS=pitzer-override-tres
    JobState=PENDING Reason=BadConstraints Dependency=(null)
    Requeue=0 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
    RunTime=00:00:00 TimeLimit=00:35:00 TimeMin=N/A
    SubmitTime=2020-10-09T17:28:19 EligibleTime=2020-10-09T17:35:04
    AccrueTime=Unknown
    StartTime=Unknown EndTime=Unknown Deadline=N/A
    SuspendTime=None SecsPreSuspend=0 LastSchedEval=2020-10-09T17:35:04
    Partition=parallel-48core AllocNode:Sid=pitzer-login02:13543
    ReqNodeList=(null) ExcNodeList=(null)
    NodeList=(null)
    NumNodes=30-30 NumCPUs=1440 NumTasks=1440 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
    TRES=cpu=1440,mem=5467680M,node=30,billing=1440,gres/gpfs:ess=30
    Socks/Node=* NtasksPerN:B:S:C=48:0:*:1 CoreSpec=*
    MinCPUsNode=48 MinMemoryCPU=3797M MinTmpDiskNode=0
    Features=(null) DelayBoot=00:00:00
    Reservation=x005-18
    OverSubscribe=NO Contiguous=0 Licenses=(null) Network=(null)
    Command=/fs/ess/scratch/PYS0343/wxops/tmp/20201009/18/runscript
    WorkDir=/fs/ess/scratch/PYS0343/wxops/tmp/20201009/18
    Comment=stdout=/fs/ess/scratch/PYS0343/wxops/tmp/20201009/18/runscript.out
    StdErr=/fs/ess/scratch/PYS0343/wxops/tmp/20201009/18/runscript.out
    StdIn=/dev/null
    StdOut=/fs/ess/scratch/PYS0343/wxops/tmp/20201009/18/runscript.out
    Power=
    TresPerNode=gpfs:ess:1
    MailUser=(null) MailType=NONE

In the occurrence shown above, I turned up the debugging logging to try to determine what the bad constraint was and then stored the resulting log for analysis.  However, I didn't really get much additional information:

troy@pitzer-login01:~$ zgrep 2015017 X005-18-slurmctld.log.gz
Oct 9 17:28:19 pitzer-slurm01 slurmctld[4699]: _slurm_rpc_submit_batch_job: JobId=2015017 InitPrio=1200107024 usec=675
Oct 9 17:28:31 pitzer-slurm01 slurmctld[4699]: debug3: Writing job id 2015017 to header record of job_state file
Oct 9 17:28:36 pitzer-slurm01 slurmctld[4699]: debug3: Writing job id 2015017 to header record of job_state file
Oct 9 17:34:16 pitzer-slurm01 slurmctld[4699]: debug2: priority for job 2015017 is now 1200107024
Oct 9 17:35:04 pitzer-slurm01 slurmctld[4699]: _pick_best_nodes: JobId=2015017 never runnable in partition parallel-48core
Oct 9 17:35:04 pitzer-slurm01 slurmctld[4699]: sched: schedule: JobId=2015017 non-runnable: Requested node configuration is not available

To my knowledge, the same job will run if it doesn't request a reservation.

Please advise.  Let me know what additional information is needed.
Comment 2 Ben Roberts 2020-10-12 10:33:34 MDT
Hi Troy,

I believe I see what is causing the job to fail to start in the reservation.  It looks like the reservation was created, specifying the 'batch' partition and 'c6420&48core' as features.  However, the job requests the 'parallel-48core' partition.  I didn't try to dig up a slurm.conf you've sent in another ticket, but my guess is that the nodes in the 'parallel-48core' partition don't have the same features.  Is that right?

If so, I can reproduce this behavior.  I created a reservation that specifies a feature that is only on half the nodes in my cluster.
$ scontrol create reservation reservationname=test_osc partition=debug feature=rack2 nodecnt=5 flags=purge_comp=5:00 account=sub1 starttime=11:25:00 duration=30:00
Reservation created: test_osc

$ scontrol show res
ReservationName=test_osc StartTime=2020-10-12T11:25:00 EndTime=2020-10-12T11:55:00 Duration=00:30:00
   Nodes=node[09-13] NodeCnt=5 CoreCnt=120 Features=rack2 PartitionName=debug Flags=PURGE_COMP=00:05:00
   TRES=cpu=120
   Users=(null) Accounts=sub1 Licenses=(null) State=INACTIVE BurstBuffer=(null) Watts=n/a
   MaxStartDelay=(null)




Then I submit a job that requests a partition that doesn't have the same nodes available to it.  When it's time for it to start it fails with a 'BadConstraints' reason.  Another thing to notice is that the Priority goes to '0' when it fails to start.

$ sbatch -N5 --reservation=test_osc -pgpu -Asub1 -t10:00 --wrap='srun sleep 300'
sbatch: In original lua submit function
Submitted batch job 753

$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON) 
               753       gpu     wrap      ben PD       0:00      5 (BadConstraints) 

$ scontrol show job 753
JobId=753 JobName=wrap
   UserId=ben(1000) GroupId=ben(1000) MCS_label=N/A
   Priority=0 Nice=0 Account=sub1 QOS=normal
   JobState=PENDING Reason=BadConstraints Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=00:10:00 TimeMin=N/A
   SubmitTime=2020-10-12T11:24:21 EligibleTime=2020-10-12T11:25:00
   AccrueTime=Unknown
   StartTime=2020-10-12T11:25:11 EndTime=2020-10-12T11:35:11 Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2020-10-12T11:25:00
   Partition=gpu AllocNode:Sid=kitt:7656
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=(null)
   NumNodes=5-5 NumCPUs=5 NumTasks=5 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=5,node=5,billing=5
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   Reservation=test_osc
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=(null)
   WorkDir=/home/ben/slurm/src/20-02/kitt/etc
   StdErr=/home/ben/slurm/src/20-02/kitt/etc/slurm-753.out
   StdIn=/dev/null
   StdOut=/home/ben/slurm/src/20-02/kitt/etc/slurm-753.out
   Power=
   MailUser=(null) MailType=NONE


If I submit the job with another partition specified that overlaps with the nodes in the reservation it runs fine.  Let me know if this looks like what you're running in to.

Thanks,
Ben
Comment 3 Troy Baer 2020-10-12 10:46:30 MDT
(In reply to Ben Roberts from comment #2)
> I believe I see what is causing the job to fail to start in the reservation.
> It looks like the reservation was created, specifying the 'batch' partition
> and 'c6420&48core' as features.  However, the job requests the
> 'parallel-48core' partition.  I didn't try to dig up a slurm.conf you've
> sent in another ticket, but my guess is that the nodes in the
> 'parallel-48core' partition don't have the same features.  Is that right?


No, that is not the case that I can find  For instance, one of the reserved nodes is p0501, which has both the c6420 and 48core features:

troy@pitzer-login01:~$ scontrol show node p0501
NodeName=p0501 Arch=x86_64 CoresPerSocket=24 
   CPUAlloc=0 CPUTot=48 CPULoad=0.11
   AvailableFeatures=48core,expansion,exp,c6420,cpu,eth-pitzer-rack11h1,ib-i4l1s14,ib-i4,pitzer-rack11,pitzer-rack11-c01
   ActiveFeatures=48core,expansion,exp,c6420,cpu,eth-pitzer-rack11h1,ib-i4l1s14,ib-i4,pitzer-rack11,pitzer-rack11-c01
   Gres=pfsdir:scratch:1,pfsdir:ess:1,ime:1,gpfs:project:1,gpfs:scratch:1,gpfs:ess:1
   NodeAddr=10.4.11.1 NodeHostName=p0501 Version=20.02.5
   OS=Linux 3.10.0-1062.18.1.el7.x86_64 #1 SMP Wed Feb 12 14:08:31 UTC 2020 
   RealMemory=182272 AllocMem=0 FreeMem=179001 Sockets=2 Boards=1
   State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=2 Owner=N/A MCS_label=N/A
   Partitions=batch,debug,debug-48core,parallel,parallel-48core,serial,serial-48core,systems 
   BootTime=2020-08-18T12:05:04 SlurmdStartTime=2020-09-15T09:13:49
   CfgTRES=cpu=48,mem=178G,billing=48,gres/gpfs:ess=1,gres/gpfs:project=1,gres/gpfs:scratch=1,gres/ime=1,gres/pfsdir=2,gres/pfsdir:ess=1,gres/pfsdir:scratch=1
   AllocTRES=
   CapWatts=n/a
   CurrentWatts=0 AveWatts=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

troy@pitzer-login01:~$ scontrol show nodes p[0501-0518,0521-0522,0527,0529] | egrep 'NodeName|Features'
NodeName=p0501 Arch=x86_64 CoresPerSocket=24 
   AvailableFeatures=48core,expansion,exp,c6420,cpu,eth-pitzer-rack11h1,ib-i4l1s14,ib-i4,pitzer-rack11,pitzer-rack11-c01
   ActiveFeatures=48core,expansion,exp,c6420,cpu,eth-pitzer-rack11h1,ib-i4l1s14,ib-i4,pitzer-rack11,pitzer-rack11-c01
NodeName=p0502 Arch=x86_64 CoresPerSocket=24 
   AvailableFeatures=48core,expansion,exp,c6420,cpu,eth-pitzer-rack11h1,ib-i4l1s14,ib-i4,pitzer-rack11,pitzer-rack11-c01
   ActiveFeatures=48core,expansion,exp,c6420,cpu,eth-pitzer-rack11h1,ib-i4l1s14,ib-i4,pitzer-rack11,pitzer-rack11-c01
NodeName=p0503 Arch=x86_64 CoresPerSocket=24 
   AvailableFeatures=48core,expansion,exp,c6420,cpu,eth-pitzer-rack11h1,ib-i4l1s14,ib-i4,pitzer-rack11,pitzer-rack11-c01
   ActiveFeatures=48core,expansion,exp,c6420,cpu,eth-pitzer-rack11h1,ib-i4l1s14,ib-i4,pitzer-rack11,pitzer-rack11-c01
NodeName=p0504 Arch=x86_64 CoresPerSocket=24 
   AvailableFeatures=48core,expansion,exp,c6420,cpu,eth-pitzer-rack11h1,ib-i4l1s14,ib-i4,pitzer-rack11,pitzer-rack11-c01
   ActiveFeatures=48core,expansion,exp,c6420,cpu,eth-pitzer-rack11h1,ib-i4l1s14,ib-i4,pitzer-rack11,pitzer-rack11-c01
NodeName=p0505 Arch=x86_64 CoresPerSocket=24 
   AvailableFeatures=48core,expansion,exp,c6420,cpu,eth-pitzer-rack11h1,ib-i4l1s14,ib-i4,pitzer-rack11,pitzer-rack11-c02
   ActiveFeatures=48core,expansion,exp,c6420,cpu,eth-pitzer-rack11h1,ib-i4l1s14,ib-i4,pitzer-rack11,pitzer-rack11-c02
NodeName=p0506 Arch=x86_64 CoresPerSocket=24 
   AvailableFeatures=48core,expansion,exp,c6420,cpu,eth-pitzer-rack11h1,ib-i4l1s14,ib-i4,pitzer-rack11,pitzer-rack11-c02
   ActiveFeatures=48core,expansion,exp,c6420,cpu,eth-pitzer-rack11h1,ib-i4l1s14,ib-i4,pitzer-rack11,pitzer-rack11-c02
NodeName=p0507 Arch=x86_64 CoresPerSocket=24 
   AvailableFeatures=48core,expansion,exp,c6420,cpu,eth-pitzer-rack11h1,ib-i4l1s14,ib-i4,pitzer-rack11,pitzer-rack11-c02
   ActiveFeatures=48core,expansion,exp,c6420,cpu,eth-pitzer-rack11h1,ib-i4l1s14,ib-i4,pitzer-rack11,pitzer-rack11-c02
NodeName=p0508 Arch=x86_64 CoresPerSocket=24 
   AvailableFeatures=48core,expansion,exp,c6420,cpu,eth-pitzer-rack11h1,ib-i4l1s14,ib-i4,pitzer-rack11,pitzer-rack11-c02
   ActiveFeatures=48core,expansion,exp,c6420,cpu,eth-pitzer-rack11h1,ib-i4l1s14,ib-i4,pitzer-rack11,pitzer-rack11-c02
NodeName=p0509 Arch=x86_64 CoresPerSocket=24 
   AvailableFeatures=48core,expansion,exp,c6420,cpu,eth-pitzer-rack11h1,ib-i4l1s14,ib-i4,pitzer-rack11,pitzer-rack11-c03
   ActiveFeatures=48core,expansion,exp,c6420,cpu,eth-pitzer-rack11h1,ib-i4l1s14,ib-i4,pitzer-rack11,pitzer-rack11-c03
NodeName=p0510 Arch=x86_64 CoresPerSocket=24 
   AvailableFeatures=48core,expansion,exp,c6420,cpu,eth-pitzer-rack11h1,ib-i4l1s14,ib-i4,pitzer-rack11,pitzer-rack11-c03
   ActiveFeatures=48core,expansion,exp,c6420,cpu,eth-pitzer-rack11h1,ib-i4l1s14,ib-i4,pitzer-rack11,pitzer-rack11-c03
NodeName=p0511 Arch=x86_64 CoresPerSocket=24 
   AvailableFeatures=48core,expansion,exp,c6420,cpu,eth-pitzer-rack11h1,ib-i4l1s14,ib-i4,pitzer-rack11,pitzer-rack11-c03
   ActiveFeatures=48core,expansion,exp,c6420,cpu,eth-pitzer-rack11h1,ib-i4l1s14,ib-i4,pitzer-rack11,pitzer-rack11-c03
NodeName=p0512 Arch=x86_64 CoresPerSocket=24 
   AvailableFeatures=48core,expansion,exp,c6420,cpu,eth-pitzer-rack11h1,ib-i4l1s14,ib-i4,pitzer-rack11,pitzer-rack11-c03
   ActiveFeatures=48core,expansion,exp,c6420,cpu,eth-pitzer-rack11h1,ib-i4l1s14,ib-i4,pitzer-rack11,pitzer-rack11-c03
NodeName=p0513 Arch=x86_64 CoresPerSocket=24 
   AvailableFeatures=48core,expansion,exp,c6420,cpu,eth-pitzer-rack11h1,ib-i4l1s16,ib-i4,pitzer-rack11,pitzer-rack11-c04
   ActiveFeatures=48core,expansion,exp,c6420,cpu,eth-pitzer-rack11h1,ib-i4l1s16,ib-i4,pitzer-rack11,pitzer-rack11-c04
NodeName=p0514 Arch=x86_64 CoresPerSocket=24 
   AvailableFeatures=48core,expansion,exp,c6420,cpu,eth-pitzer-rack11h1,ib-i4l1s16,ib-i4,pitzer-rack11,pitzer-rack11-c04
   ActiveFeatures=48core,expansion,exp,c6420,cpu,eth-pitzer-rack11h1,ib-i4l1s16,ib-i4,pitzer-rack11,pitzer-rack11-c04
NodeName=p0515 Arch=x86_64 CoresPerSocket=24 
   AvailableFeatures=48core,expansion,exp,c6420,cpu,eth-pitzer-rack11h1,ib-i4l1s16,ib-i4,pitzer-rack11,pitzer-rack11-c04
   ActiveFeatures=48core,expansion,exp,c6420,cpu,eth-pitzer-rack11h1,ib-i4l1s16,ib-i4,pitzer-rack11,pitzer-rack11-c04
NodeName=p0516 Arch=x86_64 CoresPerSocket=24 
   AvailableFeatures=48core,expansion,exp,c6420,cpu,eth-pitzer-rack11h1,ib-i4l1s16,ib-i4,pitzer-rack11,pitzer-rack11-c04
   ActiveFeatures=48core,expansion,exp,c6420,cpu,eth-pitzer-rack11h1,ib-i4l1s16,ib-i4,pitzer-rack11,pitzer-rack11-c04
NodeName=p0517 Arch=x86_64 CoresPerSocket=24 
   AvailableFeatures=48core,expansion,exp,c6420,cpu,eth-pitzer-rack11h1,ib-i4l1s16,ib-i4,pitzer-rack11,pitzer-rack11-c05
   ActiveFeatures=48core,expansion,exp,c6420,cpu,eth-pitzer-rack11h1,ib-i4l1s16,ib-i4,pitzer-rack11,pitzer-rack11-c05
NodeName=p0518 Arch=x86_64 CoresPerSocket=24 
   AvailableFeatures=48core,expansion,exp,c6420,cpu,eth-pitzer-rack11h1,ib-i4l1s16,ib-i4,pitzer-rack11,pitzer-rack11-c05
   ActiveFeatures=48core,expansion,exp,c6420,cpu,eth-pitzer-rack11h1,ib-i4l1s16,ib-i4,pitzer-rack11,pitzer-rack11-c05
NodeName=p0521 Arch=x86_64 CoresPerSocket=24 
   AvailableFeatures=48core,expansion,exp,c6420,cpu,eth-pitzer-rack11h1,ib-i4l1s16,ib-i4,pitzer-rack11,pitzer-rack11-c06
   ActiveFeatures=48core,expansion,exp,c6420,cpu,eth-pitzer-rack11h1,ib-i4l1s16,ib-i4,pitzer-rack11,pitzer-rack11-c06
NodeName=p0522 Arch=x86_64 CoresPerSocket=24 
   AvailableFeatures=48core,expansion,exp,c6420,cpu,eth-pitzer-rack11h1,ib-i4l1s16,ib-i4,pitzer-rack11,pitzer-rack11-c06
   ActiveFeatures=48core,expansion,exp,c6420,cpu,eth-pitzer-rack11h1,ib-i4l1s16,ib-i4,pitzer-rack11,pitzer-rack11-c06
NodeName=p0527 Arch=x86_64 CoresPerSocket=24 
   AvailableFeatures=48core,expansion,exp,c6420,cpu,eth-pitzer-rack11h1,ib-i4l1s16,ib-i4,pitzer-rack11,pitzer-rack11-c07
   ActiveFeatures=48core,expansion,exp,c6420,cpu,eth-pitzer-rack11h1,ib-i4l1s16,ib-i4,pitzer-rack11,pitzer-rack11-c07
NodeName=p0529 Arch=x86_64 CoresPerSocket=24 
   AvailableFeatures=48core,expansion,exp,c6420,cpu,eth-pitzer-rack11h1,ib-i4l1s16,ib-i4,pitzer-rack11,pitzer-rack11-c08
   ActiveFeatures=48core,expansion,exp,c6420,cpu,eth-pitzer-rack11h1,ib-i4l1s16,ib-i4,pitzer-rack11,pitzer-rack11-c08


I can't find any of the reserved nodes that don't have both of those features:

troy@pitzer-login01:~$ scontrol show nodes p[0501-0518,0521-0522,0527,0529] | grep 'ActiveFeatures' | egrep -v '48core|c6420'
[...nothing...]

troy@pitzer-login01:~/slurm-bugs$ scontrol show nodes p[0501-0518,0521-0522,0527,0529] | grep 'ActiveFeatures' | egrep -v '48core.*c6420'
[...nothing...]
Comment 4 Troy Baer 2020-10-12 11:01:39 MDT
BTW, there was no partition specified when these reservations were created, so I think the batch partition got picked up by the reservation since it's the default:

troy@pitzer-login01:~/slurm-bugs/9973$ scontrol show partition batch
PartitionName=batch
   AllowGroups=ALL DenyAccounts=pcon0060,pcon0003,pcon0014,pcon0015,pcon0016,pcon0008,pcon0010,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101 AllowQos=ALL
   AllocNodes=ALL Default=YES QoS=N/A
   DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=7-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED
   Nodes=p0[001-195],p0[401-404,501-792],p02[25-39],p03[01-19],p035[1-2],p02[57-60],p09[01-12]
   PriorityJobFactor=1000 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
   OverTimeLimit=NONE PreemptMode=OFF
   State=INACTIVE TotalCPUs=24512 TotalNodes=543 SelectTypeParameters=NONE
   JobDefaults=(null)
   DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED

As you can see, the batch partition is inactive; we use it for routing in the submit filter.

Would it help if we changed the reservations' PartitionName value to parallel-48core?  (And is it possible to set a reservation partition to a list of partitions rather than just one?)
Comment 5 Ben Roberts 2020-10-12 11:22:04 MDT
Thanks for confirming that Troy.  One of my colleagues said that this looks like an issue he has worked on.  There was a change he checked in to 20.02.5 to address a problem with adding constraints with a count on the number of features.  Unfortunately this caused unintended side effects with features and it has been reverted in the code base, but it won't be in the official releases until 20.02.6.  You can find the commit that reverts the problem code here:
https://github.com/SchedMD/slurm/commit/19c72c188b0f27526d9402dda8e329999442fe3b

Our apologies for the inconvenience caused by this bug.  Would you be able to apply this patch to your system to address the problem you're running into?  

Thanks,
Ben
Comment 7 Troy Baer 2020-10-12 13:24:10 MDT
Ben, I noticed one difference between your test and mine is that your reservation has Flags=PURGE_COMP=00:05:00 and mine has Flags=MAINT,DAILY,PURGE_COMP=00:05:00.  We've been operating under the assumption that the MAINT flag was just to disable accounting on the reservation, but I noticed on my test system that the reserved nodes went into the MAINT state during the reservation.  Is it possible that the MAINT flags is the issue?
Comment 8 Ben Roberts 2020-10-12 14:34:29 MDT
I've been doing some more testing to see if I could identify anything else that might cause the problem.  You're right that I left off the MAINT and DAILY flags in my initial test.  I did some testing to see if using a different partition that included the nodes that are reserved, but aren't the one specified in the reservation would make a difference.  In my testing that worked fine still.  I also tried including the MAINT and DAILY flags to see if they made a difference, but they didn't change the behavior.  Here is another example of a test with the flags and different partitions.  

I added a second feature to all the nodes with nodes 13-16 having both 'rack2' and 'feat4'.  I modified my partitions so that 'debug' has all the nodes, 'gpu' still excludes the 'rack2' feature and the 'high' partition includes nodes 13-16, along with some others.

$ scontrol create reservation reservationname=test_osc partition=debug feature="rack2&feat4" nodecnt=4 flags=maint,daily,purge_comp=5:00 account=sub1 starttime=15:07:00 duration=30:00
Reservation created: test_osc

$ scontrol show res
ReservationName=test_osc StartTime=2020-10-12T15:07:00 EndTime=2020-10-12T15:37:00 Duration=00:30:00
   Nodes=node[13-16] NodeCnt=4 CoreCnt=96 Features=rack2&feat4 PartitionName=debug Flags=MAINT,DAILY,PURGE_COMP=00:05:00
   TRES=cpu=96
   Users=(null) Accounts=sub1 Licenses=(null) State=INACTIVE BurstBuffer=(null) Watts=n/a
   MaxStartDelay=(null)

$ sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST 
debug*       up   infinite      4  maint node[13-16] 
debug*       up   infinite     14   idle node[01-12,17-18] 
high         up   infinite      4  maint node[13-16] 
high         up   infinite      8   idle node[05-12] 
gpu          up   infinite      8   idle node[01-08] 
socket       up   infinite      4  maint node[13-16] 
socket       up   infinite     14   idle node[01-12,17-18] 




$ sbatch -N4 --reservation=test_osc -phigh -Asub1 -t10:00 --wrap='srun sleep 60'
Submitted batch job 762
$ sbatch -N4 --reservation=test_osc -pgpu -Asub1 -t10:00 --wrap='srun sleep 60'
Submitted batch job 763

$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON) 
               763       gpu     wrap      ben PD       0:00      4 (BadConstraints) 
               762      high     wrap      ben  R       0:19      4 node[13-16] 


The 'gpu' partition job still fails in this way, but the 'high' partition job, which has access to the right nodes, is able to run.  The fact that the nodes show in a 'MAINT' state shouldn't have an effect on their ability to run these jobs.  

You mention that you are using a test system.  Are you able to reproduce the behavior there?

Thanks,
Ben
Comment 9 Troy Baer 2020-10-13 08:01:29 MDT
> You mention that you are using a test system.  Are you able to reproduce the behavior there?

I decided to go back to first principles and build up from there to something similar to what I was trying to do:

scontrol create reservation=test start=09:40:00 duration=10:00 accounts=PZS0708 nodecnt=2
sbatch --time=05:00 --nodes=1 --reservation=test serial-short.job
# works

scontrol create reservation=test start=09:45:00 duration=10:00 accounts=PZS0708 nodecnt=2 flags=daily
sbatch --time=05:00 --nodes=1 --reservation=test serial-short.job
# works

scontrol create reservation=test start=09:50:00 duration=10:00 accounts=PZS0708 nodecnt=2 flags=daily,maint
sbatch --time=05:00 --nodes=1 --reservation=test serial-short.job
# works

scontrol create reservation=test start=09:55:00 duration=10:00 accounts=PZS0708 nodecnt=2 flags=daily,maint feature='haswell&vm'
sbatch --time=05:00 --nodes=1 --reservation=test serial-short.job
# works

So there is something different between my test system and my production system.  I'll follow up on this in a couple hours.
Comment 10 Troy Baer 2020-10-13 11:07:13 MDT
OK, we've modified our test environment to be more like Pitzer in that some nodes have different features than others and reverted to the version of Slurm on the production system.  However, we still haven't been able to reproduce this there.

I've been asked by my management to escalate this to the highest priority level.
Comment 11 Ben Roberts 2020-10-13 11:25:02 MDT
Hi Troy,

That's interesting that it isn't happening on your test system either.  Do you have some nodes on your production system we can use to create a reservation and try and narrow down what's happening?

Thanks,
Ben
Comment 12 Troy Baer 2020-10-13 11:30:04 MDT
(In reply to Ben Roberts from comment #11)
> Hi Troy,
> 
> That's interesting that it isn't happening on your test system either.  Do
> you have some nodes on your production system we can use to create a
> reservation and try and narrow down what's happening?
> 
> Thanks,
> Ben

That's what I've been doing right now:

troy@pitzer-login04:~$ scontrol create reservation=test nodecnt=30 feature='c6420&48core' Flags=MAINT,DAILY,PURGE_COMP=00:05:00 start=13:20 duration=00:40:00 accounts=PZS0708
Reservation created: test
 
troy@pitzer-login04:~$ sbatch --nodes=1 --reservation=test --time=10:00 test.job
Submitted batch job 2029112
 
troy@pitzer-login04:~$ sbatch --nodes=2 --reservation=test --time=10:00 test.job
Submitted batch job 2029113
 
troy@pitzer-login04:~$ sbatch --nodes=4 --reservation=test --time=10:00 test.job
Submitted batch job 2029114
 
troy@pitzer-login04:~$ sbatch --nodes=8 --reservation=test --time=10:00 test.job
Submitted batch job 2029115
 
troy@pitzer-login04:~$ sbatch --nodes=16 --reservation=test --time=10:00 test.job
Submitted batch job 2029117
 
troy@pitzer-login04:~$ sbatch --nodes=30 --reservation=test --time=10:00 test.job
Submitted batch job 2029118
 
[…wait till after 13:20…]
troy@pitzer-login04:~$ squeue -u troy
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
           2029118 parallel-     test     troy PD       0:00     30 (PartitionNodeLimit)
           2029113 parallel-     test     troy PD       0:00      2 (Resources)
           2029112 serial-40     test     troy PD       0:00      1 (Priority)
           2029117 parallel-     test     troy  R       1:23     16 p[0501-0516]
           2029115 parallel-     test     troy  R       1:23      8 p[0517-0518,0531-0533,0613-0615]
           2029114 parallel-     test     troy  R       1:23      4 p[0525,0536,0645-0646]
 
troy@pitzer-login04:~$ scancel 2029117 2029115 2029114
 
troy@pitzer-login04:~$ squeue -u troy
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
           2029118 parallel-     test     troy PD       0:00     30 (BadConstraints)
           2029113 parallel-     test     troy  R       0:40      2 p[0645-0646]
           2029112 serial-48     test     troy  R       0:40      1 p0525
 
troy@pitzer-login04:~$ sbatch --nodes=29 --reservation=test --time=10:00 test.job 
Submitted batch job 2029164

troy@pitzer-login04:~$ sbatch --nodes=28 --reservation=test --time=10:00 test.job 
Submitted batch job 2029165

troy@pitzer-login04:~$ squeue -u troy
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON) 
           2029118 parallel-     test     troy PD       0:00     30 (PartitionNodeLimit) 
           2029164 parallel-     test     troy PD       0:00     29 (PartitionNodeLimit) 
           2029165 parallel-     test     troy PD       0:00     28 (PartitionNodeLimit) 
           2029113 parallel-     test     troy  R       4:13      2 p[0645-0646] 
           2029112 serial-48     test     troy  R       4:13      1 p0525 

troy@pitzer-login04:~$ scancel 2029112 2029113

troy@pitzer-login04:~$ squeue -u troy
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON) 
           2029118 parallel-     test     troy PD       0:00     30 (PartitionNodeLimit) 
           2029165 parallel-     test     troy PD       0:00     28 (PartitionNodeLimit) 
           2029164 parallel-     test     troy  R       1:35     29 p[0501-0518,0525,0531-0533,0536,0575,0613-0615,0645-0646] 


So it’s funny that the one job that wouldn’t start is the one that was the same size as the reservation.  Maybe we just need to make the reservation 1-2 nodes bigger?
Comment 13 Troy Baer 2020-10-13 11:33:11 MDT
(And yes, the 28 node job did eventually start.)
Comment 14 Troy Baer 2020-10-13 11:41:56 MDT
It's also a little strange that in this case, the reason given for the 30-node job not running *isn't* BadConstraints:

troy@pitzer-login04:~$ scontrol show job 2029118
JobId=2029118 JobName=test
   UserId=troy(6624) GroupId=PZS0708(5509) MCS_label=N/A
   Priority=100627430 Nice=0 Account=pzs0708 QOS=pitzer-all
   JobState=PENDING Reason=ReqNodeNotAvail,_UnavailableNodes:p[0401-0404,0501,0581] Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=00:10:00 TimeMin=N/A
   SubmitTime=2020-10-13T13:18:15 EligibleTime=2020-10-13T13:20:01
   AccrueTime=2020-10-13T13:20:01
   StartTime=Unknown EndTime=Unknown Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2020-10-13T13:36:29
   Partition=parallel-40core,parallel-48core,condo-olivucci-backfill-parallel,gpubackfill-parallel-40core,condo-osumed-gpu-48core-backfill-parallel,gpubackfill-parallel-48core,condo-datta-backfill-parallel,condo-belloni-backfill-parallel,condo-honscheid-backfill-parallel,gpubackfill-parallel-quad,condo-ccapp-backfill-parallel,condo-osumed-gpu-quad-backfill-parallel,condo-osumed-gpu-40core-backfill-parallel AllocNode:Sid=pitzer-login04:257153
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=(null)
   NumNodes=30-30 NumCPUs=30 NumTasks=30 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=30,mem=136680M,node=30,billing=30
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:1 CoreSpec=*
   MinCPUsNode=1 MinMemoryCPU=3797M MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   Reservation=test
   OverSubscribe=NO Contiguous=0 Licenses=(null) Network=(null)
   Command=/users/sysp/troy/test.job
   WorkDir=/users/sysp/troy
   Comment=stdout=/users/sysp/troy/%x.o2029118 
   StdErr=/users/sysp/troy/test.o2029118
   StdIn=/dev/null
   StdOut=/users/sysp/troy/test.o2029118
   Power=
   MailUser=(null) MailType=NONE

troy@pitzer-login04:~$ sinfo -T
RESV_NAME     STATE           START_TIME             END_TIME     DURATION  NODELIST
[...]
test         ACTIVE  2020-10-13T13:20:00  2020-10-13T14:00:00     00:40:00  p[0501-0518,0525,0531-0533,0536,0575,0613-0615,0645-0647]

What's odd about this is that all of the listed nodes are down, but only p0501 was in the reservation.
Comment 15 Ben Roberts 2020-10-13 11:45:17 MDT
Thanks for putting that together.  The that stands out to me is that the reason that shows up for the 28 and 30 node jobs is PartitionNodeLimit.  I can see the details for the 'batch' partition, which doesn't have a MaxNodes limit set.  Can I have you send the output of 'scontrol show partition' to verify that one of the other partitions being used doesn't have a limit set?

I see you updated with information about the nodes too.  Nodes 0401-0404 were ones we were looking at yesterday as being in the partition, but not having the same features.

Also, I'm happy to work with you quickly to get this resolved, but I would refer you to our definitions of severity levels.  
-----------------
Severity 1 — Major Impact
A Severity 1 issue occurs when there is a continued system outage that affects a large set of end users. The system is down and non-functional due to Slurm problem(s) and no procedural workaround exists.

Severity 2 — High Impact
A Severity 2 issue is a high-impact problem that is causing sporadic outages or is consistently encountered by end users with adverse impact to end user interaction with the system.
-----------------

This seems like it fits the definition of a Severity 2 ticket better.  We do have an "Importance" field that allows you to reflect the impact to your site.  As I said, it won't change my responsiveness.

Thanks,
Ben
Comment 16 Troy Baer 2020-10-13 11:53:59 MDT
troy@pitzer-login04:~$ scontrol show partitions
PartitionName=batch
   AllowGroups=ALL DenyAccounts=pcon0060,pcon0003,pcon0014,pcon0015,pcon0016,pcon0008,pcon0010,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101 AllowQos=ALL
   AllocNodes=ALL Default=YES QoS=N/A
   DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=7-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED
   Nodes=p0[001-195],p0[401-404,501-792],p02[25-39],p03[01-19],p035[1-2],p02[57-60],p09[01-12]
   PriorityJobFactor=1000 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
   OverTimeLimit=NONE PreemptMode=OFF
   State=INACTIVE TotalCPUs=24512 TotalNodes=543 SelectTypeParameters=NONE
   JobDefaults=(null)
   DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED

PartitionName=condo-belloni-backfill-parallel
   AllowGroups=ALL DenyAccounts=pcon0060,pcon0003,pcon0014,pcon0015,pcon0016,pcon0008,pcon0010,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101 AllowQos=ALL
   AllocNodes=ALL Default=NO QoS=N/A
   DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=04:00:00 MinNodes=2 LLN=NO MaxCPUsPerNode=40
   Nodes=p02[19-24]
   PriorityJobFactor=1000 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=EXCLUSIVE
   OverTimeLimit=NONE PreemptMode=OFF
   State=UP TotalCPUs=240 TotalNodes=6 SelectTypeParameters=NONE
   JobDefaults=(null)
   DefMemPerCPU=4556 MaxMemPerCPU=4556

PartitionName=condo-belloni-backfill-serial
   AllowGroups=ALL DenyAccounts=pcon0060,pcon0003,pcon0014,pcon0015,pcon0016,pcon0008,pcon0010,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101 AllowQos=ALL
   AllocNodes=ALL Default=NO QoS=N/A
   DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
   MaxNodes=1 MaxTime=04:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=40
   Nodes=p02[19-24]
   PriorityJobFactor=1000 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
   OverTimeLimit=NONE PreemptMode=OFF
   State=UP TotalCPUs=240 TotalNodes=6 SelectTypeParameters=NONE
   JobDefaults=(null)
   DefMemPerCPU=4556 MaxMemPerCPU=4556

PartitionName=condo-belloni-parallel
   AllowGroups=ALL AllowAccounts=pcon0060 AllowQos=ALL
   AllocNodes=ALL Default=NO QoS=pitzer-condo-belloni
   DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=7-00:00:00 MinNodes=2 LLN=NO MaxCPUsPerNode=40
   Nodes=p02[19-24]
   PriorityJobFactor=1000 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=EXCLUSIVE
   OverTimeLimit=NONE PreemptMode=OFF
   State=UP TotalCPUs=240 TotalNodes=6 SelectTypeParameters=NONE
   JobDefaults=(null)
   DefMemPerCPU=4556 MaxMemPerCPU=4556

PartitionName=condo-belloni-serial
   AllowGroups=ALL AllowAccounts=pcon0060 AllowQos=ALL
   AllocNodes=ALL Default=NO QoS=pitzer-condo-belloni
   DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
   MaxNodes=1 MaxTime=7-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=40
   Nodes=p02[19-24]
   PriorityJobFactor=1000 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
   OverTimeLimit=NONE PreemptMode=OFF
   State=UP TotalCPUs=240 TotalNodes=6 SelectTypeParameters=NONE
   JobDefaults=(null)
   DefMemPerCPU=4556 MaxMemPerCPU=4556

PartitionName=condo-ccapp-backfill-parallel
   AllowGroups=ALL DenyAccounts=pcon0060,pcon0003,pcon0014,pcon0015,pcon0016,pcon0008,pcon0010,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101 AllowQos=ALL
   AllocNodes=ALL Default=NO QoS=N/A
   DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=04:00:00 MinNodes=2 LLN=NO MaxCPUsPerNode=40
   Nodes=p02[08-18]
   PriorityJobFactor=1000 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=EXCLUSIVE
   OverTimeLimit=NONE PreemptMode=OFF
   State=UP TotalCPUs=440 TotalNodes=11 SelectTypeParameters=NONE
   JobDefaults=(null)
   DefMemPerCPU=4556 MaxMemPerCPU=4556

PartitionName=condo-ccapp-backfill-serial
   AllowGroups=ALL DenyAccounts=pcon0060,pcon0003,pcon0014,pcon0015,pcon0016,pcon0008,pcon0010,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101 AllowQos=ALL
   AllocNodes=ALL Default=NO QoS=N/A
   DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
   MaxNodes=1 MaxTime=04:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=40
   Nodes=p02[08-18]
   PriorityJobFactor=1000 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
   OverTimeLimit=NONE PreemptMode=OFF
   State=UP TotalCPUs=440 TotalNodes=11 SelectTypeParameters=NONE
   JobDefaults=(null)
   DefMemPerCPU=4556 MaxMemPerCPU=4556

PartitionName=condo-ccapp-parallel
   AllowGroups=ALL AllowAccounts=pcon0003 AllowQos=ALL
   AllocNodes=ALL Default=NO QoS=pitzer-condo-ccapp
   DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=7-00:00:00 MinNodes=2 LLN=NO MaxCPUsPerNode=40
   Nodes=p02[08-18]
   PriorityJobFactor=1000 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=EXCLUSIVE
   OverTimeLimit=NONE PreemptMode=OFF
   State=UP TotalCPUs=440 TotalNodes=11 SelectTypeParameters=NONE
   JobDefaults=(null)
   DefMemPerCPU=4556 MaxMemPerCPU=4556

PartitionName=condo-ccapp-serial
   AllowGroups=ALL AllowAccounts=pcon0003 AllowQos=ALL
   AllocNodes=ALL Default=NO QoS=pitzer-condo-ccapp
   DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
   MaxNodes=1 MaxTime=7-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=40
   Nodes=p02[08-18]
   PriorityJobFactor=1000 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
   OverTimeLimit=NONE PreemptMode=OFF
   State=UP TotalCPUs=440 TotalNodes=11 SelectTypeParameters=NONE
   JobDefaults=(null)
   DefMemPerCPU=4556 MaxMemPerCPU=4556

PartitionName=condo-datta-backfill-parallel
   AllowGroups=ALL DenyAccounts=pcon0060,pcon0003,pcon0014,pcon0015,pcon0016,pcon0008,pcon0010,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101 AllowQos=ALL
   AllocNodes=ALL Default=NO QoS=N/A
   DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=04:00:00 MinNodes=2 LLN=NO MaxCPUsPerNode=48
   Nodes=p0[793-840]
   PriorityJobFactor=1000 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=EXCLUSIVE
   OverTimeLimit=NONE PreemptMode=OFF
   State=UP TotalCPUs=2304 TotalNodes=48 SelectTypeParameters=NONE
   JobDefaults=(null)
   DefMemPerCPU=3797 MaxMemPerCPU=3797

PartitionName=condo-datta-backfill-serial
   AllowGroups=ALL DenyAccounts=pcon0060,pcon0003,pcon0014,pcon0015,pcon0016,pcon0008,pcon0010,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101 AllowQos=ALL
   AllocNodes=ALL Default=NO QoS=N/A
   DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
   MaxNodes=1 MaxTime=04:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=48
   Nodes=p0[793-840]
   PriorityJobFactor=1000 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
   OverTimeLimit=NONE PreemptMode=OFF
   State=UP TotalCPUs=2304 TotalNodes=48 SelectTypeParameters=NONE
   JobDefaults=(null)
   DefMemPerCPU=3797 MaxMemPerCPU=3797

PartitionName=condo-datta-parallel
   AllowGroups=ALL AllowAccounts=pcon0014,pcon0015,pcon0016 AllowQos=ALL
   AllocNodes=ALL Default=NO QoS=pitzer-condo-datta
   DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=7-00:00:00 MinNodes=2 LLN=NO MaxCPUsPerNode=48
   Nodes=p0[793-840]
   PriorityJobFactor=1000 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=EXCLUSIVE
   OverTimeLimit=NONE PreemptMode=OFF
   State=UP TotalCPUs=2304 TotalNodes=48 SelectTypeParameters=NONE
   JobDefaults=(null)
   DefMemPerCPU=3797 MaxMemPerCPU=3797

PartitionName=condo-datta-serial
   AllowGroups=ALL AllowAccounts=pcon0014,pcon0015,pcon0016 AllowQos=ALL
   AllocNodes=ALL Default=NO QoS=pitzer-condo-datta
   DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
   MaxNodes=1 MaxTime=7-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=48
   Nodes=p0[793-840]
   PriorityJobFactor=1000 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
   OverTimeLimit=NONE PreemptMode=OFF
   State=UP TotalCPUs=2304 TotalNodes=48 SelectTypeParameters=NONE
   JobDefaults=(null)
   DefMemPerCPU=3797 MaxMemPerCPU=3797

PartitionName=condo-honscheid-backfill-parallel
   AllowGroups=ALL DenyAccounts=pcon0060,pcon0003,pcon0014,pcon0015,pcon0016,pcon0008,pcon0010,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101 AllowQos=ALL
   AllocNodes=ALL Default=NO QoS=N/A
   DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=04:00:00 MinNodes=2 LLN=NO MaxCPUsPerNode=40
   Nodes=p020[3-7]
   PriorityJobFactor=1000 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=EXCLUSIVE
   OverTimeLimit=NONE PreemptMode=OFF
   State=UP TotalCPUs=200 TotalNodes=5 SelectTypeParameters=NONE
   JobDefaults=(null)
   DefMemPerCPU=4556 MaxMemPerCPU=4556

PartitionName=condo-honscheid-backfill-serial
   AllowGroups=ALL DenyAccounts=pcon0060,pcon0003,pcon0014,pcon0015,pcon0016,pcon0008,pcon0010,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101 AllowQos=ALL
   AllocNodes=ALL Default=NO QoS=N/A
   DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
   MaxNodes=1 MaxTime=04:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=40
   Nodes=p020[3-7]
   PriorityJobFactor=1000 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
   OverTimeLimit=NONE PreemptMode=OFF
   State=UP TotalCPUs=200 TotalNodes=5 SelectTypeParameters=NONE
   JobDefaults=(null)
   DefMemPerCPU=4556 MaxMemPerCPU=4556

PartitionName=condo-honscheid-parallel
   AllowGroups=ALL AllowAccounts=pcon0008 AllowQos=ALL
   AllocNodes=ALL Default=NO QoS=pitzer-condo-honscheid
   DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=7-00:00:00 MinNodes=2 LLN=NO MaxCPUsPerNode=40
   Nodes=p020[3-7]
   PriorityJobFactor=1000 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=EXCLUSIVE
   OverTimeLimit=NONE PreemptMode=OFF
   State=UP TotalCPUs=200 TotalNodes=5 SelectTypeParameters=NONE
   JobDefaults=(null)
   DefMemPerCPU=4556 MaxMemPerCPU=4556

PartitionName=condo-honscheid-serial
   AllowGroups=ALL AllowAccounts=pcon0008 AllowQos=ALL
   AllocNodes=ALL Default=NO QoS=pitzer-condo-honscheid
   DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
   MaxNodes=1 MaxTime=7-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=40
   Nodes=p020[3-7]
   PriorityJobFactor=1000 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
   OverTimeLimit=NONE PreemptMode=OFF
   State=UP TotalCPUs=200 TotalNodes=5 SelectTypeParameters=NONE
   JobDefaults=(null)
   DefMemPerCPU=4556 MaxMemPerCPU=4556

PartitionName=condo-olivucci-backfill-parallel
   AllowGroups=ALL DenyAccounts=pcon0060,pcon0003,pcon0014,pcon0015,pcon0016,pcon0008,pcon0010,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101 AllowQos=ALL
   AllocNodes=ALL Default=NO QoS=N/A
   DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=04:00:00 MinNodes=2 LLN=NO MaxCPUsPerNode=40
   Nodes=p0[196-202]
   PriorityJobFactor=1000 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=EXCLUSIVE
   OverTimeLimit=NONE PreemptMode=OFF
   State=UP TotalCPUs=280 TotalNodes=7 SelectTypeParameters=NONE
   JobDefaults=(null)
   DefMemPerCPU=4556 MaxMemPerCPU=4556

PartitionName=condo-olivucci-backfill-serial
   AllowGroups=ALL DenyAccounts=pcon0060,pcon0003,pcon0014,pcon0015,pcon0016,pcon0008,pcon0010,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101 AllowQos=ALL
   AllocNodes=ALL Default=NO QoS=N/A
   DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
   MaxNodes=1 MaxTime=04:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=40
   Nodes=p0[196-202]
   PriorityJobFactor=1000 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
   OverTimeLimit=NONE PreemptMode=OFF
   State=UP TotalCPUs=280 TotalNodes=7 SelectTypeParameters=NONE
   JobDefaults=(null)
   DefMemPerCPU=4556 MaxMemPerCPU=4556

PartitionName=condo-olivucci-parallel
   AllowGroups=ALL AllowAccounts=pcon0010 AllowQos=ALL
   AllocNodes=ALL Default=NO QoS=pitzer-condo-olivucci
   DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=7-00:00:00 MinNodes=2 LLN=NO MaxCPUsPerNode=40
   Nodes=p0[196-202]
   PriorityJobFactor=1000 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=EXCLUSIVE
   OverTimeLimit=NONE PreemptMode=OFF
   State=UP TotalCPUs=280 TotalNodes=7 SelectTypeParameters=NONE
   JobDefaults=(null)
   DefMemPerCPU=4556 MaxMemPerCPU=4556

PartitionName=condo-olivucci-serial
   AllowGroups=ALL AllowAccounts=pcon0010 AllowQos=ALL
   AllocNodes=ALL Default=NO QoS=pitzer-condo-olivucci
   DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
   MaxNodes=1 MaxTime=7-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=40
   Nodes=p0[196-202]
   PriorityJobFactor=1000 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
   OverTimeLimit=NONE PreemptMode=OFF
   State=UP TotalCPUs=280 TotalNodes=7 SelectTypeParameters=NONE
   JobDefaults=(null)
   DefMemPerCPU=4556 MaxMemPerCPU=4556

PartitionName=condo-osumed-gpu-40core-backfill-parallel
   AllowGroups=ALL DenyAccounts=pcon0060,pcon0003,pcon0014,pcon0015,pcon0016,pcon0008,pcon0010,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101 AllowQos=ALL
   AllocNodes=ALL Default=NO QoS=pitzer-condo-osumed-gpu-backfill
   DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=04:00:00 MinNodes=2 LLN=NO MaxCPUsPerNode=40
   Nodes=p02[40-56]
   PriorityJobFactor=1000 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=EXCLUSIVE
   OverTimeLimit=NONE PreemptMode=OFF
   State=UP TotalCPUs=680 TotalNodes=17 SelectTypeParameters=NONE
   JobDefaults=(null)
   DefMemPerCPU=9292 MaxMemPerCPU=9292

PartitionName=condo-osumed-gpu-40core-backfill-serial
   AllowGroups=ALL DenyAccounts=pcon0060,pcon0003,pcon0014,pcon0015,pcon0016,pcon0008,pcon0010,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101 AllowQos=ALL
   AllocNodes=ALL Default=NO QoS=pitzer-condo-osumed-gpu-backfill
   DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
   MaxNodes=1 MaxTime=04:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=40
   Nodes=p02[40-56]
   PriorityJobFactor=1000 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
   OverTimeLimit=NONE PreemptMode=OFF
   State=UP TotalCPUs=680 TotalNodes=17 SelectTypeParameters=NONE
   JobDefaults=(null)
   DefMemPerCPU=9292 MaxMemPerCPU=9292

PartitionName=condo-osumed-gpu-40core-parallel
   AllowGroups=ALL AllowAccounts=pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101 AllowQos=ALL
   AllocNodes=ALL Default=NO QoS=pitzer-condo-osumed-gpu-40core
   DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=7-00:00:00 MinNodes=2 LLN=NO MaxCPUsPerNode=40
   Nodes=p02[40-56]
   PriorityJobFactor=3000 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=EXCLUSIVE
   OverTimeLimit=NONE PreemptMode=OFF
   State=UP TotalCPUs=680 TotalNodes=17 SelectTypeParameters=NONE
   JobDefaults=(null)
   DefMemPerCPU=9292 MaxMemPerCPU=9292

PartitionName=condo-osumed-gpu-40core-serial
   AllowGroups=ALL AllowAccounts=pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101 AllowQos=ALL
   AllocNodes=ALL Default=NO QoS=pitzer-condo-osumed-gpu-40core
   DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
   MaxNodes=1 MaxTime=7-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=40
   Nodes=p02[40-56]
   PriorityJobFactor=3000 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
   OverTimeLimit=NONE PreemptMode=OFF
   State=UP TotalCPUs=680 TotalNodes=17 SelectTypeParameters=NONE
   JobDefaults=(null)
   DefMemPerCPU=9292 MaxMemPerCPU=9292

PartitionName=condo-osumed-gpu-48core-backfill-parallel
   AllowGroups=ALL DenyAccounts=pcon0060,pcon0003,pcon0014,pcon0015,pcon0016,pcon0008,pcon0010,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101 AllowQos=ALL
   AllocNodes=ALL Default=NO QoS=pitzer-condo-osumed-gpu-backfill
   DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=04:00:00 MinNodes=2 LLN=NO MaxCPUsPerNode=48
   Nodes=p03[20-42]
   PriorityJobFactor=1000 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=EXCLUSIVE
   OverTimeLimit=NONE PreemptMode=OFF
   State=UP TotalCPUs=1104 TotalNodes=23 SelectTypeParameters=NONE
   JobDefaults=(null)
   DefMemPerCPU=7744 MaxMemPerCPU=7744

PartitionName=condo-osumed-gpu-48core-backfill-serial
   AllowGroups=ALL DenyAccounts=pcon0060,pcon0003,pcon0014,pcon0015,pcon0016,pcon0008,pcon0010,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101 AllowQos=ALL
   AllocNodes=ALL Default=NO QoS=pitzer-condo-osumed-gpu-backfill
   DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
   MaxNodes=1 MaxTime=04:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=48
   Nodes=p03[20-42]
   PriorityJobFactor=1000 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
   OverTimeLimit=NONE PreemptMode=OFF
   State=UP TotalCPUs=1104 TotalNodes=23 SelectTypeParameters=NONE
   JobDefaults=(null)
   DefMemPerCPU=7744 MaxMemPerCPU=7744

PartitionName=condo-osumed-gpu-48core-parallel
   AllowGroups=ALL AllowAccounts=pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101 AllowQos=ALL
   AllocNodes=ALL Default=NO QoS=pitzer-condo-osumed-gpu-48core
   DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=7-00:00:00 MinNodes=2 LLN=NO MaxCPUsPerNode=48
   Nodes=p03[20-42]
   PriorityJobFactor=2000 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=EXCLUSIVE
   OverTimeLimit=NONE PreemptMode=OFF
   State=UP TotalCPUs=1104 TotalNodes=23 SelectTypeParameters=NONE
   JobDefaults=(null)
   DefMemPerCPU=7744 MaxMemPerCPU=7744

PartitionName=condo-osumed-gpu-48core-serial
   AllowGroups=ALL AllowAccounts=pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101 AllowQos=ALL
   AllocNodes=ALL Default=NO QoS=pitzer-condo-osumed-gpu-48core
   DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
   MaxNodes=1 MaxTime=7-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=48
   Nodes=p03[20-42]
   PriorityJobFactor=2000 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
   OverTimeLimit=NONE PreemptMode=OFF
   State=UP TotalCPUs=1104 TotalNodes=23 SelectTypeParameters=NONE
   JobDefaults=(null)
   DefMemPerCPU=7744 MaxMemPerCPU=7744

PartitionName=condo-osumed-gpu-quad-backfill-parallel
   AllowGroups=ALL DenyAccounts=pcon0060,pcon0003,pcon0014,pcon0015,pcon0016,pcon0008,pcon0010,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101 AllowQos=ALL
   AllocNodes=ALL Default=NO QoS=pitzer-condo-osumed-gpu-backfill-quad
   DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=04:00:00 MinNodes=2 LLN=NO MaxCPUsPerNode=48
   Nodes=p035[3-4]
   PriorityJobFactor=1000 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=EXCLUSIVE
   OverTimeLimit=NONE PreemptMode=OFF
   State=UP TotalCPUs=96 TotalNodes=2 SelectTypeParameters=NONE
   JobDefaults=(null)
   DefMemPerCPU=15872 MaxMemPerCPU=15872

PartitionName=condo-osumed-gpu-quad-backfill-serial
   AllowGroups=ALL DenyAccounts=pcon0060,pcon0003,pcon0014,pcon0015,pcon0016,pcon0008,pcon0010,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101 AllowQos=ALL
   AllocNodes=ALL Default=NO QoS=pitzer-condo-osumed-gpu-backfill-quad
   DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
   MaxNodes=1 MaxTime=04:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=48
   Nodes=p035[3-4]
   PriorityJobFactor=1000 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
   OverTimeLimit=NONE PreemptMode=OFF
   State=UP TotalCPUs=96 TotalNodes=2 SelectTypeParameters=NONE
   JobDefaults=(null)
   DefMemPerCPU=15872 MaxMemPerCPU=15872

PartitionName=condo-osumed-gpu-quad-parallel
   AllowGroups=ALL AllowAccounts=pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101 AllowQos=ALL
   AllocNodes=ALL Default=NO QoS=N/A
   DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=7-00:00:00 MinNodes=2 LLN=NO MaxCPUsPerNode=48
   Nodes=p035[3-4]
   PriorityJobFactor=1000 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=EXCLUSIVE
   OverTimeLimit=NONE PreemptMode=OFF
   State=UP TotalCPUs=96 TotalNodes=2 SelectTypeParameters=NONE
   JobDefaults=(null)
   DefMemPerCPU=15872 MaxMemPerCPU=15872

PartitionName=condo-osumed-gpu-quad-serial
   AllowGroups=ALL AllowAccounts=pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101 AllowQos=ALL
   AllocNodes=ALL Default=NO QoS=N/A
   DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
   MaxNodes=1 MaxTime=7-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=48
   Nodes=p035[3-4]
   PriorityJobFactor=1000 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
   OverTimeLimit=NONE PreemptMode=OFF
   State=UP TotalCPUs=96 TotalNodes=2 SelectTypeParameters=NONE
   JobDefaults=(null)
   DefMemPerCPU=15872 MaxMemPerCPU=15872

PartitionName=debug
   AllowGroups=ALL DenyAccounts=pcon0060,pcon0003,pcon0014,pcon0015,pcon0016,pcon0008,pcon0010,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101 AllowQos=ALL
   AllocNodes=ALL Default=NO QoS=N/A
   DefaultTime=00:30:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
   MaxNodes=2 MaxTime=01:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=48
   Nodes=p0[001-195],p0[401-404,501-792]
   PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
   OverTimeLimit=NONE PreemptMode=OFF
   State=INACTIVE TotalCPUs=22008 TotalNodes=491 SelectTypeParameters=NONE
   JobDefaults=(null)
   DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED

PartitionName=debug-40core
   AllowGroups=ALL DenyAccounts=pcon0060,pcon0003,pcon0014,pcon0015,pcon0016,pcon0008,pcon0010,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101 AllowQos=ALL
   AllocNodes=ALL Default=NO QoS=debug
   DefaultTime=00:30:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
   MaxNodes=2 MaxTime=01:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=48
   Nodes=p0[001-195]
   PriorityJobFactor=5000 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
   OverTimeLimit=NONE PreemptMode=OFF
   State=UP TotalCPUs=7800 TotalNodes=195 SelectTypeParameters=NONE
   JobDefaults=(null)
   DefMemPerCPU=4556 MaxMemPerCPU=4556

PartitionName=debug-48core
   AllowGroups=ALL DenyAccounts=pcon0060,pcon0003,pcon0014,pcon0015,pcon0016,pcon0008,pcon0010,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101 AllowQos=ALL
   AllocNodes=ALL Default=NO QoS=debug
   DefaultTime=00:30:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
   MaxNodes=2 MaxTime=01:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=48
   Nodes=p0[401-404,501-792]
   PriorityJobFactor=5000 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
   OverTimeLimit=NONE PreemptMode=OFF
   State=UP TotalCPUs=14208 TotalNodes=296 SelectTypeParameters=NONE
   JobDefaults=(null)
   DefMemPerCPU=3797 MaxMemPerCPU=3797

PartitionName=gpubackfill-parallel-40core
   AllowGroups=ALL DenyAccounts=pcon0060,pcon0003,pcon0014,pcon0015,pcon0016,pcon0008,pcon0010,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101 AllowQos=ALL
   AllocNodes=ALL Default=NO QoS=N/A
   DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
   MaxNodes=4 MaxTime=04:00:00 MinNodes=2 LLN=NO MaxCPUsPerNode=40
   Nodes=p02[25-39]
   PriorityJobFactor=1000 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=EXCLUSIVE
   OverTimeLimit=NONE PreemptMode=OFF
   State=UP TotalCPUs=600 TotalNodes=15 SelectTypeParameters=NONE
   JobDefaults=(null)
   DefMemPerCPU=9292 MaxMemPerCPU=9292

PartitionName=gpubackfill-parallel-48core
   AllowGroups=ALL DenyAccounts=pcon0060,pcon0003,pcon0014,pcon0015,pcon0016,pcon0008,pcon0010,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101 AllowQos=ALL
   AllocNodes=ALL Default=NO QoS=N/A
   DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=04:00:00 MinNodes=2 LLN=NO MaxCPUsPerNode=48
   Nodes=p03[01-19]
   PriorityJobFactor=1000 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=EXCLUSIVE
   OverTimeLimit=NONE PreemptMode=OFF
   State=UP TotalCPUs=912 TotalNodes=19 SelectTypeParameters=NONE
   JobDefaults=(null)
   DefMemPerCPU=7744 MaxMemPerCPU=7744

PartitionName=gpubackfill-parallel-quad
   AllowGroups=ALL DenyAccounts=pcon0060,pcon0003,pcon0014,pcon0015,pcon0016,pcon0008,pcon0010,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101 AllowQos=ALL
   AllocNodes=ALL Default=NO QoS=N/A
   DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
   MaxNodes=2 MaxTime=04:00:00 MinNodes=2 LLN=NO MaxCPUsPerNode=48
   Nodes=p035[1-2]
   PriorityJobFactor=1000 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=EXCLUSIVE
   OverTimeLimit=NONE PreemptMode=OFF
   State=UP TotalCPUs=96 TotalNodes=2 SelectTypeParameters=NONE
   JobDefaults=(null)
   DefMemPerCPU=15872 MaxMemPerCPU=15872

PartitionName=gpubackfill-serial-40core
   AllowGroups=ALL DenyAccounts=pcon0060,pcon0003,pcon0014,pcon0015,pcon0016,pcon0008,pcon0010,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101 AllowQos=ALL
   AllocNodes=ALL Default=NO QoS=N/A
   DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
   MaxNodes=1 MaxTime=04:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=40
   Nodes=p02[25-39]
   PriorityJobFactor=1000 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
   OverTimeLimit=NONE PreemptMode=OFF
   State=UP TotalCPUs=600 TotalNodes=15 SelectTypeParameters=NONE
   JobDefaults=(null)
   DefMemPerCPU=9292 MaxMemPerCPU=9292

PartitionName=gpubackfill-serial-48core
   AllowGroups=ALL DenyAccounts=pcon0060,pcon0003,pcon0014,pcon0015,pcon0016,pcon0008,pcon0010,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101 AllowQos=ALL
   AllocNodes=ALL Default=NO QoS=N/A
   DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
   MaxNodes=1 MaxTime=04:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=48
   Nodes=p03[01-19]
   PriorityJobFactor=1000 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
   OverTimeLimit=NONE PreemptMode=OFF
   State=UP TotalCPUs=912 TotalNodes=19 SelectTypeParameters=NONE
   JobDefaults=(null)
   DefMemPerCPU=7744 MaxMemPerCPU=7744

PartitionName=gpubackfill-serial-quad
   AllowGroups=ALL DenyAccounts=pcon0060,pcon0003,pcon0014,pcon0015,pcon0016,pcon0008,pcon0010,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101 AllowQos=ALL
   AllocNodes=ALL Default=NO QoS=N/A
   DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
   MaxNodes=1 MaxTime=04:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=48
   Nodes=p035[1-2]
   PriorityJobFactor=1000 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
   OverTimeLimit=NONE PreemptMode=OFF
   State=UP TotalCPUs=96 TotalNodes=2 SelectTypeParameters=NONE
   JobDefaults=(null)
   DefMemPerCPU=15872 MaxMemPerCPU=15872

PartitionName=gpudebug
   AllowGroups=ALL DenyAccounts=pcon0060,pcon0003,pcon0014,pcon0015,pcon0016,pcon0008,pcon0010,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101 AllowQos=ALL
   AllocNodes=ALL Default=NO QoS=N/A
   DefaultTime=00:30:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
   MaxNodes=2 MaxTime=01:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=48
   Nodes=p02[25-39],p03[01-19]
   PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
   OverTimeLimit=NONE PreemptMode=OFF
   State=INACTIVE TotalCPUs=1512 TotalNodes=34 SelectTypeParameters=NONE
   JobDefaults=(null)
   DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED

PartitionName=gpudebug-40core
   AllowGroups=ALL DenyAccounts=pcon0060,pcon0003,pcon0014,pcon0015,pcon0016,pcon0008,pcon0010,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101 AllowQos=ALL
   AllocNodes=ALL Default=NO QoS=gpudebug
   DefaultTime=00:30:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
   MaxNodes=2 MaxTime=01:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=40
   Nodes=p02[25-39]
   PriorityJobFactor=5000 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
   OverTimeLimit=NONE PreemptMode=OFF
   State=UP TotalCPUs=600 TotalNodes=15 SelectTypeParameters=NONE
   JobDefaults=(null)
   DefMemPerCPU=9292 MaxMemPerCPU=9292

PartitionName=gpudebug-48core
   AllowGroups=ALL DenyAccounts=pcon0060,pcon0003,pcon0014,pcon0015,pcon0016,pcon0008,pcon0010,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101 AllowQos=ALL
   AllocNodes=ALL Default=NO QoS=gpudebug
   DefaultTime=00:30:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
   MaxNodes=2 MaxTime=01:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=48
   Nodes=p03[01-19]
   PriorityJobFactor=5000 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
   OverTimeLimit=NONE PreemptMode=OFF
   State=UP TotalCPUs=912 TotalNodes=19 SelectTypeParameters=NONE
   JobDefaults=(null)
   DefMemPerCPU=7744 MaxMemPerCPU=7744

PartitionName=gpudebug-quad
   AllowGroups=ALL DenyAccounts=pcon0060,pcon0003,pcon0014,pcon0015,pcon0016,pcon0008,pcon0010,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101 AllowQos=ALL
   AllocNodes=ALL Default=NO QoS=gpudebug
   DefaultTime=00:30:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
   MaxNodes=2 MaxTime=01:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=48
   Nodes=p035[1-2]
   PriorityJobFactor=5000 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
   OverTimeLimit=NONE PreemptMode=OFF
   State=UP TotalCPUs=96 TotalNodes=2 SelectTypeParameters=NONE
   JobDefaults=(null)
   DefMemPerCPU=15872 MaxMemPerCPU=15872

PartitionName=gpuparallel
   AllowGroups=ALL DenyAccounts=pcon0060,pcon0003,pcon0014,pcon0015,pcon0016,pcon0008,pcon0010,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101 AllowQos=ALL
   AllocNodes=ALL Default=NO QoS=pitzer-gpuparallel-partition
   DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
   MaxNodes=10 MaxTime=4-00:00:00 MinNodes=2 LLN=NO MaxCPUsPerNode=48
   Nodes=p02[25-39],p03[01-19],p035[1-2]
   PriorityJobFactor=1000 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=EXCLUSIVE
   OverTimeLimit=NONE PreemptMode=OFF
   State=INACTIVE TotalCPUs=1608 TotalNodes=36 SelectTypeParameters=NONE
   JobDefaults=(null)
   DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED

PartitionName=gpuparallel-40core
   AllowGroups=ALL DenyAccounts=pcon0060,pcon0003,pcon0014,pcon0015,pcon0016,pcon0008,pcon0010,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101 AllowQos=ALL
   AllocNodes=ALL Default=NO QoS=pitzer-gpuparallel-partition
   DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
   MaxNodes=10 MaxTime=4-00:00:00 MinNodes=2 LLN=NO MaxCPUsPerNode=40
   Nodes=p02[25-39]
   PriorityJobFactor=2000 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=EXCLUSIVE
   OverTimeLimit=NONE PreemptMode=OFF
   State=UP TotalCPUs=600 TotalNodes=15 SelectTypeParameters=NONE
   JobDefaults=(null)
   DefMemPerCPU=9292 MaxMemPerCPU=9292

PartitionName=gpuparallel-48core
   AllowGroups=ALL DenyAccounts=pcon0060,pcon0003,pcon0014,pcon0015,pcon0016,pcon0008,pcon0010,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101 AllowQos=ALL
   AllocNodes=ALL Default=NO QoS=pitzer-gpuparallel-partition
   DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
   MaxNodes=10 MaxTime=4-00:00:00 MinNodes=2 LLN=NO MaxCPUsPerNode=48
   Nodes=p03[01-19]
   PriorityJobFactor=2000 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=EXCLUSIVE
   OverTimeLimit=NONE PreemptMode=OFF
   State=UP TotalCPUs=912 TotalNodes=19 SelectTypeParameters=NONE
   JobDefaults=(null)
   DefMemPerCPU=7744 MaxMemPerCPU=7744

PartitionName=gpuparallel-quad
   AllowGroups=ALL DenyAccounts=pcon0060,pcon0003,pcon0014,pcon0015,pcon0016,pcon0008,pcon0010,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101 AllowQos=ALL
   AllocNodes=ALL Default=NO QoS=pitzer-gpu-quad-partition
   DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
   MaxNodes=2 MaxTime=4-00:00:00 MinNodes=2 LLN=NO MaxCPUsPerNode=48
   Nodes=p035[1-2]
   PriorityJobFactor=1000 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=EXCLUSIVE
   OverTimeLimit=NONE PreemptMode=OFF
   State=UP TotalCPUs=96 TotalNodes=2 SelectTypeParameters=NONE
   JobDefaults=(null)
   DefMemPerCPU=15872 MaxMemPerCPU=15872

PartitionName=gpuserial
   AllowGroups=ALL DenyAccounts=pcon0060,pcon0003,pcon0014,pcon0015,pcon0016,pcon0008,pcon0010,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101 AllowQos=ALL
   AllocNodes=ALL Default=NO QoS=pitzer-gpuserial-partition
   DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
   MaxNodes=1 MaxTime=7-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=48
   Nodes=p02[25-39],p03[01-19],p035[1-2]
   PriorityJobFactor=1000 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
   OverTimeLimit=NONE PreemptMode=OFF
   State=INACTIVE TotalCPUs=1608 TotalNodes=36 SelectTypeParameters=NONE
   JobDefaults=(null)
   DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED

PartitionName=gpuserial-40core
   AllowGroups=ALL DenyAccounts=pcon0060,pcon0003,pcon0014,pcon0015,pcon0016,pcon0008,pcon0010,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101 AllowQos=ALL
   AllocNodes=ALL Default=NO QoS=pitzer-gpuserial-partition
   DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
   MaxNodes=1 MaxTime=7-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=40
   Nodes=p02[25-39]
   PriorityJobFactor=2000 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
   OverTimeLimit=NONE PreemptMode=OFF
   State=UP TotalCPUs=600 TotalNodes=15 SelectTypeParameters=NONE
   JobDefaults=(null)
   DefMemPerCPU=9292 MaxMemPerCPU=9292

PartitionName=gpuserial-48core
   AllowGroups=ALL DenyAccounts=pcon0060,pcon0003,pcon0014,pcon0015,pcon0016,pcon0008,pcon0010,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101 AllowQos=ALL
   AllocNodes=ALL Default=NO QoS=pitzer-gpuserial-partition
   DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
   MaxNodes=1 MaxTime=7-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=48
   Nodes=p03[01-19]
   PriorityJobFactor=2000 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
   OverTimeLimit=NONE PreemptMode=OFF
   State=UP TotalCPUs=912 TotalNodes=19 SelectTypeParameters=NONE
   JobDefaults=(null)
   DefMemPerCPU=7744 MaxMemPerCPU=7744

PartitionName=gpuserial-quad
   AllowGroups=ALL DenyAccounts=pcon0060,pcon0003,pcon0014,pcon0015,pcon0016,pcon0008,pcon0010,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101 AllowQos=ALL
   AllocNodes=ALL Default=NO QoS=pitzer-gpu-quad-partition
   DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
   MaxNodes=1 MaxTime=7-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=48
   Nodes=p035[1-2]
   PriorityJobFactor=1000 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
   OverTimeLimit=NONE PreemptMode=OFF
   State=UP TotalCPUs=96 TotalNodes=2 SelectTypeParameters=NONE
   JobDefaults=(null)
   DefMemPerCPU=15872 MaxMemPerCPU=15872

PartitionName=hugemem
   AllowGroups=ALL DenyAccounts=pcon0060,pcon0003,pcon0014,pcon0015,pcon0016,pcon0008,pcon0010,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101 AllowQos=ALL
   AllocNodes=ALL Default=NO QoS=pitzer-hugemem-partition
   DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
   MaxNodes=1 MaxTime=7-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=80
   Nodes=p02[57-60]
   PriorityJobFactor=1000 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
   OverTimeLimit=NONE PreemptMode=OFF
   State=UP TotalCPUs=320 TotalNodes=4 SelectTypeParameters=NONE
   JobDefaults=(null)
   DefMemPerCPU=38259 MaxMemPerCPU=38259

PartitionName=hugemem-parallel
   AllowGroups=ALL AllowAccounts=pzs0708,pzs0710,pzs0712 AllowQos=ALL
   AllocNodes=ALL Default=NO QoS=N/A
   DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
   MaxNodes=1 MaxTime=4-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=80
   Nodes=p02[57-60]
   PriorityJobFactor=1000 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
   OverTimeLimit=NONE PreemptMode=OFF
   State=UP TotalCPUs=320 TotalNodes=4 SelectTypeParameters=NONE
   JobDefaults=(null)
   DefMemPerCPU=38259 MaxMemPerCPU=38259

PartitionName=largemem
   AllowGroups=ALL DenyAccounts=pcon0060,pcon0003,pcon0014,pcon0015,pcon0016,pcon0008,pcon0010,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101 AllowQos=ALL
   AllocNodes=ALL Default=NO QoS=pitzer-largemem-partition
   DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
   MaxNodes=1 MaxTime=7-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=48
   Nodes=p09[01-12]
   PriorityJobFactor=1000 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
   OverTimeLimit=NONE PreemptMode=OFF
   State=UP TotalCPUs=576 TotalNodes=12 SelectTypeParameters=NONE
   JobDefaults=(null)
   DefMemPerCPU=15872 MaxMemPerCPU=15872

PartitionName=largemem-parallel
   AllowGroups=ALL AllowAccounts=pzs0708,pzs0710,pzs0712 AllowQos=ALL
   AllocNodes=ALL Default=NO QoS=N/A
   DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=4-00:00:00 MinNodes=2 LLN=NO MaxCPUsPerNode=48
   Nodes=p09[01-12]
   PriorityJobFactor=1000 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
   OverTimeLimit=NONE PreemptMode=OFF
   State=UP TotalCPUs=576 TotalNodes=12 SelectTypeParameters=NONE
   JobDefaults=(null)
   DefMemPerCPU=15872 MaxMemPerCPU=15872

PartitionName=longserial
   AllowGroups=ALL AllowAccounts=pzs0708,pzs0710,pzs0714,pzs0712,pas0426,pfs0183,paa0209,pas1350,pas1117,pjs0320,pas1501 AllowQos=ALL
   AllocNodes=ALL Default=NO QoS=N/A
   DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
   MaxNodes=1 MaxTime=14-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=40
   Nodes=p0[001-195]
   PriorityJobFactor=1000 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
   OverTimeLimit=NONE PreemptMode=OFF
   State=UP TotalCPUs=7800 TotalNodes=195 SelectTypeParameters=NONE
   JobDefaults=(null)
   DefMemPerCPU=4556 MaxMemPerCPU=4556

PartitionName=parallel
   AllowGroups=ALL DenyAccounts=pcon0060,pcon0003,pcon0014,pcon0015,pcon0016,pcon0008,pcon0010,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101 AllowQos=ALL
   AllocNodes=ALL Default=NO QoS=N/A
   DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
   MaxNodes=40 MaxTime=4-00:00:00 MinNodes=2 LLN=NO MaxCPUsPerNode=48
   Nodes=p0[001-195],p0[401-404,501-792]
   PriorityJobFactor=1000 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=EXCLUSIVE
   OverTimeLimit=NONE PreemptMode=OFF
   State=INACTIVE TotalCPUs=22008 TotalNodes=491 SelectTypeParameters=NONE
   JobDefaults=(null)
   DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED

PartitionName=parallel-40core
   AllowGroups=ALL DenyAccounts=pcon0060,pcon0003,pcon0014,pcon0015,pcon0016,pcon0008,pcon0010,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101 AllowQos=ALL
   AllocNodes=ALL Default=NO QoS=N/A
   DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
   MaxNodes=40 MaxTime=4-00:00:00 MinNodes=2 LLN=NO MaxCPUsPerNode=40
   Nodes=p0[001-195]
   PriorityJobFactor=2000 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=EXCLUSIVE
   OverTimeLimit=NONE PreemptMode=OFF
   State=UP TotalCPUs=7800 TotalNodes=195 SelectTypeParameters=NONE
   JobDefaults=(null)
   DefMemPerCPU=4556 MaxMemPerCPU=4556

PartitionName=parallel-48core
   AllowGroups=ALL DenyAccounts=pcon0060,pcon0003,pcon0014,pcon0015,pcon0016,pcon0008,pcon0010,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101 AllowQos=ALL
   AllocNodes=ALL Default=NO QoS=N/A
   DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
   MaxNodes=40 MaxTime=4-00:00:00 MinNodes=2 LLN=NO MaxCPUsPerNode=48
   Nodes=p0[401-404,501-792]
   PriorityJobFactor=2000 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=EXCLUSIVE
   OverTimeLimit=NONE PreemptMode=OFF
   State=UP TotalCPUs=14208 TotalNodes=296 SelectTypeParameters=NONE
   JobDefaults=(null)
   DefMemPerCPU=3797 MaxMemPerCPU=3797

PartitionName=serial
   AllowGroups=ALL DenyAccounts=pcon0060,pcon0003,pcon0014,pcon0015,pcon0016,pcon0008,pcon0010,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101 AllowQos=ALL
   AllocNodes=ALL Default=NO QoS=N/A
   DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
   MaxNodes=1 MaxTime=7-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=48
   Nodes=p0[001-195],p0[401-404,501-792]
   PriorityJobFactor=1000 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
   OverTimeLimit=NONE PreemptMode=OFF
   State=INACTIVE TotalCPUs=22008 TotalNodes=491 SelectTypeParameters=NONE
   JobDefaults=(null)
   DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED

PartitionName=serial-40core
   AllowGroups=ALL DenyAccounts=pcon0060,pcon0003,pcon0014,pcon0015,pcon0016,pcon0008,pcon0010,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101 AllowQos=ALL
   AllocNodes=ALL Default=NO QoS=N/A
   DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
   MaxNodes=1 MaxTime=7-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=40
   Nodes=p0[001-195]
   PriorityJobFactor=2000 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
   OverTimeLimit=NONE PreemptMode=OFF
   State=UP TotalCPUs=7800 TotalNodes=195 SelectTypeParameters=NONE
   JobDefaults=(null)
   DefMemPerCPU=4556 MaxMemPerCPU=4556

PartitionName=serial-48core
   AllowGroups=ALL DenyAccounts=pcon0060,pcon0003,pcon0014,pcon0015,pcon0016,pcon0008,pcon0010,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101 AllowQos=ALL
   AllocNodes=ALL Default=NO QoS=N/A
   DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
   MaxNodes=1 MaxTime=7-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=48
   Nodes=p0[401-404,501-792]
   PriorityJobFactor=2000 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
   OverTimeLimit=NONE PreemptMode=OFF
   State=UP TotalCPUs=14208 TotalNodes=296 SelectTypeParameters=NONE
   JobDefaults=(null)
   DefMemPerCPU=3797 MaxMemPerCPU=3797

PartitionName=systems
   AllowGroups=sysstf,sappstf DenyAccounts=pcon0060,pcon0003,pcon0014,pcon0015,pcon0016,pcon0008,pcon0010,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0100,pcon0101 AllowQos=ALL
   AllocNodes=ALL Default=NO QoS=N/A
   DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED
   Nodes=p0[001-195],p0[401-404,501-792],p02[25-39],p03[01-19],p035[1-2],p02[57-60],p09[01-12]
   PriorityJobFactor=10000 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
   OverTimeLimit=NONE PreemptMode=OFF
   State=UP TotalCPUs=24512 TotalNodes=543 SelectTypeParameters=NONE
   JobDefaults=(null)
   DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED

WRT the severity, I'm following the instructions of my manager.  Also, the Importance field is what I set -- I can't find a severity field in my view other than that one.
Comment 17 Ben Roberts 2020-10-13 12:12:47 MDT
I see that job 2029118 has a large list of partitions.  Some of those partitions do have a MaxNodes limit that would be exceeded by the job, like 'gpubackfill-parallel-40core' as an example.  I assume the other test jobs you ran had the same partition list though, is that right?  

It is interesting that the nodes it's listing as unavailable are down.  Can you run a test again with a couple 29 node jobs so that one is pending.  While it's pending I'd like to see if it references nodes outside of the reservation as well as being unavailable.  

The fact that it's making reference to nodes that don't match the reservation make me think it could be related to the bug I referenced yesterday.  I wasn't able to reproduce the behavior you're seeing where the bug came into play, but I think there are a lot of variables in your environment that I didn't include in my testing and may not have been a factor in your test environment either.  What are your thoughts on applying the patch I referenced?

Thanks,
Ben
Comment 18 Troy Baer 2020-10-13 12:22:23 MDT
One other hinky thing I just noticed is that one of the reserved nodes, p0501, was in an IDLE+DRAIN state.  I cleared that just in case.
Comment 19 Troy Baer 2020-10-13 12:50:36 MDT
Ben, we've made the following changes on our production system:

1.  We've made the customer reservations 1 node larger than the jobs (i.e. 31 nodes instead of 30), in case one of the nodes gets into a bad state but remains in the reservation like we saw with p0501 in the test reservation earlier.  (That's something I made a habit of with large reservations in Moab after being burned by node failures a number of times, so it's nothing new.)

2.  We've updated Slurm with the patch you provided yesterday.

Hopefully between those two, we'll have this fixed.  I'll do some tests myself here in a little bit, and then the next iteration of the customer's reservation is around 17:35EDT tonight.
Comment 20 Ben Roberts 2020-10-13 13:01:51 MDT
Thanks for the update, I do think those two actions should help.  I'll stay posted for an update this afternoon.

Thanks,
Ben
Comment 21 Troy Baer 2020-10-13 13:09:37 MDT
Based on my tests, I think this is going to work:

troy@pitzer-login04:~$ scontrol create reservation=test nodecnt=30 feature='c6420&48core' Flags=MAINT,PURGE_COMP=00:05:00 start=14:55:00 duration=01:00:00 accounts=PZS0708
Reservation created: test

troy@pitzer-login04:~$ sbatch --nodes=30 --reservation=test --time=10:00 test.job
Submitted batch job 2029438

troy@pitzer-login04:~$ sbatch --nodes=29 --reservation=test --time=10:00 test.job
Submitted batch job 2029439

troy@pitzer-login04:~$ sbatch --nodes=28 --reservation=test --time=10:00 test.job
Submitted batch job 2029440

troy@pitzer-login04:~$ sbatch --nodes=24 --reservation=test --time=10:00 test.job
Submitted batch job 2029441

troy@pitzer-login04:~$ sbatch --nodes=16 --reservation=test --time=10:00 test.job
Submitted batch job 2029442

troy@pitzer-login04:~$ sbatch --nodes=8 --reservation=test --time=10:00 test.job
Submitted batch job 2029443

troy@pitzer-login04:~$ sbatch --nodes=4 --reservation=test --time=10:00 test.job
Submitted batch job 2029444

troy@pitzer-login04:~$ sbatch --nodes=2 --reservation=test --time=10:00 test.job
Submitted batch job 2029445

troy@pitzer-login04:~$ sbatch --nodes=1 --reservation=test --time=10:00 test.job
Submitted batch job 2029447

troy@pitzer-login04:~$ squeue -u troy
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON) 
           2029438 parallel-     test     troy PD       0:00     30 (Reservation) 
           2029439 parallel-     test     troy PD       0:00     29 (Reservation) 
           2029440 parallel-     test     troy PD       0:00     28 (Reservation) 
           2029441 parallel-     test     troy PD       0:00     24 (Reservation) 
           2029442 parallel-     test     troy PD       0:00     16 (Reservation) 
           2029443 parallel-     test     troy PD       0:00      8 (Reservation) 
           2029444 parallel-     test     troy PD       0:00      4 (Reservation) 
           2029445 parallel-     test     troy PD       0:00      2 (Reservation) 
           2029447 serial-40     test     troy PD       0:00      1 (Reservation) 

[...wait till ~15:06...]

troy@pitzer-login04:~$ squeue -u troy
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON) 
           2029440 parallel-     test     troy PD       0:00     28 (PartitionNodeLimit) 
           2029441 parallel-     test     troy PD       0:00     24 (PartitionNodeLimit) 
           2029442 parallel-     test     troy PD       0:00     16 (PartitionNodeLimit) 
           2029443 parallel-     test     troy PD       0:00      8 (PartitionNodeLimit) 
           2029444 parallel-     test     troy PD       0:00      4 (PartitionNodeLimit) 
           2029445 parallel-     test     troy PD       0:00      2 (Priority) 
           2029439 parallel-     test     troy  R       4:54     29 p[0501-0518,0521-0522,0527,0529,0531-0533,0535-0538] 
           2029447 serial-48     test     troy  R       4:51      1 p0546 

# what happened to the 30-node job?
troy@pitzer-login04:~$ sacct -j2029438
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode 
------------ ---------- ---------- ---------- ---------- ---------- -------- 
2029438            test parallel-+    pzs0708       1440  COMPLETED      0:0 
2029438.bat+      batch               pzs0708         48  COMPLETED      0:0 
2029438.ext+     extern               pzs0708       1440  COMPLETED      0:0
Comment 22 Ben Roberts 2020-10-13 13:36:19 MDT
That does look hopeful.  To see what happened to the 30 node job I would start with the slurmctld logs.  Do they have entries for that job id showing when it started and ended?
Comment 23 Troy Baer 2020-10-13 13:41:27 MDT
(In reply to Ben Roberts from comment #22)
> That does look hopeful.  To see what happened to the 30 node job I would
> start with the slurmctld logs.  Do they have entries for that job id showing
> when it started and ended?

I can tell it ran from the accounting:

troy@pitzer-login04:~$ sacct -j2029438 -X -o jobid,start,end,nodelist%100
       JobID               Start                 End                                                                                             NodeList 
------------ ------------------- -------------------                                                ----------------------------------------------------- 
2029438      2020-10-13T14:55:00 2020-10-13T15:00:02                                            p[0501-0518,0521-0522,0527,0529,0531-0533,0535-0538,0546]
Comment 24 Ben Roberts 2020-10-13 14:14:11 MDT
Yeah, I was looking for the time stamps, so thank you for the sacct output.  It looks like it was the first to run when the reservation started.  It ended at 15:00:02 and in your comment it looks like you waited until ~15:06 to look at the status again.  This is enough time that the default MinJobAge value of 300 seconds would have been met and the record would have been purged from slurmctld's memory, requiring you to look at sacct for information about the job.  This does look like normal behavior unless you've increased your MinJobAge from the default.

Thanks,
Ben
Comment 25 Troy Baer 2020-10-13 15:39:23 MDT
The customer's latest job just started within the requested reservation.  I'd like to have a couple more of these before we declare victory, but this is very encouraging.
Comment 26 Ben Roberts 2020-10-13 15:49:51 MDT
I'm glad to hear it.  Since the job started and you'd like to monitor a few more times I'll lower the severity but leave the ticket open while we wait to see how it goes.

Thanks,
Ben
Comment 27 Troy Baer 2020-10-14 10:18:37 MDT
The customer just had this fail again, in what looks to be the same way:

root@pitzer-slurm01:~# grep 2032519 /var/log/slurm/slurmctld.log
Oct 14 11:26:11 pitzer-slurm01 slurmctld[59138]: _slurm_rpc_submit_batch_job: JobId=2032519 InitPrio=1200107024 usec=443
Oct 14 11:35:01 pitzer-slurm01 slurmctld[59138]: _pick_best_nodes: JobId=2032519  never runnable in partition parallel-48core
Oct 14 11:35:01 pitzer-slurm01 slurmctld[59138]: sched: schedule: JobId=2032519 non-runnable: Requested node configuration is not available
Oct 14 11:37:43 pitzer-slurm01 slurmctld[59138]: _slurm_rpc_kill_job: REQUEST_KILL_JOB JobId=2032519 uid 30211

They deleted the job before I had a chance to take a closer look at it.
Comment 28 Troy Baer 2020-10-14 10:27:17 MDT
I've created a test reservation to have the customer test this for themselves:

troy@pitzer-login01:~$ scontrol create reservation=test nodecnt=2 feature='c6420&48core' start=12:30 duration=05:00:00 accounts=PZS0708,PYS1043 flags=maint
Reservation created: test

troy@pitzer-login01:~$ sinfo -T
RESV_NAME STATE START_TIME END_TIME DURATION NODELIST
x005-00 INACTIVE 2020-10-14T23:35:00 2020-10-15T00:15:00 00:40:00 p[0501-0509,0511-0518,0521-0522,0525,0528-0529,0532,0535,0537]
x005-06 INACTIVE 2020-10-15T05:35:00 2020-10-15T06:15:00 00:40:00 p[0501-0509,0511-0518,0521-0522,0525,0528-0529,0532,0535,0537-0538]
x005-12 INACTIVE 2020-10-15T11:35:00 2020-10-15T12:15:00 00:40:00 p[0501-0509,0511-0518,0521-0522,0525,0528-0529,0532,0535,0537]
x005-18 INACTIVE 2020-10-14T17:35:00 2020-10-14T18:15:00 00:40:00 p[0501-0509,0511-0518,0521-0522,0525,0528-0529,0532,0535,0537-0538]
test INACTIVE 2020-10-14T12:30:00 2020-10-14T17:30:00 05:00:00 p[0501-0502]
Comment 30 Ben Roberts 2020-10-14 10:47:38 MDT
I just had another colleague bring up a ticket he's working on where the 'MAINT' flag seems to cause problems with jobs being able to start in a reservation.  I'll do some testing with this to see if I can narrow down a reproducer, but I wanted to bring it up as something you might look at on your side as well.

Thanks,
Ben
Comment 31 Troy Baer 2020-10-14 10:51:22 MDT
Looking in our slurmctld logs, we get a *LOT* of messages about modifying the node list for these reservations, but they usually seems to be on the same nodes:

root@pitzer-slurm01:~# grep -i X005-00 /var/log/slurm/slurmctld.log | head
Oct 14 03:28:06 pitzer-slurm01 slurmctld[59138]: modified reservation x005-00 due to unusable nodes, new nodes: p[0501-0507,0509-0518,0521-0522,0525,0528-0529,0532,0535,0537]
Oct 14 03:28:36 pitzer-slurm01 slurmctld[59138]: modified reservation x005-00 due to unusable nodes, new nodes: p[0501-0507,0509-0518,0521-0522,0525,0528-0529,0532,0535,0537]
Oct 14 03:29:06 pitzer-slurm01 slurmctld[59138]: modified reservation x005-00 due to unusable nodes, new nodes: p[0501-0507,0509-0518,0521-0522,0525,0528-0529,0532,0535,0537]
Oct 14 03:29:36 pitzer-slurm01 slurmctld[59138]: modified reservation x005-00 due to unusable nodes, new nodes: p[0501-0507,0509-0518,0521-0522,0525,0528-0529,0532,0535,0537]
Oct 14 03:30:06 pitzer-slurm01 slurmctld[59138]: modified reservation x005-00 due to unusable nodes, new nodes: p[0501-0507,0509-0518,0521-0522,0525,0528-0529,0532,0535,0537]
Oct 14 03:30:36 pitzer-slurm01 slurmctld[59138]: modified reservation x005-00 due to unusable nodes, new nodes: p[0501-0507,0509-0518,0521-0522,0525,0528-0529,0532,0535,0537]
Oct 14 03:31:06 pitzer-slurm01 slurmctld[59138]: modified reservation x005-00 due to unusable nodes, new nodes: p[0501-0507,0509-0518,0521-0522,0525,0528-0529,0532,0535,0537]
Oct 14 03:31:36 pitzer-slurm01 slurmctld[59138]: modified reservation x005-00 due to unusable nodes, new nodes: p[0501-0507,0509-0518,0521-0522,0525,0528-0529,0532,0535,0537]
Oct 14 03:32:06 pitzer-slurm01 slurmctld[59138]: modified reservation x005-00 due to unusable nodes, new nodes: p[0501-0507,0509-0518,0521-0522,0525,0528-0529,0532,0535,0537]
Oct 14 03:32:36 pitzer-slurm01 slurmctld[59138]: modified reservation x005-00 due to unusable nodes, new nodes: p[0501-0507,0509-0518,0521-0522,0525,0528-0529,0532,0535,0537]

root@pitzer-slurm01:~# grep -i X005-00 /var/log/slurm/slurmctld.log | grep modified | awk '{print $NF}' | sort | uniq -c
     10 p[0501-0507,0509-0518,0521-0522,0525,0528-0529,0532,0535,0537]
   1099 p[0501-0509,0511-0518,0521-0522,0525,0528-0529,0532,0535,0537]
      1 p[0501-0518,0521-0522,0525,0528-0529,0532,0535,0537]

root@pitzer-slurm01:~# grep -i X005-06 /var/log/slurm/slurmctld.log | head
Oct 14 03:28:06 pitzer-slurm01 slurmctld[59138]: modified reservation x005-06 due to unusable nodes, new nodes: p[0501-0507,0509-0518,0521-0522,0525,0528-0529,0532,0535,0537-0538]
Oct 14 03:28:36 pitzer-slurm01 slurmctld[59138]: modified reservation x005-06 due to unusable nodes, new nodes: p[0501-0507,0509-0518,0521-0522,0525,0528-0529,0532,0535,0537-0538]
Oct 14 03:29:06 pitzer-slurm01 slurmctld[59138]: modified reservation x005-06 due to unusable nodes, new nodes: p[0501-0507,0509-0518,0521-0522,0525,0528-0529,0532,0535,0537-0538]
Oct 14 03:29:36 pitzer-slurm01 slurmctld[59138]: modified reservation x005-06 due to unusable nodes, new nodes: p[0501-0507,0509-0518,0521-0522,0525,0528-0529,0532,0535,0537-0538]
Oct 14 03:30:06 pitzer-slurm01 slurmctld[59138]: modified reservation x005-06 due to unusable nodes, new nodes: p[0501-0507,0509-0518,0521-0522,0525,0528-0529,0532,0535,0537-0538]
Oct 14 03:30:36 pitzer-slurm01 slurmctld[59138]: modified reservation x005-06 due to unusable nodes, new nodes: p[0501-0507,0509-0518,0521-0522,0525,0528-0529,0532,0535,0537-0538]
Oct 14 03:31:06 pitzer-slurm01 slurmctld[59138]: modified reservation x005-06 due to unusable nodes, new nodes: p[0501-0507,0509-0518,0521-0522,0525,0528-0529,0532,0535,0537-0538]
Oct 14 03:31:36 pitzer-slurm01 slurmctld[59138]: modified reservation x005-06 due to unusable nodes, new nodes: p[0501-0507,0509-0518,0521-0522,0525,0528-0529,0532,0535,0537-0538]
Oct 14 03:32:06 pitzer-slurm01 slurmctld[59138]: modified reservation x005-06 due to unusable nodes, new nodes: p[0501-0507,0509-0518,0521-0522,0525,0528-0529,0532,0535,0537-0538]
Oct 14 03:32:36 pitzer-slurm01 slurmctld[59138]: modified reservation x005-06 due to unusable nodes, new nodes: p[0501-0507,0509-0518,0521-0522,0525,0528-0529,0532,0535,0537-0538]

root@pitzer-slurm01:~# grep -i X005-06 /var/log/slurm/slurmctld.log | grep modified | awk '{print $NF}' | sort | uniq -c
     10 p[0501-0507,0509-0518,0521-0522,0525,0528-0529,0532,0535,0537-0538]
   1101 p[0501-0509,0511-0518,0521-0522,0525,0528-0529,0532,0535,0537-0538]

root@pitzer-slurm01:~# grep -i X005-12 /var/log/slurm/slurmctld.log | head
Oct 14 03:28:06 pitzer-slurm01 slurmctld[59138]: modified reservation x005-12 due to unusable nodes, new nodes: p[0501-0507,0509-0518,0521-0522,0525,0528-0529,0532,0535,0537]
Oct 14 03:28:36 pitzer-slurm01 slurmctld[59138]: modified reservation x005-12 due to unusable nodes, new nodes: p[0501-0507,0509-0518,0521-0522,0525,0528-0529,0532,0535,0537]
Oct 14 03:29:06 pitzer-slurm01 slurmctld[59138]: modified reservation x005-12 due to unusable nodes, new nodes: p[0501-0507,0509-0518,0521-0522,0525,0528-0529,0532,0535,0537]
Oct 14 03:29:36 pitzer-slurm01 slurmctld[59138]: modified reservation x005-12 due to unusable nodes, new nodes: p[0501-0507,0509-0518,0521-0522,0525,0528-0529,0532,0535,0537]
Oct 14 03:30:06 pitzer-slurm01 slurmctld[59138]: modified reservation x005-12 due to unusable nodes, new nodes: p[0501-0507,0509-0518,0521-0522,0525,0528-0529,0532,0535,0537]
Oct 14 03:30:36 pitzer-slurm01 slurmctld[59138]: modified reservation x005-12 due to unusable nodes, new nodes: p[0501-0507,0509-0518,0521-0522,0525,0528-0529,0532,0535,0537]
Oct 14 03:31:06 pitzer-slurm01 slurmctld[59138]: modified reservation x005-12 due to unusable nodes, new nodes: p[0501-0507,0509-0518,0521-0522,0525,0528-0529,0532,0535,0537]
Oct 14 03:31:36 pitzer-slurm01 slurmctld[59138]: modified reservation x005-12 due to unusable nodes, new nodes: p[0501-0507,0509-0518,0521-0522,0525,0528-0529,0532,0535,0537]
Oct 14 03:32:06 pitzer-slurm01 slurmctld[59138]: modified reservation x005-12 due to unusable nodes, new nodes: p[0501-0507,0509-0518,0521-0522,0525,0528-0529,0532,0535,0537]
Oct 14 03:32:36 pitzer-slurm01 slurmctld[59138]: modified reservation x005-12 due to unusable nodes, new nodes: p[0501-0507,0509-0518,0521-0522,0525,0528-0529,0532,0535,0537]

root@pitzer-slurm01:~# grep -i X005-12 /var/log/slurm/slurmctld.log | grep modified | awk '{print $NF}' | sort | uniq -c
     10 p[0501-0507,0509-0518,0521-0522,0525,0528-0529,0532,0535,0537]
   1085 p[0501-0509,0511-0518,0521-0522,0525,0528-0529,0532,0535,0537]
      1 p[0501-0518,0521-0522,0525,0528-0529,0532,0535,0537]

root@pitzer-slurm01:~# grep -i X005-18 /var/log/slurm/slurmctld.log | head
Oct 14 03:28:06 pitzer-slurm01 slurmctld[59138]: modified reservation x005-18 due to unusable nodes, new nodes: p[0501-0507,0509-0518,0521-0522,0525,0528-0529,0532,0535,0537-0538]
Oct 14 03:28:36 pitzer-slurm01 slurmctld[59138]: modified reservation x005-18 due to unusable nodes, new nodes: p[0501-0507,0509-0518,0521-0522,0525,0528-0529,0532,0535,0537-0538]
Oct 14 03:29:06 pitzer-slurm01 slurmctld[59138]: modified reservation x005-18 due to unusable nodes, new nodes: p[0501-0507,0509-0518,0521-0522,0525,0528-0529,0532,0535,0537-0538]
Oct 14 03:29:36 pitzer-slurm01 slurmctld[59138]: modified reservation x005-18 due to unusable nodes, new nodes: p[0501-0507,0509-0518,0521-0522,0525,0528-0529,0532,0535,0537-0538]
Oct 14 03:30:06 pitzer-slurm01 slurmctld[59138]: modified reservation x005-18 due to unusable nodes, new nodes: p[0501-0507,0509-0518,0521-0522,0525,0528-0529,0532,0535,0537-0538]
Oct 14 03:30:36 pitzer-slurm01 slurmctld[59138]: modified reservation x005-18 due to unusable nodes, new nodes: p[0501-0507,0509-0518,0521-0522,0525,0528-0529,0532,0535,0537-0538]
Oct 14 03:31:06 pitzer-slurm01 slurmctld[59138]: modified reservation x005-18 due to unusable nodes, new nodes: p[0501-0507,0509-0518,0521-0522,0525,0528-0529,0532,0535,0537-0538]
Oct 14 03:31:36 pitzer-slurm01 slurmctld[59138]: modified reservation x005-18 due to unusable nodes, new nodes: p[0501-0507,0509-0518,0521-0522,0525,0528-0529,0532,0535,0537-0538]
Oct 14 03:32:06 pitzer-slurm01 slurmctld[59138]: modified reservation x005-18 due to unusable nodes, new nodes: p[0501-0507,0509-0518,0521-0522,0525,0528-0529,0532,0535,0537-0538]
Oct 14 03:32:36 pitzer-slurm01 slurmctld[59138]: modified reservation x005-18 due to unusable nodes, new nodes: p[0501-0507,0509-0518,0521-0522,0525,0528-0529,0532,0535,0537-0538]

root@pitzer-slurm01:~# grep -i X005-18 /var/log/slurm/slurmctld.log | grep modified | awk '{print $NF}' | sort | uniq -c
     10 p[0501-0507,0509-0518,0521-0522,0525,0528-0529,0532,0535,0537-0538]
   1096 p[0501-0509,0511-0518,0521-0522,0525,0528-0529,0532,0535,0537-0538]

It seems like p0508 and p0510 come and go from these reservations but the other nodes stay largely the same.  I see a couple messages from slurmctld about p0510 but nothing about p0508:

root@pitzer-slurm01:~# grep p0508 /var/log/slurm/slurmctld.log
[...nothing...]

root@pitzer-slurm01:~# grep p0510 /var/log/slurm/slurmctld.log
Oct 14 03:32:49 pitzer-slurm01 slurmctld[59138]: update_node: node p0510 reason set to: NHC: check_cmd_output:  TIMEOUT after 15s for "/bin/cat /fs/project/.testfile"; subprocess terminated.
Oct 14 03:32:49 pitzer-slurm01 slurmctld[59138]: update_node: node p0510 state set to DRAINED
Oct 14 03:42:37 pitzer-slurm01 slurmctld[59138]: update_node: node p0510 state set to IDLE
Oct 14 06:53:23 pitzer-slurm01 slurmctld[59138]: sched: Allocate JobId=2031930 NodeList=p0510 #CPUs=48 Partition=serial-48core

Very few jobs have run on them either:

root@pitzer-slurm01:~# sacct -X -S 2020-10-14T00:00:00 -E 2020-10-14T12:45:00 -N p0508,p0510 -o jobid,jobname,start,end,nodelist
       JobID    JobName               Start                 End        NodeList 
------------ ---------- ------------------- ------------------- --------------- 
2031930      ondemand/+ 2020-10-14T06:53:23 2020-10-14T11:53:36           p0510

What would cause sustained flurries of "modified reservation <rsv> due to unusable nodes" events like this?
Comment 32 Troy Baer 2020-10-14 10:53:35 MDT
We've been using the MAINT flag to prevent charging for reservations.  The successful run we had last night was with the MAINT flag set, so if it is a problem, it's inconsistently so.
Comment 33 Ben Roberts 2020-10-14 11:22:04 MDT
Agreed, if it is something related to the MAINT flag it is not a consistent problem.  I'm still looking into what might be happening with that flag.

However, with the log messages you're seeing, I am able to reproduce similar behavior.  I set up a daily reservation and waited for it not to be currently active so there is a reservation in the future.  Then if I set one of the nodes in the reservation down I get log entries like you're seeing.  This looks like a bug because it says it should be picking new nodes, but I see it selecting the same nodes, including the one that is down, so it starts repeating that message until the node comes back up.  I still need to do more testing, but I haven't seen cases where reservations start with a down node (unless created with the ignstate flag) so I assume as it gets closer to the start time it would actually change the nodes, but I will confirm that.  

It's interesting that you don't see log entries about p0508 changing state and only one change for p0510.  Can you monitor those nodes for a while to see if they do change state?  Are you able to run a single node job on either of them?  

Thanks,
Ben
Comment 34 Troy Baer 2020-10-14 11:27:57 MDT
(In reply to Ben Roberts from comment #33)
> It's interesting that you don't see log entries about p0508 changing state
> and only one change for p0510.  Can you monitor those nodes for a while to
> see if they do change state?  Are you able to run a single node job on
> either of them?  

Yes, I can run short test jobs on both of them:

troy@pitzer-login01:~$ sbatch --nodes=1 --ntasks=48 --nodelist=p0508 test.job
Submitted batch job 2033005

troy@pitzer-login01:~$ sbatch --nodes=1 --ntasks=48 --nodelist=p0510 test.job
Submitted batch job 2033006

troy@pitzer-login01:~$ squeue -u troy
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON) 
           2033005 serial-48     test     troy  R       0:03      1 p0508 
           2033006 serial-48     test     troy  R       0:03      1 p0510
Comment 35 Troy Baer 2020-10-14 11:33:18 MDT
(In reply to Ben Roberts from comment #33)
> However, with the log messages you're seeing, I am able to reproduce similar
> behavior.  I set up a daily reservation and waited for it not to be
> currently active so there is a reservation in the future.  Then if I set one
> of the nodes in the reservation down I get log entries like you're seeing. 
> This looks like a bug because it says it should be picking new nodes, but I
> see it selecting the same nodes, including the one that is down, so it
> starts repeating that message until the node comes back up.  I still need to
> do more testing, but I haven't seen cases where reservations start with a
> down node (unless created with the ignstate flag) so I assume as it gets
> closer to the start time it would actually change the nodes, but I will
> confirm that.  

Please do.  If the reservation includes enough down nodes to prevent a job from starting, that might explain the BadConstraints symptom we see with this.
Comment 36 Ben Roberts 2020-10-14 12:04:17 MDT
There is a bit of a 'duh' moment on my part.  I forgot about the 'REPLACE_DOWN' flag until I stopped to think about it.  The default behavior of reservations is that they won't replace nodes that go down, which does explain why jobs that request the full reservation weren't able to run (when a node was down) and why the log entries repeat without changing the nodes in the reservation.

However, I did some testing with this flag and it doesn't replace down nodes when the 'MAINT' flag is also included on the reservation.  Without 'MAINT' it works as I would expect.  I know that maintenance reservations are treated differently in some ways and this is probably one of them.  I can see that they wouldn't replace down nodes for a maintenance reservation because nodes are expected to go down.  I'll confirm that it is expected behavior for maintenance reservations not to replace down nodes.  

Let me know if a node being down when the reservation is active doesn't line up with what you have been seeing though.

Thanks,
Ben
Comment 37 Troy Baer 2020-10-14 12:07:14 MDT
(In reply to Ben Roberts from comment #36)
> There is a bit of a 'duh' moment on my part.  I forgot about the
> 'REPLACE_DOWN' flag until I stopped to think about it.  The default behavior
> of reservations is that they won't replace nodes that go down, which does
> explain why jobs that request the full reservation weren't able to run (when
> a node was down) and why the log entries repeat without changing the nodes
> in the reservation.

OK, it sounds like I need to set REPLACE_DOWN on these reservations.  I (perhaps foolishly) thought that was the default when setting NodeCnt.
Comment 38 Ben Roberts 2020-10-14 13:12:25 MDT
I traced what was happening with a maintenance reservation.  When one of the nodes in a reservation goes down it calls _resv_node_replace, which in turn makes a call to _select_nodes.  In _select_nodes it does a check for the MAINT or OVERLAP flags to be on the reservation and if so skips over the node selection.  Here are what I found to be the relevant lines:
https://github.com/SchedMD/slurm/blob/09769ad701e91ce7956e4b093933d89f0f10ec86/src/slurmctld/reservation.c#L3924-L3931

You would want to use the REPLACE_DOWN flag.  I know you've mentioned that you use the MAINT flag to disable accounting for the reservation.  Is removing that flag so that down nodes can be replaced a possibility?

Thanks,
Ben
Comment 39 Troy Baer 2020-10-14 14:23:56 MDT
> You would want to use the REPLACE_DOWN flag.  I know you've mentioned that you use the MAINT flag to disable accounting for the reservation.  Is removing that flag so that down nodes can be replaced a possibility?

Yes, the MAINT flag only there to prevent charging for reservations while we were working through problems, not expecting that it might cause them.  I'll remove that flag.
Comment 40 Ben Roberts 2020-10-14 15:15:13 MDT
Ok, I'm glad to hear that the flag isn't critical to how you are doing things.  I'll wait to hear how things go with the change to replace the down nodes.

Thanks,
Ben
Comment 41 Troy Baer 2020-10-14 15:40:44 MDT
The latest customer job was able to launch in its reservation.  I think we're getting there, but I'd like to see a couple more successes before we declare victory.
Comment 42 Troy Baer 2020-10-15 10:23:46 MDT
We've seen another successful job launch this morning, so we consider this resolved.
Comment 43 Jason Booth 2020-10-15 10:35:53 MDT
Thank you for the feedback. Marking as resolved.
Comment 44 Troy Baer 2020-10-16 10:58:31 MDT
I think we may have spoken too soon about this being solved.  I created another, larger (~80-node) set of reservations for this customer, and while the earlier reservations seem to be working, these new ones are not.  The especially curious thing is that after the job got reason=BadConstraints, I tried increased the NodeCnt parameter on the reservation from 81 to 84, but AFAICT the job was never reevaluated for scheduling after the initial "never runnable in partition" message.

troy@pitzer-login01:~$ scontrol show job 2049096
JobId=2049096 JobName=x003-12-20201016
   UserId=wxops(30211) GroupId=PYS0343(5387) MCS_label=N/A
   Priority=0 Nice=0 Account=pys1043 QOS=pitzer-override-tres
   JobState=PENDING Reason=BadConstraints Dependency=(null)
   Requeue=0 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=02:30:00 TimeMin=N/A
   SubmitTime=2020-10-16T12:36:17 EligibleTime=2020-10-16T12:50:04
   AccrueTime=2020-10-16T12:50:04
   StartTime=Unknown EndTime=Unknown Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2020-10-16T12:50:05
   Partition=parallel-48core AllocNode:Sid=pitzer-login01:250275
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=(null)
   NumNodes=80-80 NumCPUs=3840 NumTasks=3840 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=3840,mem=14580480M,node=80,billing=3840,gres/gpfs:ess=80
   Socks/Node=* NtasksPerN:B:S:C=48:0:*:1 CoreSpec=*
   MinCPUsNode=48 MinMemoryCPU=3797M MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   Reservation=x003-12
   OverSubscribe=NO Contiguous=0 Licenses=(null) Network=(null)
   Command=/fs/ess/scratch/PYS0343/wxops/runs/mpas/coldstart-g9-spirero/20201016/12/runscript
   WorkDir=/fs/ess/scratch/PYS0343/wxops/runs/mpas/coldstart-g9-spirero/20201016/12
   Comment=stdout=/fs/ess/scratch/PYS0343/wxops/runs/mpas/coldstart-g9-spirero/20201016/12/runscript.out 
   StdErr=/fs/ess/scratch/PYS0343/wxops/runs/mpas/coldstart-g9-spirero/20201016/12/runscript.out
   StdIn=/dev/null
   StdOut=/fs/ess/scratch/PYS0343/wxops/runs/mpas/coldstart-g9-spirero/20201016/12/runscript.out
   Power=
   TresPerNode=gpfs:ess:1
   MailUser=(null) MailType=NONE

troy@pitzer-login01:~$ scontrol show reservation x003-12
ReservationName=x003-12 StartTime=2020-10-16T12:50:00 EndTime=2020-10-16T15:20:00 Duration=02:30:00
   Nodes=p[0501-0506,0509-0511,0514,0519-0522,0525-0539,0542-0547,0549-0550,0556-0557,0560,0563-0565,0567-0569,0571-0572,0574,0576-0580,0586,0589-0602,0624,0643,0668,0670,0683-0685,0687,0693,0699,0775,0778,0902,0905,0912] NodeCnt=84 CoreCnt=4032 Features=c6420&48core PartitionName=batch Flags=DAILY,REPLACE_DOWN,PURGE_COMP=00:05:00
   TRES=cpu=4032
   Users=(null) Accounts=PYS1043 Licenses=(null) State=ACTIVE BurstBuffer=(null) Watts=n/a
   MaxStartDelay=(null)

root@pitzer-slurm01:~# grep 2049096 /var/log/slurm/slurmctld.log
Oct 16 12:36:17 pitzer-slurm01 slurmctld[59138]: _slurm_rpc_submit_batch_job: JobId=2049096 InitPrio=1200285399 usec=16432
Oct 16 12:50:05 pitzer-slurm01 slurmctld[59138]: _pick_best_nodes: JobId=204909  never runnable in partition parallel-48core
Oct 16 12:50:05 pitzer-slurm01 slurmctld[59138]: sched: schedule: JobId=2049096 non-runnable: Requested node configuration is not available
Comment 45 Nate Rini 2020-10-16 17:07:22 MDT
(In reply to Troy Baer from comment #44)
> The especially curious thing is that after the job got reason=BadConstraints, I
Please verify that the patch from comment#5 has been applied to slurmctld.

> tried increased the NodeCnt parameter on the reservation from 81 to 84, but
> AFAICT the job was never reevaluated for scheduling after the initial "never
> runnable in partition" message.

Jobs are only re-evaluated under certain states. We can however force it to be re-evaluated by holding and releasing it.

Please perform the following:
> sdiag -r
> scontrol show job -d 2049096
> scontrol setdebugflags +TraceJobs
> scontrol setdebugflags +SelectType
> scontrol setdebugflags +Reservation
> scontrol setdebug debug3
> scontrol hold 2049096
> sleep 1
> scontrol show job -d 2049096
> scontrol release 2049096
> sleep 500
> scontrol setdebugflags -TraceJobs
> scontrol setdebugflags -SelectType
> scontrol setdebugflags -Reservation
> scontrol setdebug info
> scontrol show job -d 2049096
> sdiag

Please attach the slurmctld logs generated during this period.
Comment 46 Troy Baer 2020-10-19 10:10:17 MDT
Nate, thanks for the update.  We've verified that we are indeed using the patch in question.  The job and reservation from Friday have long since passed, but we'll use that procedure the next time we see this, which could be as soon as ~40 minutes from now.
Comment 47 Nate Rini 2020-10-19 10:46:29 MDT
(In reply to Troy Baer from comment #46)
> The job and reservation from Friday have long since
> passed, but we'll use that procedure the next time we see this, which could
> be as soon as ~40 minutes from now.

Reducing ticket severity while we wait for the problem to reappear.
Comment 48 Troy Baer 2020-10-20 10:52:07 MDT
We did a test of this at 12:50 EDY today.  Naturally, when I was watching it with the debugging turned up, it worked.

I'd like to see a couple more successes before we declare victory on this (again).
Comment 49 Ben Roberts 2020-10-20 11:20:25 MDT
Thanks for the update Troy.  That is frustrating that the problem didn't happen when you were ready for it.  We're happy to wait for a few more iterations.

Thanks,
Ben
Comment 50 Troy Baer 2020-10-21 11:49:25 MDT
We've had three more successful goes at this, so we're declaring victory.  Thanks for all the help.
Comment 51 Ben Roberts 2020-10-21 12:09:25 MDT
I'm glad to hear it's kept working.  Let us know if anything else comes up.

Thanks,
Ben