Ticket 9716

Summary:	Slurm incorrectly rejects memory above MaxMemPerCPU
Product:	Slurm	Reporter:	Trey Dockendorf <tdockendorf>
Component:	Scheduling	Assignee:	Felip Moll <felip.moll>
Status:	RESOLVED FIXED	QA Contact:
Severity:	3 - Medium Impact
Priority:	---	CC:	alex, tdockendorf, troy
Version:	20.02.4
Hardware:	Linux
OS:	Linux
See Also:	https://bugs.schedmd.com/show_bug.cgi?id=7876 https://bugs.schedmd.com/show_bug.cgi?id=7629 https://bugs.schedmd.com/show_bug.cgi?id=9636 https://bugs.schedmd.com/show_bug.cgi?id=10569
Site:	Ohio State OSC	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:	20.11.3
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	slurm.conf

Description Trey Dockendorf 2020-09-01 12:21:02 MDT

Created attachment 15680 [details]
slurm.conf

We have a partition gpuserial-48core with MaxMemPerCPU=7744.  I submit a job with --mem=32G and the job is rejected.  SLURM should not be rejecting this job, it should be assigned 5 CPUs based on memory request but instead the job gets rejected.  The issue does not goes away if I use SelectTypeParameters=CR_Core instead of CR_Core_Memory.

This is not specific to GPU request, happens on our gpubackfill partitions too which is configured the same just doesn't require GPUs be requested and has short MaxTime.

This is a rather serious issue we need addressed. We go live with our SLURM cluster in one month and we cannot have this setup not working.

$ sbatch -N 1 --ntasks-per-node=4 --gpus=1 --mem=32G -p gpuserial-48core --wrap 'scontrol show job=$SLURM_JOB_ID'
sbatch: error: Batch job submission failed: Requested node configuration is not available

$ sbatch -N 1 --ntasks-per-node=4 -p gpubackfill-serial-48core --time=00:05:00  --mem=32G --wrap 'scontrol show job=$SLURM_JOB_ID'
sbatch: error: Batch job submission failed: Requested node configuration is not available

A job with --mem=30G is accepted:

$ sbatch -N 1 --ntasks-per-node=4 --gpus=1 --mem=30G -p gpuserial-48core --wrap 'scontrol show job=$SLURM_JOB_ID'
Submitted batch job 18569


These nodes have 363GB of RealMemory configured, example:

$ scontrol show node=p0301
NodeName=p0301 Arch=x86_64 CoresPerSocket=24 
   CPUAlloc=0 CPUTot=48 CPULoad=0.22
   AvailableFeatures=48core,expansion,exp,r740,gpu,eth-pitzer-rack09h1,ib-i4l1s12,ib-i4,pitzer-rack08,v100-32g
   ActiveFeatures=48core,expansion,exp,r740,gpu,eth-pitzer-rack09h1,ib-i4l1s12,ib-i4,pitzer-rack08,v100-32g
   Gres=gpu:v100-32g:2(S:0-1),pfsdir:scratch:1,pfsdir:ess:1,ime:1,gpfs:project:1,gpfs:scratch:1,gpfs:ess:1,vis:1
   NodeAddr=10.4.8.1 NodeHostName=p0301 Version=20.02.4
   OS=Linux 3.10.0-1062.18.1.el7.x86_64 #1 SMP Wed Feb 12 14:08:31 UTC 2020 
   RealMemory=371712 AllocMem=0 FreeMem=367618 Sockets=2 Boards=1
   State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=4 Owner=N/A MCS_label=N/A
   Partitions=batch,gpubackfill-parallel-48core,gpubackfill-serial-48core,gpudebug,gpudebug-48core,gpuparallel,gpuparallel-48core,gpuserial,gpuserial-48core,systems 
   BootTime=2020-08-24T13:39:32 SlurmdStartTime=2020-08-26T17:06:40
   CfgTRES=cpu=48,mem=363G,billing=48,gres/gpfs:ess=1,gres/gpfs:project=1,gres/gpfs:scratch=1,gres/gpu=2,gres/gpu:v100-32g=2,gres/ime=1,gres/pfsdir=2,gres/pfsdir:ess=1,gres/pfsdir:scratch=1,gres/vis=1
   AllocTRES=
   CapWatts=n/a
   CurrentWatts=0 AveWatts=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

Partition:

$ scontrol show partition=gpuserial-48core
PartitionName=gpuserial-48core
   AllowGroups=ALL DenyAccounts=pcon0060,pcon0003,pcon0014,pcon0015,pcon0016,pcon0008,pcon0010,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080 AllowQos=ALL
   AllocNodes=ALL Default=NO QoS=pitzer-gpuserial-partition
   DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
   MaxNodes=1 MaxTime=12:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=48
   Nodes=p03[01-42]
   PriorityJobFactor=2000 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
   OverTimeLimit=NONE PreemptMode=OFF
   State=UP TotalCPUs=2016 TotalNodes=42 SelectTypeParameters=NONE
   JobDefaults=(null)
   DefMemPerNode=UNLIMITED MaxMemPerCPU=7744

$ scontrol show partition=gpubackfill-serial-48core
PartitionName=gpubackfill-serial-48core
   AllowGroups=ALL DenyAccounts=pcon0060,pcon0003,pcon0014,pcon0015,pcon0016,pcon0008,pcon0010,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080 AllowQos=ALL
   AllocNodes=ALL Default=NO QoS=N/A
   DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
   MaxNodes=1 MaxTime=04:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=48
   Nodes=p03[01-42]
   PriorityJobFactor=1000 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
   OverTimeLimit=NONE PreemptMode=OFF
   State=UP TotalCPUs=2016 TotalNodes=42 SelectTypeParameters=NONE
   JobDefaults=(null)
   DefMemPerNode=UNLIMITED MaxMemPerCPU=7744

Comment 1 Trey Dockendorf 2020-09-01 12:33:32 MDT

It would appear the issue is if --mem dividied by MaxMemPerCPU is greater than --ntasks-per-node, the job is rejected:

$ sbatch -N 1 --ntasks-per-node=4 -p gpuserial-48core --gpus-per-node=1 --mem=32G --wrap 'scontrol show job=$SLURM_JOB_ID'
sbatch: error: Batch job submission failed: Requested node configuration is not available

$ sbatch -N 1 --ntasks-per-node=4 -p gpuserial-48core --gpus-per-node=1 --mem=31G --wrap 'scontrol show job=$SLURM_JOB_ID'
sbatch: error: Batch job submission failed: Requested node configuration is not available

$ sbatch -N 1 --ntasks-per-node=4 -p gpuserial-48core --gpus-per-node=1 --mem=30G --wrap 'scontrol show job=$SLURM_JOB_ID'
Submitted batch job 18606

Removing --gpus-per-node and submitting to identical set of nodes just using backfill partition (same MaxMemPerCPU) has same issue.

Comment 2 Trey Dockendorf 2020-09-01 12:45:30 MDT

I can submit without --ntasks-per-node and --mem=32G or can submit without --mem=32G but using --ntasks-per-node, but can not use both together at least not with --ntasks-per-node=4.

$ sbatch -N 1 -p gpuserial-48core --gpus-per-node=1 --mem=32G --wrap 'scontrol show job=$SLURM_JOB_ID'
Submitted batch job 18609

$ cat slurm-18609.out 
JobId=18609 JobName=wrap
   UserId=tdockendorf(20821) GroupId=PZS0708(5509) MCS_label=N/A
   Priority=200023940 Nice=0 Account=pzs0708 QOS=pitzer-all
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:01 TimeLimit=01:00:00 TimeMin=N/A
   SubmitTime=2020-09-01T14:43:27 EligibleTime=2020-09-01T14:43:27
   AccrueTime=2020-09-01T14:43:27
   StartTime=2020-09-01T14:43:28 EndTime=2020-09-01T15:43:28 Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2020-09-01T14:43:28
   Partition=gpuserial-48core AllocNode:Sid=pitzer-rw01:128863
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=p0302
   BatchHost=p0302
   NumNodes=1 NumCPUs=5 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=5,node=1,billing=5,gres/gpu=1
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:1 CoreSpec=*
   MinCPUsNode=5 MinMemoryNode=32G MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=(null)
   WorkDir=/users/sysp/tdockendorf/slurm-tests
   Comment=stdout=/users/sysp/tdockendorf/slurm-tests/slurm-18609.out 
   StdErr=/users/sysp/tdockendorf/slurm-tests/slurm-18609.out
   StdIn=/dev/null
   StdOut=/users/sysp/tdockendorf/slurm-tests/slurm-18609.out
   Power=
   TresPerNode=gpu:1
   MailUser=(null) MailType=NONE


$ sbatch -N 1 -p gpuserial-48core --gpus-per-node=1 --ntasks-per-node=4 --wrap 'scontrol show job=$SLURM_JOB_ID'
Submitted batch job 18610

$ cat slurm-18610.out 
JobId=18610 JobName=wrap
   UserId=tdockendorf(20821) GroupId=PZS0708(5509) MCS_label=N/A
   Priority=200023963 Nice=0 Account=pzs0708 QOS=pitzer-all
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:01 TimeLimit=01:00:00 TimeMin=N/A
   SubmitTime=2020-09-01T14:44:06 EligibleTime=2020-09-01T14:44:06
   AccrueTime=2020-09-01T14:44:06
   StartTime=2020-09-01T14:44:07 EndTime=2020-09-01T15:44:07 Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2020-09-01T14:44:07
   Partition=gpuserial-48core AllocNode:Sid=pitzer-rw01:128863
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=p0302
   BatchHost=p0302
   NumNodes=1 NumCPUs=4 NumTasks=4 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=4,node=1,billing=4,gres/gpu=1
   Socks/Node=* NtasksPerN:B:S:C=4:0:*:1 CoreSpec=*
   MinCPUsNode=4 MinMemoryCPU=3797M MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=(null)
   WorkDir=/users/sysp/tdockendorf/slurm-tests
   Comment=stdout=/users/sysp/tdockendorf/slurm-tests/slurm-18610.out 
   StdErr=/users/sysp/tdockendorf/slurm-tests/slurm-18610.out
   StdIn=/dev/null
   StdOut=/users/sysp/tdockendorf/slurm-tests/slurm-18610.out
   Power=
   TresPerNode=gpu:1
   MailUser=(null) MailType=NONE

Comment 5 Felip Moll 2020-09-02 13:34:36 MDT

Hi Trey,

This is an issue in the select plugin. It happens both in cons_res and cons_tres.
I am trying a fix.

The problem is caused by an auto-adjustment. 

slurmctld: debug:  Setting job's pn_min_cpus to 5 due to memory limit
slurmctld: _pick_best_nodes: JobId=7011 never runnable in partition p1
slurmctld: _slurm_rpc_submit_batch_job: Requested node configuration is not available

We need 5 cores for 32gb of ram due to MaxMemPerCpu in the partition, but the num tasks is 4 and in the select plugin we take this 4 instead of the 5 to decide how many cores to allocate.

Will give you some feedback soon this week.

Comment 7 Felip Moll 2020-09-03 05:41:55 MDT

Trey,

I submitted an internal proposal to mitigate your specific issue, but you must know that auto-adjustments with MaxMemPerCPU is a very limited feature.

While we discuss it internally, you can illustrate yourself in bug 5240 if you want.

See also this small doc. patch explaining some other limitations:

https://github.com/SchedMD/slurm/commit/375c568914461cb53c7da81cd642588d274547f3

commit 375c568914461cb53c7da81cd642588d274547f3
Author: Felip Moll <felip.moll@schedmd.com>
Date:   Tue Feb 4 16:01:56 2020 +0100

    Docs - Clarify auto-adjustments limitation on MaxMemPerCPU
    
    Auto-adjustment of job requests were introduced but it has limitations.
    A multi-partition job request where every partition has different
    MaxMemPerCPU limits, and possibly different involved QoS, in an
    heterogeneous cluster, makes it not possible to provide an accurate auto
    adjusted request previous to being granted an allocation. This patch adds
    a note about this limitation.
    
    Bug 7876

Comment 8 Trey Dockendorf 2020-09-03 07:20:18 MDT

My only push back is that MaxMemPerCPU is extremely important when doing memory allocations and doing charging on CPU usage.  If I have a 100G node with 10 core sand someone does --mem=100G and -c 1 then I want them charged for 10 cores and not 1 core that took up entire node because of memory.

The multiple partitions aspect is important because of the mixed nature of our cluster. If we had a uniform MaxMemPerCPU then we'd be underutilizing some nodes.

If there are limitations to this then I think it would be important to at some point address those limitations so that these important and long-standing features can be properly utilized.  If that requires an Enhancement request, let me know.  I just don't know I have a full grasp of the limitation yet.

Comment 9 Trey Dockendorf 2020-09-03 11:15:42 MDT

I don't know if this sheds any more light onto the issue but --ntasks=4 is accepted and adjusted correctly but --ntasks-per-node=4 is the one that gets rejected:

$ sbatch -N 1 --ntasks-per-node=4 -p gpuserial-48core --gpus-per-node=1 --mem=32G --wrap 'scontrol show job=$SLURM_JOB_ID'
sbatch: error: Batch job submission failed: Requested node configuration is not available
$ sbatch -N 1 --ntasks=4 -p gpuserial-48core --gpus-per-node=1 --mem=32G --wrap 'scontrol show job=$SLURM_JOB_ID'
Submitted batch job 20265
$ cat slurm-20265.out 
JobId=20265 JobName=wrap
   UserId=tdockendorf(20821) GroupId=PZS0708(5509) MCS_label=N/A
   Priority=200023963 Nice=0 Account=pzs0708 QOS=pitzer-all
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:01 TimeLimit=01:00:00 TimeMin=N/A
   SubmitTime=2020-09-03T13:14:16 EligibleTime=2020-09-03T13:14:16
   AccrueTime=2020-09-03T13:14:16
   StartTime=2020-09-03T13:14:18 EndTime=2020-09-03T14:14:18 Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2020-09-03T13:14:18
   Partition=gpuserial-48core AllocNode:Sid=pitzer-rw01:128863
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=p0301
   BatchHost=p0301
   NumNodes=1 NumCPUs=5 NumTasks=4 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=5,mem=32G,node=1,billing=5,gres/gpu=1
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:1 CoreSpec=*
   MinCPUsNode=5 MinMemoryNode=32G MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=(null)
   WorkDir=/users/sysp/tdockendorf/slurm-tests
   Comment=stdout=/users/sysp/tdockendorf/slurm-tests/slurm-20265.out 
   StdErr=/users/sysp/tdockendorf/slurm-tests/slurm-20265.out
   StdIn=/dev/null
   StdOut=/users/sysp/tdockendorf/slurm-tests/slurm-20265.out
   Power=
   TresPerNode=gpu:1
   MailUser=(null) MailType=NONE

Comment 12 Felip Moll 2020-09-04 09:37:50 MDT

(In reply to Trey Dockendorf from comment #8)
> My only push back is that MaxMemPerCPU is extremely important when doing
> memory allocations and doing charging on CPU usage.  If I have a 100G node
> with 10 core sand someone does --mem=100G and -c 1 then I want them charged
> for 10 cores and not 1 core that took up entire node because of memory.

In that case you should have MaxMemPerCPU=100/10=10G

But as I commented there are these known drawbacks. I will inform you about the proposed fix I mentioned which should make it work.

> The multiple partitions aspect is important because of the mixed nature of
> our cluster. If we had a uniform MaxMemPerCPU then we'd be underutilizing
> some nodes.

I see and understand your concerns. Nevertheless note that heterogeneous nodes in a single partitions would be a problem and you should set MaxMemPerCPU to the ratio of the node with lowest mem per cpu.. though it is not your case at the moment.

> If there are limitations to this then I think it would be important to at
> some point address those limitations so that these important and
> long-standing features can be properly utilized.  If that requires an
> Enhancement request, let me know.  I just don't know I have a full grasp of
> the limitation yet.

Let me check if a quick fix for your issue is possible. Otherwise I will give you more details on the issues, but basically there would be a need to change the autoadjustment logic plus adapt the select plugins (cons_res and cons_tres) which is not trivial.

The other path which workarounds this quite well is to use a job_submit.lua script.

Please stay tuned.

Comment 13 Trey Dockendorf 2020-09-08 08:45:55 MDT

Wanted to check if any update on a possible fix.  Any timeline on a possible fix is helpful.  Depending on timeline we may have to implement job submit filter logic to work around this issue which is not something we want to do but something we would have to do before we go live with this system if a patch is not available.

Comment 14 Felip Moll 2020-09-08 10:27:11 MDT

(In reply to Trey Dockendorf from comment #13)
> Wanted to check if any update on a possible fix.  Any timeline on a possible
> fix is helpful.  Depending on timeline we may have to implement job submit
> filter logic to work around this issue which is not something we want to do
> but something we would have to do before we go live with this system if a
> patch is not available.

There's a patch which fix the issue but it goes against cons_tres nature which doesn't allow overallocating more cores than tasks if exclusive is not set in the job request. I am analyzing implications of my patch at the moment.

Will let you know about the decision soon.

For the moment the safest path is a job submit plugin.

Comment 31 Felip Moll 2020-10-16 03:10:29 MDT

Hi Trey, 

I just wanted to do a quick update. 

We've found a possible way to fix the specific issue you reported, but we've seen that cons_tres is reporting an oversubscribe error after the fix, and also there's some binding behavior change. It seems to be not important but we are seeing that may be something else around. Also there's a similar issue with --cpus-per-gpu which I want to mitigate with the same fix.

This week I am out but I will keep working on the issue as my top priority after it.

In the meantime the workaround is to use a job submit lua plugin which modifies the request of the user.

If ypu don't mind I drop this to sev-3 for the moment. 

Thank you for your patience.

Comment 32 Trey Dockendorf 2020-12-01 06:51:18 MST

Is there any update on this issue? We've received a new report where --ntasks and --mem-per-cpu is causing similar problems when combined with a GRES.  I can provide further details if this sounds like a different issue.

Thanks,
- Trey

Comment 33 Felip Moll 2020-12-01 08:04:56 MST

(In reply to Trey Dockendorf from comment #32)
> Is there any update on this issue? We've received a new report where
> --ntasks and --mem-per-cpu is causing similar problems when combined with a
> GRES.  I can provide further details if this sounds like a different issue.
> 
> Thanks,
> - Trey

Hi Trey,

Sorry for the delay. There's still work in progress.

Please, provide details of your new issue, it may be something different and there has been some improvements with GRES in recent versions.

e.g. bug 10077

Comment 34 Trey Dockendorf 2020-12-01 08:19:01 MST

This is the issue I just had reported:

$ sbatch -n 2 --mem-per-cpu=8G --gres=pfsdir --wrap 'scontrol show job=$SLURM_JOB_ID'
sbatch: error: Batch job submission failed: Requested node configuration is not available
$ sbatch -n 2 --mem-per-cpu=4G --gres=pfsdir --wrap 'scontrol show job=$SLURM_JOB_ID'
Submitted batch job 9804


The nodes in question have MaxMemPerCPU=4315 and CPUs=28.  The following illustrates the issue:

$ sbatch -n 2 --mem-per-cpu=4315M --gres=pfsdir --wrap 'scontrol show job=$SLURM_JOB_ID'
Submitted batch job 9805
$ sbatch -n 2 --mem-per-cpu=4316M --gres=pfsdir --wrap 'scontrol show job=$SLURM_JOB_ID'
sbatch: error: Batch job submission failed: Requested node configuration is not available

- Trey

Comment 41 Felip Moll 2021-01-06 06:19:15 MST

Hello Trey,

I've finally managed to fix this and I am glad to inform you the patch has been committed to our source code.

The patches and hence the fix are in starting with version 20.11.3:

commit f840968e42b538bdab57b664ca5a3c709d3bc9c2 (HEAD -> slurm-20.11, origin/slurm-20.11)
Author:     Felip Moll <felip.moll@schedmd.com>
AuthorDate: Fri Dec 25 18:07:57 2020 +0100
Commit:     Danny Auble <da@schedmd.com>
CommitDate: Tue Jan 5 15:59:14 2021 -0700

    Fix false error about oversubscribing in cons_tres
    
    In cons_tres when checking _at_tpn_limit we only detected if we are below
    tasks per node limit or at the limit. With this fix we can now detect
    whether we are at the limit or beyond, thus avoiding an incorrect
    overcommit error message when allocating more cpus than tasks.
    
    Bug 9716

commit bb9f3d4f46684764b7065a425d024c4dc8f2a751
Author:     Felip Moll <felip.moll@schedmd.com>
AuthorDate: Fri Dec 25 18:05:46 2020 +0100
Commit:     Danny Auble <da@schedmd.com>
CommitDate: Tue Jan 5 15:52:58 2021 -0700

    Fix rejecting jobs under MaxMemPerCPU when allocating more cpus than tasks
    
    MaxMemPerCPU can cause a job to be auto adjusted increasing pn_min_cpus due
    to memory limits. In that situation if we also request --ntasks-per-node the
    job may be rejected, because we may end up with too many cpus allocated to
    the job. There shouldn't be any problem in allocating more cpus and not
    using them if we require it due to MaxMemPerCPU. This works with --exclusive
    flag.
    
    This patch detects if there are enough cpus depending on pn_min_cpus and
    for each node then it picks the maximum between this number and the
    required cpus depending upon the job request.
    
    Bug 9716



Thanks for reporting!!.