It would appear the issue is if --mem dividied by MaxMemPerCPU is greater than --ntasks-per-node, the job is rejected: $ sbatch -N 1 --ntasks-per-node=4 -p gpuserial-48core --gpus-per-node=1 --mem=32G --wrap 'scontrol show job=$SLURM_JOB_ID' sbatch: error: Batch job submission failed: Requested node configuration is not available $ sbatch -N 1 --ntasks-per-node=4 -p gpuserial-48core --gpus-per-node=1 --mem=31G --wrap 'scontrol show job=$SLURM_JOB_ID' sbatch: error: Batch job submission failed: Requested node configuration is not available $ sbatch -N 1 --ntasks-per-node=4 -p gpuserial-48core --gpus-per-node=1 --mem=30G --wrap 'scontrol show job=$SLURM_JOB_ID' Submitted batch job 18606 Removing --gpus-per-node and submitting to identical set of nodes just using backfill partition (same MaxMemPerCPU) has same issue. I can submit without --ntasks-per-node and --mem=32G or can submit without --mem=32G but using --ntasks-per-node, but can not use both together at least not with --ntasks-per-node=4. $ sbatch -N 1 -p gpuserial-48core --gpus-per-node=1 --mem=32G --wrap 'scontrol show job=$SLURM_JOB_ID' Submitted batch job 18609 $ cat slurm-18609.out JobId=18609 JobName=wrap UserId=tdockendorf(20821) GroupId=PZS0708(5509) MCS_label=N/A Priority=200023940 Nice=0 Account=pzs0708 QOS=pitzer-all JobState=RUNNING Reason=None Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 RunTime=00:00:01 TimeLimit=01:00:00 TimeMin=N/A SubmitTime=2020-09-01T14:43:27 EligibleTime=2020-09-01T14:43:27 AccrueTime=2020-09-01T14:43:27 StartTime=2020-09-01T14:43:28 EndTime=2020-09-01T15:43:28 Deadline=N/A SuspendTime=None SecsPreSuspend=0 LastSchedEval=2020-09-01T14:43:28 Partition=gpuserial-48core AllocNode:Sid=pitzer-rw01:128863 ReqNodeList=(null) ExcNodeList=(null) NodeList=p0302 BatchHost=p0302 NumNodes=1 NumCPUs=5 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:* TRES=cpu=5,node=1,billing=5,gres/gpu=1 Socks/Node=* NtasksPerN:B:S:C=0:0:*:1 CoreSpec=* MinCPUsNode=5 MinMemoryNode=32G MinTmpDiskNode=0 Features=(null) DelayBoot=00:00:00 OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) Command=(null) WorkDir=/users/sysp/tdockendorf/slurm-tests Comment=stdout=/users/sysp/tdockendorf/slurm-tests/slurm-18609.out StdErr=/users/sysp/tdockendorf/slurm-tests/slurm-18609.out StdIn=/dev/null StdOut=/users/sysp/tdockendorf/slurm-tests/slurm-18609.out Power= TresPerNode=gpu:1 MailUser=(null) MailType=NONE $ sbatch -N 1 -p gpuserial-48core --gpus-per-node=1 --ntasks-per-node=4 --wrap 'scontrol show job=$SLURM_JOB_ID' Submitted batch job 18610 $ cat slurm-18610.out JobId=18610 JobName=wrap UserId=tdockendorf(20821) GroupId=PZS0708(5509) MCS_label=N/A Priority=200023963 Nice=0 Account=pzs0708 QOS=pitzer-all JobState=RUNNING Reason=None Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 RunTime=00:00:01 TimeLimit=01:00:00 TimeMin=N/A SubmitTime=2020-09-01T14:44:06 EligibleTime=2020-09-01T14:44:06 AccrueTime=2020-09-01T14:44:06 StartTime=2020-09-01T14:44:07 EndTime=2020-09-01T15:44:07 Deadline=N/A SuspendTime=None SecsPreSuspend=0 LastSchedEval=2020-09-01T14:44:07 Partition=gpuserial-48core AllocNode:Sid=pitzer-rw01:128863 ReqNodeList=(null) ExcNodeList=(null) NodeList=p0302 BatchHost=p0302 NumNodes=1 NumCPUs=4 NumTasks=4 CPUs/Task=1 ReqB:S:C:T=0:0:*:* TRES=cpu=4,node=1,billing=4,gres/gpu=1 Socks/Node=* NtasksPerN:B:S:C=4:0:*:1 CoreSpec=* MinCPUsNode=4 MinMemoryCPU=3797M MinTmpDiskNode=0 Features=(null) DelayBoot=00:00:00 OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) Command=(null) WorkDir=/users/sysp/tdockendorf/slurm-tests Comment=stdout=/users/sysp/tdockendorf/slurm-tests/slurm-18610.out StdErr=/users/sysp/tdockendorf/slurm-tests/slurm-18610.out StdIn=/dev/null StdOut=/users/sysp/tdockendorf/slurm-tests/slurm-18610.out Power= TresPerNode=gpu:1 MailUser=(null) MailType=NONE Hi Trey, This is an issue in the select plugin. It happens both in cons_res and cons_tres. I am trying a fix. The problem is caused by an auto-adjustment. slurmctld: debug: Setting job's pn_min_cpus to 5 due to memory limit slurmctld: _pick_best_nodes: JobId=7011 never runnable in partition p1 slurmctld: _slurm_rpc_submit_batch_job: Requested node configuration is not available We need 5 cores for 32gb of ram due to MaxMemPerCpu in the partition, but the num tasks is 4 and in the select plugin we take this 4 instead of the 5 to decide how many cores to allocate. Will give you some feedback soon this week. Trey, I submitted an internal proposal to mitigate your specific issue, but you must know that auto-adjustments with MaxMemPerCPU is a very limited feature. While we discuss it internally, you can illustrate yourself in bug 5240 if you want. See also this small doc. patch explaining some other limitations: https://github.com/SchedMD/slurm/commit/375c568914461cb53c7da81cd642588d274547f3 commit 375c568914461cb53c7da81cd642588d274547f3 Author: Felip Moll <felip.moll@schedmd.com> Date: Tue Feb 4 16:01:56 2020 +0100 Docs - Clarify auto-adjustments limitation on MaxMemPerCPU Auto-adjustment of job requests were introduced but it has limitations. A multi-partition job request where every partition has different MaxMemPerCPU limits, and possibly different involved QoS, in an heterogeneous cluster, makes it not possible to provide an accurate auto adjusted request previous to being granted an allocation. This patch adds a note about this limitation. Bug 7876 My only push back is that MaxMemPerCPU is extremely important when doing memory allocations and doing charging on CPU usage. If I have a 100G node with 10 core sand someone does --mem=100G and -c 1 then I want them charged for 10 cores and not 1 core that took up entire node because of memory. The multiple partitions aspect is important because of the mixed nature of our cluster. If we had a uniform MaxMemPerCPU then we'd be underutilizing some nodes. If there are limitations to this then I think it would be important to at some point address those limitations so that these important and long-standing features can be properly utilized. If that requires an Enhancement request, let me know. I just don't know I have a full grasp of the limitation yet. I don't know if this sheds any more light onto the issue but --ntasks=4 is accepted and adjusted correctly but --ntasks-per-node=4 is the one that gets rejected: $ sbatch -N 1 --ntasks-per-node=4 -p gpuserial-48core --gpus-per-node=1 --mem=32G --wrap 'scontrol show job=$SLURM_JOB_ID' sbatch: error: Batch job submission failed: Requested node configuration is not available $ sbatch -N 1 --ntasks=4 -p gpuserial-48core --gpus-per-node=1 --mem=32G --wrap 'scontrol show job=$SLURM_JOB_ID' Submitted batch job 20265 $ cat slurm-20265.out JobId=20265 JobName=wrap UserId=tdockendorf(20821) GroupId=PZS0708(5509) MCS_label=N/A Priority=200023963 Nice=0 Account=pzs0708 QOS=pitzer-all JobState=RUNNING Reason=None Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 RunTime=00:00:01 TimeLimit=01:00:00 TimeMin=N/A SubmitTime=2020-09-03T13:14:16 EligibleTime=2020-09-03T13:14:16 AccrueTime=2020-09-03T13:14:16 StartTime=2020-09-03T13:14:18 EndTime=2020-09-03T14:14:18 Deadline=N/A SuspendTime=None SecsPreSuspend=0 LastSchedEval=2020-09-03T13:14:18 Partition=gpuserial-48core AllocNode:Sid=pitzer-rw01:128863 ReqNodeList=(null) ExcNodeList=(null) NodeList=p0301 BatchHost=p0301 NumNodes=1 NumCPUs=5 NumTasks=4 CPUs/Task=1 ReqB:S:C:T=0:0:*:* TRES=cpu=5,mem=32G,node=1,billing=5,gres/gpu=1 Socks/Node=* NtasksPerN:B:S:C=0:0:*:1 CoreSpec=* MinCPUsNode=5 MinMemoryNode=32G MinTmpDiskNode=0 Features=(null) DelayBoot=00:00:00 OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) Command=(null) WorkDir=/users/sysp/tdockendorf/slurm-tests Comment=stdout=/users/sysp/tdockendorf/slurm-tests/slurm-20265.out StdErr=/users/sysp/tdockendorf/slurm-tests/slurm-20265.out StdIn=/dev/null StdOut=/users/sysp/tdockendorf/slurm-tests/slurm-20265.out Power= TresPerNode=gpu:1 MailUser=(null) MailType=NONE (In reply to Trey Dockendorf from comment #8) > My only push back is that MaxMemPerCPU is extremely important when doing > memory allocations and doing charging on CPU usage. If I have a 100G node > with 10 core sand someone does --mem=100G and -c 1 then I want them charged > for 10 cores and not 1 core that took up entire node because of memory. In that case you should have MaxMemPerCPU=100/10=10G But as I commented there are these known drawbacks. I will inform you about the proposed fix I mentioned which should make it work. > The multiple partitions aspect is important because of the mixed nature of > our cluster. If we had a uniform MaxMemPerCPU then we'd be underutilizing > some nodes. I see and understand your concerns. Nevertheless note that heterogeneous nodes in a single partitions would be a problem and you should set MaxMemPerCPU to the ratio of the node with lowest mem per cpu.. though it is not your case at the moment. > If there are limitations to this then I think it would be important to at > some point address those limitations so that these important and > long-standing features can be properly utilized. If that requires an > Enhancement request, let me know. I just don't know I have a full grasp of > the limitation yet. Let me check if a quick fix for your issue is possible. Otherwise I will give you more details on the issues, but basically there would be a need to change the autoadjustment logic plus adapt the select plugins (cons_res and cons_tres) which is not trivial. The other path which workarounds this quite well is to use a job_submit.lua script. Please stay tuned. Wanted to check if any update on a possible fix. Any timeline on a possible fix is helpful. Depending on timeline we may have to implement job submit filter logic to work around this issue which is not something we want to do but something we would have to do before we go live with this system if a patch is not available. (In reply to Trey Dockendorf from comment #13) > Wanted to check if any update on a possible fix. Any timeline on a possible > fix is helpful. Depending on timeline we may have to implement job submit > filter logic to work around this issue which is not something we want to do > but something we would have to do before we go live with this system if a > patch is not available. There's a patch which fix the issue but it goes against cons_tres nature which doesn't allow overallocating more cores than tasks if exclusive is not set in the job request. I am analyzing implications of my patch at the moment. Will let you know about the decision soon. For the moment the safest path is a job submit plugin. Hi Trey, I just wanted to do a quick update. We've found a possible way to fix the specific issue you reported, but we've seen that cons_tres is reporting an oversubscribe error after the fix, and also there's some binding behavior change. It seems to be not important but we are seeing that may be something else around. Also there's a similar issue with --cpus-per-gpu which I want to mitigate with the same fix. This week I am out but I will keep working on the issue as my top priority after it. In the meantime the workaround is to use a job submit lua plugin which modifies the request of the user. If ypu don't mind I drop this to sev-3 for the moment. Thank you for your patience. Is there any update on this issue? We've received a new report where --ntasks and --mem-per-cpu is causing similar problems when combined with a GRES. I can provide further details if this sounds like a different issue. Thanks, - Trey (In reply to Trey Dockendorf from comment #32) > Is there any update on this issue? We've received a new report where > --ntasks and --mem-per-cpu is causing similar problems when combined with a > GRES. I can provide further details if this sounds like a different issue. > > Thanks, > - Trey Hi Trey, Sorry for the delay. There's still work in progress. Please, provide details of your new issue, it may be something different and there has been some improvements with GRES in recent versions. e.g. bug 10077 This is the issue I just had reported: $ sbatch -n 2 --mem-per-cpu=8G --gres=pfsdir --wrap 'scontrol show job=$SLURM_JOB_ID' sbatch: error: Batch job submission failed: Requested node configuration is not available $ sbatch -n 2 --mem-per-cpu=4G --gres=pfsdir --wrap 'scontrol show job=$SLURM_JOB_ID' Submitted batch job 9804 The nodes in question have MaxMemPerCPU=4315 and CPUs=28. The following illustrates the issue: $ sbatch -n 2 --mem-per-cpu=4315M --gres=pfsdir --wrap 'scontrol show job=$SLURM_JOB_ID' Submitted batch job 9805 $ sbatch -n 2 --mem-per-cpu=4316M --gres=pfsdir --wrap 'scontrol show job=$SLURM_JOB_ID' sbatch: error: Batch job submission failed: Requested node configuration is not available - Trey Hello Trey, I've finally managed to fix this and I am glad to inform you the patch has been committed to our source code. The patches and hence the fix are in starting with version 20.11.3: commit f840968e42b538bdab57b664ca5a3c709d3bc9c2 (HEAD -> slurm-20.11, origin/slurm-20.11) Author: Felip Moll <felip.moll@schedmd.com> AuthorDate: Fri Dec 25 18:07:57 2020 +0100 Commit: Danny Auble <da@schedmd.com> CommitDate: Tue Jan 5 15:59:14 2021 -0700 Fix false error about oversubscribing in cons_tres In cons_tres when checking _at_tpn_limit we only detected if we are below tasks per node limit or at the limit. With this fix we can now detect whether we are at the limit or beyond, thus avoiding an incorrect overcommit error message when allocating more cpus than tasks. Bug 9716 commit bb9f3d4f46684764b7065a425d024c4dc8f2a751 Author: Felip Moll <felip.moll@schedmd.com> AuthorDate: Fri Dec 25 18:05:46 2020 +0100 Commit: Danny Auble <da@schedmd.com> CommitDate: Tue Jan 5 15:52:58 2021 -0700 Fix rejecting jobs under MaxMemPerCPU when allocating more cpus than tasks MaxMemPerCPU can cause a job to be auto adjusted increasing pn_min_cpus due to memory limits. In that situation if we also request --ntasks-per-node the job may be rejected, because we may end up with too many cpus allocated to the job. There shouldn't be any problem in allocating more cpus and not using them if we require it due to MaxMemPerCPU. This works with --exclusive flag. This patch detects if there are enough cpus depending on pn_min_cpus and for each node then it picks the maximum between this number and the required cpus depending upon the job request. Bug 9716 Thanks for reporting!!. |
Created attachment 15680 [details] slurm.conf We have a partition gpuserial-48core with MaxMemPerCPU=7744. I submit a job with --mem=32G and the job is rejected. SLURM should not be rejecting this job, it should be assigned 5 CPUs based on memory request but instead the job gets rejected. The issue does not goes away if I use SelectTypeParameters=CR_Core instead of CR_Core_Memory. This is not specific to GPU request, happens on our gpubackfill partitions too which is configured the same just doesn't require GPUs be requested and has short MaxTime. This is a rather serious issue we need addressed. We go live with our SLURM cluster in one month and we cannot have this setup not working. $ sbatch -N 1 --ntasks-per-node=4 --gpus=1 --mem=32G -p gpuserial-48core --wrap 'scontrol show job=$SLURM_JOB_ID' sbatch: error: Batch job submission failed: Requested node configuration is not available $ sbatch -N 1 --ntasks-per-node=4 -p gpubackfill-serial-48core --time=00:05:00 --mem=32G --wrap 'scontrol show job=$SLURM_JOB_ID' sbatch: error: Batch job submission failed: Requested node configuration is not available A job with --mem=30G is accepted: $ sbatch -N 1 --ntasks-per-node=4 --gpus=1 --mem=30G -p gpuserial-48core --wrap 'scontrol show job=$SLURM_JOB_ID' Submitted batch job 18569 These nodes have 363GB of RealMemory configured, example: $ scontrol show node=p0301 NodeName=p0301 Arch=x86_64 CoresPerSocket=24 CPUAlloc=0 CPUTot=48 CPULoad=0.22 AvailableFeatures=48core,expansion,exp,r740,gpu,eth-pitzer-rack09h1,ib-i4l1s12,ib-i4,pitzer-rack08,v100-32g ActiveFeatures=48core,expansion,exp,r740,gpu,eth-pitzer-rack09h1,ib-i4l1s12,ib-i4,pitzer-rack08,v100-32g Gres=gpu:v100-32g:2(S:0-1),pfsdir:scratch:1,pfsdir:ess:1,ime:1,gpfs:project:1,gpfs:scratch:1,gpfs:ess:1,vis:1 NodeAddr=10.4.8.1 NodeHostName=p0301 Version=20.02.4 OS=Linux 3.10.0-1062.18.1.el7.x86_64 #1 SMP Wed Feb 12 14:08:31 UTC 2020 RealMemory=371712 AllocMem=0 FreeMem=367618 Sockets=2 Boards=1 State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=4 Owner=N/A MCS_label=N/A Partitions=batch,gpubackfill-parallel-48core,gpubackfill-serial-48core,gpudebug,gpudebug-48core,gpuparallel,gpuparallel-48core,gpuserial,gpuserial-48core,systems BootTime=2020-08-24T13:39:32 SlurmdStartTime=2020-08-26T17:06:40 CfgTRES=cpu=48,mem=363G,billing=48,gres/gpfs:ess=1,gres/gpfs:project=1,gres/gpfs:scratch=1,gres/gpu=2,gres/gpu:v100-32g=2,gres/ime=1,gres/pfsdir=2,gres/pfsdir:ess=1,gres/pfsdir:scratch=1,gres/vis=1 AllocTRES= CapWatts=n/a CurrentWatts=0 AveWatts=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s Partition: $ scontrol show partition=gpuserial-48core PartitionName=gpuserial-48core AllowGroups=ALL DenyAccounts=pcon0060,pcon0003,pcon0014,pcon0015,pcon0016,pcon0008,pcon0010,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080 AllowQos=ALL AllocNodes=ALL Default=NO QoS=pitzer-gpuserial-partition DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO MaxNodes=1 MaxTime=12:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=48 Nodes=p03[01-42] PriorityJobFactor=2000 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO OverTimeLimit=NONE PreemptMode=OFF State=UP TotalCPUs=2016 TotalNodes=42 SelectTypeParameters=NONE JobDefaults=(null) DefMemPerNode=UNLIMITED MaxMemPerCPU=7744 $ scontrol show partition=gpubackfill-serial-48core PartitionName=gpubackfill-serial-48core AllowGroups=ALL DenyAccounts=pcon0060,pcon0003,pcon0014,pcon0015,pcon0016,pcon0008,pcon0010,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0080 AllowQos=ALL AllocNodes=ALL Default=NO QoS=N/A DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO MaxNodes=1 MaxTime=04:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=48 Nodes=p03[01-42] PriorityJobFactor=1000 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO OverTimeLimit=NONE PreemptMode=OFF State=UP TotalCPUs=2016 TotalNodes=42 SelectTypeParameters=NONE JobDefaults=(null) DefMemPerNode=UNLIMITED MaxMemPerCPU=7744