Ticket 9432

Summary: SMT job scheduling to nodes without enough memory
Product: Slurm Reporter: Sebastian Smith <stsmith>
Component: SchedulingAssignee: Marcin Stolarek <cinek>
Status: RESOLVED INFOGIVEN QA Contact:
Severity: 4 - Minor Issue    
Priority: ---    
Version: - Unsupported Older Versions   
Hardware: Linux   
OS: Linux   
Site: Nevada Reno Slinky Site: ---
Alineos Sites: --- Atos/Eviden Sites: ---
Confidential Site: --- Coreweave sites: ---
Cray Sites: --- DS9 clusters: ---
Google sites: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- NoveTech Sites: ---
Nvidia HWinf-CS Sites: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Tzag Elita Sites: ---
Linux Distro: --- Machine Name:
CLE Version: Version Fixed:
Target Release: --- DevPrio: ---
Emory-Cloud Sites: ---
Attachments: slurm configs, submission script, and debug-level logs

Description Sebastian Smith 2020-07-20 16:13:11 MDT
Hi,

Batch job submissions aren't failing with "requested node configuration is not available" when submitting jobs that exceed node memory.

NOTE: We are running Slurm v18.08.4 -- to be updated in a few months.

Node configuration:
```
NodeName=cpu-[64-107] CoresPerSocket=16 Feature=intelv5 MemSpecLimit=24000 RealMemory=191000 Sockets=2 ThreadsPerCore=2 Weight=100
```

Partition configuration:
```
PartitionName=cpu-s1-bionres-0 AllowGroups=RC-cpu_s1_bionres_0-enabled AllowQOS=ALL Default=NO ExclusiveUser=NO LLN=NO MaxNodes=1 MaxTime=14-00:00:00 Nodes=cpu-68 PriorityTier=1 State=UP TRESBillingWeights="CPU=1,mem=1G,node=1"
```

Relevant slurm.conf options(?):
SelectType              = select/cons_res
SelectTypeParameters    = CR_CORE_MEMORY

Example submission script:
```
#!/bin/bash

#SBATCH --account=cpu-s1-bionres-0
#SBATCH --partition=cpu-s1-bionres-0
#SBATCH --job-name=alloc-test
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=32
#SBATCH --mem-per-cpu=4800M
#SBATCH --time=00:02:00
#SBATCH --hint=compute_bound

hostname
```

This submission is expected to fail because it is expected to result in a memory request of 4800MB/CPU * 64CPU = 307,200MB which exceeds the 167,000MB logical, and 191,000MB physical maxes. The job is accepted by the scheduler:

scontrol show job 2462499
JobId=2462499 JobName=alloc-test
   UserId=stsmithsa(2000004) GroupId=p-stsmithsa(2000004) MCS_label=N/A
   Priority=19506 Nice=0 Account=cpu-s1-bionres-0 QOS=normal
   JobState=PENDING Reason=Resources Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=00:02:00 TimeMin=N/A
   SubmitTime=2020-07-20T14:06:43 EligibleTime=2020-07-20T14:06:43
   AccrueTime=2020-07-20T14:06:43
   StartTime=2020-08-02T21:42:33 EndTime=2020-08-02T21:44:33 Deadline=N/A
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   LastSchedEval=2020-07-20T14:40:28
   Partition=cpu-s1-bionres-0 AllocNode:Sid=ph-head-0:121838
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=(null)
   NumNodes=1-1 NumCPUs=32 NumTasks=1 CPUs/Task=32 ReqB:S:C:T=0:0:*:1
   TRES=cpu=32,mem=150G,node=1,billing=183
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=32 MinMemoryCPU=4800M MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/data/gpfs/home/stsmithsa/tmp/hostname_0.sl
   WorkDir=/data/gpfs/home/stsmithsa/tmp
   StdErr=/data/gpfs/home/stsmithsa/tmp/slurm-2462499.out
   StdIn=/dev/null
   StdOut=/data/gpfs/home/stsmithsa/tmp/slurm-2462499.out
   Power=

At submission time the NodeList isn't populated. The "null" node appears to ignore ThreadsPerCore and calculates the requested memory as 150GB (should this be 153,600MB?). This is within the 167GB max. When the NodeList is assigned, the math is worked out correctly and the job is run with a TRES=mem=300G (shoudl this be 307,200MB?). This has resulted in jobs running our nodes out of memory. Can you help me understand what's happening, and the best way to prevent this?

The job above hasn't been scheduled -- bad example. I've run the same job on another partition to demonstrate the problem. The job info is below followed by the node configuration. The "final" job configuration exceeds the node's RealMemory.

Final configuration:

(base) [stsmithsa@ph-head-0 tmp]$ scontrol show job 2462506
JobId=2462506 JobName=alloc-test
   UserId=stsmithsa(2000004) GroupId=p-stsmithsa(2000004) MCS_label=N/A
   Priority=20958 Nice=0 Account=gpu-s2-oit-0 QOS=renter
   JobState=COMPLETED Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:01 TimeLimit=00:02:00 TimeMin=N/A
   SubmitTime=2020-07-20T15:01:16 EligibleTime=2020-07-20T15:01:16
   AccrueTime=2020-07-20T15:01:16
   StartTime=2020-07-20T15:01:16 EndTime=2020-07-20T15:01:17 Deadline=N/A
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   LastSchedEval=2020-07-20T15:01:16
   Partition=gpu-core-0 AllocNode:Sid=ph-head-0:121838
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=gpu-8
   BatchHost=gpu-8
   NumNodes=1 NumCPUs=64 NumTasks=1 CPUs/Task=32 ReqB:S:C:T=0:0:*:1
   TRES=cpu=64,mem=300G,node=1,billing=365
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=32 MinMemoryCPU=4800M MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   Reservation=gpu-s2-oit-0_22
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/data/gpfs/home/stsmithsa/tmp/hostname_0.sl
   WorkDir=/data/gpfs/home/stsmithsa/tmp
   StdErr=/data/gpfs/home/stsmithsa/tmp/slurm-2462506.out
   StdIn=/dev/null
   StdOut=/data/gpfs/home/stsmithsa/tmp/slurm-2462506.out
   Power=

(base) [stsmithsa@ph-head-0 tmp]$ scontrol show node gpu-8
NodeName=gpu-8 Arch=x86_64 CoresPerSocket=16
   CPUAlloc=0 CPUTot=64 CPULoad=0.08
   AvailableFeatures=intelv4,p100
   ActiveFeatures=intelv4,p100
   Gres=gpu:4
   NodeAddr=gpu-8 NodeHostName=gpu-8 Version=18.08
   OS=Linux 3.10.0-957.27.2.el7.x86_64 #1 SMP Mon Jul 29 17:46:05 UTC 2019
   RealMemory=256000 AllocMem=0 FreeMem=237386 Sockets=2 Boards=1
   MemSpecLimit=24000
   State=RESERVED ThreadsPerCore=2 TmpDisk=0 Weight=1000 Owner=N/A MCS_label=N/A
   Partitions=gpu-s2-core-0,gpu-core-0,gpu-s3-sponsored-0
   BootTime=2020-06-25T14:00:51 SlurmdStartTime=2020-06-25T14:03:32
   CfgTRES=cpu=64,mem=250G,billing=314
   AllocTRES=
   CapWatts=n/a
   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s


Thank you,

Sebastian
Comment 1 Marcin Stolarek 2020-07-23 03:46:46 MDT
Sebastian,

The first example you gave (as you later state) was incorrect, but just to be clear:
>#SBATCH --ntasks=1
>#SBATCH --cpus-per-task=32
>#SBATCH --mem-per-cpu=4800M
results in 1x32x4800MB=153600MB=150GiB (I agree that just G may be misleading but it's kind convention that we usually interpret M,G,P as powers of 2). In job structure we keep the value in MB and you can always get it using sacct with --noconvert option.

I failed to reproduce the behavior you noticed on the second job using 18.08.4. I'm  surprised by NumCPUs being bumped to 64. Closest I can get is using OverSubscribe=exclusive as partition configuration parameter:
>slurm-18.08.4/bin/srun --cpus-per-task=32 --mem-per-cpu=4800 
>slurm-18.08/bin/scontrol show job | grep -A5 NumCPU 
   NumNodes=1 NumCPUs=64 NumTasks=1 CPUs/Task=32 ReqB:S:C:T=0:0:*:*
   TRES=cpu=64,mem=300G,node=1,billing=64
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=32 MinMemoryCPU=4800M MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=NO Contiguous=0 Licenses=(null) Network=(null)
My node has the same configuration as yours, but as you see I'm getting Oversubscribe=NO. This bug is fixed in slurm 19.05.

Could you please share your slurm.conf and debug level slurmctld logs with SelectType debugflag enabled[1] from the time the jobs get submitted & started?


cheers,
Marcin

[1]
enable: 
scontrol setdebugflag +SelectType; scontrol setdebug debug
disable/get back to info(use debuglevel of your regular preference): scontrol setdebugflag -SelectType; scontrol setdebug info
Comment 2 Sebastian Smith 2020-07-23 18:08:15 MDT
Created attachment 15155 [details]
slurm configs, submission script, and debug-level logs
Comment 3 Sebastian Smith 2020-07-23 18:08:44 MDT
Hi Marcin,

Thanks for your reply. I want to start with a proper demo (unlike last time).

Here's the batch script:

#!/bin/bash

#SBATCH --job-name=alloc-test
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=32
#SBATCH --mem-per-cpu=4800M
#SBATCH --time=00:02:00
#SBATCH --hint=compute_bound

hostname
ulimit -l
sleep 60

Here's the job while it's pending:

(ph-tools) [stsmithsa@ph-head-0 tmp]$ scontrol show job 2463065
JobId=2463065 JobName=alloc-test
   UserId=stsmithsa(2000004) GroupId=p-stsmithsa(2000004) MCS_label=N/A
   Priority=99841 Nice=0 Account=cpu-s6-test-0 QOS=normal
   JobState=PENDING Reason=AssocMaxJobsLimit Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=00:02:00 TimeMin=N/A
   SubmitTime=2020-07-23T16:18:51 EligibleTime=2020-07-23T16:18:51
   AccrueTime=2020-07-23T16:18:51
   StartTime=Unknown EndTime=Unknown Deadline=N/A
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   LastSchedEval=2020-07-23T16:18:52
   Partition=cpu-s6-test-0 AllocNode:Sid=ph-head-0:194120
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=(null)
   NumNodes=1-1 NumCPUs=32 NumTasks=1 CPUs/Task=32 ReqB:S:C:T=0:0:*:1
   TRES=cpu=32,mem=150G,node=1,billing=183
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=32 MinMemoryCPU=4800M MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/data/gpfs/home/stsmithsa/tmp/hostname_0.sl
   WorkDir=/data/gpfs/home/stsmithsa/tmp
   StdErr=/data/gpfs/home/stsmithsa/tmp/slurm-2463065.out
   StdIn=/dev/null
   StdOut=/data/gpfs/home/stsmithsa/tmp/slurm-2463065.out
   Power=

Here's the job while it's running:

(ph-tools) [stsmithsa@ph-head-0 tmp]$ scontrol show job 2463065
JobId=2463065 JobName=alloc-test
   UserId=stsmithsa(2000004) GroupId=p-stsmithsa(2000004) MCS_label=N/A
   Priority=63398 Nice=0 Account=cpu-s6-test-0 QOS=normal
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:24 TimeLimit=00:02:00 TimeMin=N/A
   SubmitTime=2020-07-23T16:18:51 EligibleTime=2020-07-23T16:18:51
   AccrueTime=2020-07-23T16:18:51
   StartTime=2020-07-23T16:19:50 EndTime=2020-07-23T16:21:51 Deadline=N/A
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   LastSchedEval=2020-07-23T16:19:50
   Partition=cpu-s6-test-0 AllocNode:Sid=ph-head-0:194120
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=cpu-65
   BatchHost=cpu-65
   NumNodes=1 NumCPUs=64 NumTasks=1 CPUs/Task=32 ReqB:S:C:T=0:0:*:1
   TRES=cpu=64,mem=300G,node=1,billing=365
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=32 MinMemoryCPU=4800M MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/data/gpfs/home/stsmithsa/tmp/hostname_0.sl
   WorkDir=/data/gpfs/home/stsmithsa/tmp
   StdErr=/data/gpfs/home/stsmithsa/tmp/slurm-2463065.out
   StdIn=/dev/null
   StdOut=/data/gpfs/home/stsmithsa/tmp/slurm-2463065.out
   Power=

I believe there is an issue with `--hint=compute_bound`. It's properly checking the CPU count, but not memory. It uses the node configuration to automatically request 32cores * 2threads = 64CPUs. If I request more cores, it will fail with a "node configuration not available" error. If I exceed `--mem-per-cpu` density of 64 threads it will not fail. If I exceed the `--mem-per-cpu` density of 32 cores it will fail. With compute_bound it appears to ignore threads per core when calculating memory. I think the issue may be more subtle because I've run jobs where this is working correctly.

I've attached relevant slurm config files, submission script, and debug-level.  The test job is 2463093.

Thanks,

Sebastian
Comment 4 Marcin Stolarek 2020-07-28 03:54:23 MDT
Sebastian,

I did reproduce the issue you're experiencing on both 18.08.4 and 18.08.8. The issue got fixed with introduction of a new logic in commits:
796d8d2d3a41ddcf050471fc0aad3a735938d52f
c332b4800123c2e30aa2e7fbd727065cf80220e9

Unfortunately, it's not easy to apply without other changes done prior to 19.05 (part of cons_tres addition), but will be resolved when you upgrade.

The issue is caused by incorrect handling of --threads-per-core=1 (which is part of --hint=compute_boud), you can workaround that commenting out a line in proc_args.c like below,
> 835                 } else if (xstrcasecmp(tok, "compute_bound") == 0) {             
> 836                         *min_sockets = NO_VAL;                                   
> 837                         *min_cores   = NO_VAL;                                   
> 838                 //      *min_threads = 1;                                        
> 839                         if (cpu_bind_type)                                       
> 840                                 *cpu_bind_type |= CPU_BIND_TO_CORES;   

however, direct use of --threads-per-core=1 will still result in wrong memory value being set. 

Do you have any additional questions regarding the case?

cheers,
Marcin
Comment 5 Sebastian Smith 2020-07-28 10:45:02 MDT
Hi Marcin,

Thank you for the effort.  I have no additional questions, and will work around the issue until we upgrade.  We're looking forward to cons_tres!

- Sebastian

--

[University of Nevada, Reno]<http://www.unr.edu/>
Sebastian Smith
High-Performance Computing Engineer
Office of Information Technology
1664 North Virginia Street
MS 0291

work-phone: 775-682-5050<tel:7756825050>
email: stsmith@unr.edu<mailto:stsmith@unr.edu>
website: http://rc.unr.edu<http://rc.unr.edu/>

________________________________
From: bugs@schedmd.com <bugs@schedmd.com>
Sent: Tuesday, July 28, 2020 2:54 AM
To: Sebastian T Smith <stsmith@unr.edu>
Subject: [Bug 9432] SMT job scheduling to nodes without enough memory


Comment # 4<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D9432%23c4&data=01%7C01%7Cstsmith%40unr.edu%7C758f3eb1442941a52ca308d832dc359a%7C523b4bfc0ebd4c03b2b96f6a17fd31d8%7C1&sdata=KVYwp0h4oDWFcZDnEUwufarI8NvIn4pZPgAnIv33P4c%3D&reserved=0> on bug 9432<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D9432&data=01%7C01%7Cstsmith%40unr.edu%7C758f3eb1442941a52ca308d832dc359a%7C523b4bfc0ebd4c03b2b96f6a17fd31d8%7C1&sdata=zfoet2qQ0lf6GIySVx1Qx23V%2FOQHjOeHP2jPPQP6HTU%3D&reserved=0> from Marcin Stolarek<mailto:cinek@schedmd.com>

Sebastian,

I did reproduce the issue you're experiencing on both 18.08.4 and 18.08.8. The
issue got fixed with introduction of a new logic in commits:
796d8d2d3a41ddcf050471fc0aad3a735938d52f
c332b4800123c2e30aa2e7fbd727065cf80220e9

Unfortunately, it's not easy to apply without other changes done prior to 19.05
(part of cons_tres addition), but will be resolved when you upgrade.

The issue is caused by incorrect handling of --threads-per-core=1 (which is
part of --hint=compute_boud), you can workaround that commenting out a line in
proc_args.c like below,
> 835                 } else if (xstrcasecmp(tok, "compute_bound") == 0) {
> 836                         *min_sockets = NO_VAL;
> 837                         *min_cores   = NO_VAL;
> 838                 //      *min_threads = 1;
> 839                         if (cpu_bind_type)
> 840                                 *cpu_bind_type |= CPU_BIND_TO_CORES;

however, direct use of --threads-per-core=1 will still result in wrong memory
value being set.

Do you have any additional questions regarding the case?

cheers,
Marcin

________________________________
You are receiving this mail because:

  *   You reported the bug.
Comment 6 Marcin Stolarek 2020-07-29 00:44:48 MDT
Sebastian,

>and will work around the issue until we upgrade.  We're looking forward to cons_tres!
Just to be clear the issue is removed in cons_res too in 19.05, it's just a number of changes in a common code required to apply a patch to high to easily backport it to 18.08.

I'll go ahead and close this now as information given. Should you have any question please don't hesitate to reopen.

cheers,
Marcin