| Summary: | SMT job scheduling to nodes without enough memory | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Sebastian Smith <stsmith> |
| Component: | Scheduling | Assignee: | Marcin Stolarek <cinek> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 4 - Minor Issue | ||
| Priority: | --- | ||
| Version: | - Unsupported Older Versions | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | Nevada Reno | Slinky Site: | --- |
| Alineos Sites: | --- | Atos/Eviden Sites: | --- |
| Confidential Site: | --- | Coreweave sites: | --- |
| Cray Sites: | --- | DS9 clusters: | --- |
| Google sites: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | NoveTech Sites: | --- |
| Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Tzag Elita Sites: | --- |
| Linux Distro: | --- | Machine Name: | |
| CLE Version: | Version Fixed: | ||
| Target Release: | --- | DevPrio: | --- |
| Emory-Cloud Sites: | --- | ||
| Attachments: | slurm configs, submission script, and debug-level logs | ||
|
Description
Sebastian Smith
2020-07-20 16:13:11 MDT
Sebastian, The first example you gave (as you later state) was incorrect, but just to be clear: >#SBATCH --ntasks=1 >#SBATCH --cpus-per-task=32 >#SBATCH --mem-per-cpu=4800M results in 1x32x4800MB=153600MB=150GiB (I agree that just G may be misleading but it's kind convention that we usually interpret M,G,P as powers of 2). In job structure we keep the value in MB and you can always get it using sacct with --noconvert option. I failed to reproduce the behavior you noticed on the second job using 18.08.4. I'm surprised by NumCPUs being bumped to 64. Closest I can get is using OverSubscribe=exclusive as partition configuration parameter: >slurm-18.08.4/bin/srun --cpus-per-task=32 --mem-per-cpu=4800 >slurm-18.08/bin/scontrol show job | grep -A5 NumCPU NumNodes=1 NumCPUs=64 NumTasks=1 CPUs/Task=32 ReqB:S:C:T=0:0:*:* TRES=cpu=64,mem=300G,node=1,billing=64 Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* MinCPUsNode=32 MinMemoryCPU=4800M MinTmpDiskNode=0 Features=(null) DelayBoot=00:00:00 OverSubscribe=NO Contiguous=0 Licenses=(null) Network=(null) My node has the same configuration as yours, but as you see I'm getting Oversubscribe=NO. This bug is fixed in slurm 19.05. Could you please share your slurm.conf and debug level slurmctld logs with SelectType debugflag enabled[1] from the time the jobs get submitted & started? cheers, Marcin [1] enable: scontrol setdebugflag +SelectType; scontrol setdebug debug disable/get back to info(use debuglevel of your regular preference): scontrol setdebugflag -SelectType; scontrol setdebug info Created attachment 15155 [details]
slurm configs, submission script, and debug-level logs
Hi Marcin, Thanks for your reply. I want to start with a proper demo (unlike last time). Here's the batch script: #!/bin/bash #SBATCH --job-name=alloc-test #SBATCH --nodes=1 #SBATCH --ntasks=1 #SBATCH --cpus-per-task=32 #SBATCH --mem-per-cpu=4800M #SBATCH --time=00:02:00 #SBATCH --hint=compute_bound hostname ulimit -l sleep 60 Here's the job while it's pending: (ph-tools) [stsmithsa@ph-head-0 tmp]$ scontrol show job 2463065 JobId=2463065 JobName=alloc-test UserId=stsmithsa(2000004) GroupId=p-stsmithsa(2000004) MCS_label=N/A Priority=99841 Nice=0 Account=cpu-s6-test-0 QOS=normal JobState=PENDING Reason=AssocMaxJobsLimit Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 RunTime=00:00:00 TimeLimit=00:02:00 TimeMin=N/A SubmitTime=2020-07-23T16:18:51 EligibleTime=2020-07-23T16:18:51 AccrueTime=2020-07-23T16:18:51 StartTime=Unknown EndTime=Unknown Deadline=N/A PreemptTime=None SuspendTime=None SecsPreSuspend=0 LastSchedEval=2020-07-23T16:18:52 Partition=cpu-s6-test-0 AllocNode:Sid=ph-head-0:194120 ReqNodeList=(null) ExcNodeList=(null) NodeList=(null) NumNodes=1-1 NumCPUs=32 NumTasks=1 CPUs/Task=32 ReqB:S:C:T=0:0:*:1 TRES=cpu=32,mem=150G,node=1,billing=183 Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* MinCPUsNode=32 MinMemoryCPU=4800M MinTmpDiskNode=0 Features=(null) DelayBoot=00:00:00 OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) Command=/data/gpfs/home/stsmithsa/tmp/hostname_0.sl WorkDir=/data/gpfs/home/stsmithsa/tmp StdErr=/data/gpfs/home/stsmithsa/tmp/slurm-2463065.out StdIn=/dev/null StdOut=/data/gpfs/home/stsmithsa/tmp/slurm-2463065.out Power= Here's the job while it's running: (ph-tools) [stsmithsa@ph-head-0 tmp]$ scontrol show job 2463065 JobId=2463065 JobName=alloc-test UserId=stsmithsa(2000004) GroupId=p-stsmithsa(2000004) MCS_label=N/A Priority=63398 Nice=0 Account=cpu-s6-test-0 QOS=normal JobState=RUNNING Reason=None Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 RunTime=00:00:24 TimeLimit=00:02:00 TimeMin=N/A SubmitTime=2020-07-23T16:18:51 EligibleTime=2020-07-23T16:18:51 AccrueTime=2020-07-23T16:18:51 StartTime=2020-07-23T16:19:50 EndTime=2020-07-23T16:21:51 Deadline=N/A PreemptTime=None SuspendTime=None SecsPreSuspend=0 LastSchedEval=2020-07-23T16:19:50 Partition=cpu-s6-test-0 AllocNode:Sid=ph-head-0:194120 ReqNodeList=(null) ExcNodeList=(null) NodeList=cpu-65 BatchHost=cpu-65 NumNodes=1 NumCPUs=64 NumTasks=1 CPUs/Task=32 ReqB:S:C:T=0:0:*:1 TRES=cpu=64,mem=300G,node=1,billing=365 Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* MinCPUsNode=32 MinMemoryCPU=4800M MinTmpDiskNode=0 Features=(null) DelayBoot=00:00:00 OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) Command=/data/gpfs/home/stsmithsa/tmp/hostname_0.sl WorkDir=/data/gpfs/home/stsmithsa/tmp StdErr=/data/gpfs/home/stsmithsa/tmp/slurm-2463065.out StdIn=/dev/null StdOut=/data/gpfs/home/stsmithsa/tmp/slurm-2463065.out Power= I believe there is an issue with `--hint=compute_bound`. It's properly checking the CPU count, but not memory. It uses the node configuration to automatically request 32cores * 2threads = 64CPUs. If I request more cores, it will fail with a "node configuration not available" error. If I exceed `--mem-per-cpu` density of 64 threads it will not fail. If I exceed the `--mem-per-cpu` density of 32 cores it will fail. With compute_bound it appears to ignore threads per core when calculating memory. I think the issue may be more subtle because I've run jobs where this is working correctly. I've attached relevant slurm config files, submission script, and debug-level. The test job is 2463093. Thanks, Sebastian Sebastian,
I did reproduce the issue you're experiencing on both 18.08.4 and 18.08.8. The issue got fixed with introduction of a new logic in commits:
796d8d2d3a41ddcf050471fc0aad3a735938d52f
c332b4800123c2e30aa2e7fbd727065cf80220e9
Unfortunately, it's not easy to apply without other changes done prior to 19.05 (part of cons_tres addition), but will be resolved when you upgrade.
The issue is caused by incorrect handling of --threads-per-core=1 (which is part of --hint=compute_boud), you can workaround that commenting out a line in proc_args.c like below,
> 835 } else if (xstrcasecmp(tok, "compute_bound") == 0) {
> 836 *min_sockets = NO_VAL;
> 837 *min_cores = NO_VAL;
> 838 // *min_threads = 1;
> 839 if (cpu_bind_type)
> 840 *cpu_bind_type |= CPU_BIND_TO_CORES;
however, direct use of --threads-per-core=1 will still result in wrong memory value being set.
Do you have any additional questions regarding the case?
cheers,
Marcin
Hi Marcin, Thank you for the effort. I have no additional questions, and will work around the issue until we upgrade. We're looking forward to cons_tres! - Sebastian -- [University of Nevada, Reno]<http://www.unr.edu/> Sebastian Smith High-Performance Computing Engineer Office of Information Technology 1664 North Virginia Street MS 0291 work-phone: 775-682-5050<tel:7756825050> email: stsmith@unr.edu<mailto:stsmith@unr.edu> website: http://rc.unr.edu<http://rc.unr.edu/> ________________________________ From: bugs@schedmd.com <bugs@schedmd.com> Sent: Tuesday, July 28, 2020 2:54 AM To: Sebastian T Smith <stsmith@unr.edu> Subject: [Bug 9432] SMT job scheduling to nodes without enough memory Comment # 4<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D9432%23c4&data=01%7C01%7Cstsmith%40unr.edu%7C758f3eb1442941a52ca308d832dc359a%7C523b4bfc0ebd4c03b2b96f6a17fd31d8%7C1&sdata=KVYwp0h4oDWFcZDnEUwufarI8NvIn4pZPgAnIv33P4c%3D&reserved=0> on bug 9432<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D9432&data=01%7C01%7Cstsmith%40unr.edu%7C758f3eb1442941a52ca308d832dc359a%7C523b4bfc0ebd4c03b2b96f6a17fd31d8%7C1&sdata=zfoet2qQ0lf6GIySVx1Qx23V%2FOQHjOeHP2jPPQP6HTU%3D&reserved=0> from Marcin Stolarek<mailto:cinek@schedmd.com> Sebastian, I did reproduce the issue you're experiencing on both 18.08.4 and 18.08.8. The issue got fixed with introduction of a new logic in commits: 796d8d2d3a41ddcf050471fc0aad3a735938d52f c332b4800123c2e30aa2e7fbd727065cf80220e9 Unfortunately, it's not easy to apply without other changes done prior to 19.05 (part of cons_tres addition), but will be resolved when you upgrade. The issue is caused by incorrect handling of --threads-per-core=1 (which is part of --hint=compute_boud), you can workaround that commenting out a line in proc_args.c like below, > 835 } else if (xstrcasecmp(tok, "compute_bound") == 0) { > 836 *min_sockets = NO_VAL; > 837 *min_cores = NO_VAL; > 838 // *min_threads = 1; > 839 if (cpu_bind_type) > 840 *cpu_bind_type |= CPU_BIND_TO_CORES; however, direct use of --threads-per-core=1 will still result in wrong memory value being set. Do you have any additional questions regarding the case? cheers, Marcin ________________________________ You are receiving this mail because: * You reported the bug. Sebastian,
>and will work around the issue until we upgrade. We're looking forward to cons_tres!
Just to be clear the issue is removed in cons_res too in 19.05, it's just a number of changes in a common code required to apply a patch to high to easily backport it to 18.08.
I'll go ahead and close this now as information given. Should you have any question please don't hesitate to reopen.
cheers,
Marcin
|