Nodes defined as: >NodeName=nid001008 Arch=x86_64 CoresPerSocket=64 > CPUAlloc=0 CPUTot=256 CPULoad=0.00 > RealMemory=245000 AllocMem=235520 FreeMem=229461 Sockets=2 Boards=1 > State=IDLE ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A >SelectType = select/cons_tres >SelectTypeParameters = CR_CORE_MEMORY,OTHER_CONS_RES,CR_ONE_TASK_PER_CORE >DefMemPerNode = UNLIMITED From Pascal: >1) salloc --ntasks-per-node=34 --threads-per-core=1 --cpus-per-task=2 --ntasks-per-core=1 --mem-per-cpu=1828 > NumNodes=1 NumCPUs=136 NumTasks=34 CPUs/Task=2 ReqB:S:C:T=0:0:*:1 > TRES=cpu=136,mem=124304M,node=1,billing=136 > Socks/Node=* NtasksPerN:B:S:C=34:0:*:1 CoreSpec=* > MinCPUsNode=68 MinMemoryCPU=1828M MinTmpDiskNode=0 >2) salloc --ntasks-per-node=34 --threads-per-core=2 --cpus-per-task=2 --ntasks-per-core=1 --mem-per-cpu=1828 > NumNodes=1 NumCPUs=68 NumTasks=34 CPUs/Task=2 ReqB:S:C:T=0:0:*:2 > TRES=cpu=68,mem=124304M,node=1,billing=68 > Socks/Node=* NtasksPerN:B:S:C=34:0:*:1 CoreSpec=* > MinCPUsNode=68 MinMemoryCPU=1828M MinTmpDiskNode=0 >3) salloc --ntasks-per-node=34 --threads-per-core=2 --cpus-per-task=3 --ntasks-per-core=1 --mem-per-cpu=1828 >salloc: Pending job allocation 63585 >salloc: job 63585 queued and waiting for resources >HANGS INDEFINITELY It is odd that 3) hangs indefinitely if the physical cpus required are 34*2 = 68 and the memory should be 34* 3 * 1828 ~180000 M. If I reduce the memory requirement, I see that in fact the memory being requested is 34* 4 * 1828, which cannot be satisfied by any node. Yet it is not rejected. Why is it not rejected?
Let's discuss the case of ThreadsPerCore=4 which is more generic than ThreadsPerCore=2, on the configuration like: │ >NodeName=DEFAULT Sockets=4 CoresPerSocket=4 ThreadsPerCore=4 RealMemory=15000 1. Just using --mem-per-cpu: ># srun --mem-per-cpu=10 /bin/bash -c 'scontrol show job $SLURM_JOB_ID | grep TRES' > TRES=cpu=4,mem=40M,node=1,billing=4 which is what we get since the definition of CPU is by default a physical thread we get the whole core and end-up-with 40M (4 ThreadsPerCore) 2. Add --cpus-per-task into consideration: ># srun --mem-per-cpu=10 --cpus-per-task=1 /bin/bash -c 'scontrol show job -d $SLURM_JOB_ID | grep TRES' | uniq -c > 4 TRES=cpu=4,mem=40M,node=1,billing=4 this results in 4 tasks being launched since we've got 4 CPUs, so the number of tasks got adjusted, similar for: ># srun --mem-per-cpu=10 --cpus-per-task=2 /bin/bash -c 'scontrol show job -d $SLURM_JOB_ID | grep TRES' | uniq -c > 2 TRES=cpu=4,mem=40M,node=1,billing=4 requesting two CPUs per task on a 4 CPUs(single core) we have 2 tasks. ># srun --mem-per-cpu=10 --cpus-per-task=3 /bin/bash -c 'scontrol show job -d $SLURM_JOB_ID | grep TRES' | uniq -c > 1 TRES=cpu=4,mem=40M,node=1,billing=4 when 3 CPUs are requested we have only one task, since we can't start more on 4 CPUs. When we request --cpus-per-task=5 (or 6,7,8) we get 8CPUs=2 cores allocated: ># srun --mem-per-cpu=10 --cpus-per-task=5 /bin/bash -c 'scontrol show job -d $SLURM_JOB_ID | grep TRES' | uniq -c > 1 TRES=cpu=8,mem=80M,node=1,billing=8 ># srun --mem-per-cpu=10 --cpus-per-task=6 /bin/bash -c 'scontrol show job -d $SLURM_JOB_ID | grep Nodes=tes' | uniq -c > 1 Nodes=test01 CPU_IDs=0-7 Mem=80 GRES= 3. Add --threads-per-core to specification (limit number of threads to be used per core): ># srun --mem-per-cpu=10 --cpus-per-task=2 --threads-per-core=1 /bin/bash -c 'scontrol show job -d $SLURM_JOB_ID | grep TRES' | uniq -c > 1 TRES=cpu=8,mem=20M,node=1,billing=8 ># srun --mem-per-cpu=10 --cpus-per-task=2 --threads-per-core=1 /bin/bash -c 'scontrol show job -d $SLURM_JOB_ID | grep Nodes=tes' | uniq -c > 1 Nodes=test01 CPU_IDs=0-7 Mem=20 GRES= In previous case we've got 2 tasks on a single core and only 4 CPUs were allocated, now we've got 1 task but because of 2 CPUs per task request and requirement o 1 thread core two cores were allocated. The major reason why we decided to limit memory allocation to 20MB (2 CPUs in use) was that such a request on a hybrid cluster will result in the same number of tasks will start on a nodes with 1 ThreadPerCore and 4 ThreadsPerCore and we shouldn't allocate more memory just because we select different node. The allocation here is reliably calculated despite of node type used. Let me know if that helps and please ask further questions. I believe it's better to discuss than send more very complicated case discussion since addition of every option makes the case complexity grow exponentially more complicated. cheers, Marcin
Pascal, Is the explanation clear for you? Does it rise any additional questions? cheers, Marcin
Hi Marcin, thanks for the info. I've been busy with other stuff but if I have any further questions, I'll try to place them soon. Cheers, Pascal
Pascal, Did you have some time to take a look at the case?
Hi Marcin, So I have looked over the ticket and I think the underlying issue is that --mem-per-cpu is too fluid. I know you mentioned it before but I think the simple examples you show and stuff like srun -p debug --ntasks=1 --mem-per-cpu=128 --threads-per-core=1 --cpus-per-task=3 /bin/bash -c 'scontrol show job $SLURM_JOB_ID ' NumNodes=1 NumCPUs=6 NumTasks=1 CPUs/Task=3 ReqB:S:C:T=0:0:*:1 TRES=cpu=6,mem=384M,node=1,billing=6 srun -p debug --ntasks=1 --mem-per-cpu=128 --threads-per-core=2 --cpus-per-task=3 /bin/bash -c 'scontrol show job $SLURM_JOB_ID' NumNodes=1 NumCPUs=4 NumTasks=1 CPUs/Task=3 ReqB:S:C:T=0:0:*:2 TRES=cpu=4,mem=512M,node=1,billing=4 Someone reading the request would assume that I am only asking for 3 execution threads and so memory should always be 384, whether or not it is spread on 6 or 4 virtual cores. The other issue highlighted here is that I can run something like srun -p debug --nodes=1 --ntasks=34 --mem-per-cpu=1850 --threads-per-core=1 --cpus-per-task=3 /bin/bash -c 'scontrol show job $SLURM_JOB_ID' and given the configuration settings, this should be fine and reserve 3*2*34 virtual cores = 204 virtual cores (102 physical cores) and should reserve 102*1850= 188700 M. However, the request will fail with a srun: error: Unable to allocate resources: Requested node configuration is not available If I use --threads-per-core=2 and then reserve round_up_to_nearest_even(3*2*34) = 204 virtual cores it is fine and does give srun -p debug --nodes=1 --ntasks=34 --mem-per-cpu=1850 --threads-per-core=2 --cpus-per-task=3 /bin/bash -c 'scontrol show job $SLURM_JOB_ID' NumNodes=1 NumCPUs=204 NumTasks=34 CPUs/Task=6 ReqB:S:C:T=0:0:*:2 TRES=cpu=204,mem=188700M,node=1,billing=204 Even weirder for me now having retested it is that mem-per-cpu requests < 1850 seem to hang indefinitely. Or least there are values between [1802,1840] just sit indefinitely. Why? It does suggest underlying bugs. I will test other systems with the same version of slurm. Cheers, Pascal
Pascal, Sorry for the long delay. I wanted to keep this in sync with Bug 14397, where had a longer internal discussion and we had to spend some time on code analysis to answer potential RFE question. >Someone reading the request would assume that I am only asking for 3 execution threads and so memory should always be 384, whether or not it is spread on 6 or 4 virtual cores. At the allocation stage Slurm will only assign full cores to jobs, so if you're having 2 hyper-threads (HT) per core, requesting 3 CPUs per task with a limit of 2 threads per core requires allocation of 4 CPUs (2 full cores). If you limit the use of HT per core to 1, than 3 Cores, which in your configuration means 6 CPUs are required to fulfill the allocation spec. Since you're testing by an allocating srun those values are automatically inherited by the running step, however, this doesn't has to be like that. For instance one can specify resources for allocation to sbatch/salloc and then use a subset of those by steps running in the alloction. I see that it may be confusing, but if we decided to change that probably even more users of the option would get results they don't expect. Job specification may contain many parameters and if those aren't direct it may always happen that the "interpretation" differs. In those cases (not only related to --mem-per-cpu) we always advice users to specify directly what they need. For instance, if the user knows the amount of memory required per node it's best to state it directly using --mem. > [...]reserve 3*2*34 virtual cores = 204 virtual cores (102 physical cores) Since you've specified --threads-per-core=1 only one HT from core may be allocated which requires 102 physical cores on the node, since only full cores are allocated you're getting 204CPUs (and you're accounted for that amount when you look at sacct). When you let the job use 2 threads per core, than the allocation is smaller in terms of number of CPUs given to user. He/She isn't accounted for additional CPUs and the number of CPUs considered for memory calculation is smaller. >Even weirder for me now having retested it is that mem-per-cpu requests < 1850 seem to hang indefinitely. Or least there are values between [1802,1840] just sit indefinitely. Why? It does suggest underlying bugs. I'd need to see the nodes definition You can use the `--test-only`[1] option of srun to get an estimate on when the job can start. cheers, Marcin [1]https://slurm.schedmd.com/srun.html#OPT_test-only
Pascal, Any update from your side? cheers, Marcin
(In reply to Marcin Stolarek from comment #7) > Pascal, > > Any update from your side? > > cheers, > Marcin Hi Marcin, So we've migrated to 22.05.2 on Pawsey systems (and will eventually need to update to 22.05.6 based on the bug fixes) and have also updated hwloc such that slurm is now l3 cache aware in that we can use l3_cache_as_socket. I have been busy getting Setonix's software stack up and running but I will quickly note that the computation of memory and billing still seems confusing to me. Sometimes I understand the underlying calculation. Example: $ srun -p debug --nodes=1 --ntasks=34 --mem-per-cpu=50 --threads-per-core=1 --cpus-per-task=3 /bin/bash -c 'scontrol show job $SLURM_JOB_ID ' will report NumNodes=1 NumCPUs=204 NumTasks=34 CPUs/Task=3 ReqB:S:C:T=0:0:*:1 TRES=cpu=204,mem=5100M,node=1,billing=204 $ srun -p debug --nodes=1 --ntasks=34 --mem-per-cpu=50 --threads-per-core=2 --cpus-per-task=3 /bin/bash -c 'scontrol show job $SLURM_JOB_ID' reports NumNodes=1 NumCPUs=136 NumTasks=34 CPUs/Task=3 ReqB:S:C:T=0:0:*:2 TRES=cpu=136,mem=6800M,node=1,billing=136 So it does look like the CPUs per task is reserving 4 virtual CPUs per MPI task even though 3*34 = even number, hence the 136 CPUs instead of the 102 CPUs it should be reserving. For the former case the total memory is 102 *50 = 5100. The latter's request is asking for more total mem because even though the request is for 102 CPUs, it rounds the cpus per task up to multiple of 2 and gets 136*50 = 6800. The issue is here is that the memory request calculated at first glance appears to be at odds with the cpu reservation. Now updating the memory request but keeping the cpus requests the same leads to more confusion (and the mem-per-cpu bug cropping up) $ srun -p debug --nodes=1 --ntasks=34 --mem-per-cpu=1850 --threads-per-core=2 --cpus-per-task=3 /bin/bash -c 'scontrol show job $SLURM_JOB_ID' reports NumNodes=1 NumCPUs=204 NumTasks=34 CPUs/Task=6 ReqB:S:C:T=0:0:*:2 TRES=cpu=204,mem=188700M,node=1,billing=204 Socks/Node=* NtasksPerN:B:S:C=0:0:*:1 CoreSpec=* MinCPUsNode=3 MinMemoryCPU=925M MinTmpDiskNode=0 $ srun -p debug --nodes=1 --ntasks=34 --mem-per-cpu=1850 --threads-per-core=1 --cpus-per-task=3 /bin/bash -c 'scontrol show job $SLURM_JOB_ID' fails but should have been calculating 1850*102=188700 Likely I'll wait till we get the fully patched version to see how it behaves and continue to ask users to specify --mem when possible. Happy to close the ticket and reopen it if necessary once we have the latest version. Cheers, Pascal
Pascal, I agree that one may want to differentiate the meaning of --mem-per-cpu, especially when you're thinking about it in terms of DefMemPerCPU set in slurm.conf between allocation and step. That's the reason why we considered potential enhancement from Bug 14397 (development of something like --mem-per-core / DefMemPerCore) as a potential way to go. I'll mark the bug as resolved now, as agreed please reopen if you have any questions. cheers, Marcin