I'm currently testing Slurm 22.05.03 and see some inconsistent behavior. The node in this case has 32 physical cores with hyperthreading, so 64 logical CPUs. In the slurmd.conf we have basically these settings: GresTypes=gpu TaskPlugin=task/cgroup,task/affinity DefMemPerCPU=512 DefMemPerGPU=16384 SelectType=select/cons_tres SelectTypeParameters=CR_Core_Memory,CR_ONE_TASK_PER_CORE,CR_CORE_DEFAULT_DIST_BLOCK NodeName=gpu[50] Feature="gpu,avx2,naples,ampere" CPUs=64 CoresPerSocket=8 ThreadsPerCore=2 RealMemory=126976 Gres=gpu:a5000:4 PartitionName=gpu3 Nodes=gpu[50] DefCpuPerGPU=16 OverSubscribe=NO The idea is to slice the machine into "GPU slots", so if I just do $ salloc -p gpu3 -G 1 I will get one GPU and a quarter of the CPU resources. Since CR_ONE_TASK_PER_CORE is set the maximum number of tasks should be 32, and if I request more I get an error message (as it should be): $ salloc -p gpu3 -N 1 -n 64 salloc: error: Job submit/allocate failed: Requested node configuration is not available However if I request all 4 GPUs and then check the TASKS_PER_NODE I see 64: $ salloc -p gpu3 -N 1 -G 4 $ $ env | grep TASKS SLURM_TASKS_PER_NODE=64 Also possibly related, if I increase the DefMemPerCPU setting to 1024 I cannot allocate 4 GPUs: $ salloc -p gpu33 -N 1 -G 4 salloc: error: Job submit/allocate failed: Requested node configuration is not available $ salloc -p gpu33 -N 1 -G 4 --mem=120G alloc: Granted job allocation 9 I've tested with lower values and the breaking point is at 992, i.e. RealMemory/64/2
Sorry, some typos got in there in the latter part of the text when I was editing. While I'm reposting that part I might as well add some more info. Also possibly related, if I increase the DefMemPerCPU setting to 1024 I cannot allocate 4 GPUs without --mem flag: $ salloc -p gpu3 -N 1 -G 4 salloc: error: Job submit/allocate failed: Requested node configuration is not available $ salloc -p gpu3 -N 1 -G 4 --mem=120G salloc: Granted job allocation 9 Note that the first command should request 64G, which fails. Yet I can request 120G explicitly. I've tested with lower values and the breaking point is at 992, i.e. RealMemory/64/2. So seems like the first command is actually checking for 128G instead of 64G. Yet the final allocation has the correct amount, here with DefMemPerCPU=992: $ salloc -p gpu3 -N 1 -G 4 salloc: Granted job allocation 10 $ scontrol show jobid=10 NumNodes=1 NumCPUs=64 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:* TRES=cpu=64,mem=64G,node=1,billing=64 CpusPerTres=gpu:16 MemPerTres=gpu:16384 TresPerJob=gres:gpu:4 [...] The same behavior if I remove DefMemPerGPU from the config file and only have DefMemPerCPU, so does not seem to be related to that. Only difference is that now the allocated memory clearly comes from the DefMemPerCPU setting, since I get 62G instead of 64G above. NumNodes=1 NumCPUs=64 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:* TRES=cpu=64,mem=62G,node=1,billing=64 CpusPerTres=gpu:16 TresPerJob=gres:gpu:4 But again with DefMemPerCPU=1024: $ salloc -v -p gpu3 -N 1 -G 4 salloc: defined options salloc: -------------------- -------------------- salloc: gpus : 4 salloc: nodes : 1 salloc: partition : gpu3 salloc: verbose : 1 salloc: -------------------- -------------------- salloc: end of defined options salloc: select/cray_aries: init: Cray/Aries node selection plugin loaded salloc: select/linear: init: Linear node selection plugin loaded with argument 4372 salloc: select/cons_res: common_init: select/cons_res loaded salloc: select/cons_tres: common_init: select/cons_tres loaded salloc: error: Job submit/allocate failed: Requested node configuration is not available salloc: Job allocation 12 has been revoked.
$ scontrol show jobid=12 JobId=12 JobName=interactive JobState=FAILED Reason=BadConstraints Dependency=(null) Scheduler=Main Partition=gpu3 AllocNode:Sid=login2:173281 ReqNodeList=(null) ExcNodeList=(null) NodeList=(null) NumNodes=1 NumCPUs=1 NumTasks=N/A CPUs/Task=1 ReqB:S:C:T=0:0:*:* TRES=cpu=1,mem=1G,node=1,billing=1 CpusPerTres=gpu:16 TresPerJob=gres:gpu:4 [...]
I've looked through the closed bug reports and found https://bugs.schedmd.com/show_bug.cgi?id=14223 So I think the idea from there to manually set the number of CPUs to the number of physical cores takes care of the inconsistencies I'm seeing here. I.e. changing these lines in my slurm.conf to DefMemPerCPU=1024 NodeName=gpu[50] Feature="gpu,avx2,naples,ampere" CPUs=32 Sockets=4 CoresPerSocket=8 ThreadsPerCore=2 RealMemory=126976 Gres=gpu:a5000:4 PartitionName=gpu3 Nodes=gpu[50] DefCpuPerGPU=8 OverSubscribe=NO I suppose configuring resources as physical cores is an explicit way of forcing CR_ONE_TASK_PER_CORE?
As much as I enjoy my own company, I did file this bug with the hope to get some feedback ;-)
Stefan, Jess Arrington <jess@schedmd.com> was supposed to have contacted you about this last week. Did you get a direct email from Jess about this? Jacob
No I didn't receive any email. But I do get and delete spam on that account, so I could have accidentally deleted it. Could you ask Jess to resend it, please? I'll add the email address to my whitelist.