| Summary: | SLURM_TASKS_PER_NODE inconsistent with CR_ONE_TASK_PER_CORE and DefCpuPerGPU | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Stefan <stefan> |
| Component: | Scheduling | Assignee: | Jacob Jenson <jacob> |
| Status: | RESOLVED INVALID | QA Contact: | |
| Severity: | 6 - No support contract | ||
| Priority: | --- | ||
| Version: | 22.05.3 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | -Other- | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
|
Description
Stefan
2022-09-15 12:48:58 MDT
Sorry, some typos got in there in the latter part of the text when I was editing. While I'm reposting that part I might as well add some more info. Also possibly related, if I increase the DefMemPerCPU setting to 1024 I cannot allocate 4 GPUs without --mem flag: $ salloc -p gpu3 -N 1 -G 4 salloc: error: Job submit/allocate failed: Requested node configuration is not available $ salloc -p gpu3 -N 1 -G 4 --mem=120G salloc: Granted job allocation 9 Note that the first command should request 64G, which fails. Yet I can request 120G explicitly. I've tested with lower values and the breaking point is at 992, i.e. RealMemory/64/2. So seems like the first command is actually checking for 128G instead of 64G. Yet the final allocation has the correct amount, here with DefMemPerCPU=992: $ salloc -p gpu3 -N 1 -G 4 salloc: Granted job allocation 10 $ scontrol show jobid=10 NumNodes=1 NumCPUs=64 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:* TRES=cpu=64,mem=64G,node=1,billing=64 CpusPerTres=gpu:16 MemPerTres=gpu:16384 TresPerJob=gres:gpu:4 [...] The same behavior if I remove DefMemPerGPU from the config file and only have DefMemPerCPU, so does not seem to be related to that. Only difference is that now the allocated memory clearly comes from the DefMemPerCPU setting, since I get 62G instead of 64G above. NumNodes=1 NumCPUs=64 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:* TRES=cpu=64,mem=62G,node=1,billing=64 CpusPerTres=gpu:16 TresPerJob=gres:gpu:4 But again with DefMemPerCPU=1024: $ salloc -v -p gpu3 -N 1 -G 4 salloc: defined options salloc: -------------------- -------------------- salloc: gpus : 4 salloc: nodes : 1 salloc: partition : gpu3 salloc: verbose : 1 salloc: -------------------- -------------------- salloc: end of defined options salloc: select/cray_aries: init: Cray/Aries node selection plugin loaded salloc: select/linear: init: Linear node selection plugin loaded with argument 4372 salloc: select/cons_res: common_init: select/cons_res loaded salloc: select/cons_tres: common_init: select/cons_tres loaded salloc: error: Job submit/allocate failed: Requested node configuration is not available salloc: Job allocation 12 has been revoked. $ scontrol show jobid=12 JobId=12 JobName=interactive JobState=FAILED Reason=BadConstraints Dependency=(null) Scheduler=Main Partition=gpu3 AllocNode:Sid=login2:173281 ReqNodeList=(null) ExcNodeList=(null) NodeList=(null) NumNodes=1 NumCPUs=1 NumTasks=N/A CPUs/Task=1 ReqB:S:C:T=0:0:*:* TRES=cpu=1,mem=1G,node=1,billing=1 CpusPerTres=gpu:16 TresPerJob=gres:gpu:4 [...] I've looked through the closed bug reports and found https://bugs.schedmd.com/show_bug.cgi?id=14223 So I think the idea from there to manually set the number of CPUs to the number of physical cores takes care of the inconsistencies I'm seeing here. I.e. changing these lines in my slurm.conf to DefMemPerCPU=1024 NodeName=gpu[50] Feature="gpu,avx2,naples,ampere" CPUs=32 Sockets=4 CoresPerSocket=8 ThreadsPerCore=2 RealMemory=126976 Gres=gpu:a5000:4 PartitionName=gpu3 Nodes=gpu[50] DefCpuPerGPU=8 OverSubscribe=NO I suppose configuring resources as physical cores is an explicit way of forcing CR_ONE_TASK_PER_CORE? As much as I enjoy my own company, I did file this bug with the hope to get some feedback ;-) Stefan, Jess Arrington <jess@schedmd.com> was supposed to have contacted you about this last week. Did you get a direct email from Jess about this? Jacob No I didn't receive any email. But I do get and delete spam on that account, so I could have accidentally deleted it. Could you ask Jess to resend it, please? I'll add the email address to my whitelist. |