10262 – Memory TRES is not calculated correctly with HT and --mem-per-cpu

Ticket 10262 - Memory TRES is not calculated correctly with HT and --mem-per-cpu

Summary: Memory TRES is not calculated correctly with HT and --mem-per-cpu

Status:	RESOLVED DUPLICATE of ticket 9724

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	Scheduling (show other tickets)
Version:	20.02.6
Hardware:	Linux Linux

Severity:	3 - Medium Impact
Assignee:	Marshall Garey
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2020-11-20 06:19 MST by CSC sysadmins
Modified:	2020-11-23 15:59 MST (History)
CC List:	2 users (show)

See Also:
Site:	CSC - IT Center for Science
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
current config (16.20 KB, text/plain) 2020-11-20 06:25 MST, CSC sysadmins	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description CSC sysadmins 2020-11-20 06:19:16 MST

Hi,

test case: srun -N 1 -n1 -c24 --mem-per-cpu=4000

Job submitted to partition which has following kind of nodes is not able to run.
CPUS=80 Sockets=2 CoresPerSocket=20 ThreadsPerCore=2 RealMemory=190000


JobState=PENDING Reason=Resources Dependency=(null)

NumNodes=1-1 NumCPUs=24 NumTasks=1 CPUs/Task=24 ReqB:S:C:T=0:0:*:1
TRES=cpu=24,mem=96000M,node=1,billing=24
Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
MinCPUsNode=24 MinMemoryCPU=4000M MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00

Actually requested amount of memory is 2x 96000M but it's really hard to found out. I set slurmctld to debug3 and it could not tell reason why job was pending.

backfill: Failed to start JobId=xxx avail: Requested nodes are busy

Comment 1 CSC sysadmins 2020-11-20 06:25:21 MST

Created attachment 16754 [details]
current config

Comment 2 CSC sysadmins 2020-11-20 06:50:40 MST

Also user point of view situation is awkward:

srun -p fmitest -N 1 -n1 -c1 --mem-per-cpu=200000 --pty $SHELL
srun: error: Memory specification can not be satisfied
srun: error: Unable to allocate resources: Requested node configuration is not available

which is understandable but this is not:

srun p fmitest -N 1 -n1 -c1 --mem-per-cpu=100000 --pty $SHELL
srun: job 4088872 queued and waiting for resources

200G job is not possible to run on that partition.

Comment 5 Marshall Garey 2020-11-20 14:51:37 MST

Hi Tommi,

This is a duplicate of bug 9724. The title isn't exactly the same, but the fix there fixes the issue of --mem-per-cpu and hyperthreads.

Let me know if you have any more questions. For now, I'm closing this as a dup of 9724.

*** This ticket has been marked as a duplicate of ticket 9724 ***