Ticket 10262

Summary:	Memory TRES is not calculated correctly with HT and --mem-per-cpu
Product:	Slurm	Reporter:	CSC sysadmins <csc-slurm-tickets>
Component:	Scheduling	Assignee:	Marshall Garey <marshall>
Status:	RESOLVED DUPLICATE	QA Contact:
Severity:	3 - Medium Impact
Priority:	---	CC:	cinek, marshall
Version:	20.02.6
Hardware:	Linux
OS:	Linux
Site:	CSC - IT Center for Science	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	current config

Description CSC sysadmins 2020-11-20 06:19:16 MST

Hi,

test case: srun -N 1 -n1 -c24 --mem-per-cpu=4000

Job submitted to partition which has following kind of nodes is not able to run.
CPUS=80 Sockets=2 CoresPerSocket=20 ThreadsPerCore=2 RealMemory=190000


JobState=PENDING Reason=Resources Dependency=(null)

NumNodes=1-1 NumCPUs=24 NumTasks=1 CPUs/Task=24 ReqB:S:C:T=0:0:*:1
TRES=cpu=24,mem=96000M,node=1,billing=24
Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
MinCPUsNode=24 MinMemoryCPU=4000M MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00

Actually requested amount of memory is 2x 96000M but it's really hard to found out. I set slurmctld to debug3 and it could not tell reason why job was pending.

backfill: Failed to start JobId=xxx avail: Requested nodes are busy

Comment 1 CSC sysadmins 2020-11-20 06:25:21 MST

Created attachment 16754 [details]
current config

Comment 2 CSC sysadmins 2020-11-20 06:50:40 MST

Also user point of view situation is awkward:

srun -p fmitest -N 1 -n1 -c1 --mem-per-cpu=200000 --pty $SHELL
srun: error: Memory specification can not be satisfied
srun: error: Unable to allocate resources: Requested node configuration is not available

which is understandable but this is not:

srun p fmitest -N 1 -n1 -c1 --mem-per-cpu=100000 --pty $SHELL
srun: job 4088872 queued and waiting for resources

200G job is not possible to run on that partition.

Comment 5 Marshall Garey 2020-11-20 14:51:37 MST

Hi Tommi,

This is a duplicate of bug 9724. The title isn't exactly the same, but the fix there fixes the issue of --mem-per-cpu and hyperthreads.

Let me know if you have any more questions. For now, I'm closing this as a dup of 9724.

*** This ticket has been marked as a duplicate of ticket 9724 ***