Ticket 10262

Summary: Memory TRES is not calculated correctly with HT and --mem-per-cpu
Product: Slurm Reporter: CSC sysadmins <csc-slurm-tickets>
Component: SchedulingAssignee: Marshall Garey <marshall>
Status: RESOLVED DUPLICATE QA Contact:
Severity: 3 - Medium Impact    
Priority: --- CC: cinek, marshall
Version: 20.02.6   
Hardware: Linux   
OS: Linux   
Site: CSC - IT Center for Science Slinky Site: ---
Alineos Sites: --- Atos/Eviden Sites: ---
Confidential Site: --- Coreweave sites: ---
Cray Sites: --- DS9 clusters: ---
Google sites: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- NoveTech Sites: ---
Nvidia HWinf-CS Sites: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Tzag Elita Sites: ---
Linux Distro: --- Machine Name:
CLE Version: Version Fixed:
Target Release: --- DevPrio: ---
Emory-Cloud Sites: ---
Attachments: current config

Description CSC sysadmins 2020-11-20 06:19:16 MST
Hi,

test case: srun -N 1 -n1 -c24 --mem-per-cpu=4000

Job submitted to partition which has following kind of nodes is not able to run.
CPUS=80 Sockets=2 CoresPerSocket=20 ThreadsPerCore=2 RealMemory=190000


JobState=PENDING Reason=Resources Dependency=(null)

NumNodes=1-1 NumCPUs=24 NumTasks=1 CPUs/Task=24 ReqB:S:C:T=0:0:*:1
TRES=cpu=24,mem=96000M,node=1,billing=24
Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
MinCPUsNode=24 MinMemoryCPU=4000M MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00

Actually requested amount of memory is 2x 96000M but it's really hard to found out. I set slurmctld to debug3 and it could not tell reason why job was pending.

backfill: Failed to start JobId=xxx avail: Requested nodes are busy
Comment 1 CSC sysadmins 2020-11-20 06:25:21 MST
Created attachment 16754 [details]
current config
Comment 2 CSC sysadmins 2020-11-20 06:50:40 MST
Also user point of view situation is awkward:

srun -p fmitest -N 1 -n1 -c1 --mem-per-cpu=200000 --pty $SHELL
srun: error: Memory specification can not be satisfied
srun: error: Unable to allocate resources: Requested node configuration is not available

which is understandable but this is not:

srun p fmitest -N 1 -n1 -c1 --mem-per-cpu=100000 --pty $SHELL
srun: job 4088872 queued and waiting for resources

200G job is not possible to run on that partition.
Comment 5 Marshall Garey 2020-11-20 14:51:37 MST
Hi Tommi,

This is a duplicate of bug 9724. The title isn't exactly the same, but the fix there fixes the issue of --mem-per-cpu and hyperthreads.

Let me know if you have any more questions. For now, I'm closing this as a dup of 9724.

*** This ticket has been marked as a duplicate of ticket 9724 ***