Ticket 10262 - Memory TRES is not calculated correctly with HT and --mem-per-cpu
Summary: Memory TRES is not calculated correctly with HT and --mem-per-cpu
Status: RESOLVED DUPLICATE of ticket 9724
Alias: None
Product: Slurm
Classification: Unclassified
Component: Scheduling (show other tickets)
Version: 20.02.6
Hardware: Linux Linux
: 3 - Medium Impact
Assignee: Marshall Garey
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2020-11-20 06:19 MST by CSC sysadmins
Modified: 2020-11-23 15:59 MST (History)
2 users (show)

See Also:
Site: CSC - IT Center for Science
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
current config (16.20 KB, text/plain)
2020-11-20 06:25 MST, CSC sysadmins
Details

Note You need to log in before you can comment on or make changes to this ticket.
Description CSC sysadmins 2020-11-20 06:19:16 MST
Hi,

test case: srun -N 1 -n1 -c24 --mem-per-cpu=4000

Job submitted to partition which has following kind of nodes is not able to run.
CPUS=80 Sockets=2 CoresPerSocket=20 ThreadsPerCore=2 RealMemory=190000


JobState=PENDING Reason=Resources Dependency=(null)

NumNodes=1-1 NumCPUs=24 NumTasks=1 CPUs/Task=24 ReqB:S:C:T=0:0:*:1
TRES=cpu=24,mem=96000M,node=1,billing=24
Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
MinCPUsNode=24 MinMemoryCPU=4000M MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00

Actually requested amount of memory is 2x 96000M but it's really hard to found out. I set slurmctld to debug3 and it could not tell reason why job was pending.

backfill: Failed to start JobId=xxx avail: Requested nodes are busy
Comment 1 CSC sysadmins 2020-11-20 06:25:21 MST
Created attachment 16754 [details]
current config
Comment 2 CSC sysadmins 2020-11-20 06:50:40 MST
Also user point of view situation is awkward:

srun -p fmitest -N 1 -n1 -c1 --mem-per-cpu=200000 --pty $SHELL
srun: error: Memory specification can not be satisfied
srun: error: Unable to allocate resources: Requested node configuration is not available

which is understandable but this is not:

srun p fmitest -N 1 -n1 -c1 --mem-per-cpu=100000 --pty $SHELL
srun: job 4088872 queued and waiting for resources

200G job is not possible to run on that partition.
Comment 5 Marshall Garey 2020-11-20 14:51:37 MST
Hi Tommi,

This is a duplicate of bug 9724. The title isn't exactly the same, but the fix there fixes the issue of --mem-per-cpu and hyperthreads.

Let me know if you have any more questions. For now, I'm closing this as a dup of 9724.

*** This ticket has been marked as a duplicate of ticket 9724 ***