Ticket 10754

Summary:	Memory QoSs and multi partitions job submission request
Product:	Slurm	Reporter:	NASA JSC Aerolab <JSC-DL-AEROLAB-ADMIN>
Component:	Configuration	Assignee:	Ben Roberts <ben>
Status:	RESOLVED INFOGIVEN	QA Contact:
Severity:	3 - Medium Impact
Priority:	---
Version:	20.02.3
Hardware:	Linux
OS:	Linux
Site:	Johnson Space Center	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	Slurm config files: slurm.conf, job_submit.lua, cgroup.conf QoS Limits Info

Description NASA JSC Aerolab 2021-02-01 10:14:19 MST

Created attachment 17689 [details]
Slurm config files: slurm.conf, job_submit.lua, cgroup.conf

Hello there, we have recently configured QoS to control total memory used per user for specific partitions.  We also have configured a high priority partition that allow users to run a small number of jobs at a higher priority without having to explicitly use this partition s the job submission plugin (job_submit.lua) will automatically add this hipri partition  to any submitted job into a correlated partition (normal partition).  The user's submitted jobs will run on the hipri partition first (up to the QoS limits) then all remaining jobs will run on the normal partition.

When we submit jobs to the normal partition then we have noticed, for jobs hitting the QoS based memory limits, to behave in a way we are trying to understand.  The following are the QoSs that apply to both partitions:

     hpfsl     cpu=140,mem=560G
 normlimit    cpu=840,mem=3360G

and we have set a default memory limit of 4GB on slurm.conf.  The partitions are set as:

PartitionName=normal     Nodes=normal  Priority=10000 State=UP OverSubscribe=Yes Default=YES MaxTime=08:00:00 QoS=normlimit  PreemptMode=off DefaultTime=04:00:00 MaxNodes=10
PartitionName=hpnormal     Nodes=normal  Priority=50000 State=UP OverSubscribe=Yes Default=No MaxTime=08:00:00 QoS=hpfsl  PreemptMode=off DefaultTime=04:00:00 MaxNodes=10

If we submit a 1,000 array job then we should expect the job to be balance across the two partitions with a total memory used of 3,920GB (in this case the CPU limit won't be reached as due to the memory limit), but this is what we are getting (two examples from the same job):

- (each task using 5GB of memory) once submitted 740 tasks ran on the cluster (used emulated cluster), 626 on the normal partition and 114 on the hipri partition (hpnormal). The total memory used was 3,700GB, 3,130GB from the normal partition and 570GB on the hipri partition.

We expected the array job to allocate 3,920GB across 784 running tasks.  We weren't expecting tasks on the hipri partition to exceed 560GB.

- (each task using 16GB of memory) once submitted 249 tasks ran on the cluster, 212 on the normal partition and 37 on the hipri partition. The total memory used was 3,984GB, 3,392GB from the normal partition and 592GB on the hipri partition.

For both partitions the memory limits are exceeded.  We were expecting only 245 tasks to be allocated.

Can you please help us understand how does this work?  We are attaching the relevant configuration files from our emulated cluster as well as details from an open gitlab issue that explains details about the problem.

Our dummy array job uses this wrapper script:

#!/bin/bash
#SBATCH --job-name=testme
#SBATCH --ntasks=1
#SBATCH --mem=16G
#SBATCH --array=1-1000

echo "My SLURM_ARRAY_JOB_ID is $SLURM_ARRAY_JOB_ID."
echo "My SLURM_ARRAY_TASK_ID is $SLURM_ARRAY_TASK_ID"
echo "Executing on the machine:" $(hostname)

uptime
date
hostname
whoami
sleep 300

Comment 1 Ben Roberts 2021-02-01 13:48:36 MST

Hello,

Thanks for the details and information you included.  From the pdf you included, it looks like you are using the MaxTRESPU limit on the QOS to limit the amount of memory and number of CPUs.  This should work to limit the number of CPUs/Memory being allocated to a user at any given time.  It's possible that another limit is coming into play or that the limits aren't being picked up properly.  Can I have you send the output showing the full details of the QOS definitions?
sacctmgr show qos

I would also like to see how the usage is being reported when you have an array that is using more than the amount of memory it should.  While the system is in this state can you run the following command and send the output?
scontrol show assoc flags=qos

Thanks,
Ben

Comment 2 NASA JSC Aerolab 2021-02-01 15:34:37 MST

Created attachment 17702 [details]
QoS Limits Info

Comment 3 NASA JSC Aerolab 2021-02-01 15:37:07 MST

Ben, thanks for your quickly answer.  Yes, in addition to the MaxTRESPU we have a couple of MaxTRES they should not be interferring with the corresponding partitions.  Take a look to the files I attached containing output of the commands you asked us to run.

Comment 4 Ben Roberts 2021-02-02 13:53:11 MST

Thanks for providing that information.  It does look like the MaxTRESPU definitions on the QOS should be the only ones enforced with the way you've got them configured.  I looked at the output of 'scontrol show assoc flags=qos' to see how these limits are being enforced and this output does show that each of these QOS's are using up to the defined limits and not more.  The lines I'm looking at are here:
QOS=normlimit(67)
...
MaxTRESPU=cpu=840(210),mem=3440640(3440640),energy=N(0),node=N(70),billing=N(210),fs/disk=N(0),vmem=N(0),pages=N(0)
...
QOS=hpfsl(71)
...
MaxTRESPU=cpu=140(35),mem=573440(573440),energy=N(0),node=N(12),billing=N(35),fs/disk=N(0),vmem=N(0),pages=N(0)


These show the different attributes that could have limits placed, like cpu, mem, etc.  The first number is the limit to be enforced, (with memory being in megabytes) or showing 'N' if a limit isn't defined.  The number in the parentheses is the amount of that resource currently in use for the QOS.  You can see that for both of these QOS's the memory shows the limit and amount used as having the same value.  

Reading through the output of the ticket you sent earlier, I think I may see why the numbers you were looking for didn't align.  In one of the updates Darby Vickers shows a script he put together to calculate the amount of memory used by each partition.  He calculates the number of jobs running in each queue like this:
squeue -M FSL -p normal -t R | wc -l

The problem with this is that it includes a header line.  You can add the '-h' flag to print the output without the header.  Here is an example:
$ squeue -phigh -tR 
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
           2069_19      high     wrap      ben  R       4:32      1 node01
           2069_20      high     wrap      ben  R       4:32      1 node02
$ squeue -phigh -tR -h
           2069_19      high     wrap      ben  R       4:35      1 node01
           2069_20      high     wrap      ben  R       4:35      1 node02


If you subtract 32 from the amount of memory reported from the hpnormal qos output in that ticket update, it goes from 608G to 576G.  This is still higher than the 560G you have defined, but this could be due to rounding errors with the conversion from Mb to Gb.  The usage in the normal qos was lower than expected, but there could have been something else running that was preventing some of these jobs from being able to start.  

I think the best way to verify exactly how much CPU/memory is allocated at any given time would be to submit another test job array and get the details for the array to look at which jobs are running in which partition and exactly how much memory is allocated for each job.  Can you run 'scontrol show job <array job_id>' with another test array and send that output?  It would also be good to have the 'scontrol show assoc flags=qos' output while the array has running jobs too to compare those numbers with what is showing up in the job details.  I apologize that I didn't ask for that output when you ran the last test.

Thanks,
Ben

Comment 5 NASA JSC Aerolab 2021-02-02 15:52:23 MST

Ben, I have already adapted the script for my tests examples. I can confirm these values shown for the two examples I described are correct.

Is there anything else you would like me to check?

Thanks,
-Hugo

Comment 6 Ben Roberts 2021-02-02 16:35:58 MST

Hi Hugo,

Ok, that's good.  If you can run another test array that has more jobs start than should be allowed then I would like to see what squeue shows about the job with some modification of the fields shown in the output.  Can you run this while there is a problem array running (replacing "<array job id>" with the job id you get for the array):
squeue -j<array job id> -o "%18i %.9P %.8u %.2t %.8c %.8m"

Along with that output I would like to see the output of 'scontrol show assoc flags=qos' again to compare with what squeue shows.

Thanks,
Ben

Comment 7 NASA JSC Aerolab 2021-02-03 09:20:45 MST

Ben, yeah you were right we discovered the issue we were counting entries from the header, after removing the numbers just matched. Thank you for pointing things out!  For a 1,000 tasks array job each needing 16G of memory we had:

[17:07:49 hugo@hernandez cluster-emulated] ./count_running_jobs 16
n_tot 245, n_nor 210, n_hp 35
mem_tot 3920, mem_nor 3360, mem_hp 560

Now, the problem with bigger memory allocation like the ones submitted by Darby, in addition to the problem with the header when counting running jobs, we found the emulated cluster was restricting memory per virtual compute node to the total physical memory on the host running the emulated cluster: 64GB. Check on the NodeName in the slurm.conf file:

NodeName=blade[91-94]   Port=40091-40094 NodeHostName=hernandez NodeAddr=192.52.98.83 Sockets=2 CoresPerSocket=4 ThreadsPerCore=1 Procs=8 RealMemory=64139 Weight=10 Feature=ivy,core20

We were missing this info when trying to understand why the allocated cores/memory were below the limit values.

One thing, I tried to set the emulated cluster to look like our production cluster were we have compute nodes with 20, 28, and, 32 cores all with 128GB of memory.  When I translated their configuration then I wasn't able to start the slurmd on any of the virtual nodes ending in using for all of them only 8 cores and 64GB of memory matching what we have on the host running the emulated cluster.  Thoughts?

Comment 8 Ben Roberts 2021-02-03 10:06:19 MST

I'm glad to hear that the limits are being correctly enforced.  We do have a parameter you can set for situations like this where you're trying to emulate hardware that is different than the hardware you're testing with.  You can set the following in your slurm.conf and then you should be able to set the amount of memory and number of cores to whatever you want for testing purposes:
SlurmdParameters=config_overrides

You can read more about this parameter here:
https://slurm.schedmd.com/slurm.conf.html#OPT_SlurmdParameters

Let me know if you have problems getting that to work.

Thanks,
Ben

Comment 9 Ben Roberts 2021-02-12 13:56:07 MST

Hi Hugo,

I wanted to see if you have any additional questions about this.  Let me know if there is anything else I can do to help or if this ticket is ok to close.

Thanks,
Ben

Comment 10 NASA JSC Aerolab 2021-02-17 11:23:21 MST

Ben, thanks for reaching out.  Please go ahead and close this ticket.  Thanks for the help!

Comment 11 Ben Roberts 2021-02-17 12:13:12 MST

I'm glad things are working.  Closing the ticket now.