| Summary: | Memory QoSs and multi partitions job submission request | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | NASA JSC Aerolab <JSC-DL-AEROLAB-ADMIN> |
| Component: | Configuration | Assignee: | Ben Roberts <ben> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 3 - Medium Impact | ||
| Priority: | --- | ||
| Version: | 20.02.3 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | Johnson Space Center | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
| Attachments: |
Slurm config files: slurm.conf, job_submit.lua, cgroup.conf
QoS Limits Info |
||
Hello, Thanks for the details and information you included. From the pdf you included, it looks like you are using the MaxTRESPU limit on the QOS to limit the amount of memory and number of CPUs. This should work to limit the number of CPUs/Memory being allocated to a user at any given time. It's possible that another limit is coming into play or that the limits aren't being picked up properly. Can I have you send the output showing the full details of the QOS definitions? sacctmgr show qos I would also like to see how the usage is being reported when you have an array that is using more than the amount of memory it should. While the system is in this state can you run the following command and send the output? scontrol show assoc flags=qos Thanks, Ben Created attachment 17702 [details]
QoS Limits Info
Ben, thanks for your quickly answer. Yes, in addition to the MaxTRESPU we have a couple of MaxTRES they should not be interferring with the corresponding partitions. Take a look to the files I attached containing output of the commands you asked us to run. Thanks for providing that information. It does look like the MaxTRESPU definitions on the QOS should be the only ones enforced with the way you've got them configured. I looked at the output of 'scontrol show assoc flags=qos' to see how these limits are being enforced and this output does show that each of these QOS's are using up to the defined limits and not more. The lines I'm looking at are here:
QOS=normlimit(67)
...
MaxTRESPU=cpu=840(210),mem=3440640(3440640),energy=N(0),node=N(70),billing=N(210),fs/disk=N(0),vmem=N(0),pages=N(0)
...
QOS=hpfsl(71)
...
MaxTRESPU=cpu=140(35),mem=573440(573440),energy=N(0),node=N(12),billing=N(35),fs/disk=N(0),vmem=N(0),pages=N(0)
These show the different attributes that could have limits placed, like cpu, mem, etc. The first number is the limit to be enforced, (with memory being in megabytes) or showing 'N' if a limit isn't defined. The number in the parentheses is the amount of that resource currently in use for the QOS. You can see that for both of these QOS's the memory shows the limit and amount used as having the same value.
Reading through the output of the ticket you sent earlier, I think I may see why the numbers you were looking for didn't align. In one of the updates Darby Vickers shows a script he put together to calculate the amount of memory used by each partition. He calculates the number of jobs running in each queue like this:
squeue -M FSL -p normal -t R | wc -l
The problem with this is that it includes a header line. You can add the '-h' flag to print the output without the header. Here is an example:
$ squeue -phigh -tR
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
2069_19 high wrap ben R 4:32 1 node01
2069_20 high wrap ben R 4:32 1 node02
$ squeue -phigh -tR -h
2069_19 high wrap ben R 4:35 1 node01
2069_20 high wrap ben R 4:35 1 node02
If you subtract 32 from the amount of memory reported from the hpnormal qos output in that ticket update, it goes from 608G to 576G. This is still higher than the 560G you have defined, but this could be due to rounding errors with the conversion from Mb to Gb. The usage in the normal qos was lower than expected, but there could have been something else running that was preventing some of these jobs from being able to start.
I think the best way to verify exactly how much CPU/memory is allocated at any given time would be to submit another test job array and get the details for the array to look at which jobs are running in which partition and exactly how much memory is allocated for each job. Can you run 'scontrol show job <array job_id>' with another test array and send that output? It would also be good to have the 'scontrol show assoc flags=qos' output while the array has running jobs too to compare those numbers with what is showing up in the job details. I apologize that I didn't ask for that output when you ran the last test.
Thanks,
Ben
Ben, I have already adapted the script for my tests examples. I can confirm these values shown for the two examples I described are correct. Is there anything else you would like me to check? Thanks, -Hugo Hi Hugo, Ok, that's good. If you can run another test array that has more jobs start than should be allowed then I would like to see what squeue shows about the job with some modification of the fields shown in the output. Can you run this while there is a problem array running (replacing "<array job id>" with the job id you get for the array): squeue -j<array job id> -o "%18i %.9P %.8u %.2t %.8c %.8m" Along with that output I would like to see the output of 'scontrol show assoc flags=qos' again to compare with what squeue shows. Thanks, Ben Ben, yeah you were right we discovered the issue we were counting entries from the header, after removing the numbers just matched. Thank you for pointing things out! For a 1,000 tasks array job each needing 16G of memory we had: [17:07:49 hugo@hernandez cluster-emulated] ./count_running_jobs 16 n_tot 245, n_nor 210, n_hp 35 mem_tot 3920, mem_nor 3360, mem_hp 560 Now, the problem with bigger memory allocation like the ones submitted by Darby, in addition to the problem with the header when counting running jobs, we found the emulated cluster was restricting memory per virtual compute node to the total physical memory on the host running the emulated cluster: 64GB. Check on the NodeName in the slurm.conf file: NodeName=blade[91-94] Port=40091-40094 NodeHostName=hernandez NodeAddr=192.52.98.83 Sockets=2 CoresPerSocket=4 ThreadsPerCore=1 Procs=8 RealMemory=64139 Weight=10 Feature=ivy,core20 We were missing this info when trying to understand why the allocated cores/memory were below the limit values. One thing, I tried to set the emulated cluster to look like our production cluster were we have compute nodes with 20, 28, and, 32 cores all with 128GB of memory. When I translated their configuration then I wasn't able to start the slurmd on any of the virtual nodes ending in using for all of them only 8 cores and 64GB of memory matching what we have on the host running the emulated cluster. Thoughts? I'm glad to hear that the limits are being correctly enforced. We do have a parameter you can set for situations like this where you're trying to emulate hardware that is different than the hardware you're testing with. You can set the following in your slurm.conf and then you should be able to set the amount of memory and number of cores to whatever you want for testing purposes: SlurmdParameters=config_overrides You can read more about this parameter here: https://slurm.schedmd.com/slurm.conf.html#OPT_SlurmdParameters Let me know if you have problems getting that to work. Thanks, Ben Hi Hugo, I wanted to see if you have any additional questions about this. Let me know if there is anything else I can do to help or if this ticket is ok to close. Thanks, Ben Ben, thanks for reaching out. Please go ahead and close this ticket. Thanks for the help! I'm glad things are working. Closing the ticket now. |
Created attachment 17689 [details] Slurm config files: slurm.conf, job_submit.lua, cgroup.conf Hello there, we have recently configured QoS to control total memory used per user for specific partitions. We also have configured a high priority partition that allow users to run a small number of jobs at a higher priority without having to explicitly use this partition s the job submission plugin (job_submit.lua) will automatically add this hipri partition to any submitted job into a correlated partition (normal partition). The user's submitted jobs will run on the hipri partition first (up to the QoS limits) then all remaining jobs will run on the normal partition. When we submit jobs to the normal partition then we have noticed, for jobs hitting the QoS based memory limits, to behave in a way we are trying to understand. The following are the QoSs that apply to both partitions: hpfsl cpu=140,mem=560G normlimit cpu=840,mem=3360G and we have set a default memory limit of 4GB on slurm.conf. The partitions are set as: PartitionName=normal Nodes=normal Priority=10000 State=UP OverSubscribe=Yes Default=YES MaxTime=08:00:00 QoS=normlimit PreemptMode=off DefaultTime=04:00:00 MaxNodes=10 PartitionName=hpnormal Nodes=normal Priority=50000 State=UP OverSubscribe=Yes Default=No MaxTime=08:00:00 QoS=hpfsl PreemptMode=off DefaultTime=04:00:00 MaxNodes=10 If we submit a 1,000 array job then we should expect the job to be balance across the two partitions with a total memory used of 3,920GB (in this case the CPU limit won't be reached as due to the memory limit), but this is what we are getting (two examples from the same job): - (each task using 5GB of memory) once submitted 740 tasks ran on the cluster (used emulated cluster), 626 on the normal partition and 114 on the hipri partition (hpnormal). The total memory used was 3,700GB, 3,130GB from the normal partition and 570GB on the hipri partition. We expected the array job to allocate 3,920GB across 784 running tasks. We weren't expecting tasks on the hipri partition to exceed 560GB. - (each task using 16GB of memory) once submitted 249 tasks ran on the cluster, 212 on the normal partition and 37 on the hipri partition. The total memory used was 3,984GB, 3,392GB from the normal partition and 592GB on the hipri partition. For both partitions the memory limits are exceeded. We were expecting only 245 tasks to be allocated. Can you please help us understand how does this work? We are attaching the relevant configuration files from our emulated cluster as well as details from an open gitlab issue that explains details about the problem. Our dummy array job uses this wrapper script: #!/bin/bash #SBATCH --job-name=testme #SBATCH --ntasks=1 #SBATCH --mem=16G #SBATCH --array=1-1000 echo "My SLURM_ARRAY_JOB_ID is $SLURM_ARRAY_JOB_ID." echo "My SLURM_ARRAY_TASK_ID is $SLURM_ARRAY_TASK_ID" echo "Executing on the machine:" $(hostname) uptime date hostname whoami sleep 300