| Summary: | seff showing 0% usage when requesting all memory with --mem=0 | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Alex Mamach <alex.mamach> |
| Component: | Other | Assignee: | Oscar Hernández <oscar.hernandez> |
| Status: | RESOLVED FIXED | QA Contact: | |
| Severity: | 4 - Minor Issue | ||
| Priority: | --- | CC: | marshall, tripiana |
| Version: | 22.05.3 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | Northwestern | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | 23.02.0pre1 | Target Release: | --- |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
|
Description
Alex Mamach
2022-12-06 12:35:42 MST
Hi Alex, I wonder if you can take a look at the amount of memory that was reported to have been allocated to the job according to sacct. It's possible that you are using a SelectTypeParameter that does not track memory, in which case requesting --mem=0 will result in it showing that 0 memory was allocated. Here's an example of how that would look when I'm using CR_CPU: $ scontrol show config | grep CR_CPU SelectTypeParameters = CR_CPU $ sbatch -n1 --mem=0 --wrap='srun sleep 10' Submitted batch job 9560 $ sacct -j 9560 --format=jobid,alloctres%35 JobID AllocTRES ------------ ----------------------------------- 9560 billing=1,cpu=1,node=1 9560.batch cpu=1,mem=0,node=1 9560.extern billing=1,cpu=1,node=1 9560.0 cpu=1,mem=0,node=1 But you can see that if I change my parameter to CR_CPU_Memory and request 0 memory that the amount allocated shows as being 15678M (which is all the configured memory on the node). $ scontrol show config | grep CR_CPU SelectTypeParameters = CR_CPU_MEMORY $ sbatch -n1 --mem=0 --wrap='srun sleep 10' Submitted batch job 9561 $ sacct -j 9561 --format=jobid,alloccpus,alloctres%35 JobID AllocCPUS AllocTRES ------------ ---------- ----------------------------------- 9561 1 billing=1,cpu=1,mem=15678M,node=1 9561.batch 1 cpu=1,mem=15678M,node=1 9561.extern 1 billing=1,cpu=1,mem=15678M,node=1 9561.0 1 cpu=1,mem=15678M,node=1 Some of my colleagues have also been looking at this a little and think it could be a problem with seff using the amount of requested memory vs allocated memory. If it's not an issue with the SelectTypeParameter then can you send the output of the sacct command that looks like this: sacct -j 8409275 --format=jobid,alloctres%45,reqtres%45 Thanks, Ben Hi Ben, Thanks for your response! Our SelectType parameters are SelectTypeParameters = CR_CORE_MEMORY, which I believe should be tracking memory. Here's the sacct output you requested as well: sacct -j 8409275 --format=jobid,alloctres%45,reqtres%45 JobID AllocTRES ReqTRES ------------ --------------------------------------------- --------------------------------------------- 8409275 billing=243,cpu=2,mem=249245M,node=1 billing=180,cpu=2,mem=184563M,node=1 8409275.bat+ cpu=2,mem=249245M,node=1 8409275.ext+ billing=243,cpu=2,mem=249245M,node=1 Thank you for looking at this! That does look like it should be showing a non-zero amount for the memory. I looked at what seff uses for this data and it looks at the 'reqmem' data rather than the reqtres. Can I have you run an sacct command one more time, like this: sacct -j 8409275 --format=jobid,reqmem There was also a change surrounding ReqMem in 21.08. https://github.com/SchedMD/slurm/commit/2acdf51e9e1fde8fe28dce847aa769e0f6455e60 The ticket shows you're using 22.05, but can I have you confirm by running 'slurmdbd -V'? Was seff showing a non-zero amount for memory when you used 'mem=0' to request it all previously, or were you not attempting this before? If there was a change in behavior do you know if it accompanied any change to the system? Thanks, Ben Hi Alex, I wanted to follow up and see if you had a chance to look further into the value reported for ReqMem for this job. Let me know if you still need help with this issue. Thanks, Ben Hi Ben, Thanks for following up! I hope you had a good holiday season and new year! Here's the output of sacct -j 8409275 --format=jobid,reqmem JobID ReqMem ------------ ---------- 8409275 184563M 8409275.bat+ 8409275.ext+ The output I got from slurmdbd -V was slurm 22.05.3 Thanks! Hi Alex, Thanks, I did enjoy the break. Hopefully you did as well. It's interesting that ReqMem also shows the correct amount of memory. I'm curious if this is something you can reproduce. If you submit a test job that uses "mem=0" and use seff to generate a report, do you see the same behavior for memory? Thanks, Ben Hi Ben,
I did some testing on my own and I can reproduce the issue reliably. I ran the following script and it looks like seff is reporting no consumed memory despite sacct showing the correct amount:
#!/bin/bash
#SBATCH -A t3982
#SBATCH -p testing
#SBATCH -N 1
#SBATCH -t 60:00
#SBATCH --mem=0
#SBATCH --ntasks-per-node=2
echo {1..100000000}
sleep 300
seff 390181
Job ID: 390181
Cluster: quest
User/Group: alex/alex
State: COMPLETED (exit code 0)
Nodes: 1
Cores per node: 2
CPU Utilized: 00:01:57
CPU Efficiency: 14.00% of 00:13:56 core-walltime
Job Wall-clock time: 00:06:58
Memory Utilized: 17.88 GB
Memory Efficiency: 0.00% of 0.00 MB
sacct -j 390181 --format=jobid,reqmem
JobID ReqMem
------------ ----------
390181 87097M
390181.batch
390181.exte+
Hi Alex, Seff was getting ReqMem from a field in the database that just stores the requested memory by the user. Effectively storing 0 when using --mem=0. We have just pushed a couple of commits to 23.02 which: 1. 7d2c32ecb4 Fix seff when jobs submitted with --mem=0 In seff: get reqmem from TresReq (stores real value instead of 0). It is the same value displayed in sacct. So it will make seff information consistent with sacct. 2. 9e87bcd349 Improve memory efficiency computing in seff In seff: use value from allocTres to compute efficiency. This will make efficiency computation more accurate than before, as we will be taking into account allocated memory instead of requested memory. Many thanks for reporting the issue. Oscar Hi Oscar, Thank you so much for the update and for fixing this! Thanks, Alex |