Ticket 15569

Summary: seff showing 0% usage when requesting all memory with --mem=0
Product: Slurm Reporter: Alex Mamach <alex.mamach>
Component: OtherAssignee: Oscar Hernández <oscar.hernandez>
Status: RESOLVED FIXED QA Contact:
Severity: 4 - Minor Issue    
Priority: --- CC: marshall, tripiana
Version: 22.05.3   
Hardware: Linux   
OS: Linux   
Site: Northwestern Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: 23.02.0pre1 Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---

Description Alex Mamach 2022-12-06 12:35:42 MST
When our users submit a job using the --mem=0 flag to request all memory on a node, seff will report their memory utilization at 0%.

Example job script:

sacct -j 8409275 -B
Batch Script for 8409275
--------------------------------------------------------------------------------
#!/bin/bash
#SBATCH -A p31516
#SBATCH -p long
#SBATCH -N 1
#SBATCH -t 47:30:00
#SBATCH --mem=0
#SBATCH --ntasks-per-node=2
#SBATCH --output=outlog_gnet_scrape.log

Example seff output:

seff 8409275
Job ID: 8409275
Cluster: quest
User/Group: ourUser/ourUser
State: COMPLETED (exit code 0)
Nodes: 1
Cores per node: 2
CPU Utilized: 01:11:40
CPU Efficiency: 13.91% of 08:35:04 core-walltime
Job Wall-clock time: 04:17:32
Memory Utilized: 68.65 GB
Memory Efficiency: 0.00% of 0.00 MB

If you need any additional information please let me know!

Thanks!

Alex
Comment 3 Ben Roberts 2022-12-07 09:00:35 MST
Hi Alex,

I wonder if you can take a look at the amount of memory that was reported to have been allocated to the job according to sacct.  It's possible that you are using a SelectTypeParameter that does not track memory, in which case requesting --mem=0 will result in it showing that 0 memory was allocated.

Here's an example of how that would look when I'm using CR_CPU:

$ scontrol show config | grep CR_CPU
SelectTypeParameters    = CR_CPU

$ sbatch -n1 --mem=0 --wrap='srun sleep 10'
Submitted batch job 9560

$ sacct -j 9560 --format=jobid,alloctres%35
JobID                                  AllocTRES 
------------ ----------------------------------- 
9560                      billing=1,cpu=1,node=1 
9560.batch                    cpu=1,mem=0,node=1 
9560.extern               billing=1,cpu=1,node=1 
9560.0                        cpu=1,mem=0,node=1 




But you can see that if I change my parameter to CR_CPU_Memory and request 0 memory that the amount allocated shows as being 15678M (which is all the configured memory on the node).

$ scontrol show config | grep CR_CPU
SelectTypeParameters    = CR_CPU_MEMORY

$ sbatch -n1 --mem=0 --wrap='srun sleep 10'
Submitted batch job 9561

$ sacct -j 9561 --format=jobid,alloccpus,alloctres%35
JobID         AllocCPUS                           AllocTRES 
------------ ---------- ----------------------------------- 
9561                  1   billing=1,cpu=1,mem=15678M,node=1 
9561.batch            1             cpu=1,mem=15678M,node=1 
9561.extern           1   billing=1,cpu=1,mem=15678M,node=1 
9561.0                1             cpu=1,mem=15678M,node=1 



Some of my colleagues have also been looking at this a little and think it could be a problem with seff using the amount of requested memory vs allocated memory.  If it's not an issue with the SelectTypeParameter then can you send the output of the sacct command that looks like this:
sacct -j 8409275 --format=jobid,alloctres%45,reqtres%45

Thanks,
Ben
Comment 4 Alex Mamach 2022-12-07 12:32:04 MST
Hi Ben,

Thanks for your response! Our SelectType parameters are SelectTypeParameters    = CR_CORE_MEMORY, which I believe should be tracking memory.

Here's the sacct output you requested as well:

sacct -j 8409275 --format=jobid,alloctres%45,reqtres%45
JobID                                            AllocTRES                                       ReqTRES
------------ --------------------------------------------- ---------------------------------------------
8409275               billing=243,cpu=2,mem=249245M,node=1          billing=180,cpu=2,mem=184563M,node=1
8409275.bat+                      cpu=2,mem=249245M,node=1
8409275.ext+          billing=243,cpu=2,mem=249245M,node=1

Thank you for looking at this!
Comment 5 Ben Roberts 2022-12-08 08:49:32 MST
That does look like it should be showing a non-zero amount for the memory.  I looked at what seff uses for this data and it looks at the 'reqmem' data rather than the reqtres.  Can I have you run an sacct command one more time, like this:
sacct -j 8409275 --format=jobid,reqmem

There was also a change surrounding ReqMem in 21.08.  
https://github.com/SchedMD/slurm/commit/2acdf51e9e1fde8fe28dce847aa769e0f6455e60

The ticket shows you're using 22.05, but can I have you confirm by running 'slurmdbd -V'?  

Was seff showing a non-zero amount for memory when you used 'mem=0' to request it all previously, or were you not attempting this before?  If there was a change in behavior do you know if it accompanied any change to the system?

Thanks,
Ben
Comment 6 Ben Roberts 2023-01-03 10:43:56 MST
Hi Alex,

I wanted to follow up and see if you had a chance to look further into the value reported for ReqMem for this job.  Let me know if you still need help with this issue.

Thanks,
Ben
Comment 7 Alex Mamach 2023-01-04 15:09:09 MST
Hi Ben,

Thanks for following up! I hope you had a good holiday season and new year!

Here's the output of sacct -j 8409275 --format=jobid,reqmem

JobID            ReqMem
------------ ----------
8409275         184563M
8409275.bat+
8409275.ext+

The output I got from slurmdbd -V was slurm 22.05.3

Thanks!
Comment 8 Ben Roberts 2023-01-05 12:45:03 MST
Hi Alex,

Thanks, I did enjoy the break.  Hopefully you did as well.

It's interesting that ReqMem also shows the correct amount of memory.  I'm curious if this is something you can reproduce.  If you submit a test job that uses "mem=0" and use seff to generate a report, do you see the same behavior for memory?  

Thanks,
Ben
Comment 9 Alex Mamach 2023-01-06 20:59:01 MST
Hi Ben,

I did some testing on my own and I can reproduce the issue reliably. I ran the following script and it looks like seff is reporting no consumed memory despite sacct showing the correct amount:


#!/bin/bash

#SBATCH -A t3982
#SBATCH -p testing
#SBATCH -N 1
#SBATCH -t 60:00
#SBATCH --mem=0
#SBATCH --ntasks-per-node=2

echo {1..100000000}
sleep 300


seff 390181
Job ID: 390181
Cluster: quest
User/Group: alex/alex
State: COMPLETED (exit code 0)
Nodes: 1
Cores per node: 2
CPU Utilized: 00:01:57
CPU Efficiency: 14.00% of 00:13:56 core-walltime
Job Wall-clock time: 00:06:58
Memory Utilized: 17.88 GB
Memory Efficiency: 0.00% of 0.00 MB

sacct -j 390181 --format=jobid,reqmem
JobID            ReqMem
------------ ----------
390181           87097M
390181.batch
390181.exte+
Comment 24 Oscar Hernández 2023-02-13 09:16:21 MST
Hi Alex,

Seff was getting ReqMem from a field in the database that just stores the requested memory by the user. Effectively storing 0 when using --mem=0.

We have just pushed a couple of commits to 23.02 which:

1. 7d2c32ecb4 Fix seff when jobs submitted with --mem=0

In seff: get reqmem from TresReq (stores real value instead of 0). It is the same value displayed in sacct. So it will make seff information consistent with sacct.

2. 9e87bcd349 Improve memory efficiency computing in seff
 
In seff: use value from allocTres to compute efficiency. This will make efficiency computation more accurate than before, as we will be taking into account allocated memory instead of requested memory.

Many thanks for reporting the issue.

Oscar
Comment 25 Alex Mamach 2023-02-13 10:00:14 MST
Hi Oscar,

Thank you so much for the update and for fixing this!

Thanks,

Alex