Ticket 12193

Summary:	seff reporting impossibly high memory consumption for completed jobs
Product:	Slurm	Reporter:	Alex Mamach <alex.mamach>
Component:	User Commands	Assignee:	Oriol Vilarrubi <jvilarru>
Status:	RESOLVED INFOGIVEN	QA Contact:
Severity:	4 - Minor Issue
Priority:	---	CC:	ich+schedmd
Version:	20.02.6
Hardware:	Linux
OS:	Linux
Site:	Northwestern	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	slurm.conf

Description Alex Mamach 2021-08-02 20:08:36 MDT

Hi,

I'm aware seff results may be inaccurate if a job is still in progress or has failed, but occasionally we see impossible high memory consumption reported from seff, (30TB of RAM consumed on a node with 256 GB of RAM), for successfully completed jobs.

Do you know if there are any factors that could cause this kind of behavior? I'm just looking for some general advice for users who are trying to profile the resource consumption of their jobs and get odd results back from seff.

Thanks!

Alex

Comment 1 Oriol Vilarrubi 2021-08-23 08:07:16 MDT

Hello Alex,

What seff does is basically query slurm for the data parse it and present it to the user, so if you see strange data, most probably the problem is the data stored in slurm. In order to further debug this could you please execute seff in debug mode for one of this jobs that you see have impossible data, in order to do so you need to give seff the -d option, like this:

seff -d 1234

The only difference is that it will print the raw data, that way we can see if some data in the slurm database is strange.

Could you please also attach the slurm.conf, that way I can also see all the other relevant settings like JobAcctGatherType.

Greetings.

Comment 2 Oriol Vilarrubi 2021-09-02 07:06:38 MDT

Hello Alex,

Did you had the time to execute seff in debug mode?

Greetings.

Comment 3 Alex Mamach 2021-09-02 20:05:11 MDT

Hi Oriol,

Sorry for the delay, I needed to comb through the jobs to find a good candidate, (some of the jobs I was previously looking at began reporting 0% memory utilization instead of their previously > 100% memory utilization).

seff -d 9007814
Slurm data: JobID ArrayJobID User Group State Clustername Ncpus Nnodes Ntasks Reqmem PerNode Cput Walltime Mem ExitStatus
Slurm data: 9007814  lmc0633 lmc0633 TIMEOUT quest 16 1 16 104857600 1 294 14709 1214633856 0

Job ID: 9007814
Cluster: quest
User/Group: lmc0633/lmc0633
State: TIMEOUT (exit code 0)
Nodes: 1
Cores per node: 16
CPU Utilized: 00:04:54
CPU Efficiency: 0.12% of 2-17:22:24 core-walltime
Job Wall-clock time: 04:05:09
Memory Utilized: 1.13 TB (estimated maximum)
Memory Efficiency: 1158.37% of 100.00 GB (100.00 GB/node)

I've also attached our slurm.conf

Comment 4 Alex Mamach 2021-09-02 20:05:46 MDT

Created attachment 21152 [details]
slurm.conf

Comment 5 Oriol Vilarrubi 2021-09-13 07:00:37 MDT

Hello Alex,

As I suspected the problem seems is not seff, the raw data that comes from the DB is wrong (The Mem field, that in your example is 1214633856[1158GB rounded])

So now we know that the problem comes from either the data acquisition on the storage of this one in the DB. Do you think you could run again the job 9007814 so that we can see if this happens again?

Greetings.

Comment 6 Oriol Vilarrubi 2021-09-13 09:51:07 MDT

(In reply to Oriol Vilarrubi from comment #5)
> Hello Alex,
> 
> As I suspected the problem seems is not seff, the raw data that comes from
> the DB is wrong (The Mem field, that in your example is 1214633856[1158GB
> rounded])
> 
> So now we know that the problem comes from either the data acquisition on
> the storage of this one in the DB. Do you think you could run again the job
> 9007814 so that we can see if this happens again?
> 
> Greetings.

Also, in order to better isolate the problem, could you get the data from sacct directly? The command would be something like this:
sacct -j 9007814 -o JobID,TRESUsageInMax

Comment 7 Alex Mamach 2021-09-13 14:26:02 MDT

Hi Oriol,

I'll work on tracking down the job submission script and seeing if we can re-run it.

In the meantime here's the output from sacct you requested:

         JobID                                                                   TRESUsageInMax
--------------                            -----------------------------------------------------
       9007814
9007814.extern           cpu=00:00:04,energy=0,fs/disk=214304173,mem=1952K,pages=6,vmem=140516K
     9007814.0  cpu=02:56:20,energy=0,fs/disk=11064919239,mem=75914616K,pages=39,vmem=79772964K

Comment 8 Alex Mamach 2021-10-04 11:51:27 MDT

Hi, I'm going to close this for now since after upgrading Slurm to 20.11 and doing some database optimizations we haven't seen this again. Thanks for your time!

Comment 9 F. H. 2022-04-29 06:18:02 MDT

Hi, we are running into the same issue:
$ seff 680215
Job ID: 680215
Cluster: ag_gagneur
Use of uninitialized value $user in concatenation (.) or string at /bin/seff line 154, <DATA> line 602.
User/Group: /ag_gagneur
State: COMPLETED (exit code 0)
Nodes: 1
Cores per node: 10
CPU Utilized: 6-16:21:26
CPU Efficiency: 33.08% of 20-04:46:30 core-walltime
Job Wall-clock time: 2-00:28:39
Memory Utilized: 501.90 GB
Memory Efficiency: 392.11% of 128.00 GB

Probably this comes from shared-memory usage.
Is there anything we can do about?

Comment 10 F. H. 2022-04-29 06:19:49 MDT

More details:

seff -d 680215
Slurm data: JobID ArrayJobID User Group State Clustername Ncpus Nnodes Ntasks Reqmem PerNode Cput Walltime Mem ExitStatus
Use of uninitialized value $user in concatenation (.) or string at /usr/bin/seff line 147, <DATA> line 602.
Slurm data: 680215   ag_gagneur COMPLETED ag_gagneur 10 1 1 134217728 1 577286 174519 526284124 0
[...]