7704 – _handle_assoc_tres_run_secs underflow error in slurmctld.log

Ticket 7704 - _handle_assoc_tres_run_secs underflow error in slurmctld.log

Summary: _handle_assoc_tres_run_secs underflow error in slurmctld.log

Status:	RESOLVED DUPLICATE of ticket 7390

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	slurmctld (show other tickets)
Version:	18.08.7
Hardware:	Linux Linux

Severity:	4 - Minor Issue
Assignee:	Broderick Gardner
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2019-09-08 16:30 MDT by ARC Admins
Modified:	2019-11-26 09:53 MST (History)
CC List:	0 users

See Also:	7390
Site:	University of Michigan
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
Gzip file of slurm.conf and 2 logfiles (1.35 MB, application/x-gzip) 2019-09-12 13:26 MDT, ARC Admins	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description ARC Admins 2019-09-08 16:30:55 MDT

Good evening,

I saw the following logs in slurmctld.log and was curious what the significance of them (there were 3 sets of these errors).

[2019-09-08T18:15:52.215] error: _handle_assoc_tres_run_secs: job 356468: assoc 6562 TRES cpu grp_used_tres_run_secs underflow, tried to remove 2479870 seconds when only 2473570 remained.
[2019-09-08T18:15:52.215] error: _handle_assoc_tres_run_secs: job 356468: assoc 6562 TRES mem grp_used_tres_run_secs underflow, tried to remove 12696934400 seconds when only 12679424000 remained.
[2019-09-08T18:15:52.215] error: _handle_assoc_tres_run_secs: job 356468: assoc 6562 TRES node grp_used_tres_run_secs underflow, tried to remove 247987 seconds when only 241687 remained.
[2019-09-08T18:15:52.215] error: _handle_assoc_tres_run_secs: job 356468: assoc 6562 TRES billing grp_used_tres_run_secs underflow, tried to remove 1093126696 seconds when only 1091621596 remained.
 
I saw similar underflow errors via Google, but it looked like those were fixed in v15.

Thanks,
   - Matt

Comment 3 Broderick Gardner 2019-09-09 11:40:00 MDT

grp_used_tres_run_secs refers to the currently running TRES usage. This is used to implement the QOS limit GrpTresRunMins and similar. It appears that between when the job starts and ends, its calculated TRES usage is different. It is trying to remove more run usage than there is accounted. 

How often is this occuring? Just the once so far?

Please attach your slurm.conf and the slurmctld.log, at least including these error messages and the start of the job mentioned (356468). Or just all you have.

Thanks

Comment 4 ARC Admins 2019-09-12 13:26:44 MDT

Created attachment 11562 [details]
Gzip file of slurm.conf and 2 logfiles

Comment 5 ARC Admins 2019-09-12 13:29:05 MDT

Thanks Broderick.  

Today, we have 1 instance (4 log lines) but yesterday (also included), there were 124 logs .  Let us know what else we can do or provide.

Thanks,
 - Matt

Comment 6 ARC Admins 2019-09-18 12:29:32 MDT

Hi Broderick,

Have you had any insights on this error?

Thanks,
   - Matt

Comment 7 Broderick Gardner 2019-09-19 14:53:32 MDT

Sorry for the delay in responding. I was preoccupied with the Slurm User Group Meeting that was this week.

I have examined the logs you sent. I don't have any new insight yet, but I have determined that this is a duplicate of Bug 7390. 

I trying to reproduce the error locally, and I'm considering the possibility of a race condition.

Comment 8 ARC Admins 2019-11-13 10:24:45 MST

Hi Broderick,

Just checking in on this ticket - any ideas?

Thanks,
   - Matt

Comment 9 Broderick Gardner 2019-11-14 16:42:58 MST

Not yet. The information about how often this occurs is indirectly useful. I'm also looking for any similarities between jobs that trigger the underflow error. No predictability would indicate a race condition rather than bugged logic under certain conditions. 

We have seen and fixed the opposite problem in the past, where accrued usage was leaked, but this appears completely unrelated. 

I should have another update for you soon. Additional logs and/or reports of occurrences are appreciated.

Thanks

Comment 11 Michael Hinton 2019-11-26 09:53:41 MST

Hi Matt,

This issue is showing up in a few customer clusters, so we are going to consolidate this into a single bug (bug 7390) and work from there. Feel free to participate in that bug as you have in this one.

Thanks,
Michael

*** This ticket has been marked as a duplicate of ticket 7390 ***