Ticket 3719

Summary:	Impossibly high group max tres(cpu) minutes used preventing Jobs starting when resource limit is applied to an account
Product:	Slurm	Reporter:	antony.cleave
Component:	Accounting	Assignee:	Jacob Jenson <jacob>
Status:	RESOLVED INVALID	QA Contact:
Severity:	6 - No support contract
Priority:	---
Version:	16.05.2
Hardware:	Linux
OS:	Linux
Site:	-Other-	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---

Description antony.cleave 2017-04-20 09:54:04 MDT

When setting a resource limit of 4500000 CPU mins on the mimngs account all jobs are held back with AssocGrpCPUMinutesLimit even though according to sacct the account has only used 229473710 seconds or 3824560.67 minutes.

140217 14:05:28 [root@muse-adm02 ~]# sacct -n -X -D -A mimngs --format=JobID,CPUTimeRAW -S2015-02-01T00:00:00 | awk '{ sum += $2; } END { print sum; }'
229473710

In the logs we see 

2017-04-06T13:12:03.239] debug2: Job 49611 being held, the job is at or exceeds assoc 54(mimngs/(null)/(null)) group max tres(cpu) minutes of 4500000 of which 4214743 are still available but request is for 42949671752 (42949671750 already used) tres minutes (1 tres count)

Please note that this is for a 2 minute sleep job on a single core. 

This correctly states that there is still time available but it seems to think that somehow the account has has already used 42949671750 mins of time which is longer than the cluster has been online multiplied by the number of compute cores. Note that the job time is correct 42949671752 - 42949671750 =2 min. 

Is there a way to identify what is causing this tres mins usage and fix it?

I suspect a rogue unfinished job in the database but I do not know whereor how to start looking for it.

As a workaround I have had to instruct the sysadmin to disable all resource limits for the affected accounts so this does have a noticeable impact on the system.

Thanks

Antony