9477 – reconfigure while jobs are running leads to leaky TRESRunMins data

Ticket 9477 - reconfigure while jobs are running leads to leaky TRESRunMins data

Summary: reconfigure while jobs are running leads to leaky TRESRunMins data

Status:	RESOLVED FIXED

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	Accounting (show other tickets)
Version:	20.02.3
Hardware:	Linux Linux

Severity:	2 - High Impact
Assignee:	Scott Hilton
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2020-07-28 10:40 MDT by Luke Yeager
Modified:	2020-08-06 10:25 MDT (History)
CC List:	0 users

See Also:	9356
Site:	NVIDIA (PSLA)
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:	20.02.4 20.11.0pre1
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
job430139 (65.54 KB, image/jpeg) 2020-07-28 10:40 MDT, Luke Yeager	Details
Reconfigure Issue Fix v1 (1.33 KB, patch) 2020-07-31 09:29 MDT, Scott Hilton	Details \| Diff
Show Obsolete (1) Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Luke Yeager 2020-07-28 10:40:53 MDT

Created attachment 15197 [details]
job430139

We have noticed that running 'scontrol reconfigure' while jobs are running can lead to (always leads to?) TRESRunMins data being leaked. Here is an example:

    $ sacct -XnP -j 430139 --format=start,end,timelimit,elapsed,state,nnodes
    2020-07-24T15:47:38|2020-07-24T16:28:02|01:30:00|00:40:24|FAILED|1

Check out the attached screenshot. For the user in question, only one job was running at this time. Noderunmins starts at 90 for the job - correct. It decreases as expected. Then, at around 4:16pm, the value magically increases by 25min (back up to 90min). And finally, when the job ends, 25min are leaked. That timestamp corresponds with an 'scontrol reconfigure' event, according to the controller log:

    Jul 24 16:16:00 slurm-01 slurmctld[450]: Processing RPC: REQUEST_RECONFIGURE from uid=10001

Ben Roberts was able to reproduce this. See https://bugs.schedmd.com/show_bug.cgi?id=9356#c20.

Comment 1 Luke Yeager 2020-07-29 16:26:30 MDT

Increasing to Sev2. We don't have a good workaround for this.

Comment 2 Scott Hilton 2020-07-29 16:47:26 MDT

Luke, 

I've been able to reproduce the issue like Ben did and will be looking into it.

-Scott

Comment 3 Scott Hilton 2020-07-31 09:29:37 MDT

Created attachment 15260 [details]
Reconfigure Issue Fix v1

Luke,

Here is a patch that fixes the issue. You are free to try it and if you do, let me know if it works on your setup. 

-Scott

Comment 7 Scott Hilton 2020-07-31 11:38:00 MDT

Created attachment 15265 [details]
Reconfigure Issue Fix v2

Targeted at 20.02

Comment 9 Scott Hilton 2020-08-04 11:30:36 MDT

Luke,

The fix will be included in the 20.02.4 release which is coming up soon.

This is the commit ID: 8f28de91efa07984020b247f272738a93e4dd5f8

Take care,

Scott

Comment 10 Luke Yeager 2020-08-06 10:25:40 MDT

Verified as fixed in 20.02.4. Thanks!