Ticket 9477

Summary: reconfigure while jobs are running leads to leaky TRESRunMins data
Product: Slurm Reporter: Luke Yeager <lyeager>
Component: AccountingAssignee: Scott Hilton <scott>
Status: RESOLVED FIXED QA Contact:
Severity: 2 - High Impact    
Priority: ---    
Version: 20.02.3   
Hardware: Linux   
OS: Linux   
See Also: https://bugs.schedmd.com/show_bug.cgi?id=9356
Site: NVIDIA (PSLA) Slinky Site: ---
Alineos Sites: --- Atos/Eviden Sites: ---
Confidential Site: --- Coreweave sites: ---
Cray Sites: --- DS9 clusters: ---
Google sites: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- NoveTech Sites: ---
Nvidia HWinf-CS Sites: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Tzag Elita Sites: ---
Linux Distro: --- Machine Name:
CLE Version: Version Fixed: 20.02.4 20.11.0pre1
Target Release: --- DevPrio: ---
Emory-Cloud Sites: ---
Attachments: job430139
Reconfigure Issue Fix v1

Description Luke Yeager 2020-07-28 10:40:53 MDT
Created attachment 15197 [details]
job430139

We have noticed that running 'scontrol reconfigure' while jobs are running can lead to (always leads to?) TRESRunMins data being leaked. Here is an example:

    $ sacct -XnP -j 430139 --format=start,end,timelimit,elapsed,state,nnodes
    2020-07-24T15:47:38|2020-07-24T16:28:02|01:30:00|00:40:24|FAILED|1

Check out the attached screenshot. For the user in question, only one job was running at this time. Noderunmins starts at 90 for the job - correct. It decreases as expected. Then, at around 4:16pm, the value magically increases by 25min (back up to 90min). And finally, when the job ends, 25min are leaked. That timestamp corresponds with an 'scontrol reconfigure' event, according to the controller log:

    Jul 24 16:16:00 slurm-01 slurmctld[450]: Processing RPC: REQUEST_RECONFIGURE from uid=10001

Ben Roberts was able to reproduce this. See https://bugs.schedmd.com/show_bug.cgi?id=9356#c20.
Comment 1 Luke Yeager 2020-07-29 16:26:30 MDT
Increasing to Sev2. We don't have a good workaround for this.
Comment 2 Scott Hilton 2020-07-29 16:47:26 MDT
Luke, 

I've been able to reproduce the issue like Ben did and will be looking into it.

-Scott
Comment 3 Scott Hilton 2020-07-31 09:29:37 MDT
Created attachment 15260 [details]
Reconfigure Issue Fix v1

Luke,

Here is a patch that fixes the issue. You are free to try it and if you do, let me know if it works on your setup. 

-Scott
Comment 7 Scott Hilton 2020-07-31 11:38:00 MDT
Created attachment 15265 [details]
Reconfigure Issue Fix v2

Targeted at 20.02
Comment 9 Scott Hilton 2020-08-04 11:30:36 MDT
Luke,

The fix will be included in the 20.02.4 release which is coming up soon.

This is the commit ID: 8f28de91efa07984020b247f272738a93e4dd5f8

Take care,

Scott
Comment 10 Luke Yeager 2020-08-06 10:25:40 MDT
Verified as fixed in 20.02.4. Thanks!