Created attachment 15197 [details] job430139 We have noticed that running 'scontrol reconfigure' while jobs are running can lead to (always leads to?) TRESRunMins data being leaked. Here is an example: $ sacct -XnP -j 430139 --format=start,end,timelimit,elapsed,state,nnodes 2020-07-24T15:47:38|2020-07-24T16:28:02|01:30:00|00:40:24|FAILED|1 Check out the attached screenshot. For the user in question, only one job was running at this time. Noderunmins starts at 90 for the job - correct. It decreases as expected. Then, at around 4:16pm, the value magically increases by 25min (back up to 90min). And finally, when the job ends, 25min are leaked. That timestamp corresponds with an 'scontrol reconfigure' event, according to the controller log: Jul 24 16:16:00 slurm-01 slurmctld[450]: Processing RPC: REQUEST_RECONFIGURE from uid=10001 Ben Roberts was able to reproduce this. See https://bugs.schedmd.com/show_bug.cgi?id=9356#c20.
Increasing to Sev2. We don't have a good workaround for this.
Luke, I've been able to reproduce the issue like Ben did and will be looking into it. -Scott
Created attachment 15260 [details] Reconfigure Issue Fix v1 Luke, Here is a patch that fixes the issue. You are free to try it and if you do, let me know if it works on your setup. -Scott
Created attachment 15265 [details] Reconfigure Issue Fix v2 Targeted at 20.02
Luke, The fix will be included in the 20.02.4 release which is coming up soon. This is the commit ID: 8f28de91efa07984020b247f272738a93e4dd5f8 Take care, Scott
Verified as fixed in 20.02.4. Thanks!