Good evening, I saw the following logs in slurmctld.log and was curious what the significance of them (there were 3 sets of these errors). [2019-09-08T18:15:52.215] error: _handle_assoc_tres_run_secs: job 356468: assoc 6562 TRES cpu grp_used_tres_run_secs underflow, tried to remove 2479870 seconds when only 2473570 remained. [2019-09-08T18:15:52.215] error: _handle_assoc_tres_run_secs: job 356468: assoc 6562 TRES mem grp_used_tres_run_secs underflow, tried to remove 12696934400 seconds when only 12679424000 remained. [2019-09-08T18:15:52.215] error: _handle_assoc_tres_run_secs: job 356468: assoc 6562 TRES node grp_used_tres_run_secs underflow, tried to remove 247987 seconds when only 241687 remained. [2019-09-08T18:15:52.215] error: _handle_assoc_tres_run_secs: job 356468: assoc 6562 TRES billing grp_used_tres_run_secs underflow, tried to remove 1093126696 seconds when only 1091621596 remained. I saw similar underflow errors via Google, but it looked like those were fixed in v15. Thanks, - Matt
grp_used_tres_run_secs refers to the currently running TRES usage. This is used to implement the QOS limit GrpTresRunMins and similar. It appears that between when the job starts and ends, its calculated TRES usage is different. It is trying to remove more run usage than there is accounted. How often is this occuring? Just the once so far? Please attach your slurm.conf and the slurmctld.log, at least including these error messages and the start of the job mentioned (356468). Or just all you have. Thanks
Created attachment 11562 [details] Gzip file of slurm.conf and 2 logfiles
Thanks Broderick. Today, we have 1 instance (4 log lines) but yesterday (also included), there were 124 logs . Let us know what else we can do or provide. Thanks, - Matt
Hi Broderick, Have you had any insights on this error? Thanks, - Matt
Sorry for the delay in responding. I was preoccupied with the Slurm User Group Meeting that was this week. I have examined the logs you sent. I don't have any new insight yet, but I have determined that this is a duplicate of Bug 7390. I trying to reproduce the error locally, and I'm considering the possibility of a race condition.
Hi Broderick, Just checking in on this ticket - any ideas? Thanks, - Matt
Not yet. The information about how often this occurs is indirectly useful. I'm also looking for any similarities between jobs that trigger the underflow error. No predictability would indicate a race condition rather than bugged logic under certain conditions. We have seen and fixed the opposite problem in the past, where accrued usage was leaked, but this appears completely unrelated. I should have another update for you soon. Additional logs and/or reports of occurrences are appreciated. Thanks
Hi Matt, This issue is showing up in a few customer clusters, so we are going to consolidate this into a single bug (bug 7390) and work from there. Feel free to participate in that bug as you have in this one. Thanks, Michael *** This ticket has been marked as a duplicate of ticket 7390 ***