5341 – "more allocated time than is possible" but no reservation

Ticket 5341 - "more allocated time than is possible" but no reservation

Summary: "more allocated time than is possible" but no reservation

Status:	RESOLVED DUPLICATE of ticket 5166

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	Accounting (show other tickets)
Version:	17.11.7
Hardware:	Linux Linux

Severity:	4 - Minor Issue
Assignee:	Dominik Bartkiewicz
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2018-06-22 03:42 MDT by Sophie Créno
Modified:	2018-07-10 07:37 MDT (History)
CC List:	1 user (show)

See Also:
Site:	Institut Pasteur
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
*output of select from tars_resv_table where resv_name="Downtime";** (3.93 KB, text/plain) 2018-06-22 09:01 MDT, Sophie Créno	Details
*select__from_cluster_resv_table before and after the rollup** (7.89 KB, text/plain) 2018-07-10 07:37 MDT, Sophie Créno	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Sophie Créno 2018-06-22 03:42:17 MDT

Hello,

We observe the following messages for a while now.

2018-06-22T11:00:00+02:00 tars-acct slurmdbd[5889]: error: We have more allocated time than is possible (24764191 > 17467200) for cluster tars(4852) from 2018-06-22T10:00:00 - 2018-06-22T11:00:00 tres 1
2018-06-22T11:00:00+02:00 tars-acct slurmdbd[5889]: error: We have more time than is possible (17467200+230400+0)(17697600) > 17467200 for cluster tars(4852) from 2018-06-22T10:00:00 - 2018-06-22T11:00:00 tres 1
2018-06-22T11:00:00+02:00 tars-acct slurmdbd[5889]: error: We have more allocated time than is possible (181448290888 > 142360455600) for cluster tars(39544571) from 2018-06-22T10:00:00 - 2018-06-22T11:00:00 tres 2
2018-06-22T11:00:00+02:00 tars-acct slurmdbd[5889]: error: We have more time than is possible (142360455600+1109952000+0)(143470407600) > 142360455600 for cluster tars(39544571) from 2018-06-22T10:00:00 - 2018-06-22T11:00:00 tres 2
2018-06-22T11:00:00+02:00 tars-acct slurmdbd[5889]: error: We have more allocated time than is possible (24760591 > 17467200) for cluster tars(4852) from 2018-06-22T10:00:00 - 2018-06-22T11:00:00 tres 5
2018-06-22T11:00:00+02:00 tars-acct slurmdbd[5889]: error: We have more time than is possible (17467200+230400+0)(17697600) > 17467200 for cluster tars(4852) from 2018-06-22T10:00:00 - 2018-06-22T11:00:00 tres 5


Here is the output of 

$ sreport cluster AccountUtilizationByUser start=06/22/18 end=now -t percent

--------------------------------------------------------------------------------
Cluster/Account/User Utilization 2018-06-22T00:00:00 - 2018-06-22T10:59:59 (39600 secs)
Usage reported in Percentage of Total
--------------------------------------------------------------------------------
  Cluster         Account     Login     Proper Name       Used   Energy 
--------- --------------- --------- --------------- ---------- -------- 
     tars            root                              941.63%    0.00% 
     tars            root      root            root    750.00%    0.00% 
     tars           admin                               50.04%    0.00% 
     tars           admin      root            root     50.00%    0.00% 


I'm not 100% sure but it seems that the problem appeared in the same 
circumstances as Dominik used to reproduce the problem in Bug 5166:
restart of slurctld multiple times while slurmdbd was down

but contrary to Bug 5166, we don't have any reservation in stock:

$ scontrol show res
No reservations in the system

so I guess that the procedure provided by Dominik doesn't apply, or at least
not on the <cluster>_resv_table. Could you tell us what we should do to fix
the situation?

Thanks in advance,

Comment 1 Dominik Bartkiewicz 2018-06-22 03:55:54 MDT

Hi

Could you send me the output from "sacctmgr show reservation start=06/22/18"?

Dominik

Comment 2 Sophie Créno 2018-06-22 04:22:16 MDT

$ sacctmgr show reservation start=06/22/18
   Cluster            Name                           TRES           TimeStart             TimeEnd UnusedWall 
---------- --------------- ------------------------------ ------------------- ------------------- ---------- 
      tars        Downtime                       cpu=4852 2018-03-08T17:37:42 2019-03-07T19:37:37 9.3253e+06 
      tars        Downtime                       cpu=4852 2018-03-08T18:25:14 2019-03-07T19:37:37 9.3225e+06 
      tars        Downtime                       cpu=4852 2018-03-08T18:28:08 2019-03-07T19:37:37 9.3223e+06 
      tars        Downtime                       cpu=4852 2018-03-08T18:29:07 2019-03-07T19:37:37 9.3223e+06 
      tars        Downtime                       cpu=4852 2018-03-08T18:36:54 2019-03-07T19:37:37 9.3218e+06 
      tars        Downtime                       cpu=4852 2018-03-08T18:43:01 2019-03-07T19:37:37 9.3214e+06 
      tars        Downtime                       cpu=4852 2018-03-08T18:48:10 2019-03-07T19:37:37 9.3211e+06 
      tars        Downtime                       cpu=4852 2018-03-08T19:03:18 2019-03-07T19:37:37 9.3202e+06 

Oh yes, you're right, I did the scontrol show res but not the sacctmgr show res.
That corresponds to the day when we upgraded. So I guess it is a duplicate
of Bug 5166. 

In the procedure, you said:
1) stop slurmdbd 
2) update <cluster>_resv_table set unused_wall=0,time_end=<> where id_resv=<resv_id> and time_start=<>; (fix reservation record )
next steps are only required to recalculate used records
3) update <cluster>_last_ran_table set hourly_rollup=<time_start>,daily_rollup=<time_start>,monthly_rollup=<time_start>; (force rollup on slurmdbd start)
4) start slurmdbd

For 2), we must use time_start and time_end of the listed reservations but for
3), is it also that same time_start that we should put?

Tell me if I must ask my question on Bug 5166 since this one is a Duplicate.

Thanks,

Comment 3 Dominik Bartkiewicz 2018-06-22 08:49:37 MDT

Hi

It looks like that you have more than one invalid row.
Could you send me output of this MySQL query?

select * from tars_resv_table where resv_name="Downtime";

If you can wait few days I will provide more automated procedure for you.
I would like to close this ticket as the duplicate of bug 5166.

Dominik

Comment 4 Sophie Créno 2018-06-22 09:01:03 MDT

Created attachment 7159 [details]
output of select * from tars_resv_table where resv_name="Downtime";

> It looks like that you have more than one invalid row.
> Could you send me output of this MySQL query?
> 
> select * from tars_resv_table where resv_name="Downtime";
I've put the result in attachment.


> If you can wait few days I will provide more automated procedure for you.
Oh great, thank you!


> I would like to close this ticket as the duplicate of bug 5166.
Yes, sure, proceed and sorry again for that duplication.

Comment 5 Dominik Bartkiewicz 2018-06-22 09:10:03 MDT

Hi

Thanks for records.

No problem.

Dominik

*** This ticket has been marked as a duplicate of ticket 5166 ***

Comment 6 Sophie Créno 2018-07-06 04:06:40 MDT

Hello,

I don't know if I should write on this bug since it is a duplicate but I don't
want to pollute bug 5166.

I applied the procedure given by Dominik 3 days ago.
In slurmdbd.conf, we have

PurgeResvAfter=3months


Thus, I guess that the 

   Cluster            Name                           TRES           TimeStart             TimeEnd UnusedWall 
---------- --------------- ------------------------------ ------------------- ------------------- ---------- 
      tars        Downtime                       cpu=4852 2018-03-08T17:37:42 2019-03-07T19:37:37 9.3253e+06 

had already been purged before I applied the procedure since they were more
than 3 month old. The other reservations seem to be with a correct end time:

$ sacctmgr show reservation start=03/07/18
   Cluster            Name                           TRES           TimeStart             TimeEnd UnusedWall 
---------- --------------- ------------------------------ ------------------- ------------------- ---------- 
      tars        resa_dbd                       cpu=4852 2018-04-03T11:03:30 2018-04-03T11:32:47 1.7570e+03 
      tars      resa_patch                       cpu=4852 2018-05-04T11:48:28 2018-05-04T11:52:33 245.000000 
      tars      resa_patch                       cpu=4852 2018-05-04T11:52:33 2018-05-04T12:02:14 581.000000 
      tars      resa_patch                       cpu=4852 2018-05-24T09:51:16 2018-05-24T09:52:54  98.000000 
      tars      resa_patch                       cpu=4852 2018-05-24T09:52:54 2018-05-24T09:59:51 417.000000 
      tars      resa_patch                       cpu=4852 2018-06-01T13:21:11 2018-06-01T13:24:57 226.000000 
      tars      resa_patch                       cpu=4852 2018-06-01T13:24:57 2018-06-01T13:27:51 174.000000 
      tars      resa_patch                       cpu=4852 2018-06-01T17:00:02 2018-06-01T17:01:55 113.000000 
      tars      resa_patch                       cpu=4852 2018-06-01T17:01:55 2018-06-01T17:03:56 121.000000 
      tars      resa_patch                       cpu=4852 2018-06-05T13:22:57 2018-06-05T13:24:25  88.000000 
      tars      resa_patch                       cpu=4852 2018-06-05T13:24:25 2018-06-05T13:27:09 164.000000 
      tars      resa_patch                       cpu=4852 2018-06-05T13:27:09 2018-06-05T15:25:30 7.1010e+03 
      tars         resa_db                       cpu=4852 2018-06-07T13:43:18 2018-06-07T15:05:18 4.9200e+03 
      tars      resa_patch                       cpu=4852 2018-06-12T16:30:59 2018-06-12T16:34:12 193.000000 
      tars      resa_patch                       cpu=4852 2018-06-12T16:34:12 2018-06-12T16:37:06 174.000000 
      tars      resa_patch                       cpu=4852 2018-06-14T16:06:08 2018-06-14T16:09:03 175.000000 
      tars      resa_patch                       cpu=4852 2018-06-14T16:09:03 2018-06-14T16:12:26 203.000000 


However, we still have error messages in slurmdbd logs:

2018-07-03T15:29:17+02:00 tars-acct slurmdbd[8594]: slurmdbd version 17.11.7 started
2018-07-03T16:00:00+02:00 tars-acct slurmdbd[8594]: error: We have more allocated time than is possible (25116535 > 17467200) for cluster tars(4852) from 2018-07-03T15:00:00 - 2018-07-03T16:00:00 tres 1
2018-07-03T16:00:00+02:00 tars-acct slurmdbd[8594]: error: We have more time than is possible (17467200+187200+0)(17654400) > 17467200 for cluster tars(4852) from 2018-07-03T15:00:00 - 2018-07-03T16:00:00 tres 1
2018-07-03T16:00:00+02:00 tars-acct slurmdbd[8594]: error: We have more allocated time than is possible (190150683896 > 142360455600) for cluster tars(39544571) from 2018-07-03T15:00:00 - 2018-07-03T16:00:00 tres 2
2018-07-03T16:00:00+02:00 tars-acct slurmdbd[8594]: error: We have more time than is possible (142360455600+888768000+0)(143249223600) > 142360455600 for cluster tars(39544571) from 2018-07-03T15:00:00 - 2018-07-03T16:00:00 tres 2
2018-07-03T16:00:00+02:00 tars-acct slurmdbd[8594]: error: We have more allocated time than is possible (25112935 > 17467200) for cluster tars(4852) from 2018-07-03T15:00:00 - 2018-07-03T16:00:00 tres 5
2018-07-03T16:00:00+02:00 tars-acct slurmdbd[8594]: error: We have more time than is possible (17467200+187200+0)(17654400) > 17467200 for cluster tars(4852) from 2018-07-03T15:00:00 - 2018-07-03T16:00:00 tres 5


Since the Downtime reservations had already been archived and purged, were we 
supposed to still have these error messages? I guess, the procedure wasn't
able to modify anything about them since they weren't in the db anymore
when I applied the procedure.

Thanks in advance,

Comment 7 Dominik Bartkiewicz 2018-07-06 10:33:43 MDT

Hi


Did you force rollup like this? 

1) stop slurmdbd 
2) update <cluster>_last_ran_table set hourly_rollup=<time_start>,daily_rollup=<time_start>,monthly_rollup=<time_start>; (force rollup on slurmdbd start)
3) start slurmdbd

Could you send me output of this MySQL query?

select * from tars_resv_table;

Dominik

Comment 8 Sophie Créno 2018-07-06 11:35:38 MDT

Oh no, you're right, I'm out of it. My deepest apologies but what should I put for time_start?

Thanks,

Comment 9 Dominik Bartkiewicz 2018-07-06 13:20:24 MDT

Hi

Because some of the reservation usages are already gone, we have no way to correctly recalculate all usage.
If we have all resv data, time_start should be set the same as the start of "Downtime" reservation.  In this case, this will cause a decrease in usage.


Dominik

Comment 10 Sophie Créno 2018-07-10 07:37:10 MDT

Created attachment 7258 [details]
select_*_from_cluster_resv_table before and after the rollup

Hello,

I've put in attachment the result of the

select * from tars_resv_table;

before and after the rollup done with time_start set at 1520447857, the value
of the first line of downtime (attachment 7159 [details]). No change as expected.

Here is what we have in the slurmdbd's log now:

2018-07-10T15:34:31+02:00 tars-acct slurmdbd[1893]: error: We have more allocated time than is possible (18325907 > 17467200) for cluster tars(4852) from 2018-04-26T01:00:00 - 2018-04-26T02:00:00 tres 5
2018-07-10T15:34:31+02:00 tars-acct slurmdbd[1893]: error: We have more time than is possible (17467200+86400+0)(17553600) > 17467200 for cluster tars(4852) from 2018-04-26T01:00:00 - 2018-04-26T02:00:00 tres 5

It seems worse than before the rollup: we have many more lines.

Thanks