Ticket 5341 - "more allocated time than is possible" but no reservation
Summary: "more allocated time than is possible" but no reservation
Status: RESOLVED DUPLICATE of ticket 5166
Alias: None
Product: Slurm
Classification: Unclassified
Component: Accounting (show other tickets)
Version: 17.11.7
Hardware: Linux Linux
: 4 - Minor Issue
Assignee: Dominik Bartkiewicz
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2018-06-22 03:42 MDT by Sophie Créno
Modified: 2018-07-10 07:37 MDT (History)
1 user (show)

See Also:
Site: Institut Pasteur
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
output of select * from tars_resv_table where resv_name="Downtime"; (3.93 KB, text/plain)
2018-06-22 09:01 MDT, Sophie Créno
Details
select_*_from_cluster_resv_table before and after the rollup (7.89 KB, text/plain)
2018-07-10 07:37 MDT, Sophie Créno
Details

Note You need to log in before you can comment on or make changes to this ticket.
Description Sophie Créno 2018-06-22 03:42:17 MDT
Hello,

We observe the following messages for a while now.

2018-06-22T11:00:00+02:00 tars-acct slurmdbd[5889]: error: We have more allocated time than is possible (24764191 > 17467200) for cluster tars(4852) from 2018-06-22T10:00:00 - 2018-06-22T11:00:00 tres 1
2018-06-22T11:00:00+02:00 tars-acct slurmdbd[5889]: error: We have more time than is possible (17467200+230400+0)(17697600) > 17467200 for cluster tars(4852) from 2018-06-22T10:00:00 - 2018-06-22T11:00:00 tres 1
2018-06-22T11:00:00+02:00 tars-acct slurmdbd[5889]: error: We have more allocated time than is possible (181448290888 > 142360455600) for cluster tars(39544571) from 2018-06-22T10:00:00 - 2018-06-22T11:00:00 tres 2
2018-06-22T11:00:00+02:00 tars-acct slurmdbd[5889]: error: We have more time than is possible (142360455600+1109952000+0)(143470407600) > 142360455600 for cluster tars(39544571) from 2018-06-22T10:00:00 - 2018-06-22T11:00:00 tres 2
2018-06-22T11:00:00+02:00 tars-acct slurmdbd[5889]: error: We have more allocated time than is possible (24760591 > 17467200) for cluster tars(4852) from 2018-06-22T10:00:00 - 2018-06-22T11:00:00 tres 5
2018-06-22T11:00:00+02:00 tars-acct slurmdbd[5889]: error: We have more time than is possible (17467200+230400+0)(17697600) > 17467200 for cluster tars(4852) from 2018-06-22T10:00:00 - 2018-06-22T11:00:00 tres 5


Here is the output of 

$ sreport cluster AccountUtilizationByUser start=06/22/18 end=now -t percent

--------------------------------------------------------------------------------
Cluster/Account/User Utilization 2018-06-22T00:00:00 - 2018-06-22T10:59:59 (39600 secs)
Usage reported in Percentage of Total
--------------------------------------------------------------------------------
  Cluster         Account     Login     Proper Name       Used   Energy 
--------- --------------- --------- --------------- ---------- -------- 
     tars            root                              941.63%    0.00% 
     tars            root      root            root    750.00%    0.00% 
     tars           admin                               50.04%    0.00% 
     tars           admin      root            root     50.00%    0.00% 


I'm not 100% sure but it seems that the problem appeared in the same 
circumstances as Dominik used to reproduce the problem in Bug 5166:
restart of slurctld multiple times while slurmdbd was down

but contrary to Bug 5166, we don't have any reservation in stock:

$ scontrol show res
No reservations in the system

so I guess that the procedure provided by Dominik doesn't apply, or at least
not on the <cluster>_resv_table. Could you tell us what we should do to fix
the situation?

Thanks in advance,
Comment 1 Dominik Bartkiewicz 2018-06-22 03:55:54 MDT
Hi

Could you send me the output from "sacctmgr show reservation start=06/22/18"?

Dominik
Comment 2 Sophie Créno 2018-06-22 04:22:16 MDT
$ sacctmgr show reservation start=06/22/18
   Cluster            Name                           TRES           TimeStart             TimeEnd UnusedWall 
---------- --------------- ------------------------------ ------------------- ------------------- ---------- 
      tars        Downtime                       cpu=4852 2018-03-08T17:37:42 2019-03-07T19:37:37 9.3253e+06 
      tars        Downtime                       cpu=4852 2018-03-08T18:25:14 2019-03-07T19:37:37 9.3225e+06 
      tars        Downtime                       cpu=4852 2018-03-08T18:28:08 2019-03-07T19:37:37 9.3223e+06 
      tars        Downtime                       cpu=4852 2018-03-08T18:29:07 2019-03-07T19:37:37 9.3223e+06 
      tars        Downtime                       cpu=4852 2018-03-08T18:36:54 2019-03-07T19:37:37 9.3218e+06 
      tars        Downtime                       cpu=4852 2018-03-08T18:43:01 2019-03-07T19:37:37 9.3214e+06 
      tars        Downtime                       cpu=4852 2018-03-08T18:48:10 2019-03-07T19:37:37 9.3211e+06 
      tars        Downtime                       cpu=4852 2018-03-08T19:03:18 2019-03-07T19:37:37 9.3202e+06 

Oh yes, you're right, I did the scontrol show res but not the sacctmgr show res.
That corresponds to the day when we upgraded. So I guess it is a duplicate
of Bug 5166. 

In the procedure, you said:
1) stop slurmdbd 
2) update <cluster>_resv_table set unused_wall=0,time_end=<> where id_resv=<resv_id> and time_start=<>; (fix reservation record )
next steps are only required to recalculate used records
3) update <cluster>_last_ran_table set hourly_rollup=<time_start>,daily_rollup=<time_start>,monthly_rollup=<time_start>; (force rollup on slurmdbd start)
4) start slurmdbd

For 2), we must use time_start and time_end of the listed reservations but for
3), is it also that same time_start that we should put?

Tell me if I must ask my question on Bug 5166 since this one is a Duplicate.

Thanks,
Comment 3 Dominik Bartkiewicz 2018-06-22 08:49:37 MDT
Hi

It looks like that you have more than one invalid row.
Could you send me output of this MySQL query?

select * from tars_resv_table where resv_name="Downtime";

If you can wait few days I will provide more automated procedure for you.
I would like to close this ticket as the duplicate of bug 5166.

Dominik
Comment 4 Sophie Créno 2018-06-22 09:01:03 MDT
Created attachment 7159 [details]
output of select * from tars_resv_table where resv_name="Downtime";

> It looks like that you have more than one invalid row.
> Could you send me output of this MySQL query?
> 
> select * from tars_resv_table where resv_name="Downtime";
I've put the result in attachment.


> If you can wait few days I will provide more automated procedure for you.
Oh great, thank you!


> I would like to close this ticket as the duplicate of bug 5166.
Yes, sure, proceed and sorry again for that duplication.
Comment 5 Dominik Bartkiewicz 2018-06-22 09:10:03 MDT
Hi

Thanks for records.

No problem.

Dominik

*** This ticket has been marked as a duplicate of ticket 5166 ***
Comment 6 Sophie Créno 2018-07-06 04:06:40 MDT
Hello,

I don't know if I should write on this bug since it is a duplicate but I don't
want to pollute bug 5166.

I applied the procedure given by Dominik 3 days ago.
In slurmdbd.conf, we have

PurgeResvAfter=3months


Thus, I guess that the 

   Cluster            Name                           TRES           TimeStart             TimeEnd UnusedWall 
---------- --------------- ------------------------------ ------------------- ------------------- ---------- 
      tars        Downtime                       cpu=4852 2018-03-08T17:37:42 2019-03-07T19:37:37 9.3253e+06 

had already been purged before I applied the procedure since they were more
than 3 month old. The other reservations seem to be with a correct end time:

$ sacctmgr show reservation start=03/07/18
   Cluster            Name                           TRES           TimeStart             TimeEnd UnusedWall 
---------- --------------- ------------------------------ ------------------- ------------------- ---------- 
      tars        resa_dbd                       cpu=4852 2018-04-03T11:03:30 2018-04-03T11:32:47 1.7570e+03 
      tars      resa_patch                       cpu=4852 2018-05-04T11:48:28 2018-05-04T11:52:33 245.000000 
      tars      resa_patch                       cpu=4852 2018-05-04T11:52:33 2018-05-04T12:02:14 581.000000 
      tars      resa_patch                       cpu=4852 2018-05-24T09:51:16 2018-05-24T09:52:54  98.000000 
      tars      resa_patch                       cpu=4852 2018-05-24T09:52:54 2018-05-24T09:59:51 417.000000 
      tars      resa_patch                       cpu=4852 2018-06-01T13:21:11 2018-06-01T13:24:57 226.000000 
      tars      resa_patch                       cpu=4852 2018-06-01T13:24:57 2018-06-01T13:27:51 174.000000 
      tars      resa_patch                       cpu=4852 2018-06-01T17:00:02 2018-06-01T17:01:55 113.000000 
      tars      resa_patch                       cpu=4852 2018-06-01T17:01:55 2018-06-01T17:03:56 121.000000 
      tars      resa_patch                       cpu=4852 2018-06-05T13:22:57 2018-06-05T13:24:25  88.000000 
      tars      resa_patch                       cpu=4852 2018-06-05T13:24:25 2018-06-05T13:27:09 164.000000 
      tars      resa_patch                       cpu=4852 2018-06-05T13:27:09 2018-06-05T15:25:30 7.1010e+03 
      tars         resa_db                       cpu=4852 2018-06-07T13:43:18 2018-06-07T15:05:18 4.9200e+03 
      tars      resa_patch                       cpu=4852 2018-06-12T16:30:59 2018-06-12T16:34:12 193.000000 
      tars      resa_patch                       cpu=4852 2018-06-12T16:34:12 2018-06-12T16:37:06 174.000000 
      tars      resa_patch                       cpu=4852 2018-06-14T16:06:08 2018-06-14T16:09:03 175.000000 
      tars      resa_patch                       cpu=4852 2018-06-14T16:09:03 2018-06-14T16:12:26 203.000000 


However, we still have error messages in slurmdbd logs:

2018-07-03T15:29:17+02:00 tars-acct slurmdbd[8594]: slurmdbd version 17.11.7 started
2018-07-03T16:00:00+02:00 tars-acct slurmdbd[8594]: error: We have more allocated time than is possible (25116535 > 17467200) for cluster tars(4852) from 2018-07-03T15:00:00 - 2018-07-03T16:00:00 tres 1
2018-07-03T16:00:00+02:00 tars-acct slurmdbd[8594]: error: We have more time than is possible (17467200+187200+0)(17654400) > 17467200 for cluster tars(4852) from 2018-07-03T15:00:00 - 2018-07-03T16:00:00 tres 1
2018-07-03T16:00:00+02:00 tars-acct slurmdbd[8594]: error: We have more allocated time than is possible (190150683896 > 142360455600) for cluster tars(39544571) from 2018-07-03T15:00:00 - 2018-07-03T16:00:00 tres 2
2018-07-03T16:00:00+02:00 tars-acct slurmdbd[8594]: error: We have more time than is possible (142360455600+888768000+0)(143249223600) > 142360455600 for cluster tars(39544571) from 2018-07-03T15:00:00 - 2018-07-03T16:00:00 tres 2
2018-07-03T16:00:00+02:00 tars-acct slurmdbd[8594]: error: We have more allocated time than is possible (25112935 > 17467200) for cluster tars(4852) from 2018-07-03T15:00:00 - 2018-07-03T16:00:00 tres 5
2018-07-03T16:00:00+02:00 tars-acct slurmdbd[8594]: error: We have more time than is possible (17467200+187200+0)(17654400) > 17467200 for cluster tars(4852) from 2018-07-03T15:00:00 - 2018-07-03T16:00:00 tres 5


Since the Downtime reservations had already been archived and purged, were we 
supposed to still have these error messages? I guess, the procedure wasn't
able to modify anything about them since they weren't in the db anymore
when I applied the procedure.

Thanks in advance,
Comment 7 Dominik Bartkiewicz 2018-07-06 10:33:43 MDT
Hi


Did you force rollup like this? 

1) stop slurmdbd 
2) update <cluster>_last_ran_table set hourly_rollup=<time_start>,daily_rollup=<time_start>,monthly_rollup=<time_start>; (force rollup on slurmdbd start)
3) start slurmdbd

Could you send me output of this MySQL query?

select * from tars_resv_table;

Dominik
Comment 8 Sophie Créno 2018-07-06 11:35:38 MDT
Oh no, you're right, I'm out of it. My deepest apologies but what should I put for time_start?

Thanks,
Comment 9 Dominik Bartkiewicz 2018-07-06 13:20:24 MDT
Hi

Because some of the reservation usages are already gone, we have no way to correctly recalculate all usage.
If we have all resv data, time_start should be set the same as the start of "Downtime" reservation.  In this case, this will cause a decrease in usage.


Dominik
Comment 10 Sophie Créno 2018-07-10 07:37:10 MDT
Created attachment 7258 [details]
select_*_from_cluster_resv_table before and after the rollup

Hello,

I've put in attachment the result of the

select * from tars_resv_table;

before and after the rollup done with time_start set at 1520447857, the value
of the first line of downtime (attachment 7159 [details]). No change as expected.

Here is what we have in the slurmdbd's log now:

2018-07-10T15:34:31+02:00 tars-acct slurmdbd[1893]: error: We have more allocated time than is possible (18325907 > 17467200) for cluster tars(4852) from 2018-04-26T01:00:00 - 2018-04-26T02:00:00 tres 5
2018-07-10T15:34:31+02:00 tars-acct slurmdbd[1893]: error: We have more time than is possible (17467200+86400+0)(17553600) > 17467200 for cluster tars(4852) from 2018-04-26T01:00:00 - 2018-04-26T02:00:00 tres 5

It seems worse than before the rollup: we have many more lines.

Thanks