Ticket 5341

Summary:	"more allocated time than is possible" but no reservation
Product:	Slurm	Reporter:	Sophie Créno <sophie.creno>
Component:	Accounting	Assignee:	Dominik Bartkiewicz <bart>
Status:	RESOLVED DUPLICATE	QA Contact:
Severity:	4 - Minor Issue
Priority:	---	CC:	bart
Version:	17.11.7
Hardware:	Linux
OS:	Linux
Site:	Institut Pasteur	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	output of select * from tars_resv_table where resv_name="Downtime"; select_*_from_cluster_resv_table before and after the rollup

Description Sophie Créno 2018-06-22 03:42:17 MDT

Hello,

We observe the following messages for a while now.

2018-06-22T11:00:00+02:00 tars-acct slurmdbd[5889]: error: We have more allocated time than is possible (24764191 > 17467200) for cluster tars(4852) from 2018-06-22T10:00:00 - 2018-06-22T11:00:00 tres 1
2018-06-22T11:00:00+02:00 tars-acct slurmdbd[5889]: error: We have more time than is possible (17467200+230400+0)(17697600) > 17467200 for cluster tars(4852) from 2018-06-22T10:00:00 - 2018-06-22T11:00:00 tres 1
2018-06-22T11:00:00+02:00 tars-acct slurmdbd[5889]: error: We have more allocated time than is possible (181448290888 > 142360455600) for cluster tars(39544571) from 2018-06-22T10:00:00 - 2018-06-22T11:00:00 tres 2
2018-06-22T11:00:00+02:00 tars-acct slurmdbd[5889]: error: We have more time than is possible (142360455600+1109952000+0)(143470407600) > 142360455600 for cluster tars(39544571) from 2018-06-22T10:00:00 - 2018-06-22T11:00:00 tres 2
2018-06-22T11:00:00+02:00 tars-acct slurmdbd[5889]: error: We have more allocated time than is possible (24760591 > 17467200) for cluster tars(4852) from 2018-06-22T10:00:00 - 2018-06-22T11:00:00 tres 5
2018-06-22T11:00:00+02:00 tars-acct slurmdbd[5889]: error: We have more time than is possible (17467200+230400+0)(17697600) > 17467200 for cluster tars(4852) from 2018-06-22T10:00:00 - 2018-06-22T11:00:00 tres 5


Here is the output of 

$ sreport cluster AccountUtilizationByUser start=06/22/18 end=now -t percent

--------------------------------------------------------------------------------
Cluster/Account/User Utilization 2018-06-22T00:00:00 - 2018-06-22T10:59:59 (39600 secs)
Usage reported in Percentage of Total
--------------------------------------------------------------------------------
  Cluster         Account     Login     Proper Name       Used   Energy 
--------- --------------- --------- --------------- ---------- -------- 
     tars            root                              941.63%    0.00% 
     tars            root      root            root    750.00%    0.00% 
     tars           admin                               50.04%    0.00% 
     tars           admin      root            root     50.00%    0.00% 


I'm not 100% sure but it seems that the problem appeared in the same 
circumstances as Dominik used to reproduce the problem in Bug 5166:
restart of slurctld multiple times while slurmdbd was down

but contrary to Bug 5166, we don't have any reservation in stock:

$ scontrol show res
No reservations in the system

so I guess that the procedure provided by Dominik doesn't apply, or at least
not on the <cluster>_resv_table. Could you tell us what we should do to fix
the situation?

Thanks in advance,

Comment 1 Dominik Bartkiewicz 2018-06-22 03:55:54 MDT

Hi

Could you send me the output from "sacctmgr show reservation start=06/22/18"?

Dominik

Comment 2 Sophie Créno 2018-06-22 04:22:16 MDT

$ sacctmgr show reservation start=06/22/18
   Cluster            Name                           TRES           TimeStart             TimeEnd UnusedWall 
---------- --------------- ------------------------------ ------------------- ------------------- ---------- 
      tars        Downtime                       cpu=4852 2018-03-08T17:37:42 2019-03-07T19:37:37 9.3253e+06 
      tars        Downtime                       cpu=4852 2018-03-08T18:25:14 2019-03-07T19:37:37 9.3225e+06 
      tars        Downtime                       cpu=4852 2018-03-08T18:28:08 2019-03-07T19:37:37 9.3223e+06 
      tars        Downtime                       cpu=4852 2018-03-08T18:29:07 2019-03-07T19:37:37 9.3223e+06 
      tars        Downtime                       cpu=4852 2018-03-08T18:36:54 2019-03-07T19:37:37 9.3218e+06 
      tars        Downtime                       cpu=4852 2018-03-08T18:43:01 2019-03-07T19:37:37 9.3214e+06 
      tars        Downtime                       cpu=4852 2018-03-08T18:48:10 2019-03-07T19:37:37 9.3211e+06 
      tars        Downtime                       cpu=4852 2018-03-08T19:03:18 2019-03-07T19:37:37 9.3202e+06 

Oh yes, you're right, I did the scontrol show res but not the sacctmgr show res.
That corresponds to the day when we upgraded. So I guess it is a duplicate
of Bug 5166. 

In the procedure, you said:
1) stop slurmdbd 
2) update <cluster>_resv_table set unused_wall=0,time_end=<> where id_resv=<resv_id> and time_start=<>; (fix reservation record )
next steps are only required to recalculate used records
3) update <cluster>_last_ran_table set hourly_rollup=<time_start>,daily_rollup=<time_start>,monthly_rollup=<time_start>; (force rollup on slurmdbd start)
4) start slurmdbd

For 2), we must use time_start and time_end of the listed reservations but for
3), is it also that same time_start that we should put?

Tell me if I must ask my question on Bug 5166 since this one is a Duplicate.

Thanks,

Comment 3 Dominik Bartkiewicz 2018-06-22 08:49:37 MDT

Hi

It looks like that you have more than one invalid row.
Could you send me output of this MySQL query?

select * from tars_resv_table where resv_name="Downtime";

If you can wait few days I will provide more automated procedure for you.
I would like to close this ticket as the duplicate of bug 5166.

Dominik

Comment 4 Sophie Créno 2018-06-22 09:01:03 MDT

Created attachment 7159 [details]
output of select * from tars_resv_table where resv_name="Downtime";

> It looks like that you have more than one invalid row.
> Could you send me output of this MySQL query?
> 
> select * from tars_resv_table where resv_name="Downtime";
I've put the result in attachment.


> If you can wait few days I will provide more automated procedure for you.
Oh great, thank you!


> I would like to close this ticket as the duplicate of bug 5166.
Yes, sure, proceed and sorry again for that duplication.

Comment 5 Dominik Bartkiewicz 2018-06-22 09:10:03 MDT

Hi

Thanks for records.

No problem.

Dominik

*** This ticket has been marked as a duplicate of ticket 5166 ***

Comment 6 Sophie Créno 2018-07-06 04:06:40 MDT

Hello,

I don't know if I should write on this bug since it is a duplicate but I don't
want to pollute bug 5166.

I applied the procedure given by Dominik 3 days ago.
In slurmdbd.conf, we have

PurgeResvAfter=3months


Thus, I guess that the 

   Cluster            Name                           TRES           TimeStart             TimeEnd UnusedWall 
---------- --------------- ------------------------------ ------------------- ------------------- ---------- 
      tars        Downtime                       cpu=4852 2018-03-08T17:37:42 2019-03-07T19:37:37 9.3253e+06 

had already been purged before I applied the procedure since they were more
than 3 month old. The other reservations seem to be with a correct end time:

$ sacctmgr show reservation start=03/07/18
   Cluster            Name                           TRES           TimeStart             TimeEnd UnusedWall 
---------- --------------- ------------------------------ ------------------- ------------------- ---------- 
      tars        resa_dbd                       cpu=4852 2018-04-03T11:03:30 2018-04-03T11:32:47 1.7570e+03 
      tars      resa_patch                       cpu=4852 2018-05-04T11:48:28 2018-05-04T11:52:33 245.000000 
      tars      resa_patch                       cpu=4852 2018-05-04T11:52:33 2018-05-04T12:02:14 581.000000 
      tars      resa_patch                       cpu=4852 2018-05-24T09:51:16 2018-05-24T09:52:54  98.000000 
      tars      resa_patch                       cpu=4852 2018-05-24T09:52:54 2018-05-24T09:59:51 417.000000 
      tars      resa_patch                       cpu=4852 2018-06-01T13:21:11 2018-06-01T13:24:57 226.000000 
      tars      resa_patch                       cpu=4852 2018-06-01T13:24:57 2018-06-01T13:27:51 174.000000 
      tars      resa_patch                       cpu=4852 2018-06-01T17:00:02 2018-06-01T17:01:55 113.000000 
      tars      resa_patch                       cpu=4852 2018-06-01T17:01:55 2018-06-01T17:03:56 121.000000 
      tars      resa_patch                       cpu=4852 2018-06-05T13:22:57 2018-06-05T13:24:25  88.000000 
      tars      resa_patch                       cpu=4852 2018-06-05T13:24:25 2018-06-05T13:27:09 164.000000 
      tars      resa_patch                       cpu=4852 2018-06-05T13:27:09 2018-06-05T15:25:30 7.1010e+03 
      tars         resa_db                       cpu=4852 2018-06-07T13:43:18 2018-06-07T15:05:18 4.9200e+03 
      tars      resa_patch                       cpu=4852 2018-06-12T16:30:59 2018-06-12T16:34:12 193.000000 
      tars      resa_patch                       cpu=4852 2018-06-12T16:34:12 2018-06-12T16:37:06 174.000000 
      tars      resa_patch                       cpu=4852 2018-06-14T16:06:08 2018-06-14T16:09:03 175.000000 
      tars      resa_patch                       cpu=4852 2018-06-14T16:09:03 2018-06-14T16:12:26 203.000000 


However, we still have error messages in slurmdbd logs:

2018-07-03T15:29:17+02:00 tars-acct slurmdbd[8594]: slurmdbd version 17.11.7 started
2018-07-03T16:00:00+02:00 tars-acct slurmdbd[8594]: error: We have more allocated time than is possible (25116535 > 17467200) for cluster tars(4852) from 2018-07-03T15:00:00 - 2018-07-03T16:00:00 tres 1
2018-07-03T16:00:00+02:00 tars-acct slurmdbd[8594]: error: We have more time than is possible (17467200+187200+0)(17654400) > 17467200 for cluster tars(4852) from 2018-07-03T15:00:00 - 2018-07-03T16:00:00 tres 1
2018-07-03T16:00:00+02:00 tars-acct slurmdbd[8594]: error: We have more allocated time than is possible (190150683896 > 142360455600) for cluster tars(39544571) from 2018-07-03T15:00:00 - 2018-07-03T16:00:00 tres 2
2018-07-03T16:00:00+02:00 tars-acct slurmdbd[8594]: error: We have more time than is possible (142360455600+888768000+0)(143249223600) > 142360455600 for cluster tars(39544571) from 2018-07-03T15:00:00 - 2018-07-03T16:00:00 tres 2
2018-07-03T16:00:00+02:00 tars-acct slurmdbd[8594]: error: We have more allocated time than is possible (25112935 > 17467200) for cluster tars(4852) from 2018-07-03T15:00:00 - 2018-07-03T16:00:00 tres 5
2018-07-03T16:00:00+02:00 tars-acct slurmdbd[8594]: error: We have more time than is possible (17467200+187200+0)(17654400) > 17467200 for cluster tars(4852) from 2018-07-03T15:00:00 - 2018-07-03T16:00:00 tres 5


Since the Downtime reservations had already been archived and purged, were we 
supposed to still have these error messages? I guess, the procedure wasn't
able to modify anything about them since they weren't in the db anymore
when I applied the procedure.

Thanks in advance,

Comment 7 Dominik Bartkiewicz 2018-07-06 10:33:43 MDT

Hi


Did you force rollup like this? 

1) stop slurmdbd 
2) update <cluster>_last_ran_table set hourly_rollup=<time_start>,daily_rollup=<time_start>,monthly_rollup=<time_start>; (force rollup on slurmdbd start)
3) start slurmdbd

Could you send me output of this MySQL query?

select * from tars_resv_table;

Dominik

Comment 8 Sophie Créno 2018-07-06 11:35:38 MDT

Oh no, you're right, I'm out of it. My deepest apologies but what should I put for time_start?

Thanks,

Comment 9 Dominik Bartkiewicz 2018-07-06 13:20:24 MDT

Hi

Because some of the reservation usages are already gone, we have no way to correctly recalculate all usage.
If we have all resv data, time_start should be set the same as the start of "Downtime" reservation.  In this case, this will cause a decrease in usage.


Dominik

Comment 10 Sophie Créno 2018-07-10 07:37:10 MDT

Created attachment 7258 [details]
select_*_from_cluster_resv_table before and after the rollup

Hello,

I've put in attachment the result of the

select * from tars_resv_table;

before and after the rollup done with time_start set at 1520447857, the value
of the first line of downtime (attachment 7159 [details]). No change as expected.

Here is what we have in the slurmdbd's log now:

2018-07-10T15:34:31+02:00 tars-acct slurmdbd[1893]: error: We have more allocated time than is possible (18325907 > 17467200) for cluster tars(4852) from 2018-04-26T01:00:00 - 2018-04-26T02:00:00 tres 5
2018-07-10T15:34:31+02:00 tars-acct slurmdbd[1893]: error: We have more time than is possible (17467200+86400+0)(17553600) > 17467200 for cluster tars(4852) from 2018-04-26T01:00:00 - 2018-04-26T02:00:00 tres 5

It seems worse than before the rollup: we have many more lines.

Thanks