Hello, We observe the following messages for a while now. 2018-06-22T11:00:00+02:00 tars-acct slurmdbd[5889]: error: We have more allocated time than is possible (24764191 > 17467200) for cluster tars(4852) from 2018-06-22T10:00:00 - 2018-06-22T11:00:00 tres 1 2018-06-22T11:00:00+02:00 tars-acct slurmdbd[5889]: error: We have more time than is possible (17467200+230400+0)(17697600) > 17467200 for cluster tars(4852) from 2018-06-22T10:00:00 - 2018-06-22T11:00:00 tres 1 2018-06-22T11:00:00+02:00 tars-acct slurmdbd[5889]: error: We have more allocated time than is possible (181448290888 > 142360455600) for cluster tars(39544571) from 2018-06-22T10:00:00 - 2018-06-22T11:00:00 tres 2 2018-06-22T11:00:00+02:00 tars-acct slurmdbd[5889]: error: We have more time than is possible (142360455600+1109952000+0)(143470407600) > 142360455600 for cluster tars(39544571) from 2018-06-22T10:00:00 - 2018-06-22T11:00:00 tres 2 2018-06-22T11:00:00+02:00 tars-acct slurmdbd[5889]: error: We have more allocated time than is possible (24760591 > 17467200) for cluster tars(4852) from 2018-06-22T10:00:00 - 2018-06-22T11:00:00 tres 5 2018-06-22T11:00:00+02:00 tars-acct slurmdbd[5889]: error: We have more time than is possible (17467200+230400+0)(17697600) > 17467200 for cluster tars(4852) from 2018-06-22T10:00:00 - 2018-06-22T11:00:00 tres 5 Here is the output of $ sreport cluster AccountUtilizationByUser start=06/22/18 end=now -t percent -------------------------------------------------------------------------------- Cluster/Account/User Utilization 2018-06-22T00:00:00 - 2018-06-22T10:59:59 (39600 secs) Usage reported in Percentage of Total -------------------------------------------------------------------------------- Cluster Account Login Proper Name Used Energy --------- --------------- --------- --------------- ---------- -------- tars root 941.63% 0.00% tars root root root 750.00% 0.00% tars admin 50.04% 0.00% tars admin root root 50.00% 0.00% I'm not 100% sure but it seems that the problem appeared in the same circumstances as Dominik used to reproduce the problem in Bug 5166: restart of slurctld multiple times while slurmdbd was down but contrary to Bug 5166, we don't have any reservation in stock: $ scontrol show res No reservations in the system so I guess that the procedure provided by Dominik doesn't apply, or at least not on the <cluster>_resv_table. Could you tell us what we should do to fix the situation? Thanks in advance,
Hi Could you send me the output from "sacctmgr show reservation start=06/22/18"? Dominik
$ sacctmgr show reservation start=06/22/18 Cluster Name TRES TimeStart TimeEnd UnusedWall ---------- --------------- ------------------------------ ------------------- ------------------- ---------- tars Downtime cpu=4852 2018-03-08T17:37:42 2019-03-07T19:37:37 9.3253e+06 tars Downtime cpu=4852 2018-03-08T18:25:14 2019-03-07T19:37:37 9.3225e+06 tars Downtime cpu=4852 2018-03-08T18:28:08 2019-03-07T19:37:37 9.3223e+06 tars Downtime cpu=4852 2018-03-08T18:29:07 2019-03-07T19:37:37 9.3223e+06 tars Downtime cpu=4852 2018-03-08T18:36:54 2019-03-07T19:37:37 9.3218e+06 tars Downtime cpu=4852 2018-03-08T18:43:01 2019-03-07T19:37:37 9.3214e+06 tars Downtime cpu=4852 2018-03-08T18:48:10 2019-03-07T19:37:37 9.3211e+06 tars Downtime cpu=4852 2018-03-08T19:03:18 2019-03-07T19:37:37 9.3202e+06 Oh yes, you're right, I did the scontrol show res but not the sacctmgr show res. That corresponds to the day when we upgraded. So I guess it is a duplicate of Bug 5166. In the procedure, you said: 1) stop slurmdbd 2) update <cluster>_resv_table set unused_wall=0,time_end=<> where id_resv=<resv_id> and time_start=<>; (fix reservation record ) next steps are only required to recalculate used records 3) update <cluster>_last_ran_table set hourly_rollup=<time_start>,daily_rollup=<time_start>,monthly_rollup=<time_start>; (force rollup on slurmdbd start) 4) start slurmdbd For 2), we must use time_start and time_end of the listed reservations but for 3), is it also that same time_start that we should put? Tell me if I must ask my question on Bug 5166 since this one is a Duplicate. Thanks,
Hi It looks like that you have more than one invalid row. Could you send me output of this MySQL query? select * from tars_resv_table where resv_name="Downtime"; If you can wait few days I will provide more automated procedure for you. I would like to close this ticket as the duplicate of bug 5166. Dominik
Created attachment 7159 [details] output of select * from tars_resv_table where resv_name="Downtime"; > It looks like that you have more than one invalid row. > Could you send me output of this MySQL query? > > select * from tars_resv_table where resv_name="Downtime"; I've put the result in attachment. > If you can wait few days I will provide more automated procedure for you. Oh great, thank you! > I would like to close this ticket as the duplicate of bug 5166. Yes, sure, proceed and sorry again for that duplication.
Hi Thanks for records. No problem. Dominik *** This ticket has been marked as a duplicate of ticket 5166 ***
Hello, I don't know if I should write on this bug since it is a duplicate but I don't want to pollute bug 5166. I applied the procedure given by Dominik 3 days ago. In slurmdbd.conf, we have PurgeResvAfter=3months Thus, I guess that the Cluster Name TRES TimeStart TimeEnd UnusedWall ---------- --------------- ------------------------------ ------------------- ------------------- ---------- tars Downtime cpu=4852 2018-03-08T17:37:42 2019-03-07T19:37:37 9.3253e+06 had already been purged before I applied the procedure since they were more than 3 month old. The other reservations seem to be with a correct end time: $ sacctmgr show reservation start=03/07/18 Cluster Name TRES TimeStart TimeEnd UnusedWall ---------- --------------- ------------------------------ ------------------- ------------------- ---------- tars resa_dbd cpu=4852 2018-04-03T11:03:30 2018-04-03T11:32:47 1.7570e+03 tars resa_patch cpu=4852 2018-05-04T11:48:28 2018-05-04T11:52:33 245.000000 tars resa_patch cpu=4852 2018-05-04T11:52:33 2018-05-04T12:02:14 581.000000 tars resa_patch cpu=4852 2018-05-24T09:51:16 2018-05-24T09:52:54 98.000000 tars resa_patch cpu=4852 2018-05-24T09:52:54 2018-05-24T09:59:51 417.000000 tars resa_patch cpu=4852 2018-06-01T13:21:11 2018-06-01T13:24:57 226.000000 tars resa_patch cpu=4852 2018-06-01T13:24:57 2018-06-01T13:27:51 174.000000 tars resa_patch cpu=4852 2018-06-01T17:00:02 2018-06-01T17:01:55 113.000000 tars resa_patch cpu=4852 2018-06-01T17:01:55 2018-06-01T17:03:56 121.000000 tars resa_patch cpu=4852 2018-06-05T13:22:57 2018-06-05T13:24:25 88.000000 tars resa_patch cpu=4852 2018-06-05T13:24:25 2018-06-05T13:27:09 164.000000 tars resa_patch cpu=4852 2018-06-05T13:27:09 2018-06-05T15:25:30 7.1010e+03 tars resa_db cpu=4852 2018-06-07T13:43:18 2018-06-07T15:05:18 4.9200e+03 tars resa_patch cpu=4852 2018-06-12T16:30:59 2018-06-12T16:34:12 193.000000 tars resa_patch cpu=4852 2018-06-12T16:34:12 2018-06-12T16:37:06 174.000000 tars resa_patch cpu=4852 2018-06-14T16:06:08 2018-06-14T16:09:03 175.000000 tars resa_patch cpu=4852 2018-06-14T16:09:03 2018-06-14T16:12:26 203.000000 However, we still have error messages in slurmdbd logs: 2018-07-03T15:29:17+02:00 tars-acct slurmdbd[8594]: slurmdbd version 17.11.7 started 2018-07-03T16:00:00+02:00 tars-acct slurmdbd[8594]: error: We have more allocated time than is possible (25116535 > 17467200) for cluster tars(4852) from 2018-07-03T15:00:00 - 2018-07-03T16:00:00 tres 1 2018-07-03T16:00:00+02:00 tars-acct slurmdbd[8594]: error: We have more time than is possible (17467200+187200+0)(17654400) > 17467200 for cluster tars(4852) from 2018-07-03T15:00:00 - 2018-07-03T16:00:00 tres 1 2018-07-03T16:00:00+02:00 tars-acct slurmdbd[8594]: error: We have more allocated time than is possible (190150683896 > 142360455600) for cluster tars(39544571) from 2018-07-03T15:00:00 - 2018-07-03T16:00:00 tres 2 2018-07-03T16:00:00+02:00 tars-acct slurmdbd[8594]: error: We have more time than is possible (142360455600+888768000+0)(143249223600) > 142360455600 for cluster tars(39544571) from 2018-07-03T15:00:00 - 2018-07-03T16:00:00 tres 2 2018-07-03T16:00:00+02:00 tars-acct slurmdbd[8594]: error: We have more allocated time than is possible (25112935 > 17467200) for cluster tars(4852) from 2018-07-03T15:00:00 - 2018-07-03T16:00:00 tres 5 2018-07-03T16:00:00+02:00 tars-acct slurmdbd[8594]: error: We have more time than is possible (17467200+187200+0)(17654400) > 17467200 for cluster tars(4852) from 2018-07-03T15:00:00 - 2018-07-03T16:00:00 tres 5 Since the Downtime reservations had already been archived and purged, were we supposed to still have these error messages? I guess, the procedure wasn't able to modify anything about them since they weren't in the db anymore when I applied the procedure. Thanks in advance,
Hi Did you force rollup like this? 1) stop slurmdbd 2) update <cluster>_last_ran_table set hourly_rollup=<time_start>,daily_rollup=<time_start>,monthly_rollup=<time_start>; (force rollup on slurmdbd start) 3) start slurmdbd Could you send me output of this MySQL query? select * from tars_resv_table; Dominik
Oh no, you're right, I'm out of it. My deepest apologies but what should I put for time_start? Thanks,
Hi Because some of the reservation usages are already gone, we have no way to correctly recalculate all usage. If we have all resv data, time_start should be set the same as the start of "Downtime" reservation. In this case, this will cause a decrease in usage. Dominik
Created attachment 7258 [details] select_*_from_cluster_resv_table before and after the rollup Hello, I've put in attachment the result of the select * from tars_resv_table; before and after the rollup done with time_start set at 1520447857, the value of the first line of downtime (attachment 7159 [details]). No change as expected. Here is what we have in the slurmdbd's log now: 2018-07-10T15:34:31+02:00 tars-acct slurmdbd[1893]: error: We have more allocated time than is possible (18325907 > 17467200) for cluster tars(4852) from 2018-04-26T01:00:00 - 2018-04-26T02:00:00 tres 5 2018-07-10T15:34:31+02:00 tars-acct slurmdbd[1893]: error: We have more time than is possible (17467200+86400+0)(17553600) > 17467200 for cluster tars(4852) from 2018-04-26T01:00:00 - 2018-04-26T02:00:00 tres 5 It seems worse than before the rollup: we have many more lines. Thanks