| Summary: | "more allocated time than is possible" but no reservation | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Sophie Créno <sophie.creno> |
| Component: | Accounting | Assignee: | Dominik Bartkiewicz <bart> |
| Status: | RESOLVED DUPLICATE | QA Contact: | |
| Severity: | 4 - Minor Issue | ||
| Priority: | --- | CC: | bart |
| Version: | 17.11.7 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | Institut Pasteur | Slinky Site: | --- |
| Alineos Sites: | --- | Atos/Eviden Sites: | --- |
| Confidential Site: | --- | Coreweave sites: | --- |
| Cray Sites: | --- | DS9 clusters: | --- |
| Google sites: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | NoveTech Sites: | --- |
| Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Tzag Elita Sites: | --- |
| Linux Distro: | --- | Machine Name: | |
| CLE Version: | Version Fixed: | ||
| Target Release: | --- | DevPrio: | --- |
| Emory-Cloud Sites: | --- | ||
| Attachments: |
output of select * from tars_resv_table where resv_name="Downtime";
select_*_from_cluster_resv_table before and after the rollup |
||
Hi Could you send me the output from "sacctmgr show reservation start=06/22/18"? Dominik $ sacctmgr show reservation start=06/22/18
Cluster Name TRES TimeStart TimeEnd UnusedWall
---------- --------------- ------------------------------ ------------------- ------------------- ----------
tars Downtime cpu=4852 2018-03-08T17:37:42 2019-03-07T19:37:37 9.3253e+06
tars Downtime cpu=4852 2018-03-08T18:25:14 2019-03-07T19:37:37 9.3225e+06
tars Downtime cpu=4852 2018-03-08T18:28:08 2019-03-07T19:37:37 9.3223e+06
tars Downtime cpu=4852 2018-03-08T18:29:07 2019-03-07T19:37:37 9.3223e+06
tars Downtime cpu=4852 2018-03-08T18:36:54 2019-03-07T19:37:37 9.3218e+06
tars Downtime cpu=4852 2018-03-08T18:43:01 2019-03-07T19:37:37 9.3214e+06
tars Downtime cpu=4852 2018-03-08T18:48:10 2019-03-07T19:37:37 9.3211e+06
tars Downtime cpu=4852 2018-03-08T19:03:18 2019-03-07T19:37:37 9.3202e+06
Oh yes, you're right, I did the scontrol show res but not the sacctmgr show res.
That corresponds to the day when we upgraded. So I guess it is a duplicate
of Bug 5166.
In the procedure, you said:
1) stop slurmdbd
2) update <cluster>_resv_table set unused_wall=0,time_end=<> where id_resv=<resv_id> and time_start=<>; (fix reservation record )
next steps are only required to recalculate used records
3) update <cluster>_last_ran_table set hourly_rollup=<time_start>,daily_rollup=<time_start>,monthly_rollup=<time_start>; (force rollup on slurmdbd start)
4) start slurmdbd
For 2), we must use time_start and time_end of the listed reservations but for
3), is it also that same time_start that we should put?
Tell me if I must ask my question on Bug 5166 since this one is a Duplicate.
Thanks,
Hi It looks like that you have more than one invalid row. Could you send me output of this MySQL query? select * from tars_resv_table where resv_name="Downtime"; If you can wait few days I will provide more automated procedure for you. I would like to close this ticket as the duplicate of bug 5166. Dominik Created attachment 7159 [details] output of select * from tars_resv_table where resv_name="Downtime"; > It looks like that you have more than one invalid row. > Could you send me output of this MySQL query? > > select * from tars_resv_table where resv_name="Downtime"; I've put the result in attachment. > If you can wait few days I will provide more automated procedure for you. Oh great, thank you! > I would like to close this ticket as the duplicate of bug 5166. Yes, sure, proceed and sorry again for that duplication. Hi Thanks for records. No problem. Dominik *** This ticket has been marked as a duplicate of ticket 5166 *** Hello, I don't know if I should write on this bug since it is a duplicate but I don't want to pollute bug 5166. I applied the procedure given by Dominik 3 days ago. In slurmdbd.conf, we have PurgeResvAfter=3months Thus, I guess that the Cluster Name TRES TimeStart TimeEnd UnusedWall ---------- --------------- ------------------------------ ------------------- ------------------- ---------- tars Downtime cpu=4852 2018-03-08T17:37:42 2019-03-07T19:37:37 9.3253e+06 had already been purged before I applied the procedure since they were more than 3 month old. The other reservations seem to be with a correct end time: $ sacctmgr show reservation start=03/07/18 Cluster Name TRES TimeStart TimeEnd UnusedWall ---------- --------------- ------------------------------ ------------------- ------------------- ---------- tars resa_dbd cpu=4852 2018-04-03T11:03:30 2018-04-03T11:32:47 1.7570e+03 tars resa_patch cpu=4852 2018-05-04T11:48:28 2018-05-04T11:52:33 245.000000 tars resa_patch cpu=4852 2018-05-04T11:52:33 2018-05-04T12:02:14 581.000000 tars resa_patch cpu=4852 2018-05-24T09:51:16 2018-05-24T09:52:54 98.000000 tars resa_patch cpu=4852 2018-05-24T09:52:54 2018-05-24T09:59:51 417.000000 tars resa_patch cpu=4852 2018-06-01T13:21:11 2018-06-01T13:24:57 226.000000 tars resa_patch cpu=4852 2018-06-01T13:24:57 2018-06-01T13:27:51 174.000000 tars resa_patch cpu=4852 2018-06-01T17:00:02 2018-06-01T17:01:55 113.000000 tars resa_patch cpu=4852 2018-06-01T17:01:55 2018-06-01T17:03:56 121.000000 tars resa_patch cpu=4852 2018-06-05T13:22:57 2018-06-05T13:24:25 88.000000 tars resa_patch cpu=4852 2018-06-05T13:24:25 2018-06-05T13:27:09 164.000000 tars resa_patch cpu=4852 2018-06-05T13:27:09 2018-06-05T15:25:30 7.1010e+03 tars resa_db cpu=4852 2018-06-07T13:43:18 2018-06-07T15:05:18 4.9200e+03 tars resa_patch cpu=4852 2018-06-12T16:30:59 2018-06-12T16:34:12 193.000000 tars resa_patch cpu=4852 2018-06-12T16:34:12 2018-06-12T16:37:06 174.000000 tars resa_patch cpu=4852 2018-06-14T16:06:08 2018-06-14T16:09:03 175.000000 tars resa_patch cpu=4852 2018-06-14T16:09:03 2018-06-14T16:12:26 203.000000 However, we still have error messages in slurmdbd logs: 2018-07-03T15:29:17+02:00 tars-acct slurmdbd[8594]: slurmdbd version 17.11.7 started 2018-07-03T16:00:00+02:00 tars-acct slurmdbd[8594]: error: We have more allocated time than is possible (25116535 > 17467200) for cluster tars(4852) from 2018-07-03T15:00:00 - 2018-07-03T16:00:00 tres 1 2018-07-03T16:00:00+02:00 tars-acct slurmdbd[8594]: error: We have more time than is possible (17467200+187200+0)(17654400) > 17467200 for cluster tars(4852) from 2018-07-03T15:00:00 - 2018-07-03T16:00:00 tres 1 2018-07-03T16:00:00+02:00 tars-acct slurmdbd[8594]: error: We have more allocated time than is possible (190150683896 > 142360455600) for cluster tars(39544571) from 2018-07-03T15:00:00 - 2018-07-03T16:00:00 tres 2 2018-07-03T16:00:00+02:00 tars-acct slurmdbd[8594]: error: We have more time than is possible (142360455600+888768000+0)(143249223600) > 142360455600 for cluster tars(39544571) from 2018-07-03T15:00:00 - 2018-07-03T16:00:00 tres 2 2018-07-03T16:00:00+02:00 tars-acct slurmdbd[8594]: error: We have more allocated time than is possible (25112935 > 17467200) for cluster tars(4852) from 2018-07-03T15:00:00 - 2018-07-03T16:00:00 tres 5 2018-07-03T16:00:00+02:00 tars-acct slurmdbd[8594]: error: We have more time than is possible (17467200+187200+0)(17654400) > 17467200 for cluster tars(4852) from 2018-07-03T15:00:00 - 2018-07-03T16:00:00 tres 5 Since the Downtime reservations had already been archived and purged, were we supposed to still have these error messages? I guess, the procedure wasn't able to modify anything about them since they weren't in the db anymore when I applied the procedure. Thanks in advance, Hi Did you force rollup like this? 1) stop slurmdbd 2) update <cluster>_last_ran_table set hourly_rollup=<time_start>,daily_rollup=<time_start>,monthly_rollup=<time_start>; (force rollup on slurmdbd start) 3) start slurmdbd Could you send me output of this MySQL query? select * from tars_resv_table; Dominik Oh no, you're right, I'm out of it. My deepest apologies but what should I put for time_start? Thanks, Hi Because some of the reservation usages are already gone, we have no way to correctly recalculate all usage. If we have all resv data, time_start should be set the same as the start of "Downtime" reservation. In this case, this will cause a decrease in usage. Dominik Created attachment 7258 [details] select_*_from_cluster_resv_table before and after the rollup Hello, I've put in attachment the result of the select * from tars_resv_table; before and after the rollup done with time_start set at 1520447857, the value of the first line of downtime (attachment 7159 [details]). No change as expected. Here is what we have in the slurmdbd's log now: 2018-07-10T15:34:31+02:00 tars-acct slurmdbd[1893]: error: We have more allocated time than is possible (18325907 > 17467200) for cluster tars(4852) from 2018-04-26T01:00:00 - 2018-04-26T02:00:00 tres 5 2018-07-10T15:34:31+02:00 tars-acct slurmdbd[1893]: error: We have more time than is possible (17467200+86400+0)(17553600) > 17467200 for cluster tars(4852) from 2018-04-26T01:00:00 - 2018-04-26T02:00:00 tres 5 It seems worse than before the rollup: we have many more lines. Thanks |
Hello, We observe the following messages for a while now. 2018-06-22T11:00:00+02:00 tars-acct slurmdbd[5889]: error: We have more allocated time than is possible (24764191 > 17467200) for cluster tars(4852) from 2018-06-22T10:00:00 - 2018-06-22T11:00:00 tres 1 2018-06-22T11:00:00+02:00 tars-acct slurmdbd[5889]: error: We have more time than is possible (17467200+230400+0)(17697600) > 17467200 for cluster tars(4852) from 2018-06-22T10:00:00 - 2018-06-22T11:00:00 tres 1 2018-06-22T11:00:00+02:00 tars-acct slurmdbd[5889]: error: We have more allocated time than is possible (181448290888 > 142360455600) for cluster tars(39544571) from 2018-06-22T10:00:00 - 2018-06-22T11:00:00 tres 2 2018-06-22T11:00:00+02:00 tars-acct slurmdbd[5889]: error: We have more time than is possible (142360455600+1109952000+0)(143470407600) > 142360455600 for cluster tars(39544571) from 2018-06-22T10:00:00 - 2018-06-22T11:00:00 tres 2 2018-06-22T11:00:00+02:00 tars-acct slurmdbd[5889]: error: We have more allocated time than is possible (24760591 > 17467200) for cluster tars(4852) from 2018-06-22T10:00:00 - 2018-06-22T11:00:00 tres 5 2018-06-22T11:00:00+02:00 tars-acct slurmdbd[5889]: error: We have more time than is possible (17467200+230400+0)(17697600) > 17467200 for cluster tars(4852) from 2018-06-22T10:00:00 - 2018-06-22T11:00:00 tres 5 Here is the output of $ sreport cluster AccountUtilizationByUser start=06/22/18 end=now -t percent -------------------------------------------------------------------------------- Cluster/Account/User Utilization 2018-06-22T00:00:00 - 2018-06-22T10:59:59 (39600 secs) Usage reported in Percentage of Total -------------------------------------------------------------------------------- Cluster Account Login Proper Name Used Energy --------- --------------- --------- --------------- ---------- -------- tars root 941.63% 0.00% tars root root root 750.00% 0.00% tars admin 50.04% 0.00% tars admin root root 50.00% 0.00% I'm not 100% sure but it seems that the problem appeared in the same circumstances as Dominik used to reproduce the problem in Bug 5166: restart of slurctld multiple times while slurmdbd was down but contrary to Bug 5166, we don't have any reservation in stock: $ scontrol show res No reservations in the system so I guess that the procedure provided by Dominik doesn't apply, or at least not on the <cluster>_resv_table. Could you tell us what we should do to fix the situation? Thanks in advance,