| Summary: | Event purged while should not | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Regine Gaudin <regine.gaudin> |
| Component: | slurmdbd | Assignee: | Albert Gil <albert.gil> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 3 - Medium Impact | ||
| Priority: | --- | CC: | jbooth |
| Version: | 22.05.7 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | CEA | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
Hi Regine, If I understand correctly the problem is totally on the purging of the Events, and not on how sacct -N behaves. I'm investigating it, but if you can share your slurmdbd logs maybe they may help us. Thanks, Albert the problem is totally on the purging of the Events yes I think I'm investigating it, but if you can share your slurmdbd logs maybe they may help us. difficult sensitive site I seems that age for purge of event are not always respected over all if the start of the event happens before the age date and the end after the age date For instance today we are 23rd of March and the two following events have been purged why ? machine1239 DOWN* start 2023-02-10T15:27 end 2023-03-15T17:03 machine2108 DRAIN start 2022-09-22T10:53 end 2023-03-14T14:22 with ArchiveEvents=yes PurgeEventAfter=31days Hi Regine, > I seems that age for purge of event are not always respected over all if the > start of the event > happens before the age date and the end after the age date I think that you are totally right. The good news is that this has been already fixed in bug 13857 comment 29 as part of the 23.02 release. Regards, Albert Is it possible to have apath for slurm22.5.7 as we've just upgrdaded in two weeks ago ? I mean a patch Hi Regine,
> Is it possible to have apath for slurm22.5.7 as we've just upgrdaded in two
> weeks ago ?
Unfortunately we cannot validate such patch, so we cannot recomend it.
But the good news are that you only need to upgrade slurmdbd to fix the issue, so maybe you can do it before that you planned and keep the rest of the cluster still in 22.05 until you can upgrade it.
Anyway, if this is ok for you I'm closing this ticket as infogiven.
But please don't hesitate to reopen it if you need further support.
Regards,
Albert
|
sacct -N ran on Febrary 10th have had incoherency due to bad purge event of event with date start=2022-11-28 end=Febrary 07 All jobs before Febrary 07 were given the same problem It seems that event which does not begin less that one month ago or does not end more than one month ago are not well considered ON Febrary 10th sacct -j 8053132 -o nodelist,start,end NodeList Start End --------------- ------------------- ------------------- machine[4860-486+ 2023-02-07T07:59:29 2023-02-08T07:59:03 machine4860 2023-02-07T07:59:29 2023-02-08T07:59:03 machine[4860-486+ 2023-02-07T07:59:29 2023-02-08T07:59:01 machine[4860-486+ 2023-02-07T08:04:45 2023-02-08T07:59:01 while sacct -S 2023-02-07T07:00 -E 2023-02-08T08:00 -N machine4860 -o jobid,nodelist%50 JobID NodeList --------------- ------------------- ------------------- remained empty after investigations sacct -N was calling a request for having the nodes in the cluster at the time of the job in event table but the event at the date had been purged and should not have been : mysql on db (not purge) leads to: select cluster_nodes, time_start, time_end from machine_event_table where node_name='' && cluster_nodes !=''; +----------------------------------------------------------------------------------------+------------+------------+ | cluster_nodes | time_start | time_end | +----------------------------------------------------------------------------------------+------------+------------+ | machine[1000-2655,4000-6291,7000-7019,7050-7081,7100-7129,7200-7201,8000-8004,9100-9179] | 1675783240 | 1675783277 | | machine[1000-2655,4000-6291,7000-7019,7050-7081,7100-7129,7200-7201,8000-8004,9100-9179] | 1675783277 | 1676279185 | | machine[1000-2655,4000-6291,7000-7019,7050-7081,7100-7129,7200-7201,8000-8004,9100-9179] | 1676279185 | 1676279202 | | machine[1000-2655,4000-6291,7000-7019,7050-7081,7100-7129,7200-7201,8000-8004,9100-9179] | 1676279202 | 0 mysql on archive (purged) gave select cluster_nodes, time_start, time_end from machine_ppi_event_table where node_name='' && cluster_nodes !=''; machine[1000-2655,4000-6291,7000-7019,7050-7081,7100-7129,7200-7201,8000-8004,9100-9179] | 1669632382 | 1675783240 | bash-4.2$ date --date='@1669632382' lun. nov. 28 11:46:22 CET 2022 bash-4.2$ date --date='@1675783240' mar. févr. 7 16:20:40 CET 2023 For instance today we are 23rd of March and the two following events have been purged why ? machine1239 DOWN* start 2023-02-10T15:27 end 2023-03-15T17:03 machine2108 DRAIN start 2022-09-22T10:53 end 2023-03-14T14:22 We are using ArchiveEvents=yes ArchiveJobs=yes ArchiveResvs=yes ArchiveSteps=yes ArchiveSuspend=yes PurgeEventAfter=31days PurgeJobAfter=93days PurgeResvAfter=31days PurgeStepAfter=93days PurgeSuspendAfter=93days we might put all things at 93days however the problem might be the same...