sacct -N ran on Febrary 10th have had incoherency due to bad purge event of event with date start=2022-11-28 end=Febrary 07 All jobs before Febrary 07 were given the same problem It seems that event which does not begin less that one month ago or does not end more than one month ago are not well considered ON Febrary 10th sacct -j 8053132 -o nodelist,start,end NodeList Start End --------------- ------------------- ------------------- machine[4860-486+ 2023-02-07T07:59:29 2023-02-08T07:59:03 machine4860 2023-02-07T07:59:29 2023-02-08T07:59:03 machine[4860-486+ 2023-02-07T07:59:29 2023-02-08T07:59:01 machine[4860-486+ 2023-02-07T08:04:45 2023-02-08T07:59:01 while sacct -S 2023-02-07T07:00 -E 2023-02-08T08:00 -N machine4860 -o jobid,nodelist%50 JobID NodeList --------------- ------------------- ------------------- remained empty after investigations sacct -N was calling a request for having the nodes in the cluster at the time of the job in event table but the event at the date had been purged and should not have been : mysql on db (not purge) leads to: select cluster_nodes, time_start, time_end from machine_event_table where node_name='' && cluster_nodes !=''; +----------------------------------------------------------------------------------------+------------+------------+ | cluster_nodes | time_start | time_end | +----------------------------------------------------------------------------------------+------------+------------+ | machine[1000-2655,4000-6291,7000-7019,7050-7081,7100-7129,7200-7201,8000-8004,9100-9179] | 1675783240 | 1675783277 | | machine[1000-2655,4000-6291,7000-7019,7050-7081,7100-7129,7200-7201,8000-8004,9100-9179] | 1675783277 | 1676279185 | | machine[1000-2655,4000-6291,7000-7019,7050-7081,7100-7129,7200-7201,8000-8004,9100-9179] | 1676279185 | 1676279202 | | machine[1000-2655,4000-6291,7000-7019,7050-7081,7100-7129,7200-7201,8000-8004,9100-9179] | 1676279202 | 0 mysql on archive (purged) gave select cluster_nodes, time_start, time_end from machine_ppi_event_table where node_name='' && cluster_nodes !=''; machine[1000-2655,4000-6291,7000-7019,7050-7081,7100-7129,7200-7201,8000-8004,9100-9179] | 1669632382 | 1675783240 | bash-4.2$ date --date='@1669632382' lun. nov. 28 11:46:22 CET 2022 bash-4.2$ date --date='@1675783240' mar. févr. 7 16:20:40 CET 2023 For instance today we are 23rd of March and the two following events have been purged why ? machine1239 DOWN* start 2023-02-10T15:27 end 2023-03-15T17:03 machine2108 DRAIN start 2022-09-22T10:53 end 2023-03-14T14:22 We are using ArchiveEvents=yes ArchiveJobs=yes ArchiveResvs=yes ArchiveSteps=yes ArchiveSuspend=yes PurgeEventAfter=31days PurgeJobAfter=93days PurgeResvAfter=31days PurgeStepAfter=93days PurgeSuspendAfter=93days we might put all things at 93days however the problem might be the same...
Hi Regine, If I understand correctly the problem is totally on the purging of the Events, and not on how sacct -N behaves. I'm investigating it, but if you can share your slurmdbd logs maybe they may help us. Thanks, Albert
the problem is totally on the purging of the Events yes I think I'm investigating it, but if you can share your slurmdbd logs maybe they may help us. difficult sensitive site I seems that age for purge of event are not always respected over all if the start of the event happens before the age date and the end after the age date For instance today we are 23rd of March and the two following events have been purged why ? machine1239 DOWN* start 2023-02-10T15:27 end 2023-03-15T17:03 machine2108 DRAIN start 2022-09-22T10:53 end 2023-03-14T14:22 with ArchiveEvents=yes PurgeEventAfter=31days
Hi Regine, > I seems that age for purge of event are not always respected over all if the > start of the event > happens before the age date and the end after the age date I think that you are totally right. The good news is that this has been already fixed in bug 13857 comment 29 as part of the 23.02 release. Regards, Albert
Is it possible to have apath for slurm22.5.7 as we've just upgrdaded in two weeks ago ?
I mean a patch
Hi Regine, > Is it possible to have apath for slurm22.5.7 as we've just upgrdaded in two > weeks ago ? Unfortunately we cannot validate such patch, so we cannot recomend it. But the good news are that you only need to upgrade slurmdbd to fix the issue, so maybe you can do it before that you planned and keep the rest of the cluster still in 22.05 until you can upgrade it. Anyway, if this is ok for you I'm closing this ticket as infogiven. But please don't hesitate to reopen it if you need further support. Regards, Albert