Ticket 16347

Summary:	Event purged while should not
Product:	Slurm	Reporter:	Regine Gaudin <regine.gaudin>
Component:	slurmdbd	Assignee:	Albert Gil <albert.gil>
Status:	RESOLVED INFOGIVEN	QA Contact:
Severity:	3 - Medium Impact
Priority:	---	CC:	jbooth
Version:	22.05.7
Hardware:	Linux
OS:	Linux
Site:	CEA	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---

Description Regine Gaudin 2023-03-23 09:32:28 MDT

sacct -N ran on Febrary 10th have had incoherency due to bad purge event of event with date start=2022-11-28   end=Febrary 07
All jobs before Febrary 07 were given the same problem

It seems that event which does not begin less that one month ago or does not end
more than one month ago are not well considered


ON Febrary 10th
sacct -j 8053132 -o nodelist,start,end

       NodeList Start End

--------------- ------------------- -------------------
machine[4860-486+ 2023-02-07T07:59:29 2023-02-08T07:59:03

   machine4860 2023-02-07T07:59:29 2023-02-08T07:59:03

machine[4860-486+ 2023-02-07T07:59:29 2023-02-08T07:59:01

machine[4860-486+ 2023-02-07T08:04:45 2023-02-08T07:59:01

while

sacct -S 2023-02-07T07:00 -E 2023-02-08T08:00 -N machine4860 -o jobid,nodelist%50

       JobID NodeList
--------------- ------------------- -------------------
remained empty


after investigations sacct -N was calling a request for having the nodes in the cluster at the time of the job in event  table but the event at the date had been  purged and should not have been :

mysql on db (not purge) leads to:
select cluster_nodes, time_start, time_end from machine_event_table where node_name='' && cluster_nodes !='';
+----------------------------------------------------------------------------------------+------------+------------+
| cluster_nodes                                                                          | time_start | time_end   |
+----------------------------------------------------------------------------------------+------------+------------+
| machine[1000-2655,4000-6291,7000-7019,7050-7081,7100-7129,7200-7201,8000-8004,9100-9179] | 1675783240 | 1675783277 |
| machine[1000-2655,4000-6291,7000-7019,7050-7081,7100-7129,7200-7201,8000-8004,9100-9179] | 1675783277 | 1676279185 |
| machine[1000-2655,4000-6291,7000-7019,7050-7081,7100-7129,7200-7201,8000-8004,9100-9179] | 1676279185 | 1676279202 |
| machine[1000-2655,4000-6291,7000-7019,7050-7081,7100-7129,7200-7201,8000-8004,9100-9179] | 1676279202 |          0


mysql on archive (purged) gave

select cluster_nodes, time_start, time_end from machine_ppi_event_table where node_name='' && cluster_nodes !='';
machine[1000-2655,4000-6291,7000-7019,7050-7081,7100-7129,7200-7201,8000-8004,9100-9179] | 1669632382 | 1675783240 |

bash-4.2$ date --date='@1669632382'

lun. nov. 28 11:46:22 CET 2022

bash-4.2$ date --date='@1675783240'

mar. févr. 7 16:20:40 CET 2023


For instance today we are 23rd of March and the two following 
events have been purged why ?
 machine1239 DOWN* start 2023-02-10T15:27 end 2023-03-15T17:03
 machine2108 DRAIN start 2022-09-22T10:53 end 2023-03-14T14:22 


We are using 
ArchiveEvents=yes
ArchiveJobs=yes
ArchiveResvs=yes
ArchiveSteps=yes
ArchiveSuspend=yes
PurgeEventAfter=31days
PurgeJobAfter=93days
PurgeResvAfter=31days
PurgeStepAfter=93days
PurgeSuspendAfter=93days


we might put all things at 93days however the problem might be the same...

Comment 1 Albert Gil 2023-03-24 08:41:41 MDT

Hi Regine,

If I understand correctly the problem is totally on the purging of the Events, and not on how sacct -N behaves.
I'm investigating it, but if you can share your slurmdbd logs maybe they may help us.

Thanks,
Albert

Comment 2 Regine Gaudin 2023-03-24 08:49:55 MDT

 the problem is totally on the purging of the Events
yes I think

I'm investigating it, but if you can share your slurmdbd logs maybe they may help us.
difficult sensitive site

I seems that age for purge of event are not always respected over all if the start of the event
happens before the age date and the end  after the age date

For instance today we are 23rd of March and the two following 
events have been purged why ?
 machine1239 DOWN* start 2023-02-10T15:27 end 2023-03-15T17:03
 machine2108 DRAIN start 2022-09-22T10:53 end 2023-03-14T14:22 
with
ArchiveEvents=yes
PurgeEventAfter=31days

Comment 4 Albert Gil 2023-03-24 09:01:42 MDT

Hi Regine,

> I seems that age for purge of event are not always respected over all if the
> start of the event
> happens before the age date and the end  after the age date

I think that you are totally right.
The good news is that this has been already fixed in bug 13857 comment 29 as part of the 23.02 release.

Regards,
Albert

Comment 5 Regine Gaudin 2023-03-24 09:06:37 MDT

Is it possible to have  apath for slurm22.5.7 as we've just upgrdaded in two weeks ago ?

Comment 6 Regine Gaudin 2023-03-24 09:07:25 MDT

I mean a patch

Comment 9 Albert Gil 2023-03-28 08:32:37 MDT

Hi Regine,

> Is it possible to have  apath for slurm22.5.7 as we've just upgrdaded in two
> weeks ago ?

Unfortunately we cannot validate such patch, so we cannot recomend it.
But the good news are that you only need to upgrade slurmdbd to fix the issue, so maybe you can do it before that you planned and keep the rest of the cluster still in 22.05 until you can upgrade it.

Anyway, if this is ok for you I'm closing this ticket as infogiven.
But please don't hesitate to reopen it if you need further support.

Regards,
Albert