16347 – Event purged while should not

Ticket 16347 - Event purged while should not

Summary: Event purged while should not

Status:	RESOLVED INFOGIVEN

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	slurmdbd (show other tickets)
Version:	22.05.7
Hardware:	Linux Linux

Severity:	3 - Medium Impact
Assignee:	Albert Gil
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2023-03-23 09:32 MDT by Regine Gaudin
Modified:	2023-03-28 08:32 MDT (History)
CC List:	1 user (show)

See Also:
Site:	CEA
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Regine Gaudin 2023-03-23 09:32:28 MDT

sacct -N ran on Febrary 10th have had incoherency due to bad purge event of event with date start=2022-11-28   end=Febrary 07
All jobs before Febrary 07 were given the same problem

It seems that event which does not begin less that one month ago or does not end
more than one month ago are not well considered


ON Febrary 10th
sacct -j 8053132 -o nodelist,start,end

       NodeList Start End

--------------- ------------------- -------------------
machine[4860-486+ 2023-02-07T07:59:29 2023-02-08T07:59:03

   machine4860 2023-02-07T07:59:29 2023-02-08T07:59:03

machine[4860-486+ 2023-02-07T07:59:29 2023-02-08T07:59:01

machine[4860-486+ 2023-02-07T08:04:45 2023-02-08T07:59:01

while

sacct -S 2023-02-07T07:00 -E 2023-02-08T08:00 -N machine4860 -o jobid,nodelist%50

       JobID NodeList
--------------- ------------------- -------------------
remained empty


after investigations sacct -N was calling a request for having the nodes in the cluster at the time of the job in event  table but the event at the date had been  purged and should not have been :

mysql on db (not purge) leads to:
select cluster_nodes, time_start, time_end from machine_event_table where node_name='' && cluster_nodes !='';
+----------------------------------------------------------------------------------------+------------+------------+
| cluster_nodes                                                                          | time_start | time_end   |
+----------------------------------------------------------------------------------------+------------+------------+
| machine[1000-2655,4000-6291,7000-7019,7050-7081,7100-7129,7200-7201,8000-8004,9100-9179] | 1675783240 | 1675783277 |
| machine[1000-2655,4000-6291,7000-7019,7050-7081,7100-7129,7200-7201,8000-8004,9100-9179] | 1675783277 | 1676279185 |
| machine[1000-2655,4000-6291,7000-7019,7050-7081,7100-7129,7200-7201,8000-8004,9100-9179] | 1676279185 | 1676279202 |
| machine[1000-2655,4000-6291,7000-7019,7050-7081,7100-7129,7200-7201,8000-8004,9100-9179] | 1676279202 |          0


mysql on archive (purged) gave

select cluster_nodes, time_start, time_end from machine_ppi_event_table where node_name='' && cluster_nodes !='';
machine[1000-2655,4000-6291,7000-7019,7050-7081,7100-7129,7200-7201,8000-8004,9100-9179] | 1669632382 | 1675783240 |

bash-4.2$ date --date='@1669632382'

lun. nov. 28 11:46:22 CET 2022

bash-4.2$ date --date='@1675783240'

mar. févr. 7 16:20:40 CET 2023


For instance today we are 23rd of March and the two following 
events have been purged why ?
 machine1239 DOWN* start 2023-02-10T15:27 end 2023-03-15T17:03
 machine2108 DRAIN start 2022-09-22T10:53 end 2023-03-14T14:22 


We are using 
ArchiveEvents=yes
ArchiveJobs=yes
ArchiveResvs=yes
ArchiveSteps=yes
ArchiveSuspend=yes
PurgeEventAfter=31days
PurgeJobAfter=93days
PurgeResvAfter=31days
PurgeStepAfter=93days
PurgeSuspendAfter=93days


we might put all things at 93days however the problem might be the same...

Comment 1 Albert Gil 2023-03-24 08:41:41 MDT

Hi Regine,

If I understand correctly the problem is totally on the purging of the Events, and not on how sacct -N behaves.
I'm investigating it, but if you can share your slurmdbd logs maybe they may help us.

Thanks,
Albert

Comment 2 Regine Gaudin 2023-03-24 08:49:55 MDT

 the problem is totally on the purging of the Events
yes I think

I'm investigating it, but if you can share your slurmdbd logs maybe they may help us.
difficult sensitive site

I seems that age for purge of event are not always respected over all if the start of the event
happens before the age date and the end  after the age date

For instance today we are 23rd of March and the two following 
events have been purged why ?
 machine1239 DOWN* start 2023-02-10T15:27 end 2023-03-15T17:03
 machine2108 DRAIN start 2022-09-22T10:53 end 2023-03-14T14:22 
with
ArchiveEvents=yes
PurgeEventAfter=31days

Comment 4 Albert Gil 2023-03-24 09:01:42 MDT

Hi Regine,

> I seems that age for purge of event are not always respected over all if the
> start of the event
> happens before the age date and the end  after the age date

I think that you are totally right.
The good news is that this has been already fixed in bug 13857 comment 29 as part of the 23.02 release.

Regards,
Albert

Comment 5 Regine Gaudin 2023-03-24 09:06:37 MDT

Is it possible to have  apath for slurm22.5.7 as we've just upgrdaded in two weeks ago ?

Comment 6 Regine Gaudin 2023-03-24 09:07:25 MDT

I mean a patch

Comment 9 Albert Gil 2023-03-28 08:32:37 MDT

Hi Regine,

> Is it possible to have  apath for slurm22.5.7 as we've just upgrdaded in two
> weeks ago ?

Unfortunately we cannot validate such patch, so we cannot recomend it.
But the good news are that you only need to upgrade slurmdbd to fix the issue, so maybe you can do it before that you planned and keep the rest of the cluster still in 22.05 until you can upgrade it.

Anyway, if this is ok for you I'm closing this ticket as infogiven.
But please don't hesitate to reopen it if you need further support.

Regards,
Albert