Ticket 15749 - not enough job data stored in slurm database
Summary: not enough job data stored in slurm database
Status: RESOLVED INFOGIVEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: Accounting (show other tickets)
Version: 22.05.3
Hardware: Linux Linux
: 4 - Minor Issue
Assignee: Director of Support
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2023-01-06 09:32 MST by Xing Huang
Modified: 2023-01-09 14:20 MST (History)
0 users

See Also:
Site: WA St. Louis
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description Xing Huang 2023-01-06 09:32:29 MST
Dear support,

I found that there is a time limit on how far I can go back to extract job data from "sreport" or "sacct" command.
How can I modify the default limit? Say if I want all the data to be saved for two years, so that any time can I extract data back to two years. How can I do it? Thanks!

Best,
Xing
Comment 1 Benny Hedayati 2023-01-09 11:19:59 MST
Hi,

in order to see job information going back longer periods you must supply start and end time range parameters.  

For example, in the case of sacct the following command would allow you to see job info going back to 2021:

sacct -a -S 2021-01-09T00:00:00 -E 2023-01-09T00:00:00

In the case of sreport the start= and end= parameters must be specified.  For example this command is valid:

sreport cluster accountutilizationbyuser start=2021-01-09T00:00:00
end=2023-01-09T00:00:00 

Have you supplied these paramaters?

Thanks
Comment 2 Xing Huang 2023-01-09 12:16:14 MST
The point is not that I did not supplied the starting point and ending point.
It is that the database only saves a few months of data, not all.
[root@mgt ~]# sacct -a -S 2022-01-09T00:00:00 -E 2022-03-09T00:00:00
JobID           JobName  Partition    Account  AllocCPUS      State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
You can see that within this time period, there is not data, which I believe was not stored by slurm.

Best,
Xing
________________________________
From: bugs@schedmd.com <bugs@schedmd.com>
Sent: Monday, January 9, 2023 12:19 PM
To: Huang, Xing <x.huang@wustl.edu>
Subject: [Bug 15749] not enough job data stored in slurm database


* External Email - Caution *

Comment # 1<https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D15749%23c1&data=05%7C01%7Cx.huang%40wustl.edu%7C189d59e28e724c0b883e08daf26e21af%7C4ccca3b571cd4e6d974b4d9beb96c6d6%7C0%7C0%7C638088852079792439%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=kI7fk9S6YiXtu30GkV7HvciJNpOwsDl0WKgir930dbU%3D&reserved=0> on bug 15749<https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D15749&data=05%7C01%7Cx.huang%40wustl.edu%7C189d59e28e724c0b883e08daf26e21af%7C4ccca3b571cd4e6d974b4d9beb96c6d6%7C0%7C0%7C638088852079792439%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=i6%2BOo%2FmHIpOfFAa0%2BupvzhjQizePEos1tmHrz82XXLA%3D&reserved=0> from Benny Hedayati<mailto:benny@schedmd.com>

Hi,

in order to see job information going back longer periods you must supply start
and end time range parameters.

For example, in the case of sacct the following command would allow you to see
job info going back to 2021:

sacct -a -S 2021-01-09T00:00:00 -E 2023-01-09T00:00:00

In the case of sreport the start= and end= parameters must be specified.  For
example this command is valid:

sreport cluster accountutilizationbyuser start=2021-01-09T00:00:00
end=2023-01-09T00:00:00

Have you supplied these paramaters?

Thanks

________________________________
You are receiving this mail because:

  *   You reported the bug.

________________________________
The materials in this message are private and may contain Protected Healthcare Information or other information of a sensitive nature. If you are not the intended recipient, be advised that any unauthorized use, disclosure, copying or the taking of any action in reliance on the contents of this information is strictly prohibited. If you have received this email in error, please immediately notify the sender via telephone or return mail.
Comment 3 Benny Hedayati 2023-01-09 12:38:21 MST
Ok I see, can you please send me the contents of you slurmdbd.conf.  You can hide the the credentials for the database like so:
  StoragePass=xxxx
  StorageUser=xxxx

I would like to see if there might be some type of archive purge enabled.

Thanks
Comment 4 Xing Huang 2023-01-09 13:13:07 MST
Here is my slurm.conf.

ClusterName=chpc3
ControlMachine=mgt
ControlAddr=mgt.cluster
#BackupController=
#BackupAddr=
#
SlurmUser=slurm
#SlurmdUser=root
SlurmctldPort=6817
SlurmdPort=6818
AuthType=auth/munge
#JobCredentialPrivateKey=
#JobCredentialPublicCertificate=
GroupUpdateForce=0
StateSaveLocation=/var/spool/slurm/ctld
SlurmdSpoolDir=/var/spool/slurm/d
SwitchType=switch/none
MpiDefault=none
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmdPidFile=/var/run/slurmd.pid
ProctrackType=proctrack/cgroup
PrologFlags=x11
#PluginDir=
#FirstJobId=
ReturnToService=2
MaxJobCount=50000
#PlugStackConfig=
#PropagatePrioProcess=
#PropagateResourceLimits=
#PropagateResourceLimitsExcept=
Prolog=/opt/slurm/prologue
Epilog=/opt/slurm/epilogue
JobSubmitPlugins=lua,require_timelimit
JobRequeue=0
#SrunProlog=
#SrunEpilog=
#TaskProlog=
#TaskEpilog=
TaskPlugin=task/affinity,task/cgroup
#TrackWCKey=no
#TreeWidth=50
#TmpFS=
#UsePAM=
#
# TIMERS
SlurmctldTimeout=600
SlurmdTimeout=600
MessageTimeout=100
TCPTimeout=600
BatchStartTimeout=600
MinJobAge=300
KillWait=30
InactiveLimit=0
Waittime=0
#
# SCHEDULING
SchedulerType=sched/backfill
#SchedulerAuth=
### These two options will 'distribute' jobs on nodes - Xing 03/19/2021
SelectType=select/cons_tres
SelectTypeParameters=CR_Core_Memory
###
PriorityType=priority/multifactor
PriorityDecayHalfLife=14
PriorityUsageResetPeriod=Monthly
# Fairshare Factor
PriorityWeightFairshare=10000
# Age Factor
PriorityWeightAge=5000
PriorityMaxAge=7-0
# Job Factor
PriorityFavorSmall=NO
PriorityWeightJobSize=2000
# Partition Factor
PriorityWeightPartition=1000
PriorityFlags=CALCULATE_RUNNING
# Job Array Size
MaxArraySize = 5000
#### This would fix the missing acl group in slurm
LaunchParameters=disable_send_gids
####
#
#
# LOGGING
SlurmctldDebug=info
SlurmctldLogFile=/var/log/slurm/slurmctld.log
SlurmdDebug=info
#SlurmdDebug=debug
SlurmdLogFile=/var/log/slurm/slurmd.log
JobCompType=jobcomp/none
#JobCompLoc=
#
# ACCOUNTING
JobAcctGatherType=jobacct_gather/cgroup
JobAcctGatherFrequency=30
AccountingStorageType=accounting_storage/slurmdbd
AccountingStorageHost=mgt.cluster
#AccountingStorageLoc=
#AccountingStoragePass=
AccountingStorageTRES=cpu,mem,gres/gpu
AccountingStorageUser=slurm
AccountingStorageEnforce=limits,qos
#
# COMPUTE NODES
AccountingStoreFlags=job_comment
GresTypes=gpu,vmem
________________________________
From: bugs@schedmd.com <bugs@schedmd.com>
Sent: Monday, January 9, 2023 1:38 PM
To: Huang, Xing <x.huang@wustl.edu>
Subject: [Bug 15749] not enough job data stored in slurm database


* External Email - Caution *

Comment # 3<https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D15749%23c3&data=05%7C01%7Cx.huang%40wustl.edu%7Ced8007d1097f4aa7cc5408daf27912d7%7C4ccca3b571cd4e6d974b4d9beb96c6d6%7C0%7C0%7C638088899069021166%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=U4Wuw%2Fjf%2Fn5GywGHtpAKNv1Tgn%2FWRV2hBWqxN1%2FCSOA%3D&reserved=0> on bug 15749<https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D15749&data=05%7C01%7Cx.huang%40wustl.edu%7Ced8007d1097f4aa7cc5408daf27912d7%7C4ccca3b571cd4e6d974b4d9beb96c6d6%7C0%7C0%7C638088899069021166%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=%2B83pnEwYveGWsBy6whArcBP4MHXHUzIqrTb5tXaUV%2Fc%3D&reserved=0> from Benny Hedayati<mailto:benny@schedmd.com>

Ok I see, can you please send me the contents of you slurmdbd.conf.  You can
hide the the credentials for the database like so:
  StoragePass=xxxx
  StorageUser=xxxx

I would like to see if there might be some type of archive purge enabled.

Thanks

________________________________
You are receiving this mail because:

  *   You reported the bug.

________________________________
The materials in this message are private and may contain Protected Healthcare Information or other information of a sensitive nature. If you are not the intended recipient, be advised that any unauthorized use, disclosure, copying or the taking of any action in reliance on the contents of this information is strictly prohibited. If you have received this email in error, please immediately notify the sender via telephone or return mail.
Comment 5 Benny Hedayati 2023-01-09 13:22:36 MST
Thanks for that, do you also have a slurmdbd.conf? Its a different file in the /etc folder.
Comment 6 Xing Huang 2023-01-09 13:31:13 MST
# Archive info
#ArchiveJobs=yes
#ArchiveDir="/tmp"
#ArchiveSteps=yes
#ArchiveScript=
#JobPurge=12
#StepPurge=1
#
# Authentication info
AuthType=auth/munge
#AuthInfo=/var/run/munge/munge.socket.2
#
# slurmDBD info
DbdAddr=localhost
DbdHost=localhost
#DbdPort=7031
SlurmUser=slurm
#MessageTimeout=300
DebugLevel=4
#DefaultQOS=normal,standby
LogFile=/var/log/slurm/slurmdbd.log
PidFile=/var/run/slurmdbd.pid
#PluginDir=/usr/lib/slurm
#PrivateData=accounts,users,usage,jobs
#TrackWCKey=yes
#
# Database info
StorageType=accounting_storage/mysql
#StorageHost=localhost
#StoragePort=1234
StoragePass=i5fIosmnk6
StorageUser=slurm
StorageLoc=slurm_acct_db

PurgeEventAfter=1month
PurgeJobAfter=12month
PurgeResvAfter=1month
PurgeStepAfter=1month
PurgeSuspendAfter=1month


I wonder if it is related to the last five flags and which one do I need to modify.

Best,
Xing
________________________________
From: bugs@schedmd.com <bugs@schedmd.com>
Sent: Monday, January 9, 2023 2:22 PM
To: Huang, Xing <x.huang@wustl.edu>
Subject: [Bug 15749] not enough job data stored in slurm database


* External Email - Caution *

Comment # 5<https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D15749%23c5&data=05%7C01%7Cx.huang%40wustl.edu%7Ca714a48395d044456d2208daf27f4076%7C4ccca3b571cd4e6d974b4d9beb96c6d6%7C0%7C0%7C638088925600675678%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=5M0M7IoWHYTZ%2FWAVhcLQgIc59zzc0RlmJulLKdbOpDU%3D&reserved=0> on bug 15749<https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D15749&data=05%7C01%7Cx.huang%40wustl.edu%7Ca714a48395d044456d2208daf27f4076%7C4ccca3b571cd4e6d974b4d9beb96c6d6%7C0%7C0%7C638088925600675678%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=krndkpuhfqEQycK38R8FQM5GPRaNvfYUlbuZt7RD5E0%3D&reserved=0> from Benny Hedayati<mailto:benny@schedmd.com>

Thanks for that, do you also have a slurmdbd.conf? Its a different file in the
/etc folder.

________________________________
You are receiving this mail because:

  *   You reported the bug.

________________________________
The materials in this message are private and may contain Protected Healthcare Information or other information of a sensitive nature. If you are not the intended recipient, be advised that any unauthorized use, disclosure, copying or the taking of any action in reliance on the contents of this information is strictly prohibited. If you have received this email in error, please immediately notify the sender via telephone or return mail.
Comment 7 Benny Hedayati 2023-01-09 14:03:48 MST
Indeed because you have set these options:

PurgeEventAfter=1month
PurgeJobAfter=12month
PurgeResvAfter=1month
PurgeStepAfter=1month
PurgeSuspendAfter=1month

Individual job records over this age are purged from the database. Also, since you have commented out the following:

#ArchiveJobs=yes

Jobs that are purged are not archived.  Unfortunately, we cannot do anything in this situation to recover older jobs unless you have some other form of backup.

I suggest, for the future, extend the Purge options to a longer period and uncomment the ArchiveJobs=yes option in order to archive jobs that are purged. 

Please refer to this documentation for more detailed information on archiving:

https://slurm.schedmd.com/slurmdbd.conf.html#OPT_ArchiveJobs

and this section:

https://slurm.schedmd.com/slurmdbd.conf.html#OPT_PurgeEventAfter 

will explain what the different Purge options mean.

Thanks
Comment 8 Xing Huang 2023-01-09 14:14:42 MST
Thank you so much. I will make the change recommended.

Best,
Xing
________________________________
From: bugs@schedmd.com <bugs@schedmd.com>
Sent: Monday, January 9, 2023 3:03 PM
To: Huang, Xing <x.huang@wustl.edu>
Subject: [Bug 15749] not enough job data stored in slurm database


* External Email - Caution *

Comment # 7<https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D15749%23c7&data=05%7C01%7Cx.huang%40wustl.edu%7C4d9e37a583c54d47b0fa08daf2850d25%7C4ccca3b571cd4e6d974b4d9beb96c6d6%7C0%7C0%7C638088950522172051%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=bd%2FSuXznhhh2GNik1R70WE%2B0NP90mc7zEBg%2F6ACGM%2B0%3D&reserved=0> on bug 15749<https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D15749&data=05%7C01%7Cx.huang%40wustl.edu%7C4d9e37a583c54d47b0fa08daf2850d25%7C4ccca3b571cd4e6d974b4d9beb96c6d6%7C0%7C0%7C638088950522172051%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=X2xJt1a5%2FmjtDM2hqYhOZ%2F%2BIMOhM5bEs40f%2F97fodu8%3D&reserved=0> from Benny Hedayati<mailto:benny@schedmd.com>

Indeed because you have set these options:

PurgeEventAfter=1month
PurgeJobAfter=12month
PurgeResvAfter=1month
PurgeStepAfter=1month
PurgeSuspendAfter=1month

Individual job records over this age are purged from the database. Also, since
you have commented out the following:

#ArchiveJobs=yes

Jobs that are purged are not archived.  Unfortunately, we cannot do anything in
this situation to recover older jobs unless you have some other form of backup.

I suggest, for the future, extend the Purge options to a longer period and
uncomment the ArchiveJobs=yes option in order to archive jobs that are purged.

Please refer to this documentation for more detailed information on archiving:

https://slurm.schedmd.com/slurmdbd.conf.html#OPT_ArchiveJobs<https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fslurm.schedmd.com%2Fslurmdbd.conf.html%23OPT_ArchiveJobs&data=05%7C01%7Cx.huang%40wustl.edu%7C4d9e37a583c54d47b0fa08daf2850d25%7C4ccca3b571cd4e6d974b4d9beb96c6d6%7C0%7C0%7C638088950522172051%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=I79AAIA3Fou3xMAm5RBFRuj7Ouhg6Jr3KoJfR8iakdk%3D&reserved=0>

and this section:

https://slurm.schedmd.com/slurmdbd.conf.html#OPT_PurgeEventAfter<https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fslurm.schedmd.com%2Fslurmdbd.conf.html%23OPT_PurgeEventAfter&data=05%7C01%7Cx.huang%40wustl.edu%7C4d9e37a583c54d47b0fa08daf2850d25%7C4ccca3b571cd4e6d974b4d9beb96c6d6%7C0%7C0%7C638088950522172051%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=moLhqAXvfDyYz2HlVxACWpFcTJk6LZW%2FhQDoSv%2BAGkA%3D&reserved=0>

will explain what the different Purge options mean.

Thanks

________________________________
You are receiving this mail because:

  *   You reported the bug.

________________________________
The materials in this message are private and may contain Protected Healthcare Information or other information of a sensitive nature. If you are not the intended recipient, be advised that any unauthorized use, disclosure, copying or the taking of any action in reliance on the contents of this information is strictly prohibited. If you have received this email in error, please immediately notify the sender via telephone or return mail.
Comment 9 Benny Hedayati 2023-01-09 14:20:30 MST
You're welcome.