| Summary: | not enough job data stored in slurm database | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Xing Huang <x.huang> |
| Component: | Accounting | Assignee: | Director of Support <support> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 4 - Minor Issue | ||
| Priority: | --- | ||
| Version: | 22.05.3 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | WA St. Louis | Slinky Site: | --- |
| Alineos Sites: | --- | Atos/Eviden Sites: | --- |
| Confidential Site: | --- | Coreweave sites: | --- |
| Cray Sites: | --- | DS9 clusters: | --- |
| Google sites: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | NoveTech Sites: | --- |
| Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Tzag Elita Sites: | --- |
| Linux Distro: | --- | Machine Name: | |
| CLE Version: | Version Fixed: | ||
| Target Release: | --- | DevPrio: | --- |
| Emory-Cloud Sites: | --- | ||
|
Description
Xing Huang
2023-01-06 09:32:29 MST
Hi, in order to see job information going back longer periods you must supply start and end time range parameters. For example, in the case of sacct the following command would allow you to see job info going back to 2021: sacct -a -S 2021-01-09T00:00:00 -E 2023-01-09T00:00:00 In the case of sreport the start= and end= parameters must be specified. For example this command is valid: sreport cluster accountutilizationbyuser start=2021-01-09T00:00:00 end=2023-01-09T00:00:00 Have you supplied these paramaters? Thanks The point is not that I did not supplied the starting point and ending point. It is that the database only saves a few months of data, not all. [root@mgt ~]# sacct -a -S 2022-01-09T00:00:00 -E 2022-03-09T00:00:00 JobID JobName Partition Account AllocCPUS State ExitCode ------------ ---------- ---------- ---------- ---------- ---------- -------- You can see that within this time period, there is not data, which I believe was not stored by slurm. Best, Xing ________________________________ From: bugs@schedmd.com <bugs@schedmd.com> Sent: Monday, January 9, 2023 12:19 PM To: Huang, Xing <x.huang@wustl.edu> Subject: [Bug 15749] not enough job data stored in slurm database * External Email - Caution * Comment # 1<https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D15749%23c1&data=05%7C01%7Cx.huang%40wustl.edu%7C189d59e28e724c0b883e08daf26e21af%7C4ccca3b571cd4e6d974b4d9beb96c6d6%7C0%7C0%7C638088852079792439%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=kI7fk9S6YiXtu30GkV7HvciJNpOwsDl0WKgir930dbU%3D&reserved=0> on bug 15749<https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D15749&data=05%7C01%7Cx.huang%40wustl.edu%7C189d59e28e724c0b883e08daf26e21af%7C4ccca3b571cd4e6d974b4d9beb96c6d6%7C0%7C0%7C638088852079792439%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=i6%2BOo%2FmHIpOfFAa0%2BupvzhjQizePEos1tmHrz82XXLA%3D&reserved=0> from Benny Hedayati<mailto:benny@schedmd.com> Hi, in order to see job information going back longer periods you must supply start and end time range parameters. For example, in the case of sacct the following command would allow you to see job info going back to 2021: sacct -a -S 2021-01-09T00:00:00 -E 2023-01-09T00:00:00 In the case of sreport the start= and end= parameters must be specified. For example this command is valid: sreport cluster accountutilizationbyuser start=2021-01-09T00:00:00 end=2023-01-09T00:00:00 Have you supplied these paramaters? Thanks ________________________________ You are receiving this mail because: * You reported the bug. ________________________________ The materials in this message are private and may contain Protected Healthcare Information or other information of a sensitive nature. If you are not the intended recipient, be advised that any unauthorized use, disclosure, copying or the taking of any action in reliance on the contents of this information is strictly prohibited. If you have received this email in error, please immediately notify the sender via telephone or return mail. Ok I see, can you please send me the contents of you slurmdbd.conf. You can hide the the credentials for the database like so: StoragePass=xxxx StorageUser=xxxx I would like to see if there might be some type of archive purge enabled. Thanks Here is my slurm.conf. ClusterName=chpc3 ControlMachine=mgt ControlAddr=mgt.cluster #BackupController= #BackupAddr= # SlurmUser=slurm #SlurmdUser=root SlurmctldPort=6817 SlurmdPort=6818 AuthType=auth/munge #JobCredentialPrivateKey= #JobCredentialPublicCertificate= GroupUpdateForce=0 StateSaveLocation=/var/spool/slurm/ctld SlurmdSpoolDir=/var/spool/slurm/d SwitchType=switch/none MpiDefault=none SlurmctldPidFile=/var/run/slurmctld.pid SlurmdPidFile=/var/run/slurmd.pid ProctrackType=proctrack/cgroup PrologFlags=x11 #PluginDir= #FirstJobId= ReturnToService=2 MaxJobCount=50000 #PlugStackConfig= #PropagatePrioProcess= #PropagateResourceLimits= #PropagateResourceLimitsExcept= Prolog=/opt/slurm/prologue Epilog=/opt/slurm/epilogue JobSubmitPlugins=lua,require_timelimit JobRequeue=0 #SrunProlog= #SrunEpilog= #TaskProlog= #TaskEpilog= TaskPlugin=task/affinity,task/cgroup #TrackWCKey=no #TreeWidth=50 #TmpFS= #UsePAM= # # TIMERS SlurmctldTimeout=600 SlurmdTimeout=600 MessageTimeout=100 TCPTimeout=600 BatchStartTimeout=600 MinJobAge=300 KillWait=30 InactiveLimit=0 Waittime=0 # # SCHEDULING SchedulerType=sched/backfill #SchedulerAuth= ### These two options will 'distribute' jobs on nodes - Xing 03/19/2021 SelectType=select/cons_tres SelectTypeParameters=CR_Core_Memory ### PriorityType=priority/multifactor PriorityDecayHalfLife=14 PriorityUsageResetPeriod=Monthly # Fairshare Factor PriorityWeightFairshare=10000 # Age Factor PriorityWeightAge=5000 PriorityMaxAge=7-0 # Job Factor PriorityFavorSmall=NO PriorityWeightJobSize=2000 # Partition Factor PriorityWeightPartition=1000 PriorityFlags=CALCULATE_RUNNING # Job Array Size MaxArraySize = 5000 #### This would fix the missing acl group in slurm LaunchParameters=disable_send_gids #### # # # LOGGING SlurmctldDebug=info SlurmctldLogFile=/var/log/slurm/slurmctld.log SlurmdDebug=info #SlurmdDebug=debug SlurmdLogFile=/var/log/slurm/slurmd.log JobCompType=jobcomp/none #JobCompLoc= # # ACCOUNTING JobAcctGatherType=jobacct_gather/cgroup JobAcctGatherFrequency=30 AccountingStorageType=accounting_storage/slurmdbd AccountingStorageHost=mgt.cluster #AccountingStorageLoc= #AccountingStoragePass= AccountingStorageTRES=cpu,mem,gres/gpu AccountingStorageUser=slurm AccountingStorageEnforce=limits,qos # # COMPUTE NODES AccountingStoreFlags=job_comment GresTypes=gpu,vmem ________________________________ From: bugs@schedmd.com <bugs@schedmd.com> Sent: Monday, January 9, 2023 1:38 PM To: Huang, Xing <x.huang@wustl.edu> Subject: [Bug 15749] not enough job data stored in slurm database * External Email - Caution * Comment # 3<https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D15749%23c3&data=05%7C01%7Cx.huang%40wustl.edu%7Ced8007d1097f4aa7cc5408daf27912d7%7C4ccca3b571cd4e6d974b4d9beb96c6d6%7C0%7C0%7C638088899069021166%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=U4Wuw%2Fjf%2Fn5GywGHtpAKNv1Tgn%2FWRV2hBWqxN1%2FCSOA%3D&reserved=0> on bug 15749<https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D15749&data=05%7C01%7Cx.huang%40wustl.edu%7Ced8007d1097f4aa7cc5408daf27912d7%7C4ccca3b571cd4e6d974b4d9beb96c6d6%7C0%7C0%7C638088899069021166%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=%2B83pnEwYveGWsBy6whArcBP4MHXHUzIqrTb5tXaUV%2Fc%3D&reserved=0> from Benny Hedayati<mailto:benny@schedmd.com> Ok I see, can you please send me the contents of you slurmdbd.conf. You can hide the the credentials for the database like so: StoragePass=xxxx StorageUser=xxxx I would like to see if there might be some type of archive purge enabled. Thanks ________________________________ You are receiving this mail because: * You reported the bug. ________________________________ The materials in this message are private and may contain Protected Healthcare Information or other information of a sensitive nature. If you are not the intended recipient, be advised that any unauthorized use, disclosure, copying or the taking of any action in reliance on the contents of this information is strictly prohibited. If you have received this email in error, please immediately notify the sender via telephone or return mail. Thanks for that, do you also have a slurmdbd.conf? Its a different file in the /etc folder. # Archive info #ArchiveJobs=yes #ArchiveDir="/tmp" #ArchiveSteps=yes #ArchiveScript= #JobPurge=12 #StepPurge=1 # # Authentication info AuthType=auth/munge #AuthInfo=/var/run/munge/munge.socket.2 # # slurmDBD info DbdAddr=localhost DbdHost=localhost #DbdPort=7031 SlurmUser=slurm #MessageTimeout=300 DebugLevel=4 #DefaultQOS=normal,standby LogFile=/var/log/slurm/slurmdbd.log PidFile=/var/run/slurmdbd.pid #PluginDir=/usr/lib/slurm #PrivateData=accounts,users,usage,jobs #TrackWCKey=yes # # Database info StorageType=accounting_storage/mysql #StorageHost=localhost #StoragePort=1234 StoragePass=i5fIosmnk6 StorageUser=slurm StorageLoc=slurm_acct_db PurgeEventAfter=1month PurgeJobAfter=12month PurgeResvAfter=1month PurgeStepAfter=1month PurgeSuspendAfter=1month I wonder if it is related to the last five flags and which one do I need to modify. Best, Xing ________________________________ From: bugs@schedmd.com <bugs@schedmd.com> Sent: Monday, January 9, 2023 2:22 PM To: Huang, Xing <x.huang@wustl.edu> Subject: [Bug 15749] not enough job data stored in slurm database * External Email - Caution * Comment # 5<https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D15749%23c5&data=05%7C01%7Cx.huang%40wustl.edu%7Ca714a48395d044456d2208daf27f4076%7C4ccca3b571cd4e6d974b4d9beb96c6d6%7C0%7C0%7C638088925600675678%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=5M0M7IoWHYTZ%2FWAVhcLQgIc59zzc0RlmJulLKdbOpDU%3D&reserved=0> on bug 15749<https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D15749&data=05%7C01%7Cx.huang%40wustl.edu%7Ca714a48395d044456d2208daf27f4076%7C4ccca3b571cd4e6d974b4d9beb96c6d6%7C0%7C0%7C638088925600675678%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=krndkpuhfqEQycK38R8FQM5GPRaNvfYUlbuZt7RD5E0%3D&reserved=0> from Benny Hedayati<mailto:benny@schedmd.com> Thanks for that, do you also have a slurmdbd.conf? Its a different file in the /etc folder. ________________________________ You are receiving this mail because: * You reported the bug. ________________________________ The materials in this message are private and may contain Protected Healthcare Information or other information of a sensitive nature. If you are not the intended recipient, be advised that any unauthorized use, disclosure, copying or the taking of any action in reliance on the contents of this information is strictly prohibited. If you have received this email in error, please immediately notify the sender via telephone or return mail. Indeed because you have set these options: PurgeEventAfter=1month PurgeJobAfter=12month PurgeResvAfter=1month PurgeStepAfter=1month PurgeSuspendAfter=1month Individual job records over this age are purged from the database. Also, since you have commented out the following: #ArchiveJobs=yes Jobs that are purged are not archived. Unfortunately, we cannot do anything in this situation to recover older jobs unless you have some other form of backup. I suggest, for the future, extend the Purge options to a longer period and uncomment the ArchiveJobs=yes option in order to archive jobs that are purged. Please refer to this documentation for more detailed information on archiving: https://slurm.schedmd.com/slurmdbd.conf.html#OPT_ArchiveJobs and this section: https://slurm.schedmd.com/slurmdbd.conf.html#OPT_PurgeEventAfter will explain what the different Purge options mean. Thanks Thank you so much. I will make the change recommended. Best, Xing ________________________________ From: bugs@schedmd.com <bugs@schedmd.com> Sent: Monday, January 9, 2023 3:03 PM To: Huang, Xing <x.huang@wustl.edu> Subject: [Bug 15749] not enough job data stored in slurm database * External Email - Caution * Comment # 7<https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D15749%23c7&data=05%7C01%7Cx.huang%40wustl.edu%7C4d9e37a583c54d47b0fa08daf2850d25%7C4ccca3b571cd4e6d974b4d9beb96c6d6%7C0%7C0%7C638088950522172051%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=bd%2FSuXznhhh2GNik1R70WE%2B0NP90mc7zEBg%2F6ACGM%2B0%3D&reserved=0> on bug 15749<https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D15749&data=05%7C01%7Cx.huang%40wustl.edu%7C4d9e37a583c54d47b0fa08daf2850d25%7C4ccca3b571cd4e6d974b4d9beb96c6d6%7C0%7C0%7C638088950522172051%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=X2xJt1a5%2FmjtDM2hqYhOZ%2F%2BIMOhM5bEs40f%2F97fodu8%3D&reserved=0> from Benny Hedayati<mailto:benny@schedmd.com> Indeed because you have set these options: PurgeEventAfter=1month PurgeJobAfter=12month PurgeResvAfter=1month PurgeStepAfter=1month PurgeSuspendAfter=1month Individual job records over this age are purged from the database. Also, since you have commented out the following: #ArchiveJobs=yes Jobs that are purged are not archived. Unfortunately, we cannot do anything in this situation to recover older jobs unless you have some other form of backup. I suggest, for the future, extend the Purge options to a longer period and uncomment the ArchiveJobs=yes option in order to archive jobs that are purged. Please refer to this documentation for more detailed information on archiving: https://slurm.schedmd.com/slurmdbd.conf.html#OPT_ArchiveJobs<https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fslurm.schedmd.com%2Fslurmdbd.conf.html%23OPT_ArchiveJobs&data=05%7C01%7Cx.huang%40wustl.edu%7C4d9e37a583c54d47b0fa08daf2850d25%7C4ccca3b571cd4e6d974b4d9beb96c6d6%7C0%7C0%7C638088950522172051%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=I79AAIA3Fou3xMAm5RBFRuj7Ouhg6Jr3KoJfR8iakdk%3D&reserved=0> and this section: https://slurm.schedmd.com/slurmdbd.conf.html#OPT_PurgeEventAfter<https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fslurm.schedmd.com%2Fslurmdbd.conf.html%23OPT_PurgeEventAfter&data=05%7C01%7Cx.huang%40wustl.edu%7C4d9e37a583c54d47b0fa08daf2850d25%7C4ccca3b571cd4e6d974b4d9beb96c6d6%7C0%7C0%7C638088950522172051%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=moLhqAXvfDyYz2HlVxACWpFcTJk6LZW%2FhQDoSv%2BAGkA%3D&reserved=0> will explain what the different Purge options mean. Thanks ________________________________ You are receiving this mail because: * You reported the bug. ________________________________ The materials in this message are private and may contain Protected Healthcare Information or other information of a sensitive nature. If you are not the intended recipient, be advised that any unauthorized use, disclosure, copying or the taking of any action in reliance on the contents of this information is strictly prohibited. If you have received this email in error, please immediately notify the sender via telephone or return mail. You're welcome. |