Hello, We would like to load our archives (job_table_archive, step_table_archive...) in a database separated from the production one so that we can keep more than 3 months in it (our current limit given the number of jobs per month) and can compute statistics about the cluster usage (by research unit, by type of QoS or GRes...) over a year for example. At the moment, we are unable to extract this kind of figures easily so it's difficult to foresee where we should put resources, which quantity and at which pace. How should we proceed? I guess we should export the users/associations data from the production DB using s"acctmgr dump" and load the result file in the blank DB with "sacctmgr load" before using "sacctmgr archive load" to insert the different archive files, one by one according to the date of the content. Will it be find if we insert jobs belonging to users that have been deleted after the archive file creation and that won't be, as a consequence, in the blank database populated with the result of the "sacctmgr dump" command? We also wonder how we could create archives more frequently without purging the production DB. It seems to be possible given the sentence """ Archive dump Dump accounting data to file. Depending on options and slurmdbd configuration data may remain in database or be purged. """ Should we use a script for that, using the SLURM_ARCHIVE_LAST_JOB to give the date of the most recent job to dump in the archive? Could you provide a short example/template for that? Thanks in advance,
Sophie. (In reply to Sophie Créno from comment #0) > We would like to load our archives (job_table_archive, > step_table_archive...) > in a database separated from the production one so that we can keep more > than 3 months in it (our current limit given the number of jobs per month) > and can compute statistics about the cluster usage (by research unit, by type > of QoS or GRes...) over a year for example. I'm going to lower the severity of this ticket per <https://www.schedmd.com/support.php> >Severity 4 issues may also be used for recommendations for future product enhancements or modifications. > At the moment, we are unable > to extract this kind of figures easily so it's difficult to foresee where > we should put resources, which quantity and at which pace. Understood, many sites use previous machines to predict needs of new machines. > How should we proceed? Do you plan to use another slurmdbd/slurmctld pair to store and query the records or will you be using MySQL/Mariadb directly?
Hello Nate, Thanks for your quick reply. > I'm going to lower the severity of this ticket per > <https://www.schedmd.com/support.php> > >Severity 4 issues may also be used for recommendations for future product > enhancements or modifications. Oh, yes, sorry, I should have reread it first. > Do you plan to use another slurmdbd/slurmctld pair to store and query > the records or will you be using MySQL/Mariadb directly? Yes, we imagined to use another slurmdbd/slurmctld pair to store and query. But we will also probably build our own SQL queries to cross data and obtain all we need given our organization (common and private partitions...).
(In reply to Sophie Créno from comment #2) > Yes, we imagined to use another slurmdbd/slurmctld pair to store and > query. Looking in how best to do this. > But we will also probably build our own SQL queries to cross data > and obtain all we need given our organization (common and private > partitions...). Just so you know, the SQL table schemas can change between major releases.
>> But we will also probably build our own SQL queries to cross data >> and obtain all we need given our organization (common and private >> partitions...). > > Just so you know, the SQL table schemas can change between major releases. Yes, I know. We will have to update them at the same time. That would be part of our check list when we install a new version on the test cluster.
Hello, This message just to know if you found time during the last couple of weeks to look into this. Thanks in advance,
> We would like to load our archives (job_table_archive, > step_table_archive...) > in a database separated from the production one so that we can keep more > than 3 months in it (our current limit given the number of jobs per month) > and can compute statistics about the cluster usage (by research unit, by type > of QoS or GRes...) over a year for example. At the moment, we are unable > to extract this kind of figures easily so it's difficult to foresee where > we should put resources, which quantity and at which pace. > How should we proceed? I guess we should export the users/associations > data from the production DB using s"acctmgr dump" and load the result file > in the blank DB with "sacctmgr load" before using "sacctmgr archive load" > to insert the different archive files, one by one according to the date > of the content. I'm taking over here for Nate since he got pulled into some other work. I apologize for the delayed response - I encountered a few issues while researching. TL;DR - I discuss a few options and caveats below, but I think the easiest thing for you to do is to mysqldump the entire production database, then sacctmgr archive load all the archive files you currently have. Then you can continue to sacctmgr archive load new archive files as they whenever the production database has new data that is archived and purged. Long explanation: You can export the associations with sacctmgr dump and load those into the new database with sacctmgr load. I ran into one prerequisite and one problem for sacctmgr load to work: (1) The user calling it has to be an admin, SlurmUser, or root in the database. Just do this as root. (2) However, sacctmgr load still doesn't work if you added any QOS to the table besides the "normal" QOS (which is the default QOS in Slurm). So you will have to dump the QOS table from your production database and load it into the new database. Example: * Dump qos_table from my old database (slurm_2002) to a file * Start slurmdbd, then load this into the new database as user root * Then call sacctmgr load as user root $ mysqldump -u marshall slurm_2002 qos_table > slurm_2002_qos_table.sql # mysql -u marshall slurm_master_bug8910 < slurm_2002_qos_table.sql # sacctmgr load discovery1.cfg For cluster discovery1 < Then a bunch of output as it processes the file > sacctmgr: Done adding cluster in usec=387935 Would you like to commit changes? (You have 30 seconds to decide) (N/y): y Another issue - if you have more TRES than the defaults, you'll also need to backup/restore the TRES manually along with the QOS. (such as gres/gpu - check with sacctmgr show tres) List of default TRES: $ sacctmgr show tres Type Name ID -------- --------------- ------ cpu 1 mem 2 energy 3 node 4 billing 5 fs disk 6 vmem 7 pages 8 Another option is to manually backup and restore the tres_table in the database just like the qos_table. Then you can do sacctmgr archive load <archive file> for all your archive files. You will need to wait for slurmdbd to rollup database usage before sreport is useful. Rollup happens every hour, or when slurmdbd starts up if it hasn't rolled up for the last hour yet. If it isn't happening and you get an error like "Cluster discovery1 not registered, not doing rollup", you may need to start a slurmctld instance that connects to slurmdbd on that node to make the cluster register. You can continue to purge/archive for 3 months in your production slurmdbd/database. When each new archive file is created, you should be able to simply load the archive file into the separate database. If you want to update this more often than once per month, then you could set the "Purge*After" values in slurmdbd.conf to "90days" instead of "3months" to make purging/archiving happen once per day instead of once per month. Note: Sometimes jobs aren't archived in a period where they normally would have been archived. If this happens, whenever the job gets archived then you'll get a new archive file with that same time period, but with a ".1", ".2", etc. to distinguish them. You can call sacctmgr archive load that file normally. This is documented in "ArchiveDir" in the slurmdbd.conf man page. (https://slurm.schedmd.com/slurmdbd.conf.html#OPT_ArchiveDir) > Will it be find if we insert jobs belonging to users that have been > deleted after the archive file creation and that won't be, as a consequence, > in the blank database populated with the result of the "sacctmgr dump" > command? Just to clarify - you're asking if associations have been deleted from the database, what will happen if you archive load jobs from those associations? It should be fine. The jobs will still load, you can still query jobs from those users. However, because those associations don't exist in the new database then usage for them won't be rolled up, so certain usage reports from sreport (such as AccountUtilizationByUser) won't be the same as on your production database, because they don't have those associations. As an example: I ran some jobs as user "kelsey" on May 9th, but deleted this user before calling sacctmgr dump in the old database. Old slurmdbd - I delete user "kelsey" and run sacctmgr dump. As you can see, kelsey isn't in the file: $ sacctmgr del user kelsey Deleting users... kelsey Would you like to commit changes? (You have 30 seconds to decide) (N/y): y $ sacctmgr dump discovery1 No filename given, using ./discovery1.cfg. sacctmgr: Cluster - 'discovery1':Fairshare=1:QOS='normal' sacctmgr: Parent - 'root' sacctmgr: User - 'root':DefaultAccount='root':AdminLevel='Administrator':Fairshare=1 sacctmgr: Account - 'acct1':Description='acct1':Organization='acct1':Fairshare=1 sacctmgr: Parent - 'acct1' sacctmgr: User - 'marshall':DefaultAccount='acct1':WCKeys='test21_41wckey1':Fairshare=1:MaxJobsAccrue=4:QOS='+qos2,+qos3' New database: "kelsey" isn't in the database but I can still query jobs. # sacctmgr show users User Def Acct Def WCKey Admin ---------- ---------- ---------- --------- marshall acct1 test21_41+ None root root Administ+ # sacct -X -a -S1/1/20 --user=kelsey --format=jobid,jobname,start,end,state JobID JobName Start End State ------------ ---------- ------------------- ------------------- ---------- 1622_1 j1 2020-05-09T12:00:56 2020-05-09T12:00:57 FAILED 1622_2 j2 2020-05-09T12:00:57 2020-05-09T12:00:57 FAILED 1622_3 j3 2020-05-09T12:00:57 2020-05-09T12:00:57 FAILED 1622_4 j4 2020-05-09T12:00:57 2020-05-09T12:00:57 FAILED If you want it to be the same as the production database, then instead of using sacctmgr dump you'll need to manually dump the following tables from the old database and restore them into the new database: acct_coord_table, acct_table, qos_table, tres_table, user_table Or, just dump the entire production database and restore it on the other node. When rows are deleted from tables in the Slurm database, instead of actually removing that row from the table Slurm sets the "deleted" column to 1 to indicate that they've been deleted. So Slurm still has records of the old users/accounts. > We also wonder how we could create archives more frequently without > purging the production DB. It seems to be possible given the sentence > """ > Archive dump > Dump accounting data to file. Depending on options and slurmdbd configuration > data may remain in database or be purged. > """ > > Should we use a script for that, using the SLURM_ARCHIVE_LAST_JOB to give > the date of the most recent job to dump in the archive? Could you provide > a short example/template for that? You can't archive without purging. To create archives more frequently I recommend using "days" or "hours" instead of "months" in the Purge*After options, though you still won't get any data in the archive that isn't purged. For example, if you purge/archive 3 months of data, you won't see any jobs in the archive that are less than 3 months old. I don't know if you could ever archive without purging. Either the sacctmgr man page that you quoted is wrong, or it's a bug. In slurmdbd.conf I don't have any of the Purge* settings. Without that, I get these errors on the command line and in slurmdbd.log. $ sacctmgr archive dump directory=/home/marshall/archive/20.02 jobs steps This may result in loss of accounting database records (if Purge* options enabled). Are you sure you want to continue? (You have 30 seconds to decide) (N/y): y sacctmgr: error: slurmdbd: Error with request. Problem dumping archive: Unspecified error [2020-05-14T11:13:05.358] debug: REQUEST_PERSIST_INIT: CLUSTER:discovery1 VERSION:8960 UID:1000 IP:127.0.0.1 CONN:12 [2020-05-14T11:13:06.464] error: No known unit given for purge, we are guessing mistake and returning error [2020-05-14T11:13:06.464] error: Parsing purge discovery1_step_table [2020-05-14T11:13:06.464] error: Processing last message from connection 12(127.0.0.1) uid(1000) With the Purge* settings, I can purge and archive normally. I need to investigate this more. Does this answer your questions enough for you to do what you want?
We've updated the documentation to clarify that you can't archive without purging in commit c46e780e07ef8a29. We've also created an internal enhancement ticket to add QOS and TRES to sacctmgr dump/load. I'm closing this as resolved/fixed per the documentation fix. Let us know if you have more questions.
Hello, Sorry for the delay in my answer. Thanks a lot for the time you took to investigate and write that comprehensive answer regarding the procedure to setup a separate database. So, if I understand correctly, we could: * either keep our slurmdbd setting regarding the archive/purge and, in this case, keep the last 90 days in the production DB whereas all the rest would be in the "historic" DB. So we would have to merge 2 sreport outputs if we wanted to have accounting information over 6 months for example. * or decrease the values of Purge* settings so that there would be just a few days (let's say a week or maybe even less?) of records in the production DB and all the rest in the "historic" one. Then we would execute sreport essentially on the latter one. Of course, as you said, in both cases we would create archive files everyday and load them straight afterwards in the "historic" DB. Do you think that keeping only 24 hours in the production DB would be sensible? Thanks a lot,
Re-opening the ticket per your response. Yes, those two options would both work. > Do you think that > keeping only 24 hours in the production DB would be sensible? I think it depends on how your users and admins use the production database and who can access the historic database. If users don't need more than 1 day of history, then that's fine. From Slurm's point of view, it doesn't matter. The scheduler/resource manager part of Slurm will continue to work the same, even without a database. (slurmdbd is an optional, though strongly recommended, part of Slurm.)
>> Do you think that keeping only 24 hours in the production DB >> would be sensible? > > I think it depends on how your users and admins use the production > database and who can access the historic database. If users don't > need more than 1 day of history, then that's fine. Some teams want to keep track of how much each of their members/projects consume so we would have to let them access the historic db. They also sometimes need figures to apply for a grant for example and even a week would definitely not be enough in that case. > From Slurm's point of view, it doesn't matter. The scheduler/resource > manager part of Slurm will continue to work the same, even without > a database. (slurmdbd is an optional, though strongly recommended, > part of Slurm.) Ah yes that's right but it would be so weird to me not be able to access accounting information that I forget that slurmdbd is indeed optional. Thanks again for your help.
You're welcome. Re-closing as resolved/fixed