13346 – Archiving/Purging from our active database

Ticket 13346 - Archiving/Purging from our active database

Summary: Archiving/Purging from our active database

Status:	RESOLVED INFOGIVEN

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	Database (show other tickets)
Version:	20.11.8
Hardware:	Linux Linux

Severity:	3 - Medium Impact
Assignee:	Ben Roberts
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2022-02-07 10:36 MST by Hjalti Sveinsson
Modified:	2022-03-18 08:06 MDT (History)
CC List:	2 users (show)

See Also:
Site:	deCODE
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
slurmdbd archiving purging logs (74.16 KB, application/mbox) 2022-02-10 01:41 MST, Hjalti Sveinsson	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Hjalti Sveinsson 2022-02-07 10:36:41 MST

Hello,

We want to purge data from our current active database, the database size is now over 600 GB in size.

MariaDB [(none)]> SELECT table_schema "slurm_acc",
    ->         ROUND(SUM(data_length + index_length) / 1024 / 1024, 1) "DB Size in MB"
    -> FROM information_schema.tables
    -> GROUP BY table_schema;
+--------------------+---------------+
| slurm_acc          | DB Size in MB |
+--------------------+---------------+
| information_schema |           0.1 |
| mysql              |           0.6 |
| performance_schema |           0.0 |
| slurm_acc          |      646468.3 |
+--------------------+---------------+
4 rows in set (1.39 sec)

MariaDB [(none)]>

As you see in this ticket that we created when we were upgrading from 19.05.6 to 20.11.8 -> https://bugs.schedmd.com/show_bug.cgi?id=12416, we have had issues in the past when enabling archiving/purging. 

We have tried the "sacctmgr archive dump" but that always times out after 900 seconds. 

How can we best do this to make sure that we won't crash the slurmdbd service or the mariadb service?

We have decided that we only need to have a few months of data in our active database at one time but we want to be able to insert the data back into the database if needed or have it on a seperate slurmdbd archiving node. 

Would be great to get input from you guys before we enalbe Archive. Last time I tried this the slurmdbd service halted but that was back in 19.05.6 version of slurm.

regards,
Hjalti

Comment 1 Hjalti Sveinsson 2022-02-07 10:41:00 MST

Also see this when we tried this in 18.08.7 -> https://bugs.schedmd.com/show_bug.cgi?id=8873

Comment 3 Ben Roberts 2022-02-07 13:06:31 MST

Hi Hjalti,

We should be able to come up with a solution that will let you purge old records from your database without overwhelming it.  The way the Purge*After parameters work is that they will perform the purge at the end of the type of period specified.  For example, if you have PurgeJobAfter=3months then at the end of the current month Slurm will look for any jobs older than 3 months and attempt to get rid of it.  This can be a problem when the number of jobs that qualify is too large.  To address this problem you can configure Slurm to purge the jobs in smaller increments so there aren't as many being purged at a time.  For example, if you have a years worth of data and you want to purge it down to 3 months, you can start by setting PurgeJobAfter=360days.  This will cause a purge to happen at the end of the day rather than at the end of the month and it will only purge 5 days worth of jobs.  Then the next day you can update it to be 355 days so that it clears out 5 more days worth of jobs the next night.  You can carry on with these purges of small increments until you reach the desired time frame and then leave it to do a nightly purge that keeps it at the specified time period.  

To begin this process you need to know how old your oldest job is.  Can I have you run the following query, replacing <cluster_name> with the appropriate name for your cluster?

select id_job,time_start,time_end from <cluster_name>_job_table order by time_end limit 50;


Another thing that you may be aware of, but I feel I should mention are the Archive* parameters.  These cause the jobs (or other  record types) to be written to a file when they are purged so you can take them to another instance of slurmdbd and load them as a place to retrieve information on the records that have been purged.  This allows you to access the information without having to keep all the records on your production machine.  You can find information on the ArchiveJobs parameter here:
https://slurm.schedmd.com/slurmdbd.conf.html#OPT_ArchiveJobs

Let me know if you have any questions about this.

Thanks,
Ben

Comment 4 Hjalti Sveinsson 2022-02-08 02:57:23 MST

Hi, thank you for your answer. 

Good to hear that we should be able to manage this. Will the database shrink in size when we have purged the data?

The query you gave me seems to not work correctly,

Here is the result from it, 

MariaDB [slurm_acc]> select id_job,time_start,time_end from lhpc_job_table order by time_end limit 50;
+----------+------------+----------+
| id_job   | time_start | time_end |
+----------+------------+----------+
| 42993789 |          0 |        0 |
| 42993790 |          0 |        0 |
| 42993791 |          0 |        0 |
| 42993792 |          0 |        0 |
| 42993793 |          0 |        0 |
| 42993794 |          0 |        0 |
| 42993795 |          0 |        0 |
| 42993796 |          0 |        0 |
| 42993797 |          0 |        0 |
| 42993798 |          0 |        0 |
| 42993799 |          0 |        0 |
| 42993800 |          0 |        0 |
| 42993801 |          0 |        0 |
| 42993802 |          0 |        0 |
| 42993803 |          0 |        0 |
| 42993804 |          0 |        0 |
| 42993805 |          0 |        0 |
| 42993806 |          0 |        0 |
| 42996102 |          0 |        0 |
| 42996104 |          0 |        0 |
| 42996106 |          0 |        0 |
| 42996108 |          0 |        0 |
| 42996110 |          0 |        0 |
| 42996112 |          0 |        0 |
| 42996114 |          0 |        0 |
| 42996116 |          0 |        0 |
| 42996117 |          0 |        0 |
| 42996119 |          0 |        0 |
| 42996121 |          0 |        0 |
| 42996123 |          0 |        0 |
| 42996125 |          0 |        0 |
| 42996127 |          0 |        0 |
| 42996129 |          0 |        0 |
| 42996130 |          0 |        0 |
| 42996132 |          0 |        0 |
| 42996134 |          0 |        0 |
| 49289263 |          0 |        0 |
| 49289265 |          0 |        0 |
| 49685028 |          0 |        0 |
| 49685030 |          0 |        0 |
| 49685032 |          0 |        0 |
| 49685034 |          0 |        0 |
| 49685035 |          0 |        0 |
| 49685037 |          0 |        0 |
| 49685040 |          0 |        0 |
| 49890322 |          0 |        0 |
| 49896673 |          0 |        0 |
| 50033274 |          0 |        0 |
| 50033275 |          0 |        0 |
| 50033276 |          0 |        0 |
+----------+------------+----------+
50 rows in set (0.00 sec)

----

Archiving is exactly what we want so we would want to configure that as well.


best regards,
Hjalti Sveinsson

Comment 5 Hjalti Sveinsson 2022-02-08 03:17:19 MST

Here I was running sacct queries and it seems that I have found the oldest record see here below. 

Nothing here:
root@ru-lhpc-head:~# sacct -S 03/02/18-00:00:00 -E 03/02/18-16:00:00 -o jobid,start,end| head -10
       JobID               Start                 End
------------ ------------------- -------------------

Nothing here:
root@ru-lhpc-head:~# sacct -S 03/02/18-00:00:00 -E 03/02/18-23:00:00 -o jobid,start,end| head -10
       JobID               Start                 End
------------ ------------------- -------------------

Nothing here:
root@ru-lhpc-head:~# sacct -S 03/02/18-00:00:00 -E 03/02/18-23:59:00 -o jobid,start,end| head -10
       JobID               Start                 End
------------ ------------------- -------------------

Voila:
root@ru-lhpc-head:~# sacct -S 03/03/18-00:00:00 -E 03/03/18-02:00:00 -o jobid,start,end| head -10
       JobID               Start                 End
------------ ------------------- -------------------
15887289     2018-03-03T00:22:35 2018-03-05T00:22:38
15892684     2018-03-03T00:37:56 2018-03-04T10:09:30
15892688     2018-03-03T00:38:02 2018-03-04T12:30:58
15892692     2018-03-03T00:38:03 2018-03-04T12:12:07
root@ru-lhpc-head:~#

Can we verify this somehow that these are the oldest records?

Comment 6 Ben Roberts 2022-02-08 08:58:08 MST

Yes, as records are purged the size of the database will shrink accordingly.  That's interesting that the query returned 50 records that didn't have a start or end time.  I would like to see what sacct has to say about one of these jobs.  Can I have you send the output of the following command?
sacct -j 42993789 --format=jobid,state,exitcode,submit,eligible,start,end


For archiving, the most important thing you would want to configure is the ArchiveDir.  This is where the archive file(s) will be created when the purge of job (or other) records happens.  Once you have the archive file that is created you can move the file to your other system with a separate instance of slurmdbd running and load the files there.  You need to have the Archive option that corresponds to each type of record that is being purged if you want to archive records of that type.  For example, you need to have ArchiveJobs along with PurgeJobsAfter to have the job records archived.  You need to have ArchiveSteps along with PurgeStepsAfter to have job step records archived, etc.
https://slurm.schedmd.com/slurmdbd.conf.html#OPT_ArchiveDir

It does look like you've probably found the oldest records if there was nothing on 3/2/18 but there was on 3/3/18.  You can also run that command looking for the entire month of February in 2018 and again for older months, but I assume you were looking around that time because that was about when the cluster came online.  Is that right?  

Thanks,
Ben

Comment 8 Hjalti Sveinsson 2022-02-09 04:18:49 MST

Regarding your first question :

root@ru-lhpc-head-db-01:/var/log/slurm# sacct -j 42993789 --format=jobid,state,exitcode,submit,eligible,start,end
       JobID      State ExitCode              Submit            Eligible               Start                 End
------------ ---------- -------- ------------------- ------------------- ------------------- -------------------
42993789        PENDING      0:0 2021-12-22T10:00:10             Unknown             Unknown             Unknown

If I do it with -D parameter:

root@ru-lhpc-head-db-01:/var/log/slurm# sacct -D -j 42993789 --format=jobid,state,exitcode,submit,eligible,start,end
       JobID      State ExitCode              Submit            Eligible               Start                 End
------------ ---------- -------- ------------------- ------------------- ------------------- -------------------
42993789         FAILED    127:0 2018-06-10T07:52:10 2018-06-10T07:52:10 2018-06-10T07:52:11 2018-06-10T07:52:15
42993789      COMPLETED      0:0 2019-03-29T01:12:55 2019-03-29T01:12:55 2019-03-29T01:20:00 2019-03-29T01:22:02
42993789      COMPLETED      0:0 2019-12-09T06:56:41 2019-12-09T06:56:41 2019-12-09T06:56:41 2019-12-09T07:03:46
42993789.ba+  COMPLETED      0:0 2019-12-09T06:56:41 2019-12-09T06:56:41 2019-12-09T06:56:41 2019-12-09T07:03:46
42993789      COMPLETED      0:0 2020-07-08T12:56:29 2020-07-08T12:56:29 2020-07-08T12:57:24 2020-07-08T13:16:19
42993789.ba+  COMPLETED      0:0 2020-07-08T12:57:24 2020-07-08T12:57:24 2020-07-08T12:57:24 2020-07-08T13:16:19
42993789      COMPLETED      0:0 2021-03-03T05:49:55 2021-03-03T05:50:02 2021-03-03T05:50:32 2021-03-03T07:36:35
42993789.ba+  COMPLETED      0:0 2021-03-03T05:50:32 2021-03-03T05:50:32 2021-03-03T05:50:32 2021-03-03T07:36:35
42993789        PENDING      0:0 2021-12-22T10:00:10             Unknown             Unknown             Unknown

----

Reguarding your second remark:

Would this config make sense since our oldest data is since 3 of march 2018 (1439 days ago) since we only want to try it with 2 days of data to begin with:

# Archive info
ArchiveDir="/nfs/fs1/slurm_archive_data/"
ArchiveJobs=yes
ArchiveSteps=yes
#ArchiveEvents=yes
#ArchiveResvs=yes
#ArchiveSuspend=yes
#ArchiveUsage=yes
#ArchiveTXN=yes
PurgeEventAfter=1437days
PurgeJobAfter=1437days
PurgeStepAfter=1437days
PurgeResvAfter=1437days
PurgeSuspendAfter=1437days
PurgeTXNAfter=1437days
PurgeUsageAfter=1437days

Comment 9 Ben Roberts 2022-02-09 10:29:28 MST

It looks like it's finding the last record in the sacct output you sent where the eligible, start and end times are unknown.  It looks like you have been processing enough jobs to reuse jobids and this is a (relatively) recent job that is still pending.  If you ran the query again, ordering by submit time instead then I would expect it to line up with the job list you got from sacct.
select id_job,time_submit,time_start,time_end from lhpc_job_table order by time_submit limit 20;

The config you have looks good, assuming you don't want to archive anything but the jobs and steps.  The events, reservations, suspend usage and transaction information will be purged but not written to a file.

Thanks,
Ben

Comment 10 Hjalti Sveinsson 2022-02-10 01:41:14 MST

Created attachment 23397 [details]
slurmdbd archiving purging logs

Comment 11 Hjalti Sveinsson 2022-02-10 01:45:57 MST

I tried yesterday to enable this with the config I sent you yesterday, it ran fine until 02:01:30 :

[2022-02-09T21:40:58.584] slurmdbd version 20.11.8 started
[2022-02-10T02:01:30.335] fatal: mysql gave ER_LOCK_WAIT_TIMEOUT as an error. The only way to fix this is restart the calling program

from journaltcld:
Feb 10 02:01:30 ru-lhpc-head-db-01.decode.is systemd[1]: slurmdbd.service: main process exited, code=exited, status=1/FAILURE
Feb 10 02:01:30 ru-lhpc-head-db-01.decode.is systemd[1]: Unit slurmdbd.service entered failed state.
Feb 10 02:01:30 ru-lhpc-head-db-01.decode.is systemd[1]: slurmdbd.service failed.

I can see that there was an insert into the job_table that went to 1800 seconds and then it seems to crash slurmdbd, however the delete from continued. See the log file I have attached to the case.

See here:
| 359649 | slurm | localhost | slurm_acc | Query   |  1802 | init   | delete from "lhpc_job_table" where time_submit <= 1520121599 && time_end != 0 order by time_submit asc LIMIT 50000                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |    0.000 |
| 359650 | slurm | localhost | slurm_acc | Query   |  1800 | update | insert into "lhpc_job_table" (id_job, mod_time, id_array_job, id_array_task, het_job_id, het_job_offset, id_assoc, id_qos, id_user, id_group, nodelist, id_resv, timelimit, time_eligible, time_submit, time_start, job_name, track_steps, state, priority, cpus_req, nodes_alloc, mem_req, flags, state_reason_prev, account, `partition`, array_task_str, array_task_pending, tres_req, work_dir) values (52829831, UNIX_TIMESTAMP(), 0, 4294967294, 0, 4294967294, 5, 1, 1065, 1042, 'None assigned', 0, 660, 1644456687, 1644456687, 0, 'pheno', 0, 0, 1, 1, 0, 4096, 0, 0, 'bioprod', 'cpu_hog', NULL, 0, '1=1,2=4096,4=1,5=1', '/nfs/odinn/datavault/damp/tmp/vol101/assocDaisy_snpindel_Q2P/LDSRQTLP/LDSRQTLP143256/mtjobs.out') on duplicate key update job_db_inx=LAST_INSERT_ID(job_db_inx), id_assoc=5, id_user=1065, id_group=1042, nodelist='None assigned', id_resv=0, timelimit=660, time_submit=1644456687, time_eligible=1644456687, time_start=0, mod_time=UNIX_TIMESTAMP(), job_name='pheno', track_steps=0, id_qos=1, state=greatest(state, 0), priority=1, cpus_req=1, nodes_alloc=0, mem_req=4096, id_array_job=0, id_array_task=4294967294, het_job_id=0, het_job_offset=4294967294, flags=0, state_reason_prev=0, account='bioprod', `partition`='cpu_hog', array_task_str=NULL, array_task_pending=0, tres_req='1=1,2=4096,4=1,5=1', work_dir='/nfs/odinn/datavault/damp/tmp/vol101/assocDaisy_snpindel_Q2P/LDSRQTLP/LDSRQTLP143256/mtjobs.out' |    0.000 |
| 368626 | slurm | localhost | slurm_acc | Sleep   |     0 |        | NULL                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             |    0.000 |
| 368627 | root  | localhost | NULL      | Query   |     0 | NULL   | show full processlist

Comment 12 Ben Roberts 2022-02-10 08:41:07 MST

I'm sorry to hear the record purge had a problem.  When you brought it back up does it look like the delete completed?  Do you see a record for job 52829831 that it was trying to insert? 

Can I have you get information on the innodb settings by running this:
show variables like 'innodb%';

Since it had a problem right around the 1800 second mark it seems like there might be a timeout coming into play.

Thanks,
Ben

Comment 13 Hjalti Sveinsson 2022-02-10 09:01:34 MST

Yes I see information about job 52829831 but the step data is missing.

root@ru-lhpc-head-db-01:~/lockwait# sacct -j 52829831
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
52829831     allocation               bioprod          1  COMPLETED      0:0


innodb settings:
MariaDB [(none)]> show variables like 'innodb%';
+-------------------------------------------+------------------------+
| Variable_name                             | Value                  |
+-------------------------------------------+------------------------+
| innodb_adaptive_flushing                  | ON                     |
| innodb_adaptive_flushing_method           | estimate               |
| innodb_adaptive_hash_index                | ON                     |
| innodb_adaptive_hash_index_partitions     | 1                      |
| innodb_additional_mem_pool_size           | 8388608                |
| innodb_autoextend_increment               | 8                      |
| innodb_autoinc_lock_mode                  | 2                      |
| innodb_blocking_buffer_pool_restore       | OFF                    |
| innodb_buffer_pool_instances              | 1                      |
| innodb_buffer_pool_populate               | OFF                    |
| innodb_buffer_pool_restore_at_startup     | 0                      |
| innodb_buffer_pool_shm_checksum           | ON                     |
| innodb_buffer_pool_shm_key                | 0                      |
| innodb_buffer_pool_size                   | 10737418240            |
| innodb_change_buffering                   | all                    |
| innodb_checkpoint_age_target              | 0                      |
| innodb_checksums                          | ON                     |
| innodb_commit_concurrency                 | 0                      |
| innodb_concurrency_tickets                | 500                    |
| innodb_corrupt_table_action               | assert                 |
| innodb_data_file_path                     | ibdata1:10M:autoextend |
| innodb_data_home_dir                      |                        |
| innodb_dict_size_limit                    | 0                      |
| innodb_doublewrite                        | ON                     |
| innodb_doublewrite_file                   |                        |
| innodb_fake_changes                       | OFF                    |
| innodb_fast_checksum                      | OFF                    |
| innodb_fast_shutdown                      | 1                      |
| innodb_file_format                        | Antelope               |
| innodb_file_format_check                  | ON                     |
| innodb_file_format_max                    | Antelope               |
| innodb_file_per_table                     | ON                     |
| innodb_flush_log_at_trx_commit            | 1                      |
| innodb_flush_method                       | O_DIRECT               |
| innodb_flush_neighbor_pages               | area                   |
| innodb_force_load_corrupted               | OFF                    |
| innodb_force_recovery                     | 0                      |
| innodb_ibuf_accel_rate                    | 100                    |
| innodb_ibuf_active_contract               | 1                      |
| innodb_ibuf_max_size                      | 5368692736             |
| innodb_import_table_from_xtrabackup       | 0                      |
| innodb_io_capacity                        | 200                    |
| innodb_kill_idle_transaction              | 0                      |
| innodb_large_prefix                       | OFF                    |
| innodb_lazy_drop_table                    | 0                      |
| innodb_lock_wait_timeout                  | 6000                   |
| innodb_locking_fake_changes               | ON                     |
| innodb_locks_unsafe_for_binlog            | OFF                    |
| innodb_log_block_size                     | 512                    |
| innodb_log_buffer_size                    | 8388608                |
| innodb_log_file_size                      | 1073741824             |
| innodb_log_files_in_group                 | 2                      |
| innodb_log_group_home_dir                 | ./                     |
| innodb_max_bitmap_file_size               | 104857600              |
| innodb_max_changed_pages                  | 1000000                |
| innodb_max_dirty_pages_pct                | 75                     |
| innodb_max_purge_lag                      | 0                      |
| innodb_merge_sort_block_size              | 1048576                |
| innodb_mirrored_log_groups                | 1                      |
| innodb_old_blocks_pct                     | 37                     |
| innodb_old_blocks_time                    | 0                      |
| innodb_open_files                         | 300                    |
| innodb_page_size                          | 16384                  |
| innodb_print_all_deadlocks                | OFF                    |
| innodb_purge_batch_size                   | 20                     |
| innodb_purge_threads                      | 1                      |
| innodb_random_read_ahead                  | OFF                    |
| innodb_read_ahead                         | linear                 |
| innodb_read_ahead_threshold               | 56                     |
| innodb_read_io_threads                    | 4                      |
| innodb_recovery_stats                     | OFF                    |
| innodb_recovery_update_relay_log          | OFF                    |
| innodb_replication_delay                  | 0                      |
| innodb_rollback_on_timeout                | OFF                    |
| innodb_rollback_segments                  | 128                    |
| innodb_show_locks_held                    | 10                     |
| innodb_show_verbose_locks                 | 0                      |
| innodb_simulate_comp_failures             | 0                      |
| innodb_spin_wait_delay                    | 6                      |
| innodb_stats_auto_update                  | 1                      |
| innodb_stats_method                       | nulls_equal            |
| innodb_stats_modified_counter             | 0                      |
| innodb_stats_on_metadata                  | ON                     |
| innodb_stats_sample_pages                 | 8                      |
| innodb_stats_traditional                  | ON                     |
| innodb_stats_update_need_lock             | 1                      |
| innodb_strict_mode                        | OFF                    |
| innodb_support_xa                         | ON                     |
| innodb_sync_spin_loops                    | 30                     |
| innodb_table_locks                        | ON                     |
| innodb_thread_concurrency                 | 0                      |
| innodb_thread_concurrency_timer_based     | OFF                    |
| innodb_thread_sleep_delay                 | 10000                  |
| innodb_track_changed_pages                | OFF                    |
| innodb_use_atomic_writes                  | OFF                    |
| innodb_use_fallocate                      | OFF                    |
| innodb_use_global_flush_log_at_trx_commit | ON                     |
| innodb_use_native_aio                     | ON                     |
| innodb_use_stacktrace                     | OFF                    |
| innodb_use_sys_malloc                     | ON                     |
| innodb_use_sys_stats_table                | OFF                    |
| innodb_version                            | 5.5.61-MariaDB-38.13   |
| innodb_write_io_threads                   | 4                      |
+-------------------------------------------+------------------------+
103 rows in set (0.00 sec)

>Since it had a problem right around the 1800 second mark it seems like there might be a timeout coming into play. 

Yes I agree but it is not the innodb_lock_wait_timeout value because I tried increasing it to 6000 since I have noticed that the delete takes around 4000 - 5000 seconds to complete, restarted the archiving/purge and if another transaction is trying to update/insert to the lhpc_job_table that transaction times out at 1800 seconds and the slurmdbd crases, with this:

"[2022-02-10T12:15:33.909] fatal: mysql gave ER_LOCK_WAIT_TIMEOUT as an error. The only way to fix this is restart the calling program"

Comment 14 Ben Roberts 2022-02-10 12:56:54 MST

Hi Hjalti,

I've been looking into what other timeout might come into play here.  It looks like there is a 'lock_wait_timeout' in MySQL.  The documentation shows that running into this timeout will result in the 'ER_LOCK_WAIT_TIMEOUT' message we're seeing.
https://dev.mysql.com/doc/refman/5.7/en/server-system-variables.html#sysvar_lock_wait_timeout

Can you see what this is set to on your system?
show variables like 'lock_wait_timeout';

Thanks,
Ben

Comment 15 Hjalti Sveinsson 2022-02-11 04:03:57 MST

Hi, I have found what the timeout issue was. 

It was the SLURMDBD_TIMEOUT in :

./src/plugins/accounting_storage/slurmdbd/slurmdbd_agent.c

and

./src/plugins/accounting_storage/slurmdbd/dbd_conn.c

it was set to 900 but I changed it some time back to 1800 to increase the lock wait timeout. Yesterday I changed this to 6000 and recreated the rpm packages and reinstalled the slurm-slurmdbd package on the database node. 

Now the timeout is 6000 and when the delete from table is running transactions wait up to 6000 seconds to complete and that seems to be enough, the deletes are taking about 4500 seconds to complete, that is each 50000 rows. 

That is another question have, is there anyway to improve this performance since I have calculated that with the current performance I will be able to delete/purge ~300.000 rows per day but the problem is that we are creating over ~400.000 new entries per day into our database so with this performance we will never be able to catch our tail.

One thing I thought of is that we already have a backup of our database that I took on the 8th of february. Can we just import that backup into a seperate server so we can look up old data there and then delete data from our main database more quickly then slurmdbd does?

Comment 16 Hjalti Sveinsson 2022-02-11 08:23:26 MST

The archiving is not working for us, it is taking 24 hours to purge roughly 300.000 lines from the database and we are creating 400.000 - 500.000 per day. 

With that performance we will never catch our tail.

Also, users are experiencing slowness in queries with the sacct command while archiving is running. 

We really need to decrease the size of the database and we have gotten approval for removing anything older the 1 year of data. 

How should we go forward?

I was thinking if this is possible?
###############
1. systemctl stop slurmdbd
2. time mysqldump -u root slurm_acc --single-transaction lhpc_job_table --where='time_start>1613055899' > /nfs/fs1/slurm_archive_data/lhpc_job_table-test.sql
3. time mysqldump -u root slurm_acc --single-transaction lhpc_step_table --where='time_start>1613055899' > /nfs/fs1/slurm_archive_data/lhpc_step_table-test.sql
4. drop table lhpc_step_table
5. time mysql -u root -p slurm_acc < /nfs/fs1/slurm_archive_data/lhpc_step_table-test.sql
6. time mysql -u root -p slurm_acc < /nfs/fs1/slurm_archive_data/lhpc_job_table-test.sql
7. systemctl start slurmdbd
###############

1. That is, stop the slurmdbd service so no new data gets created for a period of time. 
2. Dump the data that we want to keep from job_table and step_table to sql backup files. 
3. drop the tables
4. restore the tables from the sql backup files. 
5. start the slurmdbd service again

Will this work?

Comment 17 Ben Roberts 2022-02-11 11:17:42 MST

Hi Hjalti,

I'm glad you were able to find the right timeout and increase it to get things working.  I've been looking at your plan to truncate the tables manually.  It makes me nervous to delete the data outside of the normal purge/archive process and it is not a supported method of doing it. However, I understand why you want to delete it in this way.  Looking at the tables I don't see that there are foreign keys involved (except between themselves) where removing data from those tables is going to cause problems.  

That being said, I would recommend testing it on your backup system first to make sure that you don't run into anything unexpected.  One more table that could possible grow large is the events table.  You might check it while you're doing this.

I am also concerned about the amount of time slurmdbd would be down.  You can have slurmdbd down for a while and slurmctld will queue messages destined for slurmdbd until it comes back online, but with the number of jobs you are running every day it is not going to take much time before you might be maxing out your MaxDBMsgs.  Have you configured this to be something greater than the default?
https://slurm.schedmd.com/slurm.conf.html#OPT_MaxDBDMsgs

Thanks,
Ben

Comment 18 Hjalti Sveinsson 2022-02-11 15:22:30 MST

Hi Ben,

I tried the method I describe earlier today on a test slurmdbd server that had a full copy of our slurm database since I did the mysqldump just a few days ago and restored. 

It seems to have worked correctly like I wanted it to work, and I timed the actions.

1. systemctl stop slurmdbd -> 1 sec
2. time mysqldump -u root slurm_acc --single-transaction lhpc_job_table --where='time_start>1613055899' > /nfs/fs1/slurm_archive_data/lhpc_job_table-test.sql -> 16m21s
3. time mysqldump -u root slurm_acc --single-transaction lhpc_step_table --where='time_start>1613055899' > /nfs/fs1/slurm_archive_data/lhpc_step_table-test.sql -> 97m28s
4. drop table lhpc_step_table -> few seconds
5. drop table lhpc_job_table -> few seconds
5. time mysql -u root -p slurm_acc < /nfs/fs1/slurm_archive_data/lhpc_step_table-test.sql -> 32m25s
6. time mysql -u root -p slurm_acc < /nfs/fs1/slurm_archive_data/lhpc_job_table-test.sql -> 46m56s
7. systemctl start slurmdbd -> 1 sec

So we are talking about roughly 3 hours that this takes and the size of slurm_acc to only 70 GBs from over 600 GB. 

MariaDB [slurm_acc]> SELECT table_schema "slurm_acc",
    -> ROUND(SUM(data_length + index_length) / 1024 / 1024, 1) "DB Size in MB"
    -> FROM information_schema.tables
    -> GROUP BY table_schema;
+--------------------+---------------+
| slurm_acc          | DB Size in MB |
+--------------------+---------------+
| information_schema |           0.1 |
| mysql              |           0.6 |
| performance_schema |           0.0 |
| slurm_acc          |       70370.7 |
+--------------------+---------------+
4 rows in set (0.41 sec)

I have checked if I can lookup data with sacct command and everything works just fine.

Comment 20 Hjalti Sveinsson 2022-02-14 04:30:00 MST

Hi again, 

The method I described did work very well on the test system. Is there any way for me to confirm that all records I want for the period I select are in the database?

I am thinking of a SQL query, f.e. for each month I restored, I took data for the last 12 months and used mysqldump to dump that data via:

Backup:
time mysqldump -u root slurm_acc --single-transaction lhpc_job_table --where='time_start>1613055899' > /nfs/fs1/slurm_archive_data/lhpc_job_table-test.sql; 

time mysqldump -u root slurm_acc --single-transaction lhpc_step_table --where='time_start>1613055899' > /nfs/fs1/slurm_archive_data/lhpc_step_table-test.sql

Restore after dropping tables in mysql cli:
time mysql -u root -p slurm_acc < /nfs/fs1/slurm_archive_data/lhpc_step_table-test.sql; 
time mysql -u root -p slurm_acc < /nfs/fs1/slurm_archive_data/lhpc_job_table-test.sql

Comment 21 Ben Roberts 2022-02-14 10:29:00 MST

I'm glad to hear that your test went well. I think you're on the right track for confirming that the records you expect are still there. I would pick a few time windows and get a count of how many records were there before and after the truncation. For example, to look at how many records there were in the job and step tables for the month of December 2021 you could run these.

select count(*) from lhpc_job_table where time_start>1638316800 and time_end<1640908800;

select count(*) from lhpc_step_table where time_start>1638316800 and time_end<1640908800;

Note that the time stamps I used are for midnight GMT.

With the number of records you have you can decide if you want to query full months or use smaller increments. The counts of large numbers of records should return faster than returning all the information in those records, but still takes system resources. I would probably focus on the number of records around the border of where you deleted and kept records. The more recent records are probably going to be ok.

The thing I think is most likely to have happened is that you might have some orphaned job step records. Since the job and step records are kept or discarded based on the start time, it's possible that some of the job records would have been discarded but steps inside a job could have started after the job had been running for a while. For jobs that started just before you started keeping records that has steps that started after the cutoff period you could find some jobs like this. I would expect the number of jobs like this to be small and shouldn't have any negative effect other than a potentially confusing report about a job step.

Thanks,
Ben

Comment 24 Hjalti Sveinsson 2022-02-15 15:19:56 MST

Okay I see. 

I just tested your query:

---
job table: 

on prod system:
MariaDB [slurm_acc]> select count(*) from lhpc_job_table where time_start>1639180800 and time_end<1639353600;
+----------+
| count(*) |
+----------+
|   688489 |
+----------+
1 row in set (9 min 53.21 sec)

on test system:
MariaDB [slurm_acc]> select count(*) from lhpc_job_table where time_start>1639180800 and time_end<1639353600;
+----------+
| count(*) |
+----------+
|   687523 |
+----------+
1 row in set (11.81 sec)

so there are 966 records short in job table.

---
step table:

on test system:
MariaDB [slurm_acc]> select count(*) from lhpc_step_table where time_start>1638316800 and time_end<1640908800;
+----------+
| count(*) |
+----------+
|  6239857 |
+----------+
1 row in set (2 min 44.08 sec)

prod system:
MariaDB [slurm_acc]> select count(*) from lhpc_step_table where time_start>1638316800 and time_end<1640908800;
+----------+
| count(*) |
+----------+
|  6243263 |
+----------+
1 row in set (1 hour 23 min 42.53 sec)

Comment 25 Hjalti Sveinsson 2022-02-15 15:20:30 MST

So there seems to be a little bit of difference, do you know why that may be?

Comment 26 Ben Roberts 2022-02-16 10:56:37 MST

It's possible that there are some jobs that didn't have a valid start time or end time.  If you'll remember in comment 4 there were at least 50 jobs that had start and end times of 0, probably more.  I wonder if there were any jobs that had incomplete records in some way that were cleaned up with the truncation you did that are still showing up in the production database.  I would like to see details for some of the jobs that are missing from your test system.  Can you identify a handful of jobs that are missing and then get the complete database record for those jobs?  

I would narrow the window of time you're looking at so you have fewer job records to look at, maybe a few hours.  Assuming you find a discrepancy in the number of records then you can do the same query and look at the job_db_inx of the jobs on the two systems.  Using 'diff' should identify the records of interest.  Then you can pick several of these records and run:
  select * from <cluster_name>_job_table where job_db_inx=<value>;

Let me know if you run into any problems gathering that information.

Thanks,
Ben

Comment 27 Hjalti Sveinsson 2022-02-17 04:55:31 MST

Hi, thank you for this. I am not sure how to do this because if I search with "sacct --start" doesn't the start time time need to be in the database and then I won't find the jobs with missing start time/end time? 

How should I do the query?

Comment 28 Ben Roberts 2022-02-17 08:52:48 MST

Sorry I didn't explain what I was thinking better.  Since the database query you're doing is returning a different number of records I was thinking that you could get a sampling of jobs that were missing from the test database.  If you run the same query on the production and test databases and print out just the job_db_inx field it will let you run a 'diff' on the two lists to see where the difference is.  Here's an example where it's just a 12 hour window of time.
select job_db_inx from lhpc_job_table where time_start>1639180800 and time_end<1639224000;

Once you run 'diff' on the two lists you should have several jobs to look at.  At that point you can run this on several jobs selected at random:
select * from lhpc_job_table where job_db_inx=<value>;

Thanks,
Ben

Comment 29 Hjalti Sveinsson 2022-02-18 03:50:10 MST

Okay, I am not sure if I did this in the correct manner but here is what i've got:

Ran sql query: 

ON TEST:
select job_db_inx from lhpc_job_table where time_start>1639180800 and time_end<1639224000;

67154 rows in set (2 min 53.11 sec)

ON PRD:

select job_db_inx from lhpc_job_table where time_start>1639180800 and time_end<1639224000;

66368 rows in set (9 min 46.10 sec)

-> so I can see that there are more rows on TEST system for some reason for this time period.

Then I ran this to get everything from these jobs:

TEST:
[root@ru-lhpc-db-archive ~]# time mysql -u root slurm_acc -e 'select * from lhpc_job_table where job_db_inx>446011437 and job_db_inx<458044528' > /root/archive-compare.txt

real    2m10.808s
user    1m31.721s
sys     0m6.410s

And on PROD
root@ru-lhpc-head-db-01:~# time mysql -u root slurm_acc -e 'select * from lhpc_job_table where job_db_inx>446011437 and job_db_inx<461353091' > /root/head-db-compare.txt

real    6m33.387s
user    1m55.344s
sys     0m9.044s

-----

Then I ran compare with diff:

[root@ru-lhpc-db-archive ~]# diff archive-compare.txt head-db-compare.txt > /u0/mysql/slurm_acc/db_diff.out
[root@ru-lhpc-db-archive ~]# ls -ltar /u0/mysql/slurm_acc/db_diff.out
-rw-r--r--. 1 root root 1139571943 Feb 18 10:39 /u0/mysql/slurm_acc/db_diff.out
[root@ru-lhpc-db-archive ~]# ls -lthar /u0/mysql/slurm_acc/db_diff.out
-rw-r--r--. 1 root root 1.1G Feb 18 10:39 /u0/mysql/slurm_acc/db_diff.out

1 Gigabyte of diffs, that is alot.

I check the first line diff:

2787a2788,2789
> 446014236     1639222586      0       bioprod NULL    NULL    0       0       NULL    1       0       NULL    0       0       est_weight      5       0       4294967294      NULL    40655720        1       0       0       1065    1042    0       4294967294      1065    0       NULL
    4096    None assigned   0       NULL    cpu_hog 1       4       5       1639184623      0       0       1639222586      0                       /nfs/odinn/datavault/damp/tmp/vol101/phenotypesDaisy/PhenoCWQT/Proteomics_PC0_6918_183_CCK_CCKN_07082019        NULL    0               1=1,2=4096,4=1,5=1

looked up the job:

on PROD:
root@ru-lhpc-head-db-01:~# sacct -j 40655720
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
40655720     est_weight    cpu_hog    bioprod          1 CANCELLED+      0:0

on TEST:
[root@ru-lhpc-db-archive ~]# sacct -D -j 40655720 -o start,end
              Start                 End
------------------- -------------------
2021-02-20T16:34:30 2021-02-20T16:34:31
2021-02-20T16:34:30 2021-02-20T16:34:31

on PROD:
root@ru-lhpc-head-db-01:~# sacct -D -j 40655720 -o start,end
              Start                 End
------------------- -------------------
2018-06-03T02:40:17 2018-06-03T03:20:10
2019-03-19T15:03:48 2019-03-19T15:12:39
2019-12-02T14:05:10 2019-12-02T14:45:08
2019-12-02T14:05:10 2019-12-02T14:45:08
2020-06-29T11:56:18 2020-06-29T12:00:50
2020-06-29T11:56:18 2020-06-29T12:00:50
2021-02-20T16:34:30 2021-02-20T16:34:31
2021-02-20T16:34:30 2021-02-20T16:34:31
2021-12-11T11:36:26 2021-12-11T11:36:26
root@ru-lhpc-head-db-01:~# sacct -j 40655720 -l
       JobID     JobIDRaw    JobName  Partition  MaxVMSize  MaxVMSizeNode  MaxVMSizeTask  AveVMSize     MaxRSS MaxRSSNode MaxRSSTask     AveRSS MaxPages MaxPagesNode   MaxPagesTask   AvePages     MinCPU MinCPUNode MinCPUTask     AveCPU   NTasks  AllocCPUS    Elapsed      State ExitCode AveCPUFreq ReqCPUFreqMin ReqCPUFreqMax ReqCPUFreqGov     ReqMem ConsumedEnergy  MaxDiskRead MaxDiskReadNode MaxDiskReadTask    AveDiskRead MaxDiskWrite MaxDiskWriteNode MaxDiskWriteTask   AveDiskWrite    ReqTRES  AllocTRES TRESUsageInAve TRESUsageInMax TRESUsageInMaxNode TRESUsageInMaxTask TRESUsageInMin TRESUsageInMinNode TRESUsageInMinTask TRESUsageInTot TRESUsageOutMax TRESUsageOutMaxNode TRESUsageOutMaxTask TRESUsageOutAve TRESUsageOutTot
------------ ------------ ---------- ---------- ---------- -------------- -------------- ---------- ---------- ---------- ---------- ---------- -------- ------------ -------------- ---------- ---------- ---------- ---------- ---------- -------- ---------- ---------- ---------- -------- ---------- ------------- ------------- ------------- ---------- -------------- ------------ --------------- --------------- -------------- ------------ ---------------- ---------------- -------------- ---------- ---------- -------------- -------------- ------------------ ------------------ -------------- ------------------ ------------------ -------------- --------------- ------------------- ------------------- --------------- ---------------
40655720     40655720     est_weight    cpu_hog                                                                                                                                                                                                               1   00:00:00 CANCELLED+      0:0                  Unknown       Unknown       Unknown        4Gn              0         

So this job was apperantly not restored in the TEST evironment.

Comment 30 Hjalti Sveinsson 2022-02-18 05:09:01 MST

I just checked more jobs:

Took the first 13 jobs in the diff list:

[root@ru-lhpc-db-archive ~]# head -20  /u0/mysql/slurm_acc/db_diff.out | grep '>' | awk '{print $21}' | xargs | sed 's/ /,/g'
40655720,40655721,40655724,40655725,40655728,40655729,40655732,40655733,40655736,40655737,40656657,40656658,40659108

On prod system:

root@ru-lhpc-head-db-01:~# sacct -j 40655720,40655721,40655724,40655725,40655728,40655729,40655732,40655733,40655736,40655737,40656657,40656658,40659108 -o start,end,state
              Start                 End      State
------------------- ------------------- ----------
2021-12-11T11:36:26 2021-12-11T11:36:26 CANCELLED+
2021-12-11T11:36:26 2021-12-11T11:36:26 CANCELLED+
2021-12-11T11:36:26 2021-12-11T11:36:26 CANCELLED+
2021-12-11T11:36:26 2021-12-11T11:36:26 CANCELLED+
2021-12-11T11:36:24 2021-12-11T11:36:24 CANCELLED+
2021-12-11T11:36:24 2021-12-11T11:36:24 CANCELLED+
2021-12-11T11:36:24 2021-12-11T11:36:24 CANCELLED+
2021-12-11T11:36:24 2021-12-11T11:36:24 CANCELLED+
2021-12-11T11:36:23 2021-12-11T11:36:23 CANCELLED+
2021-12-11T11:36:23 2021-12-11T11:36:23 CANCELLED+
2021-12-11T11:36:22 2021-12-11T11:36:22 CANCELLED+
2021-12-11T11:36:22 2021-12-11T11:36:22 CANCELLED+
2021-12-11T11:36:22 2021-12-11T11:36:22 CANCELLED+

All of these jobs were cancelled. 

If I look them up on the TEST system they are not found:

[root@ru-lhpc-db-archive ~]# sacct -j 40655720,40655721,40655724,40655725,40655728,40655729,40655732,40655733,40655736,40655737,40656657,40656658,40659108 -o start,end
              Start                 End
------------------- -------------------
2021-02-20T16:34:30 2021-02-20T16:34:31
2021-02-20T16:34:30 2021-02-20T16:34:31
2021-02-20T16:34:31 2021-02-20T16:34:31
2021-02-20T16:34:31 2021-02-20T16:34:31
2021-02-20T16:34:33 2021-02-20T16:34:33
2021-02-20T16:34:33 2021-02-20T16:34:33
2021-02-20T16:34:33 2021-02-20T16:34:34
2021-02-20T16:34:33 2021-02-20T16:34:34
2021-02-20T16:34:37 2021-02-20T16:34:38
2021-02-20T16:34:37 2021-02-20T16:34:38
2021-02-20T16:34:38 2021-02-20T16:34:39
2021-02-20T16:34:38 2021-02-20T16:34:39
2021-02-20T16:34:40 2021-02-20T16:34:41
2021-02-20T16:34:40 2021-02-20T16:34:41
2021-02-20T16:34:41 2021-02-20T16:34:43
2021-02-20T16:34:41 2021-02-20T16:34:43
2021-02-20T16:34:46 2021-02-20T16:34:46
2021-02-20T16:34:46 2021-02-20T16:34:46
2021-02-20T16:34:47 2021-02-20T16:34:47
2021-02-20T16:34:47 2021-02-20T16:34:47
2021-02-20T17:36:26 2021-02-20T17:36:26
2021-02-20T17:36:26 2021-02-20T17:36:26
2021-02-20T17:36:26 2021-02-20T17:36:27
2021-02-20T17:36:26 2021-02-20T17:36:27
2021-02-20T17:12:16 2021-02-20T17:12:20
2021-02-20T17:12:16 2021-02-20T17:12:20

So it seems that jobs that were CANCECLLED were not backed up when I did the backup and restore. 

Any clue why?

Comment 31 Ben Roberts 2022-02-18 14:22:57 MST

I see a few things with the data you collected.  The first is that the start and end times from the 13 jobs you listed on your production system are from December 2021.  The start and end times for the same list on your test system are from February of 2021.  I remember seeing that you had enough jobs cycling through that the job ids had been reused, so I think that the jobs being listed on the two systems are not actually the same jobs.  Rather, they're different jobs with the same jobid.  Using the job_db_inx is a way to make sure we're looking at the same records.

What I was hoping to get from the queries I sent was all the details from the database for a few of the jobs that showed up on one system and not the other.  The first query I showed with the job_db_inx was to identify several jobs that existed on one system and not the other.  Then you could get all the details of just a few jobs from the database.  I would modify that command slightly to make the output easier to read:
select * from lhpc_job_table where job_db_inx=<value>\G

From your diff file you should be able to see the job_db_inx for the jobs you identified and I'd still like to see how those database records look.

Until I have that information I can do some conjecture about why there are differences.  Jobs have several time stamps for different things, but the ones we have been focusing on for now have been the submit time and the start time.  Every job will have a submit time in order to exist on the system, but some jobs are canceled before they start, so they don't necessarily have a start time.  What may have happened is that when you dumped the data from the database you added a constraint that it was for jobs with a start time greater than X.  This would have excluded jobs that didn't have a start time and they would not be present any more when you re-imported the information to your production database.  Depending on how you moved job information to your test system, these jobs may still be there.  

You mentioned in comment 15 that you imported data to your test system on Feb 8.  I don't quite understand why jobs from December wouldn't show up on the test system.  For your production system, it's possible that if you lowered the PurgeJobAfter time to less than 365 days that it could have cleaned out the jobs from Feb 20 of last year, which would account for that difference.  Do you have details of how you exported the data to import on your test system?

Thanks,
Ben

Comment 32 Ben Roberts 2022-02-24 13:19:11 MST

Hi Hjalti,

I wanted to follow up and see if you have had a chance to look at the database records after the truncation again.  Let me know if you still need help with this ticket.

Thanks,
Ben

Comment 33 Hjalti Sveinsson 2022-02-28 04:17:47 MST

Hi Ben, 

To answer your first question, this is what I used to genrate the job output on both the TEST and PROD systems:

TEST:
mysql -u root slurm_acc -e 'select * from lhpc_job_table where job_db_inx>446011437 and job_db_inx<458044528' > /root/archive-compare.txt

PROD:
mysql -u root slurm_acc -e 'select * from lhpc_job_table where job_db_inx>446011437 and job_db_inx<461353091' > /root/head-db-compare.txt




[root@ru-lhpc-db-archive ~]# diff job_db_index_1639180800_1639224000.txt_sorted_TEST.txt job_db_index_1639180800_1639224000.txt_sorted_PROD.txt | grep '>' | wc -l
1942

>You mentioned in comment 15 that you imported data to your test system on Feb 8.  I don't quite understand why jobs from December wouldn't show up on the test system.  For your production system, it's possible that if you lowered the PurgeJobAfter time to less than 365 days that it could have cleaned out the jobs from Feb 20 of last year, which would account for that difference.  Do you have details of how you exported the data to import on your test system?

I think the reason here is that after I restored the data from the PROD system on the TEST system I did take a backup using 'time_start' see comment here -> https://bugs.schedmd.com/show_bug.cgi?id=13346#c18 

And by doing that I did only get jobs that have start time and not the jobs that have lost job info about jobs that were cancelled before the started.

-----

However I am retrying what you mentioned I should do. 

1. I have done a FULL restore of the mysql database from PROD -> TEST, with all data from the 8. Feb backup. 
2. Did this on test and prod systems:
mysql -u root slurm_acc -e 'select job_db_inx from lhpc_job_table where time_start>1639180800 and time_end<1639224000' > /root/job_db_index_1639180800_1639224000.txt 
3. cat /root/job_db_index_1639180800_1639224000.txt | grep -v job | sort > /root/job_db_index_1639180800_1639224000.txt_sorted.txt
4. mv /root/job_db_index_1639180800_1639224000.txt_sorted.txt /root/job_db_index_1639180800_1639224000.txt_sorted_TEST.txt 
5. did the same on PROD system and transferred the file over. 
6. diff job_db_index_1639180800_1639224000.txt_sorted_TEST.txt job_db_index_1639180800_1639224000.txt_sorted_PROD.txt | grep '>' | wc -l
1942 # so 1942 jobs more on the TEST system.

-----

I am thinking I should perhaps instead of using time_start to take the backup from the tables since then I will have jobs also that don't have a starting time(time_start). 

But step_table does not have a time_submit column, how do we take a backup of the step table then?

Comment 34 Hjalti Sveinsson 2022-02-28 04:31:50 MST

Okay I see now what is going wrong:

I checked the last line from my sql query and that gave me:

on TEST:
[root@ru-lhpc-db-archive ~]# tail -1 job_db_index_1639180800_1639224000.txt_sorted_TEST.txt
458352736

On PROD:
root@ru-lhpc-head-db-01:~# tail -1 /root/job_db_index_1639180800_1639224000.txt_sorted.txt
463681162

--

When I look up these job_db_inx I find the info:

TEST:
MariaDB [slurm_acc]> select time_submit from lhpc_job_table where job_db_inx=458352736\G
*************************** 1. row ***************************
time_submit: 1644326434

Which is GMT: Tuesday, 8. February 2022 13:20:34 
The day I took the FULL DB backup.

PROD: 
MariaDB [slurm_acc]> select time_submit from lhpc_job_table where job_db_inx=463681162\G
*************************** 1. row ***************************
time_submit: 1646045305

GMT: Monday, 28. February 2022 10:48:25
Today when I made the query. 

---

So the sql command is not working, I am using this command:

mysql -u root slurm_acc -e 'select job_db_inx from lhpc_job_table where time_start>1639180800 and time_end<1639224000' > /root/job_db_index_1639180800_1639224000.txt

Comment 35 Hjalti Sveinsson 2022-02-28 04:35:11 MST

What I ment to say in comment 33 :

>I am thinking I should perhaps instead of using time_start to take the backup from the tables since then I will have jobs also that don't have a starting time(time_start). 

Was "I am thinking I should perhaps use time_submit instead of using time_start to take the backup from the tables since then I will have jobs also that don't have a starting time(time_start).

Comment 36 Ben Roberts 2022-02-28 11:14:11 MST

Hi Hjalti,

I have to apologize, it looks like I made a copy/paste error in comment 28. I meant to have you look for jobs with a submit time that was within the range you specified, but instead I pasted an example that used the start time and end time. I'm sure this added to the confusion instead of helping things move forward so, again, I apologize.

You're right that using the submit time for jobs will make sure you are comparing apples to apples. When you are looking at the job table there will be a submit time and a start time because there is frequently a delay between when a job is submitted and when it starts. The step table doesn't have a submit time because a job step will begin running as soon as it is created, so the start time is essentially the submit time.

Let me try again to show the queries that I would like to see:
select job_db_inx from lhpc_job_table where time_submit>1639180800 and time_submit<1639224000

Doing this on both the production and test systems will give a list of job_db_inxs that you can get a diff on. Hopefully this list will be smaller than the 1942 jobs you got before, or ideally the two lists would be the same. If there is still a difference then we can look at the details of a handful of these jobs like this (replacing the correct index for <job_db_inx>):
select * from lhpc_job_table where job_db_inx=<job_db_inx>\G

The step table should be related to the job table. If there are differences in the job table then I would expect to see the same job ids as being different for the step table. To confirm that you should be able to do a similar query, using the start time as mentioned above.
select job_db_inx from lhpc_step_table where time_start>1639180800 and time_start<1639224000

See if using the submit time gives more consistent results.

Thanks,
Ben

Comment 37 Hjalti Sveinsson 2022-03-01 08:30:28 MST

Good news, 

I did the command you asked me to do on both the test and prod system and the result are the same:

Test system:
select job_db_inx from lhpc_job_table where time_submit>1639180800 and time_submit<1639224000;

71901 rows in set (1 min 2.50 sec)

Prod system:
select job_db_inx from lhpc_job_table where time_submit>1639180800 and time_submit<1639224000

71901 rows in set (5 min 27.62 sec)

[root@ru-lhpc-db-archive ~]# cat /root/comp_test_1639180800_1639224000.txt | wc -l
71902
[root@ru-lhpc-db-archive ~]# cat /root/comp_prod_1639180800_1639224000.txt | wc -l
71902
[root@ru-lhpc-db-archive ~]# diff /root/comp_test_1639180800_1639224000.txt /root/comp_prod_1639180800_1639224000.txt
[root@ru-lhpc-db-archive ~]#

So everything looks good for this time period. 

By the way this morning on the TEST system I did the following: 

Took backup of job_table using the time_submit instead of time_start so I would get all jobs:
time mysqldump -u root slurm_acc --single-transaction lhpc_job_table --where='time_submit>1613055899' > /nfs/fs1/slurm_archive_data/lhpc_job_table-test.sql

Took backup from step_table using time_start that was a little bit earlier the the time_submit I used when I took backup of the job_table:

time mysqldump -u root slurm_acc --single-transaction lhpc_step_table --where='time_start>1612515754' > /nfs/fs1/slurm_archive_data/lhpc_step_table-test.sql

Then I dropped the job_table and step_table and restored them from these two backup files.

The job_db_inx diff was done after doing the backup/restore.

Comment 38 Hjalti Sveinsson 2022-03-01 08:32:06 MST

So maybe this is the way to go? Would you recommend this way, should I do more tests to see if some data is missing?

Comment 39 Ben Roberts 2022-03-01 10:51:32 MST

I'm glad to hear that this is looking the way we expect when I sent the right query.  My apologies again for delaying this outcome.  I think this is the right approach and now that we have a positive confirmation that the records match up after the dump and import, I think that we can move forward with plans to perform this on your production system.  Do you have a time you were thinking of making this change?  

Thanks,
Ben

Comment 40 Hjalti Sveinsson 2022-03-03 07:12:23 MST

I agree about going through with this on our PROD system. I did one additional test and that was to try to update slurm from 20.11.x to 21.08.x and start slurmdbd -Dvv in foreground to see if everything was okay. Everything went fine. 

I will check with my inhouse staff when it would be best to do the backup/restore of the tables since accounting database will be offline in the meantime.

Comment 41 Ben Roberts 2022-03-03 08:36:26 MST

I'm glad to hear the test upgrade went well too.  I'm sure it went much faster than it would have before the purge of data.  If you'll be doing this maintenance soon I can leave the ticket open to make sure it goes well.  If it's going to be a few weeks or more then I might close the ticket and you can update us if something comes up.

Thanks,
Ben

Comment 42 Hjalti Sveinsson 2022-03-07 03:09:53 MST

I will be performing the PROD export/import tomorrow morning, so please keep this issue open so I can let you know how it went or if anything comes up.

Comment 43 Ben Roberts 2022-03-07 09:09:25 MST

Sounds good.  I'll wait to hear the results of the purge tomorrow.

Thanks,
Ben

Comment 44 Hjalti Sveinsson 2022-03-18 04:49:01 MDT

Hi, you can close this issue. Everything went according to plan! 

Thanks alot for all your help!

rgds,
Hjalti

Comment 45 Ben Roberts 2022-03-18 08:06:06 MDT

I'm glad to hear everything went well.  Let us know if there's anything we can do to help in the future.

Thanks,
Ben