Hi, We upgraded Slurm from 23.11.9 to 24.05.3 this weekend and am now seeing slurmdbd crash with messages like this: Sep 2 11:14:16 frankie slurmdbd[3149202]: debug: REQUEST_PERSIST_INIT: CLUSTER:cluster VERSION:10496 UID:5334 IP:100.64.0.68 CONN:13 Sep 2 11:14:27 frankie slurmdbd[3149202]: debug: REQUEST_PERSIST_INIT: CLUSTER:cluster VERSION:10496 UID:5334 IP:100.64.0.68 CONN:11 Sep 2 11:14:29 frankie slurmdbd[3149202]: debug: REQUEST_PERSIST_INIT: CLUSTER:cluster VERSION:10496 UID:5224 IP:100.64.15.254 CONN:13 Sep 2 11:14:29 frankie slurmctld[3149203]: Warning: Note very large processing time from slurmdbd agent: full loop: usec=900101111 began=10:59:29.587 Sep 2 11:14:30 frankie slurmdbd[3149202]: fatal: mysql gave ER_LOCK_WAIT_TIMEOUT as an error. The only way to fix this is restart the calling program Sep 2 11:52:40 frankie slurmdbd[215156]: debug: REQUEST_PERSIST_INIT: CLUSTER:cluster VERSION:10496 UID:5224 IP:100.64.15.254 CONN:11 Sep 2 11:52:40 frankie slurmdbd[215156]: error: A new registration for cluster cluster CONN:11 just came in, but I am already talking to that cluster (CONN:9), closing other connection. Sep 2 11:52:40 frankie slurmdbd[215156]: debug: accounting_storage/as_mysql: as_mysql_job_start: Need to reroll usage from Mon Sep 02 10:59:29 2024 Job 16012286 from cluster started then and we are just now hearing about it. Sep 2 11:52:41 frankie slurmdbd[215156]: fatal: mysql gave ER_LOCK_WAIT_TIMEOUT as an error. The only way to fix this is restart the calling program I have in my.cnf: [mysqld] datadir=/var/lib/mysql socket=/var/lib/mysql/mysql.sock log-error=/var/log/mariadb/mariadb.log pid-file=/run/mariadb/mariadb.pid innodb_buffer_pool_size=5096M innodb_log_file_size=256M innodb_lock_wait_timeout=900 innodb_flush_method = fsync innodb_doublewrite = 0 # ZFS is transactional innodb_use_native_aio = 0 innodb_read_io_threads = 10 innodb_write_io_threads = 10 I'm not sure exactly where to start looking for what is causing this, for now I'm just restarting it automatically when it crashes and collecting logs. Is there another timeout somewhere I need to make longer? griznog
This seems like a weird coincidence, it seems to crash on roughly 16 minute boundaries? Crashes and the last 2 messages before the crash each time: # grep slurmdbd /var/log/messages | grep -B 2 fatal Sep 2 11:14:29 frankie slurmdbd[3149202]: debug: REQUEST_PERSIST_INIT: CLUSTER:cluster VERSION:10496 UID:5224 IP:100.64.15.254 CONN:13 Sep 2 11:14:29 frankie slurmctld[3149203]: Warning: Note very large processing time from slurmdbd agent: full loop: usec=900101111 began=10:59:29.587 Sep 2 11:14:30 frankie slurmdbd[3149202]: fatal: mysql gave ER_LOCK_WAIT_TIMEOUT as an error. The only way to fix this is restart the calling program -- Sep 2 11:52:40 frankie slurmdbd[215156]: error: A new registration for cluster cluster CONN:11 just came in, but I am already talking to that cluster (CONN:9), closing other connection. Sep 2 11:52:40 frankie slurmdbd[215156]: debug: accounting_storage/as_mysql: as_mysql_job_start: Need to reroll usage from Mon Sep 02 10:59:29 2024 Job 16012286 from cluster started then and we are just now hearing about it. Sep 2 11:52:41 frankie slurmdbd[215156]: fatal: mysql gave ER_LOCK_WAIT_TIMEOUT as an error. The only way to fix this is restart the calling program -- Sep 2 12:08:01 frankie slurmdbd[652261]: error: A new registration for cluster cluster CONN:13 just came in, but I am already talking to that cluster (CONN:9), closing other connection. Sep 2 12:08:01 frankie slurmdbd[652261]: debug: accounting_storage/as_mysql: as_mysql_job_start: Need to reroll usage from Mon Sep 02 10:59:29 2024 Job 16012286 from cluster started then and we are just now hearing about it. Sep 2 12:08:01 frankie slurmdbd[652261]: fatal: mysql gave ER_LOCK_WAIT_TIMEOUT as an error. The only way to fix this is restart the calling program -- Sep 2 12:24:02 frankie slurmdbd[1106228]: debug: REQUEST_PERSIST_INIT: CLUSTER:cluster VERSION:10496 UID:5464 IP:100.64.0.68 CONN:11 Sep 2 12:24:02 frankie slurmdbd[1106228]: debug: REQUEST_PERSIST_INIT: CLUSTER:cluster VERSION:10496 UID:5498 IP:100.64.0.68 CONN:15 Sep 2 12:24:02 frankie slurmdbd[1106228]: fatal: mysql gave ER_LOCK_WAIT_TIMEOUT as an error. The only way to fix this is restart the calling program Sep 2 12:24:02 frankie slurmdbd[1106228]: fatal: mysql gave ER_LOCK_WAIT_TIMEOUT as an error. The only way to fix this is restart the calling program -- Sep 2 13:00:27 frankie slurmdbd[1518230]: debug: REQUEST_PERSIST_INIT: CLUSTER:cluster VERSION:10496 UID:5224 IP:100.64.15.254 CONN:11 Sep 2 13:00:27 frankie slurmdbd[1518230]: error: A new registration for cluster cluster CONN:11 just came in, but I am already talking to that cluster (CONN:9), closing other connection. Sep 2 13:00:29 frankie slurmdbd[1518230]: fatal: mysql gave ER_LOCK_WAIT_TIMEOUT as an error. The only way to fix this is restart the calling program -- Sep 2 13:16:02 frankie slurmdbd[2470789]: error: A new registration for cluster cluster CONN:13 just came in, but I am already talking to that cluster (CONN:9), closing other connection. Sep 2 13:16:02 frankie slurmdbd[2470789]: debug: accounting_storage/as_mysql: as_mysql_job_complete: Need to reroll usage from Mon Sep 02 12:45:27 2024 Job 14797864 from cluster ended then and we are just now hearing about it. Sep 2 13:16:03 frankie slurmdbd[2470789]: fatal: mysql gave ER_LOCK_WAIT_TIMEOUT as an error. The only way to fix this is restart the calling program -- Sep 2 13:32:02 frankie slurmdbd[2879814]: error: A new registration for cluster cluster CONN:13 just came in, but I am already talking to that cluster (CONN:9), closing other connection. Sep 2 13:32:02 frankie slurmdbd[2879814]: debug: accounting_storage/as_mysql: as_mysql_job_complete: Need to reroll usage from Mon Sep 02 12:45:27 2024 Job 14797864 from cluster ended then and we are just now hearing about it. Sep 2 13:32:03 frankie slurmdbd[2879814]: fatal: mysql gave ER_LOCK_WAIT_TIMEOUT as an error. The only way to fix this is restart the calling program -- Sep 2 13:48:02 frankie slurmdbd[3298025]: error: A new registration for cluster cluster CONN:11 just came in, but I am already talking to that cluster (CONN:9), closing other connection. Sep 2 13:48:02 frankie slurmdbd[3298025]: debug: accounting_storage/as_mysql: as_mysql_job_complete: Need to reroll usage from Mon Sep 02 12:45:27 2024 Job 14797864 from cluster ended then and we are just now hearing about it. Sep 2 13:48:03 frankie slurmdbd[3298025]: fatal: mysql gave ER_LOCK_WAIT_TIMEOUT as an error. The only way to fix this is restart the calling program
This seems to be associated with a periodic roll-up: Sep 2 13:00:27 frankie slurmdbd[1518230]: debug: REQUEST_PERSIST_INIT: CLUSTER:cluster VERSION:10496 UID:5224 IP:100.64.15.254 CONN:11 Sep 2 13:00:27 frankie slurmdbd[1518230]: error: A new registration for cluster cluster CONN:11 just came in, but I am already talking to that cluster (CONN:9), closing other connection. Sep 2 13:00:29 frankie slurmdbd[1518230]: fatal: mysql gave ER_LOCK_WAIT_TIMEOUT as an error. The only way to fix this is restart the calling program -- Sep 2 13:16:02 frankie slurmdbd[2470789]: error: A new registration for cluster cluster CONN:13 just came in, but I am already talking to that cluster (CONN:9), closing other connection. Sep 2 13:16:02 frankie slurmdbd[2470789]: debug: accounting_storage/as_mysql: as_mysql_job_complete: Need to reroll usage from Mon Sep 02 12:45:27 2024 Job 14797864 from cluster ended then and we are just now hearing about it. Sep 2 13:16:03 frankie slurmdbd[2470789]: fatal: mysql gave ER_LOCK_WAIT_TIMEOUT as an error. The only way to fix this is restart the calling program -- Sep 2 13:32:02 frankie slurmdbd[2879814]: error: A new registration for cluster cluster CONN:13 just came in, but I am already talking to that cluster (CONN:9), closing other connection. Sep 2 13:32:02 frankie slurmdbd[2879814]: debug: accounting_storage/as_mysql: as_mysql_job_complete: Need to reroll usage from Mon Sep 02 12:45:27 2024 Job 14797864 from cluster ended then and we are just now hearing about it. Sep 2 13:32:03 frankie slurmdbd[2879814]: fatal: mysql gave ER_LOCK_WAIT_TIMEOUT as an error. The only way to fix this is restart the calling program -- Sep 2 13:48:02 frankie slurmdbd[3298025]: error: A new registration for cluster cluster CONN:11 just came in, but I am already talking to that cluster (CONN:9), closing other connection. Sep 2 13:48:02 frankie slurmdbd[3298025]: debug: accounting_storage/as_mysql: as_mysql_job_complete: Need to reroll usage from Mon Sep 02 12:45:27 2024 Job 14797864 from cluster ended then and we are just now hearing about it. Sep 2 13:48:03 frankie slurmdbd[3298025]: fatal: mysql gave ER_LOCK_WAIT_TIMEOUT as an error. The only way to fix this is restart the calling program -- Sep 2 14:04:02 frankie slurmdbd[3701401]: debug: REQUEST_PERSIST_INIT: CLUSTER:cluster VERSION:10496 UID:5224 IP:100.64.15.254 CONN:15 Sep 2 14:04:02 frankie slurmdbd[3701401]: error: A new registration for cluster cluster CONN:15 just came in, but I am already talking to that cluster (CONN:9), closing other connection. Sep 2 14:04:03 frankie slurmdbd[3701401]: fatal: mysql gave ER_LOCK_WAIT_TIMEOUT as an error. The only way to fix this is restart the calling program I just ran 'sacctmgr list runaway jobs', which listed 54 jobs and when I let it try to clean those up, slurmdbd crashed again. ONce it was restarted I now see no more runaway jobs listed.
That bought us an hour before the next crash, which logged: Sep 2 15:21:08 frankie slurmdbd[4142220]: debug: REQUEST_PERSIST_INIT: CLUSTER:cluster VERSION:10496 UID:5224 IP:100.64.15.254 CONN:10 Sep 2 15:21:08 frankie slurmctld[4142221]: Warning: Note very large processing time from slurmdbd agent: full loop: usec=900090354 began=15:06:08.574 Sep 2 15:21:09 frankie slurmdbd[4142220]: fatal: mysql gave ER_LOCK_WAIT_TIMEOUT as an error. The only way to fix this is restart the calling program
Hello griznog, Could you send us the output of this command? It will show what the database is currently working on. > sudo mysql -e "show processlist"
MariaDB [(none)]> show processlist -> ; +----+-------------+-----------+---------------+---------+------+--------------------------+------------------+----------+ | Id | User | Host | db | Command | Time | State | Info | Progress | +----+-------------+-----------+---------------+---------+------+--------------------------+------------------+----------+ | 1 | system user | | NULL | Daemon | NULL | InnoDB purge coordinator | NULL | 0.000 | | 2 | system user | | NULL | Daemon | NULL | InnoDB purge worker | NULL | 0.000 | | 3 | system user | | NULL | Daemon | NULL | InnoDB purge worker | NULL | 0.000 | | 4 | system user | | NULL | Daemon | NULL | InnoDB purge worker | NULL | 0.000 | | 5 | system user | | NULL | Daemon | NULL | InnoDB shutdown handler | NULL | 0.000 | | 11 | slurm | localhost | slurm_acct_db | Sleep | 122 | | NULL | 0.000 | | 14 | slurm | localhost | slurm_acct_db | Sleep | 0 | | NULL | 0.000 | | 60 | root | localhost | NULL | Query | 0 | Init | show processlist | 0.000 | +----+-------------+-----------+---------------+---------+------+--------------------------+------------------+----------+ 8 rows in set (0.000 sec)
I realized my recurring crashes were happening at innodb_lock_wait_timeout intervals, so I increased that to 1200 to test that theory.
What version is your MariaDB? When the slurmdbd crashes again could you send another show processlist as well? Where is your slurmdbd located? is it on top of the database or with the slurmctld?
griznog, > error: A new registration for cluster cluster CONN:13 just came in, but I am > already talking to that cluster (CONN:9), closing other connection. This error means that there is a second slurmctld attempting to connect with the slurmdbd. Do you run a backup slurmctld?
(In reply to Benjamin Witham from comment #8) > griznog, > > > error: A new registration for cluster cluster CONN:13 just came in, but I am > > already talking to that cluster (CONN:9), closing other connection. > This error means that there is a second slurmctld attempting to connect with > the slurmdbd. Do you run a backup slurmctld? Not on purpose, I don't have a backup controller. Is there any way to see what host is trying to connect?
(In reply to Benjamin Witham from comment #7) > What version is your MariaDB? When the slurmdbd crashes again could you send > another show processlist as well? > > Where is your slurmdbd located? is it on top of the database or with the > slurmctld? MariaDB is from mariadb-server-10.3.39-1.module+el8.8.0+1452+2a7eab68.x86_64 on Rocky LInux 8.10. slurmctld, slurmdbd and mariadb all run on the same host. I have my "is it running check" running every 5 seconds now and if/when it detects a crash it'll dump a "show processlist" before restarting slurmdbd. Will post when I have some output from that. griznog
New maybe related data point, slurmctld hung and had to be restarted, the last thing it said before hanging was: Sep 02 18:00:35 frankie slurmctld[3132700]: slurmctld: error: _handle_qos_tres_run_secs: job 16012402: QOS normal TRES mem user used limit tres_run_secs underflow, tried to remove 2457600 s> Sep 02 18:00:35 frankie slurmctld[3132700]: slurmctld: error: _handle_assoc_tres_run_secs: job 16012402: assoc 4254 TRES mem grp_used_tres_run_secs underflow, tried to remove 2457600 second> Sep 02 18:00:45 frankie slurmctld[3132700]: slurmctld: debug: _job_requeue_op: JobId=16012969_3830(16016876) state 0x8000 reason 0 priority 64 Sep 02 18:00:45 frankie slurmctld[3132700]: slurmctld: preempted JobId=16012969_3830(16016876) has been requeued to reclaim resources for JobId=Invalid Sep 02 18:10:30 frankie slurmctld[3132700]: slurmctld: server_thread_count over limit (256), waiting
10 crashes since yesterday afternoon. The one common thing for all 10 is this process: 469 slurm localhost slurm_acct_db Query 3645 Updating delete from "cluster_job_script_table" where hash_inx not in(select script_hash_inx from "cluster_jo 0.000
Hi, >469 slurm localhost slurm_acct_db Query 3645 Updating >delete from "cluster_job_script_table" where hash_inx not in(select >script_hash_inx from "cluster_jo 0.000 This command is directly related with the purging process of the job table, which is issued together with the rollup at the configured frequency. While we figure this out, could you try removing/commenting the job purging option from slurmdbd.conf? PurgeJobAfter=# Restart slurmdbd after the change. If that query is the problematic one, we should get to stabilize the situation disabling the purge. Kind regards, Oscar
(In reply to Oscar Hernández from comment #14) > While we figure this out, could you try removing/commenting the job purging > option from slurmdbd.conf? > > PurgeJobAfter=# > > Restart slurmdbd after the change. If that query is the problematic one, we > should get to stabilize the situation disabling the purge. Done. For the sake of completeness, all my purge settings after that change are: PurgeEventAfter=12months #PurgeJobAfter=12months PurgeResvAfter=12months PurgeStepAfter=12months PurgeSuspendAfter=12months PurgeTXNAfter=12months PurgeUsageAfter=12months
Hi. How has your slurmdbd stability been after increasing the purge time you listed in comment 16? Would you please supply the following? Table row counts for all slurm db tables: # mysql -e "select table_name, table_rows from information_schema.tables where table_schema='<StorageLoc>'" where <StorageLoc> is the value from your slurmdbd.conf (it's "slurm_acct_db" by default if not set). What's the load on the node running slurmctld+slurmdbd+mariadb? Is it generally busy (ex. load average from uptime)? Could you supply a few "sdiag" outputs separated by about 15min each? The latest slurmdbd.log
(In reply to Chad Vizino from comment #17) > Hi. How has your slurmdbd stability been after increasing the purge time you > listed in comment 16? Haven't had a crash since making that change. > Would you please supply the following? > > Table row counts for all slurm db tables: > > # mysql -e "select table_name, table_rows from information_schema.tables > where table_schema='<StorageLoc>'" > where <StorageLoc> is the value from your slurmdbd.conf (it's > "slurm_acct_db" by default if not set). > # mysql -e "select table_name, table_rows from information_schema.tables where table_schema='slurm_acct_db'" +---------------------------------+------------+ | table_name | table_rows | +---------------------------------+------------+ | acct_coord_table | 0 | | acct_table | 2 | | clus_res_table | 30 | | cluster_assoc_table | 4830 | | cluster_assoc_usage_day_table | 104067 | | cluster_assoc_usage_hour_table | 1947233 | | cluster_assoc_usage_month_table | 10167 | | cluster_event_table | 5424 | | cluster_job_env_table | 2314074 | | cluster_job_script_table | 2095407 | | cluster_job_table | 4581519 | | cluster_last_ran_table | 1 | | cluster_resv_table | 56 | | cluster_step_table | 9229297 | | cluster_suspend_table | 0 | | cluster_table | 2 | | cluster_usage_day_table | 6487 | | cluster_usage_hour_table | 162407 | | cluster_usage_month_table | 261 | | cluster_wckey_table | 358 | | cluster_wckey_usage_day_table | 70111 | | cluster_wckey_usage_hour_table | 1919084 | | cluster_wckey_usage_month_table | 7612 | | convert_version_table | 0 | | federation_table | 0 | | frankie_assoc_table | 2 | | frankie_assoc_usage_day_table | 0 | | frankie_assoc_usage_hour_table | 0 | | frankie_assoc_usage_month_table | 0 | | frankie_event_table | 18 | | frankie_job_env_table | 0 | | frankie_job_script_table | 0 | | frankie_job_table | 0 | | frankie_last_ran_table | 1 | | frankie_resv_table | 0 | | frankie_step_table | 0 | | frankie_suspend_table | 0 | | frankie_usage_day_table | 32 | | frankie_usage_hour_table | 728 | | frankie_usage_month_table | 0 | | frankie_wckey_table | 0 | | frankie_wckey_usage_day_table | 0 | | frankie_wckey_usage_hour_table | 0 | | frankie_wckey_usage_month_table | 0 | | qos_table | 11 | | res_table | 30 | | table_defs_table | 48 | | tres_table | 9 | | txn_table | 6801 | | user_table | 705 | +---------------------------------+------------+ > What's the load on the node running slurmctld+slurmdbd+mariadb? Is it > generally busy (ex. load average from uptime)? Load is pretty low, generally < 10 and the system has 256G memory and 16 cores (AMD EPYC 7302P). There are spikes when provisining a lot of nodes, like when bringing up the system but crashes haven't been correlated with any of that activity. > Could you supply a few "sdiag" outputs separated by about 15min each? > > The latest slurmdbd.log Might be next week before I can follow up with this, will be busy/traveling the rest of this week.
Finally got a moment to look back at this and we haven't had a crash since removing the job purge setting from our config. I don't think we are particularly concerned about having that disabled, at least not in the short term. sdiag below in case it's interesting, do you still want me to collect some of these on an interval or is there something else we should try? I'm content to just leave it as is for the moment and try enabling job purges again at a later date when there's more free time (yes,I'm fantasizing) to troubleshoot if the crashes return. griznog ******************************************************* sdiag output at Wed Sep 11 15:07:43 2024 (1726092463) Data since Tue Sep 10 17:00:01 2024 (1726012801) ******************************************************* Server thread count: 2 RPC queue enabled: 0 Agent queue size: 0 Agent count: 0 Agent thread count: 0 DBD Agent queue size: 0 Jobs submitted: 247 Jobs started: 5097 Jobs completed: 5113 Jobs canceled: 33 Jobs failed: 0 Job states ts: Wed Sep 11 15:07:18 2024 (1726092438) Jobs pending: 50781 Jobs running: 1701 Main schedule statistics (microseconds): Last cycle: 153161 Max cycle: 234707 Total cycles: 7260 Mean cycle: 86572 Mean depth cycle: 8 Cycles per minute: 5 Last queue length: 20776 Main scheduler exit: End of job queue:7029 Hit default_queue_depth:231 Hit sched_max_job_start: 0 Blocked on licenses: 0 Hit max_rpc_cnt: 0 Timeout (max_sched_time): 0 Backfilling stats Total backfilled jobs (since last slurm start): 44902 Total backfilled jobs (since last stats cycle start): 4061 Total backfilled heterogeneous job components: 0 Total cycles: 2526 Last cycle when: Wed Sep 11 15:07:16 2024 (1726092436) Last cycle: 1114782 Max cycle: 3971832 Mean cycle: 1442228 Last depth cycle: 20776 Last depth cycle (try sched): 105 Depth Mean: 20493 Depth Mean (try depth): 117 Last queue length: 20776 Queue length mean: 20646 Last table size: 92 Mean table size: 88 Backfill exit End of job queue:2505 Hit bf_max_job_start: 0 Hit bf_max_job_test: 0 System state changed:21 Hit table size limit (bf_node_space_size): 0 Timeout (bf_max_time): 0 Latency for 1000 calls to gettimeofday(): 24 microseconds Remote Procedure Call statistics by message type REQUEST_PARTITION_INFO ( 2009) count:883828 ave_time:202 total_time:179267027 REQUEST_JOB_INFO_SINGLE ( 2021) count:414934 ave_time:27260 total_time:11311255692 REQUEST_NODE_INFO_SINGLE ( 2040) count:235570 ave_time:20089 total_time:4732551063 REQUEST_NODE_INFO ( 2007) count:137995 ave_time:12889 total_time:1778650314 REQUEST_BUILD_INFO ( 2001) count:132022 ave_time:5316 total_time:701918589 REQUEST_STEP_COMPLETE ( 5016) count:120422 ave_time:53692 total_time:6465793187 MESSAGE_EPILOG_COMPLETE ( 6012) count:119995 ave_time:209812 total_time:25176479597 REQUEST_COMPLETE_BATCH_SCRIPT ( 5018) count:119664 ave_time:63669 total_time:7618940301 REQUEST_COMPLETE_PROLOG ( 6018) count:118704 ave_time:152226 total_time:18069890315 REQUEST_JOB_INFO ( 2003) count:62355 ave_time:170717 total_time:10645073450 REQUEST_SUBMIT_BATCH_JOB ( 4003) count:43375 ave_time:17831 total_time:773455706 REQUEST_FED_INFO ( 2049) count:42566 ave_time:83 total_time:3557515 REQUEST_JOB_USER_INFO ( 2039) count:41948 ave_time:54922 total_time:2303896573 REQUEST_STATS_INFO ( 2035) count:24072 ave_time:103 total_time:2492326 REQUEST_PING ( 1008) count:12036 ave_time:49 total_time:592830 MESSAGE_NODE_REGISTRATION_STATUS ( 1002) count:9036 ave_time:27172 total_time:245527248 REQUEST_UPDATE_NODE ( 3002) count:7434 ave_time:59883 total_time:445172324 REQUEST_KILL_JOB ( 5032) count:1791 ave_time:8954 total_time:16038386 REQUEST_JOB_READY ( 4019) count:427 ave_time:19019 total_time:8121294 REQUEST_JOB_STEP_CREATE ( 5001) count:370 ave_time:2364 total_time:874982 REQUEST_HET_JOB_ALLOC_INFO ( 4027) count:267 ave_time:25971 total_time:6934265 REQUEST_RESOURCE_ALLOCATION ( 4001) count:229 ave_time:54037 total_time:12374641 REQUEST_COMPLETE_JOB_ALLOCATION ( 5017) count:222 ave_time:39376 total_time:8741580 REQUEST_KILL_JOBS ( 5041) count:144 ave_time:2674 total_time:385114 REQUEST_CANCEL_JOB_STEP ( 5005) count:87 ave_time:40019 total_time:3481715 REQUEST_JOB_ALLOCATION_INFO ( 4014) count:24 ave_time:35460 total_time:851059 ACCOUNTING_UPDATE_MSG (10001) count:18 ave_time:37 total_time:673 REQUEST_REBOOT_NODES ( 1015) count:15 ave_time:2742 total_time:41136 REQUEST_PRIORITY_FACTORS ( 2026) count:5 ave_time:27411 total_time:137057 REQUEST_SHARE_INFO ( 2022) count:4 ave_time:4486 total_time:17946 REQUEST_UPDATE_JOB ( 3001) count:4 ave_time:20210 total_time:80842 REQUEST_JOB_STEP_INFO ( 2005) count:4 ave_time:2472 total_time:9889 REQUEST_CRONTAB ( 2200) count:2 ave_time:22745 total_time:45490 REQUEST_PERSIST_INIT ( 6500) count:1 ave_time:100 total_time:100 REQUEST_JOB_REQUEUE ( 5023) count:1 ave_time:238 total_time:238 REQUEST_UPDATE_CRONTAB ( 2202) count:1 ave_time:78031 total_time:78031 Remote Procedure Call statistics by user root ( 0) count:1022375 ave_time:62838 total_time:64244737343 telegraf ( 943) count:472883 ave_time:18671 total_time:8829303372 eduardo.hirata ( 5484) count:109501 ave_time:9705 total_time:1062798840 michaela.mueller( 5498) count:83035 ave_time:15897 total_time:1320024617 yue.yu ( 5570) count:65539 ave_time:6635 total_time:434904428 soorya.pradeep ( 5334) count:58284 ave_time:6058 total_time:353136056 ariana.peck ( 5572) count:54007 ave_time:9991 total_time:539601313 ziwen.liu ( 5464) count:49845 ave_time:9773 total_time:487160078 yttria.aniseia ( 5593) count:44610 ave_time:11092 total_time:494832934 ilan.silva ( 5531) count:39870 ave_time:9810 total_time:391161126 bradley.guilliams( 5733) count:39866 ave_time:18764 total_time:748068232 madhav.mantri ( 5598) count:33004 ave_time:13373 total_time:441370440 appsmgr ( 5225) count:29824 ave_time:94763 total_time:2826236778 vincent.turon-lagot( 5508) count:27000 ave_time:5611 total_time:151498519 ritwicq.arjyal ( 5509) count:25745 ave_time:8914 total_time:229494745 siyu.he ( 5567) count:21929 ave_time:9585 total_time:210205104 kevin.zhao ( 5648) count:18733 ave_time:70577 total_time:1322129782 neal.soderquist ( 5608) count:18148 ave_time:7746 total_time:140577051 andy.zhou ( 5510) count:17947 ave_time:7419 total_time:133165243 jonathan.schwartz( 5615) count:17881 ave_time:39902 total_time:713497795 ryan.ward ( 5458) count:17328 ave_time:27041 total_time:468569262 jaeyoon.lee ( 5523) count:16182 ave_time:22420 total_time:362810355 padmini.deosthale( 5538) count:15680 ave_time:6073 total_time:95240059 doug.henze ( 5520) count:15189 ave_time:8997 total_time:136669702 jared.nigg ( 5663) count:13996 ave_time:9035 total_time:126454527 taylla.theodoro ( 5740) count:13696 ave_time:51190 total_time:701108467 reza.paraan ( 5569) count:13181 ave_time:5494 total_time:72417739 kithmini.herath ( 5707) count:11636 ave_time:56192 total_time:653857930 utz.ermel ( 5614) count:10102 ave_time:37755 total_time:381404807 carolina.arias ( 5397) count:9070 ave_time:12121 total_time:109941243 eileen.wang ( 5463) count:8731 ave_time:22102 total_time:192976011 jason.perera ( 5761) count:8267 ave_time:8700 total_time:71926559 benjamin.iovino ( 5736) count:7960 ave_time:6553 total_time:52165384 yang-joon.kim ( 5223) count:7750 ave_time:6906 total_time:53525915 leah.dorman ( 5459) count:7506 ave_time:15224 total_time:114276500 svc.seqbot ( 5534) count:6945 ave_time:8032 total_time:55787499 ilakkiyan.jeyakumar( 5293) count:6763 ave_time:14849 total_time:100424009 seth.hinz ( 5716) count:6316 ave_time:10265 total_time:64834005 william.law ( 5515) count:6255 ave_time:31893 total_time:199496714 saugat.kandel ( 5637) count:6123 ave_time:33114 total_time:202761680 paul.lebel ( 5096) count:6019 ave_time:16583 total_time:99817591 jordao.bragantini( 5196) count:5728 ave_time:35732 total_time:204674294 ivan.ivanov ( 5011) count:5348 ave_time:3908 total_time:20905113 akilandeswari.b ( 5527) count:5011 ave_time:10867 total_time:54456319 luise.seeker ( 5604) count:4720 ave_time:7199 total_time:33983218 mallak.ali ( 5713) count:4591 ave_time:7408 total_time:34010282 merlin.lange ( 5089) count:4292 ave_time:15198 total_time:65233675 eric.waltari ( 5044) count:3934 ave_time:3732 total_time:14684067 sudip.khadka ( 5588) count:3855 ave_time:20219 total_time:77944873 tiger.lao ( 5560) count:3763 ave_time:17386 total_time:65426775 siddhansh.agarwal( 5313) count:3401 ave_time:57219 total_time:194603592 phil.smoot ( 5649) count:3294 ave_time:21810 total_time:71844545 xiang.zhao ( 5291) count:2674 ave_time:20007 total_time:53499385 daniel.serwas ( 5574) count:2393 ave_time:13323 total_time:31883514 keir.balla ( 5341) count:1968 ave_time:8102 total_time:15945718 julia.peukes ( 5724) count:1832 ave_time:7189 total_time:13171589 amanda.seng ( 5477) count:1444 ave_time:4615 total_time:6665350 teun.huijben ( 5702) count:1011 ave_time:76395 total_time:77235797 gibraan.rahman ( 5622) count:1000 ave_time:10876 total_time:10876341 alishba.imran ( 5712) count:778 ave_time:11861 total_time:9228580 dihan.zheng ( 5778) count:589 ave_time:21955 total_time:12931511 nicholas.bratvold( 5742) count:498 ave_time:12762 total_time:6355538 allyson.ryan ( 5741) count:452 ave_time:2161 total_time:976947 craig.kapfer ( 5320) count:362 ave_time:15262 total_time:5524857 rowan.cassius ( 5728) count:304 ave_time:6387 total_time:1941763 kyle.awayan ( 5206) count:285 ave_time:5253 total_time:1497227 kuo-feng.weng ( 5525) count:246 ave_time:9998 total_time:2459674 john.pak ( 5067) count:160 ave_time:22770 total_time:3643246 laina.hall ( 5709) count:127 ave_time:13353 total_time:1695944 duo.peng ( 5314) count:122 ave_time:80168 total_time:9780611 george.crowley ( 5519) count:122 ave_time:18238 total_time:2225113 madhurya.sekhar ( 5645) count:73 ave_time:90174 total_time:6582755 kyle.swanson ( 5777) count:66 ave_time:29173 total_time:1925454 deepika.sundarraman( 5494) count:51 ave_time:2339 total_time:119332 carlos.gonzalez ( 5537) count:49 ave_time:4032 total_time:197585 josh.elias ( 5068) count:44 ave_time:9452 total_time:415896 shalin.mehta ( 5004) count:36 ave_time:4252 total_time:153074 john.hanks ( 5220) count:35 ave_time:7162 total_time:250682 zhouzerui.liu ( 5661) count:29 ave_time:59944 total_time:1738392 van.phan ( 5336) count:29 ave_time:57138 total_time:1657023 saransh.kaul ( 5050) count:25 ave_time:16315 total_time:407894 dari.kimanius ( 5646) count:25 ave_time:48489 total_time:1212232 haleigh.miller ( 5782) count:22 ave_time:36474 total_time:802431 slurm ( 5224) count:19 ave_time:40 total_time:773 kyle.harrington ( 5502) count:16 ave_time:14174 total_time:226794 james.burgess ( 5592) count:14 ave_time:22933 total_time:321063 sarah.lin ( 5113) count:8 ave_time:3028 total_time:24226 talon.chandler ( 5333) count:6 ave_time:169612 total_time:1017677 Pending RPC statistics No pending RPCs
(In reply to griznog from comment #19) > Finally got a moment to look back at this and we haven't had a crash since > removing the job purge setting from our config. I don't think we are > particularly concerned about having that disabled, at least not in the short > term. sdiag below in case it's interesting, do you still want me to collect > some of these on an interval or is there something else we should try? I'm > content to just leave it as is for the moment and try enabling job purges > again at a later date when there's more free time (yes,I'm fantasizing) to > troubleshoot if the crashes return. Thanks for sending the snapshot. If you're not getting the fatals due to ER_LOCK_WAIT_TIMEOUT with the present settings, that's good for now but your job table is going to continue to grow making future purges problematic. Couple things: Can you reenable your job purges but increase the purge time beyond 12mo (maybe 24mo)? If the db server is keeping up and the purges are under control, you could slowly begin to decrease a month or so at a time back to 12mo. How much ram does your db server machine have? Can you increase innodb_buffer_pool_size at all (we recommend up to 50% of the machine's ram). That may help a bit with performance. https://slurm.schedmd.com/accounting.html#slurm-accounting-configuration-before-build
(In reply to griznog from comment #19) > Finally got a moment to look back at this and we haven't had a crash since > removing the job purge setting from our config. I don't think we are > particularly concerned about having that disabled, at least not in the short > term. sdiag below in case it's interesting, do you still want me to collect > some of these on an interval or is there something else we should try? I'm > content to just leave it as is for the moment and try enabling job purges > again at a later date when there's more free time (yes,I'm fantasizing) to > troubleshoot if the crashes return. Ok--closing for now. Feel free to reopen or start a new ticket if problems return.