20838 – slurmdbd crashes after: mysql gave ER_LOCK_WAIT_TIMEOUT as an error.

Ticket 20838 - slurmdbd crashes after: mysql gave ER_LOCK_WAIT_TIMEOUT as an error.

Summary: slurmdbd crashes after: mysql gave ER_LOCK_WAIT_TIMEOUT as an error.

Status:	RESOLVED INFOGIVEN

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	slurmdbd (show other tickets)
Version:	24.05.3
Hardware:	Linux Linux

Severity:	3 - Medium Impact
Assignee:	Chad Vizino
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2024-09-02 13:02 MDT by griznog
Modified:	2024-10-02 11:05 MDT (History)
CC List:	3 users (show)

See Also:	18007
Site:	CZ Biohub
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description griznog 2024-09-02 13:02:32 MDT

Hi,

We upgraded Slurm from 23.11.9 to 24.05.3 this weekend and am now seeing slurmdbd crash with messages like this:


Sep  2 11:14:16 frankie slurmdbd[3149202]: debug:  REQUEST_PERSIST_INIT: CLUSTER:cluster VERSION:10496 UID:5334 IP:100.64.0.68 CONN:13
Sep  2 11:14:27 frankie slurmdbd[3149202]: debug:  REQUEST_PERSIST_INIT: CLUSTER:cluster VERSION:10496 UID:5334 IP:100.64.0.68 CONN:11
Sep  2 11:14:29 frankie slurmdbd[3149202]: debug:  REQUEST_PERSIST_INIT: CLUSTER:cluster VERSION:10496 UID:5224 IP:100.64.15.254 CONN:13
Sep  2 11:14:29 frankie slurmctld[3149203]: Warning: Note very large processing time from slurmdbd agent: full loop: usec=900101111 began=10:59:29.587
Sep  2 11:14:30 frankie slurmdbd[3149202]: fatal: mysql gave ER_LOCK_WAIT_TIMEOUT as an error. The only way to fix this is restart the calling program




Sep  2 11:52:40 frankie slurmdbd[215156]: debug:  REQUEST_PERSIST_INIT: CLUSTER:cluster VERSION:10496 UID:5224 IP:100.64.15.254 CONN:11
Sep  2 11:52:40 frankie slurmdbd[215156]: error: A new registration for cluster cluster CONN:11 just came in, but I am already talking to that cluster (CONN:9), closing other connection.
Sep  2 11:52:40 frankie slurmdbd[215156]: debug:  accounting_storage/as_mysql: as_mysql_job_start: Need to reroll usage from Mon Sep 02 10:59:29 2024 Job 16012286 from cluster started then and we are just now hearing about it.
Sep  2 11:52:41 frankie slurmdbd[215156]: fatal: mysql gave ER_LOCK_WAIT_TIMEOUT as an error. The only way to fix this is restart the calling program


I have in my.cnf:

[mysqld]
datadir=/var/lib/mysql
socket=/var/lib/mysql/mysql.sock
log-error=/var/log/mariadb/mariadb.log
pid-file=/run/mariadb/mariadb.pid
innodb_buffer_pool_size=5096M
innodb_log_file_size=256M
innodb_lock_wait_timeout=900
innodb_flush_method = fsync
innodb_doublewrite = 0 # ZFS is transactional
innodb_use_native_aio = 0
innodb_read_io_threads = 10
innodb_write_io_threads = 10

I'm not sure exactly where to start looking for what is causing this, for now I'm just restarting it automatically when it crashes and collecting logs. Is there another timeout somewhere I need to make longer?

griznog

Comment 1 griznog 2024-09-02 14:57:02 MDT

This seems like a weird coincidence, it seems to crash on roughly 16 minute boundaries?

Crashes and the last 2 messages before the crash each time:

# grep slurmdbd /var/log/messages | grep -B 2 fatal
Sep  2 11:14:29 frankie slurmdbd[3149202]: debug:  REQUEST_PERSIST_INIT: CLUSTER:cluster VERSION:10496 UID:5224 IP:100.64.15.254 CONN:13
Sep  2 11:14:29 frankie slurmctld[3149203]: Warning: Note very large processing time from slurmdbd agent: full loop: usec=900101111 began=10:59:29.587
Sep  2 11:14:30 frankie slurmdbd[3149202]: fatal: mysql gave ER_LOCK_WAIT_TIMEOUT as an error. The only way to fix this is restart the calling program
--
Sep  2 11:52:40 frankie slurmdbd[215156]: error: A new registration for cluster cluster CONN:11 just came in, but I am already talking to that cluster (CONN:9), closing other connection.
Sep  2 11:52:40 frankie slurmdbd[215156]: debug:  accounting_storage/as_mysql: as_mysql_job_start: Need to reroll usage from Mon Sep 02 10:59:29 2024 Job 16012286 from cluster started then and we are just now hearing about it.
Sep  2 11:52:41 frankie slurmdbd[215156]: fatal: mysql gave ER_LOCK_WAIT_TIMEOUT as an error. The only way to fix this is restart the calling program
--
Sep  2 12:08:01 frankie slurmdbd[652261]: error: A new registration for cluster cluster CONN:13 just came in, but I am already talking to that cluster (CONN:9), closing other connection.
Sep  2 12:08:01 frankie slurmdbd[652261]: debug:  accounting_storage/as_mysql: as_mysql_job_start: Need to reroll usage from Mon Sep 02 10:59:29 2024 Job 16012286 from cluster started then and we are just now hearing about it.
Sep  2 12:08:01 frankie slurmdbd[652261]: fatal: mysql gave ER_LOCK_WAIT_TIMEOUT as an error. The only way to fix this is restart the calling program
--
Sep  2 12:24:02 frankie slurmdbd[1106228]: debug:  REQUEST_PERSIST_INIT: CLUSTER:cluster VERSION:10496 UID:5464 IP:100.64.0.68 CONN:11
Sep  2 12:24:02 frankie slurmdbd[1106228]: debug:  REQUEST_PERSIST_INIT: CLUSTER:cluster VERSION:10496 UID:5498 IP:100.64.0.68 CONN:15
Sep  2 12:24:02 frankie slurmdbd[1106228]: fatal: mysql gave ER_LOCK_WAIT_TIMEOUT as an error. The only way to fix this is restart the calling program
Sep  2 12:24:02 frankie slurmdbd[1106228]: fatal: mysql gave ER_LOCK_WAIT_TIMEOUT as an error. The only way to fix this is restart the calling program
--
Sep  2 13:00:27 frankie slurmdbd[1518230]: debug:  REQUEST_PERSIST_INIT: CLUSTER:cluster VERSION:10496 UID:5224 IP:100.64.15.254 CONN:11
Sep  2 13:00:27 frankie slurmdbd[1518230]: error: A new registration for cluster cluster CONN:11 just came in, but I am already talking to that cluster (CONN:9), closing other connection.
Sep  2 13:00:29 frankie slurmdbd[1518230]: fatal: mysql gave ER_LOCK_WAIT_TIMEOUT as an error. The only way to fix this is restart the calling program
--
Sep  2 13:16:02 frankie slurmdbd[2470789]: error: A new registration for cluster cluster CONN:13 just came in, but I am already talking to that cluster (CONN:9), closing other connection.
Sep  2 13:16:02 frankie slurmdbd[2470789]: debug:  accounting_storage/as_mysql: as_mysql_job_complete: Need to reroll usage from Mon Sep 02 12:45:27 2024 Job 14797864 from cluster ended then and we are just now hearing about it.
Sep  2 13:16:03 frankie slurmdbd[2470789]: fatal: mysql gave ER_LOCK_WAIT_TIMEOUT as an error. The only way to fix this is restart the calling program
--
Sep  2 13:32:02 frankie slurmdbd[2879814]: error: A new registration for cluster cluster CONN:13 just came in, but I am already talking to that cluster (CONN:9), closing other connection.
Sep  2 13:32:02 frankie slurmdbd[2879814]: debug:  accounting_storage/as_mysql: as_mysql_job_complete: Need to reroll usage from Mon Sep 02 12:45:27 2024 Job 14797864 from cluster ended then and we are just now hearing about it.
Sep  2 13:32:03 frankie slurmdbd[2879814]: fatal: mysql gave ER_LOCK_WAIT_TIMEOUT as an error. The only way to fix this is restart the calling program
--
Sep  2 13:48:02 frankie slurmdbd[3298025]: error: A new registration for cluster cluster CONN:11 just came in, but I am already talking to that cluster (CONN:9), closing other connection.
Sep  2 13:48:02 frankie slurmdbd[3298025]: debug:  accounting_storage/as_mysql: as_mysql_job_complete: Need to reroll usage from Mon Sep 02 12:45:27 2024 Job 14797864 from cluster ended then and we are just now hearing about it.
Sep  2 13:48:03 frankie slurmdbd[3298025]: fatal: mysql gave ER_LOCK_WAIT_TIMEOUT as an error. The only way to fix this is restart the calling program

Comment 2 griznog 2024-09-02 15:10:53 MDT

This seems to be associated with a periodic roll-up:

Sep  2 13:00:27 frankie slurmdbd[1518230]: debug:  REQUEST_PERSIST_INIT: CLUSTER:cluster VERSION:10496 UID:5224 IP:100.64.15.254 CONN:11
Sep  2 13:00:27 frankie slurmdbd[1518230]: error: A new registration for cluster cluster CONN:11 just came in, but I am already talking to that cluster (CONN:9), closing other connection.
Sep  2 13:00:29 frankie slurmdbd[1518230]: fatal: mysql gave ER_LOCK_WAIT_TIMEOUT as an error. The only way to fix this is restart the calling program
--
Sep  2 13:16:02 frankie slurmdbd[2470789]: error: A new registration for cluster cluster CONN:13 just came in, but I am already talking to that cluster (CONN:9), closing other connection.
Sep  2 13:16:02 frankie slurmdbd[2470789]: debug:  accounting_storage/as_mysql: as_mysql_job_complete: Need to reroll usage from Mon Sep 02 12:45:27 2024 Job 14797864 from cluster ended then and we are just now hearing about it.
Sep  2 13:16:03 frankie slurmdbd[2470789]: fatal: mysql gave ER_LOCK_WAIT_TIMEOUT as an error. The only way to fix this is restart the calling program
--
Sep  2 13:32:02 frankie slurmdbd[2879814]: error: A new registration for cluster cluster CONN:13 just came in, but I am already talking to that cluster (CONN:9), closing other connection.
Sep  2 13:32:02 frankie slurmdbd[2879814]: debug:  accounting_storage/as_mysql: as_mysql_job_complete: Need to reroll usage from Mon Sep 02 12:45:27 2024 Job 14797864 from cluster ended then and we are just now hearing about it.
Sep  2 13:32:03 frankie slurmdbd[2879814]: fatal: mysql gave ER_LOCK_WAIT_TIMEOUT as an error. The only way to fix this is restart the calling program
--
Sep  2 13:48:02 frankie slurmdbd[3298025]: error: A new registration for cluster cluster CONN:11 just came in, but I am already talking to that cluster (CONN:9), closing other connection.
Sep  2 13:48:02 frankie slurmdbd[3298025]: debug:  accounting_storage/as_mysql: as_mysql_job_complete: Need to reroll usage from Mon Sep 02 12:45:27 2024 Job 14797864 from cluster ended then and we are just now hearing about it.
Sep  2 13:48:03 frankie slurmdbd[3298025]: fatal: mysql gave ER_LOCK_WAIT_TIMEOUT as an error. The only way to fix this is restart the calling program
--
Sep  2 14:04:02 frankie slurmdbd[3701401]: debug:  REQUEST_PERSIST_INIT: CLUSTER:cluster VERSION:10496 UID:5224 IP:100.64.15.254 CONN:15
Sep  2 14:04:02 frankie slurmdbd[3701401]: error: A new registration for cluster cluster CONN:15 just came in, but I am already talking to that cluster (CONN:9), closing other connection.
Sep  2 14:04:03 frankie slurmdbd[3701401]: fatal: mysql gave ER_LOCK_WAIT_TIMEOUT as an error. The only way to fix this is restart the calling program


I just ran 'sacctmgr list runaway jobs', which listed 54 jobs and when I let it try to clean those up, slurmdbd crashed again. ONce it was restarted I now see no more runaway jobs listed.

Comment 3 griznog 2024-09-02 16:32:30 MDT

That bought us an hour before the next crash, which logged:

Sep  2 15:21:08 frankie slurmdbd[4142220]: debug:  REQUEST_PERSIST_INIT: CLUSTER:cluster VERSION:10496 UID:5224 IP:100.64.15.254 CONN:10
Sep  2 15:21:08 frankie slurmctld[4142221]: Warning: Note very large processing time from slurmdbd agent: full loop: usec=900090354 began=15:06:08.574
Sep  2 15:21:09 frankie slurmdbd[4142220]: fatal: mysql gave ER_LOCK_WAIT_TIMEOUT as an error. The only way to fix this is restart the calling program

Comment 4 Benjamin Witham 2024-09-02 16:39:27 MDT

Hello griznog,

Could you send us the output of this command? It will show what the database is currently working on.

> sudo mysql -e "show processlist"

Comment 5 griznog 2024-09-02 16:43:13 MDT

MariaDB [(none)]> show processlist
    -> ;
+----+-------------+-----------+---------------+---------+------+--------------------------+------------------+----------+
| Id | User        | Host      | db            | Command | Time | State                    | Info             | Progress |
+----+-------------+-----------+---------------+---------+------+--------------------------+------------------+----------+
|  1 | system user |           | NULL          | Daemon  | NULL | InnoDB purge coordinator | NULL             |    0.000 |
|  2 | system user |           | NULL          | Daemon  | NULL | InnoDB purge worker      | NULL             |    0.000 |
|  3 | system user |           | NULL          | Daemon  | NULL | InnoDB purge worker      | NULL             |    0.000 |
|  4 | system user |           | NULL          | Daemon  | NULL | InnoDB purge worker      | NULL             |    0.000 |
|  5 | system user |           | NULL          | Daemon  | NULL | InnoDB shutdown handler  | NULL             |    0.000 |
| 11 | slurm       | localhost | slurm_acct_db | Sleep   |  122 |                          | NULL             |    0.000 |
| 14 | slurm       | localhost | slurm_acct_db | Sleep   |    0 |                          | NULL             |    0.000 |
| 60 | root        | localhost | NULL          | Query   |    0 | Init                     | show processlist |    0.000 |
+----+-------------+-----------+---------------+---------+------+--------------------------+------------------+----------+
8 rows in set (0.000 sec)

Comment 6 griznog 2024-09-02 16:48:38 MDT

I realized my recurring crashes were happening at innodb_lock_wait_timeout intervals, so I increased that to 1200 to test that theory.

Comment 7 Benjamin Witham 2024-09-02 17:10:53 MDT

What version is your MariaDB? When the slurmdbd crashes again could you send another show processlist as well?

Where is your slurmdbd located? is it on top of the database or with the slurmctld?

Comment 8 Benjamin Witham 2024-09-02 17:17:30 MDT

griznog,

> error: A new registration for cluster cluster CONN:13 just came in, but I am 
> already talking to that cluster (CONN:9), closing other connection.
This error means that there is a second slurmctld attempting to connect with the slurmdbd. Do you run a backup slurmctld?

Comment 9 griznog 2024-09-02 17:34:03 MDT

(In reply to Benjamin Witham from comment #8)
> griznog,
> 
> > error: A new registration for cluster cluster CONN:13 just came in, but I am 
> > already talking to that cluster (CONN:9), closing other connection.
> This error means that there is a second slurmctld attempting to connect with
> the slurmdbd. Do you run a backup slurmctld?

Not on purpose, I don't have a backup controller. Is there any way to see what host is trying to connect?

Comment 10 griznog 2024-09-02 17:54:04 MDT

(In reply to Benjamin Witham from comment #7)
> What version is your MariaDB? When the slurmdbd crashes again could you send
> another show processlist as well?
> 
> Where is your slurmdbd located? is it on top of the database or with the
> slurmctld?

MariaDB is from 

mariadb-server-10.3.39-1.module+el8.8.0+1452+2a7eab68.x86_64

on Rocky LInux 8.10. 

slurmctld, slurmdbd and mariadb all run on the same host.

I have my "is it running check" running every 5 seconds now and if/when it detects a crash it'll dump a "show processlist" before restarting slurmdbd. Will post when I have some output from that. 

griznog

Comment 11 griznog 2024-09-02 19:28:30 MDT

New maybe related data point, slurmctld hung and had to be restarted, the last thing it said before hanging was:

Sep 02 18:00:35 frankie slurmctld[3132700]: slurmctld: error: _handle_qos_tres_run_secs: job 16012402: QOS normal TRES mem user used limit tres_run_secs underflow, tried to remove 2457600 s>
Sep 02 18:00:35 frankie slurmctld[3132700]: slurmctld: error: _handle_assoc_tres_run_secs: job 16012402: assoc 4254 TRES mem grp_used_tres_run_secs underflow, tried to remove 2457600 second>
Sep 02 18:00:45 frankie slurmctld[3132700]: slurmctld: debug:  _job_requeue_op: JobId=16012969_3830(16016876) state 0x8000 reason 0 priority 64
Sep 02 18:00:45 frankie slurmctld[3132700]: slurmctld: preempted JobId=16012969_3830(16016876) has been requeued to reclaim resources for JobId=Invalid
Sep 02 18:10:30 frankie slurmctld[3132700]: slurmctld: server_thread_count over limit (256), waiting

Comment 13 griznog 2024-09-03 05:43:52 MDT

10 crashes since yesterday afternoon. The one common thing for all 10 is this process:

469     slurm   localhost       slurm_acct_db   Query   3645    Updating        delete from "cluster_job_script_table" where hash_inx not in(select script_hash_inx from "cluster_jo    0.000

Comment 14 Oscar Hernández 2024-09-03 07:00:58 MDT

Hi,

>469     slurm   localhost       slurm_acct_db   Query   3645    Updating        
>delete from "cluster_job_script_table" where hash_inx not in(select 
>script_hash_inx from "cluster_jo    0.000
This command is directly related with the purging process of the job table, which is issued together with the rollup at the configured frequency.

While we figure this out, could you try removing/commenting the job purging option from slurmdbd.conf?

PurgeJobAfter=#

Restart slurmdbd after the change. If that query is the problematic one, we should get to stabilize the situation disabling the purge.

Kind regards,
Oscar

Comment 16 griznog 2024-09-03 07:33:51 MDT

(In reply to Oscar Hernández from comment #14)
> While we figure this out, could you try removing/commenting the job purging
> option from slurmdbd.conf?
> 
> PurgeJobAfter=#
> 
> Restart slurmdbd after the change. If that query is the problematic one, we
> should get to stabilize the situation disabling the purge.

Done. For the sake of completeness, all my purge settings after that change are:

PurgeEventAfter=12months
#PurgeJobAfter=12months 
PurgeResvAfter=12months
PurgeStepAfter=12months
PurgeSuspendAfter=12months
PurgeTXNAfter=12months
PurgeUsageAfter=12months

Comment 17 Chad Vizino 2024-09-03 15:59:09 MDT

Hi. How has your slurmdbd stability been after increasing the purge time you listed in comment 16?

Would you please supply the following?

Table row counts for all slurm db tables:

# mysql -e "select table_name, table_rows from information_schema.tables where table_schema='<StorageLoc>'"

where <StorageLoc> is the value from your slurmdbd.conf (it's "slurm_acct_db" by default if not set).

What's the load on the node running slurmctld+slurmdbd+mariadb? Is it generally busy (ex. load average from uptime)?

Could you supply a few "sdiag" outputs separated by about 15min each?

The latest slurmdbd.log

Comment 18 griznog 2024-09-03 17:41:11 MDT

(In reply to Chad Vizino from comment #17)
> Hi. How has your slurmdbd stability been after increasing the purge time you
> listed in comment 16?

Haven't had a crash since making that change.

> Would you please supply the following?
> 
> Table row counts for all slurm db tables:
> 
> # mysql -e "select table_name, table_rows from information_schema.tables
> where table_schema='<StorageLoc>'"
> where <StorageLoc> is the value from your slurmdbd.conf (it's
> "slurm_acct_db" by default if not set).
> 

# mysql -e "select table_name, table_rows from information_schema.tables where table_schema='slurm_acct_db'"
+---------------------------------+------------+
| table_name                      | table_rows |
+---------------------------------+------------+
| acct_coord_table                |          0 |
| acct_table                      |          2 |
| clus_res_table                  |         30 |
| cluster_assoc_table             |       4830 |
| cluster_assoc_usage_day_table   |     104067 |
| cluster_assoc_usage_hour_table  |    1947233 |
| cluster_assoc_usage_month_table |      10167 |
| cluster_event_table             |       5424 |
| cluster_job_env_table           |    2314074 |
| cluster_job_script_table        |    2095407 |
| cluster_job_table               |    4581519 |
| cluster_last_ran_table          |          1 |
| cluster_resv_table              |         56 |
| cluster_step_table              |    9229297 |
| cluster_suspend_table           |          0 |
| cluster_table                   |          2 |
| cluster_usage_day_table         |       6487 |
| cluster_usage_hour_table        |     162407 |
| cluster_usage_month_table       |        261 |
| cluster_wckey_table             |        358 |
| cluster_wckey_usage_day_table   |      70111 |
| cluster_wckey_usage_hour_table  |    1919084 |
| cluster_wckey_usage_month_table |       7612 |
| convert_version_table           |          0 |
| federation_table                |          0 |
| frankie_assoc_table             |          2 |
| frankie_assoc_usage_day_table   |          0 |
| frankie_assoc_usage_hour_table  |          0 |
| frankie_assoc_usage_month_table |          0 |
| frankie_event_table             |         18 |
| frankie_job_env_table           |          0 |
| frankie_job_script_table        |          0 |
| frankie_job_table               |          0 |
| frankie_last_ran_table          |          1 |
| frankie_resv_table              |          0 |
| frankie_step_table              |          0 |
| frankie_suspend_table           |          0 |
| frankie_usage_day_table         |         32 |
| frankie_usage_hour_table        |        728 |
| frankie_usage_month_table       |          0 |
| frankie_wckey_table             |          0 |
| frankie_wckey_usage_day_table   |          0 |
| frankie_wckey_usage_hour_table  |          0 |
| frankie_wckey_usage_month_table |          0 |
| qos_table                       |         11 |
| res_table                       |         30 |
| table_defs_table                |         48 |
| tres_table                      |          9 |
| txn_table                       |       6801 |
| user_table                      |        705 |
+---------------------------------+------------+

> What's the load on the node running slurmctld+slurmdbd+mariadb? Is it
> generally busy (ex. load average from uptime)?

Load is pretty low, generally < 10 and the system has 256G memory and 16 cores (AMD EPYC 7302P). There are spikes when provisining a lot of nodes, like when bringing up the system but crashes haven't been correlated with any of that activity.

> Could you supply a few "sdiag" outputs separated by about 15min each?
>
> The latest slurmdbd.log

Might be next week before I can follow up with this, will be busy/traveling the rest of this week.

Comment 19 griznog 2024-09-11 16:11:53 MDT

Finally got a moment to look back at this and we haven't had a crash since removing the job purge setting from our config. I don't think we are particularly concerned about having that disabled, at least not in the short term. sdiag below in case it's interesting, do you still want me to collect some of these on an interval or is there something else we should try? I'm content to just leave it as is for the moment and try enabling job purges again at a later date when there's more free time (yes,I'm fantasizing) to troubleshoot if the crashes return.

griznog

*******************************************************
sdiag output at Wed Sep 11 15:07:43 2024 (1726092463)
Data since      Tue Sep 10 17:00:01 2024 (1726012801)
*******************************************************
Server thread count:  2
RPC queue enabled:    0
Agent queue size:     0
Agent count:          0
Agent thread count:   0
DBD Agent queue size: 0

Jobs submitted: 247
Jobs started:   5097
Jobs completed: 5113
Jobs canceled:  33
Jobs failed:    0

Job states ts:  Wed Sep 11 15:07:18 2024 (1726092438)
Jobs pending:   50781
Jobs running:   1701

Main schedule statistics (microseconds):
	Last cycle:   153161
	Max cycle:    234707
	Total cycles: 7260
	Mean cycle:   86572
	Mean depth cycle:  8
	Cycles per minute: 5
	Last queue length: 20776

Main scheduler exit:
	End of job queue:7029
	Hit default_queue_depth:231
	Hit sched_max_job_start: 0
	Blocked on licenses: 0
	Hit max_rpc_cnt: 0
	Timeout (max_sched_time): 0

Backfilling stats
	Total backfilled jobs (since last slurm start): 44902
	Total backfilled jobs (since last stats cycle start): 4061
	Total backfilled heterogeneous job components: 0
	Total cycles: 2526
	Last cycle when: Wed Sep 11 15:07:16 2024 (1726092436)
	Last cycle: 1114782
	Max cycle:  3971832
	Mean cycle: 1442228
	Last depth cycle: 20776
	Last depth cycle (try sched): 105
	Depth Mean: 20493
	Depth Mean (try depth): 117
	Last queue length: 20776
	Queue length mean: 20646
	Last table size: 92
	Mean table size: 88

Backfill exit
	End of job queue:2505
	Hit bf_max_job_start: 0
	Hit bf_max_job_test: 0
	System state changed:21
	Hit table size limit (bf_node_space_size): 0
	Timeout (bf_max_time): 0

Latency for 1000 calls to gettimeofday(): 24 microseconds

Remote Procedure Call statistics by message type
	REQUEST_PARTITION_INFO                  ( 2009) count:883828 ave_time:202    total_time:179267027
	REQUEST_JOB_INFO_SINGLE                 ( 2021) count:414934 ave_time:27260  total_time:11311255692
	REQUEST_NODE_INFO_SINGLE                ( 2040) count:235570 ave_time:20089  total_time:4732551063
	REQUEST_NODE_INFO                       ( 2007) count:137995 ave_time:12889  total_time:1778650314
	REQUEST_BUILD_INFO                      ( 2001) count:132022 ave_time:5316   total_time:701918589
	REQUEST_STEP_COMPLETE                   ( 5016) count:120422 ave_time:53692  total_time:6465793187
	MESSAGE_EPILOG_COMPLETE                 ( 6012) count:119995 ave_time:209812 total_time:25176479597
	REQUEST_COMPLETE_BATCH_SCRIPT           ( 5018) count:119664 ave_time:63669  total_time:7618940301
	REQUEST_COMPLETE_PROLOG                 ( 6018) count:118704 ave_time:152226 total_time:18069890315
	REQUEST_JOB_INFO                        ( 2003) count:62355  ave_time:170717 total_time:10645073450
	REQUEST_SUBMIT_BATCH_JOB                ( 4003) count:43375  ave_time:17831  total_time:773455706
	REQUEST_FED_INFO                        ( 2049) count:42566  ave_time:83     total_time:3557515
	REQUEST_JOB_USER_INFO                   ( 2039) count:41948  ave_time:54922  total_time:2303896573
	REQUEST_STATS_INFO                      ( 2035) count:24072  ave_time:103    total_time:2492326
	REQUEST_PING                            ( 1008) count:12036  ave_time:49     total_time:592830
	MESSAGE_NODE_REGISTRATION_STATUS        ( 1002) count:9036   ave_time:27172  total_time:245527248
	REQUEST_UPDATE_NODE                     ( 3002) count:7434   ave_time:59883  total_time:445172324
	REQUEST_KILL_JOB                        ( 5032) count:1791   ave_time:8954   total_time:16038386
	REQUEST_JOB_READY                       ( 4019) count:427    ave_time:19019  total_time:8121294
	REQUEST_JOB_STEP_CREATE                 ( 5001) count:370    ave_time:2364   total_time:874982
	REQUEST_HET_JOB_ALLOC_INFO              ( 4027) count:267    ave_time:25971  total_time:6934265
	REQUEST_RESOURCE_ALLOCATION             ( 4001) count:229    ave_time:54037  total_time:12374641
	REQUEST_COMPLETE_JOB_ALLOCATION         ( 5017) count:222    ave_time:39376  total_time:8741580
	REQUEST_KILL_JOBS                       ( 5041) count:144    ave_time:2674   total_time:385114
	REQUEST_CANCEL_JOB_STEP                 ( 5005) count:87     ave_time:40019  total_time:3481715
	REQUEST_JOB_ALLOCATION_INFO             ( 4014) count:24     ave_time:35460  total_time:851059
	ACCOUNTING_UPDATE_MSG                   (10001) count:18     ave_time:37     total_time:673
	REQUEST_REBOOT_NODES                    ( 1015) count:15     ave_time:2742   total_time:41136
	REQUEST_PRIORITY_FACTORS                ( 2026) count:5      ave_time:27411  total_time:137057
	REQUEST_SHARE_INFO                      ( 2022) count:4      ave_time:4486   total_time:17946
	REQUEST_UPDATE_JOB                      ( 3001) count:4      ave_time:20210  total_time:80842
	REQUEST_JOB_STEP_INFO                   ( 2005) count:4      ave_time:2472   total_time:9889
	REQUEST_CRONTAB                         ( 2200) count:2      ave_time:22745  total_time:45490
	REQUEST_PERSIST_INIT                    ( 6500) count:1      ave_time:100    total_time:100
	REQUEST_JOB_REQUEUE                     ( 5023) count:1      ave_time:238    total_time:238
	REQUEST_UPDATE_CRONTAB                  ( 2202) count:1      ave_time:78031  total_time:78031

Remote Procedure Call statistics by user
	root            (       0) count:1022375 ave_time:62838  total_time:64244737343
	telegraf        (     943) count:472883 ave_time:18671  total_time:8829303372
	eduardo.hirata  (    5484) count:109501 ave_time:9705   total_time:1062798840
	michaela.mueller(    5498) count:83035  ave_time:15897  total_time:1320024617
	yue.yu          (    5570) count:65539  ave_time:6635   total_time:434904428
	soorya.pradeep  (    5334) count:58284  ave_time:6058   total_time:353136056
	ariana.peck     (    5572) count:54007  ave_time:9991   total_time:539601313
	ziwen.liu       (    5464) count:49845  ave_time:9773   total_time:487160078
	yttria.aniseia  (    5593) count:44610  ave_time:11092  total_time:494832934
	ilan.silva      (    5531) count:39870  ave_time:9810   total_time:391161126
	bradley.guilliams(    5733) count:39866  ave_time:18764  total_time:748068232
	madhav.mantri   (    5598) count:33004  ave_time:13373  total_time:441370440
	appsmgr         (    5225) count:29824  ave_time:94763  total_time:2826236778
	vincent.turon-lagot(    5508) count:27000  ave_time:5611   total_time:151498519
	ritwicq.arjyal  (    5509) count:25745  ave_time:8914   total_time:229494745
	siyu.he         (    5567) count:21929  ave_time:9585   total_time:210205104
	kevin.zhao      (    5648) count:18733  ave_time:70577  total_time:1322129782
	neal.soderquist (    5608) count:18148  ave_time:7746   total_time:140577051
	andy.zhou       (    5510) count:17947  ave_time:7419   total_time:133165243
	jonathan.schwartz(    5615) count:17881  ave_time:39902  total_time:713497795
	ryan.ward       (    5458) count:17328  ave_time:27041  total_time:468569262
	jaeyoon.lee     (    5523) count:16182  ave_time:22420  total_time:362810355
	padmini.deosthale(    5538) count:15680  ave_time:6073   total_time:95240059
	doug.henze      (    5520) count:15189  ave_time:8997   total_time:136669702
	jared.nigg      (    5663) count:13996  ave_time:9035   total_time:126454527
	taylla.theodoro (    5740) count:13696  ave_time:51190  total_time:701108467
	reza.paraan     (    5569) count:13181  ave_time:5494   total_time:72417739
	kithmini.herath (    5707) count:11636  ave_time:56192  total_time:653857930
	utz.ermel       (    5614) count:10102  ave_time:37755  total_time:381404807
	carolina.arias  (    5397) count:9070   ave_time:12121  total_time:109941243
	eileen.wang     (    5463) count:8731   ave_time:22102  total_time:192976011
	jason.perera    (    5761) count:8267   ave_time:8700   total_time:71926559
	benjamin.iovino (    5736) count:7960   ave_time:6553   total_time:52165384
	yang-joon.kim   (    5223) count:7750   ave_time:6906   total_time:53525915
	leah.dorman     (    5459) count:7506   ave_time:15224  total_time:114276500
	svc.seqbot      (    5534) count:6945   ave_time:8032   total_time:55787499
	ilakkiyan.jeyakumar(    5293) count:6763   ave_time:14849  total_time:100424009
	seth.hinz       (    5716) count:6316   ave_time:10265  total_time:64834005
	william.law     (    5515) count:6255   ave_time:31893  total_time:199496714
	saugat.kandel   (    5637) count:6123   ave_time:33114  total_time:202761680
	paul.lebel      (    5096) count:6019   ave_time:16583  total_time:99817591
	jordao.bragantini(    5196) count:5728   ave_time:35732  total_time:204674294
	ivan.ivanov     (    5011) count:5348   ave_time:3908   total_time:20905113
	akilandeswari.b (    5527) count:5011   ave_time:10867  total_time:54456319
	luise.seeker    (    5604) count:4720   ave_time:7199   total_time:33983218
	mallak.ali      (    5713) count:4591   ave_time:7408   total_time:34010282
	merlin.lange    (    5089) count:4292   ave_time:15198  total_time:65233675
	eric.waltari    (    5044) count:3934   ave_time:3732   total_time:14684067
	sudip.khadka    (    5588) count:3855   ave_time:20219  total_time:77944873
	tiger.lao       (    5560) count:3763   ave_time:17386  total_time:65426775
	siddhansh.agarwal(    5313) count:3401   ave_time:57219  total_time:194603592
	phil.smoot      (    5649) count:3294   ave_time:21810  total_time:71844545
	xiang.zhao      (    5291) count:2674   ave_time:20007  total_time:53499385
	daniel.serwas   (    5574) count:2393   ave_time:13323  total_time:31883514
	keir.balla      (    5341) count:1968   ave_time:8102   total_time:15945718
	julia.peukes    (    5724) count:1832   ave_time:7189   total_time:13171589
	amanda.seng     (    5477) count:1444   ave_time:4615   total_time:6665350
	teun.huijben    (    5702) count:1011   ave_time:76395  total_time:77235797
	gibraan.rahman  (    5622) count:1000   ave_time:10876  total_time:10876341
	alishba.imran   (    5712) count:778    ave_time:11861  total_time:9228580
	dihan.zheng     (    5778) count:589    ave_time:21955  total_time:12931511
	nicholas.bratvold(    5742) count:498    ave_time:12762  total_time:6355538
	allyson.ryan    (    5741) count:452    ave_time:2161   total_time:976947
	craig.kapfer    (    5320) count:362    ave_time:15262  total_time:5524857
	rowan.cassius   (    5728) count:304    ave_time:6387   total_time:1941763
	kyle.awayan     (    5206) count:285    ave_time:5253   total_time:1497227
	kuo-feng.weng   (    5525) count:246    ave_time:9998   total_time:2459674
	john.pak        (    5067) count:160    ave_time:22770  total_time:3643246
	laina.hall      (    5709) count:127    ave_time:13353  total_time:1695944
	duo.peng        (    5314) count:122    ave_time:80168  total_time:9780611
	george.crowley  (    5519) count:122    ave_time:18238  total_time:2225113
	madhurya.sekhar (    5645) count:73     ave_time:90174  total_time:6582755
	kyle.swanson    (    5777) count:66     ave_time:29173  total_time:1925454
	deepika.sundarraman(    5494) count:51     ave_time:2339   total_time:119332
	carlos.gonzalez (    5537) count:49     ave_time:4032   total_time:197585
	josh.elias      (    5068) count:44     ave_time:9452   total_time:415896
	shalin.mehta    (    5004) count:36     ave_time:4252   total_time:153074
	john.hanks      (    5220) count:35     ave_time:7162   total_time:250682
	zhouzerui.liu   (    5661) count:29     ave_time:59944  total_time:1738392
	van.phan        (    5336) count:29     ave_time:57138  total_time:1657023
	saransh.kaul    (    5050) count:25     ave_time:16315  total_time:407894
	dari.kimanius   (    5646) count:25     ave_time:48489  total_time:1212232
	haleigh.miller  (    5782) count:22     ave_time:36474  total_time:802431
	slurm           (    5224) count:19     ave_time:40     total_time:773
	kyle.harrington (    5502) count:16     ave_time:14174  total_time:226794
	james.burgess   (    5592) count:14     ave_time:22933  total_time:321063
	sarah.lin       (    5113) count:8      ave_time:3028   total_time:24226
	talon.chandler  (    5333) count:6      ave_time:169612 total_time:1017677

Pending RPC statistics
	No pending RPCs

Comment 20 Chad Vizino 2024-09-11 17:17:38 MDT

(In reply to griznog from comment #19)
> Finally got a moment to look back at this and we haven't had a crash since
> removing the job purge setting from our config. I don't think we are
> particularly concerned about having that disabled, at least not in the short
> term. sdiag below in case it's interesting, do you still want me to collect
> some of these on an interval or is there something else we should try? I'm
> content to just leave it as is for the moment and try enabling job purges
> again at a later date when there's more free time (yes,I'm fantasizing) to
> troubleshoot if the crashes return.
Thanks for sending the snapshot. If you're not getting the fatals due to ER_LOCK_WAIT_TIMEOUT with the present settings, that's good for now but your job table is going to continue to grow making future purges problematic.

Couple things:

Can you reenable your job purges but increase the purge time beyond 12mo (maybe 24mo)? If the db server is keeping up and the purges are under control, you could slowly begin to decrease a month or so at a time back to 12mo.

How much ram does your db server machine have? Can you increase innodb_buffer_pool_size at all (we recommend up to 50% of the machine's ram). That may help a bit with performance.

https://slurm.schedmd.com/accounting.html#slurm-accounting-configuration-before-build

Comment 21 Chad Vizino 2024-10-02 11:05:03 MDT

(In reply to griznog from comment #19)
> Finally got a moment to look back at this and we haven't had a crash since
> removing the job purge setting from our config. I don't think we are
> particularly concerned about having that disabled, at least not in the short
> term. sdiag below in case it's interesting, do you still want me to collect
> some of these on an interval or is there something else we should try? I'm
> content to just leave it as is for the moment and try enabling job purges
> again at a later date when there's more free time (yes,I'm fantasizing) to
> troubleshoot if the crashes return.
Ok--closing for now. Feel free to reopen or start a new ticket if problems return.