Ticket 13505

Summary:	fatal: _mysql_query_internal: unable to resolve deadlock
Product:	Slurm	Reporter:	Marco Induni <marco.induni>
Component:	Database	Assignee:	Tim McMullan <mcmullan>
Status:	RESOLVED CANNOTREPRODUCE	QA Contact:
Severity:	3 - Medium Impact
Priority:	---
Version:	20.11.8
Hardware:	Linux
OS:	Linux
Site:	CSCS - Swiss National Supercomputing Centre	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	Output of command: show engine innodb status Active node slurmdbd log Active node messages log Backup standby messages log Configuartion file slurmdbd show variables output

Description Marco Induni 2022-02-24 02:54:38 MST

Dear support,
this night our slurmdbd crashed with the following message:

slurmdbd[1572]: fatal: _mysql_query_internal: unable to resolve deadlock with attempts 10/10: 1213 Deadlock found when trying to get lock; try restarting transaction#012 Please call 'show engine innodb status;' in MySQL/MariaDB and open a bug report with SchedMD.

We have a configuration with 2 nodes, on active and the other in background, but both fell down.

Attached you will find the result of the command suggested on the log.

I've restarted both slurmdbd processes and situation is back to normal, but we will need your guidance in order to avoid such problem in the future.


Thank you 
Kind regards,

Marco Induni


Here the version running 
# rpm -qa | grep -iE "maria|galera" | sort
galera-4-26.4.8-1.el7.centos.x86_64
MariaDB-client-10.4.19-1.el7.centos.x86_64
MariaDB-common-10.4.19-1.el7.centos.x86_64
MariaDB-compat-10.4.19-1.el7.centos.x86_64
MariaDB-devel-10.4.19-1.el7.centos.x86_64
MariaDB-server-10.4.19-1.el7.centos.x86_64
MariaDB-shared-10.4.19-1.el7.centos.x86_64

Comment 1 Marco Induni 2022-02-24 02:56:12 MST

Created attachment 23612 [details]
Output of command: show engine innodb status

Comment 2 Tim McMullan 2022-02-24 12:35:51 MST

Would you also attach the slurmdbd log from around the time of the fatal (assuming there is anything there) as well as your slurmdbd.conf (but please redact the database password)?

Thanks!
--Tim

Comment 3 Marco Induni 2022-02-25 02:39:57 MST

Created attachment 23627 [details]
Active node slurmdbd log

Comment 4 Marco Induni 2022-02-25 02:40:21 MST

Created attachment 23629 [details]
Active node messages log

Comment 5 Marco Induni 2022-02-25 02:40:54 MST

Created attachment 23630 [details]
Backup standby messages log

Comment 6 Marco Induni 2022-02-25 02:41:37 MST

Hi Tim, as requested attached you will log and configuration.

Kind regards.
Marco

Comment 7 Marco Induni 2022-02-25 02:41:57 MST

Created attachment 23631 [details]
Configuartion file slurmdbd

Comment 9 Tim McMullan 2022-02-25 07:31:46 MST

Thank you for the additional logging. Unfortunately what we want to look for in the show engine innodb status output isn't present so its less definitive what is going on here.

Would you also mind attaching the output of "SHOW VARIABLES;" from the database?

Thanks!
--Tim

Comment 10 Marco Induni 2022-02-25 08:26:58 MST

Created attachment 23635 [details]
show variables output

Hi Tim,
attached the out of:
mysql --table -e "show variables" > mysql-show-variables.log

Bests regards,
Marco

Comment 11 Tim McMullan 2022-03-01 07:30:30 MST

Thanks for this output Marco,

I've been looking through the logs etc and its still not conclusive what caused the hangup, but it looks like around that time archive/purge was running and it seems to be fairly slow to run.  Those operations can hold locks for a while and *might* be related.  There are some improvements coming in 22.05 that can help speed these up.

The first thing I would do here is try increasing the deadlock detection timer, its set at the default 50 seconds.  Would you be able to change it to 100 seconds? Its in microseconds, so the config line would be something like deadlock_timeout_long=100000000

Thanks!
--Tim

Comment 12 Marco Induni 2022-03-10 03:06:22 MST

Dear Tim,

as agreed I've updated the deadlock timeout
to deadlock_timeout_long=100000000

Since the event happened just once, I think we can close this ticket for the moment and I will reopen it or create a new one in case the same problem will hit the system another time.


Thank you for the support and all the best.

Marco Induni

Comment 13 Tim McMullan 2022-03-10 06:42:10 MST

(In reply to Marco Induni from comment #12)
> Dear Tim,
> 
> as agreed I've updated the deadlock timeout
> to deadlock_timeout_long=100000000
> 
> Since the event happened just once, I think we can close this ticket for the
> moment and I will reopen it or create a new one in case the same problem
> will hit the system another time.
> 
> 
> Thank you for the support and all the best.
> 
> Marco Induni

Thanks for the update Marco, please let us know if it happens again!
--Tim