Ticket 13505

Summary: fatal: _mysql_query_internal: unable to resolve deadlock
Product: Slurm Reporter: Marco Induni <marco.induni>
Component: DatabaseAssignee: Tim McMullan <mcmullan>
Status: RESOLVED CANNOTREPRODUCE QA Contact:
Severity: 3 - Medium Impact    
Priority: ---    
Version: 20.11.8   
Hardware: Linux   
OS: Linux   
Site: CSCS - Swiss National Supercomputing Centre Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---
Attachments: Output of command: show engine innodb status
Active node slurmdbd log
Active node messages log
Backup standby messages log
Configuartion file slurmdbd
show variables output

Description Marco Induni 2022-02-24 02:54:38 MST
Dear support,
this night our slurmdbd crashed with the following message:

slurmdbd[1572]: fatal: _mysql_query_internal: unable to resolve deadlock with attempts 10/10: 1213 Deadlock found when trying to get lock; try restarting transaction#012 Please call 'show engine innodb status;' in MySQL/MariaDB and open a bug report with SchedMD.

We have a configuration with 2 nodes, on active and the other in background, but both fell down.

Attached you will find the result of the command suggested on the log.

I've restarted both slurmdbd processes and situation is back to normal, but we will need your guidance in order to avoid such problem in the future.


Thank you 
Kind regards,

Marco Induni


Here the version running 
# rpm -qa | grep -iE "maria|galera" | sort
galera-4-26.4.8-1.el7.centos.x86_64
MariaDB-client-10.4.19-1.el7.centos.x86_64
MariaDB-common-10.4.19-1.el7.centos.x86_64
MariaDB-compat-10.4.19-1.el7.centos.x86_64
MariaDB-devel-10.4.19-1.el7.centos.x86_64
MariaDB-server-10.4.19-1.el7.centos.x86_64
MariaDB-shared-10.4.19-1.el7.centos.x86_64
Comment 1 Marco Induni 2022-02-24 02:56:12 MST
Created attachment 23612 [details]
Output of command: show engine innodb status
Comment 2 Tim McMullan 2022-02-24 12:35:51 MST
Would you also attach the slurmdbd log from around the time of the fatal (assuming there is anything there) as well as your slurmdbd.conf (but please redact the database password)?

Thanks!
--Tim
Comment 3 Marco Induni 2022-02-25 02:39:57 MST
Created attachment 23627 [details]
Active node slurmdbd log
Comment 4 Marco Induni 2022-02-25 02:40:21 MST
Created attachment 23629 [details]
Active node messages log
Comment 5 Marco Induni 2022-02-25 02:40:54 MST
Created attachment 23630 [details]
Backup standby messages log
Comment 6 Marco Induni 2022-02-25 02:41:37 MST
Hi Tim, as requested attached you will log and configuration.

Kind regards.
Marco
Comment 7 Marco Induni 2022-02-25 02:41:57 MST
Created attachment 23631 [details]
Configuartion file slurmdbd
Comment 9 Tim McMullan 2022-02-25 07:31:46 MST
Thank you for the additional logging. Unfortunately what we want to look for in the show engine innodb status output isn't present so its less definitive what is going on here.

Would you also mind attaching the output of "SHOW VARIABLES;" from the database?

Thanks!
--Tim
Comment 10 Marco Induni 2022-02-25 08:26:58 MST
Created attachment 23635 [details]
show variables output

Hi Tim,
attached the out of:
mysql --table -e "show variables" > mysql-show-variables.log

Bests regards,
Marco
Comment 11 Tim McMullan 2022-03-01 07:30:30 MST
Thanks for this output Marco,

I've been looking through the logs etc and its still not conclusive what caused the hangup, but it looks like around that time archive/purge was running and it seems to be fairly slow to run.  Those operations can hold locks for a while and *might* be related.  There are some improvements coming in 22.05 that can help speed these up.

The first thing I would do here is try increasing the deadlock detection timer, its set at the default 50 seconds.  Would you be able to change it to 100 seconds? Its in microseconds, so the config line would be something like deadlock_timeout_long=100000000

Thanks!
--Tim
Comment 12 Marco Induni 2022-03-10 03:06:22 MST
Dear Tim,

as agreed I've updated the deadlock timeout
to deadlock_timeout_long=100000000

Since the event happened just once, I think we can close this ticket for the moment and I will reopen it or create a new one in case the same problem will hit the system another time.


Thank you for the support and all the best.

Marco Induni
Comment 13 Tim McMullan 2022-03-10 06:42:10 MST
(In reply to Marco Induni from comment #12)
> Dear Tim,
> 
> as agreed I've updated the deadlock timeout
> to deadlock_timeout_long=100000000
> 
> Since the event happened just once, I think we can close this ticket for the
> moment and I will reopen it or create a new one in case the same problem
> will hit the system another time.
> 
> 
> Thank you for the support and all the best.
> 
> Marco Induni

Thanks for the update Marco, please let us know if it happens again!
--Tim