Ticket 11480

Summary: Slurmctld crashes fatal: locks.c:125 lock_slurmctld: pthread_rwlock_wrlock(): Resource deadlock avoided
Product: Slurm Reporter: Jimmy Hui <jhui>
Component: slurmctldAssignee: Marcin Stolarek <cinek>
Status: RESOLVED FIXED QA Contact:
Severity: 2 - High Impact    
Priority: --- CC: cinek, hiroshi.kobayashi
Version: 20.02.5   
Hardware: Linux   
OS: Linux   
Site: Roche/PHCIX Slinky Site: ---
Alineos Sites: --- Atos/Eviden Sites: ---
Confidential Site: --- Coreweave sites: ---
Cray Sites: --- DS9 clusters: ---
Google sites: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- NoveTech Sites: ---
Nvidia HWinf-CS Sites: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Tzag Elita Sites: ---
Linux Distro: --- Machine Name:
CLE Version: Version Fixed: 20.11.7 21.08pre1
Target Release: --- DevPrio: ---
Emory-Cloud Sites: ---
Attachments: Slurmctld and slurmdb logs

Description Jimmy Hui 2021-04-28 23:34:25 MDT
Created attachment 19180 [details]
Slurmctld and slurmdb logs

Hi,

Our slurmctld crash with this error. I have attached the slurmcltd and slurmdb logs.

[2021-04-29T02:53:21.967] fatal: locks.c:125 lock_slurmctld: pthread_rwlock_wrlock(): Resource deadlock avoided
Comment 3 Marcin Stolarek 2021-04-29 05:24:13 MDT
Jimmy, 

I think I found the root cause of the issue. I'm passing the patch to our QA queue now. Let me know if you want to apply it locally, before the review. Am I correct that the issue happened only once and doesn't affect cluster operations?

cheers,
Marcin
Comment 4 Jimmy Hui 2021-04-29 09:12:06 MDT
Hi,

This this also happen in our other cluster the day before and also on 02/02/21. This affected operation because this never failed over to our backup node. 

[root@EUMASTER:/var/log ] $ grep lock_slurmctld slurmctld.log
[2021-02-01T16:08:04.185] fatal: locks.c:125 lock_slurmctld: pthread_rwlock_wrlock(): Resource deadlock avoided
[2021-04-28T06:44:02.302] fatal: locks.c:125 lock_slurmctld: pthread_rwlock_wrlock(): Resource deadlock avoided
Comment 7 Marcin Stolarek 2021-04-29 23:13:53 MDT
Jimmy, 

 The issue is fixed by 6b6b3879e97208a0[1] which got merged to Slurm 20.11 branch and will be released in 20.11.7.

This should be easy to backport to 20.02 as well, let me know if you need any help with it.

In case of no reply, I'll close the bug as fixed.

cheers,
Marcin


[1]https://github.com/SchedMD/slurm/commit/6b6b3879e97208a041c104df1ccf2574a60ecf27
Comment 8 Jimmy Hui 2021-04-30 10:12:22 MDT
Hi Marcin,

Is a patch available for install? If there is a patch can you provide details on applying the fix?
Comment 9 Marcin Stolarek 2021-04-30 10:26:48 MDT
Jimmy,

Yes, you can find the patch/commit under the link in comment 7.

You need to apply it and rebuild Slurm. The patch is relevant only for slurmctld.

cheers,
Marcin
Comment 10 Marcin Stolarek 2021-05-03 10:51:36 MDT
Jimmy,

Do you need help? Did you apply the patch, so you can confirm that it fixed the issue?

In case of no reply, I'll close the bug as fixed.

cheers,
Marcin
Comment 11 Marcin Stolarek 2021-05-07 01:16:58 MDT
I'm closing the bug report as fixed. Should you have any question please reopen.

cheers,
Marcin
Comment 12 Jason Booth 2021-10-25 11:43:34 MDT
*** Ticket 12731 has been marked as a duplicate of this ticket. ***