Created attachment 19180 [details] Slurmctld and slurmdb logs Hi, Our slurmctld crash with this error. I have attached the slurmcltd and slurmdb logs. [2021-04-29T02:53:21.967] fatal: locks.c:125 lock_slurmctld: pthread_rwlock_wrlock(): Resource deadlock avoided
Jimmy, I think I found the root cause of the issue. I'm passing the patch to our QA queue now. Let me know if you want to apply it locally, before the review. Am I correct that the issue happened only once and doesn't affect cluster operations? cheers, Marcin
Hi, This this also happen in our other cluster the day before and also on 02/02/21. This affected operation because this never failed over to our backup node. [root@EUMASTER:/var/log ] $ grep lock_slurmctld slurmctld.log [2021-02-01T16:08:04.185] fatal: locks.c:125 lock_slurmctld: pthread_rwlock_wrlock(): Resource deadlock avoided [2021-04-28T06:44:02.302] fatal: locks.c:125 lock_slurmctld: pthread_rwlock_wrlock(): Resource deadlock avoided
Jimmy, The issue is fixed by 6b6b3879e97208a0[1] which got merged to Slurm 20.11 branch and will be released in 20.11.7. This should be easy to backport to 20.02 as well, let me know if you need any help with it. In case of no reply, I'll close the bug as fixed. cheers, Marcin [1]https://github.com/SchedMD/slurm/commit/6b6b3879e97208a041c104df1ccf2574a60ecf27
Hi Marcin, Is a patch available for install? If there is a patch can you provide details on applying the fix?
Jimmy, Yes, you can find the patch/commit under the link in comment 7. You need to apply it and rebuild Slurm. The patch is relevant only for slurmctld. cheers, Marcin
Jimmy, Do you need help? Did you apply the patch, so you can confirm that it fixed the issue? In case of no reply, I'll close the bug as fixed. cheers, Marcin
I'm closing the bug report as fixed. Should you have any question please reopen. cheers, Marcin
*** Ticket 12731 has been marked as a duplicate of this ticket. ***