| Summary: | Slurmctld crashes fatal: locks.c:125 lock_slurmctld: pthread_rwlock_wrlock(): Resource deadlock avoided | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Jimmy Hui <jhui> |
| Component: | slurmctld | Assignee: | Marcin Stolarek <cinek> |
| Status: | RESOLVED FIXED | QA Contact: | |
| Severity: | 2 - High Impact | ||
| Priority: | --- | CC: | cinek, hiroshi.kobayashi |
| Version: | 20.02.5 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | Roche/PHCIX | Slinky Site: | --- |
| Alineos Sites: | --- | Atos/Eviden Sites: | --- |
| Confidential Site: | --- | Coreweave sites: | --- |
| Cray Sites: | --- | DS9 clusters: | --- |
| Google sites: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | NoveTech Sites: | --- |
| Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Tzag Elita Sites: | --- |
| Linux Distro: | --- | Machine Name: | |
| CLE Version: | Version Fixed: | 20.11.7 21.08pre1 | |
| Target Release: | --- | DevPrio: | --- |
| Emory-Cloud Sites: | --- | ||
| Attachments: | Slurmctld and slurmdb logs | ||
Jimmy, I think I found the root cause of the issue. I'm passing the patch to our QA queue now. Let me know if you want to apply it locally, before the review. Am I correct that the issue happened only once and doesn't affect cluster operations? cheers, Marcin Hi, This this also happen in our other cluster the day before and also on 02/02/21. This affected operation because this never failed over to our backup node. [root@EUMASTER:/var/log ] $ grep lock_slurmctld slurmctld.log [2021-02-01T16:08:04.185] fatal: locks.c:125 lock_slurmctld: pthread_rwlock_wrlock(): Resource deadlock avoided [2021-04-28T06:44:02.302] fatal: locks.c:125 lock_slurmctld: pthread_rwlock_wrlock(): Resource deadlock avoided Jimmy, The issue is fixed by 6b6b3879e97208a0[1] which got merged to Slurm 20.11 branch and will be released in 20.11.7. This should be easy to backport to 20.02 as well, let me know if you need any help with it. In case of no reply, I'll close the bug as fixed. cheers, Marcin [1]https://github.com/SchedMD/slurm/commit/6b6b3879e97208a041c104df1ccf2574a60ecf27 Hi Marcin, Is a patch available for install? If there is a patch can you provide details on applying the fix? Jimmy, Yes, you can find the patch/commit under the link in comment 7. You need to apply it and rebuild Slurm. The patch is relevant only for slurmctld. cheers, Marcin Jimmy, Do you need help? Did you apply the patch, so you can confirm that it fixed the issue? In case of no reply, I'll close the bug as fixed. cheers, Marcin I'm closing the bug report as fixed. Should you have any question please reopen. cheers, Marcin *** Ticket 12731 has been marked as a duplicate of this ticket. *** |
Created attachment 19180 [details] Slurmctld and slurmdb logs Hi, Our slurmctld crash with this error. I have attached the slurmcltd and slurmdb logs. [2021-04-29T02:53:21.967] fatal: locks.c:125 lock_slurmctld: pthread_rwlock_wrlock(): Resource deadlock avoided