Ticket 12435

Summary: A possible deadlock bug in the function PMI_KVS_Commit
Product: Slurm Reporter: Ryan <ryancaicse>
Component: slurmdbdAssignee: Danny Auble <da>
Status: RESOLVED FIXED QA Contact:
Severity: C - Contributions    
Priority: ---    
Version: 21.08.0   
Hardware: All   
OS: All   
Site: -Other- Slinky Site: ---
Alineos Sites: --- Atos/Eviden Sites: ---
Confidential Site: --- Coreweave sites: ---
Cray Sites: --- DS9 clusters: ---
Google sites: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- NoveTech Sites: ---
Nvidia HWinf-CS Sites: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Tzag Elita Sites: ---
Linux Distro: --- Machine Name:
CLE Version: Version Fixed: 21.08.4 22.05.0pre1
Target Release: --- DevPrio: ---
Emory-Cloud Sites: ---
Attachments: The patch

Description Ryan 2021-09-06 02:53:27 MDT
Hi, developers, thank you for your checking. The lock kvs_mutex maybe not released correctly if !kvs_set.kvs_comm_ptr[kvs_set.kvs_comm_recs]==1 (line 1290). The relevant code is listed below. It could lead to a deadlock if the method PMI_KVS_Commit is called multiple times or reacquire the same lock in another function.


https://github.com/SchedMD/slurm/blob/4801c60da4784346c4bf830a0b198364012e44ee/contribs/pmi/pmi.c#L1265-L1292
Comment 1 Ryan 2021-09-14 05:08:54 MDT
Created attachment 21262 [details]
The patch
Comment 2 Ryan 2021-11-03 02:27:24 MDT
Hi, Tim, Could you please take a look at my patch fixing this issue?
Comment 4 Danny Auble 2021-11-03 14:28:15 MDT
Comment on attachment 21262 [details]
The patch

Thanks Ryan, this is now in 21.08.4+ commit 1d7e69bf72.

Thanks for finding and fixing this.  It was easy to see the mistake.  Even though this fixes the issue I would strongly suggest PMI v2 :).
Comment 5 Danny Auble 2021-11-03 14:28:55 MDT
Please reopen if anything else is needed on this.

Thanks again for the patch and sorry it took so long to get to.
Comment 6 Ryan 2021-11-03 20:45:50 MDT
Thanks!