Summary: | Remove use of pthread_cancel in PMIx plugin | ||
---|---|---|---|
Product: | Slurm | Reporter: | Tim Wickberg <tim> |
Component: | PMIx | Assignee: | Tim Wickberg <tim> |
Status: | RESOLVED FIXED | QA Contact: | |
Severity: | 3 - Medium Impact | ||
Priority: | --- | CC: | artpol84 |
Version: | 17.11.5 | ||
Hardware: | Linux | ||
OS: | Linux | ||
See Also: | https://bugs.schedmd.com/show_bug.cgi?id=5103 | ||
Site: | SchedMD | Alineos Sites: | --- |
Atos/Eviden Sites: | --- | Confidential Site: | --- |
Coreweave sites: | --- | Cray Sites: | --- |
DS9 clusters: | --- | HPCnow Sites: | --- |
HPE Sites: | --- | IBM Sites: | --- |
NOAA SIte: | --- | NoveTech Sites: | --- |
Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
Recursion Pharma Sites: | --- | SFW Sites: | --- |
SNIC sites: | --- | Linux Distro: | --- |
Machine Name: | CLE Version: | ||
Version Fixed: | 17.11.6 18.08.0-pre2 | Target Release: | --- |
DevPrio: | --- | Emory-Cloud Sites: | --- |
Attachments: | remove pthread_cancel in pmix plugin |
Hi, Tim. Patch looks good, thank you! This looks like a cleaner solution indeed. And it is true that it was modelled after PMI2. Out of curiosity, given that we ensure that thread is exited using fds and flags, how do you think it would affect? > Out of curiosity, given that we ensure that thread is exited using fds and
> flags, how do you think it would affect?
I believe the overall behavior when it comes time to shut everything down is the same as before, but without the race condition that can lead to a stuck lock.
Thanks for taking a look through it, this should be in the 17.11.6 release when out.
- Tim
That’s what I’m asking. Once we identified that thread sat exit flag there should be no mallocs. So I wouldn’t envision any locking issues. So I’d like to see if I’m missing something. The race is pretty subtle, but there are some recent changes to glibc that seem to make it much more likely to trigger. Any time you use pthread_cancel on a thread that's been marked as PTHREAD_CANCEL_ASYNCHRONOUS, there are three libc functions you're allowed to be in the middle of. Anything else leads to undefined behavior - what we're usually seeing is that we can cancel the thread while it's busy in a malloc routine holding certain internal locks. When that happens, nothing ever releases those locks, and other threads will eventually be blocked trying to take control of them. The pthreads man page page discusses this a bit better - the term you're looking for is "async-cancel-safe". So - we're working to strip out all PTHREAD_CANCEL_ASYNCHRONOUS threads; this patch (which I'll commit shortly) is handling that inside our pmix plugin. - Tim Artem - thanks for taking a look at the patch. This is now in commit e5f03971b, and will be in 17.11.6 when released. - Tim |
Created attachment 6742 [details] remove pthread_cancel in pmix plugin Hey Artem - Do you mind taking a look at the attached patch? We've noticed some issues related to the use of pthread_cancel elsewhere in the code, and it looks like this affects the PMIx plugin as well. What happens is the thread can get cancelled while it holds one of the malloc internal locks, which then deadlocks the rest of the slurmstepd threads and keeps the step from completing correctly. I believe this patch removes it but maintains all the current functionality - it looks like some of this was modeled after PMI2 which will get some similar cleanup as well. thanks, - Tim