Ticket 5119

Summary: Remove use of pthread_cancel in PMIx plugin
Product: Slurm Reporter: Tim Wickberg <tim>
Component: PMIxAssignee: Tim Wickberg <tim>
Status: RESOLVED FIXED QA Contact:
Severity: 3 - Medium Impact    
Priority: --- CC: artpol84
Version: 17.11.5   
Hardware: Linux   
OS: Linux   
See Also: https://bugs.schedmd.com/show_bug.cgi?id=5103
Site: SchedMD Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- NoveTech Sites: ---
Nvidia HWinf-CS Sites: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: 17.11.6 18.08.0-pre2 Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---
Attachments: remove pthread_cancel in pmix plugin

Description Tim Wickberg 2018-05-01 13:26:11 MDT
Created attachment 6742 [details]
remove pthread_cancel in pmix plugin

Hey Artem -

Do you mind taking a look at the attached patch? 

We've noticed some issues related to the use of pthread_cancel elsewhere in the code, and it looks like this affects the PMIx plugin as well.

What happens is the thread can get cancelled while it holds one of the malloc internal locks, which then deadlocks the rest of the slurmstepd threads and keeps the step from completing correctly.

I believe this patch removes it but maintains all the current functionality - it looks like some of this was modeled after PMI2 which will get some similar cleanup as well.

thanks,
- Tim
Comment 1 Artem Polyakov 2018-05-01 14:10:58 MDT
Hi, Tim.

Patch looks good, thank you! This looks like a cleaner solution indeed. And it is true that it was modelled after PMI2.
Out of curiosity, given that we ensure that thread is exited using fds and flags, how do you think it would affect?
Comment 2 Tim Wickberg 2018-05-01 18:31:18 MDT
> Out of curiosity, given that we ensure that thread is exited using fds and
> flags, how do you think it would affect?

I believe the overall behavior when it comes time to shut everything down is the same as before, but without the race condition that can lead to a stuck lock.

Thanks for taking a look through it, this should be in the 17.11.6 release when out.

- Tim
Comment 3 Artem Polyakov 2018-05-01 18:33:54 MDT
That’s what I’m asking. Once we identified that thread sat exit flag there should be no mallocs. So I wouldn’t envision any locking issues. So I’d like to see if I’m missing something.
Comment 4 Tim Wickberg 2018-05-02 10:32:46 MDT
The race is pretty subtle, but there are some recent changes to glibc that seem to make it much more likely to trigger.

Any time you use pthread_cancel on a thread that's been marked as PTHREAD_CANCEL_ASYNCHRONOUS, there are three libc functions you're allowed to be in the middle of. Anything else leads to undefined behavior - what we're usually seeing is that we can cancel the thread while it's busy in a malloc routine holding certain internal locks.

When that happens, nothing ever releases those locks, and other threads will eventually be blocked trying to take control of them.

The pthreads man page page discusses this a bit better - the term you're looking for is "async-cancel-safe".

So - we're working to strip out all PTHREAD_CANCEL_ASYNCHRONOUS threads; this patch (which I'll commit shortly) is handling that inside our pmix plugin.

- Tim
Comment 5 Tim Wickberg 2018-05-02 10:41:34 MDT
Artem - thanks for taking a look at the patch.

This is now in commit e5f03971b, and will be in 17.11.6 when released.

- Tim