Ticket 5103

Summary: Remove unsafe use of pthread_cancel with PTHREAD_CANCEL_ASYNCHRONOUS
Product: Slurm Reporter: Tim Wickberg <tim>
Component: OtherAssignee: Tim Wickberg <tim>
Status: RESOLVED FIXED QA Contact:
Severity: 3 - Medium Impact    
Priority: --- CC: ab2080, ahkumar, amcdonough, asa188, brian.gilmer, bschwark, chris, damien.francois, dmjacobsen, jfbotts, kaizaad, marshall, naveed, pablo.llopis, plazonic, regine.gaudin, ryan_cox, slurm-support, wfeinstein
Version: 17.11.5   
Hardware: Linux   
OS: Linux   
See Also: https://bugs.schedmd.com/show_bug.cgi?id=4281
https://bugs.schedmd.com/show_bug.cgi?id=5119
https://bugs.schedmd.com/show_bug.cgi?id=16063
Site: SchedMD Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- NoveTech Sites: ---
Nvidia HWinf-CS Sites: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: 17.11.6 18.08.0-pre2 Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---
Attachments: change _watch_tasks to avoid use of pthread_cancel()
mpi/pmix log entries

Description Tim Wickberg 2018-04-26 10:55:33 MDT
There are a number of locations where, to speed up process shutdown, threads have been marked as PTHREAD_CANCEL_ASYNCHRONOUS.

However, this is only safe when using async-cancel-safe functions (of which there are a total of three in libc). Use of malloc() can lead to deadlock.

Even with the thread set as PTHREAD_CANCEL_DEFERRED, use of any libc functions marked as cancellation points (see pthreads(7)) within a pthread_mutex_lock will mean that lock is not released, and can lead to deadlock.

Currently, this is leading to issues with the extern slurmstepd. Other locations appear to be significantly less likely, at least in part as the slurmstepd processes are started and stopped orders of magnitude more frequently than slurmctld/slurmdbd, but should be addressed as well.
Comment 1 Tim Wickberg 2018-04-26 11:01:07 MDT
*** Ticket 4733 has been marked as a duplicate of this ticket. ***
Comment 2 Tim Wickberg 2018-04-26 11:02:53 MDT
*** Ticket 4810 has been marked as a duplicate of this ticket. ***
Comment 3 Tim Wickberg 2018-04-26 11:05:26 MDT
*** Ticket 4690 has been marked as a duplicate of this ticket. ***
Comment 5 Tim Wickberg 2018-04-27 23:20:10 MDT
Created attachment 6718 [details]
change _watch_tasks to avoid use of pthread_cancel()

Hi folks -

This is a first pass at removing what we currently believe are the race condition triggering the extern (or normal user) step deadlocks.

If you're able to test this I would certainly appreciate any feedback, positive or negative.

There are still similar issues in the IPMI and PMIx plugins, and we will be working to clear those up at some point, although those seem to be much rarer in practice.

- Tim
Comment 8 Brian F Gilmer 2018-04-28 14:48:09 MDT
Hi Tim,

We checked the logs today and we saw 2500 instances of slurmstepd spinning in mpi/pmix code.  We did not see any of the other slurmstepd deadlock.  I had the site save off a backtrace and excerpts from the slurmd log.  When the site has processed them I will attach them to the bug.
Comment 9 Brian F Gilmer 2018-04-28 14:52:29 MDT
Created attachment 6722 [details]
mpi/pmix log entries

Log entries for the slurmstepd in mpi/pmix
Comment 16 Tim Wickberg 2018-04-30 16:40:21 MDT
Just an update - 

We did land a series of commits similar to attachment 6718 [details]  that should address this in most instances, and will be in 17.11.6 when released. We do still have some work left on the PMIx plugin; if you don't use that plugin we believe those patches are sufficient to prevent this from happening.

We'll attach the PMIx cleanup here as well when available. If you're not in a rush, I would suggest waiting until 17.11.6 is released (should be within a week). Or you can use that patch in the meantime, or cherry-pick commits 
1675ada0a, a7c8964e, and 3be9e1ee0 from git.

- Tim
Comment 17 Christopher Samuel 2018-04-30 21:07:29 MDT
On 1/5/18 8:40 am, bugs@schedmd.com wrote:

> Just an update -

Thanks Tim, we'll wait for 17.11.6 to land to test, we only see this
currently when one of my colleagues runs a series of test jobs (so far,
fingers crossed!).

All the best,
Chris
Comment 18 Tim Wickberg 2018-05-01 17:33:27 MDT
*** Ticket 5121 has been marked as a duplicate of this ticket. ***
Comment 20 Tim Wickberg 2018-05-03 00:12:13 MDT
I ended up splitting the PMIx portion of this off on commit e5f03971b / bug 5119, and that's available if you'd like to apply that. We expect to release 17.11.6 soon with all of these fixes rolled in shortly, and I am going ahead and tagging this as resolved.

If you're still noticing issues after that, please file a separate issue to discuss that further.

Thanks for you patience on all of this, the underlying bug proved quite difficult to isolate which unfortunately delayed this quite a bit longer than we'd like.

- Tim
Comment 21 Dominik Bartkiewicz 2018-05-03 13:18:31 MDT
*** Ticket 5111 has been marked as a duplicate of this ticket. ***
Comment 22 Marshall Garey 2018-05-16 09:23:30 MDT
*** Ticket 5177 has been marked as a duplicate of this ticket. ***
Comment 23 Alejandro Sanchez 2018-07-02 04:32:07 MDT
*** Ticket 5320 has been marked as a duplicate of this ticket. ***
Comment 25 Marshall Garey 2018-09-04 11:17:24 MDT
*** Ticket 5545 has been marked as a duplicate of this ticket. ***
Comment 26 Regine Gaudin 2019-07-29 03:38:34 MDT
Hi

CEA is falling in this bug encountred in 17.11.6 and fixed in 17.11.7 
+ -- Fix slurmstepd deadlock in stepd cleanup caused by race condition in
+    the jobacct_gather fini() interfaces introduced in 17.11.6.

Would it be possible to have access to the patch ?
Thanks
Regine Gaudin
Comment 27 Naveed Near-Ansari 2019-07-29 03:38:51 MDT
I will we be out of the officeUntil August 5th.  If you need assistance, please email help-hpc@caltech.edu.  There will be others around that can deal with anything that may come up.
Comment 28 Christopher Samuel 2019-07-29 09:48:02 MDT
(In reply to Regine Gaudin from comment #26)

> CEA is falling in this bug encountred in 17.11.6 and fixed in 17.11.7 
> + -- Fix slurmstepd deadlock in stepd cleanup caused by race condition in
> +    the jobacct_gather fini() interfaces introduced in 17.11.6.
> 
> Would it be possible to have access to the patch ?

Answering as someone who hit this bug when I was in Australia and got your reply to this resolved bugzilla entry.

17.11.9 was the last version of 17.11 released (that branch is now obsolete now 19.05 is out):

https://www.schedmd.com/archives.php

All the best,
Chris
Comment 29 Jason Booth 2019-07-29 10:19:29 MDT
Hi Regine and Chris -

 The last version of 17.11 was 17.11.13-2 and as Chris has mentioned the 17.11 release is no longer supported. 

 In regards to your request for patches, there were multiple issues and a series of commits that fixed the issue, so it would be advisable to upgrade at this point in time.

-Jason