Summary: | Remove unsafe use of pthread_cancel with PTHREAD_CANCEL_ASYNCHRONOUS | ||
---|---|---|---|
Product: | Slurm | Reporter: | Tim Wickberg <tim> |
Component: | Other | Assignee: | Tim Wickberg <tim> |
Status: | RESOLVED FIXED | QA Contact: | |
Severity: | 3 - Medium Impact | ||
Priority: | --- | CC: | ab2080, ahkumar, amcdonough, asa188, brian.gilmer, bschwark, chris, damien.francois, dmjacobsen, jfbotts, kaizaad, marshall, naveed, pablo.llopis, plazonic, regine.gaudin, ryan_cox, slurm-support, wfeinstein |
Version: | 17.11.5 | ||
Hardware: | Linux | ||
OS: | Linux | ||
See Also: |
https://bugs.schedmd.com/show_bug.cgi?id=4281 https://bugs.schedmd.com/show_bug.cgi?id=5119 https://bugs.schedmd.com/show_bug.cgi?id=16063 |
||
Site: | SchedMD | Alineos Sites: | --- |
Atos/Eviden Sites: | --- | Confidential Site: | --- |
Coreweave sites: | --- | Cray Sites: | --- |
DS9 clusters: | --- | HPCnow Sites: | --- |
HPE Sites: | --- | IBM Sites: | --- |
NOAA SIte: | --- | NoveTech Sites: | --- |
Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
Recursion Pharma Sites: | --- | SFW Sites: | --- |
SNIC sites: | --- | Linux Distro: | --- |
Machine Name: | CLE Version: | ||
Version Fixed: | 17.11.6 18.08.0-pre2 | Target Release: | --- |
DevPrio: | --- | Emory-Cloud Sites: | --- |
Attachments: |
change _watch_tasks to avoid use of pthread_cancel()
mpi/pmix log entries |
Description
Tim Wickberg
2018-04-26 10:55:33 MDT
*** Ticket 4733 has been marked as a duplicate of this ticket. *** *** Ticket 4810 has been marked as a duplicate of this ticket. *** *** Ticket 4690 has been marked as a duplicate of this ticket. *** Created attachment 6718 [details]
change _watch_tasks to avoid use of pthread_cancel()
Hi folks -
This is a first pass at removing what we currently believe are the race condition triggering the extern (or normal user) step deadlocks.
If you're able to test this I would certainly appreciate any feedback, positive or negative.
There are still similar issues in the IPMI and PMIx plugins, and we will be working to clear those up at some point, although those seem to be much rarer in practice.
- Tim
Hi Tim, We checked the logs today and we saw 2500 instances of slurmstepd spinning in mpi/pmix code. We did not see any of the other slurmstepd deadlock. I had the site save off a backtrace and excerpts from the slurmd log. When the site has processed them I will attach them to the bug. Created attachment 6722 [details]
mpi/pmix log entries
Log entries for the slurmstepd in mpi/pmix
Just an update -
We did land a series of commits similar to attachment 6718 [details] that should address this in most instances, and will be in 17.11.6 when released. We do still have some work left on the PMIx plugin; if you don't use that plugin we believe those patches are sufficient to prevent this from happening.
We'll attach the PMIx cleanup here as well when available. If you're not in a rush, I would suggest waiting until 17.11.6 is released (should be within a week). Or you can use that patch in the meantime, or cherry-pick commits
1675ada0a, a7c8964e, and 3be9e1ee0 from git.
- Tim
On 1/5/18 8:40 am, bugs@schedmd.com wrote: > Just an update - Thanks Tim, we'll wait for 17.11.6 to land to test, we only see this currently when one of my colleagues runs a series of test jobs (so far, fingers crossed!). All the best, Chris *** Ticket 5121 has been marked as a duplicate of this ticket. *** I ended up splitting the PMIx portion of this off on commit e5f03971b / bug 5119, and that's available if you'd like to apply that. We expect to release 17.11.6 soon with all of these fixes rolled in shortly, and I am going ahead and tagging this as resolved. If you're still noticing issues after that, please file a separate issue to discuss that further. Thanks for you patience on all of this, the underlying bug proved quite difficult to isolate which unfortunately delayed this quite a bit longer than we'd like. - Tim *** Ticket 5111 has been marked as a duplicate of this ticket. *** *** Ticket 5177 has been marked as a duplicate of this ticket. *** *** Ticket 5320 has been marked as a duplicate of this ticket. *** *** Ticket 5545 has been marked as a duplicate of this ticket. *** Hi CEA is falling in this bug encountred in 17.11.6 and fixed in 17.11.7 + -- Fix slurmstepd deadlock in stepd cleanup caused by race condition in + the jobacct_gather fini() interfaces introduced in 17.11.6. Would it be possible to have access to the patch ? Thanks Regine Gaudin I will we be out of the officeUntil August 5th. If you need assistance, please email help-hpc@caltech.edu. There will be others around that can deal with anything that may come up. (In reply to Regine Gaudin from comment #26) > CEA is falling in this bug encountred in 17.11.6 and fixed in 17.11.7 > + -- Fix slurmstepd deadlock in stepd cleanup caused by race condition in > + the jobacct_gather fini() interfaces introduced in 17.11.6. > > Would it be possible to have access to the patch ? Answering as someone who hit this bug when I was in Australia and got your reply to this resolved bugzilla entry. 17.11.9 was the last version of 17.11 released (that branch is now obsolete now 19.05 is out): https://www.schedmd.com/archives.php All the best, Chris Hi Regine and Chris - The last version of 17.11 was 17.11.13-2 and as Chris has mentioned the 17.11 release is no longer supported. In regards to your request for patches, there were multiple issues and a series of commits that fixed the issue, so it would be advisable to upgrade at this point in time. -Jason |