Ticket 3166 - eio_handle_mainloop/eio_signal_shutdown race and error
Summary: eio_handle_mainloop/eio_signal_shutdown race and error
Status: RESOLVED FIXED
Alias: None
Product: Slurm
Classification: Unclassified
Component: slurmstepd (show other tickets)
Version: 17.02.x
Hardware: Linux Linux
: 3 - Medium Impact
Assignee: Moe Jette
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2016-10-12 14:31 MDT by Aaron Knister
Modified: 2017-02-01 22:48 MST (History)
0 users

See Also:
Site: NASA - NCCS
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed: 17.02.0-rc2
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
Sleep statements cause test failures (1.00 KB, patch)
2017-01-26 10:13 MST, Moe Jette
Details | Diff
Patch for version 16.05 (981 bytes, patch)
2017-02-01 15:49 MST, Moe Jette
Details | Diff

Note You need to log in before you can comment on or make changes to this ticket.
Description Aaron Knister 2016-10-12 14:31:41 MDT
There's an issue we have as a result of a strange interaction between the hydra MPI launcher and SSSD that causes a file descriptor related to SSSD to get open()'d on fd 0 of srun. srun of course thinks it's stdin and attempts to read it and pass it to stepd. Because it's a socket, srun ends up reading a ridiculous amount of data and passing it to stepd and the remote tasks(s). The remote tasks aren't expecting it and so at job termination we get an error that looks like this:

srun: debug:  IO error on node 0
srun: error: step_launch_notify_io_failure: aborting, io error with slurmstepd on node 0

This is a simple reproducer:

dd if=/dev/zero bs=1k | srun sleep 5

What I think I narrowed it down to is that there's somewhat of a race condition between eio_handle_mainloop and eio_signal_shutdown. If eio_handle_mainloop exists before eio_signal_shutdown is called the socket is never shutdown() meaning there's data sent to the socket by srun that hasn't been dequeued by stepd and fed to sleep because, well, sleep doesn't read anything from stdin. When stepd exits, the next read() attempted by srun will return with ECONNRESET as a result of the unread, pending data in stepd before it quit and the kernel closed the socket.

I've come up with a seemingly functional fix that ensures the eio objects get set to shutdown and a subsequent shutdown() called should eio_handle_mainloop exit prior to eio_signal_shutdown being called. The proposed fix is here:

https://github.com/aaronknister/slurm/commits/connection_reset_fix

I'm happy to write regression tests if this fix is accepted/desirable.

-Aaron
Comment 1 Tim Wickberg 2016-10-12 14:57:04 MDT
Thanks for opening this.

The main question I had was whether changing SHUT_RD to SHUT_RDWR is sufficient? At least in my quick testing that seems to fix it, but I can't tell if that may be causing some other subtle problem.

In your review, have you seen any reason that srun would close only one direction of the socket here?

It also sounds like this still leads to that data being sent across - do you have a proposed fix for that, or are you only concerned with the bad exit code?
Comment 2 Aaron Knister 2016-10-12 16:49:30 MDT
Hi Tim,

That was actually my first attempt at fixing this. It appeared to work ... until it didn't. I then realized the timing issue between eio_handle_mainloop and eio_signal_shutdown. If eio_handle_mainloop exits before eio_signal_shutdown gets called, then shutdown() will never get called and ECONNRESET is received by srun.

-Aaron
Comment 3 Aaron Knister 2016-11-10 10:07:19 MST
Tim,

Just checking in and wondering if you've had a chance to give this any more thought?

-Aaron
Comment 4 Danny Auble 2016-12-14 10:14:28 MST
Aaron, we haven't had that much time to look into this yet.  I'll try to get something by the end of the year.  If this is causing a bigger issue please let me know and I can try to get to it sooner.
Comment 5 Aaron Knister 2016-12-14 19:33:25 MST
Hi Danny,

That sounds great to me. It's frankly a minor annoyance-- by and large not a show stopper. If we could get the fix rolled in before we upgrade to 16.05 that would be just dandy if not that's OK too. I think we're looking at going to 16.05 in January or February. Bruce or Lyn could confirm the timeline.

Thanks!

-Aaron
Comment 6 Aaron Knister 2016-12-14 19:34:44 MST
I just saw that the Version is set to 17.02. I'm not sure if that's set in stone, but if a fix doesn't make it into 16.05 (which I would understand given the nature of the change) that's OK with me.
Comment 7 Moe Jette 2017-01-18 13:24:39 MST
Aaron,

Thank you for your tracking down and providing these patches. I have committed your patch (without much of the logging and with some minor format changes) to version 17.02 (to be released next month).
https://github.com/SchedMD/slurm/commit/c08f892252304962a802ad06cf2039e49f6859ab

I also added a very simple test based upon your reproducer in this commit:
https://github.com/SchedMD/slurm/commit/40e89091548a69b850f3d04db21a691076c351a2

We are very conservative about making changes into production versions of Slurm in order to maximize stability, so I wasn't anxious to add this to version 16.05, especially with version 17.02 just a month away from release.

Thanks again!
Comment 8 Aaron Knister 2017-01-18 14:17:58 MST
Thanks Moe!
Comment 10 Moe Jette 2017-01-26 10:13:50 MST
Created attachment 3989 [details]
Sleep statements cause test failures

Aaron,

We've observed some failures in our test suite with this patch. Specifically, stdout and/or stderr is getting truncated and/or the interleaving out output gets corrupted. Personally I never saw any failures, but other staff at SchedMD did.

Applying this patch adds some well placed sleeps in the slurm code and causes many of tests in our test suite to fail, especially 9.1, 9.2, 9.3 and 9.5. Note this is not a fix, but causes the problem to become more apparent. I'll attach a change to your original patch that seems to make the problem go away.
Comment 11 Moe Jette 2017-01-26 10:34:01 MST
Here's a one-line patch that makes the I/O errors go away:
https://github.com/SchedMD/slurm/commit/435f96a685c059226cf62d32918a5c61b39ae128
Comment 12 Aaron Knister 2017-01-26 12:28:15 MST
Hi Moe,

Sorry for the trouble and thanks for the fix! 

-Aaron
Comment 13 Moe Jette 2017-01-27 14:56:35 MST
(In reply to Moe Jette from comment #11)
> Here's a one-line patch that makes the I/O errors go away:
> https://github.com/SchedMD/slurm/commit/
> 435f96a685c059226cf62d32918a5c61b39ae128

While this patch eliminates the possibility lost output at job end, it restores the original problem which you reported.

We are on the verge of releasing Slurm version 17.02.0-rc1 (release candidate 1) and my plate is more than full run now, so this is my plan:

1. Comment out the vestiges of your original patch. This bug has apparently existed for some time and I don't want to trade that problem for a new one (lost output at job termination).

2. Disable the new test for this issue (test1.115) and note for now that is it known to be broken with a reference back to this bug.

3. Revisit the bug as time permits, likely in March.

Here are the original commits:
https://github.com/SchedMD/slurm/commit/c08f892252304962a802ad06cf2039e49f6859ab
https://github.com/SchedMD/slurm/commit/40e89091548a69b850f3d04db21a691076c351a2
https://github.com/SchedMD/slurm/commit/435f96a685c059226cf62d32918a5c61b39ae128

Here's the commit disabling the change:
https://github.com/SchedMD/slurm/commit/6dcce7b50380aa6dc9654dd9b9c880093cad0352
Comment 14 Moe Jette 2017-02-01 15:49:34 MST
Created attachment 4007 [details]
Patch for version 16.05

I approached this problem in several ways and believe the attached patch will work best for version 16.05. This avoids the possible loss of stdout/stderr to be transmitted at application termination, which existed in your original patch. If stdin is abandoned at application end a message may be logged by srun with the -v/--verbose option that looks like this:
_server_read: Dropped pending I/O for terminated task


That message may or may not be printed, depending upon timing.

This is the solution that I applied to version 17.02, along with a test. The commit to version 17.02 with this final change is
https://github.com/SchedMD/slurm/commit/debc605ad853e694fbc19ab356d5e6e099fdc122
Comment 15 Moe Jette 2017-02-01 15:51:07 MST
I'm closing the ticket based upon the latest patch for version 16.05 and the new code which will be included in version 17.02 when released. Feel free to re-open the ticket if you encounter more problems with this.
Comment 16 Aaron Knister 2017-02-01 22:48:47 MST
Wow! Thanks Moe, you rock!

I had started looking into this and was trying to understand why the patch I concocted was causing output corruption. Suffice it to say I hadn't made much progress, but your fix makes a ton of sense. Thanks again!