| Summary: | eio_handle_mainloop/eio_signal_shutdown race and error | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Aaron Knister <aaron.s.knister> |
| Component: | slurmstepd | Assignee: | Moe Jette <jette> |
| Status: | RESOLVED FIXED | QA Contact: | |
| Severity: | 3 - Medium Impact | ||
| Priority: | --- | ||
| Version: | 17.02.x | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | NASA - NCCS | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | 17.02.0-rc2 | Target Release: | --- |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
| Attachments: |
Sleep statements cause test failures
Patch for version 16.05 |
||
|
Description
Aaron Knister
2016-10-12 14:31:41 MDT
Thanks for opening this. The main question I had was whether changing SHUT_RD to SHUT_RDWR is sufficient? At least in my quick testing that seems to fix it, but I can't tell if that may be causing some other subtle problem. In your review, have you seen any reason that srun would close only one direction of the socket here? It also sounds like this still leads to that data being sent across - do you have a proposed fix for that, or are you only concerned with the bad exit code? Hi Tim, That was actually my first attempt at fixing this. It appeared to work ... until it didn't. I then realized the timing issue between eio_handle_mainloop and eio_signal_shutdown. If eio_handle_mainloop exits before eio_signal_shutdown gets called, then shutdown() will never get called and ECONNRESET is received by srun. -Aaron Tim, Just checking in and wondering if you've had a chance to give this any more thought? -Aaron Aaron, we haven't had that much time to look into this yet. I'll try to get something by the end of the year. If this is causing a bigger issue please let me know and I can try to get to it sooner. Hi Danny, That sounds great to me. It's frankly a minor annoyance-- by and large not a show stopper. If we could get the fix rolled in before we upgrade to 16.05 that would be just dandy if not that's OK too. I think we're looking at going to 16.05 in January or February. Bruce or Lyn could confirm the timeline. Thanks! -Aaron I just saw that the Version is set to 17.02. I'm not sure if that's set in stone, but if a fix doesn't make it into 16.05 (which I would understand given the nature of the change) that's OK with me. Aaron, Thank you for your tracking down and providing these patches. I have committed your patch (without much of the logging and with some minor format changes) to version 17.02 (to be released next month). https://github.com/SchedMD/slurm/commit/c08f892252304962a802ad06cf2039e49f6859ab I also added a very simple test based upon your reproducer in this commit: https://github.com/SchedMD/slurm/commit/40e89091548a69b850f3d04db21a691076c351a2 We are very conservative about making changes into production versions of Slurm in order to maximize stability, so I wasn't anxious to add this to version 16.05, especially with version 17.02 just a month away from release. Thanks again! Thanks Moe! Created attachment 3989 [details]
Sleep statements cause test failures
Aaron,
We've observed some failures in our test suite with this patch. Specifically, stdout and/or stderr is getting truncated and/or the interleaving out output gets corrupted. Personally I never saw any failures, but other staff at SchedMD did.
Applying this patch adds some well placed sleeps in the slurm code and causes many of tests in our test suite to fail, especially 9.1, 9.2, 9.3 and 9.5. Note this is not a fix, but causes the problem to become more apparent. I'll attach a change to your original patch that seems to make the problem go away.
Here's a one-line patch that makes the I/O errors go away: https://github.com/SchedMD/slurm/commit/435f96a685c059226cf62d32918a5c61b39ae128 Hi Moe, Sorry for the trouble and thanks for the fix! -Aaron (In reply to Moe Jette from comment #11) > Here's a one-line patch that makes the I/O errors go away: > https://github.com/SchedMD/slurm/commit/ > 435f96a685c059226cf62d32918a5c61b39ae128 While this patch eliminates the possibility lost output at job end, it restores the original problem which you reported. We are on the verge of releasing Slurm version 17.02.0-rc1 (release candidate 1) and my plate is more than full run now, so this is my plan: 1. Comment out the vestiges of your original patch. This bug has apparently existed for some time and I don't want to trade that problem for a new one (lost output at job termination). 2. Disable the new test for this issue (test1.115) and note for now that is it known to be broken with a reference back to this bug. 3. Revisit the bug as time permits, likely in March. Here are the original commits: https://github.com/SchedMD/slurm/commit/c08f892252304962a802ad06cf2039e49f6859ab https://github.com/SchedMD/slurm/commit/40e89091548a69b850f3d04db21a691076c351a2 https://github.com/SchedMD/slurm/commit/435f96a685c059226cf62d32918a5c61b39ae128 Here's the commit disabling the change: https://github.com/SchedMD/slurm/commit/6dcce7b50380aa6dc9654dd9b9c880093cad0352 Created attachment 4007 [details] Patch for version 16.05 I approached this problem in several ways and believe the attached patch will work best for version 16.05. This avoids the possible loss of stdout/stderr to be transmitted at application termination, which existed in your original patch. If stdin is abandoned at application end a message may be logged by srun with the -v/--verbose option that looks like this: _server_read: Dropped pending I/O for terminated task That message may or may not be printed, depending upon timing. This is the solution that I applied to version 17.02, along with a test. The commit to version 17.02 with this final change is https://github.com/SchedMD/slurm/commit/debc605ad853e694fbc19ab356d5e6e099fdc122 I'm closing the ticket based upon the latest patch for version 16.05 and the new code which will be included in version 17.02 when released. Feel free to re-open the ticket if you encounter more problems with this. Wow! Thanks Moe, you rock! I had started looking into this and was trying to understand why the patch I concocted was causing output corruption. Suffice it to say I hadn't made much progress, but your fix makes a ton of sense. Thanks again! |