Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate crash introduced in #4382 #4583

Closed
AndresGuedez opened this issue Oct 2, 2018 · 3 comments · Fixed by #4587
Closed

Investigate crash introduced in #4382 #4583

AndresGuedez opened this issue Oct 2, 2018 · 3 comments · Fixed by #4587
Assignees
Labels
Milestone

Comments

@AndresGuedez
Copy link
Contributor

Description:
Code merged in #4382 is causing a SEGV in source/common/network/connection_impl.cc:586.

My comment from #4581:

As @junr03 pointed out, it seems the delayed close timeout callback is being issued after a ConnectionImpl::closeSocket() call which resets the connection_stats_ pointer has occurred. This points to an interesting race since destruction of the ConnectionImpl should lead to disarming the timeout, so it's possible the onDelayedCloseTimeout() cb is triggering while the ConnectionImpl is in the deferred deletion stage.

@AndresGuedez
Copy link
Contributor Author

@junr03 Can you provide steps to reproduce? I just want to verify my assumptions are correct about the timing that caused this.

A comprehensive fix should be a nullptr check for connection_stats_ in ConnectionImpl::onDelayedCloseTimeout() along with disabling the delayed close timer in ConnectionImpl::closeSocket().

@junr03
Copy link
Member

junr03 commented Oct 2, 2018

I agree with your timing comment (re: the cb being triggered while we are in deferred deletion), and with your proposal to fix.

Unfortunately this was happening in production boxes, so I don't have local repro steps.

Thanks for looking into it @AndresGuedez

@mattklein123
Copy link
Member

@AndresGuedez per @junr03 your assessment is correct. I verified via the crashing call stack and code inspection. I agree the right fix is disabling the timer in closeSocket(). I would recommend creating a crashing test and then fixing the bug.

AndresGuedez added a commit to AndresGuedez/envoy that referenced this issue Oct 2, 2018
Fixes a segfault introduced in envoyproxy#4382 due to a connection tear down race condition when the delayed
close timer triggers after connection state has been reset via closeSocket().

Signed-off-by: Andres Guedez <[email protected]>
@mattklein123 mattklein123 modified the milestones: 1.8.0, 1.9.0 Oct 2, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants