-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
upstream: avoid reset after end_stream #14106
Conversation
Signed-off-by: Snow Pettersen <[email protected]>
@alyssawilk This attempts to fix the issue with double ends that were happening in one of the integration tests in #13095 that I mentioned to you over Slack a while ago. I wasn't able to figure out why we get a end_stream=true and a reset, but at least this prevents it from reaching the router. LMK what you think |
Interestingly this seems to trigger an ASAN heap-use-after-free when I try to modify the |
Signed-off-by: Snow Pettersen <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for catching this (and for figuring out asan!)
One thought I think isn't covered by tests today but I'd rather fix in advance...
@@ -87,6 +87,11 @@ void TcpUpstream::resetStream() { | |||
|
|||
void TcpUpstream::onUpstreamData(Buffer::Instance& data, bool end_stream) { | |||
upstream_request_->decodeData(data, end_stream); | |||
// This ensures that if we get a reset after end_stream we won't propagate two | |||
// "end streams" to the upstream_request_. | |||
if (end_stream) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
once we have upstream filters, will it be possible to get an upstream "end stream "before we send headers? I think it could be the case, so I think we're going to want to do null pointer checks for upstream_request in few other places, WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah good point, I'll go over all the functions
Signed-off-by: Snow Pettersen <[email protected]>
Signed-off-by: Snow Pettersen <[email protected]>
Signed-off-by: Snow Pettersen <[email protected]>
@@ -45,6 +46,11 @@ void TcpUpstream::encodeData(Buffer::Instance& data, bool end_stream) { | |||
|
|||
Envoy::Http::Status TcpUpstream::encodeHeaders(const Envoy::Http::RequestHeaderMap&, | |||
bool end_stream) { | |||
if (!upstream_request_) { | |||
// TODO(snowp): Should this return something else in this case? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, seems like an error case to me.
Signed-off-by: Snow Pettersen <[email protected]>
Signed-off-by: Snow Pettersen <[email protected]>
// Verifies that if we send headers after the upstream has been reset nothing crashes. | ||
TEST_F(TcpUpstreamTest, HeadersAfterReset) { | ||
tcp_upstream_->onEvent(Network::ConnectionEvent::LocalClose); | ||
EXPECT_TRUE(tcp_upstream_->encodeHeaders(request_, false).ok()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't this fail given the status error?
Signed-off-by: Snow Pettersen <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome. Thanks for driving this one though!
@@ -480,8 +480,8 @@ void UpstreamRequest::clearRequestEncoder() { | |||
// Before clearing the encoder, unsubscribe from callbacks. | |||
if (upstream_) { | |||
parent_.callbacks()->removeDownstreamWatermarkCallbacks(downstream_watermark_manager_); | |||
parent_.callbacks()->dispatcher().deferredDelete(std::move(upstream_)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Drive by: @snowp @alyssawilk did you check carefully that the destructor of the various upstreams isn't referencing anything that can go away? We have had subtle bugs in this area and in general the use of deferred delete can be a little precarious.
Also, question: is this change defensive or is there a real case that causes this? It seems like we shouldn't be calling reset at all after we receive end stream? Why doesn't end_stream cause upstream_ to get nulled out, etc.?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The original motivation for this PR was to fix an issue I ran into in #13095 where a reset would follow an end_stream=true callback and both of these events would be propagated to the router/UpstreamRequest. This ran into an issue when resetting the upstream_request_ as part of the router, hence the deferred deletions. It does feel like there's something weird going on here, but this seemed to be the easiest fix to what was happening on the other PR. Happy to discuss more (and even revert if we think this is the wrong way to handle it).
The remaining changes here are defensive changes in anticipation of the upstream FM changes, at least I don't have a clear sense of when this would be a problem, maybe @alyssawilk does.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
re: deferred delete, it's mainly just making sure the upstream can be deleted after the router part. Is there any reference in the destructor that might cause a timing issue? For example timers getting cancelled, etc. This can be somewhat subtle and needs to be audited.
re: the change itself, it does seem to me like this is not the correct fix, so we can at least open a tracking issue to try to figure out what state is not getting reset properly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking through both the HTTP and TCP upstream neither of them has a dtor that does anything but a null check, and the only thing that would be destructed as part of it is the ConnectionData
handle. The TcpConnectionData
seems to indicate that it's resilient to deferred deletion ordering so I suspect this is fine:
~TcpConnectionData() override {
// Generally it is the case that TcpConnectionData will be destroyed before the
// ActiveTcpClient. Because ordering on the deferred delete list is not guaranteed in the
// case of a disconnect, make sure parent_ is valid before doing clean-up.
if (parent_) {
parent_->clearCallbacks();
}
}
I definitely had not done my due diligence here so, so thanks for pointing this out.
I'll file an issue and try to dig a bit deeper if I can find the time
* master: (70 commits) upstream: avoid reset after end_stream in TCP HTTP upstream (envoyproxy#14106) bazelci: add fuzz coverage (envoyproxy#14179) dependencies: allowlist CVE-2020-8277 to prevent false positives. (envoyproxy#14228) cleanup: replace ad-hoc [0, 1] value types with UnitFloat (envoyproxy#14081) Update docs for skywalking tracer (envoyproxy#14210) Fix some errors in the switch statement when decode dubbo response (envoyproxy#14207) Windows: enable tests and envoy-static.exe pdb file (envoyproxy#13688) http: add Kill Request HTTP filter (envoyproxy#14170) dependencies: fix release_dates error behavior. (envoyproxy#14216) thrift filter: support skip decoding data after metadata in the thrift message (envoyproxy#13592) update cares (envoyproxy#14213) docs: clarify behavior of hedge_on_per_try_timeout (envoyproxy#12983) repokitteh: add support for randomized auto-assign. (envoyproxy#14185) [grpc] validate grpc config for illegal characters (envoyproxy#14129) server: Return nullopt when process_context is nullptr (envoyproxy#14181) [Windows] Fix thrift proxy tests (envoyproxy#13220) kafka: add missing unit tests (envoyproxy#14195) doc: mention gperftools explicitly in PPROF.md (envoyproxy#14199) Removed `--use-fake-symbol-table` option. (envoyproxy#14178) filter contract: clarification around local replies (envoyproxy#14193) ... Signed-off-by: Michael Puncel <[email protected]>
…nvoyproxy#14106)" This reverts commit 1c06967. Signed-off-by: Snow Pettersen <[email protected]>
…14106)" (#14551) This reverts commit 1c06967. Signed-off-by: Snow Pettersen <[email protected]>
Signed-off-by: Snow Pettersen [email protected]
For an explanation of how to fill out the fields, please see the relevant section
in PULL_REQUESTS.md
Risk Level: Medium
Testing: New UT
Docs Changes: n/a
Release Notes: n/a