-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
network: delayed conn close #4382
network: delayed conn close #4382
Conversation
Mitigate client read/close race issues on downstream HTTP connections by adding a new connection close type 'FlushWriteAndDelay'. This new close type flushes the write buffer on a connection but does not immediately close after emptying the buffer (unlike ConnectionCloseType::FlushWrite). A timer has been added to track delayed closes for both 'FlushWrite' and 'FlushWriteAndDelay'. Upon triggering, the socket will be closed and the connection will be cleaned up. Delayed close processing can be disabled by setting the newly added HCM 'delayed_close_timeout' config option to 0. Risk Level: Medium (changes common case behavior for closing of downstream HTTP connections) Testing: Unit tests and integration tests added. Fixes envoyproxy#2929. Signed-off-by: Andres Guedez <[email protected]>
HTTP/2 integration tests added for delayed close processing. Signed-off-by: Andres Guedez <[email protected]>
Signed-off-by: Andres Guedez <[email protected]>
Increase the timeout values for the common case (no timeout trigerred) tests to prevent test flakiness. Signed-off-by: Andres Guedez <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, thanks for fixing this! Mostly style nits.
// | ||
// A value of 0 will completely disable delayed close processing, and the downstream connection's | ||
// socket will be closed immediately after the write flush is completed. | ||
google.protobuf.Duration delayed_close_timeout = 25 [(gogoproto.stdduration) = true]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a default value? What happens if delayed_close_timeout
is not set?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a default timeout value of 1000 ms. I've amended the comment to note this.
include/envoy/network/connection.h
Outdated
@@ -63,7 +63,8 @@ class ConnectionCallbacks { | |||
*/ | |||
enum class ConnectionCloseType { | |||
FlushWrite, // Flush pending write data before raising ConnectionEvent::LocalClose | |||
NoFlush // Do not flush any pending data and immediately raise ConnectionEvent::LocalClose | |||
NoFlush, // Do not flush any pending data and immediately raise ConnectionEvent::LocalClose | |||
FlushWriteAndDelay |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a similar comment to the above two lines?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
close_with_flush_ = true; | ||
RELEASE_ASSERT(type == ConnectionCloseType::FlushWrite || | ||
type == ConnectionCloseType::FlushWriteAndDelay, | ||
""); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the rationale behind making this a RELEASE_ASSERT
? What's the practical implication if it was an ASSERT and happened in production?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm using the RELEASE_ASSERT to provide runtime protection against bugs introduced in this logic, since they could lead to issues such as resource leaks and/or memory corruption.
I just re-read the style guide and it does mention that ASSERT should be the common case for this type of check, so I can certainly change it if that's the convention.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The convention is generally to use ASSERT for this, if you feel strongly about RELEASE_ASSERT I'll defer to senior maintainer for the decision.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to ASSERT() per style guidelines.
type == ConnectionCloseType::FlushWriteAndDelay, | ||
""); | ||
delayed_close_ = true; | ||
bool delayed_close_timeout_set = delayedCloseTimeout().count(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer to see the explicit boolean expression here, ie, delayedCloseTimeout().count() > 0
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
|
||
// Force a closeSocket() after the write buffer is flushed if the close_type calls for it or if | ||
// no delayed close timeout is set. | ||
close_after_flush_ = type == ConnectionCloseType::FlushWrite || !delayed_close_timeout_set; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Super nit, but I think this expression is more readable if you reverse the two halves of the or, since it separates the = and == more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
std::unique_ptr<Network::ConnectionImpl> server_connection(new Network::ConnectionImpl( | ||
dispatcher, std::make_unique<ConnectionSocketImpl>(0, nullptr, nullptr), | ||
TransportSocketPtr(transport_socket), true)); | ||
server_connection->setDelayedCloseTimeout(std::chrono::milliseconds(500)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we safe from flakes here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. The timer's callback is manually invoked at the end of the test. The 'setDelayedCloseTimeout()' call is required to enable creation of the timer, but the actual value does not matter (as long as it's >0). I've added a clarifying comment.
TEST_P(ConnectionImplTest, FlushWriteAndDelayCloseTimerTriggerTest) { | ||
setUpBasicConnection(); | ||
connect(); | ||
server_connection_->setDelayedCloseTimeout(std::chrono::milliseconds(500)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same question about flakes here. How do we know 500ms is the right number?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is safe. Since the client connection does not issue a close() during this test, the timer always triggers. I've added a clarifying comment and reduced the timeout value since there is no need to wait so long.
@@ -33,6 +33,7 @@ using testing::InSequence; | |||
using testing::Invoke; | |||
using testing::InvokeWithoutArgs; | |||
using testing::Return; | |||
using testing::ReturnNew; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed.
version_); | ||
|
||
connection.run(); | ||
EXPECT_TRUE(response.find("SETTINGS expected") != std::string::npos); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find the form EXPECT_THAT(response, HasSubstr("SETTINGS expected"))
more readable when trying to understand how the test works.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pointing out the matcher. Done.
version_); | ||
|
||
connection.run(); | ||
EXPECT_TRUE(response.find("SETTINGS expected") != std::string::npos); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here, the HasSubstr matcher will make this read nicely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
The CircleCI mac failure is due to the lack of early connection close detection in macOS (see #4299). Per the conversation in that issue, I am disabling the failing test on macOS. |
// The delayed close timeout is for downstream connections managed by the HTTP connection manager. | ||
// It is defined as a grace period after connection close processing has been locally initiated | ||
// during which Envoy will flush the write buffers for the connection and await the peer to close | ||
// the connection (i.e., issue a TCP FIN/RST). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the case flush the write buffer and delay close, we should issue TCP FIN but not RST, the expected behavior is issue FIN after flush writing, and RST on timeout.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've clarified this comment since this is referring to the client shutting down the connection and Envoy receiving the FIN/RST.
With regards to Envoy issuing FIN vs RST, the behavior is for a shutdown(_, SHUT_WR)
to be issued when the flush write is successfully completed. If the delayed close timeout triggers, a close()
is issued on the socket, which will trigger FIN or RST depending on the state of the socket/connection.
What benefit are we trying to get out of always issuing RST on timeout? It will likely lead to faster reclamation of resources which is nice, but I think Envoy's behavior regarding FIN vs RST should be applied uniformly across timeout/error conditions and therefore should be tracked as a new issue.
Signed-off-by: Andres Guedez <[email protected]>
Signed-off-by: Andres Guedez <[email protected]>
Signed-off-by: Andres Guedez <[email protected]>
close_with_flush_ = true; | ||
RELEASE_ASSERT(type == ConnectionCloseType::FlushWrite || | ||
type == ConnectionCloseType::FlushWriteAndDelay, | ||
""); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The convention is generally to use ASSERT for this, if you feel strongly about RELEASE_ASSERT I'll defer to senior maintainer for the decision.
.circleci/config.yml
Outdated
@@ -10,7 +10,6 @@ jobs: | |||
resource_class: xlarge | |||
working_directory: /source | |||
environment: | |||
BAZEL_REMOTE_CACHE: https://storage.googleapis.com/envoy-circleci-bazel-cache/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's plan to revert these changes before this gets merged. If we do disable this for all of Envoy, I'd like to do it in a separate PR and consult with maintainers. I think you can leave it in for now with this comment in place so senior maintainer gets a heads up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return new Buffer::WatermarkBuffer(below_low, above_high); | ||
})); | ||
|
||
auto timer = new Event::MockTimer(&dispatcher); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That explanation makes sense. I get nervous when I see bare new calls, especially without a comment. I think this test might be doing the opposite of the pattern I usually see, which is to have a unique_ptr
here in the test body which is std::move
d into the right place to be returned by Envoy::MockBufferFactory::create()
but only after you steal a bare pointer out of the unique_ptr
to use when setting up any the gmock expectations. This is a clear case of dangerous cheating the unique_ptr
, but it's ok in this limited test case example since we know the unique pointer scope lives as long as the body of this test. I think it leaves the object life cycle a bit clearer defined than if you don't use unique_ptr
here.
If this doesn't work in your case, please add a comment explaining the object lifecycle at the bare new
s. Happy to chat more about this in person tomorrow too if it helps.
Signed-off-by: Andres Guedez <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, thanks. Tagging @zuercher for senior review.
@dnoe -- thanks for the review! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for taking this on.
include/envoy/network/connection.h
Outdated
@@ -86,6 +88,8 @@ class Connection : public Event::DeferredDeletable, public FilterManager { | |||
Stats::Gauge& write_current_; | |||
// Counter* as this is an optional counter. Bind errors will not be tracked if this is nullptr. | |||
Stats::Counter* bind_errors_; | |||
// Optional counter. Delayed close semantics are only used by HTTP connections. | |||
Stats::Counter* delayed_close_timeouts_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's remove the reference to HTTP, since this isn't necessarily HTTP-specific. I think the comment can just mirror the one for bind_errors_.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed the delayed close timeout is not inherently HTTP specific, but the intent of the comment is to point out that as currently implemented it only affects HTTP connections, since the H{1,2} codecs are the only callers of setDelayedCloseTimeout()
on a connection to enable it.
I'm fine removing this however if you don't think the note/clarification is required.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My feeling is that it's just a comment we'll have to delete (or will be incorrect) when someone inevitably enables this for a different purpose.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. Cleaned up the comment.
include/envoy/network/connection.h
Outdated
|
||
/** | ||
* Set the timeout for delayed connection close()s. | ||
* This is only used for downstream connections processing HTTP/1 and HTTP/2. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Likewise, remove this reference (or at least the "only" qualifier).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please document (probably on close
) the relationship between this timeout value and the various close options. I had assumed that ConnectionCloseType::FlushWrite behavior was meant to be unchanged, but really it's going to wait up to the delayed close timeout, if set, to complete its flush before closing.
@@ -535,6 +559,13 @@ bool ConnectionImpl::bothSidesHalfClosed() { | |||
return read_end_stream_ && write_end_stream_ && write_buffer_->length() == 0; | |||
} | |||
|
|||
void ConnectionImpl::onDelayedCloseTimeout() { | |||
ENVOY_CONN_LOG(debug, "triggered delayed close", *this); | |||
ASSERT(connection_stats_->delayed_close_timeouts_); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Prefer that you test if delayed_close_timeouts_
is null (similar to how bind_errors_
optionality is handled). Or else, ASSERT the pointer is non-null when the delayed close timeout is configured.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
if (delayed_close_timer_) { | ||
// It's ok to disable even if the timer has already fired. | ||
delayed_close_timer_->disableTimer(); | ||
delayed_close_timer_.reset(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: don't need to reset -- it's about to be destroyed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
} | ||
}; | ||
} // namespace Network | ||
}; // namespace Envoy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: undo these -- looks like fix format got confused at some point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tools/check_format.py fails when this is reverted -- 'FixNamespaceComments' appears to be required.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I downloaded this file and it seems to pass format checks locally without the stray namespace comments. Has it failed CI's format check?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reverted comments. I was looking at the wrong lines... Thanks.
test/integration/http_integration.cc
Outdated
@@ -149,18 +149,36 @@ IntegrationCodecClient::startRequest(const Http::HeaderMap& headers) { | |||
return {encoder, std::move(response)}; | |||
} | |||
|
|||
void IntegrationCodecClient::waitForDisconnect() { | |||
bool IntegrationCodecClient::waitForDisconnect(uint32_t time_to_wait) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: use std::chrono::millisconds here to match the other integration tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
test/integration/http_integration.cc
Outdated
connection_->dispatcher().run(Event::Dispatcher::RunType::Block); | ||
if (wait_timer_triggered && !disconnected_) { | ||
return false; | ||
} | ||
EXPECT_TRUE(disconnected_); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should disable the timer if it hasn't fired.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
test/integration/integration_test.cc
Outdated
|
||
response->waitForEndStream(); | ||
// The delayed close timeout should trigger since client is not closing the connection. | ||
EXPECT_TRUE(codec_client_->waitForDisconnect(750)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you given this the runs_per_test treatment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I've done extensive flake testing and did not hit any issues. I've adjusted the timeouts though to make it even more resistant since the CI environment is considerably more flake prone than my machine.
Signed-off-by: Andres Guedez <[email protected]>
Signed-off-by: Andres Guedez <[email protected]>
Signed-off-by: Andres Guedez <[email protected]>
Signed-off-by: Andres Guedez <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. One small thing, but otherwise looks good to me.
@@ -128,6 +141,7 @@ void ConnectionImpl::close(ConnectionCloseType type) { | |||
// Create and activate a timer which will immediately close the connection if triggered. | |||
// A config value of 0 disables the timeout. | |||
if (delayed_close_timeout_set) { | |||
ASSERT(connection_stats_->delayed_close_timeouts_ != nullptr); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I meant in setDelayedCloseTimeout.I don't want someone to enable this feature on a connection and then only later get an assertion about this stat.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
connection_stats_
is set via setConnectionStats()
, so asserting on the counter != nullptr in setDelayedCloseTimeout()
will introduce an ordering dependency (which is not currently met). I think it would be preferable to keep these two functions independent of each other.
Unfortunately, initialization of ConnectionImpl
s is spread out between codec creation (allocation and call to setDelayedCloseTimeout
) and ConnectionManagerImpl::initializeReadFilterCallbacks()
where setConnectionStats()
is called. Therefore, adding an initFinished()
function where the ASSERT could be safely issued will require some work.
I can mark this with a TODO so we can move it into the appropriate place when such a function is added. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or just make setting connection_state_->delayed_close_timeouts_ optional? (E.g. don't assert, just don't increment if it's null?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, done.
Signed-off-by: Andres Guedez <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
@zuercher -- Thanks for reviewing! |
I don't have that commit in my branch; I'll merge and kick off new CI runs. |
@mattklein123 Please take a look. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice. Love seeing this implemented and the flush timeout leak fixed also. A few small questions/comments.
Http1Settings settings, | ||
std::chrono::milliseconds delayed_close_timeout) | ||
: ConnectionImpl(connection, HTTP_REQUEST), callbacks_(callbacks), codec_settings_(settings) { | ||
connection_.setDelayedCloseTimeout(delayed_close_timeout); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the reasoning for doing this set in the codec? It seems like something that is unrelated to the codec itself and should be done by higher layer code? Same question for the HTTP/2 codec below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not related to the codec directly but this particular delayed_close_timeout is HTTP only since it is managed via the HCM configuration. This seemed to be the most appropriate place to set the timeout for HTTP connections just after they are constructed.
Since the timeout value is intended to be network filter specific, I wouldn't expect any layer higher than the corresponding filter to issue the call to setDelayedCloseTimeout().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree it should be done in the filter. Why not do it in the HCM filter and avoid codec modifications?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed it's best to contain it within the HCM. I revisited the code and moved setting the timeout to the HCM config.
|
||
// No need to continue if a FlushWrite/FlushWriteAndDelay has already been issued and there is a | ||
// pending delayed close. | ||
if (delayed_close_) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this happen in practice? Mostly just wondering about the flow that causes this. Perhaps flesh out the comment with some examples of cases where this happens?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I initially tried an ASSERT() which triggered in multiple tests during a //test/... run. I just tracked down a failure during an integration_test and added it to the comment. Based on that example and code inspection, this conditional can evaluate true under fairly standard use cases such as when Envoy initiates the close() sequence during codec processing.
} | ||
|
||
delayed_close_ = true; | ||
bool delayed_close_timeout_set = delayedCloseTimeout().count() > 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: const
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
Signed-off-by: Andres Guedez <[email protected]>
9d86332
Instead, constrain setting the delayed close timeout to the HCM config, which owns it. Signed-off-by: Andres Guedez <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, one small change and I think you also need a master merge.
@@ -301,6 +302,7 @@ Http::ServerConnectionPtr | |||
HttpConnectionManagerConfig::createCodec(Network::Connection& connection, | |||
const Buffer::Instance& data, | |||
Http::ServerConnectionCallbacks& callbacks) { | |||
connection.setDelayedCloseTimeout(delayed_close_timeout_); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I would recommend setting this in ConnectionManagerImpl::initializeReadFilterCallbacks
. It's earlier than here and also near where we setup some other connection level stuff.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
Signed-off-by: Andres Guedez <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Can you merge master and we can 🚢? Thank you!
Signed-off-by: Andres Guedez <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, sorry, one more doc thing.
// | ||
// A value of 0 will completely disable delayed close processing, and the downstream connection's | ||
// socket will be closed immediately after the write flush is completed. | ||
google.protobuf.Duration delayed_close_timeout = 26 [(gogoproto.stdduration) = true]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I forget this from before. Can you add this to the release notes also with a link to this field?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
Signed-off-by: Andres Guedez <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome!
@mattklein123 Thanks for the review! |
This reverts commit aa9478f. Signed-off-by: Jose Nino <[email protected]>
This PR reverts #4382. When deploying at Lyft we noticed crashes on here where we might be derefencing the connection_stats_ pointer after the point has been reset. Note: this PR keeps the changes to the API made in the original PR but tags the field as not implemented. This is what we have done in the past for reverts that involve changes that change the API. Signed-off-by: Jose Nino <[email protected]>
Re-enable the changes reverted in 9d32e5c, which were originally merged as part of envoyproxy#4382. Signed-off-by: Andres Guedez <[email protected]>
Fixes a segfault introduced in envoyproxy#4382 due to a connection tear down race condition when the delayed close timer triggers after connection state has been reset via closeSocket(). Signed-off-by: Andres Guedez <[email protected]>
Re-enable the changes reverted in 9d32e5c, which were originally merged as part of #4382. Signed-off-by: Andres Guedez <[email protected]>
This PR reverts envoyproxy#4382. When deploying at Lyft we noticed crashes on here where we might be derefencing the connection_stats_ pointer after the point has been reset. Note: this PR keeps the changes to the API made in the original PR but tags the field as not implemented. This is what we have done in the past for reverts that involve changes that change the API. Signed-off-by: Jose Nino <[email protected]> Signed-off-by: Aaltan Ahmad <[email protected]>
…ix (envoyproxy#4587) Re-enable the changes reverted in 9d32e5c, which were originally merged as part of envoyproxy#4382. Signed-off-by: Andres Guedez <[email protected]> Signed-off-by: Aaltan Ahmad <[email protected]>
network: Support delayed close of downstream conns (#2929)
Mitigate client read/close race issues on downstream HTTP connections by adding a new connection
close type 'FlushWriteAndDelay'. This new close type flushes the write buffer on a connection but
does not immediately close after emptying the buffer (unlike ConnectionCloseType::FlushWrite).
A timer has been added to track delayed closes for both 'FlushWrite' and 'FlushWriteAndDelay'. Upon
triggering, the socket will be closed and the connection will be cleaned up.
Delayed close processing can be disabled by setting the newly added HCM 'delayed_close_timeout'
config option to 0.
Risk Level: Medium (changes common case behavior for closing of downstream HTTP connections)
Testing: Unit tests and integration tests added.
Fixes #2929.
Signed-off-by: Andres Guedez [email protected]