-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
xDS: gRPC connection failure shouldn't make Envoy continue startup #8152
Conversation
void HttpSubscriptionImpl::handleFailure(Config::ConfigUpdateFailureReason reason, | ||
const EnvoyException* e) { | ||
|
||
switch (reason) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code here is exactly same as the onConfigUpdateFailed
in grpc_mux_subscription_impl
except logs. Can we refactor them in to a utility and pass ApiType
and use it in logs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMHO this is not necessary to use a common function to handle the errors for HTTP and gRPC subscription. It's possible to have different error handling for them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure why would it be different once the message is received, but up to you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GrpcMuxSubscriptionImpl
and HttpSubscriptionImpl
share Config::Subscription
as common interface, which only defines subscription related API, no API for get stats_
and callbacks_
, or disableInitFetchTimeoutTimer()
. That's, beside start()
and updateResource()
, they are not expected to be same, although they have similar implementation.
I guess it's not a good idea to define a new interface for error handling or use template to generalize the type. Maybe they can share a common base class which define the error handling, but for this PR I want to limit the change scope :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be nice to dedupe here, but agree we can push this to a later PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please hold off on anything like this until #7293 is merged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(The reason being that GrpcMuxSubscriptionImpl is about to be entirely replaced)
// any initial CDS/LDS discovery response, so here calls onConfigUpdateFailed() | ||
// even reason is ConnectionFailure. After the test case fixed, | ||
// onConfigUpdateFailed() shouldn't be called for ConnectionFailure. | ||
callbacks_.onConfigUpdateFailed(reason, e); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But should we fix that case now? Otherwise of http subscriptions even if the connection failed, it will call onConfigUpdateFailed method which is incorrect/inconsistent?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest use an another PR to fix "//test/integration:hotrestart_test". I'm not fully understand that test case right now, no sure what's the best way to modify that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR #8162 created for fixing "//test/integration:hotrestart_test", that's a small config change, just remove dynamic_resources, after that accepted, I will update code here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR #8162 merged. This is also updated.
// If init_fetch_timeout is non-zero, server will continue startup after it timeout | ||
return; | ||
} | ||
|
||
callbacks_.onConfigUpdateFailed(reason, e); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: call this method if reason != ConnectionFailure may be more readable than return?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Putting const value at left side of ==
is considered a better style. It doesn't cause any difficulty for readability usually, because reader always need to see both side of ==
to understand the condition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry. My comment was not clear may be. I am not talking about const on left of ==,
I am suggesting the following instead if return - Not a big deal though
if (Envoy::Config::ConfigUpdateFailureReason::ConnectionFailure != reason) {
callbacks_.onConfigUpdateFailed(reason, e);
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO positive logic is preferred, the comment in if
block describe is helpful for understanding what's going on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@l8huang can you actually switch this to reason == the const value
? I agree that what you have there is "better style" for the reason you give, but it's actually quite jarring as most of the Envoy code base doesn't do this, and the "conform to local practices" style argument wins out IMHO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, will do
handleFailure(e); | ||
void HttpSubscriptionImpl::onFetchFailure(Config::ConfigUpdateFailureReason reason, | ||
const EnvoyException* e) { | ||
handleFailure(reason, e); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought onFetchFailure is always a ConnectionFailure? If so, we do not have to change onFetchFailure signature and just call handleFailure with ConnectionFailure reason here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, please see old code in source/common/http/rest_api_fetcher.cc
.
@lambdai for first pass. @ramaraochavali I thought we had already done this, do you have an idea of what is different? |
@htuch we fixed it for EDS earlier. This fix is trying to generalize for all subscriptions. |
break; | ||
case Config::ConfigUpdateFailureReason::FetchTimedout: | ||
ENVOY_LOG(warn, "REST config: initial fetch timeout for {}", path_); | ||
stats_.init_fetch_timeout_.inc(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: reset timer if not nullptr, or simply disableInitFetchTimeoutTimer() ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
disableInitFetchTimeoutTimer()
added
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Theoretically it's safe when the callback need to handle fewer conditions(connection failure)
The questions is, is there any callback expecting connection failure signal, such as clean up? I think the answer is no.
source/common/upstream/eds.cc
Outdated
@@ -251,13 +251,8 @@ bool EdsClusterImpl::updateHostsPerLocality( | |||
return false; | |||
} | |||
|
|||
void EdsClusterImpl::onConfigUpdateFailed(Envoy::Config::ConfigUpdateFailureReason reason, | |||
void EdsClusterImpl::onConfigUpdateFailed(Envoy::Config::ConfigUpdateFailureReason, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add a log line here? It's a critical milestone to trigger an init complete
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also add a ASSERT(reason != Envoy::Config::ConfigUpdateFailureReason::ConnectionFailure) ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assert added.
Currently, if gRPC config stream disconnected while Envoy waiting for initial xDS response, xDS implementations' onConfigUpdateFailed() will allow Envoy startup to continue. This may cause Envoy begins taking traffics while route/cluster/endpoint config are still missing and return "404 NR" or "503 NR". This change makes Envoy waiting for initial xDS response until initial_fetch_timeout if specified. Signed-off-by: lhuang8 <[email protected]>
Signed-off-by: lhuang8 <[email protected]>
@lambdai PR updated, could you please take a review? |
/lgtm |
This pull request has been automatically marked as stale because it has not had activity in the last 7 days. It will be closed in 7 days if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM modulo a tiny nit.
void HttpSubscriptionImpl::handleFailure(Config::ConfigUpdateFailureReason reason, | ||
const EnvoyException* e) { | ||
|
||
switch (reason) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be nice to dedupe here, but agree we can push this to a later PR.
// If init_fetch_timeout is non-zero, server will continue startup after it timeout | ||
return; | ||
} | ||
|
||
callbacks_.onConfigUpdateFailed(reason, e); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@l8huang can you actually switch this to reason == the const value
? I agree that what you have there is "better style" for the reason you give, but it's actually quite jarring as most of the Envoy code base doesn't do this, and the "conform to local practices" style argument wins out IMHO.
Signed-off-by: lhuang8 <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code updated according to comment, please take a look.
// If init_fetch_timeout is non-zero, server will continue startup after it timeout | ||
return; | ||
} | ||
|
||
callbacks_.onConfigUpdateFailed(reason, e); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, will do
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
@l8huang Nice change; looking at your code and description, this certainly seems like a better way to handle the situation! I have tried to adapt it into my ongoing xDS PR, #7293. Could you take a quick look at to make sure I have the right idea? (The flow is that DeltaSubscriptionState::handleEstablishmentFailure() calls DeltaSubscriptionImpl::onConfigUpdateFailed()). |
@fredlas LGTM |
Thank you! :) |
…nvoyproxy#8152) Currently, if gRPC config stream disconnected while Envoy waiting for initial xDS response, xDS implementations' onConfigUpdateFailed() will allow Envoy startup to continue. This may cause Envoy begins taking traffics while route/cluster/endpoint config are still missing and return "404 NR" or "503 NR". This change makes Envoy waiting for initial xDS response until initial_fetch_timeout if specified. Risk Level: Medium Testing: existing test cases updated Fixes envoyproxy#8046 Signed-off-by: lhuang8 <[email protected]>
…nvoyproxy#8152) Currently, if gRPC config stream disconnected while Envoy waiting for initial xDS response, xDS implementations' onConfigUpdateFailed() will allow Envoy startup to continue. This may cause Envoy begins taking traffics while route/cluster/endpoint config are still missing and return "404 NR" or "503 NR". This change makes Envoy waiting for initial xDS response until initial_fetch_timeout if specified. Risk Level: Medium Testing: existing test cases updated Fixes envoyproxy#8046 Signed-off-by: lhuang8 <[email protected]>
…nvoyproxy#8152) Currently, if gRPC config stream disconnected while Envoy waiting for initial xDS response, xDS implementations' onConfigUpdateFailed() will allow Envoy startup to continue. This may cause Envoy begins taking traffics while route/cluster/endpoint config are still missing and return "404 NR" or "503 NR". This change makes Envoy waiting for initial xDS response until initial_fetch_timeout if specified. Risk Level: Medium Testing: existing test cases updated Fixes envoyproxy#8046 Signed-off-by: lhuang8 <[email protected]>
xDS: gRPC connection failure shouldn't make Envoy continue startup
Description:
Currently, if gRPC config stream disconnected while Envoy waiting for
initial xDS response, xDS implementations' onConfigUpdateFailed() will
allow Envoy startup to continue. This may cause Envoy begins taking
traffics while route/cluster/endpoint config are still missing and
return "404 NR" or "503 NR".
This change makes Envoy waiting for initial xDS response until
initial_fetch_timeout if specified.
Risk Level: Medium
Testing: existing test cases updated
Fixes #8046
Signed-off-by: lhuang8 [email protected]