Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

xDS: gRPC connection failure shouldn't make Envoy continue startup #8152

Merged
merged 3 commits into from
Sep 19, 2019

Conversation

l8huang
Copy link
Contributor

@l8huang l8huang commented Sep 4, 2019

xDS: gRPC connection failure shouldn't make Envoy continue startup

Description:
Currently, if gRPC config stream disconnected while Envoy waiting for
initial xDS response, xDS implementations' onConfigUpdateFailed() will
allow Envoy startup to continue. This may cause Envoy begins taking
traffics while route/cluster/endpoint config are still missing and
return "404 NR" or "503 NR".

This change makes Envoy waiting for initial xDS response until
initial_fetch_timeout if specified.

Risk Level: Medium
Testing: existing test cases updated
Fixes #8046

Signed-off-by: lhuang8 [email protected]

void HttpSubscriptionImpl::handleFailure(Config::ConfigUpdateFailureReason reason,
const EnvoyException* e) {

switch (reason) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code here is exactly same as the onConfigUpdateFailed in grpc_mux_subscription_impl except logs. Can we refactor them in to a utility and pass ApiType and use it in logs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO this is not necessary to use a common function to handle the errors for HTTP and gRPC subscription. It's possible to have different error handling for them.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure why would it be different once the message is received, but up to you.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GrpcMuxSubscriptionImpl and HttpSubscriptionImpl share Config::Subscription as common interface, which only defines subscription related API, no API for get stats_ and callbacks_, or disableInitFetchTimeoutTimer(). That's, beside start() and updateResource(), they are not expected to be same, although they have similar implementation.

I guess it's not a good idea to define a new interface for error handling or use template to generalize the type. Maybe they can share a common base class which define the error handling, but for this PR I want to limit the change scope :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be nice to dedupe here, but agree we can push this to a later PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please hold off on anything like this until #7293 is merged.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(The reason being that GrpcMuxSubscriptionImpl is about to be entirely replaced)

// any initial CDS/LDS discovery response, so here calls onConfigUpdateFailed()
// even reason is ConnectionFailure. After the test case fixed,
// onConfigUpdateFailed() shouldn't be called for ConnectionFailure.
callbacks_.onConfigUpdateFailed(reason, e);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But should we fix that case now? Otherwise of http subscriptions even if the connection failed, it will call onConfigUpdateFailed method which is incorrect/inconsistent?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest use an another PR to fix "//test/integration:hotrestart_test". I'm not fully understand that test case right now, no sure what's the best way to modify that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR #8162 created for fixing "//test/integration:hotrestart_test", that's a small config change, just remove dynamic_resources, after that accepted, I will update code here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR #8162 merged. This is also updated.

// If init_fetch_timeout is non-zero, server will continue startup after it timeout
return;
}

callbacks_.onConfigUpdateFailed(reason, e);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: call this method if reason != ConnectionFailure may be more readable than return?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Putting const value at left side of == is considered a better style. It doesn't cause any difficulty for readability usually, because reader always need to see both side of == to understand the condition.

Copy link
Contributor

@ramaraochavali ramaraochavali Sep 6, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry. My comment was not clear may be. I am not talking about const on left of ==,

I am suggesting the following instead if return - Not a big deal though

if (Envoy::Config::ConfigUpdateFailureReason::ConnectionFailure != reason) {
callbacks_.onConfigUpdateFailed(reason, e);
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO positive logic is preferred, the comment in if block describe is helpful for understanding what's going on.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@l8huang can you actually switch this to reason == the const value? I agree that what you have there is "better style" for the reason you give, but it's actually quite jarring as most of the Envoy code base doesn't do this, and the "conform to local practices" style argument wins out IMHO.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, will do

handleFailure(e);
void HttpSubscriptionImpl::onFetchFailure(Config::ConfigUpdateFailureReason reason,
const EnvoyException* e) {
handleFailure(reason, e);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought onFetchFailure is always a ConnectionFailure? If so, we do not have to change onFetchFailure signature and just call handleFailure with ConnectionFailure reason here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, please see old code in source/common/http/rest_api_fetcher.cc.

@htuch
Copy link
Member

htuch commented Sep 5, 2019

@lambdai for first pass. @ramaraochavali I thought we had already done this, do you have an idea of what is different?

@htuch htuch self-assigned this Sep 5, 2019
@ramaraochavali
Copy link
Contributor

@htuch we fixed it for EDS earlier. This fix is trying to generalize for all subscriptions.

break;
case Config::ConfigUpdateFailureReason::FetchTimedout:
ENVOY_LOG(warn, "REST config: initial fetch timeout for {}", path_);
stats_.init_fetch_timeout_.inc();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: reset timer if not nullptr, or simply disableInitFetchTimeoutTimer() ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

disableInitFetchTimeoutTimer() added

Copy link
Contributor

@lambdai lambdai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Theoretically it's safe when the callback need to handle fewer conditions(connection failure)

The questions is, is there any callback expecting connection failure signal, such as clean up? I think the answer is no.

@@ -251,13 +251,8 @@ bool EdsClusterImpl::updateHostsPerLocality(
return false;
}

void EdsClusterImpl::onConfigUpdateFailed(Envoy::Config::ConfigUpdateFailureReason reason,
void EdsClusterImpl::onConfigUpdateFailed(Envoy::Config::ConfigUpdateFailureReason,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a log line here? It's a critical milestone to trigger an init complete

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also add a ASSERT(reason != Envoy::Config::ConfigUpdateFailureReason::ConnectionFailure) ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assert added.

Currently, if gRPC config stream disconnected while Envoy waiting for
initial xDS response, xDS implementations' onConfigUpdateFailed() will
allow Envoy startup to continue. This may cause Envoy begins taking
traffics while route/cluster/endpoint config are still missing and
return "404 NR" or "503 NR".

This change makes Envoy waiting for initial xDS response until
initial_fetch_timeout if specified.

Signed-off-by: lhuang8 <[email protected]>
@l8huang
Copy link
Contributor Author

l8huang commented Sep 9, 2019

@lambdai PR updated, could you please take a review?

@lambdai
Copy link
Contributor

lambdai commented Sep 9, 2019

/lgtm
@htuch Can you do the final review?

@stale
Copy link

stale bot commented Sep 16, 2019

This pull request has been automatically marked as stale because it has not had activity in the last 7 days. It will be closed in 7 days if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

@stale stale bot added the stale stalebot believes this issue/PR has not been touched recently label Sep 16, 2019
Copy link
Member

@htuch htuch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM modulo a tiny nit.

void HttpSubscriptionImpl::handleFailure(Config::ConfigUpdateFailureReason reason,
const EnvoyException* e) {

switch (reason) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be nice to dedupe here, but agree we can push this to a later PR.

// If init_fetch_timeout is non-zero, server will continue startup after it timeout
return;
}

callbacks_.onConfigUpdateFailed(reason, e);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@l8huang can you actually switch this to reason == the const value? I agree that what you have there is "better style" for the reason you give, but it's actually quite jarring as most of the Envoy code base doesn't do this, and the "conform to local practices" style argument wins out IMHO.

@stale stale bot removed stale stalebot believes this issue/PR has not been touched recently labels Sep 18, 2019
Copy link
Contributor Author

@l8huang l8huang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code updated according to comment, please take a look.

// If init_fetch_timeout is non-zero, server will continue startup after it timeout
return;
}

callbacks_.onConfigUpdateFailed(reason, e);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, will do

Copy link
Member

@htuch htuch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@htuch htuch merged commit d42e14e into envoyproxy:master Sep 19, 2019
@fredlas
Copy link
Contributor

fredlas commented Sep 19, 2019

@l8huang Nice change; looking at your code and description, this certainly seems like a better way to handle the situation! I have tried to adapt it into my ongoing xDS PR, #7293.

Could you take a quick look at
https://github.com/envoyproxy/envoy/blob/1ac8fb432906be165014623e6aa86dcb432a1d23/source/common/config/delta_subscription_impl.cc
and
https://github.com/envoyproxy/envoy/blob/1ac8fb432906be165014623e6aa86dcb432a1d23/source/common/config/delta_subscription_state.cc

to make sure I have the right idea? (The flow is that DeltaSubscriptionState::handleEstablishmentFailure() calls DeltaSubscriptionImpl::onConfigUpdateFailed()).

@l8huang
Copy link
Contributor Author

l8huang commented Sep 20, 2019

@fredlas LGTM

@fredlas
Copy link
Contributor

fredlas commented Sep 20, 2019

Thank you! :)

danzh2010 pushed a commit to danzh2010/envoy that referenced this pull request Sep 24, 2019
…nvoyproxy#8152)

Currently, if gRPC config stream disconnected while Envoy waiting for
initial xDS response, xDS implementations' onConfigUpdateFailed() will
allow Envoy startup to continue. This may cause Envoy begins taking
traffics while route/cluster/endpoint config are still missing and
return "404 NR" or "503 NR".

This change makes Envoy waiting for initial xDS response until
initial_fetch_timeout if specified.

Risk Level: Medium
Testing: existing test cases updated
Fixes envoyproxy#8046

Signed-off-by: lhuang8 <[email protected]>
danzh2010 pushed a commit to danzh2010/envoy that referenced this pull request Oct 4, 2019
…nvoyproxy#8152)

Currently, if gRPC config stream disconnected while Envoy waiting for
initial xDS response, xDS implementations' onConfigUpdateFailed() will
allow Envoy startup to continue. This may cause Envoy begins taking
traffics while route/cluster/endpoint config are still missing and
return "404 NR" or "503 NR".

This change makes Envoy waiting for initial xDS response until
initial_fetch_timeout if specified.

Risk Level: Medium
Testing: existing test cases updated
Fixes envoyproxy#8046

Signed-off-by: lhuang8 <[email protected]>
danzh2010 pushed a commit to danzh2010/envoy that referenced this pull request Oct 4, 2019
…nvoyproxy#8152)

Currently, if gRPC config stream disconnected while Envoy waiting for
initial xDS response, xDS implementations' onConfigUpdateFailed() will
allow Envoy startup to continue. This may cause Envoy begins taking
traffics while route/cluster/endpoint config are still missing and
return "404 NR" or "503 NR".

This change makes Envoy waiting for initial xDS response until
initial_fetch_timeout if specified.

Risk Level: Medium
Testing: existing test cases updated
Fixes envoyproxy#8046

Signed-off-by: lhuang8 <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

lds: Envoy starts listening ports before it receives the first RDS response
5 participants