xDS: gRPC connection failure shouldn't make Envoy continue startup #8152

l8huang · 2019-09-04T23:57:52Z

xDS: gRPC connection failure shouldn't make Envoy continue startup

Description:
Currently, if gRPC config stream disconnected while Envoy waiting for
initial xDS response, xDS implementations' onConfigUpdateFailed() will
allow Envoy startup to continue. This may cause Envoy begins taking
traffics while route/cluster/endpoint config are still missing and
return "404 NR" or "503 NR".

This change makes Envoy waiting for initial xDS response until
initial_fetch_timeout if specified.

Risk Level: Medium
Testing: existing test cases updated
Fixes #8046

Signed-off-by: lhuang8 [email protected]

ramaraochavali · 2019-09-05T09:56:13Z

source/common/config/http_subscription_impl.cc

+void HttpSubscriptionImpl::handleFailure(Config::ConfigUpdateFailureReason reason,
+                                         const EnvoyException* e) {
+
+  switch (reason) {


The code here is exactly same as the onConfigUpdateFailed in grpc_mux_subscription_impl except logs. Can we refactor them in to a utility and pass ApiType and use it in logs?

IMHO this is not necessary to use a common function to handle the errors for HTTP and gRPC subscription. It's possible to have different error handling for them.

Not sure why would it be different once the message is received, but up to you.

GrpcMuxSubscriptionImpl and HttpSubscriptionImpl share Config::Subscription as common interface, which only defines subscription related API, no API for get stats_ and callbacks_, or disableInitFetchTimeoutTimer(). That's, beside start() and updateResource(), they are not expected to be same, although they have similar implementation.

I guess it's not a good idea to define a new interface for error handling or use template to generalize the type. Maybe they can share a common base class which define the error handling, but for this PR I want to limit the change scope :)

I think it would be nice to dedupe here, but agree we can push this to a later PR.

Please hold off on anything like this until #7293 is merged.

(The reason being that GrpcMuxSubscriptionImpl is about to be entirely replaced)

ramaraochavali · 2019-09-05T09:57:36Z

source/common/config/http_subscription_impl.cc

+  // any initial CDS/LDS discovery response, so here calls onConfigUpdateFailed()
+  // even reason is ConnectionFailure. After the test case fixed,
+  // onConfigUpdateFailed() shouldn't be called for ConnectionFailure.
+  callbacks_.onConfigUpdateFailed(reason, e);


But should we fix that case now? Otherwise of http subscriptions even if the connection failed, it will call onConfigUpdateFailed method which is incorrect/inconsistent?

I suggest use an another PR to fix "//test/integration:hotrestart_test". I'm not fully understand that test case right now, no sure what's the best way to modify that.

PR #8162 created for fixing "//test/integration:hotrestart_test", that's a small config change, just remove dynamic_resources, after that accepted, I will update code here.

PR #8162 merged. This is also updated.

ramaraochavali · 2019-09-05T10:01:12Z

source/common/config/grpc_mux_subscription_impl.cc

+    // If init_fetch_timeout is non-zero, server will continue startup after it timeout
+    return;
+  }
+
  callbacks_.onConfigUpdateFailed(reason, e);


nit: call this method if reason != ConnectionFailure may be more readable than return?

Putting const value at left side of == is considered a better style. It doesn't cause any difficulty for readability usually, because reader always need to see both side of == to understand the condition.

Sorry. My comment was not clear may be. I am not talking about const on left of ==,

I am suggesting the following instead if return - Not a big deal though

if (Envoy::Config::ConfigUpdateFailureReason::ConnectionFailure != reason) { callbacks_.onConfigUpdateFailed(reason, e); }

IMO positive logic is preferred, the comment in if block describe is helpful for understanding what's going on.

@l8huang can you actually switch this to reason == the const value? I agree that what you have there is "better style" for the reason you give, but it's actually quite jarring as most of the Envoy code base doesn't do this, and the "conform to local practices" style argument wins out IMHO.

ok, will do

ramaraochavali · 2019-09-05T10:06:50Z

source/common/config/http_subscription_impl.cc

-  handleFailure(e);
+void HttpSubscriptionImpl::onFetchFailure(Config::ConfigUpdateFailureReason reason,
+                                          const EnvoyException* e) {
+  handleFailure(reason, e);


I thought onFetchFailure is always a ConnectionFailure? If so, we do not have to change onFetchFailure signature and just call handleFailure with ConnectionFailure reason here?

no, please see old code in source/common/http/rest_api_fetcher.cc.

htuch · 2019-09-05T17:31:01Z

@lambdai for first pass. @ramaraochavali I thought we had already done this, do you have an idea of what is different?

ramaraochavali · 2019-09-06T03:03:13Z

@htuch we fixed it for EDS earlier. This fix is trying to generalize for all subscriptions.

lambdai · 2019-09-06T05:51:27Z

source/common/config/http_subscription_impl.cc

+    break;
+  case Config::ConfigUpdateFailureReason::FetchTimedout:
+    ENVOY_LOG(warn, "REST config: initial fetch timeout for {}", path_);
+    stats_.init_fetch_timeout_.inc();


nit: reset timer if not nullptr, or simply disableInitFetchTimeoutTimer() ?

disableInitFetchTimeoutTimer() added

lambdai

Theoretically it's safe when the callback need to handle fewer conditions(connection failure)

The questions is, is there any callback expecting connection failure signal, such as clean up? I think the answer is no.

lambdai · 2019-09-06T06:03:02Z

source/common/upstream/eds.cc

@@ -251,13 +251,8 @@ bool EdsClusterImpl::updateHostsPerLocality(
  return false;
 }

-void EdsClusterImpl::onConfigUpdateFailed(Envoy::Config::ConfigUpdateFailureReason reason,
+void EdsClusterImpl::onConfigUpdateFailed(Envoy::Config::ConfigUpdateFailureReason,


Can we add a log line here? It's a critical milestone to trigger an init complete

Also add a ASSERT(reason != Envoy::Config::ConfigUpdateFailureReason::ConnectionFailure) ?

assert added.

Currently, if gRPC config stream disconnected while Envoy waiting for initial xDS response, xDS implementations' onConfigUpdateFailed() will allow Envoy startup to continue. This may cause Envoy begins taking traffics while route/cluster/endpoint config are still missing and return "404 NR" or "503 NR". This change makes Envoy waiting for initial xDS response until initial_fetch_timeout if specified. Signed-off-by: lhuang8 <[email protected]>

Signed-off-by: lhuang8 <[email protected]>

l8huang · 2019-09-09T19:15:11Z

@lambdai PR updated, could you please take a review?

lambdai · 2019-09-09T20:05:47Z

/lgtm
@htuch Can you do the final review?

stale · 2019-09-16T21:03:24Z

This pull request has been automatically marked as stale because it has not had activity in the last 7 days. It will be closed in 7 days if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

htuch

LGTM modulo a tiny nit.

htuch · 2019-09-18T02:14:49Z

source/common/config/http_subscription_impl.cc

+void HttpSubscriptionImpl::handleFailure(Config::ConfigUpdateFailureReason reason,
+                                         const EnvoyException* e) {
+
+  switch (reason) {


I think it would be nice to dedupe here, but agree we can push this to a later PR.

htuch · 2019-09-18T02:17:54Z

source/common/config/grpc_mux_subscription_impl.cc

+    // If init_fetch_timeout is non-zero, server will continue startup after it timeout
+    return;
+  }
+
  callbacks_.onConfigUpdateFailed(reason, e);


@l8huang can you actually switch this to reason == the const value? I agree that what you have there is "better style" for the reason you give, but it's actually quite jarring as most of the Envoy code base doesn't do this, and the "conform to local practices" style argument wins out IMHO.

Signed-off-by: lhuang8 <[email protected]>

l8huang

Code updated according to comment, please take a look.

l8huang · 2019-09-18T18:55:21Z

source/common/config/grpc_mux_subscription_impl.cc

+    // If init_fetch_timeout is non-zero, server will continue startup after it timeout
+    return;
+  }
+
  callbacks_.onConfigUpdateFailed(reason, e);


ok, will do

htuch

Thanks!

fredlas · 2019-09-19T21:45:09Z

@l8huang Nice change; looking at your code and description, this certainly seems like a better way to handle the situation! I have tried to adapt it into my ongoing xDS PR, #7293.

Could you take a quick look at
https://github.com/envoyproxy/envoy/blob/1ac8fb432906be165014623e6aa86dcb432a1d23/source/common/config/delta_subscription_impl.cc
and
https://github.com/envoyproxy/envoy/blob/1ac8fb432906be165014623e6aa86dcb432a1d23/source/common/config/delta_subscription_state.cc

to make sure I have the right idea? (The flow is that DeltaSubscriptionState::handleEstablishmentFailure() calls DeltaSubscriptionImpl::onConfigUpdateFailed()).

l8huang · 2019-09-20T20:53:30Z

@fredlas LGTM

fredlas · 2019-09-20T20:55:37Z

Thank you! :)

…nvoyproxy#8152) Currently, if gRPC config stream disconnected while Envoy waiting for initial xDS response, xDS implementations' onConfigUpdateFailed() will allow Envoy startup to continue. This may cause Envoy begins taking traffics while route/cluster/endpoint config are still missing and return "404 NR" or "503 NR". This change makes Envoy waiting for initial xDS response until initial_fetch_timeout if specified. Risk Level: Medium Testing: existing test cases updated Fixes envoyproxy#8046 Signed-off-by: lhuang8 <[email protected]>

l8huang mentioned this pull request Sep 5, 2019

lds: Envoy starts listening ports before it receives the first RDS response #8046

Closed

ramaraochavali reviewed Sep 5, 2019

View reviewed changes

htuch requested a review from lambdai September 5, 2019 17:30

htuch assigned lambdai Sep 5, 2019

htuch self-assigned this Sep 5, 2019

lambdai reviewed Sep 6, 2019

View reviewed changes

lambdai mentioned this pull request Sep 6, 2019

config: full delta xDS (including ADS) support #7293

Merged

add assert and unify HTTP and gRPC error handling for connection failure

ab71a09

Signed-off-by: lhuang8 <[email protected]>

l8huang force-pushed the fix-8046 branch from 1a3629d to ab71a09 Compare September 6, 2019 22:22

stale bot added the stale stalebot believes this issue/PR has not been touched recently label Sep 16, 2019

htuch suggested changes Sep 18, 2019

View reviewed changes

stale bot removed stale stalebot believes this issue/PR has not been touched recently labels Sep 18, 2019

update if condition style

e530371

Signed-off-by: lhuang8 <[email protected]>

l8huang commented Sep 18, 2019

View reviewed changes

htuch approved these changes Sep 19, 2019

View reviewed changes

htuch merged commit d42e14e into envoyproxy:master Sep 19, 2019

xDS: gRPC connection failure shouldn't make Envoy continue startup #8152

xDS: gRPC connection failure shouldn't make Envoy continue startup #8152

Conversation

l8huang commented Sep 4, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ramaraochavali Sep 6, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

htuch commented Sep 5, 2019

ramaraochavali commented Sep 6, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lambdai left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

l8huang commented Sep 9, 2019

lambdai commented Sep 9, 2019

stale bot commented Sep 16, 2019

htuch left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

l8huang left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

htuch left a comment

Choose a reason for hiding this comment

fredlas commented Sep 19, 2019

l8huang commented Sep 20, 2019

fredlas commented Sep 20, 2019

ramaraochavali Sep 6, 2019 •

edited

Loading

l8huang left a comment •

edited

Loading