-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add x-envoy-immediate-health-check-fail header support #1570
Conversation
This feature adds the ability for data plane processing to cause a host to be considered active health check failed. This is currently only used by the router filter and the health check filter, but could be extended to other protocols later. Fixes #1423
@htuch will trade you for the other review tomorrow. cc @alyssawilk also. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall design looks good. I'm wondering if there is a way to simplify the ownership graph.
upstream host has failed :ref:`active health checking <arch_overview_health_checking>` (if the | ||
cluster has been :ref:`configured <config_cluster_manager_cluster_hc>` for active health checking). | ||
This can be used to fast fail an upstream host via standard data plane processing without waiting | ||
for the next health check interval. See the :ref:`health checking overview |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you comment on how the host can become considered healthy again (i.e. next health check)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternately for load shedding we have a hard-coded duration for it to expire. It'd be nice to have either a set expiry time or a optional ttl as header value
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, for load shedding, IMO, I would treat that as more of an outlier ejection event vs. an active health check event. I think that would be a useful feature but I think is out of scope for this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair enough. I was mainly wondering if for reverse compatibility we should make the value extensible. As I understand envoy versioning we could always change the requirements for the value as a breaking change gated on an envoy version update - is this about right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current code doesn't look at the value of the header at all, so, in the future we can do anything we want. But yes in general we could have a deprecation window, etc. I would recommend in this case though if we want to start using the header value we should think about forward-compat. I will open a follow up issue to think about this more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sound good, thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that the filter will automatically set the :ref:`x-envoy-immediate-health-check-fail | ||
<config_http_filters_router_x-envoy-immediate-health-check-fail>` header if the | ||
:ref:`/healthcheck/fail <operations_admin_interface_healthcheck_fail>` admin endpoint has been | ||
called. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a way to unset this state? I.e. call into the admin endpoint to become healthy again? I'm thinking specifically of using this feature for load shedding.
* special HTTP header is received, the data plane may decide to fast fail a host to avoid waiting | ||
* for the full HC interval to elapse before determining the host is active HC failed. | ||
*/ | ||
class HealthCheckerSink { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering if there is a clearer term than "sink" here, maybe HealthCheckMonitor
or something on the "watching" theme. My mental model is that we're looking for events from upstream that indicate that it's time to go unhealthy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure that's fine. FWIW I just replicated the naming for the outlier detector stuff which basically works the same way with same naming. How about HealthCheckHostMonitor
and DetectorHostMonitor
?
void HealthCheckerImplBase::setUnhealthyCrossThread(const HostSharedPtr& host) { | ||
// The threading here is complex. The cluster owns the only strong reference to the health | ||
// checker. It might go away when we post to the main thread. We capture a weak reference and | ||
// make sure it is still valid when we get to the other thread. Additionally, the host/session |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be helpful to be clearer here on which thread is which when referring to "other thread". I.e. we have a worker thread (presumably monitoring the passive header check) communicating with the main thread?
// may also be gone by then so we check that also. | ||
std::weak_ptr<HealthCheckerImplBase> weak_this = shared_from_this(); | ||
dispatcher_.post([weak_this, host]() -> void { | ||
std::shared_ptr<HealthCheckerImplBase> shared_this = weak_this.lock(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This logic seems complicated. Some of this is inherent due to ownership structure, but I wonder if it could be simplified by having the main thread (in the post body) do the cluster -> health checker resolution, and only passing in the cluster? Or host+cluster -> health checker resolution?
The weak pointers seem valid, but it becomes fairly hard to reason about the combination of shared_ptr and then multiple weak_ptrs in the sink.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again FWIW this is basically identical to logic we do for outlier detection: https://github.com/lyft/envoy/blob/master/source/common/upstream/outlier_detection_impl.cc#L220
I agree the logic is complicated, but I'm not quite sure how to make it simpler. Even if we pass a cluster, we need to pass a weak_ptr, because the code relies on clusters going away on the main thread immediately. I figured it was better to have all the shared_ptr/weak_ptr logic internal to this thing like we did in outlier detector? Let me try adding some more comments.
@htuch @alyssawilk PR updated per comments. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
* Set the host's outlier detector. Outlier detectors are assumed to be thread safe, however | ||
* a new outlier detector must be installed before the host is used across threads. Thus, | ||
* Set the host's health checker monitor. Monitors are assumed to be thread safe, however | ||
* a new monitor must be installed before the host is used across threads. Thus, | ||
* this routine should only be called on the main thread before the host is used across threads. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add an ASSERT
verifying we're on the main thread in the implementation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no great way to do this because of how we use hosts in tests, etc. I think I'm going to skip this for now. I will make a note to look at it in a follow up.
@@ -63,7 +63,7 @@ class DetectorHostSink { | |||
virtual double successRate() const PURE; | |||
}; | |||
|
|||
typedef std::unique_ptr<DetectorHostSink> DetectorHostSinkPtr; | |||
typedef std::unique_ptr<DetectorHostMonitor> DetectorHostMonitorPtr; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think this nomenclature is easier to parse for me.
// 1) We capture a weak reference to the health checker and post it from worker thread to main | ||
// thread. | ||
// 2) On the main thread, we make sure it is still valid (as the cluster may have been destroyed). | ||
// 3) Additionally, the host/session may also be gone by then so we check that also. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, this is clear to me now.
Was broken by interaction of envoyproxy#1521 and envoyproxy#1570.
Signed-off-by: Jose Nino <[email protected]> Signed-off-by: JP Simard <[email protected]>
Signed-off-by: Jose Nino <[email protected]> Signed-off-by: JP Simard <[email protected]>
This feature adds the ability for data plane processing to cause a host
to be considered active health check failed. This is currently only used
by the router filter and the health check filter, but could be extended
to other protocols later.
Fixes #1423