Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add x-envoy-immediate-health-check-fail header support #1570

Merged
merged 3 commits into from
Aug 30, 2017
Merged

Conversation

mattklein123
Copy link
Member

This feature adds the ability for data plane processing to cause a host
to be considered active health check failed. This is currently only used
by the router filter and the health check filter, but could be extended
to other protocols later.

Fixes #1423

This feature adds the ability for data plane processing to cause a host
to be considered active health check failed. This is currently only used
by the router filter and the health check filter, but could be extended
to other protocols later.

Fixes #1423
@mattklein123
Copy link
Member Author

@htuch will trade you for the other review tomorrow.

cc @alyssawilk also.

Copy link
Member

@htuch htuch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall design looks good. I'm wondering if there is a way to simplify the ownership graph.

upstream host has failed :ref:`active health checking <arch_overview_health_checking>` (if the
cluster has been :ref:`configured <config_cluster_manager_cluster_hc>` for active health checking).
This can be used to fast fail an upstream host via standard data plane processing without waiting
for the next health check interval. See the :ref:`health checking overview
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you comment on how the host can become considered healthy again (i.e. next health check)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternately for load shedding we have a hard-coded duration for it to expire. It'd be nice to have either a set expiry time or a optional ttl as header value

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, for load shedding, IMO, I would treat that as more of an outlier ejection event vs. an active health check event. I think that would be a useful feature but I think is out of scope for this PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough. I was mainly wondering if for reverse compatibility we should make the value extensible. As I understand envoy versioning we could always change the requirements for the value as a breaking change gated on an envoy version update - is this about right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current code doesn't look at the value of the header at all, so, in the future we can do anything we want. But yes in general we could have a deprecation window, etc. I would recommend in this case though if we want to start using the header value we should think about forward-compat. I will open a follow up issue to think about this more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sound good, thanks!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that the filter will automatically set the :ref:`x-envoy-immediate-health-check-fail
<config_http_filters_router_x-envoy-immediate-health-check-fail>` header if the
:ref:`/healthcheck/fail <operations_admin_interface_healthcheck_fail>` admin endpoint has been
called.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to unset this state? I.e. call into the admin endpoint to become healthy again? I'm thinking specifically of using this feature for load shedding.

* special HTTP header is received, the data plane may decide to fast fail a host to avoid waiting
* for the full HC interval to elapse before determining the host is active HC failed.
*/
class HealthCheckerSink {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if there is a clearer term than "sink" here, maybe HealthCheckMonitor or something on the "watching" theme. My mental model is that we're looking for events from upstream that indicate that it's time to go unhealthy.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure that's fine. FWIW I just replicated the naming for the outlier detector stuff which basically works the same way with same naming. How about HealthCheckHostMonitor and DetectorHostMonitor ?

void HealthCheckerImplBase::setUnhealthyCrossThread(const HostSharedPtr& host) {
// The threading here is complex. The cluster owns the only strong reference to the health
// checker. It might go away when we post to the main thread. We capture a weak reference and
// make sure it is still valid when we get to the other thread. Additionally, the host/session
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be helpful to be clearer here on which thread is which when referring to "other thread". I.e. we have a worker thread (presumably monitoring the passive header check) communicating with the main thread?

// may also be gone by then so we check that also.
std::weak_ptr<HealthCheckerImplBase> weak_this = shared_from_this();
dispatcher_.post([weak_this, host]() -> void {
std::shared_ptr<HealthCheckerImplBase> shared_this = weak_this.lock();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic seems complicated. Some of this is inherent due to ownership structure, but I wonder if it could be simplified by having the main thread (in the post body) do the cluster -> health checker resolution, and only passing in the cluster? Or host+cluster -> health checker resolution?

The weak pointers seem valid, but it becomes fairly hard to reason about the combination of shared_ptr and then multiple weak_ptrs in the sink.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again FWIW this is basically identical to logic we do for outlier detection: https://github.com/lyft/envoy/blob/master/source/common/upstream/outlier_detection_impl.cc#L220

I agree the logic is complicated, but I'm not quite sure how to make it simpler. Even if we pass a cluster, we need to pass a weak_ptr, because the code relies on clusters going away on the main thread immediately. I figured it was better to have all the shared_ptr/weak_ptr logic internal to this thing like we did in outlier detector? Let me try adding some more comments.

@mattklein123
Copy link
Member Author

@htuch @alyssawilk PR updated per comments.

Copy link
Member

@htuch htuch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

* Set the host's outlier detector. Outlier detectors are assumed to be thread safe, however
* a new outlier detector must be installed before the host is used across threads. Thus,
* Set the host's health checker monitor. Monitors are assumed to be thread safe, however
* a new monitor must be installed before the host is used across threads. Thus,
* this routine should only be called on the main thread before the host is used across threads.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add an ASSERT verifying we're on the main thread in the implementation?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no great way to do this because of how we use hosts in tests, etc. I think I'm going to skip this for now. I will make a note to look at it in a follow up.

@@ -63,7 +63,7 @@ class DetectorHostSink {
virtual double successRate() const PURE;
};

typedef std::unique_ptr<DetectorHostSink> DetectorHostSinkPtr;
typedef std::unique_ptr<DetectorHostMonitor> DetectorHostMonitorPtr;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think this nomenclature is easier to parse for me.

// 1) We capture a weak reference to the health checker and post it from worker thread to main
// thread.
// 2) On the main thread, we make sure it is still valid (as the cluster may have been destroyed).
// 3) Additionally, the host/session may also be gone by then so we check that also.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this is clear to me now.

@mattklein123 mattklein123 merged commit 669f23c into master Aug 30, 2017
@mattklein123 mattklein123 deleted the hc_sink branch August 30, 2017 19:58
htuch added a commit to htuch/envoy that referenced this pull request Aug 31, 2017
mattklein123 pushed a commit that referenced this pull request Aug 31, 2017
jpsim pushed a commit that referenced this pull request Nov 28, 2022
Signed-off-by: Jose Nino <[email protected]>
Signed-off-by: JP Simard <[email protected]>
jpsim pushed a commit that referenced this pull request Nov 29, 2022
Signed-off-by: Jose Nino <[email protected]>
Signed-off-by: JP Simard <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants