Proxy should have active health checks to confirm the state of destinations #228

Tratcher · 2020-06-04T20:17:10Z

We have the IHealthProbeWorker and associated infrastructure, but it's all internal and nobody ever starts it.

This was forked from the original partner team code before health checks were completely implemented. Some missing pieces:

An IHostedService that resolves the IHealthProbeWorker and starts and stops it when needed.
DI extensions for registering the health services.

@davidni

davidni · 2020-06-05T19:50:58Z

@Tratcher We've made substantial changes in this area since the hard fork:

Debouncing of failures
Wired up with signals to react to config changes immediately
Logging and metrics reporting
Fixed exception handling

When are you planning to tackle this? I don't have the bandwidth to prepare a PR until at least late next week.

Tratcher · 2020-06-05T20:29:16Z

We haven't scheduled it yet, I only filed it for tracking once we realized it was incomplete.

davidni · 2020-06-06T02:16:39Z

FYI, I've started to cleanup and refactor our internal implementation in prep for submitting a PR here. Earliest ETA remains late next week.

Tratcher · 2020-09-21T18:43:18Z

When looking at the design also consider #304 for pulling health state from another service.

alnikola · 2020-09-22T14:39:35Z

Design proposal

Abstract

From time to time, destinations can get unhealthy and start failing requests processing due to various reasons, thus to prevent request failures and maintain a good quality of service YARP must monitor destinations health status and stop sending traffic to the ones became unhealthy until they have recovered.

Requirements

Monitor destinations health and promptly stop proxying traffic to those which got unhealthy
Health monitoring performance overhead on the main proxy flow should be as minimal as possible
Health checks must be done even when no traffic is currently going to destinations to reduce probability of failing the next client request
Support an external destination health data source which can either completely substitute the built-in health checks or be combined with them (see Health checking should allow for health data to come from a 3rd service #304)

Components

PassiveHealthCheckMiddleware - intercepts successful and failed requests flowing throw pipeline and notifies IPassiveHealthWatcher if passive health check is enabled for the given cluster.
Strategy examples: "take last status", "mark as unhealthy after first failure, mark as healthy after all success in X time period", "ignore some unsuccessful response codes", etc.
IPassiveHealthWatcher - reacts to requests proxying results reported by the PassiveHealthCheckMiddleware and passes them to an IPassiveHealthCheckPolicy configured for the given cluster to determine a destination's passive health check status. This passive watcher gets called for both of successfully and unsuccessfully proxied requests.
IPassiveHealthCheckPolicy - calculates a new destination's passive health check status and updates DestinationDynamicState.Health.Passive property if the new value is different from the old one. Some policies can additionally leverage a historic destination health storage for a more advanced health evaluation algorithms. Assigning Unhealthy to DestinationDynamicState.Health.Passive causes setting Unhealthy also on DestinationDynamicState.Health.Current which temporary deactivates the destination by removing it from HealthyDestinations collection. Subsequently, this destination gets registered on IReactivationScheduler to be later reactivated after the configured period.
IReactivationScheduler - brings back destinations previously deactivated by the passive health check by setting their DestinationDynamicState.Health.Passive to Healthy after configured ReactivationPeriod has elapsed.
GlobalProxyInfo.IsReady - flag indicating proxy readiness to serve traffic.
IActiveHealthMonitor - proactively pings destinations once in a given period and calls an IActiveHealthCheckPolicy configured for the cluster with the probing results to determine a destination's active health check status. It works in 2 modes: forced initial check and periodic probing as define by the configuration. On startup, IProxyConfigProvider calls IActiveHealthMonitor.ForceCheckAll to determine the initial health status of all configured destinations. This call is to be done right after the initial IProxyConfig has been successfully applied, but before setting GlobalProxyInfo.IsReady flag. In parallel with running the initial health check, IActiveHealthMonitor also listens to GlobalProxyInfo.IsReady changes and once it is set, it switches to periodic probing mode. Then, it firstly pulls in all the Clusters stored in IClusterManager as well as subscribes to IClusterManager.Items signal changes. Further, it starts periodic probing on Clusters with enabled active health checks and reacts to changes on IClusterManager.Items collection.
IActiveHealthCheckPolicy - calculates a new destination's active health check status and updates DestinationDynamicState.Health.Active property if the new value is different from the old one. Some policies can additionally leverage a historic destination health storage for a more advanced health evaluation algorithms. Assigning Unhealthy to DestinationDynamicState.Health.Active causes setting Unhealthy also on DestinationDynamicState.Health.Current which removes it from HealthyDestinations collection.
DestinationDynamicState.Health.Current - the current destination health status calculated as a boolean AND operation on Passive and Active property values. Additionally, it can be directly overwritten by IExternalDestinationHealthListener applying an external health snapshot arrived to its endpoint. Setting Health.Current to Unhealthy removes the given destination from ClusterDynamicState.HealthyDestinations which prevents it from receiving any client traffic.
IExternalDestinationHealthListener - opens an HTTP endpoint on YARP itself which can be called by 3rd party health data providers (see Health checking should allow for health data to come from a 3rd service #304) to explicitly set destination health status. If the newly received destination health status is different from the current value of DestinationDynamicState.Health.Active, it sets this property to the new value. It gets registered with an extension method which would also register the corresponding endpoint on Kestrel. It completely replaces the active health check subsystem.

Health check policies

Default implementation provides the following active and passive health check polices.

Passive

Failure rate (share of proxied requests failed in the given time period) - upper limit
Minimal throughput rate (number of proxied requests successfully completed in the given time period) - bottom limit

Active

Any 5xx failure
Probes failure rate (share of probes failed in the given time period) - upper limit

Configuration

All but one of active and passive health check settings are set on the cluster level. The only exception is an optional "Destination/Health" element specifying a separate active health check endpoint. The actual active health check probing URI is composed as "Destination/Address" + (optional) "Destination/Health" + "Cluster/HealthChecks/Path".

Active health check settings

Enabled - indicates whether active checks are enabled. Default false.
Interval - period health check requests are sent. Default "00:01:00".
Timeout - health check request timeout. Default "00:00:10".
Policy - name of policy evaluating destination health status based on probing results.
Path - health check path on all cluster's destinations.

Passive health check settings

Enabled - indicates whether passive checks are enabled. Default false.
Policy - name of policy evaluating destination health based on proxied requests successes and failures.
ReactivationPeriod - period to reset an Unhealthy destination's passive health status back to Healthy. Default 00:05:00.

Data Flow

API

Configuration

{
    "Clusters": {
        "cluster1": {
            "LoadBalancing": {
                "Mode": "Random"
            },
            "HealthChecks": {
                "Active": {
                    "Enabled": "true",
                    "Interval": "00:05:00",
                    "Timeout": "00:00:30",
                    "Policy": "FailOnAny5xx",
                    "Path": "/Heath"
                },
                "Passive": {
                    "Enabled": "true",
                    "Policy": "FailureRate",
                    "ReactivationPeriod": "00:05:00"
                }
            },
            "Destinations": {
                "cluster1/destination1": {
                    "Address": "https://localhost:10000/",
                    "Health": "https://localhost:4040/"
                },
                "cluster1/destination2": {
                    "Address": "http://localhost:10010/"
                }
           },
           "Metadata": {
               "PassiveHealthCheck.FailureRate.Period": "00:01:00",
               "PassiveHealthCheck.FailureRate.Percentage": "0.3"
           }
        }
    }
}

Code

// Abstractions
    public class ActiveHealthCheckOptions
    {
        public bool Enabled { get; set; }

        public TimeSpan Interval { get; set; }

        public TimeSpan Timeout { get; set; }

        public string Policy { get; set; }

        public string Path { get; set; }
    }

    public class PassiveHealthCheckOptions
    {
        public bool Enabled { get; set; }

        public string Policy { get; set; }

        public TimeSpan ReactivationPeriod { get; set; }
    }

    public class HealthCheckOptions
    {
        public ActiveHealthCheckOptions Active { get; set; }

        public PassiveHealthCheckOptions Passive { get; set; }
    }

    public class Cluster
    {
        // ...

        public HealthCheckOptions HealthChecks { get; set; }
    }

// Runtime model
    public class CompositeDestinationHealth
    {
        public CompositeDestinationHealth(DestinationHealth active, DestinationHealth passive)
        {
            // ...
        }

        public CompositeDestinationHealth(DestinationHealth externalCurrent)
        {
            // ...
        }

        public DestinationHealth Active { get; }

        public DestinationHealth Passive { get; }

        public DestinationHealth Current { get; }
    }

    public class DestinationDynamicState
    {
        public CompositeDestinationHealth Health { get; }
    }

    public interface IModelChangeListener
    {
        void OnClustersChanged(ClusterInfo cluster);
    }

// Middleware
    internal class HealthCheckMiddleware
    {
        public HealthCheckMiddleware(RequestDelegate next, IPassiveHealthWatcher watcher, IOperationLogger<HealthCheckMiddleware> operationLogger)
        {
            /// ...
        }

        public async Task Invoke(HttpContext context)
        {
            /// ...
        }
    }

// Active health checks
    public interface IActiveHealthMonitor : IDisposable
    {
        Task ForceCheckAll(IEnumerable<ClusterInfo> allClusters);
    }

    public interface IActiveHealthCheckPolicy
    {
        void ProbingFailed(ClusterConfig cluster, DestinationInfo destination, HttpStatusCode code, Exception exception);

        void ProbingSucceeded(ClusterConfig cluster, DestinationInfo destination);
    }

    public abstract class BaseActiveHealthMonitor: IActiveHealthMonitor, IModelChangeListener
    {
    }

// Passive health checks
    public interface IPassiveHealthWatcher
    {
        void RequestProxied(ClusterConfig cluster, DestinationInfo destination, HttpContext context, IProxyErrorFeature? error);
    }

    public interface IPassiveHealthCheckPolicy
    {
        void RequestFailed(ClusterConfig cluster, DestinationInfo destination, HttpContext context, IProxyErrorFeature error);

        void RequestSucceeded(ClusterConfig cluster, DestinationInfo destination, HttpContext context);
    }

    public interface IReactivationScheduler
    {
        void ScheduleMarkingHealthy(ClusterConfig cluster, DestinationInfo destination);
    }

Tratcher · 2020-09-24T16:15:42Z

Support an external destination health data source which can either completely substitute the built-in health checks or be combined with them (see Health checking should allow for health data to come from a 3rd service #304)

Is there also a requirement for the reverse scenario, pushing health data to an external service? Or at least publicly exposing it so such an operation could be performed? It seems like #306 would also require access to health data.

IPassiveHealthWatcher - reacts to requests proxying results reported by the ProxyInvokerMiddleware

There's no technical reason it needs to be part of ProxyInvokerMiddleware right? I think the only unique knowledge ProxyInvokerMiddleware has right now is which DestinationInfo was finally selected, and that's data we should find a way to preserve regardless. IPassiveHealthWatcher should be inserted into the pipeline with its own middleware to maximize customization flexibility even up to being able to remove/replace the whole middleware.

IDestinationHealthEvaluationStrategy

In the diagram, both IActiveHealthMonitor and IPassiveHealthWatcher should be pointing at the same IDestinationHealthEvaluationStrategy, right?

it updates Health property on the DestinationInfo.DynamicStateSignal

We'll need to discuss the public usage of signals here. We've avoided their use in public APIs so far.

Tratcher · 2020-09-24T16:21:34Z

IPassiveHealthWatcher seems like it should be able to examine responses in more detail to decide what it considers a failure. Should it be passed the HttpContext? E.g. the destination could respond with a 503 or a 400 and that wouldn't generate a IProxyErrorFeature today, IProxyErrorFeature primarily captures exceptions.

Tratcher · 2020-09-24T17:41:16Z

Do the active health probes use the same per-cluster HttpClient as the proxied requests?

Pro:
- Re-uses existing connections, configuration (e.g. client certificates)
- More accurately reflects the proxied request experience
- Reacts to cluster config changes, central point of config
Con:
- The health checks may be queued behind other requests, take longer to run.

Where do we configure the active health check endpoint?

At the cluster level supply a path like /health-checks.
At the destination level allow an alternate prefix (scheme, host, port, path) so you can specify that health checks are done on a different scheme/port.
Active health checks are sent to the the destination (or alternate) prefix + cluster health path.

samsp-msft · 2020-09-24T17:50:25Z

Con:

The health checks may be queued behind other requests, take longer to run.

Is there a timeout for the health checks. In which case the above may actually be a good thing - the host is obviously overloaded if it can't answer - or would that lead to more trouble?

Where do we configure the active health check endpoint?

At the cluster level supply a path like /health-checks.

At the destination level allow an alternate prefix (scheme, host, port, path) so you can specify that health checks are done on a different scheme/port.

Active health checks are sent to the the destination (or alternate) prefix + cluster health path.

I like the idea of having an optional path scheme on the destination - possibly as a relative URI- so the port, path can be changed if necessary.

Tratcher · 2020-09-24T18:07:24Z

For the config schema, I'm expecting health checks to have their own per-cluster section:

        "cluster1": {
            "LoadBalancing": {
                "Mode": "Random"
            },
            "HealthChecks": {
                "Enabled": "true",
                "Interval", "00:05:00"
                "Timeout", "00:00:30"
                "Strategy": "TakeLatest",
                "Path": "/Heath",
            }
            "Destinations": {
                "cluster1/destination1": {
                    "Address": "https://localhost:10001/",
                    "Health": "http://localhost:4040/",
                }
            }
        }

See some of the existing settings we have in the system:
https://github.com/microsoft/reverse-proxy/blob/master/src/ReverseProxy/Configuration/Contract/HealthCheckData.cs

Edit: Not sure about the schema for enabling/disabling passive health checks, or what config they might need. Any config there seems like it would be strategy specific and likely exposed via cluster metadata.

samsp-msft · 2020-09-24T22:00:00Z

If a path is set on the healthchecks and a host on the destination do they get appeneded, or should a path on the destination be considered an override.

Tratcher · 2020-09-24T22:01:24Z

I'd recommend they append for consistency. That's how the cluster path would apply to the normal destination prefix.

davidni · 2020-09-25T16:02:43Z

Thinking about previous discussions about "gradual warm-up" of destinations as well as soft Vs. hard failure modes, would it perhaps make sense to make the health signal a double instead of a boolean? The AND could be implemented as a multiply, and the final health value for a destination could be multiplied with its weight to affect final routing decisions.

This would enable interesting scenarios (e.g. reduce weight to a destination if its latency is increasing, but don't remove it from rotation entirely, to mitigate cascading failure / snowball effect. To be clear, this is more future looking, and we don't need it right away.

alnikola · 2020-09-25T16:09:25Z

That's an interesting idea. Though, I was thinking about a different approach to gradual warm-up where we kept destination weight and health status separately. That way, a new class GradualWarmUpPolicy : IDestinationActivationPolicy would be a yet another recipient of destination's health change event which would react to the "Unhealthy -> Healthy" transition by starting gradually increasing the destination's weight and thus increasing the load.

Tratcher · 2020-09-25T16:17:48Z

@davidni, that's a good point, health isn't necessarily binary. I've seen a number of systems that use at least three states (healthy, degraded, offline). If we didn't want to go with a full double we may still consider using several buckets, including one for slow start. This may have interesting effects in the stack, like maybe load balancing prefers heathy endpoints, but session affinity allows any state but offline.

alnikola · 2020-09-25T16:21:26Z

Health state buckets seem as a good compromise between binary and double values.

Tratcher · 2020-09-25T17:22:18Z

IActiveHealthMonitor depends on IClusterManager and Signals for config change notifications, neither of which are currently public. We'll need a public solution here for IActiveHealthMonitor to be a viable extensibility point.

Tratcher · 2020-09-25T18:30:40Z

I'm trying to understand the separation of responsibilities between IPassiveHealthWatcher vs IPassiveHealthCheckPolicy and IActiveHealthMonitor vs IActiveHealthCheckPolicy.

IPassiveHealthWatcher is passed the entire request/response, makes some determination about that being a success or failure, looks up the policy for the current cluster, and then calls IPassiveHealthCheckPolicy with the same parameters. Determining success or failure seems like it should be a concern for IPassiveHealthCheckPolicy, not IPassiveHealthWatcher? Would it make more sense if IPassiveHealthWatcher's sole responsibility was to look up the policy for a cluster and call it, then have the policy determine success or failure? That would also help with our non-binary ideas around health states.

IActiveHealthMonitor has a similar concern. IActiveHealthMonitor's responsibility is scheduling and sending the health probes, not determining if they succeeded or failed. Here again I think it'd make more sense to pass the whole response to the cluster specific IActiveHealthCheckPolicy and let it determine success/failure/health. The response may also contain hints like "retry-after" that could influence the health state algorithm beyond the scope of a single probe.

alnikola · 2020-09-29T12:06:50Z

IActiveHealthMonitor depends on IClusterManager and Signals for config change notifications, neither of which are currently public. We'll need a public solution here for IActiveHealthMonitor to be a viable extensibility point.

Valid point. To avoid exposing IClusterManager and signals, I propose propagate Cluster changes via push instead of pulling cluster collection from IClusterManager.
Specifically:

Add a new interface IModelChangeListener with OnClustersChanged(Cluster cluster) method. It can be implemented by several services, so all implementations will be registered via TryAddEnumerable.
Add IModelChangeListener as a base interface of IActiveHealthMonitor.
Add a virtual method OnItemChanged(T item) to ItemManagerBase<T> that will be called right after UpdateSignal call in ItemManagerBase<T>.GetOrCreateItem.
Implement UpdateSignal method in ClusterManager with code calling OnClusterChanged on all registered IModelChangeListener implementations.

In code

public interface IModelChangeListener
{
    void OnClustersChanged(Cluster cluster);
}

public interface IActiveHealthMonitor : IModelChangeListener, IDisposable
{
    Task ForceCheckAll(IEnumerable<ClusterInfo> allClusters);
}

internal abstract ItemManagerBase<T>
{
    // ...
    public T GetOrCreateItem(string itemId, Action<T> setupAction)
    {
        // ...
        if (!existed)
        {
            _items.Add(itemId, item);
            UpdateSignal();
            OnItemChanged(item); // New code
        }

        protected virtual void OnItemChanged(T item)
        {
            // Do nothing by default.
        }
        // ...
    }
}

internal sealed class ClusterManager
{
    private readonly IModelChangeListener[] _listeners;
    // ...
    protected override void OnItemChanged(Cluster item)
    {
        foreach (var listener in _listeners)
        {
            listener.OnClusterChanged(item);
        }
    }
}

alnikola · 2020-09-29T12:45:45Z

I'm trying to understand the separation of responsibilities between IPassiveHealthWatcher vs IPassiveHealthCheckPolicy and IActiveHealthMonitor vs IActiveHealthCheckPolicy.

IPassiveHealthWatcher is passed the entire request/response, makes some determination about that being a success or failure, looks up the policy for the current cluster, and then calls IPassiveHealthCheckPolicy with the same parameters. Determining success or failure seems like it should be a concern for IPassiveHealthCheckPolicy, not IPassiveHealthWatcher? Would it make more sense if IPassiveHealthWatcher's sole responsibility was to look up the policy for a cluster and call it, then have the policy determine success or failure? That would also help with our non-binary ideas around health states.

IActiveHealthMonitor has a similar concern. IActiveHealthMonitor's responsibility is scheduling and sending the health probes, not determining if they succeeded or failed. Here again I think it'd make more sense to pass the whole response to the cluster specific IActiveHealthCheckPolicy and let it determine success/failure/health. The response may also contain hints like "retry-after" that could influence the health state algorithm beyond the scope of a single probe.

My idea was a concern separation between policies on one side and Watcher/Monitor on the other. Meaning, PassiveHealthWatcher and ActiveHealthMonitor incapsulate logic of running a check operation and registering its result (though, for Watcher is seems as a simple pass-through), but they don't know how that result affects destinations health. Whereas, policies incapsulate calculation of a new destination health status based on the current status and the check result, but are they not aware of how the check operation itself is executed.

Having that said, I see your point that passing the entire request/response message or a probing result to policies can make it more flexible, but not sure if it justifies an increased coupling between the check logic and the health status evaluation.

From time to time, destinations can get unhealthy and start failing requests processing due to various reasons, thus to prevent request failures and maintain a good quality of service YARP must monitor destinations health status and stop sending traffic to the ones became unhealthy until they have recovered. This PR implements active and passive health check mechanisms where the former periodically probes destinations with dedicated HTTP requests and the latter watches for client request proxying results. Fixes #228

Tratcher added the Type: Bug Something isn't working label Jun 4, 2020

samsp-msft added this to the 1.0.0-preview3 milestone Jun 23, 2020

Tratcher removed this from the 1.0.0-preview3 milestone Jul 22, 2020

samsp-msft added the Partner Blocking: Island Gateway label Jul 30, 2020

samsp-msft added this to the 1.0.0 milestone Jul 30, 2020

samsp-msft mentioned this issue Jul 30, 2020

Active Health Checks of destinations #59

Closed

karelz modified the milestones: 1.0.0, 1.0.0-preview6 Sep 10, 2020

karelz mentioned this issue Sep 17, 2020

LifeCycle Management: state flag when YARP is truly ready to handle requests #306

Closed

karelz assigned alnikola Sep 17, 2020

alnikola mentioned this issue Oct 7, 2020

Active and passive health checks #459

Merged

ManickaP modified the milestones: 1.0.0-preview6, 1.0.0-preview7 Oct 16, 2020

samsp-msft mentioned this issue Oct 21, 2020

Epic: Ship an extensible/customizable reverse proxy performant enough for cloud scale workloads #480

Closed

37 tasks

samsp-msft changed the title ~~Finish wiring up health checks~~ Proxy should have active health checks to confirm the state of destinations Oct 21, 2020

samsp-msft added the User Story Used for divisional .NET planning label Oct 21, 2020

alnikola closed this as completed in #459 Oct 27, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proxy should have active health checks to confirm the state of destinations #228

Proxy should have active health checks to confirm the state of destinations #228

Tratcher commented Jun 4, 2020

davidni commented Jun 5, 2020 •

edited

Loading

Tratcher commented Jun 5, 2020

davidni commented Jun 6, 2020

Tratcher commented Sep 21, 2020

alnikola commented Sep 22, 2020 •

edited

Loading

Tratcher commented Sep 24, 2020

Tratcher commented Sep 24, 2020

Tratcher commented Sep 24, 2020

samsp-msft commented Sep 24, 2020 •

edited

Loading

Tratcher commented Sep 24, 2020 •

edited

Loading

samsp-msft commented Sep 24, 2020

Tratcher commented Sep 24, 2020

davidni commented Sep 25, 2020

alnikola commented Sep 25, 2020 •

edited

Loading

Tratcher commented Sep 25, 2020

alnikola commented Sep 25, 2020

Tratcher commented Sep 25, 2020

Tratcher commented Sep 25, 2020

alnikola commented Sep 29, 2020 •

edited

Loading

alnikola commented Sep 29, 2020

Proxy should have active health checks to confirm the state of destinations #228

Proxy should have active health checks to confirm the state of destinations #228

Comments

Tratcher commented Jun 4, 2020

davidni commented Jun 5, 2020 • edited Loading

Tratcher commented Jun 5, 2020

davidni commented Jun 6, 2020

Tratcher commented Sep 21, 2020

alnikola commented Sep 22, 2020 • edited Loading

Design proposal

Abstract

Requirements

Components

Health check policies

Configuration

Data Flow

API

Tratcher commented Sep 24, 2020

Tratcher commented Sep 24, 2020

Tratcher commented Sep 24, 2020

samsp-msft commented Sep 24, 2020 • edited Loading

Tratcher commented Sep 24, 2020 • edited Loading

samsp-msft commented Sep 24, 2020

Tratcher commented Sep 24, 2020

davidni commented Sep 25, 2020

alnikola commented Sep 25, 2020 • edited Loading

Tratcher commented Sep 25, 2020

alnikola commented Sep 25, 2020

Tratcher commented Sep 25, 2020

Tratcher commented Sep 25, 2020

alnikola commented Sep 29, 2020 • edited Loading

alnikola commented Sep 29, 2020

davidni commented Jun 5, 2020 •

edited

Loading

alnikola commented Sep 22, 2020 •

edited

Loading

samsp-msft commented Sep 24, 2020 •

edited

Loading

Tratcher commented Sep 24, 2020 •

edited

Loading

alnikola commented Sep 25, 2020 •

edited

Loading

alnikola commented Sep 29, 2020 •

edited

Loading