-
Notifications
You must be signed in to change notification settings - Fork 842
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proxy should have active health checks to confirm the state of destinations #228
Comments
@Tratcher We've made substantial changes in this area since the hard fork:
When are you planning to tackle this? I don't have the bandwidth to prepare a PR until at least late next week. |
We haven't scheduled it yet, I only filed it for tracking once we realized it was incomplete. |
FYI, I've started to cleanup and refactor our internal implementation in prep for submitting a PR here. Earliest ETA remains late next week. |
When looking at the design also consider #304 for pulling health state from another service. |
Design proposalAbstractFrom time to time, destinations can get unhealthy and start failing requests processing due to various reasons, thus to prevent request failures and maintain a good quality of service YARP must monitor destinations health status and stop sending traffic to the ones became unhealthy until they have recovered. Requirements
Components
Health check policiesDefault implementation provides the following active and passive health check polices. Passive
Active
ConfigurationAll but one of active and passive health check settings are set on the cluster level. The only exception is an optional "Destination/Health" element specifying a separate active health check endpoint. The actual active health check probing URI is composed as "Destination/Address" + (optional) "Destination/Health" + "Cluster/HealthChecks/Path". Active health check settings
Passive health check settings
Data FlowAPIConfiguration {
"Clusters": {
"cluster1": {
"LoadBalancing": {
"Mode": "Random"
},
"HealthChecks": {
"Active": {
"Enabled": "true",
"Interval": "00:05:00",
"Timeout": "00:00:30",
"Policy": "FailOnAny5xx",
"Path": "/Heath"
},
"Passive": {
"Enabled": "true",
"Policy": "FailureRate",
"ReactivationPeriod": "00:05:00"
}
},
"Destinations": {
"cluster1/destination1": {
"Address": "https://localhost:10000/",
"Health": "https://localhost:4040/"
},
"cluster1/destination2": {
"Address": "http://localhost:10010/"
}
},
"Metadata": {
"PassiveHealthCheck.FailureRate.Period": "00:01:00",
"PassiveHealthCheck.FailureRate.Percentage": "0.3"
}
}
}
} Code // Abstractions
public class ActiveHealthCheckOptions
{
public bool Enabled { get; set; }
public TimeSpan Interval { get; set; }
public TimeSpan Timeout { get; set; }
public string Policy { get; set; }
public string Path { get; set; }
}
public class PassiveHealthCheckOptions
{
public bool Enabled { get; set; }
public string Policy { get; set; }
public TimeSpan ReactivationPeriod { get; set; }
}
public class HealthCheckOptions
{
public ActiveHealthCheckOptions Active { get; set; }
public PassiveHealthCheckOptions Passive { get; set; }
}
public class Cluster
{
// ...
public HealthCheckOptions HealthChecks { get; set; }
}
// Runtime model
public class CompositeDestinationHealth
{
public CompositeDestinationHealth(DestinationHealth active, DestinationHealth passive)
{
// ...
}
public CompositeDestinationHealth(DestinationHealth externalCurrent)
{
// ...
}
public DestinationHealth Active { get; }
public DestinationHealth Passive { get; }
public DestinationHealth Current { get; }
}
public class DestinationDynamicState
{
public CompositeDestinationHealth Health { get; }
}
public interface IModelChangeListener
{
void OnClustersChanged(ClusterInfo cluster);
}
// Middleware
internal class HealthCheckMiddleware
{
public HealthCheckMiddleware(RequestDelegate next, IPassiveHealthWatcher watcher, IOperationLogger<HealthCheckMiddleware> operationLogger)
{
/// ...
}
public async Task Invoke(HttpContext context)
{
/// ...
}
}
// Active health checks
public interface IActiveHealthMonitor : IDisposable
{
Task ForceCheckAll(IEnumerable<ClusterInfo> allClusters);
}
public interface IActiveHealthCheckPolicy
{
void ProbingFailed(ClusterConfig cluster, DestinationInfo destination, HttpStatusCode code, Exception exception);
void ProbingSucceeded(ClusterConfig cluster, DestinationInfo destination);
}
public abstract class BaseActiveHealthMonitor: IActiveHealthMonitor, IModelChangeListener
{
}
// Passive health checks
public interface IPassiveHealthWatcher
{
void RequestProxied(ClusterConfig cluster, DestinationInfo destination, HttpContext context, IProxyErrorFeature? error);
}
public interface IPassiveHealthCheckPolicy
{
void RequestFailed(ClusterConfig cluster, DestinationInfo destination, HttpContext context, IProxyErrorFeature error);
void RequestSucceeded(ClusterConfig cluster, DestinationInfo destination, HttpContext context);
}
public interface IReactivationScheduler
{
void ScheduleMarkingHealthy(ClusterConfig cluster, DestinationInfo destination);
} |
Is there also a requirement for the reverse scenario, pushing health data to an external service? Or at least publicly exposing it so such an operation could be performed? It seems like #306 would also require access to health data.
There's no technical reason it needs to be part of ProxyInvokerMiddleware right? I think the only unique knowledge ProxyInvokerMiddleware has right now is which DestinationInfo was finally selected, and that's data we should find a way to preserve regardless. IPassiveHealthWatcher should be inserted into the pipeline with its own middleware to maximize customization flexibility even up to being able to remove/replace the whole middleware.
In the diagram, both IActiveHealthMonitor and IPassiveHealthWatcher should be pointing at the same IDestinationHealthEvaluationStrategy, right?
We'll need to discuss the public usage of signals here. We've avoided their use in public APIs so far. |
IPassiveHealthWatcher seems like it should be able to examine responses in more detail to decide what it considers a failure. Should it be passed the HttpContext? E.g. the destination could respond with a 503 or a 400 and that wouldn't generate a IProxyErrorFeature today, IProxyErrorFeature primarily captures exceptions. |
Do the active health probes use the same per-cluster HttpClient as the proxied requests?
Where do we configure the active health check endpoint?
|
Is there a timeout for the health checks. In which case the above may actually be a good thing - the host is obviously overloaded if it can't answer - or would that lead to more trouble?
I like the idea of having an optional path scheme on the destination - possibly as a relative URI- so the port, path can be changed if necessary. |
For the config schema, I'm expecting health checks to have their own per-cluster section:
See some of the existing settings we have in the system: Edit: Not sure about the schema for enabling/disabling passive health checks, or what config they might need. Any config there seems like it would be strategy specific and likely exposed via cluster metadata. |
If a path is set on the healthchecks and a host on the destination do they get appeneded, or should a path on the destination be considered an override. |
I'd recommend they append for consistency. That's how the cluster path would apply to the normal destination prefix. |
Thinking about previous discussions about "gradual warm-up" of destinations as well as soft Vs. hard failure modes, would it perhaps make sense to make the health signal a double instead of a boolean? The This would enable interesting scenarios (e.g. reduce weight to a destination if its latency is increasing, but don't remove it from rotation entirely, to mitigate cascading failure / snowball effect. To be clear, this is more future looking, and we don't need it right away. |
That's an interesting idea. Though, I was thinking about a different approach to gradual warm-up where we kept destination weight and health status separately. That way, a new |
@davidni, that's a good point, health isn't necessarily binary. I've seen a number of systems that use at least three states (healthy, degraded, offline). If we didn't want to go with a full double we may still consider using several buckets, including one for slow start. This may have interesting effects in the stack, like maybe load balancing prefers heathy endpoints, but session affinity allows any state but offline. |
Health state buckets seem as a good compromise between binary and double values. |
IActiveHealthMonitor depends on IClusterManager and Signals for config change notifications, neither of which are currently public. We'll need a public solution here for IActiveHealthMonitor to be a viable extensibility point. |
I'm trying to understand the separation of responsibilities between IPassiveHealthWatcher vs IPassiveHealthCheckPolicy and IActiveHealthMonitor vs IActiveHealthCheckPolicy. IPassiveHealthWatcher is passed the entire request/response, makes some determination about that being a success or failure, looks up the policy for the current cluster, and then calls IPassiveHealthCheckPolicy with the same parameters. Determining success or failure seems like it should be a concern for IPassiveHealthCheckPolicy, not IPassiveHealthWatcher? Would it make more sense if IPassiveHealthWatcher's sole responsibility was to look up the policy for a cluster and call it, then have the policy determine success or failure? That would also help with our non-binary ideas around health states. IActiveHealthMonitor has a similar concern. IActiveHealthMonitor's responsibility is scheduling and sending the health probes, not determining if they succeeded or failed. Here again I think it'd make more sense to pass the whole response to the cluster specific IActiveHealthCheckPolicy and let it determine success/failure/health. The response may also contain hints like "retry-after" that could influence the health state algorithm beyond the scope of a single probe. |
Valid point. To avoid exposing
In code public interface IModelChangeListener
{
void OnClustersChanged(Cluster cluster);
}
public interface IActiveHealthMonitor : IModelChangeListener, IDisposable
{
Task ForceCheckAll(IEnumerable<ClusterInfo> allClusters);
}
internal abstract ItemManagerBase<T>
{
// ...
public T GetOrCreateItem(string itemId, Action<T> setupAction)
{
// ...
if (!existed)
{
_items.Add(itemId, item);
UpdateSignal();
OnItemChanged(item); // New code
}
protected virtual void OnItemChanged(T item)
{
// Do nothing by default.
}
// ...
}
}
internal sealed class ClusterManager
{
private readonly IModelChangeListener[] _listeners;
// ...
protected override void OnItemChanged(Cluster item)
{
foreach (var listener in _listeners)
{
listener.OnClusterChanged(item);
}
}
} |
My idea was a concern separation between policies on one side and Having that said, I see your point that passing the entire request/response message or a probing result to policies can make it more flexible, but not sure if it justifies an increased coupling between the check logic and the health status evaluation. |
From time to time, destinations can get unhealthy and start failing requests processing due to various reasons, thus to prevent request failures and maintain a good quality of service YARP must monitor destinations health status and stop sending traffic to the ones became unhealthy until they have recovered. This PR implements active and passive health check mechanisms where the former periodically probes destinations with dedicated HTTP requests and the latter watches for client request proxying results. Fixes #228
We have the IHealthProbeWorker and associated infrastructure, but it's all internal and nobody ever starts it.
This was forked from the original partner team code before health checks were completely implemented. Some missing pieces:
@davidni
The text was updated successfully, but these errors were encountered: