Consul ESM anti-flapping protection proposal #50

ezaquarii · 2019-10-10T09:11:29Z

ESM anti flapping protection

Problem

The ESM monitors services periodically (polling). In case of intermittent infrastructure outages, services might repeatedly come online and offline, that has several up-stream implications that are worth allowing service providers and Consul admin's to manage.

Excessive noise - Probably the most visible problem is just a lot of noise, alarm spam, log trace
Client reconnection churn - Initial connection to a service is typically quite expensive, in many cases (https services) it can be significantly more expensive then the query cost. A flapping service can multiple that cost.
Infrastructure churn - Discovery, Load Balancing or a Service Mesh control plane must carry the cost of registering, deregistering flapping end-points continuously.

Some Typical causes of a flapping service:

Switchs with cabling issue - e.g. loose or broken cabling
Switch experiences frame broadcast storms - e.g. abusive software at the network level
Routing is temporarily broken - e.g. problems with BGP or stability when trying to update routes during high load
data center bring-up causes intermittent failures

Solution

Establish thresholds (kind of low-pass filters) that will prevent confusing status updates.
Few consecutive status confirmations would be needed to flip service health status.

A simple Schmitt Trigger should provide enough protection against
spurious failures and recoveries.

https://en.wikipedia.org/wiki/Schmitt_trigger

Number of failed checks and passing changes
would set the thresholds. 2 thresholds should be set via health-check definition: one for failed, one for passed.

A service would be marked as failed only when a number of checks failed, crossing the failure threshold.
A service would be marked as healthy only when a number of subsequent checks pass, crossing a passing threshold.

Exact trigger strategy should be hidden behind and interface, allowing diverse strategies
tuned to various service needs.

Implementation 1 - Consul + ESM

Consul ESM modification

Check updates

Consul ESM agent maintains checks via Agent.watchHealthChecks()=>CheckRunner.UpdateChecks(checks api.HealthChecks) method.
It adds, removes and updates currently running checks according to check definition received from
Consul.

Check modification

CheckTCP
CheckHTTP
CheckDocker
CheckGRPC

We propse checks to be extended making them stateful, so checks can track recent service changes
using proposed Schmitt Trigger (or any other health evaluation mechanism).

We propse that CheckNotifier remains unchanged, propagating status as before, contrary to other
solution proposed here: https://github.com/hashicorp/consul/pull/5739/files

Vendored consul library update

Consul ESM vendors in Consul library. This vendored library is somewhat out of sync with main
Consul agent code.

This library should be either upgraded or forked at some point. Implementing this feature
would cause effectively to fork an outdated codebase.

Key questions:

can we keep Consul and ESM version in sync and distribute/deploy in pairs?
is it possible to move the check definition modifications to consul library and maintain it there?

Consul Modification

In order to update health check definition schema to carry additional properties, health check
definition served by consul must be extended to propagate additional settings.

Health check definitions to modify:

consul/api/health.go
consul-esm/vendor/hashicorp/api/health.go (lib)

Agent.Checks() => map of AgentCheck => HealthCheckDefinition

Impact

consul-esm - need to update checks to carry thresholds and current state counters
consul - need to update health check definition schema to carry additional properties
health check registration API must expose method to add those extra properties

By default, thresholds should be disabled, preserving old behavior in existing tests.

Implementation 2 - only ESM, no Consul change

ESM modification

The ESM covers relatively new ground in that it expects users to register health
checks for services not running locally, because of this, unlike localhost / consul
agent based health checks, its more likely to experience network performance issues
and temporary failures. A purely ESM solution would solve this need.

To avoid modifying main consul server, the functionality can be first implemented
in Consul ESM daemon only.

This is essentially identical with ESM implementation #1, with 2 exceptions:

instead of configuring the threshold for individual checks, we draw global thresholds from a local daemon config file
no health check schema definition, as the config is not retrieved from Consul

Pros:

faster rollout
changes decoupled from main consul codebase (no impact there)
lower risk

Cons:

ESM configuration is less flexible
Check configuration must be deployed alongside ESM daemons
Configuration must be kept consistent across whole network

Consul modification

None in this variant.

Impact

Only ESM daemon is impacted:

adding check thresholds
extending ESM configuration with threshold settings

A/C

users can register ESM health checks with separate failure/recovery thresholds
existing use cases not changed - thresholds are optional and older checks are not altered
thresholds can be set for individual health checks

The text was updated successfully, but these errors were encountered:

ezaquarii · 2019-10-10T09:12:10Z

Moved from previous ticket #49

Hey @ezaquarii , thanks for the proposal.

I've just dived into working on a new feature for consul-template so I don't have time to review this thoroughly at the moment but I do have a couple initial comments.

On the master branch we have recently migrated to go-modules. In the process we eliminated the vendored dependencies and updated everything to current. So the vendored consul issue is no longer a problem.

My initial instinct is to go with "Implementation 2". Keeping things simpler and self contained for the first version of this seems better to me.

banks · 2019-10-10T11:22:43Z

The other think to link up here is that something very close to this proposal is already being worked on in hashicorp/consul#5739.

That PR being merged doesn't necessarily solve to ESM case though since thresholding there is being applied in the Consul agent before sending the the server which makes sense to me.

So I don't think that changes anything here other than to confirm that proposal 2 has a precedent in the agent case being worked on and that we should probably re-use the same check configuration and similar implementation in ESM @eikenb when we look at this.

cbroglie · 2020-05-07T00:17:49Z

I found this issue when I came here to file a feature request for supporting the failures_before_critical and success_before_passing fields from hashicorp/consul#5739 in consul-esm. Proposal 2 works for me.

lornasong added the enhancement label Dec 2, 2019

edevil mentioned this issue Aug 12, 2020

Introduce fail/pass check thresholds to prevent flapping #78

Merged

lornasong linked a pull request Sep 14, 2020 that will close this issue

Introduce fail/pass check thresholds to prevent flapping #78

Merged

lornasong closed this as completed in #78 Sep 14, 2020

lornasong added this to the 0.5.0 milestone Dec 4, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consul ESM anti-flapping protection proposal #50

Consul ESM anti-flapping protection proposal #50

ezaquarii commented Oct 10, 2019

ezaquarii commented Oct 10, 2019

banks commented Oct 10, 2019

cbroglie commented May 7, 2020

Consul ESM anti-flapping protection proposal #50

Consul ESM anti-flapping protection proposal #50

Comments

ezaquarii commented Oct 10, 2019

ESM anti flapping protection

Problem

Solution

Implementation 1 - Consul + ESM

Consul ESM modification

Check updates

Check modification

Vendored consul library update

Consul Modification

Impact

Implementation 2 - only ESM, no Consul change

ESM modification

Consul modification

Impact

A/C

ezaquarii commented Oct 10, 2019

banks commented Oct 10, 2019

cbroglie commented May 7, 2020