Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consul ESM anti-flapping protection proposal #50

Closed
ezaquarii opened this issue Oct 10, 2019 · 3 comments · Fixed by #78
Closed

Consul ESM anti-flapping protection proposal #50

ezaquarii opened this issue Oct 10, 2019 · 3 comments · Fixed by #78
Milestone

Comments

@ezaquarii
Copy link

ESM anti flapping protection

Problem

The ESM monitors services periodically (polling). In case of intermittent infrastructure outages, services might repeatedly come online and offline, that has several up-stream implications that are worth allowing service providers and Consul admin's to manage.

  • Excessive noise - Probably the most visible problem is just a lot of noise, alarm spam, log trace
  • Client reconnection churn - Initial connection to a service is typically quite expensive, in many cases (https services) it can be significantly more expensive then the query cost. A flapping service can multiple that cost.
  • Infrastructure churn - Discovery, Load Balancing or a Service Mesh control plane must carry the cost of registering, deregistering flapping end-points continuously.

Some Typical causes of a flapping service:

  • Switchs with cabling issue - e.g. loose or broken cabling
  • Switch experiences frame broadcast storms - e.g. abusive software at the network level
  • Routing is temporarily broken - e.g. problems with BGP or stability when trying to update routes during high load
  • data center bring-up causes intermittent failures

Solution

Establish thresholds (kind of low-pass filters) that will prevent confusing status updates.
Few consecutive status confirmations would be needed to flip service health status.

A simple Schmitt Trigger should provide enough protection against
spurious failures and recoveries.

https://en.wikipedia.org/wiki/Schmitt_trigger

Number of failed checks and passing changes
would set the thresholds. 2 thresholds should be set via health-check definition: one for failed, one for passed.

  • A service would be marked as failed only when a number of checks failed, crossing the failure threshold.
  • A service would be marked as healthy only when a number of subsequent checks pass, crossing a passing threshold.

Exact trigger strategy should be hidden behind and interface, allowing diverse strategies
tuned to various service needs.

Implementation 1 - Consul + ESM

Consul ESM modification

Check updates

Consul ESM agent maintains checks via Agent.watchHealthChecks()=>CheckRunner.UpdateChecks(checks api.HealthChecks) method.
It adds, removes and updates currently running checks according to check definition received from
Consul.

Check modification

  • CheckTCP
  • CheckHTTP
  • CheckDocker
  • CheckGRPC

We propse checks to be extended making them stateful, so checks can track recent service changes
using proposed Schmitt Trigger (or any other health evaluation mechanism).

We propse that CheckNotifier remains unchanged, propagating status as before, contrary to other
solution proposed here: https://github.com/hashicorp/consul/pull/5739/files

Vendored consul library update

Consul ESM vendors in Consul library. This vendored library is somewhat out of sync with main
Consul agent code.

This library should be either upgraded or forked at some point. Implementing this feature
would cause effectively to fork an outdated codebase.

Key questions:

  1. can we keep Consul and ESM version in sync and distribute/deploy in pairs?
  2. is it possible to move the check definition modifications to consul library and maintain it there?

Consul Modification

In order to update health check definition schema to carry additional properties, health check
definition served by consul must be extended to propagate additional settings.

Health check definitions to modify:

  • consul/api/health.go
  • consul-esm/vendor/hashicorp/api/health.go (lib)

Agent.Checks() => map of AgentCheck => HealthCheckDefinition

Impact

  • consul-esm - need to update checks to carry thresholds and current state counters
  • consul - need to update health check definition schema to carry additional properties
  • health check registration API must expose method to add those extra properties

By default, thresholds should be disabled, preserving old behavior in existing tests.

Implementation 2 - only ESM, no Consul change

ESM modification

The ESM covers relatively new ground in that it expects users to register health
checks for services not running locally, because of this, unlike localhost / consul
agent based health checks, its more likely to experience network performance issues
and temporary failures. A purely ESM solution would solve this need.

To avoid modifying main consul server, the functionality can be first implemented
in Consul ESM daemon only.

This is essentially identical with ESM implementation #1, with 2 exceptions:

  1. instead of configuring the threshold for individual checks, we draw global thresholds from a local daemon config file
  2. no health check schema definition, as the config is not retrieved from Consul

Pros:

  • faster rollout
  • changes decoupled from main consul codebase (no impact there)
  • lower risk

Cons:

  • ESM configuration is less flexible
  • Check configuration must be deployed alongside ESM daemons
  • Configuration must be kept consistent across whole network

Consul modification

None in this variant.

Impact

Only ESM daemon is impacted:

  1. adding check thresholds
  2. extending ESM configuration with threshold settings

A/C

  1. users can register ESM health checks with separate failure/recovery thresholds
  2. existing use cases not changed - thresholds are optional and older checks are not altered
  3. thresholds can be set for individual health checks
@ezaquarii
Copy link
Author

Moved from previous ticket #49

Hey @ezaquarii , thanks for the proposal.

I've just dived into working on a new feature for consul-template so I don't have time to review this thoroughly at the moment but I do have a couple initial comments.

  1. On the master branch we have recently migrated to go-modules. In the process we eliminated the vendored dependencies and updated everything to current. So the vendored consul issue is no longer a problem.
  2. My initial instinct is to go with "Implementation 2". Keeping things simpler and self contained for the first version of this seems better to me.

@banks
Copy link
Member

banks commented Oct 10, 2019

The other think to link up here is that something very close to this proposal is already being worked on in hashicorp/consul#5739.

That PR being merged doesn't necessarily solve to ESM case though since thresholding there is being applied in the Consul agent before sending the the server which makes sense to me.

So I don't think that changes anything here other than to confirm that proposal 2 has a precedent in the agent case being worked on and that we should probably re-use the same check configuration and similar implementation in ESM @eikenb when we look at this.

@cbroglie
Copy link
Contributor

cbroglie commented May 7, 2020

I found this issue when I came here to file a feature request for supporting the failures_before_critical and success_before_passing fields from hashicorp/consul#5739 in consul-esm. Proposal 2 works for me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants