You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The ESM monitors services periodically (polling). In case of intermittent infrastructure outages, services might repeatedly come online and offline, that has several up-stream implications that are worth allowing service providers and Consul admin's to manage.
Excessive noise - Probably the most visible problem is just a lot of noise, alarm spam, log trace
Client reconnection churn - Initial connection to a service is typically quite expensive, in many cases (https services) it can be significantly more expensive then the query cost. A flapping service can multiple that cost.
Infrastructure churn - Discovery, Load Balancing or a Service Mesh control plane must carry the cost of registering, deregistering flapping end-points continuously.
Some Typical causes of a flapping service:
Switchs with cabling issue - e.g. loose or broken cabling
Switch experiences frame broadcast storms - e.g. abusive software at the network level
Routing is temporarily broken - e.g. problems with BGP or stability when trying to update routes during high load
data center bring-up causes intermittent failures
Solution
Establish thresholds (kind of low-pass filters) that will prevent confusing status updates.
Few consecutive status confirmations would be needed to flip service health status.
A simple Schmitt Trigger should provide enough protection against
spurious failures and recoveries.
Number of failed checks and passing changes
would set the thresholds. 2 thresholds should be set via health-check definition: one for failed, one for passed.
A service would be marked as failed only when a number of checks failed, crossing the failure threshold.
A service would be marked as healthy only when a number of subsequent checks pass, crossing a passing threshold.
Exact trigger strategy should be hidden behind and interface, allowing diverse strategies
tuned to various service needs.
Implementation 1 - Consul + ESM
Consul ESM modification
Check updates
Consul ESM agent maintains checks via Agent.watchHealthChecks()=>CheckRunner.UpdateChecks(checks api.HealthChecks) method.
It adds, removes and updates currently running checks according to check definition received from
Consul.
Check modification
CheckTCP
CheckHTTP
CheckDocker
CheckGRPC
We propse checks to be extended making them stateful, so checks can track recent service changes
using proposed Schmitt Trigger (or any other health evaluation mechanism).
Consul ESM vendors in Consul library. This vendored library is somewhat out of sync with main
Consul agent code.
This library should be either upgraded or forked at some point. Implementing this feature
would cause effectively to fork an outdated codebase.
Key questions:
can we keep Consul and ESM version in sync and distribute/deploy in pairs?
is it possible to move the check definition modifications to consul library and maintain it there?
Consul Modification
In order to update health check definition schema to carry additional properties, health check
definition served by consul must be extended to propagate additional settings.
Health check definitions to modify:
consul/api/health.go
consul-esm/vendor/hashicorp/api/health.go (lib)
Agent.Checks() => map of AgentCheck => HealthCheckDefinition
Impact
consul-esm - need to update checks to carry thresholds and current state counters
consul - need to update health check definition schema to carry additional properties
health check registration API must expose method to add those extra properties
By default, thresholds should be disabled, preserving old behavior in existing tests.
Implementation 2 - only ESM, no Consul change
ESM modification
The ESM covers relatively new ground in that it expects users to register health
checks for services not running locally, because of this, unlike localhost / consul
agent based health checks, its more likely to experience network performance issues
and temporary failures. A purely ESM solution would solve this need.
To avoid modifying main consul server, the functionality can be first implemented
in Consul ESM daemon only.
This is essentially identical with ESM implementation #1, with 2 exceptions:
instead of configuring the threshold for individual checks, we draw global thresholds from a local daemon config file
no health check schema definition, as the config is not retrieved from Consul
Pros:
faster rollout
changes decoupled from main consul codebase (no impact there)
lower risk
Cons:
ESM configuration is less flexible
Check configuration must be deployed alongside ESM daemons
Configuration must be kept consistent across whole network
Consul modification
None in this variant.
Impact
Only ESM daemon is impacted:
adding check thresholds
extending ESM configuration with threshold settings
A/C
users can register ESM health checks with separate failure/recovery thresholds
existing use cases not changed - thresholds are optional and older checks are not altered
thresholds can be set for individual health checks
The text was updated successfully, but these errors were encountered:
I've just dived into working on a new feature for consul-template so I don't have time to review this thoroughly at the moment but I do have a couple initial comments.
On the master branch we have recently migrated to go-modules. In the process we eliminated the vendored dependencies and updated everything to current. So the vendored consul issue is no longer a problem.
My initial instinct is to go with "Implementation 2". Keeping things simpler and self contained for the first version of this seems better to me.
The other think to link up here is that something very close to this proposal is already being worked on in hashicorp/consul#5739.
That PR being merged doesn't necessarily solve to ESM case though since thresholding there is being applied in the Consul agent before sending the the server which makes sense to me.
So I don't think that changes anything here other than to confirm that proposal 2 has a precedent in the agent case being worked on and that we should probably re-use the same check configuration and similar implementation in ESM @eikenb when we look at this.
I found this issue when I came here to file a feature request for supporting the failures_before_critical and success_before_passing fields from hashicorp/consul#5739 in consul-esm. Proposal 2 works for me.
ESM anti flapping protection
Problem
The ESM monitors services periodically (polling). In case of intermittent infrastructure outages, services might repeatedly come online and offline, that has several up-stream implications that are worth allowing service providers and Consul admin's to manage.
Some Typical causes of a flapping service:
Solution
Establish thresholds (kind of low-pass filters) that will prevent confusing status updates.
Few consecutive status confirmations would be needed to flip service health status.
A simple Schmitt Trigger should provide enough protection against
spurious failures and recoveries.
https://en.wikipedia.org/wiki/Schmitt_trigger
Number of failed checks and passing changes
would set the thresholds. 2 thresholds should be set via health-check definition: one for failed, one for passed.
Exact trigger strategy should be hidden behind and interface, allowing diverse strategies
tuned to various service needs.
Implementation 1 - Consul + ESM
Consul ESM modification
Check updates
Consul ESM agent maintains checks via
Agent.watchHealthChecks()
=>CheckRunner.UpdateChecks(checks api.HealthChecks)
method.It adds, removes and updates currently running checks according to check definition received from
Consul.
Check modification
CheckTCP
CheckHTTP
CheckDocker
CheckGRPC
We propse checks to be extended making them stateful, so checks can track recent service changes
using proposed Schmitt Trigger (or any other health evaluation mechanism).
We propse that
CheckNotifier
remains unchanged, propagating status as before, contrary to othersolution proposed here: https://github.com/hashicorp/consul/pull/5739/files
Vendored consul library update
Consul ESM vendors in Consul library. This vendored library is somewhat out of sync with main
Consul agent code.
This library should be either upgraded or forked at some point. Implementing this feature
would cause effectively to fork an outdated codebase.
Key questions:
Consul Modification
In order to update health check definition schema to carry additional properties, health check
definition served by consul must be extended to propagate additional settings.
Health check definitions to modify:
consul/api/health.go
consul-esm/vendor/hashicorp/api/health.go
(lib)Agent.Checks()
=> map ofAgentCheck
=>HealthCheckDefinition
Impact
consul-esm
- need to update checks to carry thresholds and current state countersconsul
- need to update health check definition schema to carry additional propertiesBy default, thresholds should be disabled, preserving old behavior in existing tests.
Implementation 2 - only ESM, no Consul change
ESM modification
The ESM covers relatively new ground in that it expects users to register health
checks for services not running locally, because of this, unlike localhost / consul
agent based health checks, its more likely to experience network performance issues
and temporary failures. A purely ESM solution would solve this need.
To avoid modifying main consul server, the functionality can be first implemented
in Consul ESM daemon only.
This is essentially identical with ESM implementation #1, with 2 exceptions:
Pros:
Cons:
Consul modification
None in this variant.
Impact
Only ESM daemon is impacted:
A/C
The text was updated successfully, but these errors were encountered: