Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

outlier: tooling for success rate ejection #618

Merged
merged 17 commits into from
Mar 31, 2017
37 changes: 31 additions & 6 deletions docs/intro/arch_overview/outlier.rst
Original file line number Diff line number Diff line change
Expand Up @@ -76,15 +76,19 @@ being ejected and for what reasons. The log uses a JSON format with one object p
"upstream_url": "...",
"action": "...",
"type": "...",
"num_ejections": "..."
"num_ejections": "...",
"enforced": "...",
"host_success_rate": "...",
"cluster_success_rate_average": "...",
"cluster_success_rate_ejection_threshold": "..."
}

time
The time that the event took place.

secs_since_last_action
The time in seconds since the last action (either an ejection or unejection)
took place. This time will be -1 for the first ejection given there is no
took place. This value will be ``-1`` for the first ejection given there is no
action before the first ejection.

cluster
Expand All @@ -98,12 +102,33 @@ action
brought back into service.

type
If ``action`` is ``eject``, species the type of ejection that took place. Currently this can
only be ``5xx``.
If ``action`` is ``eject``, specifies the type of ejection that took place. Currently type can
be either ``5xx`` or ``SuccessRate``.

num_ejections
The number of times the host has been ejected (local to that Envoy and gets reset if the host
gets removed from the upstream cluster for any reason and then re-added).
If ``action`` is ``eject``, specifies the number of times the host has been ejected
(local to that Envoy and gets reset if the host gets removed from the upstream cluster for any
reason and then re-added).

enforced
If ``action`` is ``eject``, specifies if the ejection was enforced. ``true`` means the host was ejected.
``false`` means the event was logged but the host was not actually ejected.

host_success_rate
If ``action`` is ``eject``, and ``type`` is ``SuccessRate``, specifies the host's success rate
at the time of the ejection event on a ``0-100`` range.

.. _arch_overview_outlier_detection_ejection_event_logging_cluster_success_rate_average:

cluster_success_rate_average
If ``action`` is ``eject``, and ``type`` is ``SuccessRate``, specifies the average success
rate of the hosts in the cluster at the time of the ejection event on a ``0-100`` range.

.. _arch_overview_outlier_detection_ejection_event_logging_cluster_success_rate_ejection_threshold:

cluster_success_rate_ejection_threshold
If ``action`` is ``eject``, and ``type`` is ``SuccessRate``, specifies success rate ejection
threshold at the time of the ejection event.

Configuration reference
-----------------------
Expand Down
46 changes: 30 additions & 16 deletions docs/operations/admin.rst
Original file line number Diff line number Diff line change
Expand Up @@ -19,22 +19,36 @@ modify different aspects of the server.

List out all configured :ref:`cluster manager <arch_overview_cluster_manager>` clusters. This
information includes all discovered upstream hosts in each cluster along with per host statistics.
This is useful for debugging service discovery issues. The per host statistics include:

.. csv-table::
:header: Name, Type, Description
:widths: 1, 1, 2

cx_total, Counter, Total connections
cx_active, Gauge, Total active connections
cx_connect_fail, Counter, Total connection failures
rq_total, Counter, Total requests
rq_timeout, Counter, Total timed out requests
rq_active, Gauge, Total active requests
healthy, String, The health status of the host. See below
weight, Integer, Load balancing weight (1-100)
zone, String, Service zone
canary, Boolean, Whether the host is a canary
This is useful for debugging service discovery issues.

Cluster wide information
- :ref:`circuit breakers<config_cluster_manager_cluster_circuit_breakers>` settings for all priority settings.

- Information about :ref:`outlier detection<arch_overview_outlier_detection>` if a detector is installed. Currently
:ref:`success rate average<arch_overview_outlier_detection_ejection_event_logging_cluster_success_rate_average>`,
and :ref:`ejection threshold<arch_overview_outlier_detection_ejection_event_logging_cluster_success_rate_ejection_threshold>`
are presented. Both of these values could be ``-1`` if there was not enough data to calculate them in the last
:ref:`interval<config_cluster_manager_cluster_outlier_detection_interval_ms>`.

Per host statistics
.. csv-table::
:header: Name, Type, Description
:widths: 1, 1, 2

cx_total, Counter, Total connections
cx_active, Gauge, Total active connections
cx_connect_fail, Counter, Total connection failures
rq_total, Counter, Total requests
rq_timeout, Counter, Total timed out requests
rq_active, Gauge, Total active requests
healthy, String, The health status of the host. See below
weight, Integer, Load balancing weight (1-100)
zone, String, Service zone
canary, Boolean, Whether the host is a canary
success_rate, Double, "Request success rate (0-100). -1 if there was not enough
:ref:`request volume<config_cluster_manager_cluster_outlier_detection_success_rate_request_volume>`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you actually look at these docs rendered? This is almost definitely not correct and will look broken.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I rendered them and they look correct

in the :ref:`interval<config_cluster_manager_cluster_outlier_detection_interval_ms>`
to calculate it"

Host health status
A host is either healthy or unhealthy because of one or more different failing health states.
Expand Down
72 changes: 49 additions & 23 deletions include/envoy/upstream/outlier_detection.h
Original file line number Diff line number Diff line change
Expand Up @@ -48,34 +48,16 @@ class DetectorHostSink {
* @return the last time this host was unejected, if the host has been unejected previously.
*/
virtual const Optional<SystemTime>& lastUnejectionTime() PURE;
};

typedef std::unique_ptr<DetectorHostSink> DetectorHostSinkPtr;

enum class EjectionType { Consecutive5xx, SuccessRate };

/**
* Sink for outlier detection event logs.
*/
class EventLogger {
public:
virtual ~EventLogger() {}

/**
* Log an ejection event.
* @param host supplies the host that generated the event.
* @param type supplies the type of the event.
* @return the success rate of the host in the last calculated interval, in the range 0-100.
* -1 means that the host did not have enough request volume to calculate success rate
* or the cluster did not have enough hosts to run through success rate outlier ejection.
*/
virtual void logEject(HostDescriptionConstSharedPtr host, EjectionType type) PURE;

/**
* Log an unejection event.
* @param host supplies the host that generated the event.
*/
virtual void logUneject(HostDescriptionConstSharedPtr host) PURE;
virtual double successRate() const PURE;
};

typedef std::shared_ptr<EventLogger> EventLoggerSharedPtr;
typedef std::unique_ptr<DetectorHostSink> DetectorHostSinkPtr;

/**
* Interface for an outlier detection engine. Uses per host data to determine which hosts in a
Expand All @@ -95,9 +77,53 @@ class Detector {
* changes state (either ejected or brought back in) due to outlier status.
*/
virtual void addChangedStateCb(ChangeStateCb cb) PURE;

/**
* Returns the average success rate of the hosts in the Detector for the last aggregation
* interval.
* @return the average success rate, or -1 if there were not enough hosts with enough request
* volume to proceed with success rate based outlier ejection.
*/
virtual double successRateAverage() const PURE;

/**
* Returns the success rate threshold used in the last interval. The threshold is used to eject
* hosts based on their success rate.
* @return the threshold, or -1 if there were not enough hosts with enough request volume to
* proceed with success rate based outlier ejection.
*/
virtual double successRateEjectionThreshold() const PURE;
};

typedef std::shared_ptr<Detector> DetectorSharedPtr;

enum class EjectionType { Consecutive5xx, SuccessRate };

/**
* Sink for outlier detection event logs.
*/
class EventLogger {
public:
virtual ~EventLogger() {}

/**
* Log an ejection event.
* @param host supplies the host that generated the event.
* @param detector supplies the detector that is doing the ejection.
* @param type supplies the type of the event.
* @param enforced is true if the ejection took place; false, if only logging took place.
*/
virtual void logEject(HostDescriptionConstSharedPtr host, Detector& detector, EjectionType type,
bool enforced) PURE;

/**
* Log an unejection event.
* @param host supplies the host that generated the event.
*/
virtual void logUneject(HostDescriptionConstSharedPtr host) PURE;
};

typedef std::shared_ptr<EventLogger> EventLoggerSharedPtr;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: new line after type def


} // Outlier
} // Upstream
7 changes: 7 additions & 0 deletions include/envoy/upstream/upstream.h
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@
#include "envoy/network/connection.h"
#include "envoy/ssl/context.h"
#include "envoy/upstream/load_balancer_type.h"
#include "envoy/upstream/outlier_detection.h"
#include "envoy/upstream/resource_manager.h"

namespace Upstream {
Expand Down Expand Up @@ -300,6 +301,12 @@ class Cluster : public virtual HostSet {
*/
virtual ClusterInfoConstSharedPtr info() const PURE;

/**
* @return a pointer to the cluster's outlier detector. If an outlier detector has not been
* installed, returns a nullptr.
*/
virtual const Outlier::Detector* outlierDetector() const PURE;

/**
* Initialize the cluster. This will be called either immediately at creation or after all primary
* clusters have been initialized (determined via initializePhase()).
Expand Down
75 changes: 61 additions & 14 deletions source/common/upstream/outlier_detection_impl.cc
Original file line number Diff line number Diff line change
Expand Up @@ -79,7 +79,8 @@ DetectorImpl::DetectorImpl(const Cluster& cluster, const Json::Object& json_conf
: config_(json_config), dispatcher_(dispatcher), runtime_(runtime), time_source_(time_source),
stats_(generateStats(cluster.info()->statsScope())),
interval_timer_(dispatcher.createTimer([this]() -> void { onIntervalTimer(); })),
event_logger_(event_logger) {}
event_logger_(event_logger), success_rate_average_(-1), success_rate_ejection_threshold_(-1) {
}

DetectorImpl::~DetectorImpl() {
for (auto host : host_sinks_) {
Expand Down Expand Up @@ -185,7 +186,11 @@ void DetectorImpl::ejectHost(HostSharedPtr host, EjectionType type) {
runCallbacks(host);

if (event_logger_) {
event_logger_->logEject(host, type);
event_logger_->logEject(host, *this, type, true);
}
} else {
if (event_logger_) {
event_logger_->logEject(host, *this, type, false);
}
}
} else {
Expand Down Expand Up @@ -240,7 +245,7 @@ void DetectorImpl::onConsecutive5xxWorker(HostSharedPtr host) {
// outliers.
const double Utility::SUCCESS_RATE_STDEV_FACTOR = 1.9;

double Utility::successRateEjectionThreshold(
Utility::EjectionPair Utility::successRateEjectionThreshold(
double success_rate_sum, const std::vector<HostSuccessRatePair>& valid_success_rate_hosts) {
// This function is using mean and standard deviation as statistical measures for outlier
// detection. First the mean is calculated by dividing the sum of success rate data over the
Expand All @@ -266,7 +271,7 @@ double Utility::successRateEjectionThreshold(
variance /= valid_success_rate_hosts.size();
double stdev = std::sqrt(variance);

return mean - (SUCCESS_RATE_STDEV_FACTOR * stdev);
return {mean, (mean - (SUCCESS_RATE_STDEV_FACTOR * stdev))};
}

void DetectorImpl::processSuccessRateEjections() {
Expand All @@ -277,6 +282,10 @@ void DetectorImpl::processSuccessRateEjections() {
std::vector<HostSuccessRatePair> valid_success_rate_hosts;
double success_rate_sum = 0;

// Reset the Detector's success rate mean and stdev.
success_rate_average_ = -1;
success_rate_ejection_threshold_ = -1;

// Exit early if there are not enough hosts.
if (host_sinks_.size() < success_rate_minimum_hosts) {
return;
Expand All @@ -286,7 +295,6 @@ void DetectorImpl::processSuccessRateEjections() {
valid_success_rate_hosts.reserve(host_sinks_.size());

for (const auto& host : host_sinks_) {
host.second->updateCurrentSuccessRateBucket();
// Don't do work if the host is already ejected.
if (!host.first->healthFlagGet(Host::HealthFlag::FAILED_OUTLIER_CHECK)) {
Optional<double> host_success_rate =
Expand All @@ -296,15 +304,18 @@ void DetectorImpl::processSuccessRateEjections() {
valid_success_rate_hosts.emplace_back(
HostSuccessRatePair(host.first, host_success_rate.value()));
success_rate_sum += host_success_rate.value();
host.second->successRate(host_success_rate.value());
}
}
}

if (valid_success_rate_hosts.size() >= success_rate_minimum_hosts) {
double ejection_threshold =
Utility::EjectionPair ejection_pair =
Utility::successRateEjectionThreshold(success_rate_sum, valid_success_rate_hosts);
success_rate_average_ = ejection_pair.success_rate_average_;
success_rate_ejection_threshold_ = ejection_pair.ejection_threshold_;
for (const auto& host_success_rate_pair : valid_success_rate_hosts) {
if (host_success_rate_pair.success_rate_ < ejection_threshold) {
if (host_success_rate_pair.success_rate_ < success_rate_ejection_threshold_) {
stats_.ejections_success_rate_.inc();
ejectHost(host_success_rate_pair.host_, EjectionType::SuccessRate);
}
Expand All @@ -317,6 +328,12 @@ void DetectorImpl::onIntervalTimer() {

for (auto host : host_sinks_) {
checkHostForUneject(host.first, host.second, now);

// Need to update the writer bucket to keep the data valid.
host.second->updateCurrentSuccessRateBucket();
// Refresh host success rate stat for the /clusters endpoint. If there is a new valid value, it
// will get updated in processSuccessRateEjections().
host.second->successRate(-1);
}

processSuccessRateEjections();
Expand All @@ -330,25 +347,55 @@ void DetectorImpl::runCallbacks(HostSharedPtr host) {
}
}

void EventLoggerImpl::logEject(HostDescriptionConstSharedPtr host, EjectionType type) {
void EventLoggerImpl::logEject(HostDescriptionConstSharedPtr host, Detector& detector,
EjectionType type, bool enforced) {
// TODO(mattklein123): Log friendly host name (e.g., instance ID or DNS name).
// clang-format off
static const std::string json =
static const std::string json_5xx =
std::string("{{") +
"\"time\": \"{}\", " +
"\"secs_since_last_action\": \"{}\", " +
"\"cluster\": \"{}\", " +
"\"upstream_url\": \"{}\", " +
"\"action\": \"eject\", " +
"\"type\": \"{}\", " +
"\"num_ejections\": {}" +
"\"num_ejections\": \"{}\", " +
"\"enforced\": \"{}\"" +
"}}\n";

static const std::string json_success_rate =
std::string("{{") +
"\"time\": \"{}\", " +
"\"secs_since_last_action\": \"{}\", " +
"\"cluster\": \"{}\", " +
"\"upstream_url\": \"{}\", " +
"\"action\": \"eject\", " +
"\"type\": \"{}\", " +
"\"num_ejections\": \"{}\", " +
"\"enforced\": \"{}\", " +
"\"host_success_rate\": \"{}\", " +
"\"cluster_average_success_rate\": \"{}\", " +
"\"cluster_success_rate_ejection_threshold\": \"{}\"" +
"}}\n";
// clang-format on
SystemTime now = time_source_.currentSystemTime();
file_->write(fmt::format(json, AccessLogDateTimeFormatter::fromTime(now),
secsSinceLastAction(host->outlierDetector().lastUnejectionTime(), now),
host->cluster().name(), host->address()->asString(), typeToString(type),
host->outlierDetector().numEjections()));

switch (type) {
case EjectionType::Consecutive5xx:
file_->write(fmt::format(json_5xx, AccessLogDateTimeFormatter::fromTime(now),
secsSinceLastAction(host->outlierDetector().lastUnejectionTime(), now),
host->cluster().name(), host->address()->asString(),
typeToString(type), host->outlierDetector().numEjections(), enforced));
break;
case EjectionType::SuccessRate:
file_->write(fmt::format(json_success_rate, AccessLogDateTimeFormatter::fromTime(now),
secsSinceLastAction(host->outlierDetector().lastUnejectionTime(), now),
host->cluster().name(), host->address()->asString(),
typeToString(type), host->outlierDetector().numEjections(), enforced,
host->outlierDetector().successRate(), detector.successRateAverage(),
detector.successRateEjectionThreshold()));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add break here for next person that comes along.

break;
}
}

void EventLoggerImpl::logUneject(HostDescriptionConstSharedPtr host) {
Expand Down
Loading