Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support slow Start mode in Envoy #13176

Merged
merged 75 commits into from
Sep 30, 2021
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
75 commits
Select commit Hold shift + click to select a range
4173b08
Support slow Start mode in Envoy
Sep 18, 2020
2f8dad0
Support slow Start mode in Envoy
Sep 18, 2020
627c910
Introduce creation_time field into host description
Sep 22, 2020
ed27cb7
Propagate timeSource to edf lb
Sep 30, 2020
80cd8eb
Fix weight adjustment formula
Oct 1, 2020
161cbaf
Draft: Track hosts in slow start mode in edf lb
Oct 1, 2020
d7f395c
Track hosts in slow start mode in edf lb
Oct 5, 2020
944607e
Merge remote-tracking branch 'origin/master'
Nov 27, 2020
f966f8b
switch to btree_set for tracking hosts in slow start
Dec 8, 2020
f61216c
Fix logical statement
Dec 8, 2020
6bfb2e0
Parametrize time bias
Dec 10, 2020
a02698d
Fix logger inheritance
Dec 14, 2020
23e517e
Add config validation for slow start in orig_dst_cluster lb
Dec 16, 2020
a4f697d
Merge remote-tracking branch 'origin/master'
Dec 16, 2020
d0f2cd2
Fix logic when tracking hosts in slow start
Jan 22, 2021
3ab3951
Adding tests
Feb 3, 2021
e5a8534
Add support for "first passing HC" slow start mode
Feb 9, 2021
e9e93ea
Fix comparator, add test for runtime updates
Feb 10, 2021
a365f6e
Merge remote-tracking branch 'origin/main'
Feb 15, 2021
2156dc3
Cleanup
Feb 15, 2021
fe0e551
Fix CI
Feb 16, 2021
38f792a
Fix CI
Feb 16, 2021
bbc3fda
Fix more CI
Feb 17, 2021
7d1cdb4
Some docs, some CI fixes...
Feb 17, 2021
18f0463
Fix clang
Feb 18, 2021
3cf6f9a
Update documentation
Feb 19, 2021
1510abb
Revert extra formatting
Feb 22, 2021
c038daf
Apply review comments
Mar 8, 2021
c4b8f8b
Apply review comments
Mar 8, 2021
bd87893
Fix format
Mar 8, 2021
0963656
Merge remote-tracking branch 'origin/main' into main
Mar 8, 2021
33737f8
Fix spelling in docs
Mar 10, 2021
0cbdbe7
Fix spelling
Mar 10, 2021
43b2f54
Fix build, apply rome view comments
Mar 15, 2021
4e8b9d7
Get rid of endpoint warming policy
Mar 29, 2021
78be70e
Remove unused import
Mar 29, 2021
dc1bb99
Fix tests, clarify docs
Mar 30, 2021
5d5d231
Clarify docs
Mar 30, 2021
fdbbd5f
remove extra space
Mar 30, 2021
b371ece
Apply review comment and fix build
Apr 7, 2021
1602a7b
Update formula, docs and clean up
Apr 16, 2021
bf32ee5
Update API+docs with new formula
Apr 27, 2021
9d96d4b
Merge remote-tracking branch 'origin/main'
Apr 28, 2021
f1670a9
Introduce aggression parameter
Apr 28, 2021
7a495e0
Fix docs format
Apr 29, 2021
941a43e
Fix math bug and add basic test
Apr 29, 2021
3467ca4
add more tests
Apr 29, 2021
bd467d6
Apply review comments, finish tests for RR
May 5, 2021
7d8022d
Slow start support in LR and initial test
May 6, 2021
49cd453
More tests for LR slow start
May 11, 2021
c6f2b86
Refactor duplicated code
May 17, 2021
514dabf
Update slow start example table
May 18, 2021
96d7b76
Bump memory limit per cluster
May 19, 2021
6a98431
Merge remote-tracking branch 'origin/main'
May 19, 2021
74557b9
Applied review comments
Aug 16, 2021
1a23da6
Merge remote-tracking branch 'origin/main' into HEAD
Aug 27, 2021
3e4f49a
Fix merge errors
Aug 27, 2021
4a2a508
Fix weird formatting
Aug 27, 2021
875c763
Fix proto and extra formatting
Aug 27, 2021
ccc9338
Move out slow start config from common lb config
Aug 27, 2021
19d288d
Apply more comments and fix some tests
Sep 3, 2021
2766a4f
fix doc and format
Sep 3, 2021
a2b1261
Fix mock default behaviour
Sep 6, 2021
5e18212
Update diagram with example
Sep 6, 2021
2e9d0ff
fix asan
Sep 8, 2021
2002d00
Bump memory limit
Sep 10, 2021
2128535
Apply review comment
Sep 10, 2021
224daa2
Fix graph and spelling in docs
Sep 10, 2021
b3c5c43
Merge remote-tracking branch 'origin/main' into HEAD
Sep 15, 2021
4d3efe7
apply review comments
Sep 15, 2021
3ed1ff4
Fix doc format
Sep 16, 2021
e4a3c84
Apply review comments
Sep 28, 2021
5c587e9
Merge branch 'main' into slow-start
Sep 28, 2021
7f4b258
fix format
Sep 28, 2021
9ed50d9
Fix merge error
Sep 30, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
17 changes: 16 additions & 1 deletion api/envoy/config/cluster/v3/cluster.proto
Original file line number Diff line number Diff line change
Expand Up @@ -440,7 +440,7 @@ message Cluster {
}

// Common configuration for all load balancer implementations.
// [#next-free-field: 8]
// [#next-free-field: 9]
message CommonLbConfig {
option (udpa.annotations.versioning).previous_message_type =
"envoy.api.v2.Cluster.CommonLbConfig";
Expand Down Expand Up @@ -508,6 +508,18 @@ message Cluster {
google.protobuf.UInt32Value hash_balance_factor = 2 [(validate.rules).uint32 = {gte: 100}];
}

enum EndpointWarmingPolicy {
NO_WAIT = 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add comment to enum values.

WAIT_FOR_FIRST_PASSING_HC = 1;
}

// Configuration for slow start mode.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you write some Envoy docs for this and link from here? I'd suggest translating the design doc into RST and then cleaning that up a bit for end users.

// [#next-free-field: 3]
message SlowStartConfig {
google.protobuf.UInt32Value slow_start_window = 1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add comment to fields

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for review @htuch, i will fix api+docs once PR is in more mature state.

EndpointWarmingPolicy endpoint_warming_policy = 2;
}

// Configures the :ref:`healthy panic threshold <arch_overview_load_balancing_panic_threshold>`.
// If not specified, the default is 50%.
// To disable panic mode, set to 0%.
Expand Down Expand Up @@ -565,6 +577,9 @@ message Cluster {

// Common Configuration for all consistent hashing load balancers (MaglevLb, RingHashLb, etc.)
ConsistentHashingLbConfig consistent_hashing_lb_config = 7;

// Configuration for slow start mode.
mattklein123 marked this conversation as resolved.
Show resolved Hide resolved
SlowStartConfig slow_start_config = 8;
}

message RefreshRate {
Expand Down
17 changes: 16 additions & 1 deletion api/envoy/config/cluster/v4alpha/cluster.proto
Original file line number Diff line number Diff line change
Expand Up @@ -445,8 +445,20 @@ message Cluster {
bool use_http_header = 1;
}

enum EndpointWarmingPolicy {
WAIT_FOR_FIRST_PASSING_HC = 0;
NO_WAIT = 1;
}

// Configuration for slow start mode.
// [#next-free-field: 3]
message SlowStartConfig {
google.protobuf.UInt32Value slow_start_window = 1;
EndpointWarmingPolicy endpoint_warming_policy = 2;
}

// Common configuration for all load balancer implementations.
// [#next-free-field: 8]
// [#next-free-field: 9]
message CommonLbConfig {
option (udpa.annotations.versioning).previous_message_type =
"envoy.config.cluster.v3.Cluster.CommonLbConfig";
Expand Down Expand Up @@ -571,6 +583,9 @@ message Cluster {

// Common Configuration for all consistent hashing load balancers (MaglevLb, RingHashLb, etc.)
ConsistentHashingLbConfig consistent_hashing_lb_config = 7;

// Configuration for slow start mode.
SlowStartConfig slow_start_config = 8;
}

message RefreshRate {
Expand Down
17 changes: 16 additions & 1 deletion generated_api_shadow/envoy/config/cluster/v3/cluster.proto

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

5 changes: 5 additions & 0 deletions include/envoy/upstream/host_description.h
Original file line number Diff line number Diff line change
Expand Up @@ -148,6 +148,11 @@ class HostDescription {
* Set the current priority.
*/
virtual void priority(uint32_t) PURE;

/**
* @return timestamp in milliseconds of when host was created.
*/
virtual const uint64_t creationTimeMs() const PURE;
};

using HostDescriptionConstSharedPtr = std::shared_ptr<const HostDescription>;
Expand Down
3 changes: 3 additions & 0 deletions source/common/upstream/edf_scheduler.h
Original file line number Diff line number Diff line change
Expand Up @@ -73,6 +73,9 @@ template <class C> class EdfScheduler {
*/
bool empty() const { return queue_.empty(); }

// todo(nezdolik) this needs to be integer
double currentTimeMs() const { return current_time_; }

private:
struct EdfEntry {
double deadline_;
Expand Down
24 changes: 19 additions & 5 deletions source/common/upstream/load_balancer_impl.cc
Original file line number Diff line number Diff line change
Expand Up @@ -66,6 +66,11 @@ bool hostWeightsAreEqual(const HostVector& hosts) {
return true;
}

bool noHostsAreInSlowStart() {
// todo(nezdolik) fix this
return true;
}

} // namespace

std::pair<uint32_t, LoadBalancerBase::HostAvailability>
Expand Down Expand Up @@ -720,10 +725,11 @@ void EdfLoadBalancerBase::refresh(uint32_t priority) {
auto& scheduler = scheduler_[source] = Scheduler{};
refreshHostSource(source);

// Check if the original host weights are equal and skip EDF creation if they are. When all
// original weights are equal we can rely on unweighted host pick to do optimal round robin and
// least-loaded host selection with lower memory and CPU overhead.
if (hostWeightsAreEqual(hosts)) {
// Check if the original host weights are equal and no hosts are in slow start mode, in that
// case EDF creation is skipped. When all original weights are equal and no hosts are in slow
// start mode we can rely on unweighted host pick to do optimal round robin and least-loaded
// host selection with lower memory and CPU overhead.
if (hostWeightsAreEqual(hosts) && noHostsAreInSlowStart()) {
nezdolik marked this conversation as resolved.
Show resolved Hide resolved
// Skip edf creation.
return;
}
Expand All @@ -736,11 +742,19 @@ void EdfLoadBalancerBase::refresh(uint32_t priority) {
// We should probably change this to refresh at all times. See the comment in
// BaseDynamicClusterImpl::updateDynamicHostList about this.
for (const auto& host : hosts) {
auto host_weight = hostWeight(*host);
// todo(nezdolik) propagate slow_start_config and endpoint_warming_policy to edf lb base, add
// abs to formula
if (scheduler.edf_->currentTimeMs() - host->creationTimeMs() > 60) {
// todo(nezdolik) parametrize this
host_weight *= 0.1;
}

// We use a fixed weight here. While the weight may change without
// notification, this will only be stale until this host is next picked,
// at which point it is reinserted into the EdfScheduler with its new
// weight in chooseHost().
scheduler.edf_->add(hostWeight(*host), host);
scheduler.edf_->add(host_weight, host);
}

// Cycle through hosts to achieve the intended offset behavior.
Expand Down
24 changes: 21 additions & 3 deletions source/common/upstream/load_balancer_impl.h
Original file line number Diff line number Diff line change
Expand Up @@ -461,7 +461,16 @@ class LeastRequestLoadBalancer : public EdfLoadBalancerBase,
least_request_config.has_value() && least_request_config->has_active_request_bias()
? std::make_unique<Runtime::Double>(least_request_config->active_request_bias(),
runtime)
: nullptr) {
: nullptr),
// todo(nezdolik) move this to base class
endpoint_warming_policy(common_config.has_slow_start_config()
? common_config.slow_start_config().endpoint_warming_policy()
: envoy::config::cluster::v3::Cluster::CommonLbConfig::NO_WAIT),
// todo(nezdolik) move this to base class
slow_start_window(common_config.has_slow_start_config()
? PROTOBUF_GET_WRAPPED_OR_DEFAULT(common_config.slow_start_config(),
slow_start_window, 0)
: 0) {
initialize();
}

Expand Down Expand Up @@ -521,18 +530,27 @@ class LeastRequestLoadBalancer : public EdfLoadBalancerBase,
double active_request_bias_{};

const std::unique_ptr<Runtime::Double> active_request_bias_runtime_;
const envoy::config::cluster::v3::Cluster::CommonLbConfig::EndpointWarmingPolicy
endpoint_warming_policy;
const uint32_t slow_start_window;
};

/**
* Random load balancer that picks a random host out of all hosts.
*/
class RandomLoadBalancer : public ZoneAwareLoadBalancerBase {
class RandomLoadBalancer : public ZoneAwareLoadBalancerBase,
Logger::Loggable<Logger::Id::upstream> {
public:
RandomLoadBalancer(const PrioritySet& priority_set, const PrioritySet* local_priority_set,
ClusterStats& stats, Runtime::Loader& runtime, Random::RandomGenerator& random,
const envoy::config::cluster::v3::Cluster::CommonLbConfig& common_config)
: ZoneAwareLoadBalancerBase(priority_set, local_priority_set, stats, runtime, random,
common_config) {}
common_config) {
if (common_config.has_slow_start_config()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then does this slow-start config even belong in the common LB config? Seems to defeat the purpose. You ought to just add the slow-start config message to the supported LBs and avoid these checks.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

have not started this one

// todo(nezdolik) maybe use error status
ENVOY_LOG(warn, "Slow start mode is not supported for random lb");
}
}

// Upstream::LoadBalancerBase
HostConstSharedPtr chooseHostOnce(LoadBalancerContext* context) override;
Expand Down
7 changes: 7 additions & 0 deletions source/common/upstream/logical_host.h
Original file line number Diff line number Diff line change
Expand Up @@ -57,6 +57,10 @@ class LogicalHost : public HostImpl {
return HostImpl::healthCheckAddress();
}

const uint64_t creationTimeMs() const override {
return 0;
}

private:
const Network::TransportSocketOptionsSharedPtr override_transport_socket_options_;
mutable absl::Mutex address_lock_;
Expand Down Expand Up @@ -104,6 +108,9 @@ class RealHostDescription : public HostDescription {
// checking.
NOT_IMPLEMENTED_GCOVR_EXCL_LINE;
}
const uint64_t creationTimeMs() const override {
return 0;
}
uint32_t priority() const override { return logical_host_->priority(); }
void priority(uint32_t) override { NOT_IMPLEMENTED_GCOVR_EXCL_LINE; }

Expand Down
3 changes: 3 additions & 0 deletions source/common/upstream/maglev_lb.cc
Original file line number Diff line number Diff line change
Expand Up @@ -113,6 +113,9 @@ MaglevLoadBalancer::MaglevLoadBalancer(
if (!Primes::isPrime(table_size_)) {
throw EnvoyException("The table size of maglev must be prime number");
}
if (common_config.has_slow_start_config()) {
throw EnvoyException("Slow start mode is not supported for maglev lb");
}
}

MaglevLoadBalancerStats MaglevLoadBalancer::generateStats(Stats::Scope& scope) {
Expand Down
11 changes: 9 additions & 2 deletions source/common/upstream/original_dst_cluster.h
Original file line number Diff line number Diff line change
Expand Up @@ -51,8 +51,15 @@ class OriginalDstCluster : public ClusterImplBase {
*/
class LoadBalancer : public Upstream::LoadBalancer {
public:
LoadBalancer(const std::shared_ptr<OriginalDstCluster>& parent)
: parent_(parent), host_map_(parent->getCurrentHostMap()) {}
LoadBalancer(
const std::shared_ptr<OriginalDstCluster>&
parent /*, const envoy::config::cluster::v3::Cluster::CommonLbConfig& common_config*/)
: parent_(parent), host_map_(parent->getCurrentHostMap()) {
// todo(nezdolik) fix this
// if (common_config.has_slow_start_config()) {
// throw EnvoyException("Slow start mode is not supported for original dst lb");
// }
}

// Upstream::LoadBalancer
HostConstSharedPtr chooseHost(LoadBalancerContext* context) override;
Expand Down
3 changes: 3 additions & 0 deletions source/common/upstream/ring_hash_lb.cc
Original file line number Diff line number Diff line change
Expand Up @@ -43,6 +43,9 @@ RingHashLoadBalancer::RingHashLoadBalancer(
throw EnvoyException(fmt::format("ring hash: minimum_ring_size ({}) > maximum_ring_size ({})",
min_ring_size_, max_ring_size_));
}
if (common_config.has_slow_start_config()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I really don't think this belongs in the common config. let's just add the message to the supported LBs so we don't need to do this.

throw EnvoyException("Slow start mode is not supported for ring hash lb");
}
}

RingHashLoadBalancerStats RingHashLoadBalancer::generateStats(Stats::Scope& scope) {
Expand Down
4 changes: 4 additions & 0 deletions source/common/upstream/subset_lb.cc
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,10 @@ SubsetLoadBalancer::SubsetLoadBalancer(
scale_locality_weight_(subsets.scaleLocalityWeight()), list_as_any_(subsets.listAsAny()) {
ASSERT(subsets.isEnabled());

if (common_config.has_slow_start_config()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Subset + Roundrobin /LeastRequest is supported.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

throw EnvoyException("Slow start mode is not supported for subset lb");
}

if (fallback_policy_ != envoy::config::cluster::v3::Cluster::LbSubsetConfig::NO_FALLBACK) {
HostPredicate predicate;
if (fallback_policy_ == envoy::config::cluster::v3::Cluster::LbSubsetConfig::ANY_ENDPOINT) {
Expand Down
3 changes: 3 additions & 0 deletions source/common/upstream/upstream_impl.h
Original file line number Diff line number Diff line change
Expand Up @@ -234,6 +234,9 @@ class HostImpl : public HostDescriptionImpl,
void weight(uint32_t new_weight) override;
bool used() const override { return used_; }
void used(bool new_used) override { used_ = new_used; }
const uint64_t creationTimeMs() const override {
return 0;
}

protected:
static Network::ClientConnectionPtr
Expand Down