Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvserver/allocator: remove dupe logging and fix load delta min #98866

Closed

Conversation

kvoli
Copy link
Collaborator

@kvoli kvoli commented Mar 17, 2023

Previously, the store rebalancer would log load-based replica transfers successfully brought... even the store was not below the max threshold.
The store rebalancer also logs when there are no available replica
rebalancing actions remaining but the store still above the desired load
threshold: ran out of replicas worth transferring and load. This log
line was duplicated in both the post range rebalancing phase and in
exiting the range rebalancing phase.

This commit stops duplication by removing the exit log line and ensures
that the load-based replica transfers successfully brought... log line
occurs only when actually successful.

There were a few references to old variable values, or in terms of QPS
in the load based best target logic. The logic also was duplicate and
non-uniform in approach. The domain store list was also incorrectly
including all stores in the store map passed in, which is incorrect as
the domain of possible stores should be just the candidate+existing
stores. The domain store list is used to calculate the mean, so
including additional stores that are not candidates (only occurred for
lease transfers) results in a poor comparison point.

This commit updates the comments within bestStoreToMinimizeDelta to be
more accurate. The logic is also slightly updated to reuse functions.
This function is one of the most important allocator logic for load
based target scoring.

The incorrect store list creation is also updated, so that the store
list contains only the candidate+existing store.

Previously, the store rebalancer would attempt to transfer leases or
relocate replicas of a range so long as the load of the range (estimated
using the local leaseholder replica) was greater than 0.1% of the
store's load.

This commit updates the value to be higher for both lease rebalancing
and range rebalancing, specifically:

lease rebalancing: 0.1% -> 0.5%
range rebalancing: 0.1% -> 2.0%

The documented rationale, as for why these values exist is also the
reason for the increase: decrease unnecessary churn. Only the hottest
ranges are considered, when the impact of a rebalance action for these
ranges is estimated to be less than a small fraction of the store's
total load, it is indicative that the distribution of replica load is
not skewed. Letting the replicate queue balance replica and lease counts
is a better option in such distributions.

Making the minimum for range rebalancing higher is also added, as
multiple replicas for a range is inherently more costly than moving a
lease.

Resolves: #98867
Resolves: #98870

Release note: None

@blathers-crl
Copy link

blathers-crl bot commented Mar 17, 2023

It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR?

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@kvoli kvoli force-pushed the 230316.store-rebalancer-cleanup branch from 36728a5 to c404bf6 Compare March 17, 2023 16:44
@kvoli kvoli changed the title kvserver: remove duplicate logging in s-rebalancer kvserver/allocator: remove dupe logging and harden Mar 17, 2023
kvoli added 3 commits March 17, 2023 16:48
Previously, the store rebalancer would log `load-based replica transfers
successfully brought...` even when the store was not below the max
threshold. The store rebalancer also logs when there are no available
replica rebalancing actions remaining but the store still above the
desired load threshold: `ran out of replicas worth transferring and
load`. This log line was duplicated in both the post range rebalancing
phase and in exiting the range rebalancing phase.

This commit stops duplication by removing the exit log line and ensures
that the `load-based replica transfers successfully brought...` log line
occurs only when actually successful.

Resolves: cockroachdb#98867

Release note: None
There were a few references to old variable values, or in terms of QPS
in the load based best target logic. The logic also was duplicate and
non-uniform in approach. The domain store list was also incorrectly
including all stores in the store map passed in, which is incorrect as
the domain of possible stores should be just the candidate+existing
stores. The domain store list is used to calculate the mean, so
including additional stores that are not candidates (only occurred for
lease transfers) results in a poor comparison point.

This commit updates the comments within `bestStoreToMinimizeDelta` to be
more accurate. The logic is also slightly updated to reuse functions.
This function is one of the most important allocator logic for load
based target scoring.

The incorrect store list creation is also updated, so that the store
list contains only the candidate+existing store.

Resolves: cockroachdb#98870

Release note: None
Previously, the store rebalancer would attempt to transfer leases or
relocate replicas of a range so long as the load of the range (estimated
using the local leaseholder replica) was greater than 0.1% of the
store's load.

This commit updates the value to be higher for both lease rebalancing
and range rebalancing, specifically:

lease rebalancing: 0.1% -> 0.5%
range rebalancing: 0.1% -> 2.0%

The documented rationale, as for why these values exist is also the
reason for the increase: decrease unnecessary churn. Only the hottest
ranges are considered, when the impact of a rebalance action for these
ranges is estimated to be less than a small fraction of the store's
total load, it is indicative that the distribution of replica load is
not skewed. Letting the replicate queue balance replica and lease counts
is a better option in such distributions.

Making the minimum for range rebalancing higher is also added, as
multiple replicas for a range is inherently more costly than moving a
lease.

Release note: None
@kvoli kvoli force-pushed the 230316.store-rebalancer-cleanup branch from c404bf6 to 9561afb Compare March 17, 2023 17:10
@kvoli kvoli changed the title kvserver/allocator: remove dupe logging and harden kvserver/allocator: remove dupe logging and fix load delta min Mar 17, 2023
@kvoli kvoli self-assigned this Mar 17, 2023
@kvoli kvoli closed this Mar 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants