Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvserver: replicate queue should be more responsive #106101

Open
erikgrinaker opened this issue Jul 4, 2023 · 1 comment
Open

kvserver: replicate queue should be more responsive #106101

erikgrinaker opened this issue Jul 4, 2023 · 1 comment
Labels
A-kv-distribution Relating to rebalancing and leasing. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-kv KV Team

Comments

@erikgrinaker
Copy link
Contributor

erikgrinaker commented Jul 4, 2023

The replicate queue often takes a very long time to correct problems. For example, as seen in #106100, if a lease is picked up outside of the lease preferences it can take many minutes before the problem is corrected. This tends to be the case for most policies enforces by the replicate queue.

We should make the queue more responsive. A few random ideas:

  • Eagerly enqueue ranges in response to cluster events, such as nodes going offline/online, lease movement, zone config changes, etc.
  • Speed up the enqueue rate (consider removing the 10 minute scanner interval, instead relying on a min interval between each enqueue if necessary).
  • Increase queue concurrency, possibly dynamically based on node CPUs.
  • Better use of prioritization, to execute important actions soon.
  • Add a cluster setting to control the processing rate.
  • Add a builtin to enqueue all ranges in a specific queue.

Jira issue: CRDB-29400

@erikgrinaker erikgrinaker added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-kv-distribution Relating to rebalancing and leasing. T-kv KV Team labels Jul 4, 2023
@andrewbaptist
Copy link
Collaborator

A few things to note here. The replicate queue could run a lot faster except for a few things:

  1. The replicate queue currently runs over leases and replicas. This is not really necessary as the checks and handling are quite different. Replicas are expensive to move, so taking minutes to scan over them is usually fine since a majority of the time is spent on the transferring, not the finding.
  2. Leases can be generally moved for 3 reasons: 1) Load, 2) Imbalance, 3) Preference violation. The load one is already quick, the imbalance does not need to be fast, but the preference violation is slow (since it uses the imbalance mechanism). As mentioned in the first point, if we had a separate mechanism to handle imbalance vs constraint violation, the constraint violation check could be triggered quickly as soon as there is any constraint change against every range.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-distribution Relating to rebalancing and leasing. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-kv KV Team
Projects
None yet
Development

No branches or pull requests

2 participants