Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

release-23.2.11-rc: kvserver: enqueue decom ranges at an interval behind a setting #130413

Merged

Conversation

kvoli
Copy link
Collaborator

@kvoli kvoli commented Sep 10, 2024

Backport 2/2 commits from #130117 on behalf of @kvoli.

/cc @cockroachdb/release


Introduce the ranges.decommissioning gauge metric, which represents
the number of ranges with at least one replica on a decommissioning
node.

The metric is reported by the leaseholder, or if there is no valid
leaseholder, the first live replica in the descriptor, similar to
(under|over)-replication metrics.

The metric can be used to approximately identify the distribution of
decommissioning work remaining across nodes, as the leaseholder replica
is responsible for triggering the replacement of decommissioning
replicas for its own range.

Informs: #130085
Release note (ops change): The ranges.decommissioning metric is added,
representing the number of ranges which have a replica on a
decommissioning node.


When kv.enqueue_in_replicate_queue_on_problem.interval is set to a
positive non-zero value, leaseholder replicas of ranges which are
underreplicated will be enqueued into the replicate queue every
kv.enqueue_in_replicate_queue_on_problem.interval interval.

When kv.enqueue_in_replicate_queue_on_problem.interval is set to 0,
no enqueueing on underreplication will take place, outside of the
regular replica scanner.

A recommended value for users enabling the enqueue (non-zero), is 15
minutes e.g.,

SET CLUSTER SETTING
kv.enqueue_in_replicate_queue_on_problem.interval='900s'

Resolves: #130085
Release note (ops change): The ranges.decommissioning metric is added,
representing the number of ranges which have a replica on a
decommissioning node.


Release justification: Low risk observability change and otherwise disabled by default behavior change which when enabled alleviates a class of decommission stalls.

Introduce the `ranges.decommissioning` gauge metric, which represents
the number of ranges with at least one replica on a decommissioning
node.

The metric is reported by the leaseholder, or if there is no valid
leaseholder, the first live replica in the descriptor, similar to
(under|over)-replication metrics.

The metric can be used to approximately identify the distribution of
decommissioning work remaining across nodes, as the leaseholder replica
is responsible for triggering the replacement of decommissioning
replicas for its own range.

Informs: cockroachdb#130085
Release note (ops change): The `ranges.decommissioning` metric is added,
representing the number of ranges which have a replica on a
decommissioning node.
When `kv.enqueue_in_replicate_queue_on_problem.interval` is set to a
positive non-zero value, leaseholder replicas of ranges which have
decommissioning replicas will be enqueued into the replicate queue every
`kv.enqueue_in_replicate_queue_on_problem.interval` interval.

When `kv.enqueue_in_replicate_queue_on_problem.interval` is set to 0,
no enqueueing on decommissioning will take place, outside of the regular
replica scanner.

A recommended value for users enabling the enqueue (non-zero), is at
least 15 minutes e.g.,

```
SET CLUSTER SETTING
kv.enqueue_in_replicate_queue_on_problem.interval='900s'
```

Resolves: cockroachdb#130085
Informs: cockroachdb#130199
Release note: None
@kvoli kvoli added the backport Label PR's that are backports to older release branches label Sep 10, 2024
@kvoli kvoli self-assigned this Sep 10, 2024
@kvoli kvoli requested a review from a team as a code owner September 10, 2024 13:11
Copy link

blathers-crl bot commented Sep 10, 2024

Thanks for opening a backport.

Please check the backport criteria before merging:

  • Backports should only be created for serious
    issues
    or test-only changes.
  • Backports should not break backwards-compatibility.
  • Backports should change as little code as possible.
  • Backports should not change on-disk formats or node communication protocols.
  • Backports should not add new functionality (except as defined
    here).
  • Backports must not add, edit, or otherwise modify cluster versions; or add version gates.
  • All backports must be reviewed by the owning areas TL. For more information as to how that review should be conducted, please consult the backport
    policy
    .
If your backport adds new functionality, please ensure that the following additional criteria are satisfied:
  • There is a high priority need for the functionality that cannot wait until the next release and is difficult to address in another way.
  • The new functionality is additive-only and only runs for clusters which have specifically “opted in” to it (e.g. by a cluster setting).
  • New code is protected by a conditional check that is trivial to verify and ensures that it only runs for opt-in clusters. State changes must be further protected such that nodes running old binaries will not be negatively impacted by the new state (with a mixed version test added).
  • The PM and TL on the team that owns the changed code have signed off that the change obeys the above rules.
  • Your backport must be accompanied by a post to the appropriate Slack
    channel (#db-backports-point-releases or #db-backports-XX-X-release) for awareness and discussion.

Also, please add a brief release justification to the body of your PR to justify this
backport.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@kvoli
Copy link
Collaborator Author

kvoli commented Sep 10, 2024

Extended CI failure is TestPlanDataDriven which is unrelated to this change. The release branch is also not yet frozen, so I'll go ahead merging. TYFTR!

@kvoli kvoli merged commit c70f434 into cockroachdb:release-23.2.11-rc Sep 10, 2024
5 of 6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport Label PR's that are backports to older release branches
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants