Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvserver: support checking allocator action and target by range #94024

Merged
merged 1 commit into from
Jan 10, 2023

Conversation

AlexTalks
Copy link
Contributor

@AlexTalks AlexTalks commented Dec 21, 2022

This change exposes support via a store for checking the allocator
action and upreplication target (if applicable) for any range descriptor.
The range does not need to have a replica on the given store, nor is it
required to evaluate given the current state of the cluster (i.e. the
store's configured StorePool), as a store pool override can be
provided in order to evaluate possible future states.

Depends on #94114.

Part of #91570.

Release note: None

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@AlexTalks AlexTalks force-pushed the dpf_split_check_range_action branch 2 times, most recently from e71b3e7 to 0242526 Compare December 21, 2022 01:59
@AlexTalks AlexTalks force-pushed the dpf_split_check_range_action branch from 0242526 to 715b268 Compare December 22, 2022 06:39
@AlexTalks AlexTalks marked this pull request as ready for review December 22, 2022 06:41
@AlexTalks AlexTalks requested a review from a team as a code owner December 22, 2022 06:41
@AlexTalks AlexTalks requested a review from a team December 22, 2022 06:41
@AlexTalks AlexTalks requested a review from a team as a code owner December 22, 2022 06:41
@AlexTalks AlexTalks requested a review from kvoli December 22, 2022 06:42
@AlexTalks AlexTalks force-pushed the dpf_split_check_range_action branch from 715b268 to 3e2c134 Compare December 22, 2022 22:42
@dhartunian dhartunian removed the request for review from a team January 3, 2023 15:31
@AlexTalks AlexTalks force-pushed the dpf_split_check_range_action branch from 3e2c134 to b931b0f Compare January 6, 2023 02:32
@AlexTalks AlexTalks force-pushed the dpf_split_check_range_action branch from b931b0f to f6dde83 Compare January 7, 2023 09:28
AlexTalks added a commit to AlexTalks/cockroach that referenced this pull request Jan 7, 2023
This adds support for the evaluation of the decommission readiness of a
node (or set of nodes), by simulating their liveness to have the
DECOMMISSIONING status and utilizing the allocator to ensure that we are
able to perform any actions needed to repair the range. This supports a
"strict" mode, in which case we expect all ranges to only need
replacement or removal due to the decommissioning status, or a more
permissive "non-strict" mode, which allows for other actions needed, as
long as they do not encounter errors in finding a suitable allocation
target. The non-strict mode allows us to permit situations where a range
may have more than one action needed to repair it, such as a range that
needs to reach its replication factor before the decommissioning replica
can be replaced, or a range that needs to finalize an atomic replication
change.

Depends on cockroachdb#94024.

Part of cockroachdb#91571

Release note: None
Copy link
Collaborator

@kvoli kvoli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 8 of 8 files at r2, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @AlexTalks)

@AlexTalks

This comment was marked as duplicate.

@craig

This comment was marked as duplicate.

@AlexTalks

This comment was marked as duplicate.

@AlexTalks AlexTalks force-pushed the dpf_split_check_range_action branch from f6dde83 to b89e59d Compare January 10, 2023 03:30
@AlexTalks

This comment was marked as duplicate.

@AlexTalks

This comment was marked as duplicate.

@craig

This comment was marked as duplicate.

This change exposes support via a store for checking the allocator
action and upreplication target (if applicable) for any range descriptor.
The range does not need to have a replica on the given store, nor is it
required to evaluate given the current state of the cluster (i.e. the
store's configured StorePool), as a store pool override can be
provided in order to evaluate possible future states.

Depends on cockroachdb#94114.

Part of cockroachdb#91570.

Release note: None
@AlexTalks AlexTalks force-pushed the dpf_split_check_range_action branch from b89e59d to 670761d Compare January 10, 2023 04:37
@AlexTalks
Copy link
Contributor Author

bors r+

@craig
Copy link
Contributor

craig bot commented Jan 10, 2023

This PR was included in a batch that successfully built, but then failed to merge into master. It will not be retried.

Additional information:

@AlexTalks
Copy link
Contributor Author

bors r+

@craig
Copy link
Contributor

craig bot commented Jan 10, 2023

Build failed:

@AlexTalks
Copy link
Contributor Author

bors r+

@craig
Copy link
Contributor

craig bot commented Jan 10, 2023

Build succeeded:

@craig craig bot merged commit 4643153 into cockroachdb:master Jan 10, 2023
@AlexTalks AlexTalks deleted the dpf_split_check_range_action branch January 10, 2023 08:46
AlexTalks added a commit to AlexTalks/cockroach that referenced this pull request Jan 10, 2023
This adds support for the evaluation of the decommission readiness of a
node (or set of nodes), by simulating their liveness to have the
DECOMMISSIONING status and utilizing the allocator to ensure that we are
able to perform any actions needed to repair the range. This supports a
"strict" mode, in which case we expect all ranges to only need
replacement or removal due to the decommissioning status, or a more
permissive "non-strict" mode, which allows for other actions needed, as
long as they do not encounter errors in finding a suitable allocation
target. The non-strict mode allows us to permit situations where a range
may have more than one action needed to repair it, such as a range that
needs to reach its replication factor before the decommissioning replica
can be replaced, or a range that needs to finalize an atomic replication
change.

Depends on cockroachdb#94024.

Part of cockroachdb#91568

Release note: None
AlexTalks added a commit to AlexTalks/cockroach that referenced this pull request Jan 19, 2023
This adds support for the evaluation of the decommission readiness of a
node (or set of nodes), by simulating their liveness to have the
DECOMMISSIONING status and utilizing the allocator to ensure that we are
able to perform any actions needed to repair the range. This supports a
"strict" mode, in which case we expect all ranges to only need
replacement or removal due to the decommissioning status, or a more
permissive "non-strict" mode, which allows for other actions needed, as
long as they do not encounter errors in finding a suitable allocation
target. The non-strict mode allows us to permit situations where a range
may have more than one action needed to repair it, such as a range that
needs to reach its replication factor before the decommissioning replica
can be replaced, or a range that needs to finalize an atomic replication
change.

Depends on cockroachdb#94024.

Part of cockroachdb#91568

Release note: None
craig bot pushed a commit that referenced this pull request Jan 20, 2023
…95564 #95583 #95606

90222: server: add api for decommission pre-flight checks r=AlexTalks a=AlexTalks

While we have an API for checking the status of an in-progress decommission, we did not previously have an API to execute sanity checks prior to requesting a node to move into the `DECOMMISSIONING` state. This adds an API to do just that, intended to be called by the CLI prior to issuing a subsequent `Decommission` RPC request.

Fixes #91568.

Release note: None

91458: roachtest/mixed-versions-compat: use corpus for master r=fqazi a=fqazi

Informs: #91350

The CI build scripts take advantage of the branch name to uplaod the corpus to GCS. Unfortunately, we have no way of know if the current build is master inside the roachtest. To address this, this patch supports fetching the master corpus as a fallback.

Release note: None

92826: multiregionccl: add a cold start latency test  r=ajwerner a=ajwerner

This commit adds a test which creates an MR serverless cluster and then
boots the sql pods in each region while disallowing connectivity to other
regions. It also simulates latency to make sure the routing logic works
and to provide a somewhat realistic picture of what to expect.

Epic: CRDB-18596

Release note: None

93758: server: evaluate decommission pre-checks r=kvoli a=AlexTalks

This adds support for the evaluation of the decommission readiness of a
node (or set of nodes), by simulating their liveness to have the
DECOMMISSIONING status and utilizing the allocator to ensure that we are
able to perform any actions needed to repair the range. This supports a
"strict" mode, in which case we expect all ranges to only need
replacement or removal due to the decommissioning status, or a more
permissive "non-strict" mode, which allows for other actions needed, as
long as they do not encounter errors in finding a suitable allocation
target. The non-strict mode allows us to permit situations where a range
may have more than one action needed to repair it, such as a range that
needs to reach its replication factor before the decommissioning replica
can be replaced, or a range that needs to finalize an atomic replication
change.

Depends on #94024.

Part of #91568

95007: admission: CPU slot adjustment and utilization metrics r=irfansharif a=sumeerbhola

Our existing metrics are gauges (total and used slots) which don't give us insight into what is happening at smaller time scales. This creates uncertainty when we observe admission queueing but the gauge samples show total slots consistenly greater than used slots. Additionally, if total slots is steady during queuing, it doesn't tell us whether that was because of roughly matching increments or decrements, or no increments/decrements.

The following metrics are added:
- admission.granter.slots_exhausted_duration.kv: cumulative duration when the slots were exhausted. This can give insight into how much exhaustion was occurring. It is insufficient to tell us whether 0.5sec/sec of exhaustion is due to a long 500ms of exhaustion and then non-exhaustion or alternating 1ms of exhaustion and non-exhaustion. But this is an improvement over what we have.
- admission.granter.slot_adjuster_{increments,decrements}.kv: Counts the increments and decrements of the total slots.
- admission.granter.cpu_load_{short,long}_period_duration.kv: cumulative duration of short and long ticks, as indicated by the period in the CPULoad callback. We don't expect long period ticks when admission control is active (and we explicitly disable enforcement during long period ticks), but it helps us eliminate some hypothesis during incidents (e.g. long period ticks alternating with short period ticks causing a slow down in how fast we increment slots). Additionally, the sum of the rate of these two, if significantly < 1, would indicate that CPULoad frequency is lower than expected, say due to CPU overload.

Fixes #92673

Epic: none

Release note: None

95145: sql/stats: include partial stats in results of statsCache.GetTableStats r=rytaft a=michae2

We were not including partial stats in the list of table statistics returned by `statsCache.GetTableStats`. This was fine for the optimizer, which currently cannot use partial stats directly, but it was a problem for backup.

We'd like to use partial stats directly in the optimizer eventually, so this commit goes ahead and adds them to the results of `GetTableStats`. The optimizer then must filter them out. To streamline this we add some helper functions.

Finally, in an act of overzealous refactoring, this commit also changes `MergedStatistics` and `ForecastTableStatistics` to accept partial statistics and full statistics mixed together in the same input list. This simplifies the code that calls these functions.

Fixes: #95056
Part of: #93983

Epic: CRDB-19449

Release note: None

95387: kvserver: separate loadstats cpu nanos to raft/req r=andrewbaptist a=kvoli

Previously, loadstats tracked replica raft/request cpu nanos per second separately but returned both summed together in `load.ReplicaLoadStats`. This patch separates `RaftCPUNanosPerSecond` and
`RequestCPUNanosPerSecond` in the returned `load.ReplicaLoadStats` so that they may be used independently.

Informs #95380

Release note: None

95557: sql: Fix testing_optimizer_disable_rule_probability usage with vtables r=cucaroach a=cucaroach

If a vtable scan query tries to use the dummy "0" column the exec
builder errors out, this typically won't happen thanks to prune columns
normalization rules and those rules are marked as "essential" but the
logic allowing those rules to be applied was flawed.

Epic: CRDB-20535
Informs: #94890
Release note: None


95559: rpc: fix comment r=andreimatei a=andreimatei

This copy-pasta comment was mentioning a KV node, which was not right.

Release note: None
Epic: None

95564: cpustopwatch: s/grunning.Difference/grunning.Elapsed r=irfansharif a=irfansharif

`grunning.Elapsed()` is the API to use when measuring the running time spent doing some piece of work, with measurements from the start and end. This only exists due to `grunning.Time()`'s non-monotonicity, a bug in our runtime patch: #95529. The bug results in slight {over,under}-estimation of the running time (the latter breaking monotonicity), but is livable with our current uses of this library, including the one here in cpustopwatch. `grunning.Elapsed()` papers over this bug by 0-ing out when `grunning.Time()`stamps regress. This is unlike `grunning.Difference()` which would return the absolute value of the regression -- not what we want here.

Release note: None

95583: sql: fix cluster setting propagation flake take 2 r=cucaroach a=cucaroach

Previously we tried to fix this with one retry but that was
insufficient.  Extend it to all queries in this section of the test.

Release note: None
Epic: CRDB-20535


95606: backupccl: deflake TestScheduleChainingLifecycle r=adityamaru a=msbutler

This patch will skip the test if the machine clock is close to midnight, and increases the frequency of incremental backups to run every 2 minutes.

Previously, the backup schedule in this test used the following crontab recurrence: '*/5 * * * *'. In english, this means "run a full backup now, and then run a full backup every day at midnight, and an incremental every 5 minutes. This test also relies on an incremental backup running after the first full backup. But, what happens if the first full backup gets scheduled to run within 5 minutes of midnight? A second full backup may get scheduled before the expected incremental backup, breaking the invariant the test expects.

Fixes #88575 #95186

Release note: None

Co-authored-by: Alex Sarkesian <[email protected]>
Co-authored-by: Faizan Qazi <[email protected]>
Co-authored-by: Andrew Werner <[email protected]>
Co-authored-by: sumeerbhola <[email protected]>
Co-authored-by: Michael Erickson <[email protected]>
Co-authored-by: Austen McClernon <[email protected]>
Co-authored-by: Tommy Reilly <[email protected]>
Co-authored-by: Andrei Matei <[email protected]>
Co-authored-by: irfan sharif <[email protected]>
Co-authored-by: Michael Butler <[email protected]>
AlexTalks added a commit to AlexTalks/cockroach that referenced this pull request Feb 6, 2023
This change adds tests using `store.AllocatorCheckRange(..)` to
investigate the behavior or attempting to decommission a node when doing
so would cause constraints to be broken. This test is needed because it
was uncovered recently that a partially constrained range (i.e. a range
configured with `num_replicas = 3, constraint = '{<some constraint>:
1}'` may not have the constraint fully respected if the only node that
satisfies said constraint is being decommissioned.

Part of cockroachdb#94809

Depends on cockroachdb#94024.

Release note: None
AlexTalks added a commit to AlexTalks/cockroach that referenced this pull request Feb 8, 2023
This change adds tests using `store.AllocatorCheckRange(..)` to
investigate the behavior or attempting to decommission a node when doing
so would cause constraints to be broken. This test is needed because it
was uncovered recently that a partially constrained range (i.e. a range
configured with `num_replicas = 3, constraint = '{<some constraint>:
1}'` may not have the constraint fully respected if the only node that
satisfies said constraint is being decommissioned.

Part of cockroachdb#94809

Depends on cockroachdb#94024.

Release note: None
AlexTalks added a commit to AlexTalks/cockroach that referenced this pull request Feb 9, 2023
This change adds tests using `store.AllocatorCheckRange(..)` to
investigate the behavior or attempting to decommission a node when doing
so would cause constraints to be broken. This test is needed because it
was uncovered recently that a partially constrained range (i.e. a range
configured with `num_replicas = 3, constraint = '{<some constraint>:
1}'` may not have the constraint fully respected if the only node that
satisfies said constraint is being decommissioned.

Part of cockroachdb#94809

Depends on cockroachdb#94024.

Release note: None
AlexTalks added a commit to AlexTalks/cockroach that referenced this pull request Feb 15, 2023
This change adds tests using `store.AllocatorCheckRange(..)` to
investigate the behavior or attempting to decommission a node when doing
so would cause constraints to be broken. This test is needed because it
was uncovered recently that a partially constrained range (i.e. a range
configured with `num_replicas = 3, constraint = '{<some constraint>:
1}'` may not have the constraint fully respected if the only node that
satisfies said constraint is being decommissioned.

Part of cockroachdb#94809

Depends on cockroachdb#94024.

Release note: None
craig bot pushed a commit that referenced this pull request Feb 15, 2023
94810: kvserver: fix constraint analysis on replica replacement r=AlexTalks a=AlexTalks

This changes the allocator to be fully aware of the replica that is
beingn replaced when allocating new target replicas due to a dead or
decommissioning node. This ensures that if constraints were respected
before the dead or decommissioning node left the cluster, they will
still be respected afterwards, particularly in the case at issue, which
is when partial constraints are set on a range (e.g. `num_replicas = 3,
<some_constraint>: 1`). This is achieved by rejecting candidate
stores to allocate the replacement on when they do not satisfy a
constraint that was satisfied by the existing store. In doing so, this
means that some node decommissions that would previously be able to
execute will now not allow the user to decommission the node and violate
the configured constraints.

Fixes #94809.

Depends on #94024.

Release note (bug fix): Decommissions that would violate constraints set
on a subset of replicas for a range (e.g. `num_replicas = 3,
<constraint>: 1`) will no longer be able to execute, respecting
constraints during and after the decommission.

Co-authored-by: Alex Sarkesian <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants