Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvserver: modernize AdminRelocateRange #75676

Conversation

aayushshah15
Copy link
Contributor

Previously, AdminRelocateRange's contract was to transfer the lease away to
the first voting replica in the input slice of targets. However, this was
mostly a result of the fact that leaseholder replicas weren't allowed to remove
themselves. Thus, the caller had to specifically tell AdminRelocateRange where
it wanted the lease to be, and the AdminRelocateRange logic would ensure to
transfer the lease away to this replica either before it removed the
leaseholder or after it was done performing all the replica relocations.

There are two internal callers of AdminRelocateRange, and this property of
the AdminRelocateRange API is unfortunate for both of them.

  1. The merge queue takes extra effort to avoid a lease transfer when calling
    into AdminRelocateRange.
    // AdminRelocateRange moves the lease to the first target in the list, so
    // sort the existing leaseholder there to leave it unchanged.
    lease, _ := lhsRepl.GetLease()
    for i := range voterTargets {
    if t := voterTargets[i]; t.NodeID == lease.Replica.NodeID && t.StoreID == lease.Replica.StoreID {
    if i > 0 {
    voterTargets[0], voterTargets[i] = voterTargets[i], voterTargets[0]
    }
    break
    }
    }
  2. The store rebalancer performs a lease transfer based on very naive logic
    that is not optimal.
    // Pick the voter with the least QPS to be leaseholder;
    // RelocateRange transfers the lease to the first provided target.
    //
    // TODO(aayush): Does this logic need to exist? This logic does not take
    // lease preferences into account. So it is already broken in a way.

With #74077, leaseholder replicas will be able to remove themselves by
transferring the lease away to a suitable replica after they enter the joint
configuration of the replication change. This patch makes it such that, going
forward, AdminRelocateRange will only perform an explicit lease transfer if
the caller specifically asks for it. The patch also contains a migration path
for this functionality.

Once #74077 lands, a future patch will change the store rebalancer and the
merge queue behavior to leverage this new functionality.

Release note: None

@aayushshah15 aayushshah15 requested a review from a team as a code owner January 28, 2022 22:38
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@aayushshah15 aayushshah15 force-pushed the 20220128_avoidLeaseTransferInAdminRelocateRange branch from 6400a7c to ac02157 Compare January 28, 2022 23:02
@aayushshah15
Copy link
Contributor Author

aayushshah15 commented Jan 28, 2022

@shralex and @nvanbenschoten: This patch does what I've individually spoken to both of you about. #74077 depends on this since without it, AdminRelocateRange is going to lead to redundant lease transfers (since in that patch, we're potentially doing lease transfers inside AdminChangeReplicas). Once that patch lands, I will change the merge queue and the store rebalancer behavior to set this newly added TransferLeaseToFirstVoter to false, which will be a good simplification since we can then address the 2 points in the PR description.

@aayushshah15 aayushshah15 force-pushed the 20220128_avoidLeaseTransferInAdminRelocateRange branch 4 times, most recently from eb783a4 to 66e981e Compare January 29, 2022 17:48
Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 9 of 11 files at r1, 2 of 2 files at r2, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @aayushshah15)

@aayushshah15 aayushshah15 force-pushed the 20220128_avoidLeaseTransferInAdminRelocateRange branch from 66e981e to 6525423 Compare February 1, 2022 23:16
@aayushshah15
Copy link
Contributor Author

@nvanbenschoten, do you mind giving this another quick look?

The previous revision tried to redundantly transfer the lease to voterTargets[0] even if it had already done so before while relocating some replicas.

Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this new change. My understanding of relocateOne is that it should only return a lease target if either len(ops) == 0 or we are removing the current leaseholder. So I don't see how this code change makes a difference. Are you referring to code redundancy or operational redundancy?

I think this code would be easier to write/read if we either pulled the len(ops) == 0 handling entirely out of relocateOne, or pushed the transferLeaseToFirstVoter handling entirely into relocateOne.

Reviewed all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @nvanbenschoten)

@aayushshah15 aayushshah15 force-pushed the 20220128_avoidLeaseTransferInAdminRelocateRange branch from 6525423 to 0ff129c Compare February 2, 2022 16:40
@aayushshah15
Copy link
Contributor Author

My understanding of relocateOne is that it should only return a lease target if either len(ops) == 0 or we are removing the current leaseholder

Yep that's correct. The previous revision was attempting a lease transfer at the end (if transferLeaseToFirstVoter was true) even if leaseTarget was nil. This meant that in the latter scenario (i.e. if we'd already removed the current leaseholder before), we were attempting the lease transfer redundantly.

pushed the transferLeaseToFirstVoter handling entirely into relocateOne.

Done. Much better.

Previously, `AdminRelocateRange`'s contract was to transfer the lease away to
the first voting replica in the input slice of targets. However, this was
mostly a result of the fact that leaseholder replicas weren't allowed to remove
themselves. Thus, the caller had to specifically tell `AdminRelocateRange`
where it wanted the lease to be, and the `AdminRelocateRange` logic would
ensure to transfer the lease away to this replica either before it removed the
leaseholder or after it was done performing all the replica relocations.

There are two internal callers of `AdminRelocateRange`, and this property of
the `AdminRelocateRange` API is unfortunate for both of them.

1. The merge queue takes extra effort to avoid a lease transfer when calling
into `AdminRelocateRange`.

2. The store rebalancer performs a lease transfer based on very naive logic
that is not optimal.

With cockroachdb#74077, leaseholder replicas will be able to remove themselves by
transferring the lease away to a suitable replica after they enter the joint
configuration of the replication change. This patch makes it such that, going
forward, `AdminRelocateRange` will only perform an explicit lease transfer if
the caller specifically asks for it. The patch also contains a migration path
for this functionality.

Once cockroachdb#74077 lands, a future patch will change the store rebalancer and the
merge queue behavior to leverage this new functionality.

Release note: None
@aayushshah15 aayushshah15 force-pushed the 20220128_avoidLeaseTransferInAdminRelocateRange branch from 0ff129c to 949658b Compare February 2, 2022 16:54
Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 1 of 1 files at r4, 1 of 1 files at r5, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @aayushshah15)

@aayushshah15
Copy link
Contributor Author

TFTR

bors r+

@craig
Copy link
Contributor

craig bot commented Feb 2, 2022

Build succeeded:

@craig craig bot merged commit a0f0df7 into cockroachdb:master Feb 2, 2022
aayushshah15 added a commit to aayushshah15/cockroach that referenced this pull request Feb 3, 2022
…elocateRange`

This commit makes it such that the `mergeQueue` no longer has to take special
care to avoid a lease transfer when calling into `AdminRelocateRange`. This was
enabled by cockroachdb#75676.

Release note: None
aayushshah15 added a commit to aayushshah15/cockroach that referenced this pull request Feb 3, 2022
…elocateRange`

This commit makes it such that the `mergeQueue` no longer has to take special
care to avoid a lease transfer when calling into `AdminRelocateRange`. This was
enabled by cockroachdb#75676.

The preceding logic can be cleaned up in the 22.2 cycle.

Release note: None
aayushshah15 added a commit to aayushshah15/cockroach that referenced this pull request Feb 7, 2022
…ateRange`

This commit makes it such that the `mergeQueue` no longer has to take special
care to avoid a lease transfer when calling into `AdminRelocateRange`. This was
enabled by cockroachdb#75676.

The preceding logic can be cleaned up in the 22.2 cycle.

Release note: None
craig bot pushed a commit that referenced this pull request Feb 8, 2022
…76205

75541: rfc: kv lock observability via `crdb_locks` r=AlexTalks a=AlexTalks

_Link to [RFC text](https://github.com/AlexTalks/cockroach/blob/rfc_crdb_locks/docs/RFCS/20220104_crdb_locks.md)._

This RFC proposes a feature for observing locks (and contention on
locks) via a new virtual table called `crdb_internal.crdb_locks`.

Release note: None

75738: roachtest/follower_reads: unskip follower-reads/mixed-version/single-region r=andreimatei a=andreimatei

It should have been unskipped with the rest of the follower reads tests
when #69817 was closed.

Release note: None

75898: sql: consider virtual PK columns as stored in the optimizer r=mgartner a=mgartner

Allowing virtual primary key columns created a contradiction of two
hard-coded assumptions of the optimizer:

  1. Primary key columns are written to the primary index (and all
     secondary indexes).
  2. Virtual columns are not written to the primary index.

To prevent this contradiction from causing bugs while planning, the opt
catalog presents columns that are marked as virtual PK columns in the
descriptor as stored columns to the optimizer. This is valid because
virtual computed PK columns behave exactly like stored computed PK
columns: they are written to all indexes.

The test catalog has been updated to mimic this behavior.

Fixes #75147
Fixes #75874

Release note (bug fix): A bug has been fixed that caused internal errors
when querying tables with virtual columns in the primary key. This bug
was only present since version 22.1.0-alpha.1 and does not appear in any
production releases.

75912: util/tracing: don't return recordings for non-recording spans r=andreimatei a=andreimatei

This patch makes sp.GetRecording() return nil if the span was not
recording. This is already what the documentation on GetRecording() was
implying, but the actual behavior was different: a non-recording span
(that's not a no-op span) would return a non-nil recording consisting
only of the the span's metadata. This behavior is not necessarily
unreasonable, but it turns out it's a bit dangerous from a performance
perspective - there's a cost (e.g. some allocations) for generating this
recording, and it was too easy for unsuspecting callers to pay that when
they didn't want to. In particular, DistSQL was indiscriminently calling
GetRecording() (through its GetTraceData* utilities), and paying this
measurable cost unintentionally.

Eliminating this cost is particularly important when the Tracer is
configured in the "active spans registry" tracing mode; when not in that
mode, all the spans would be no-ops and the GetRecording calls were
shortcircuited anyway. This is seen in BenchmarkTracing//trace=on,
particularly on "3node/scan" which was the benchmark with the bigest
difference between tracing on and off.

name                              old time/op    new time/op    delta
Tracing/1node/scan/trace=on-32       214µs ± 4%     206µs ± 5%    ~     (p=0.222 n=5+5)
Tracing/1node/insert/trace=on-32     338µs ± 5%     334µs ± 4%    ~     (p=0.690 n=5+5)
Tracing/3node/scan/trace=on-32       515µs ± 3%     484µs ± 3%  -6.11%  (p=0.008 n=5+5)
Tracing/3node/insert/trace=on-32     635µs ± 1%     627µs ± 1%    ~     (p=0.095 n=5+5)

name                              old alloc/op   new alloc/op   delta
Tracing/1node/scan/trace=on-32      27.2kB ± 2%    25.6kB ± 2%  -5.94%  (p=0.008 n=5+5)
Tracing/1node/insert/trace=on-32    45.2kB ± 2%    45.5kB ± 2%    ~     (p=0.690 n=5+5)
Tracing/3node/scan/trace=on-32      83.2kB ± 3%    79.7kB ± 5%    ~     (p=0.095 n=5+5)
Tracing/3node/insert/trace=on-32     102kB ± 2%     103kB ± 1%    ~     (p=0.548 n=5+5)

name                              old allocs/op  new allocs/op  delta
Tracing/1node/scan/trace=on-32         242 ± 1%       229 ± 0%  -5.37%  (p=0.000 n=5+4)
Tracing/1node/insert/trace=on-32       338 ± 1%       339 ± 1%    ~     (p=0.563 n=5+5)
Tracing/3node/scan/trace=on-32         819 ± 1%       788 ± 6%    ~     (p=0.135 n=5+5)
Tracing/3node/insert/trace=on-32       871 ± 0%       880 ± 0%  +0.99%  (p=0.016 n=4+5)

Release note: None

75981: kvserver: clean-up the interaction between the mergeQueue and `AdminRelocateRange` r=aayushshah15 a=aayushshah15

This commit makes it such that the `mergeQueue` no longer has to take special
care to avoid a lease transfer when calling into `AdminRelocateRange`. This was
enabled by #75676.

Release note: None


75999: kvserver: stop reconciling lease count imbalances in the `StoreRebalancer` r=aayushshah15 a=aayushshah15

Generally, the `StoreRebalancer` doesn't try to attempt lease transfers for
replicas that account for less than 0.1% of the store's total QPS. However, it
allows such lease transfers if the store has a higher than average number of
leases. This is fraught because of the following reasons:

1. It doesn't adhere to any padding. It will do these tiny lease transfers
until the store has less than or equal to the average number of leases.
2. Additionally, "average" here means the average across the cluster, not just
the stores that have replicas for the range. This is clearly not good, as this
doesn't translate well to heterogeneously loaded clusters.
3. These lease transfers also don't adhere to
`kv.allocator.min_lease_transfer_interval`. This means they are not rate
limited like the lease transfers done by the replicateQueue.

This patch fixes this behavior. This patch is similar to #64559.

Noticed while running a tpcc workload on a multi-region cluster, but on a
database constrained to a single region. This was done to simulate a
"heterogeneously loaded cluster".

Release note: None


76078: opt: plan inner lookup joins on virtual column indexes in more cases r=mgartner a=mgartner

ExtractJoinEqualities now reuses computed columns instead of
synthesizing new columns when it creates projections that exactly match
a computed column expression of a base table. This allows
GenerateLookupJoinsWithVirtualCols to generate lookup joins in more
cases. This also paves the way for exploring  anti- and semi-lookup
joins on indexes with virtual columns and expression indexes.

Fixes #75872

Release note (performance improvement): The optimizer now plans inner
lookup joins using expression indexes in more cases, resulting in more
efficient query plans.

76193: optbuilder: do not create invalid casts when building CASE expressions r=mgartner a=mgartner

The optbuilder no longer creates invalid casts when building CASE
expressions that have branches of different types. CASE expressions that
previously caused internal errors now result in user-facing errors. A
similar change was made recently to UNION expressions in #75219.

Note that there is more work to be done to be fully consistent with
Postgres's CASE typing behavior, see #75365.

Fixes #75365

Release note (bug fix): CASE expressions with branches that result in
types that cannot be cast to a common type now result in a user-facing
error instead of an internal error.

76200: cloud: bump orchestrator to v21.2.5 r=celiala a=Azhng

Release note: None

76205: roachtest: update 22.1 version map to v21.2.5 r=celiala a=Azhng

Release note: None

Co-authored-by: Alex Sarkesian <[email protected]>
Co-authored-by: Andrei Matei <[email protected]>
Co-authored-by: Marcus Gartner <[email protected]>
Co-authored-by: Aayush Shah <[email protected]>
Co-authored-by: Azhng <[email protected]>
RajivTS pushed a commit to RajivTS/cockroach that referenced this pull request Mar 6, 2022
…ateRange`

This commit makes it such that the `mergeQueue` no longer has to take special
care to avoid a lease transfer when calling into `AdminRelocateRange`. This was
enabled by cockroachdb#75676.

The preceding logic can be cleaned up in the 22.2 cycle.

Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants