-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kvserver: ignore leaseholder replica type in DistSender #85140
kvserver: ignore leaseholder replica type in DistSender #85140
Conversation
deadd68
to
8d4913b
Compare
8d4913b
to
9677035
Compare
The DistSender could fail to prioritize a newly discovered leaseholder from a `NotLeaseHolderError` if the leaseholder had a non-`VOTER` replica type. Instead, it would continue to try replicas in order until possibly exhausting the transport and backing off, leading to increased tail latencies. This applies in particular to 22.1, where we allowed `VOTER_INCOMING` replicas to acquire the lease (see 22b4fb5). The primary reason is that `grpcTransport.MoveToFront()` would fail to compare the new leaseholder replica descriptor with the one in its range descriptor. There are two reasons why this can happen: 1. `ReplicaDescriptor.ReplicaType` is a pointer, where the zero-value `nil` is equivalent to `VOTER`. The equality comparison used in `MoveToFront()` is `==`, but pointer equality compares the memory address rather than the value. 2. The transport will keep using the initial range descriptor when it was created, and not updating it as we receive updated range descriptors. This means that the transport may e.g. have a `nil` replica type while the leaseholder has an `VOTER_INCOMING` replica type. This patch fixes both issues by adding `ReplicaDescriptor.IsSame()` which compares replica identities while ignoring the type. Release note (bug fix): Fixed a bug where new leaseholders (with a `VOTER_INCOMING` type) would not always be detected properly during query execution, leading to occasional increased tail latencies due to unnecessary internal retries.
9677035
to
0251570
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 5 of 5 files at r1, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @nvanbenschoten and @pavelkalinnikov)
// IsSame returns true if the two replica descriptors refer to the same replica, | ||
// ignoring the replica type. | ||
func (r ReplicaDescriptor) IsSame(o ReplicaDescriptor) bool { | ||
return r.NodeID == o.NodeID && r.StoreID == o.StoreID && r.ReplicaID == o.ReplicaID |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are the relations between node, store, and replica?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is ReplicaID
local to a pair of (NodeID
, StoreID
)? Can it be compared without any of the two?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One-to-many all the way down. One node can have many stores (disks), one store can have many replicas.
The node ID must be unique within the cluster. The store ID must be unique per node. The replica ID must be unique within the range. The range ID is sort of implied here, in that the replica descriptor is stored within a range descriptor.
In order to route a request to a replica we need to know all four. A replica ID won't move between nodes/stores -- if we have to move it, we'll create a new replica (with a new ID) on a different node, populate it, and then delete the old one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right in that we technically only need to compare the replica ID here, since we're operating within the context of a single range. We check the node ID and store ID just to make sure we have the right "address" for it, I suppose.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, so: (rangeID, replicaID) -> (nodeID, storeID), and rangeID is implied by the caller of IsSame
(otherwise, due to locality of replicaID, it would be comparing apples to tables). Hence, is it enough to just compare ReplicaID
s? From what you said it follows that if rangeID and replicaID are the same then nodeID and storeID are the same too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, you answered this already. Were racing with the last 2 comments :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In principle it should be, yeah. But it's cheap enough to check the node and store too, so I don't see a strong reason not to. If there should be a mismatch for whatever reason, then there's no real point in trying to contact the replica anyway, because the transport will send it to the wrong place.
TFTR! bors r=arulajmani |
This PR was included in a batch that was canceled, it will be automatically retried |
Build failed (retrying...): |
Build failed (retrying...): |
Build failed (retrying...): |
Build failed (retrying...): |
Build succeeded: |
Encountered an error creating backports. Some common things that can go wrong:
You might need to create your backport manually using the backport tool. error creating merge commit from 0251570 to blathers/backport-release-22.1-85140: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict [] you may need to manually resolve merge conflicts with the backport tool. Backport to branch 22.1.x failed. See errors above. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan. |
The DistSender could fail to prioritize a newly discovered leaseholder
from a
NotLeaseHolderError
if the leaseholder had a non-VOTER
replica type. Instead, it would continue to try replicas in order until
possibly exhausting the transport and backing off, leading to increased
tail latencies. This applies in particular to 22.1, where we allowed
VOTER_INCOMING
replicas to acquire the lease (see 22b4fb5).The primary reason is that
grpcTransport.MoveToFront()
would fail tocompare the new leaseholder replica descriptor with the one in its range
descriptor. There are two reasons why this can happen:
ReplicaDescriptor.ReplicaType
is a pointer, where the zero-valuenil
is equivalent toVOTER
. The equality comparison used inMoveToFront()
is==
, but pointer equality compares the memoryaddress rather than the value.
The transport will keep using the initial range descriptor when it
was created, and not updating it as we receive updated range
descriptors. This means that the transport may e.g. have a
nil
replica type while the leaseholder has an
VOTER_INCOMING
replicatype.
This patch fixes both issues by adding
ReplicaDescriptor.IsSame()
which compares replica identities while ignoring the type.
Resolves #85060.
Touches #74546.
Release note (bug fix): Fixed a bug where new leaseholders (with a
VOTER_INCOMING
type) would not always be detected properly duringquery execution, leading to occasional increased tail latencies due
to unnecessary internal retries.