Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gossip: Track latency by nodeID rather than addr #95701

Merged
merged 1 commit into from
Jan 25, 2023

Conversation

andrewbaptist
Copy link
Collaborator

Previously the latency to remote nodes was tracked by address rather than the node's id. This could result in a few problems. First, the remote address could be reused across nodes. This could result in incorrect information. Additionally, places that used this information (such as the allocator) needed to unnecessarily map the node id to address just to do a lookup.

Finally in preparation for dialback on heartbeat #84289 the use of the OriginAddr field in the PingRequest will change to be the actual address that a node should use to dial back. Currently this field is not set correctly.

Epic: none
Release note: None

@blathers-crl
Copy link

blathers-crl bot commented Jan 23, 2023

It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR?

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@andrewbaptist andrewbaptist force-pushed the 230123.cleanup_latency branch 11 times, most recently from 8fbfbd6 to d7f078a Compare January 24, 2023 15:48
@andrewbaptist andrewbaptist marked this pull request as ready for review January 24, 2023 16:31
@andrewbaptist andrewbaptist requested a review from a team as a code owner January 24, 2023 16:31
@andrewbaptist andrewbaptist requested a review from a team January 24, 2023 16:31
@andrewbaptist andrewbaptist requested review from a team as code owners January 24, 2023 16:31
@andrewbaptist andrewbaptist requested review from msirek and kvoli January 24, 2023 16:31
@msirek msirek removed their request for review January 25, 2023 02:34
Copy link
Contributor

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice cleanup, thanks!


func (i ReplicaInfo) addr() string {
return i.NodeDesc.Address.String()
Tiers []roachpb.Tier
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a particular reason we're getting rid of the node descriptor here? I mean, this is probably fine if we don't need it, but we could also just end up having to reintroduce it later when we need some other info.

In any case, we'll need to update a bunch of doc comments that refer to the node descriptor.

@@ -189,7 +181,7 @@ func localityMatch(a, b []roachpb.Tier) int {

// A LatencyFunc returns the latency from this node to a remote
// address and a bool indicating whether the latency is valid.
type LatencyFunc func(string) (time.Duration, bool)
type LatencyFunc func(roachpb.NodeID) (time.Duration, bool)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doc domment: s/remote address/remote node/

@@ -149,7 +150,7 @@ func (r *RemoteClockMonitor) Metrics() *RemoteClockMetrics {
// Latency returns the exponentially weighted moving average latency to the
// given node address. Returns true if the measurement is valid, or false if
// we don't have enough samples to compute a reliable average.
func (r *RemoteClockMonitor) Latency(addr string) (time.Duration, bool) {
func (r *RemoteClockMonitor) Latency(addr roachpb.NodeID) (time.Duration, bool) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/addr/id/

if mr.rpcContext.RemoteClocks != nil {
currentAverages = mr.rpcContext.RemoteClocks.AllLatencies()
}
for nodeID, entry := range isLiveMap {
address, err := mr.gossip.GetNodeIDAddress(nodeID)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can get rid of mr.gossip now as well, it's no longer used afaict.

@@ -165,7 +165,7 @@ func (hs *HeartbeatService) Ping(ctx context.Context, args *PingRequest) (*PingR
serverOffset := args.Offset
// The server offset should be the opposite of the client offset.
serverOffset.Offset = -serverOffset.Offset
hs.remoteClockMonitor.UpdateOffset(ctx, args.OriginAddr, serverOffset, 0 /* roundTripLatency */)
hs.remoteClockMonitor.UpdateOffset(ctx, args.OriginNodeID, serverOffset, 0 /* roundTripLatency */)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think OriginNodeID can sometimes be zero, should we guard against that here and/or in UpdateOffset?

Copy link
Collaborator

@kvoli kvoli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looked at the allocator/storepool/asim diffs :lgtm: on those.

:lgtm:

Reviewed 19 of 19 files at r1, 2 of 2 files at r2, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @andrewbaptist)

@andrewbaptist andrewbaptist force-pushed the 230123.cleanup_latency branch from 4f28895 to 6127b67 Compare January 25, 2023 17:55
@andrewbaptist andrewbaptist requested review from a team as code owners January 25, 2023 17:55
@andrewbaptist andrewbaptist force-pushed the 230123.cleanup_latency branch from 6127b67 to c4b310a Compare January 25, 2023 18:04
Previously the latency to remote nodes was tracked by address rather
than the node's id. This could result in a few problems. First, the
remote address could be reused across nodes. This could result in
incorrect information. Additionally, places that used this information
(such as the allocator) needed to unnecessarily map the node id to
address just to do a lookup.

Finally in preparation for dialback on heartbeat cockroachdb#84289 the use of the
OriginAddr field in the PingRequest is deprecated. That PR will add back
another field that can be used to lookup the Locality correct field to
use.

Epic: none
Release note: None
@andrewbaptist andrewbaptist force-pushed the 230123.cleanup_latency branch from c4b310a to 5cf0823 Compare January 25, 2023 18:18
Copy link
Collaborator Author

@andrewbaptist andrewbaptist left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTR - I removed mr.gossip as well.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @erikgrinaker and @kvoli)


pkg/kv/kvclient/kvcoord/replica_slice.go line 28 at r1 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

Is there a particular reason we're getting rid of the node descriptor here? I mean, this is probably fine if we don't need it, but we could also just end up having to reintroduce it later when we need some other info.

In any case, we'll need to update a bunch of doc comments that refer to the node descriptor.

The goal was to be able to populate this information directly from the heartbeat ping message eventually. They will carry the Locality information on the initial request and be used for dialback. This could also be used here to break the gossip dependency, however, I'm not sure that this should be done since we can only look it up there if we have a connection to that node (which we might not). So for now I'm leaving it depending on gossip even though that information is slightly more stale.

I will update all the doc comments for this.


pkg/kv/kvclient/kvcoord/replica_slice.go line 184 at r1 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

Doc domment: s/remote address/remote node/

Done


pkg/rpc/clock_offset.go line 153 at r1 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

s/addr/id/

Done


pkg/rpc/heartbeat.go line 168 at r1 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

I think OriginNodeID can sometimes be zero, should we guard against that here and/or in UpdateOffset?

Good catch! I don't know if this would have caused a problem, but it is confusing to have latency recorded for "node 0"

@andrewbaptist
Copy link
Collaborator Author

bors r=kvoli,erikgrinaker

@craig craig bot merged commit 2105cd4 into cockroachdb:master Jan 25, 2023
@craig
Copy link
Contributor

craig bot commented Jan 25, 2023

Build succeeded:

@andrewbaptist andrewbaptist deleted the 230123.cleanup_latency branch December 15, 2023 21:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants