-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gossip: Track latency by nodeID rather than addr #95701
gossip: Track latency by nodeID rather than addr #95701
Conversation
It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR? 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
8fbfbd6
to
d7f078a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice cleanup, thanks!
|
||
func (i ReplicaInfo) addr() string { | ||
return i.NodeDesc.Address.String() | ||
Tiers []roachpb.Tier |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a particular reason we're getting rid of the node descriptor here? I mean, this is probably fine if we don't need it, but we could also just end up having to reintroduce it later when we need some other info.
In any case, we'll need to update a bunch of doc comments that refer to the node descriptor.
@@ -189,7 +181,7 @@ func localityMatch(a, b []roachpb.Tier) int { | |||
|
|||
// A LatencyFunc returns the latency from this node to a remote | |||
// address and a bool indicating whether the latency is valid. | |||
type LatencyFunc func(string) (time.Duration, bool) | |||
type LatencyFunc func(roachpb.NodeID) (time.Duration, bool) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doc domment: s/remote address/remote node/
pkg/rpc/clock_offset.go
Outdated
@@ -149,7 +150,7 @@ func (r *RemoteClockMonitor) Metrics() *RemoteClockMetrics { | |||
// Latency returns the exponentially weighted moving average latency to the | |||
// given node address. Returns true if the measurement is valid, or false if | |||
// we don't have enough samples to compute a reliable average. | |||
func (r *RemoteClockMonitor) Latency(addr string) (time.Duration, bool) { | |||
func (r *RemoteClockMonitor) Latency(addr roachpb.NodeID) (time.Duration, bool) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/addr/id/
if mr.rpcContext.RemoteClocks != nil { | ||
currentAverages = mr.rpcContext.RemoteClocks.AllLatencies() | ||
} | ||
for nodeID, entry := range isLiveMap { | ||
address, err := mr.gossip.GetNodeIDAddress(nodeID) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can get rid of mr.gossip
now as well, it's no longer used afaict.
@@ -165,7 +165,7 @@ func (hs *HeartbeatService) Ping(ctx context.Context, args *PingRequest) (*PingR | |||
serverOffset := args.Offset | |||
// The server offset should be the opposite of the client offset. | |||
serverOffset.Offset = -serverOffset.Offset | |||
hs.remoteClockMonitor.UpdateOffset(ctx, args.OriginAddr, serverOffset, 0 /* roundTripLatency */) | |||
hs.remoteClockMonitor.UpdateOffset(ctx, args.OriginNodeID, serverOffset, 0 /* roundTripLatency */) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think OriginNodeID
can sometimes be zero, should we guard against that here and/or in UpdateOffset
?
d7f078a
to
4f28895
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looked at the allocator/storepool/asim diffs on those.
Reviewed 19 of 19 files at r1, 2 of 2 files at r2, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @andrewbaptist)
4f28895
to
6127b67
Compare
6127b67
to
c4b310a
Compare
Previously the latency to remote nodes was tracked by address rather than the node's id. This could result in a few problems. First, the remote address could be reused across nodes. This could result in incorrect information. Additionally, places that used this information (such as the allocator) needed to unnecessarily map the node id to address just to do a lookup. Finally in preparation for dialback on heartbeat cockroachdb#84289 the use of the OriginAddr field in the PingRequest is deprecated. That PR will add back another field that can be used to lookup the Locality correct field to use. Epic: none Release note: None
c4b310a
to
5cf0823
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TFTR - I removed mr.gossip as well.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @erikgrinaker and @kvoli)
pkg/kv/kvclient/kvcoord/replica_slice.go
line 28 at r1 (raw file):
Previously, erikgrinaker (Erik Grinaker) wrote…
Is there a particular reason we're getting rid of the node descriptor here? I mean, this is probably fine if we don't need it, but we could also just end up having to reintroduce it later when we need some other info.
In any case, we'll need to update a bunch of doc comments that refer to the node descriptor.
The goal was to be able to populate this information directly from the heartbeat ping message eventually. They will carry the Locality information on the initial request and be used for dialback. This could also be used here to break the gossip dependency, however, I'm not sure that this should be done since we can only look it up there if we have a connection to that node (which we might not). So for now I'm leaving it depending on gossip even though that information is slightly more stale.
I will update all the doc comments for this.
pkg/kv/kvclient/kvcoord/replica_slice.go
line 184 at r1 (raw file):
Previously, erikgrinaker (Erik Grinaker) wrote…
Doc domment: s/remote address/remote node/
Done
pkg/rpc/clock_offset.go
line 153 at r1 (raw file):
Previously, erikgrinaker (Erik Grinaker) wrote…
s/addr/id/
Done
pkg/rpc/heartbeat.go
line 168 at r1 (raw file):
Previously, erikgrinaker (Erik Grinaker) wrote…
I think
OriginNodeID
can sometimes be zero, should we guard against that here and/or inUpdateOffset
?
Good catch! I don't know if this would have caused a problem, but it is confusing to have latency recorded for "node 0"
bors r=kvoli,erikgrinaker |
Build succeeded: |
Previously the latency to remote nodes was tracked by address rather than the node's id. This could result in a few problems. First, the remote address could be reused across nodes. This could result in incorrect information. Additionally, places that used this information (such as the allocator) needed to unnecessarily map the node id to address just to do a lookup.
Finally in preparation for dialback on heartbeat #84289 the use of the OriginAddr field in the PingRequest will change to be the actual address that a node should use to dial back. Currently this field is not set correctly.
Epic: none
Release note: None