-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kvserver: add cross-region snapshot byte metrics to StoreMetrics #104111
kvserver: add cross-region snapshot byte metrics to StoreMetrics #104111
Conversation
0cb5550
to
7382441
Compare
7382441
to
6ba2566
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Took a brief look, really great! Lets walk though in more detail tomorrow.
Reviewed 1 of 1 files at r1, 6 of 6 files at r2, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @wenyihu6)
pkg/kv/kvserver/allocator/storepool/store_pool.go
line 1392 at r2 (raw file):
func (sp *StorePool) GetNodeLocalityWithString(nodeID roachpb.NodeID) localityWithString { nodeLocality := localityWithString{ str: "",
nit: remove the extra initialization beyond localityWithString
6ba2566
to
c0e047f
Compare
Previously, kvoli (Austen) wrote…
Done. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great stuff! Left some suggestions but none are significant changes. Lmk if you have any questions.
Reviewed 3 of 3 files at r3, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @wenyihu6)
-- commits
line 17 at r3:
The cross-region traffic is between stores residing in different region localities within a cluster.
Perhaps reword to "cross region snapshot traffic, between stores in a cluster.
-- commits
line 20 at r3:
nit: reword to "this commit adds two new store metrics - "
-- commits
line 21 at r3:
Use the metric metadata name instead of the golang struct field name e.g.
range.snapshots.cross-region.sent-bytes
and range.snapshots.cross-region.rcvd-bytes
-- commits
line 22 at r3:
nit: reword to make it clear these are aggregates per store, it is inferred by the counter but could be explicit e.g. "track the aggregate snapshot bytes sent from and received at a store".
-- commits
line 25 at r3:
Is this supposed to mention cross-region batch activities or snapshots?
-- commits
line 33 at r3:
Same as above for referring to metrics in PRs/commits.
pkg/kv/kvserver/replica_command.go
line 3148 at r3 (raw file):
} maybeIncrementCrossRegionBatchMetrics := func(inc int64) {
Why name this ...CrossRegionBatchMetrics
as opposed to ...CrossRegionSnapshotMetrics
?
pkg/kv/kvserver/replica_command.go
line 3151 at r3 (raw file):
storePool := r.store.cfg.StorePool if storePool == nil { log.Eventf(ctx, "store pool is nil")
When will the storepool be nil? Does this only occur in tests?
pkg/kv/kvserver/replica_command.go
line 3155 at r3 (raw file):
isCrossRegion, err := storePool.IsCrossRegion(req.CoordinatorReplica, req.RecipientReplica) if err != nil { log.Eventf(ctx, "%v", err)
This could be spammy and would be better logged as a warning if necessary.
Consider logging at a vmodule level of 2 and adding context to the error.
if err != nil {
// We are unable to determine if the snapshot was sent to a store in a different
// region or not. This can occur when ...
log.VEventf(ctx, 2, "Unable to determine if batch is cross region %v", err)
}
pkg/kv/kvserver/allocator/storepool/store_pool.go
line 1408 at r3 (raw file):
// GetNodeLocality returns the locality information for the given node. func (sp *StorePool) GetNodeLocality(nodeID roachpb.NodeID) roachpb.Locality {
Should these methods also be exported on the AllocatorStorePool
interface? I'm not certain they necessarily need to, what do you think?
Also I only see IsCrossRegion
being used outside this pkg, do the other two methods need to be exported?
pkg/kv/kvserver/replica_learner_test.go
line 2211 at r1 (raw file):
} } types := [4]string{".unknown", ".recovery", ".rebalancing", ""}
nit: move types inline with the loop
for _, v := range [4]string{...} { ... }
Is the reason you are using the metric names as constant strings here, as opposed to from the metadata in
cockroach/pkg/kv/kvserver/metrics.go
Line 890 in 3ff847a
Name: "range.snapshots.rcvd-bytes", |
kvserver_test
pkg?
pkg/kv/kvserver/replica_learner_test.go
line 2413 at r3 (raw file):
} require.Equal(t, receiverMapExpected, receiverMapDelta)
Nice test extension! Would it be possible to add a case where region
isn't set, but other locality attributes are such as zone
and dc
- then assert that the cross-region metrics are unchanged? It may be helpful to get a feel for how noisy the log messages added above are as well.
fe98fc7
to
b685267
Compare
Previously, kvoli (Austen) wrote…
Done. |
Previously, kvoli (Austen) wrote…
Done. |
Previously, kvoli (Austen) wrote…
Done. |
Previously, kvoli (Austen) wrote…
Done. |
Previously, kvoli (Austen) wrote…
Oops 😅 |
Previously, kvoli (Austen) wrote…
Done. |
Previously, kvoli (Austen) wrote…
Done. |
Previously, kvoli (Austen) wrote…
Discussed offline - this was for sanity check, but it should never happen. I've removed the check. |
Previously, kvoli (Austen) wrote…
Done. |
Previously, kvoli (Austen) wrote…
Done. |
bf3d71d
to
8635c72
Compare
This commit refactors `getSnapshotBytesMetrics` in `replica_learner_test` to return a `map[string]snapshotBytesMetrics` instead of `map[SnapShotRequest_Priority]snapshotBytesMetrics`. This allows us to include and compare different types of snapshot metrics, removing the constraint of being limited to `SnapShotRequest_Priority`. This commit does not change any existing functionality, and the main purpose is to make future commits cleaner. Part of: cockroachdb#104124 Release note: none
8635c72
to
3f1351e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@andrewbaptist would you be able to review as well? This code was last touched heavily by you for delegated snapshots.
Reviewed 6 of 6 files at r7, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @andrewbaptist and @wenyihu6)
Previously, wenyihu6 (Wenyi Hu) wrote…
Done.
I know its very much implied by the name but adding "...aggregate cross-region snapshot bytes..." would be clearer.
-- commits
line 25 at r7:
nit: "Resolves" is probably a better linking term, since this wasn't a bug - just a feature.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- This will cause a few conflicts in my PR for #102629 - but I will fix that up after this merges. That PR does also remove the UNKNOWN metric since it is never populated, so we could simplify this by removing this here as it is always checked against 0, but its not important to do.
Reviewable status: complete! 2 of 0 LGTMs obtained (waiting on @wenyihu6)
544f83a
to
ad867c7
Compare
Previously, there were no metrics to observe cross-region snapshot traffic between stores within a cluster. To improve this issue, this commit adds two new store metrics - `range.snapshots.cross-region.sent-bytes` and `range.snapshots.cross-region.rcvd-bytes`. These metrics track the aggregate of snapshot bytes sent from and received at a store across different regions. Resolves: cockroachdb#104124 Release note (ops change): Two new store metrics - `range.snapshots.cross-region.sent-bytes` and `range.snapshots.cross-region.rcvd-bytes` - are now added to track the aggregate of snapshot bytes sent from and received at a store across different regions. Note that these metrics require nodes’ localities to include a “region” tier key. If a node lacks this key but is involved in cross-region batch activities, an error message will be logged.
ad867c7
to
aaf7049
Compare
Previously, kvoli (Austen) wrote…
Done. |
@andrewbaptist Would you mind taking care of removing the UNKNOWN metrics? It seems to align more closely with your PR. But please let me know whether it would be cleaner in terms of merge conflicts if I create a separate PR for this. |
TFTRs! bors r=kvoli, andrewbaptist |
Build failed (retrying...): |
Build succeeded: |
Previously, we [added](cockroachdb#104111) cross-region snapshot byte metrics to track the aggregate of snapshot bytes sent from and received at a store across different regions. We should add metrics to track cross-zone snapshots as well. To improve this issue, this commit adds two new store metrics - ``` range.snapshots.cross-zone.sent-bytes range.snapshots.cross-zone.rcvd-bytes ``` These metrics track the aggregate of snapshot bytes sent from and received at a store across different zones within the same region if the zone and region tier keys are properly configured across nodes. To ensure accurate metrics and consistent error logging, it is important to follow the assumption when configuring locality tiers across nodes: 1. For cross-region metrics, region tier keys must be present. 2. For cross-zone metrics, zone tier keys must be present. It is also essential to maintain consistency in the zone tier key across nodes. (e.g. using different keys, such as “az” and “zone”, can lead to unexpected behavior). 3. Within a node locality, region and zone tier keys should be unique. 4. Ensure consistent configuration of region and zone tiers across nodes. If all nodes configure both region and zone tiers: Cross-region and cross-zone metrics can be used to track following information: a. Aggregate of cross-region, cross-zone activities: `range.snapshots.cross-region.(sent|received)-bytes` b. Aggregate of same-region, cross-zone activities: `range.snapshots.cross-zone.(sent|received)-bytes` c. Aggregate of same-region, same-zone activities:`range.snapshots.(sent|received)-bytes` - a - b d. Cross-region, same zone activities will be considered as misconfigured, and an error will be logged. If all nodes have zone tiers configured, but not regions: cross-zone metrics will still be accurate. Problems arise if some nodes have region tier keys configured while others do not. If all nodes have region tiers configured: cross-region metrics will be accurate regardless of the nodes’ zone tier setup. Part of: cockroachdb#103983 Release note (ops change): Two new store metrics - `range.snapshots.cross-zone.sent-bytes` and `range.snapshots.cross-zone.rcvd-bytes` - are now added to track the aggregate of snapshot bytes sent from and received at a store across different zones. For accurate metrics, follow these assumptions: - Configure region and zone tier keys consistently across nodes. - Within a node locality, ensure unique region and zone tier keys. - Maintain consistent configuration of region and zone tiers across nodes.
Previously, we [added](cockroachdb#104111) cross-region snapshot byte metrics to track the aggregate of snapshot bytes sent from and received at a store across different regions. We should add metrics to track cross-zone snapshots as well. To improve this issue, this commit adds two new store metrics - ``` range.snapshots.cross-zone.sent-bytes range.snapshots.cross-zone.rcvd-bytes ``` These metrics track the aggregate of snapshot bytes sent from and received at a store across different zones within the same region if the zone and region tier keys are properly configured across nodes. To ensure accurate metrics and consistent error logging, it is important to follow the assumption when configuring locality tiers across nodes: 1. For cross-region metrics, region tier keys must be present. 2. For cross-zone metrics, zone tier keys must be present. It is also essential to maintain consistency in the zone tier key across nodes. (e.g. using different keys, such as “az” and “zone”, can lead to unexpected behavior). 3. Within a node locality, region and zone tier keys should be unique. 4. Ensure consistent configuration of region and zone tiers across nodes. If all nodes configure both region and zone tiers: Cross-region and cross-zone metrics can be used to track following information: a. Aggregate of cross-region, cross-zone activities: `range.snapshots.cross-region.(sent|received)-bytes` b. Aggregate of same-region, cross-zone activities: `range.snapshots.cross-zone.(sent|received)-bytes` c. Aggregate of same-region, same-zone activities: `range.snapshots.(sent|received)-bytes` - a - b d. Cross-region, same zone activities will be considered as misconfigured, and an error will be logged. If all nodes have zone tiers configured, but not regions: cross-zone metrics will still be accurate. Problems arise if some nodes have region tier keys configured while others do not. If all nodes have region tiers configured: cross-region metrics will be accurate regardless of the nodes’ zone tier setup. Part of: cockroachdb#103983 Release note (ops change): Two new store metrics - `range.snapshots.cross-zone.sent-bytes` and `range.snapshots.cross-zone.rcvd-bytes` - are now added to track the aggregate of snapshot bytes sent from and received at a store across different zones. For accurate metrics, follow these assumptions: - Configure region and zone tier keys consistently across nodes. - Within a node locality, ensure unique region and zone tier keys. - Maintain consistent configuration of region and zone tiers across nodes.
104417: kvserver: add cross-zone snapshot byte metrics to StoreMetrics r=kvoli a=wenyihu6 **kvserver: refactor getSnapshotBytesMetrics** This commit refactors `getSnapshotBytesMetrics` function in `replica_learner_test`. Instead of rigidly defining an array of metrics information to be extracted within the function, the refactoring now allows the caller to pass a metrics slice array to retrieve specific metrics of the caller’s choice. This commit does not change any existing functionality, and the main purpose is to make future commits cleaner. Part of: #103983 Release note: none --- **kvserver: add cross-zone snapshot byte metrics to StoreMetrics** Previously, we [added](#104111) cross-region snapshot byte metrics to track the aggregate of snapshot bytes sent from and received at a store across different regions. We should add metrics to track cross-zone snapshots as well. To improve this issue, this commit adds two new store metrics - ``` range.snapshots.cross-zone.sent-bytes range.snapshots.cross-zone.rcvd-bytes ``` These metrics track the aggregate of snapshot bytes sent from and received at a store across different zones within the same region if the zone and region tier keys are properly configured across nodes. To ensure accurate metrics and consistent error logging, it is important to follow the assumption when configuring locality tiers across nodes: 1. For cross-region metrics, region tier keys must be present. 2. For cross-zone metrics, zone tier keys must be present. It is also essential to maintain consistency in the zone tier key across nodes. (e.g. using different keys, such as “az” and “zone”, can lead to unexpected behavior). 3. Within a node locality, region and zone tier keys should be unique. 4. Ensure consistent configuration of region and zone tiers across nodes. If all nodes configure both region and zone tiers: Cross-region and cross-zone metrics can be used to track following information: a. Aggregate of cross-region, cross-zone activities: `range.snapshots.cross-region.(sent|received)-bytes` b. Aggregate of same-region, cross-zone activities: `range.snapshots.cross-zone.(sent|received)-bytes` c. Aggregate of same-region, same-zone activities: `range.snapshots.(sent|received)-bytes` - a - b d. Cross-region, same zone activities will be considered as misconfigured, and an error will be logged. If all nodes have zone tiers configured, but not regions: cross-zone metrics will still be accurate. Problems arise if some nodes have region tier keys configured while others do not. If all nodes have region tiers configured: cross-region metrics will be accurate regardless of the nodes’ zone tier setup. Part of: #103983 Release note (ops change): Two new store metrics - `range.snapshots.cross-zone.sent-bytes` and `range.snapshots.cross-zone.rcvd-bytes` - are now added to track the aggregate of snapshot bytes sent from and received at a store across different zones. For accurate metrics, follow these assumptions: - Configure region and zone tier keys consistently across nodes. - Within a node locality, ensure unique region and zone tier keys. - Maintain consistent configuration of region and zone tiers across nodes. Co-authored-by: Wenyi <[email protected]>
kvserver: refactor getSnapshotBytesMetrics
This commit refactors
getSnapshotBytesMetrics
inreplica_learner_test
to return a
map[string]snapshotBytesMetrics
instead ofmap[SnapShotRequest_Priority]snapshotBytesMetrics
. This allows us to includeand compare different types of snapshot metrics, removing the constraint of
being limited to
SnapShotRequest_Priority
. This commit does not change anyexisting functionality, and the main purpose is to make future commits cleaner.
Part of: #104124
Release note: none
kvserver: add cross-region snapshot byte metrics to StoreMetrics
Previously, there were no metrics to observe cross-region snapshot traffic
between stores within a cluster.
To improve this issue, this commit adds two new store metrics -
range.snapshots.cross-region.sent-bytes
andrange.snapshots.cross-region.rcvd-bytes
. These metrics track the aggregate ofsnapshot bytes sent from and received at a store across different regions.
Resolves: #104124
Release note (ops change): Two new store metrics -
range.snapshots.cross-region.sent-bytes
andrange.snapshots.cross-region.rcvd-bytes
- are now added to track the aggregateof snapshot bytes sent from and received at a store across different regions.
Note that these metrics require nodes’ localities to include a “region” tier
key. If a node lacks this key but is involved in cross-region snapshot activities,
an error message will be logged.