Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvserver: add cross-region snapshot byte metrics to StoreMetrics #104111

Merged
merged 2 commits into from
Jun 5, 2023

Conversation

wenyihu6
Copy link
Contributor

@wenyihu6 wenyihu6 commented May 30, 2023

kvserver: refactor getSnapshotBytesMetrics

This commit refactors getSnapshotBytesMetrics in replica_learner_test
to return a map[string]snapshotBytesMetrics instead of
map[SnapShotRequest_Priority]snapshotBytesMetrics. This allows us to include
and compare different types of snapshot metrics, removing the constraint of
being limited to SnapShotRequest_Priority. This commit does not change any
existing functionality, and the main purpose is to make future commits cleaner.

Part of: #104124
Release note: none


kvserver: add cross-region snapshot byte metrics to StoreMetrics

Previously, there were no metrics to observe cross-region snapshot traffic
between stores within a cluster.

To improve this issue, this commit adds two new store metrics -
range.snapshots.cross-region.sent-bytes and
range.snapshots.cross-region.rcvd-bytes. These metrics track the aggregate of
snapshot bytes sent from and received at a store across different regions.

Resolves: #104124

Release note (ops change): Two new store metrics -
range.snapshots.cross-region.sent-bytes and
range.snapshots.cross-region.rcvd-bytes - are now added to track the aggregate
of snapshot bytes sent from and received at a store across different regions.
Note that these metrics require nodes’ localities to include a “region” tier
key. If a node lacks this key but is involved in cross-region snapshot activities,
an error message will be logged.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@wenyihu6 wenyihu6 force-pushed the refactor-testcase-structure branch 7 times, most recently from 0cb5550 to 7382441 Compare May 30, 2023 21:49
@wenyihu6 wenyihu6 changed the title [DNM] testing some refactoring to make future commits cleaner kvserver: add cross-region snapshot byte metrics to StoreMetrics May 30, 2023
@wenyihu6 wenyihu6 requested review from kvoli and removed request for kvoli May 30, 2023 21:51
@wenyihu6 wenyihu6 force-pushed the refactor-testcase-structure branch from 7382441 to 6ba2566 Compare May 30, 2023 21:55
@wenyihu6 wenyihu6 requested a review from kvoli May 30, 2023 21:57
Copy link
Collaborator

@kvoli kvoli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took a brief look, really great! Lets walk though in more detail tomorrow.

Reviewed 1 of 1 files at r1, 6 of 6 files at r2, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @wenyihu6)


pkg/kv/kvserver/allocator/storepool/store_pool.go line 1392 at r2 (raw file):

func (sp *StorePool) GetNodeLocalityWithString(nodeID roachpb.NodeID) localityWithString {
	nodeLocality := localityWithString{
		str:      "",

nit: remove the extra initialization beyond localityWithString

@wenyihu6 wenyihu6 force-pushed the refactor-testcase-structure branch from 6ba2566 to c0e047f Compare May 31, 2023 12:37
@wenyihu6
Copy link
Contributor Author

pkg/kv/kvserver/allocator/storepool/store_pool.go line 1392 at r2 (raw file):

Previously, kvoli (Austen) wrote…

nit: remove the extra initialization beyond localityWithString

Done.

Copy link
Collaborator

@kvoli kvoli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great stuff! Left some suggestions but none are significant changes. Lmk if you have any questions.

Reviewed 3 of 3 files at r3, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @wenyihu6)


-- commits line 17 at r3:
The cross-region traffic is between stores residing in different region localities within a cluster.

Perhaps reword to "cross region snapshot traffic, between stores in a cluster.


-- commits line 20 at r3:
nit: reword to "this commit adds two new store metrics - "


-- commits line 21 at r3:
Use the metric metadata name instead of the golang struct field name e.g.

range.snapshots.cross-region.sent-bytes and range.snapshots.cross-region.rcvd-bytes


-- commits line 22 at r3:
nit: reword to make it clear these are aggregates per store, it is inferred by the counter but could be explicit e.g. "track the aggregate snapshot bytes sent from and received at a store".


-- commits line 25 at r3:
Is this supposed to mention cross-region batch activities or snapshots?


-- commits line 33 at r3:
Same as above for referring to metrics in PRs/commits.


pkg/kv/kvserver/replica_command.go line 3148 at r3 (raw file):

	}

	maybeIncrementCrossRegionBatchMetrics := func(inc int64) {

Why name this ...CrossRegionBatchMetrics as opposed to ...CrossRegionSnapshotMetrics?


pkg/kv/kvserver/replica_command.go line 3151 at r3 (raw file):

		storePool := r.store.cfg.StorePool
		if storePool == nil {
			log.Eventf(ctx, "store pool is nil")

When will the storepool be nil? Does this only occur in tests?


pkg/kv/kvserver/replica_command.go line 3155 at r3 (raw file):

			isCrossRegion, err := storePool.IsCrossRegion(req.CoordinatorReplica, req.RecipientReplica)
			if err != nil {
				log.Eventf(ctx, "%v", err)

This could be spammy and would be better logged as a warning if necessary.

Consider logging at a vmodule level of 2 and adding context to the error.

if err != nil {
    // We are unable to determine if the snapshot was sent to a store in a different 
   // region or not. This can occur when ...
   log.VEventf(ctx, 2, "Unable to determine if batch is cross region %v", err)
}

pkg/kv/kvserver/allocator/storepool/store_pool.go line 1408 at r3 (raw file):

// GetNodeLocality returns the locality information for the given node.
func (sp *StorePool) GetNodeLocality(nodeID roachpb.NodeID) roachpb.Locality {

Should these methods also be exported on the AllocatorStorePool interface? I'm not certain they necessarily need to, what do you think?

Also I only see IsCrossRegion being used outside this pkg, do the other two methods need to be exported?


pkg/kv/kvserver/replica_learner_test.go line 2211 at r1 (raw file):

		}
	}
	types := [4]string{".unknown", ".recovery", ".rebalancing", ""}

nit: move types inline with the loop

for _, v := range [4]string{...} {  ... }

Is the reason you are using the metric names as constant strings here, as opposed to from the metadata in

Name: "range.snapshots.rcvd-bytes",
because they are private to the kvserver_test pkg?


pkg/kv/kvserver/replica_learner_test.go line 2413 at r3 (raw file):

	}
	require.Equal(t, receiverMapExpected, receiverMapDelta)

Nice test extension! Would it be possible to add a case where region isn't set, but other locality attributes are such as zone and dc - then assert that the cross-region metrics are unchanged? It may be helpful to get a feel for how noisy the log messages added above are as well.

@wenyihu6 wenyihu6 force-pushed the refactor-testcase-structure branch 3 times, most recently from fe98fc7 to b685267 Compare June 1, 2023 11:30
@wenyihu6
Copy link
Contributor Author

wenyihu6 commented Jun 1, 2023

-- commits line 17 at r3:

Previously, kvoli (Austen) wrote…

The cross-region traffic is between stores residing in different region localities within a cluster.

Perhaps reword to "cross region snapshot traffic, between stores in a cluster.

Done.

@wenyihu6
Copy link
Contributor Author

wenyihu6 commented Jun 1, 2023

-- commits line 20 at r3:

Previously, kvoli (Austen) wrote…

nit: reword to "this commit adds two new store metrics - "

Done.

@wenyihu6
Copy link
Contributor Author

wenyihu6 commented Jun 1, 2023

-- commits line 21 at r3:

Previously, kvoli (Austen) wrote…

Use the metric metadata name instead of the golang struct field name e.g.

range.snapshots.cross-region.sent-bytes and range.snapshots.cross-region.rcvd-bytes

Done.

@wenyihu6
Copy link
Contributor Author

wenyihu6 commented Jun 1, 2023

-- commits line 22 at r3:

Previously, kvoli (Austen) wrote…

nit: reword to make it clear these are aggregates per store, it is inferred by the counter but could be explicit e.g. "track the aggregate snapshot bytes sent from and received at a store".

Done.

@wenyihu6
Copy link
Contributor Author

wenyihu6 commented Jun 1, 2023

-- commits line 25 at r3:

Previously, kvoli (Austen) wrote…

Is this supposed to mention cross-region batch activities or snapshots?

Oops 😅

@wenyihu6
Copy link
Contributor Author

wenyihu6 commented Jun 1, 2023

-- commits line 33 at r3:

Previously, kvoli (Austen) wrote…

Same as above for referring to metrics in PRs/commits.

Done.

@wenyihu6
Copy link
Contributor Author

wenyihu6 commented Jun 1, 2023

pkg/kv/kvserver/replica_command.go line 3148 at r3 (raw file):

Previously, kvoli (Austen) wrote…

Why name this ...CrossRegionBatchMetrics as opposed to ...CrossRegionSnapshotMetrics?

Done.

@wenyihu6
Copy link
Contributor Author

wenyihu6 commented Jun 1, 2023

pkg/kv/kvserver/replica_command.go line 3151 at r3 (raw file):

Previously, kvoli (Austen) wrote…

When will the storepool be nil? Does this only occur in tests?

Discussed offline - this was for sanity check, but it should never happen. I've removed the check.

@wenyihu6
Copy link
Contributor Author

wenyihu6 commented Jun 1, 2023

pkg/kv/kvserver/replica_command.go line 3155 at r3 (raw file):

Previously, kvoli (Austen) wrote…

This could be spammy and would be better logged as a warning if necessary.

Consider logging at a vmodule level of 2 and adding context to the error.

if err != nil {
    // We are unable to determine if the snapshot was sent to a store in a different 
   // region or not. This can occur when ...
   log.VEventf(ctx, 2, "Unable to determine if batch is cross region %v", err)
}

Done.

@wenyihu6
Copy link
Contributor Author

wenyihu6 commented Jun 1, 2023

pkg/kv/kvserver/replica_learner_test.go line 2211 at r1 (raw file):

Previously, kvoli (Austen) wrote…

nit: move types inline with the loop

for _, v := range [4]string{...} {  ... }

Is the reason you are using the metric names as constant strings here, as opposed to from the metadata in

Name: "range.snapshots.rcvd-bytes",
because they are private to the kvserver_test pkg?

Done.

@wenyihu6 wenyihu6 force-pushed the refactor-testcase-structure branch from bf3d71d to 8635c72 Compare June 2, 2023 12:48
This commit refactors `getSnapshotBytesMetrics` in `replica_learner_test`
to return a `map[string]snapshotBytesMetrics` instead of
`map[SnapShotRequest_Priority]snapshotBytesMetrics`. This allows us to include
and compare different types of snapshot metrics, removing the constraint of
being limited to `SnapShotRequest_Priority`. This commit does not change any
existing functionality, and the main purpose is to make future commits cleaner.

Part of: cockroachdb#104124
Release note: none
@wenyihu6 wenyihu6 force-pushed the refactor-testcase-structure branch from 8635c72 to 3f1351e Compare June 2, 2023 13:58
@kvoli kvoli requested a review from andrewbaptist June 5, 2023 13:06
Copy link
Collaborator

@kvoli kvoli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: @andrewbaptist would you be able to review as well? This code was last touched heavily by you for delegated snapshots.

Reviewed 6 of 6 files at r7, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @andrewbaptist and @wenyihu6)


-- commits line 22 at r3:

Previously, wenyihu6 (Wenyi Hu) wrote…

Done.

I know its very much implied by the name but adding "...aggregate cross-region snapshot bytes..." would be clearer.


-- commits line 25 at r7:
nit: "Resolves" is probably a better linking term, since this wasn't a bug - just a feature.

@kvoli kvoli marked this pull request as ready for review June 5, 2023 13:17
@kvoli kvoli requested a review from a team as a code owner June 5, 2023 13:17
Copy link
Collaborator

@andrewbaptist andrewbaptist left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: - This will cause a few conflicts in my PR for #102629 - but I will fix that up after this merges. That PR does also remove the UNKNOWN metric since it is never populated, so we could simplify this by removing this here as it is always checked against 0, but its not important to do.

Reviewable status: :shipit: complete! 2 of 0 LGTMs obtained (waiting on @wenyihu6)

@wenyihu6 wenyihu6 force-pushed the refactor-testcase-structure branch 2 times, most recently from 544f83a to ad867c7 Compare June 5, 2023 16:13
Previously, there were no metrics to observe cross-region snapshot traffic
between stores within a cluster.

To improve this issue, this commit adds two new store metrics -
`range.snapshots.cross-region.sent-bytes` and
`range.snapshots.cross-region.rcvd-bytes`. These metrics track the aggregate of
snapshot bytes sent from and received at a store across different regions.

Resolves: cockroachdb#104124

Release note (ops change): Two new store metrics -
`range.snapshots.cross-region.sent-bytes` and
`range.snapshots.cross-region.rcvd-bytes` - are now added to track the aggregate
of snapshot bytes sent from and received at a store across different regions.
Note that these metrics require nodes’ localities to include a “region” tier
key. If a node lacks this key but is involved in cross-region batch activities,
an error message will be logged.
@wenyihu6 wenyihu6 force-pushed the refactor-testcase-structure branch from ad867c7 to aaf7049 Compare June 5, 2023 16:23
@wenyihu6
Copy link
Contributor Author

wenyihu6 commented Jun 5, 2023

-- commits line 22 at r3:

Previously, kvoli (Austen) wrote…

I know its very much implied by the name but adding "...aggregate cross-region snapshot bytes..." would be clearer.

Done.

@wenyihu6
Copy link
Contributor Author

wenyihu6 commented Jun 5, 2023

@andrewbaptist Would you mind taking care of removing the UNKNOWN metrics? It seems to align more closely with your PR. But please let me know whether it would be cleaner in terms of merge conflicts if I create a separate PR for this.

@wenyihu6
Copy link
Contributor Author

wenyihu6 commented Jun 5, 2023

TFTRs!

bors r=kvoli, andrewbaptist

@craig
Copy link
Contributor

craig bot commented Jun 5, 2023

Build failed (retrying...):

@craig
Copy link
Contributor

craig bot commented Jun 5, 2023

Build succeeded:

@craig craig bot merged commit 89887ee into cockroachdb:master Jun 5, 2023
wenyihu6 added a commit to wenyihu6/cockroach that referenced this pull request Jun 9, 2023
Previously, we [added](cockroachdb#104111) cross-region snapshot byte metrics to track the aggregate
of snapshot bytes sent from and received at a store across different regions. We
should add metrics to track cross-zone snapshots as well.

To improve this issue, this commit adds two new store metrics -
```
range.snapshots.cross-zone.sent-bytes
range.snapshots.cross-zone.rcvd-bytes
```
These metrics track the aggregate of snapshot bytes sent from and received at a
store across different zones within the same region if the zone and region tier keys are properly configured across nodes.

To ensure accurate metrics and consistent error logging, it is important to follow the assumption when configuring locality tiers across nodes:
1. For cross-region metrics, region tier keys must be present.
2. For cross-zone metrics, zone tier keys must be present. It is also essential to maintain consistency in the zone tier key across nodes. (e.g. using different keys, such as “az” and “zone”, can lead to unexpected behavior).
3. Within a node locality, region and zone tier keys should be unique.
4. Ensure consistent configuration of region and zone tiers across nodes.

If all nodes configure both region and zone tiers:
Cross-region and cross-zone metrics can be used to track following information:
a. Aggregate of cross-region, cross-zone activities: `range.snapshots.cross-region.(sent|received)-bytes`
b. Aggregate of same-region, cross-zone activities: `range.snapshots.cross-zone.(sent|received)-bytes`
c. Aggregate of same-region, same-zone activities:`range.snapshots.(sent|received)-bytes` - a - b
d. Cross-region, same zone activities will be considered as misconfigured, and an error will be logged.

If all nodes have zone tiers configured, but not regions: cross-zone metrics will still be accurate. Problems arise if some nodes have region tier keys configured while others do not.

If all nodes have region tiers configured: cross-region metrics will be accurate regardless of the nodes’ zone tier setup.

Part of: cockroachdb#103983

Release note (ops change): Two new store metrics -
`range.snapshots.cross-zone.sent-bytes` and
`range.snapshots.cross-zone.rcvd-bytes` - are now added to track the aggregate
of snapshot bytes sent from and received at a store across different zones.

For accurate metrics, follow these assumptions:
- Configure region and zone tier keys consistently across nodes.
- Within a node locality, ensure unique region and zone tier keys.
- Maintain consistent configuration of region and zone tiers across nodes.
wenyihu6 added a commit to wenyihu6/cockroach that referenced this pull request Jun 9, 2023
Previously, we [added](cockroachdb#104111) cross-region snapshot byte metrics to track the
aggregate of snapshot bytes sent from and received at a store across different
regions. We should add metrics to track cross-zone snapshots as well.

To improve this issue, this commit adds two new store metrics -
```
range.snapshots.cross-zone.sent-bytes
range.snapshots.cross-zone.rcvd-bytes
```
These metrics track the aggregate of snapshot bytes sent from and received at a
store across different zones within the same region if the zone and region tier
keys are properly configured across nodes.

To ensure accurate metrics and consistent error logging, it is important to
follow the assumption when configuring locality tiers across nodes:
1. For cross-region metrics, region tier keys must be present.
2. For cross-zone metrics, zone tier keys must be present. It is also essential
to maintain consistency in the zone tier key across nodes. (e.g. using different
keys, such as “az” and “zone”, can lead to unexpected behavior).
3. Within a node locality, region and zone tier keys should be unique.
4. Ensure consistent configuration of region and zone tiers across nodes.

If all nodes configure both region and zone tiers:
Cross-region and cross-zone metrics can be used to track following information:
a. Aggregate of cross-region, cross-zone activities:
`range.snapshots.cross-region.(sent|received)-bytes`
b. Aggregate of same-region, cross-zone activities:
`range.snapshots.cross-zone.(sent|received)-bytes`
c. Aggregate of same-region, same-zone activities:
`range.snapshots.(sent|received)-bytes` - a - b
d. Cross-region, same zone activities will be considered as misconfigured, and
an error will be logged.

If all nodes have zone tiers configured, but not regions: cross-zone metrics
will still be accurate. Problems arise if some nodes have region tier keys
configured while others do not.

If all nodes have region tiers configured: cross-region metrics will be accurate
regardless of the nodes’ zone tier setup.

Part of: cockroachdb#103983

Release note (ops change): Two new store metrics -
`range.snapshots.cross-zone.sent-bytes` and
`range.snapshots.cross-zone.rcvd-bytes` - are now added to track the aggregate
of snapshot bytes sent from and received at a store across different zones.

For accurate metrics, follow these assumptions:
- Configure region and zone tier keys consistently across nodes.
- Within a node locality, ensure unique region and zone tier keys.
- Maintain consistent configuration of region and zone tiers across nodes.
craig bot pushed a commit that referenced this pull request Jun 10, 2023
104417: kvserver: add cross-zone snapshot byte metrics to StoreMetrics r=kvoli a=wenyihu6

**kvserver: refactor getSnapshotBytesMetrics**

This commit refactors `getSnapshotBytesMetrics` function in
`replica_learner_test`. Instead of rigidly defining an array of metrics
information to be extracted within the function, the refactoring now allows the
caller to pass a metrics slice array to retrieve specific metrics of the
caller’s choice.

This commit does not change any existing functionality, and the main purpose is
to make future commits cleaner.

Part of: #103983 
Release note: none

---
**kvserver: add cross-zone snapshot byte metrics to StoreMetrics**

Previously, we [added](#104111) cross-region snapshot byte metrics to track the
aggregate of snapshot bytes sent from and received at a store across different
regions. We should add metrics to track cross-zone snapshots as well.


To improve this issue, this commit adds two new store metrics -
```
range.snapshots.cross-zone.sent-bytes 
range.snapshots.cross-zone.rcvd-bytes 
```
These metrics track the aggregate of snapshot bytes sent from and received at a
store across different zones within the same region if the zone and region tier
keys are properly configured across nodes.
	
To ensure accurate metrics and consistent error logging, it is important to
follow the assumption when configuring locality tiers across nodes:
1. For cross-region metrics, region tier keys must be present.
2. For cross-zone metrics, zone tier keys must be present. It is also essential
to maintain consistency in the zone tier key across nodes. (e.g. using different
keys, such as “az” and “zone”, can lead to unexpected behavior).
3. Within a node locality, region and zone tier keys should be unique.
4. Ensure consistent configuration of region and zone tiers across nodes.

If all nodes configure both region and zone tiers:
Cross-region and cross-zone metrics can be used to track following information:
a. Aggregate of cross-region, cross-zone activities:
`range.snapshots.cross-region.(sent|received)-bytes`
b. Aggregate of same-region, cross-zone activities:
`range.snapshots.cross-zone.(sent|received)-bytes`
c. Aggregate of same-region, same-zone activities:
`range.snapshots.(sent|received)-bytes` - a - b
d. Cross-region, same zone activities will be considered as misconfigured, and
an error will be logged.

If all nodes have zone tiers configured, but not regions: cross-zone metrics
will still be accurate. Problems arise if some nodes have region tier keys
configured while others do not.

If all nodes have region tiers configured: cross-region metrics will be accurate
regardless of the nodes’ zone tier setup.

Part of: #103983

Release note (ops change): Two new store metrics -
`range.snapshots.cross-zone.sent-bytes` and
`range.snapshots.cross-zone.rcvd-bytes` - are now added to track the aggregate
of snapshot bytes sent from and received at a store across different zones.

For accurate metrics, follow these assumptions:
- Configure region and zone tier keys consistently across nodes.
- Within a node locality, ensure unique region and zone tier keys.
- Maintain consistent configuration of region and zone tiers across nodes.

Co-authored-by: Wenyi <[email protected]>
@wenyihu6 wenyihu6 deleted the refactor-testcase-structure branch October 30, 2023 17:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

kv: add metrics for cross-region snapshots
4 participants