Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kv: add metrics for cross-region, cross-zone batch requests / responses, snapshots, raft activities #103983

Closed
6 tasks done
wenyihu6 opened this issue May 27, 2023 · 3 comments · Fixed by #105122
Closed
6 tasks done
Assignees
Labels
A-kv-observability C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-kv KV Team

Comments

@wenyihu6
Copy link
Contributor

wenyihu6 commented May 27, 2023

Is your feature request related to a problem? Please describe.

Currently, it is difficult to observe cross-region (and cross-zone) traffic for
batch requests / responses, raft activities, and snapshots. This limitation
becomes problematic when we need to assess the volume of cross-region traffic
handled by nodes. In addition, this also allows us to evaluate potential
optimization. For example, consistent follower reads should reduce cross-region
traffic.

Describe the solution you'd like

One solution is to collect byte count metrics across different levels.

Batch requests sent and batch responses received at a node

  • We can capture these metrics at the DistSender level, which runs on the
    gateway node receiving the SQL queries. Additionally, we should also collect metrics
    at the destination range node, which is responsible for the accessed data.
  • Cross Region
  • Cross Zone

Snapshot bytes sent from and received at a store

  • Cross Region (This has been mentioned in another issue.)
  • Cross Zone

Raft messages sent from and received at a store through RaftTransport

  • Cross Region
  • Cross Zone

Limitation of the solution.

  • When tracking the transmission of bytes between regions and zones, we currently
    rely on the byte count of the uncompressed data rather than the actual physical
    bytes sent or received. While this approach is acceptable when the compression
    factor remains consistent, it may hinder the accuracy of these metrics in
    reflecting the true cross-region traffic volume.

  • Ideally, we would want to aggregate the cross-region metrics based on the
    originating or forwarding region of the messages. This would allow us to assess
    the workload of individual regions. However, this tracking across multiple
    regions may lead to high cardinality.

Jira issue: CRDB-28287

@blathers-crl
Copy link

blathers-crl bot commented May 27, 2023

Hi @wenyihu6, I've guessed the C-ategory of your issue and suitably labeled it. Please re-label if inaccurate.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@blathers-crl blathers-crl bot added the C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) label May 27, 2023
wenyihu6 added a commit to wenyihu6/cockroach that referenced this issue May 27, 2023
Previously, there were no metrics to observe cross-region traffic in batch
requests / responses. This limitation becomes problematic when trying to
evaluate cross-region traffic managed by nodes.

To improve this issue, this commit adds two new metrics to DistSenderMetrics –
CrossRegionBatchRequestBytes and CrossRegionBatchResponseBytes. These metrics
track the byte count for batch requests sent and responses received in
cross-region batches within DistSender.

Note that DistSender resides on the gateway node that receives SQL queries, so
DistSender metrics are updated when DistSender sends cross-region batches to
another node.

Part of: cockroachdb#103983
Release note (ops change): Two new metrics - CrossRegionBatchRequestBytes,
CrossRegionBatchResponseBytes - are now added to DistSender metrics.
wenyihu6 added a commit to wenyihu6/cockroach that referenced this issue May 27, 2023
Previously, we introduced two new metrics to DistSender to help improve
observability of cross-region traffic on nodes. However, the issue of limited
observability on the receiver node still exists.

To solve this issue, this commit introduces two new metrics to NodeMetrics -
CrossRegionBatchRequestBytes and CrossRegionBatchResponseBytes. These metrics
track the byte count for batch requests sent and responses received in
cross-region batches within the receiver node.

Note that the node represents the destination range node, so node metrics are
updated when the node receives cross-region batches.

Fixes: cockroachdb#103983
Release note (ops change): Two new metrics - CrossRegionBatchRequestBytes,
CrossRegionBatchResponseBytes - are now added to Node metrics.
wenyihu6 added a commit to wenyihu6/cockroach that referenced this issue May 30, 2023
Previously, there were no metrics to observe cross-region traffic in batch
requests / responses. This limitation becomes problematic when trying to
evaluate cross-region traffic managed by nodes.

To improve this issue, this commit adds two new metrics to DistSenderMetrics –
CrossRegionBatchRequestBytes and CrossRegionBatchResponseBytes. These metrics
track the byte count for batch requests sent and responses received in
cross-region batches within DistSender.

Note that DistSender resides on the gateway node that receives SQL queries, so
DistSender metrics are updated when DistSender sends cross-region batches to
another node.

Part of: cockroachdb#103983
Release note (ops change): Two new metrics - CrossRegionBatchRequestBytes,
CrossRegionBatchResponseBytes - are now added to DistSender metrics.
wenyihu6 added a commit to wenyihu6/cockroach that referenced this issue May 30, 2023
Previously, we introduced two new metrics to DistSender to help improve
observability of cross-region traffic on nodes. However, the issue of limited
observability on the receiver node still exists.

To solve this issue, this commit introduces two new metrics to NodeMetrics -
CrossRegionBatchRequestBytes and CrossRegionBatchResponseBytes. These metrics
track the byte count for batch requests sent and responses received in
cross-region batches within the receiver node.

Note that the node represents the destination range node, so node metrics are
updated when the node receives cross-region batches.

Fixes: cockroachdb#103983
Release note (ops change): Two new metrics - CrossRegionBatchRequestBytes,
CrossRegionBatchResponseBytes - are now added to Node metrics.
wenyihu6 added a commit to wenyihu6/cockroach that referenced this issue Jun 6, 2023
Previously, there were no metrics to observe cross-region traffic in batch
requests / responses. This limitation becomes problematic when trying to
evaluate cross-region traffic managed by nodes.

To improve this issue, this commit adds two new metrics to DistSenderMetrics –
CrossRegionBatchRequestBytes and CrossRegionBatchResponseBytes. These metrics
track the byte count for batch requests sent and responses received in
cross-region batches within DistSender. DistSender resides on the gateway node
that receives SQL queries, so DistSender metrics are updated when DistSender
sends cross-region batches to another node.

Note: These metrics require nodes’ localities to include a tier with the key
“region”. If a node does not have this key but participates in cross-region
batch activities, metrics will remain unchanged, and an error message will be
logged. The region of each node is determined by using the locality field and
the “region” tier value. Unfortunately, the “region” key is hard coded here.
Ideally, we would prefer a more flexible approach to determine node locality.

Part of: cockroachdb#103983

Release note (ops change): Two new metrics - CrossRegionBatchRequestBytes,
CrossRegionBatchResponseBytes - are now added to DistSender metrics. Note that
these metrics require nodes’ localities to include a “region” tier key. If a
node lacks this key but is involved in cross-region batch activities, an error
message will be logged.
@wenyihu6 wenyihu6 changed the title kv: add metrics for cross-region batch requests kv: add metrics for cross-region, cross-zone batch requests / responses, snapshots, raft activities Jun 6, 2023
wenyihu6 added a commit to wenyihu6/cockroach that referenced this issue Jun 6, 2023
This commit refactors `getSnapshotBytesMetrics` function in
`replica_learner_test`. Instead of rigidly defining an array of metrics
information to be extracted within the function, the refactoring now allows the
caller to pass a metrics slice array to retrieve specific metrics of the
caller’s choice.

This commit does not change any existing functionality, and the main purpose is
to make future commits cleaner.

Part of: cockroachdb#103983
Release note: none
wenyihu6 added a commit to wenyihu6/cockroach that referenced this issue Jun 22, 2023
Previously, there were no metrics to observe cross-region, cross-zone traffic in
raft messages requests sent / received at each store.

To improve this issue, this commit adds six new store metrics -

```
"raft.rcvd.bytes"
"raft.sent.bytes"
"raft.rcvd.cross_region.bytes"
"raft.sent.cross_region.bytes"
"raft.rcvd.cross_zone.bytes"
"raft.sent.cross_zone.bytes"
```

The first two metrics track the total byte of raft messages received and sent in
a store. Additionally, there are four metrics to track the aggregate byte count
of cross-region, cross-zone Raft messages sent and received in the store.

Note that these metrics capture the byte count of requests immediately upon
message reception and just prior to message transmission. In the case of
messages containing heartbeats or heartbeat_resps, they capture the byte count
of requests with coalesced heartbeats.

To facilitate metrics updating, this commit also introduces a new raft message
handler interface `OutgoingRaftMessageHandler`. This interface captures outgoing
messages right before they are sent to `raftSendQueue`. Note that the message
may not be successfully queued if the outgoing queue is full.

Resolves: cockroachdb#103983

Release note (ops change): Six new metrics -
"raft.rcvd.bytes"
"raft.sent.bytes"
"raft.rcvd.cross_region.bytes"
"raft.sent.cross_region.bytes"
"raft.rcvd.cross_zone.bytes"
"raft.sent.cross_zone.bytes" - are now added to store metrics.

For accurate metrics, follow these assumptions:
- Configure region and zone tier keys consistently across nodes.
- Within a node locality, ensure unique region and zone tier keys.
- Maintain consistent configuration of region and zone tiers across nodes.
- Cross-region but same zone activities should be impossible.
wenyihu6 added a commit to wenyihu6/cockroach that referenced this issue Jun 22, 2023
Previously, the metrics updates testing knobs were embedded within the logic of
`DistSender.Send()` and `Node.Batch()`, which could confuse the readers of the
code. This commit addresses the issue by removing end to end tests and replacing
them with more focused unit tests. In addition, this commit also includes some
refactoring of function parameters.

Note that this commit does not change any existing functionality.

Part of: cockroachdb#103983

Release note: None
wenyihu6 added a commit to wenyihu6/cockroach that referenced this issue Jun 22, 2023
The commit renames `RaftMessageHandler` to `IncomingRaftMessageHandler`. Another
PR is introducing a new interface `OutgoingRaftMessageHandler`, dedicated to
managing messages sent. The main purpose of this PR is to make the future PR
cleaner by handling the renaming process. Note that this commit does not change
any existing functionality.

Part of: cockroachdb#103983
Related: cockroachdb#105122
Release Note: None
wenyihu6 added a commit to wenyihu6/cockroach that referenced this issue Jun 25, 2023
The commit renames `RaftMessageHandler`, `Listen`, and `Stop` to
`IncomingRaftMessageHandler`, `ListenIncomingRaftMessages`, and
`StopIncomingRaftMessages`. Another PR is introducing a new interface
`OutgoingRaftMessageHandler`, dedicated to managing messages sent. The main
purpose of this PR is to make the future PR cleaner by handling the renaming
process. Note that this commit does not change any existing functionality.

Part of: cockroachdb#103983

Related: cockroachdb#105122

Release Note: None
wenyihu6 added a commit to wenyihu6/cockroach that referenced this issue Jun 25, 2023
The commit renames `RaftMessageHandler`, `Listen`, and `Stop` to
`IncomingRaftMessageHandler`, `ListenIncomingRaftMessages`, and
`StopIncomingRaftMessages`. Another PR is introducing a new interface
`OutgoingRaftMessageHandler`, dedicated to managing messages sent. The main
purpose of this PR is to make the future PR cleaner by handling the renaming
process. Note that this commit does not change any existing functionality.

Part of: cockroachdb#103983

Related: cockroachdb#105122

Release Note: None
wenyihu6 added a commit to wenyihu6/cockroach that referenced this issue Jun 25, 2023
Previously, there were no metrics to observe cross-region, cross-zone traffic in
raft messages requests sent / received at each store.

To improve this issue, this commit adds six new store metrics -

```
"raft.rcvd.bytes"
"raft.sent.bytes"
"raft.rcvd.cross_region.bytes"
"raft.sent.cross_region.bytes"
"raft.rcvd.cross_zone.bytes"
"raft.sent.cross_zone.bytes"
```

The first two metrics track the total byte of raft messages received and sent in
a store. Additionally, there are four metrics to track the aggregate byte count
of cross-region, cross-zone Raft messages sent and received in the store.

Note that these metrics capture the byte count of requests immediately upon
message reception and just prior to message transmission. In the case of
messages containing heartbeats or heartbeat_resps, they capture the byte count
of requests with coalesced heartbeats.

To facilitate metrics updating, this commit also introduces a new raft message
handler interface `OutgoingRaftMessageHandler`. This interface captures outgoing
messages right before they are sent to `raftSendQueue`. Note that the message
may not be successfully queued if the outgoing queue is full.

Resolves: cockroachdb#103983

Release note (ops change): Six new metrics -
"raft.rcvd.bytes"
"raft.sent.bytes"
"raft.rcvd.cross_region.bytes"
"raft.sent.cross_region.bytes"
"raft.rcvd.cross_zone.bytes"
"raft.sent.cross_zone.bytes" - have now been added for each store.

For accurate metrics, follow these assumptions:
- Configure region and zone tier keys consistently across nodes.
- Within a node locality, ensure unique region and zone tier keys.
- Maintain consistent configuration of region and zone tiers across nodes.
craig bot pushed a commit that referenced this issue Jun 26, 2023
105123: kvserver: rename RaftMessageHandler to IncomingRaftMessageHandler r=tbg,kvoli a=wenyihu6

The commit renames `RaftMessageHandler`, `Listen`, and `Stop` to
`IncomingRaftMessageHandler`, `ListenIncomingRaftMessages`, and
`StopIncomingRaftMessages`. Another PR is introducing a new interface
`OutgoingRaftMessageHandler`, dedicated to managing messages sent. The main
purpose of this PR is to make the future PR cleaner by handling the renaming
process. Note that this commit does not change any existing functionality.

Part of: #103983

Related: #105122

Release Note: None

105519: kvserver: fix and re-enable slow lease application logging r=erikgrinaker a=erikgrinaker

Resolves #97209.

Epic: none
Release note: None

Co-authored-by: Wenyi <[email protected]>
Co-authored-by: Erik Grinaker <[email protected]>
wenyihu6 added a commit to wenyihu6/cockroach that referenced this issue Jun 26, 2023
Prior to this commit, `CompareWithLocality` returned two boolean values
indicating whether two localities were cross-region and cross-zone. However,
this required callers to perform additional cross-comparison of these boolean
values to make meaningful metrics updates.

To simplify this, this commit introduces a new enum type
`LocalityComparisonType`. It provides four locality comparison results: cross
region, same region cross zone, same region same zone, and undefined
(indicating error behavior). This refactoring allow the caller to directly use
the comparison result without additional operations.

In addition, this commit also updates the logic to classify activities between
different regions as cross-regional, regardless of the zone tiers’
configuration. Initially, cross-region but same-zone tiers activities were
flagged as misconfiguration. After some discussion, we have decided that
regions should be non-overlapping. Hence, same zone tiers from different
regions should still be considered as different zones.

Part of: cockroachdb#103983

Release note: None
wenyihu6 added a commit to wenyihu6/cockroach that referenced this issue Jun 26, 2023
Previously, the metrics updates testing knobs were embedded within the logic of
`DistSender.Send()` and `Node.Batch()`, which could confuse the readers of the
code. This commit addresses the issue by removing end to end tests and replacing
them with more focused unit tests. In addition, this commit also includes some
refactoring of function parameters.

Note that this commit does not change any existing functionality.

Part of: cockroachdb#103983

Release note: None
wenyihu6 added a commit to wenyihu6/cockroach that referenced this issue Jun 27, 2023
Previously, there were no metrics to observe cross-region, cross-zone traffic in
raft messages requests sent / received at each store.

To improve this issue, this commit adds six new store metrics -

```
"raft.rcvd.bytes"
"raft.sent.bytes"
"raft.rcvd.cross_region.bytes"
"raft.sent.cross_region.bytes"
"raft.rcvd.cross_zone.bytes"
"raft.sent.cross_zone.bytes"
```

The first two metrics track the total byte of raft messages received and sent in
a store. Additionally, there are four metrics to track the aggregate byte count
of cross-region, cross-zone Raft messages sent and received in the store.

Note that these metrics capture the byte count of requests immediately upon
message reception and just prior to message transmission. In the case of
messages containing heartbeats or heartbeat_resps, they capture the byte count
of requests with coalesced heartbeats.

To facilitate metrics updating, this commit also introduces a new raft message
handler interface `OutgoingRaftMessageHandler`. This interface captures outgoing
messages right before they are sent to `raftSendQueue`. Note that the message
may not be successfully queued if the outgoing queue is full.

Resolves: cockroachdb#103983

Release note (ops change): Six new metrics -
"raft.rcvd.bytes"
"raft.sent.bytes"
"raft.rcvd.cross_region.bytes"
"raft.sent.cross_region.bytes"
"raft.rcvd.cross_zone.bytes"
"raft.sent.cross_zone.bytes" - have now been added for each store.

For accurate metrics, follow these assumptions:
- Configure region and zone tier keys consistently across nodes.
- Within a node locality, ensure unique region and zone tier keys.
- Maintain consistent configuration of region and zone tiers across nodes.

# Conflicts:
#	pkg/kv/kvserver/client_raft_test.go
wenyihu6 added a commit to wenyihu6/cockroach that referenced this issue Jun 29, 2023
Previously, there were no metrics to observe cross-region, cross-zone traffic in
raft messages requests sent / received at each store.

To improve this issue, this commit adds six new store metrics -

```
"raft.rcvd.bytes"
"raft.sent.bytes"
"raft.rcvd.cross_region.bytes"
"raft.sent.cross_region.bytes"
"raft.rcvd.cross_zone.bytes"
"raft.sent.cross_zone.bytes"
```

The first two metrics track the total byte of raft messages received and sent in
a store. Additionally, there are four metrics to track the aggregate byte count
of cross-region, cross-zone Raft messages sent and received in the store.

Note that these metrics capture the byte count of requests immediately upon
message reception and just prior to message transmission. In the case of
messages containing heartbeats or heartbeat_resps, they capture the byte count
of requests with coalesced heartbeats.

To facilitate metrics updating, this commit also introduces a new raft message
handler interface `OutgoingRaftMessageHandler`. This interface captures outgoing
messages right before they are sent to `raftSendQueue`. Note that the message
may not be successfully queued if the outgoing queue is full.

Resolves: cockroachdb#103983

Release note (ops change): Six new metrics -
"raft.rcvd.bytes"
"raft.sent.bytes"
"raft.rcvd.cross_region.bytes"
"raft.sent.cross_region.bytes"
"raft.rcvd.cross_zone.bytes"
"raft.sent.cross_zone.bytes" - have now been added for each store.

For accurate metrics, follow these assumptions:
- Configure region and zone tier keys consistently across nodes.
- Within a node locality, ensure unique region and zone tier keys.
- Maintain consistent configuration of region and zone tiers across nodes.

# Conflicts:
#	pkg/kv/kvserver/client_raft_test.go
wenyihu6 added a commit to wenyihu6/cockroach that referenced this issue Jun 30, 2023
Prior to this commit, `CompareWithLocality` returned two boolean values
indicating whether two localities were cross-region and cross-zone. However,
this required callers to perform additional cross-comparison of these boolean
values to make meaningful metrics updates.

To simplify this, this commit introduces a new enum type
`LocalityComparisonType`. It provides four locality comparison results: cross
region, same region cross zone, same region same zone, and undefined
(indicating error behavior). This refactoring allow the caller to directly use
the comparison result without additional operations.

In addition, this commit also updates the logic to classify activities between
different regions as cross-regional, regardless of the zone tiers’
configuration. Initially, cross-region but same-zone tiers activities were
flagged as misconfiguration. After some discussion, we have decided that
regions should be non-overlapping. Hence, same zone tiers from different
regions should still be considered as different zones.

In addition, this commit also includes some refactoring of function
parameters.

Note that this commit does not change any existing functionality.

Part of: cockroachdb#103983

Release note: None
wenyihu6 added a commit to wenyihu6/cockroach that referenced this issue Jun 30, 2023
Prior to this commit, `CompareWithLocality` returned two boolean values
indicating whether two localities were cross-region and cross-zone. However,
this required callers to perform additional cross-comparison of these boolean
values to make meaningful metrics updates.

To simplify this, this commit introduces a new enum type
`LocalityComparisonType`. It provides four locality comparison results: cross
region, same region cross zone, same region same zone, and undefined
(indicating error behavior). This refactoring allow the caller to directly use
the comparison result without additional operations.

In addition, this commit also updates the logic to classify activities between
different regions as cross-regional, regardless of the zone tiers’
configuration. Initially, cross-region but same-zone tiers activities were
flagged as misconfiguration. After some discussion, we have decided that
regions should be non-overlapping. Hence, same zone tiers from different
regions should still be considered as different zones.

In addition, this commit also includes some refactoring of function
parameters.

Note that this commit does not change any existing functionality.

Part of: cockroachdb#103983

Release note: None
craig bot pushed a commit that referenced this issue Jun 30, 2023
105267: kv: refactor CompareWithLocality to use enum r=tbg,kvoli a=wenyihu6

Prior to this commit, `CompareWithLocality` returned two boolean values
indicating whether two localities were cross-region and cross-zone. However,
this required callers to perform additional cross-comparison of these boolean
values to make meaningful metrics updates.

To simplify this, this commit introduces a new enum type
`LocalityComparisonType`. It provides four locality comparison results: cross
region, same region cross zone, same region same zone, and undefined
(indicating error behavior). This refactoring allow the caller to directly use
the comparison result without additional operations.

In addition, this commit also updates the logic to classify activities between
different regions as cross-regional, regardless of the zone tiers’
configuration. Initially, cross-region but same-zone tiers activities were
flagged as misconfiguration. After some discussion, we have decided that
regions should be non-overlapping. Hence, same zone tiers from different
regions should still be considered as different zones.

In addition, this commit also includes some refactoring of function 
parameters.

Note that this commit does not change any existing functionality.

Part of: #103983

Release note: None

Co-authored-by: Wenyi <[email protected]>
craig bot pushed a commit that referenced this issue Jun 30, 2023
105122: kvserver: add x-region, x-zone Raft msg metrics to Store r=tbg a=wenyihu6

Previously, there were no metrics to observe cross-region, cross-zone traffic in
raft messages requests sent / received at each store.

To improve this issue, this commit adds six new store metrics -

```
"raft.rcvd.bytes"
"raft.sent.bytes"
"raft.rcvd.cross_region.bytes"
"raft.sent.cross_region.bytes"
"raft.rcvd.cross_zone.bytes"
"raft.sent.cross_zone.bytes"
```

The first two metrics track the total byte of raft messages received and sent in
a store. Additionally, there are four metrics to track the aggregate byte count
of cross-region, cross-zone Raft messages sent and received in the store.

Note that these metrics capture the byte count of requests immediately upon
message reception and just prior to message transmission. In the case of
messages containing heartbeats or heartbeat_resps, they capture the byte count
of requests with coalesced heartbeats.

To facilitate metrics updating, this commit also introduces a new raft message
handler interface `OutgoingRaftMessageHandler`. This interface captures outgoing
messages right before they are sent to `raftSendQueue`. Note that the message
may not be successfully queued if the outgoing queue is full.

Resolves: #103983

Release note (ops change): Six new metrics -
"raft.rcvd.bytes"
"raft.sent.bytes"
"raft.rcvd.cross_region.bytes"
"raft.sent.cross_region.bytes"
"raft.rcvd.cross_zone.bytes"
"raft.sent.cross_zone.bytes" - have now been added for each store.

For accurate metrics, follow these assumptions:
- Configure region and zone tier keys consistently across nodes.
- Within a node locality, ensure unique region and zone tier keys.
- Maintain consistent configuration of region and zone tiers across nodes.

Co-authored-by: Wenyi <[email protected]>
@craig craig bot closed this as completed in 5479d92 Jun 30, 2023
AlexTalks added a commit to AlexTalks/cockroach that referenced this issue Jul 26, 2023
As part of cockroachdb#103983, metrics were added to track when
requests cross region or zone boundaries, however verbose logging was
added to the necessary locality comparisons. Whenever a cluster does not
have locality tiers defined - as is the default on new clusters, as well
as in tests - an error would be generated, and this error would be
logged (with a full stack trace) at the same level as the transport
debugging information, on every request. This change reduces that
verbosity and removes the unnecessary stack trace from being printed on
these locally generated and handled errors.

Epic: none

Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-observability C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-kv KV Team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant