-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kv: add metrics for cross-region, cross-zone batch requests / responses, snapshots, raft activities #103983
Closed
6 tasks done
Labels
A-kv-observability
C-enhancement
Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
T-kv
KV Team
Comments
Hi @wenyihu6, I've guessed the C-ategory of your issue and suitably labeled it. Please re-label if inaccurate. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
blathers-crl
bot
added
the
C-enhancement
Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
label
May 27, 2023
wenyihu6
added a commit
to wenyihu6/cockroach
that referenced
this issue
May 27, 2023
Previously, there were no metrics to observe cross-region traffic in batch requests / responses. This limitation becomes problematic when trying to evaluate cross-region traffic managed by nodes. To improve this issue, this commit adds two new metrics to DistSenderMetrics – CrossRegionBatchRequestBytes and CrossRegionBatchResponseBytes. These metrics track the byte count for batch requests sent and responses received in cross-region batches within DistSender. Note that DistSender resides on the gateway node that receives SQL queries, so DistSender metrics are updated when DistSender sends cross-region batches to another node. Part of: cockroachdb#103983 Release note (ops change): Two new metrics - CrossRegionBatchRequestBytes, CrossRegionBatchResponseBytes - are now added to DistSender metrics.
wenyihu6
added a commit
to wenyihu6/cockroach
that referenced
this issue
May 27, 2023
Previously, we introduced two new metrics to DistSender to help improve observability of cross-region traffic on nodes. However, the issue of limited observability on the receiver node still exists. To solve this issue, this commit introduces two new metrics to NodeMetrics - CrossRegionBatchRequestBytes and CrossRegionBatchResponseBytes. These metrics track the byte count for batch requests sent and responses received in cross-region batches within the receiver node. Note that the node represents the destination range node, so node metrics are updated when the node receives cross-region batches. Fixes: cockroachdb#103983 Release note (ops change): Two new metrics - CrossRegionBatchRequestBytes, CrossRegionBatchResponseBytes - are now added to Node metrics.
wenyihu6
added a commit
to wenyihu6/cockroach
that referenced
this issue
May 30, 2023
Previously, there were no metrics to observe cross-region traffic in batch requests / responses. This limitation becomes problematic when trying to evaluate cross-region traffic managed by nodes. To improve this issue, this commit adds two new metrics to DistSenderMetrics – CrossRegionBatchRequestBytes and CrossRegionBatchResponseBytes. These metrics track the byte count for batch requests sent and responses received in cross-region batches within DistSender. Note that DistSender resides on the gateway node that receives SQL queries, so DistSender metrics are updated when DistSender sends cross-region batches to another node. Part of: cockroachdb#103983 Release note (ops change): Two new metrics - CrossRegionBatchRequestBytes, CrossRegionBatchResponseBytes - are now added to DistSender metrics.
wenyihu6
added a commit
to wenyihu6/cockroach
that referenced
this issue
May 30, 2023
Previously, we introduced two new metrics to DistSender to help improve observability of cross-region traffic on nodes. However, the issue of limited observability on the receiver node still exists. To solve this issue, this commit introduces two new metrics to NodeMetrics - CrossRegionBatchRequestBytes and CrossRegionBatchResponseBytes. These metrics track the byte count for batch requests sent and responses received in cross-region batches within the receiver node. Note that the node represents the destination range node, so node metrics are updated when the node receives cross-region batches. Fixes: cockroachdb#103983 Release note (ops change): Two new metrics - CrossRegionBatchRequestBytes, CrossRegionBatchResponseBytes - are now added to Node metrics.
wenyihu6
added a commit
to wenyihu6/cockroach
that referenced
this issue
Jun 6, 2023
Previously, there were no metrics to observe cross-region traffic in batch requests / responses. This limitation becomes problematic when trying to evaluate cross-region traffic managed by nodes. To improve this issue, this commit adds two new metrics to DistSenderMetrics – CrossRegionBatchRequestBytes and CrossRegionBatchResponseBytes. These metrics track the byte count for batch requests sent and responses received in cross-region batches within DistSender. DistSender resides on the gateway node that receives SQL queries, so DistSender metrics are updated when DistSender sends cross-region batches to another node. Note: These metrics require nodes’ localities to include a tier with the key “region”. If a node does not have this key but participates in cross-region batch activities, metrics will remain unchanged, and an error message will be logged. The region of each node is determined by using the locality field and the “region” tier value. Unfortunately, the “region” key is hard coded here. Ideally, we would prefer a more flexible approach to determine node locality. Part of: cockroachdb#103983 Release note (ops change): Two new metrics - CrossRegionBatchRequestBytes, CrossRegionBatchResponseBytes - are now added to DistSender metrics. Note that these metrics require nodes’ localities to include a “region” tier key. If a node lacks this key but is involved in cross-region batch activities, an error message will be logged.
wenyihu6
changed the title
kv: add metrics for cross-region batch requests
kv: add metrics for cross-region, cross-zone batch requests / responses, snapshots, raft activities
Jun 6, 2023
wenyihu6
added a commit
to wenyihu6/cockroach
that referenced
this issue
Jun 6, 2023
This commit refactors `getSnapshotBytesMetrics` function in `replica_learner_test`. Instead of rigidly defining an array of metrics information to be extracted within the function, the refactoring now allows the caller to pass a metrics slice array to retrieve specific metrics of the caller’s choice. This commit does not change any existing functionality, and the main purpose is to make future commits cleaner. Part of: cockroachdb#103983 Release note: none
wenyihu6
added a commit
to wenyihu6/cockroach
that referenced
this issue
Jun 22, 2023
Previously, there were no metrics to observe cross-region, cross-zone traffic in raft messages requests sent / received at each store. To improve this issue, this commit adds six new store metrics - ``` "raft.rcvd.bytes" "raft.sent.bytes" "raft.rcvd.cross_region.bytes" "raft.sent.cross_region.bytes" "raft.rcvd.cross_zone.bytes" "raft.sent.cross_zone.bytes" ``` The first two metrics track the total byte of raft messages received and sent in a store. Additionally, there are four metrics to track the aggregate byte count of cross-region, cross-zone Raft messages sent and received in the store. Note that these metrics capture the byte count of requests immediately upon message reception and just prior to message transmission. In the case of messages containing heartbeats or heartbeat_resps, they capture the byte count of requests with coalesced heartbeats. To facilitate metrics updating, this commit also introduces a new raft message handler interface `OutgoingRaftMessageHandler`. This interface captures outgoing messages right before they are sent to `raftSendQueue`. Note that the message may not be successfully queued if the outgoing queue is full. Resolves: cockroachdb#103983 Release note (ops change): Six new metrics - "raft.rcvd.bytes" "raft.sent.bytes" "raft.rcvd.cross_region.bytes" "raft.sent.cross_region.bytes" "raft.rcvd.cross_zone.bytes" "raft.sent.cross_zone.bytes" - are now added to store metrics. For accurate metrics, follow these assumptions: - Configure region and zone tier keys consistently across nodes. - Within a node locality, ensure unique region and zone tier keys. - Maintain consistent configuration of region and zone tiers across nodes. - Cross-region but same zone activities should be impossible.
wenyihu6
added a commit
to wenyihu6/cockroach
that referenced
this issue
Jun 22, 2023
Previously, the metrics updates testing knobs were embedded within the logic of `DistSender.Send()` and `Node.Batch()`, which could confuse the readers of the code. This commit addresses the issue by removing end to end tests and replacing them with more focused unit tests. In addition, this commit also includes some refactoring of function parameters. Note that this commit does not change any existing functionality. Part of: cockroachdb#103983 Release note: None
wenyihu6
added a commit
to wenyihu6/cockroach
that referenced
this issue
Jun 22, 2023
The commit renames `RaftMessageHandler` to `IncomingRaftMessageHandler`. Another PR is introducing a new interface `OutgoingRaftMessageHandler`, dedicated to managing messages sent. The main purpose of this PR is to make the future PR cleaner by handling the renaming process. Note that this commit does not change any existing functionality. Part of: cockroachdb#103983 Related: cockroachdb#105122 Release Note: None
wenyihu6
added a commit
to wenyihu6/cockroach
that referenced
this issue
Jun 25, 2023
The commit renames `RaftMessageHandler`, `Listen`, and `Stop` to `IncomingRaftMessageHandler`, `ListenIncomingRaftMessages`, and `StopIncomingRaftMessages`. Another PR is introducing a new interface `OutgoingRaftMessageHandler`, dedicated to managing messages sent. The main purpose of this PR is to make the future PR cleaner by handling the renaming process. Note that this commit does not change any existing functionality. Part of: cockroachdb#103983 Related: cockroachdb#105122 Release Note: None
wenyihu6
added a commit
to wenyihu6/cockroach
that referenced
this issue
Jun 25, 2023
The commit renames `RaftMessageHandler`, `Listen`, and `Stop` to `IncomingRaftMessageHandler`, `ListenIncomingRaftMessages`, and `StopIncomingRaftMessages`. Another PR is introducing a new interface `OutgoingRaftMessageHandler`, dedicated to managing messages sent. The main purpose of this PR is to make the future PR cleaner by handling the renaming process. Note that this commit does not change any existing functionality. Part of: cockroachdb#103983 Related: cockroachdb#105122 Release Note: None
wenyihu6
added a commit
to wenyihu6/cockroach
that referenced
this issue
Jun 25, 2023
Previously, there were no metrics to observe cross-region, cross-zone traffic in raft messages requests sent / received at each store. To improve this issue, this commit adds six new store metrics - ``` "raft.rcvd.bytes" "raft.sent.bytes" "raft.rcvd.cross_region.bytes" "raft.sent.cross_region.bytes" "raft.rcvd.cross_zone.bytes" "raft.sent.cross_zone.bytes" ``` The first two metrics track the total byte of raft messages received and sent in a store. Additionally, there are four metrics to track the aggregate byte count of cross-region, cross-zone Raft messages sent and received in the store. Note that these metrics capture the byte count of requests immediately upon message reception and just prior to message transmission. In the case of messages containing heartbeats or heartbeat_resps, they capture the byte count of requests with coalesced heartbeats. To facilitate metrics updating, this commit also introduces a new raft message handler interface `OutgoingRaftMessageHandler`. This interface captures outgoing messages right before they are sent to `raftSendQueue`. Note that the message may not be successfully queued if the outgoing queue is full. Resolves: cockroachdb#103983 Release note (ops change): Six new metrics - "raft.rcvd.bytes" "raft.sent.bytes" "raft.rcvd.cross_region.bytes" "raft.sent.cross_region.bytes" "raft.rcvd.cross_zone.bytes" "raft.sent.cross_zone.bytes" - have now been added for each store. For accurate metrics, follow these assumptions: - Configure region and zone tier keys consistently across nodes. - Within a node locality, ensure unique region and zone tier keys. - Maintain consistent configuration of region and zone tiers across nodes.
craig bot
pushed a commit
that referenced
this issue
Jun 26, 2023
105123: kvserver: rename RaftMessageHandler to IncomingRaftMessageHandler r=tbg,kvoli a=wenyihu6 The commit renames `RaftMessageHandler`, `Listen`, and `Stop` to `IncomingRaftMessageHandler`, `ListenIncomingRaftMessages`, and `StopIncomingRaftMessages`. Another PR is introducing a new interface `OutgoingRaftMessageHandler`, dedicated to managing messages sent. The main purpose of this PR is to make the future PR cleaner by handling the renaming process. Note that this commit does not change any existing functionality. Part of: #103983 Related: #105122 Release Note: None 105519: kvserver: fix and re-enable slow lease application logging r=erikgrinaker a=erikgrinaker Resolves #97209. Epic: none Release note: None Co-authored-by: Wenyi <[email protected]> Co-authored-by: Erik Grinaker <[email protected]>
wenyihu6
added a commit
to wenyihu6/cockroach
that referenced
this issue
Jun 26, 2023
Prior to this commit, `CompareWithLocality` returned two boolean values indicating whether two localities were cross-region and cross-zone. However, this required callers to perform additional cross-comparison of these boolean values to make meaningful metrics updates. To simplify this, this commit introduces a new enum type `LocalityComparisonType`. It provides four locality comparison results: cross region, same region cross zone, same region same zone, and undefined (indicating error behavior). This refactoring allow the caller to directly use the comparison result without additional operations. In addition, this commit also updates the logic to classify activities between different regions as cross-regional, regardless of the zone tiers’ configuration. Initially, cross-region but same-zone tiers activities were flagged as misconfiguration. After some discussion, we have decided that regions should be non-overlapping. Hence, same zone tiers from different regions should still be considered as different zones. Part of: cockroachdb#103983 Release note: None
wenyihu6
added a commit
to wenyihu6/cockroach
that referenced
this issue
Jun 26, 2023
Previously, the metrics updates testing knobs were embedded within the logic of `DistSender.Send()` and `Node.Batch()`, which could confuse the readers of the code. This commit addresses the issue by removing end to end tests and replacing them with more focused unit tests. In addition, this commit also includes some refactoring of function parameters. Note that this commit does not change any existing functionality. Part of: cockroachdb#103983 Release note: None
wenyihu6
added a commit
to wenyihu6/cockroach
that referenced
this issue
Jun 27, 2023
Previously, there were no metrics to observe cross-region, cross-zone traffic in raft messages requests sent / received at each store. To improve this issue, this commit adds six new store metrics - ``` "raft.rcvd.bytes" "raft.sent.bytes" "raft.rcvd.cross_region.bytes" "raft.sent.cross_region.bytes" "raft.rcvd.cross_zone.bytes" "raft.sent.cross_zone.bytes" ``` The first two metrics track the total byte of raft messages received and sent in a store. Additionally, there are four metrics to track the aggregate byte count of cross-region, cross-zone Raft messages sent and received in the store. Note that these metrics capture the byte count of requests immediately upon message reception and just prior to message transmission. In the case of messages containing heartbeats or heartbeat_resps, they capture the byte count of requests with coalesced heartbeats. To facilitate metrics updating, this commit also introduces a new raft message handler interface `OutgoingRaftMessageHandler`. This interface captures outgoing messages right before they are sent to `raftSendQueue`. Note that the message may not be successfully queued if the outgoing queue is full. Resolves: cockroachdb#103983 Release note (ops change): Six new metrics - "raft.rcvd.bytes" "raft.sent.bytes" "raft.rcvd.cross_region.bytes" "raft.sent.cross_region.bytes" "raft.rcvd.cross_zone.bytes" "raft.sent.cross_zone.bytes" - have now been added for each store. For accurate metrics, follow these assumptions: - Configure region and zone tier keys consistently across nodes. - Within a node locality, ensure unique region and zone tier keys. - Maintain consistent configuration of region and zone tiers across nodes. # Conflicts: # pkg/kv/kvserver/client_raft_test.go
wenyihu6
added a commit
to wenyihu6/cockroach
that referenced
this issue
Jun 29, 2023
Previously, there were no metrics to observe cross-region, cross-zone traffic in raft messages requests sent / received at each store. To improve this issue, this commit adds six new store metrics - ``` "raft.rcvd.bytes" "raft.sent.bytes" "raft.rcvd.cross_region.bytes" "raft.sent.cross_region.bytes" "raft.rcvd.cross_zone.bytes" "raft.sent.cross_zone.bytes" ``` The first two metrics track the total byte of raft messages received and sent in a store. Additionally, there are four metrics to track the aggregate byte count of cross-region, cross-zone Raft messages sent and received in the store. Note that these metrics capture the byte count of requests immediately upon message reception and just prior to message transmission. In the case of messages containing heartbeats or heartbeat_resps, they capture the byte count of requests with coalesced heartbeats. To facilitate metrics updating, this commit also introduces a new raft message handler interface `OutgoingRaftMessageHandler`. This interface captures outgoing messages right before they are sent to `raftSendQueue`. Note that the message may not be successfully queued if the outgoing queue is full. Resolves: cockroachdb#103983 Release note (ops change): Six new metrics - "raft.rcvd.bytes" "raft.sent.bytes" "raft.rcvd.cross_region.bytes" "raft.sent.cross_region.bytes" "raft.rcvd.cross_zone.bytes" "raft.sent.cross_zone.bytes" - have now been added for each store. For accurate metrics, follow these assumptions: - Configure region and zone tier keys consistently across nodes. - Within a node locality, ensure unique region and zone tier keys. - Maintain consistent configuration of region and zone tiers across nodes. # Conflicts: # pkg/kv/kvserver/client_raft_test.go
wenyihu6
added a commit
to wenyihu6/cockroach
that referenced
this issue
Jun 30, 2023
Prior to this commit, `CompareWithLocality` returned two boolean values indicating whether two localities were cross-region and cross-zone. However, this required callers to perform additional cross-comparison of these boolean values to make meaningful metrics updates. To simplify this, this commit introduces a new enum type `LocalityComparisonType`. It provides four locality comparison results: cross region, same region cross zone, same region same zone, and undefined (indicating error behavior). This refactoring allow the caller to directly use the comparison result without additional operations. In addition, this commit also updates the logic to classify activities between different regions as cross-regional, regardless of the zone tiers’ configuration. Initially, cross-region but same-zone tiers activities were flagged as misconfiguration. After some discussion, we have decided that regions should be non-overlapping. Hence, same zone tiers from different regions should still be considered as different zones. In addition, this commit also includes some refactoring of function parameters. Note that this commit does not change any existing functionality. Part of: cockroachdb#103983 Release note: None
wenyihu6
added a commit
to wenyihu6/cockroach
that referenced
this issue
Jun 30, 2023
Prior to this commit, `CompareWithLocality` returned two boolean values indicating whether two localities were cross-region and cross-zone. However, this required callers to perform additional cross-comparison of these boolean values to make meaningful metrics updates. To simplify this, this commit introduces a new enum type `LocalityComparisonType`. It provides four locality comparison results: cross region, same region cross zone, same region same zone, and undefined (indicating error behavior). This refactoring allow the caller to directly use the comparison result without additional operations. In addition, this commit also updates the logic to classify activities between different regions as cross-regional, regardless of the zone tiers’ configuration. Initially, cross-region but same-zone tiers activities were flagged as misconfiguration. After some discussion, we have decided that regions should be non-overlapping. Hence, same zone tiers from different regions should still be considered as different zones. In addition, this commit also includes some refactoring of function parameters. Note that this commit does not change any existing functionality. Part of: cockroachdb#103983 Release note: None
craig bot
pushed a commit
that referenced
this issue
Jun 30, 2023
105267: kv: refactor CompareWithLocality to use enum r=tbg,kvoli a=wenyihu6 Prior to this commit, `CompareWithLocality` returned two boolean values indicating whether two localities were cross-region and cross-zone. However, this required callers to perform additional cross-comparison of these boolean values to make meaningful metrics updates. To simplify this, this commit introduces a new enum type `LocalityComparisonType`. It provides four locality comparison results: cross region, same region cross zone, same region same zone, and undefined (indicating error behavior). This refactoring allow the caller to directly use the comparison result without additional operations. In addition, this commit also updates the logic to classify activities between different regions as cross-regional, regardless of the zone tiers’ configuration. Initially, cross-region but same-zone tiers activities were flagged as misconfiguration. After some discussion, we have decided that regions should be non-overlapping. Hence, same zone tiers from different regions should still be considered as different zones. In addition, this commit also includes some refactoring of function parameters. Note that this commit does not change any existing functionality. Part of: #103983 Release note: None Co-authored-by: Wenyi <[email protected]>
craig bot
pushed a commit
that referenced
this issue
Jun 30, 2023
105122: kvserver: add x-region, x-zone Raft msg metrics to Store r=tbg a=wenyihu6 Previously, there were no metrics to observe cross-region, cross-zone traffic in raft messages requests sent / received at each store. To improve this issue, this commit adds six new store metrics - ``` "raft.rcvd.bytes" "raft.sent.bytes" "raft.rcvd.cross_region.bytes" "raft.sent.cross_region.bytes" "raft.rcvd.cross_zone.bytes" "raft.sent.cross_zone.bytes" ``` The first two metrics track the total byte of raft messages received and sent in a store. Additionally, there are four metrics to track the aggregate byte count of cross-region, cross-zone Raft messages sent and received in the store. Note that these metrics capture the byte count of requests immediately upon message reception and just prior to message transmission. In the case of messages containing heartbeats or heartbeat_resps, they capture the byte count of requests with coalesced heartbeats. To facilitate metrics updating, this commit also introduces a new raft message handler interface `OutgoingRaftMessageHandler`. This interface captures outgoing messages right before they are sent to `raftSendQueue`. Note that the message may not be successfully queued if the outgoing queue is full. Resolves: #103983 Release note (ops change): Six new metrics - "raft.rcvd.bytes" "raft.sent.bytes" "raft.rcvd.cross_region.bytes" "raft.sent.cross_region.bytes" "raft.rcvd.cross_zone.bytes" "raft.sent.cross_zone.bytes" - have now been added for each store. For accurate metrics, follow these assumptions: - Configure region and zone tier keys consistently across nodes. - Within a node locality, ensure unique region and zone tier keys. - Maintain consistent configuration of region and zone tiers across nodes. Co-authored-by: Wenyi <[email protected]>
AlexTalks
added a commit
to AlexTalks/cockroach
that referenced
this issue
Jul 26, 2023
As part of cockroachdb#103983, metrics were added to track when requests cross region or zone boundaries, however verbose logging was added to the necessary locality comparisons. Whenever a cluster does not have locality tiers defined - as is the default on new clusters, as well as in tests - an error would be generated, and this error would be logged (with a full stack trace) at the same level as the transport debugging information, on every request. This change reduces that verbosity and removes the unnecessary stack trace from being printed on these locally generated and handled errors. Epic: none Release note: None
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
A-kv-observability
C-enhancement
Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
T-kv
KV Team
Is your feature request related to a problem? Please describe.
Currently, it is difficult to observe cross-region (and cross-zone) traffic for
batch requests / responses, raft activities, and snapshots. This limitation
becomes problematic when we need to assess the volume of cross-region traffic
handled by nodes. In addition, this also allows us to evaluate potential
optimization. For example, consistent follower reads should reduce cross-region
traffic.
Describe the solution you'd like
One solution is to collect byte count metrics across different levels.
Batch requests sent and batch responses received at a node
gateway node receiving the SQL queries. Additionally, we should also collect metrics
at the destination range node, which is responsible for the accessed data.
Snapshot bytes sent from and received at a store
Raft messages sent from and received at a store through RaftTransport
Limitation of the solution.
When tracking the transmission of bytes between regions and zones, we currently
rely on the byte count of the uncompressed data rather than the actual physical
bytes sent or received. While this approach is acceptable when the compression
factor remains consistent, it may hinder the accuracy of these metrics in
reflecting the true cross-region traffic volume.
Ideally, we would want to aggregate the cross-region metrics based on the
originating or forwarding region of the messages. This would allow us to assess
the workload of individual regions. However, this tracking across multiple
regions may lead to high cardinality.
Jira issue: CRDB-28287
The text was updated successfully, but these errors were encountered: