Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

c2c: gather perf metrics from prometheus #97465

Merged
merged 3 commits into from
Feb 27, 2023

Conversation

msbutler
Copy link
Collaborator

@msbutler msbutler commented Feb 22, 2023

c2c roachtest performance metrics are now gathered by a prom/grafana instance running locally on the roachprod cluster. This change allows us to gather and process any metrics exposed to the crdb prom endpoint. Specifically, we now gather: capacity_used, replication_logical_bytes, replication_sst_bytes at various points during the c2c roachtest, allowing us to measure:

  • Initial Scan Throughput: initial scan size / initial scan duration
  • Workload Throughput: data ingested during workload / workload duration
  • Cutover Throughput: (data ingested between cutover time and cutover cmd) / (cutover process duration)

where the size of these operations can be measured as either physical replicated bytes, logical ingested bytes, or physical ingested bytes on the source cluster.

This patch also fixes a recent bug which mislabeled src cluster throughput as initial scan throughput.

Informs #89176

Release note: None

@msbutler msbutler requested a review from a team as a code owner February 22, 2023 15:08
@msbutler msbutler self-assigned this Feb 22, 2023
@msbutler msbutler requested review from herkolategan and smg260 and removed request for a team February 22, 2023 15:08
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@msbutler
Copy link
Collaborator Author

From the logs of a c2c/tpcc/warehouses=500/duration=10/cutover=5 roachtest run:

14:57:38 cluster_to_cluster.go:241: InitialScan Perf:
  12 14:57:38 cluster_to_cluster.go:243:     Duration Minutes : 14.87
  13 14:57:38 cluster_to_cluster.go:243:     Size_LogicalMegabytes : 37654.72
  14 14:57:38 cluster_to_cluster.go:243:     Size_PhysicalMegabytes : 10311.20
  15 14:57:38 cluster_to_cluster.go:243:     Size_PhysicalReplicatedMegabytes : 31825.59
  16 14:57:38 cluster_to_cluster.go:243:     Throughput_LogicalMegabytes_MB/S/Node : 10.55
  17 14:57:38 cluster_to_cluster.go:243:     Throughput_PhysicalMegabytes_MB/S/Node : 2.89
  18 14:57:38 cluster_to_cluster.go:243:     Throughput_PhysicalReplicatedMegabytes_MB/S/Node : 8.92
  19 14:57:38 cluster_to_cluster.go:241: Workload Perf:
  27 14:57:38 cluster_to_cluster.go:243:     Duration Minutes : 10.00
  28 14:57:38 cluster_to_cluster.go:243:     Size_LogicalMegabytes : 490.68
  29 14:57:38 cluster_to_cluster.go:243:     Size_PhysicalMegabytes : 165.46
  30 14:57:38 cluster_to_cluster.go:243:     Size_PhysicalReplicatedMegabytes : 834.27
  31 14:57:38 cluster_to_cluster.go:243:     Throughput_LogicalMegabytes_MB/S/Node : 0.20
  32 14:57:38 cluster_to_cluster.go:243:     Throughput_PhysicalMegabytes_MB/S/Node : 0.07
  33 14:57:38 cluster_to_cluster.go:243:     Throughput_PhysicalReplicatedMegabytes_MB/S/Node : 0.35
  34 14:57:38 cluster_to_cluster.go:241: Cutover Perf:
  42 14:57:38 cluster_to_cluster.go:243:     Duration Minutes : 0.66
  43 14:57:38 cluster_to_cluster.go:243:     Size_LogicalMegabytes : 203.06
  44 14:57:38 cluster_to_cluster.go:243:     Size_PhysicalMegabytes : 63.95
  45 14:57:38 cluster_to_cluster.go:243:     Size_PhysicalReplicatedMegabytes : 221.79
  46 14:57:38 cluster_to_cluster.go:243:     Throughput_LogicalMegabytes_MB/S/Node : 1.29
  47 14:57:38 cluster_to_cluster.go:243:     Throughput_PhysicalMegabytes_MB/S/Node : 0.41
  48 14:57:38 cluster_to_cluster.go:243:     Throughput_PhysicalReplicatedMegabytes_MB/S/Node : 1.41

@msbutler msbutler force-pushed the butler-bench-alt branch 3 times, most recently from a4e563b to 48c0e08 Compare February 23, 2023 15:51
if err != nil {
t.L().Errorf("Could not query prom %s", err.Error())
}
metricSnap[name] = sumOverLabel(point, stat.LabelName)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to put this summation into the Query like you can in the grafana graph?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sadly, not with this collector.CollectPoint() api. I'll leave a todo and file an issue.

c2c roachtest performance metrics are now gathered by a prom/grafana instance
running locally on the roachprod cluster. This change allows us to gather and
process any metrics exposed to the crdb prom endpoint. Specifically, we now
gather: `capacity_used`, `replication_logical_bytes`, `replication_sst_bytes`
at various points during the c2c roachtest, allowing us to measure:
- Initial Scan Throughput: initial scan size / initial scan duration
- Workload Throughput: data ingested during workload / workload duration
- Cutover Throughput: (data ingested between cutover time and cutover cmd) /
  (cutover process duration)

where the size of these operations can be measured as either physical
replicated bytes, logical ingested bytes, or physical ingested bytes on the
source cluster.

This patch also fixes a recent bug which mislabeled src cluster throughput as
initial scan throughput.

Epic: None
This patch streamlines how we remove ru limiting for roachtests that use
tenants. For the c2c tests specifically, we know remove the limits on the dst
cluster tenant as soon as the replication stream begins.

Release note: None
@msbutler
Copy link
Collaborator Author

TFTR!

bors r=stevendanna

@craig
Copy link
Contributor

craig bot commented Feb 27, 2023

Build failed (retrying...):

@craig
Copy link
Contributor

craig bot commented Feb 27, 2023

Build succeeded:

@craig craig bot merged commit f1a4c63 into cockroachdb:master Feb 27, 2023
@msbutler msbutler deleted the butler-bench-alt branch March 19, 2023 22:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants