Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

release-23.1: server: support multi-span statistics endpoint #101877

Merged
merged 1 commit into from
Apr 26, 2023

Conversation

blathers-crl[bot]
Copy link

@blathers-crl blathers-crl bot commented Apr 19, 2023

Backport 1/1 commits from #101378 on behalf of @THardy98.

/cc @cockroachdb/release


Epic: None
Extends: #96223

This PR extends the implementation of our SpanStats RPC endpoint to fetch stats for multiple spans at once. By extending the endpoint, we amortize the cost of the RPC's node fanout across all requested spans, whereas previously, we were issuing a fanout per span requested. Additionally, this change batches KV layer requests for ranges fully contained by the span, instead of issuing a request per fully contained range.

Note that we do not deprecate the start_key and end_key fields as they're used to determine whether an old node is calling out to a node using the new proto format.

The changes here explicitly do not support mixed-version clusters.


BENCHMARK RESULTS

Here are some benchmark results from running:

BENCHTIMEOUT=72h PKG=./pkg/server BENCHES=BenchmarkSpanStats ./scripts/bench HEAD^ HEAD

Note that HEAD is actually a temp change to revert to old logic (request per span) and HEAD^ is the new logic (multi-span request). As such the increases in latency/memory are actually reductions.

name                                                                                                              old time/op    new time/op    delta
SpanStats/3node/BenchmarkSpanStats_-_span_stats_for_3_node_cluster,_collecting_10_spans_with_25_ranges_each-24      10.3ms ± 2%    24.5ms ± 2%   +137.38%  (p=0.000 n=10+10)
SpanStats/3node/BenchmarkSpanStats_-_span_stats_for_3_node_cluster,_collecting_10_spans_with_50_ranges_each-24      17.1ms ± 2%    31.3ms ± 1%    +83.29%  (p=0.000 n=10+10)
SpanStats/3node/BenchmarkSpanStats_-_span_stats_for_3_node_cluster,_collecting_10_spans_with_100_ranges_each-24     30.5ms ± 2%   102.7ms ± 3%   +236.55%  (p=0.000 n=10+10)
SpanStats/3node/BenchmarkSpanStats_-_span_stats_for_3_node_cluster,_collecting_100_spans_with_25_ranges_each-24      1.75s ± 5%     2.10s ± 2%    +19.89%  (p=0.000 n=10+8)
SpanStats/3node/BenchmarkSpanStats_-_span_stats_for_3_node_cluster,_collecting_100_spans_with_50_ranges_each-24      3.00s ± 1%     3.43s ± 1%    +14.35%  (p=0.000 n=8+9)
SpanStats/3node/BenchmarkSpanStats_-_span_stats_for_3_node_cluster,_collecting_100_spans_with_100_ranges_each-24     5.01s ± 1%     5.53s ± 1%    +10.44%  (p=0.000 n=9+9)
SpanStats/3node/BenchmarkSpanStats_-_span_stats_for_3_node_cluster,_collecting_200_spans_with_25_ranges_each-24      9.66s ± 1%    10.63s ± 1%    +10.10%  (p=0.000 n=9+9)
SpanStats/3node/BenchmarkSpanStats_-_span_stats_for_3_node_cluster,_collecting_200_spans_with_50_ranges_each-24      15.2s ± 1%     16.2s ± 0%     +6.61%  (p=0.000 n=9+9)
SpanStats/3node/BenchmarkSpanStats_-_span_stats_for_3_node_cluster,_collecting_200_spans_with_100_ranges_each-24     17.4s ± 1%     18.6s ± 1%     +7.31%  (p=0.000 n=9+9)

name                                                                                                              old alloc/op   new alloc/op   delta
SpanStats/3node/BenchmarkSpanStats_-_span_stats_for_3_node_cluster,_collecting_10_spans_with_25_ranges_each-24      3.91MB ± 2%   18.55MB ± 1%   +374.43%  (p=0.000 n=9+9)
SpanStats/3node/BenchmarkSpanStats_-_span_stats_for_3_node_cluster,_collecting_10_spans_with_50_ranges_each-24      6.95MB ± 2%   21.18MB ± 1%   +204.85%  (p=0.000 n=8+8)
SpanStats/3node/BenchmarkSpanStats_-_span_stats_for_3_node_cluster,_collecting_10_spans_with_100_ranges_each-24     13.3MB ± 1%   134.6MB ± 1%   +912.92%  (p=0.000 n=8+8)
SpanStats/3node/BenchmarkSpanStats_-_span_stats_for_3_node_cluster,_collecting_100_spans_with_25_ranges_each-24     1.99GB ± 4%    2.27GB ± 4%    +14.11%  (p=0.000 n=8+9)
SpanStats/3node/BenchmarkSpanStats_-_span_stats_for_3_node_cluster,_collecting_100_spans_with_50_ranges_each-24     4.16GB ± 2%    4.43GB ± 3%     +6.57%  (p=0.000 n=9+10)
SpanStats/3node/BenchmarkSpanStats_-_span_stats_for_3_node_cluster,_collecting_100_spans_with_100_ranges_each-24    7.50GB ± 1%    7.75GB ± 1%     +3.27%  (p=0.000 n=10+10)
SpanStats/3node/BenchmarkSpanStats_-_span_stats_for_3_node_cluster,_collecting_200_spans_with_25_ranges_each-24     11.8GB ± 0%    12.4GB ± 0%     +4.70%  (p=0.000 n=7+9)
SpanStats/3node/BenchmarkSpanStats_-_span_stats_for_3_node_cluster,_collecting_200_spans_with_50_ranges_each-24     21.1GB ± 2%    21.6GB ± 1%     +2.70%  (p=0.000 n=10+10)
SpanStats/3node/BenchmarkSpanStats_-_span_stats_for_3_node_cluster,_collecting_200_spans_with_100_ranges_each-24    25.8GB ± 0%    26.4GB ± 0%     +2.29%  (p=0.000 n=8+10)

name                                                                                                              old allocs/op  new allocs/op  delta
SpanStats/3node/BenchmarkSpanStats_-_span_stats_for_3_node_cluster,_collecting_10_spans_with_25_ranges_each-24       26.9k ± 0%     90.1k ± 2%   +235.04%  (p=0.000 n=9+9)
SpanStats/3node/BenchmarkSpanStats_-_span_stats_for_3_node_cluster,_collecting_10_spans_with_50_ranges_each-24       51.8k ± 3%    114.9k ± 1%   +121.89%  (p=0.000 n=8+8)
SpanStats/3node/BenchmarkSpanStats_-_span_stats_for_3_node_cluster,_collecting_10_spans_with_100_ranges_each-24       106k ± 3%     1426k ± 1%  +1240.14%  (p=0.000 n=8+8)
SpanStats/3node/BenchmarkSpanStats_-_span_stats_for_3_node_cluster,_collecting_100_spans_with_25_ranges_each-24      23.2M ± 5%     23.9M ± 3%     +3.19%  (p=0.003 n=9+9)
SpanStats/3node/BenchmarkSpanStats_-_span_stats_for_3_node_cluster,_collecting_100_spans_with_50_ranges_each-24      48.7M ± 2%     49.4M ± 2%       ~     (p=0.075 n=10+10)
SpanStats/3node/BenchmarkSpanStats_-_span_stats_for_3_node_cluster,_collecting_100_spans_with_100_ranges_each-24     87.9M ± 1%     88.6M ± 1%       ~     (p=0.075 n=10+10)
SpanStats/3node/BenchmarkSpanStats_-_span_stats_for_3_node_cluster,_collecting_200_spans_with_25_ranges_each-24       140M ± 0%      142M ± 0%     +1.04%  (p=0.000 n=8+9)
SpanStats/3node/BenchmarkSpanStats_-_span_stats_for_3_node_cluster,_collecting_200_spans_with_50_ranges_each-24       248M ± 1%      249M ± 1%     +0.65%  (p=0.001 n=10+10)
SpanStats/3node/BenchmarkSpanStats_-_span_stats_for_3_node_cluster,_collecting_200_spans_with_100_ranges_each-24      306M ± 1%      308M ± 0%     +0.57%  (p=0.002 n=10+9)

Some notable improvements particularly with requests for spans with fewer ranges. After a point, the raw number of ranges becomes the bottleneck, despite reducing the number of fanouts. Not sure if there is a better way to fetch range statistics but I think the improvement here is enough for this PR. If improvements for fetching range statistics are identified, they can be done in a follow up PR and backported.


Release note (sql change): span statistics are unavailable on mixed-version clusters


Release justification: performance improvement fetching span statistics in bulk

Extends: #96223

This PR extends the implementation of our SpanStats RPC endpoint to fetch stats
for multiple spans at once. By extending the endpoint, we amortize the cost of
the RPC's node fanout across all requested spans, whereas previously, we were
issuing a fanout per span requested.  Additionally, this change batches KV
layer requests for ranges fully contained by the span, instead of issuing a
request per fully contained range.

Release note (sql change): span statistics are unavailable on mixed-version
clusters
@blathers-crl blathers-crl bot requested review from a team as code owners April 19, 2023 21:41
@blathers-crl blathers-crl bot requested a review from a team April 19, 2023 21:41
@blathers-crl blathers-crl bot requested a review from a team as a code owner April 19, 2023 21:41
@blathers-crl blathers-crl bot requested a review from a team April 19, 2023 21:41
@blathers-crl blathers-crl bot requested a review from a team as a code owner April 19, 2023 21:41
@blathers-crl
Copy link
Author

blathers-crl bot commented Apr 19, 2023

Thanks for opening a backport.

Please check the backport criteria before merging:

  • Patches should only be created for serious issues or test-only changes.
  • Patches should not break backwards-compatibility.
  • Patches should change as little code as possible.
  • Patches should not change on-disk formats or node communication protocols.
  • Patches should not add new functionality.
  • Patches must not add, edit, or otherwise modify cluster versions; or add version gates.
If some of the basic criteria cannot be satisfied, ensure that the exceptional criteria are satisfied within.
  • There is a high priority need for the functionality that cannot wait until the next release and is difficult to address in another way.
  • The new functionality is additive-only and only runs for clusters which have specifically “opted in” to it (e.g. by a cluster setting).
  • New code is protected by a conditional check that is trivial to verify and ensures that it only runs for opt-in clusters.
  • The PM and TL on the team that owns the changed code have signed off that the change obeys the above rules.

Add a brief release justification to the body of your PR to justify this backport.

Some other things to consider:

  • What did we do to ensure that a user that doesn’t know & care about this backport, has no idea that it happened?
  • Will this work in a cluster of mixed patch versions? Did we test that?
  • If a user upgrades a patch version, uses this feature, and then downgrades, what happens?

@blathers-crl blathers-crl bot requested a review from a team as a code owner April 19, 2023 21:41
@blathers-crl blathers-crl bot requested a review from a team April 19, 2023 21:41
@blathers-crl blathers-crl bot added blathers-backport This is a backport that Blathers created automatically. O-robot Originated from a bot. labels Apr 19, 2023
@blathers-crl blathers-crl bot force-pushed the blathers/backport-release-23.1-101378 branch from cff1a58 to b0a7afc Compare April 19, 2023 21:41
@blathers-crl blathers-crl bot requested review from srosenberg and smg260 and removed request for a team April 19, 2023 21:41
@blathers-crl blathers-crl bot force-pushed the blathers/backport-release-23.1-101378 branch from ece2060 to 84dbb3e Compare April 19, 2023 21:41
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@THardy98 THardy98 removed request for a team April 20, 2023 12:49
Copy link
Contributor

@maryliag maryliag left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @j82w, @THardy98, and @zachlite)

@THardy98
Copy link

TYFR :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
blathers-backport This is a backport that Blathers created automatically. O-robot Originated from a bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants