-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[wip,dnm] tracing: peephole optimizations #62227
Closed
irfansharif
wants to merge
16
commits into
cockroachdb:master
from
irfansharif:210318.tracing-peephole-optimizations
Closed
[wip,dnm] tracing: peephole optimizations #62227
irfansharif
wants to merge
16
commits into
cockroachdb:master
from
irfansharif:210318.tracing-peephole-optimizations
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Fixes cockroachdb#62081. It had rotted after we removed the trace.mode setting. It now makes use of a debug-only cluster setting that if enabled, installs real spans all sql statements. We'll defer the actual investigation into the cause of the slow-down in future PRs. $ make bench PKG=./pkg/bench TESTFLAGS='-benchtime=10000x -count 20' \ BENCHES='BenchmarkTracing' BENCHTIMEOUT=50m > benchmark-tracing.txt $ cat benchmark-tracing.txt | grep -v tracing=t | sed 's/tracing=f//' > old.txt $ cat benchmark-tracing.txt | grep -v tracing=f | sed 's/tracing=t//' > new.txt $ benchstat old.txt new.txt name old time/op new time/op delta Tracing/Cockroach//Scan1-24 382µs ± 2% 412µs ± 8% +7.68% (p=0.000 n=10+10) Tracing/Cockroach//Insert-24 579µs ± 2% 609µs ± 6% +5.10% (p=0.000 n=10+10) Tracing/MultinodeCockroach//Scan1-24 799µs ± 2% 885µs ± 1% +10.76% (p=0.000 n=10+9) Tracing/MultinodeCockroach//Insert-24 1.09ms ± 1% 1.14ms ± 2% +5.04% (p=0.000 n=9+10) name old alloc/op new alloc/op delta Tracing/Cockroach//Scan1-24 25.3kB ± 5% 32.0kB ± 4% +26.55% (p=0.000 n=10+10) Tracing/Cockroach//Insert-24 38.0kB ± 6% 42.2kB ± 5% +11.02% (p=0.000 n=10+10) Tracing/MultinodeCockroach//Scan1-24 77.7kB ± 8% 97.1kB ±12% +25.05% (p=0.000 n=10+10) Tracing/MultinodeCockroach//Insert-24 107kB ± 8% 115kB ± 7% +7.66% (p=0.023 n=10+10) name old allocs/op new allocs/op delta Tracing/Cockroach//Scan1-24 256 ± 1% 280 ± 1% +9.10% (p=0.000 n=9+9) Tracing/Cockroach//Insert-24 301 ± 2% 321 ± 1% +6.64% (p=0.000 n=10+10) Tracing/MultinodeCockroach//Scan1-24 787 ± 2% 921 ± 2% +17.04% (p=0.000 n=9+9) Tracing/MultinodeCockroach//Insert-24 748 ± 1% 805 ± 2% +7.61% (p=0.000 n=9+9) Release note: None
Instead of the incoming ctx. Going through the ctx package is expensive, and log tags are unlikely to change between parent and child spans. --- Actually, drop log tags altogether. Release note: None
Release note: None
- Reduce lock contention by only deleting root spans from registry. - Don't materialize span recordings unless absolutely needed. - Compute span duration outside of mutex Release note: None
Tracer.AlwaysTrace is redundant. We can also avoid doing opentracing stuff if we have no external sinks configured. Release note: None
Release note: None
irfansharif
changed the title
210318.tracing peephole optimizations
[wip,dnm] tracing: peephole optimizations
Mar 19, 2021
This reverts commit 1631298. Release note: None
irfansharif
force-pushed
the
210318.tracing-peephole-optimizations
branch
5 times, most recently
from
March 22, 2021 19:47
4b10a0f
to
bc9ff32
Compare
- use an optimizedCtx for collapsed spans - avoid stepping through interface for carrier - mutate incoming ctx's metadata.MD directly - disable emptry registry check, we broke it a few commits ago ([wip,dnm] tracing: move from interface type; avoid string.ToLower) - skip some common span tags Release note: None
irfansharif
force-pushed
the
210318.tracing-peephole-optimizations
branch
from
March 22, 2021 20:46
bc9ff32
to
f0f6d58
Compare
15 tasks
irfansharif
added a commit
to irfansharif/cockroach
that referenced
this pull request
Mar 23, 2021
Fixes cockroachdb#62081. It had rotted after we removed the trace.mode setting. It now makes use of a testing knob if enabled, installs real spans all sql statements. We'll defer the actual investigation into the cause of the slow-down in future PRs (prototyped in cockroachdb#62227). $ make bench PKG=./pkg/bench TESTFLAGS='-benchtime=10000x -count 20' \ BENCHES='BenchmarkTracing' BENCHTIMEOUT=50m > benchmark-tracing.txt $ cat benchmark-tracing.txt | grep -v tracing=t | sed 's/tracing=f\///' > old.txt $ cat benchmark-tracing.txt | grep -v tracing=f | sed 's/tracing=t\///' > new.txt $ benchstat old.txt new.txt name old time/op new time/op delta Tracing/Cockroach/Scan1-24 382µs ± 2% 412µs ± 8% +7.68% (p=0.000 n=10+10) Tracing/Cockroach/Insert-24 579µs ± 2% 609µs ± 6% +5.10% (p=0.000 n=10+10) Tracing/MultinodeCockroach/Scan1-24 799µs ± 2% 885µs ± 1% +10.76% (p=0.000 n=10+9) Tracing/MultinodeCockroach/Insert-24 1.09ms ± 1% 1.14ms ± 2% +5.04% (p=0.000 n=9+10) name old alloc/op new alloc/op delta Tracing/Cockroach/Scan1-24 25.3kB ± 5% 32.0kB ± 4% +26.55% (p=0.000 n=10+10) Tracing/Cockroach/Insert-24 38.0kB ± 6% 42.2kB ± 5% +11.02% (p=0.000 n=10+10) Tracing/MultinodeCockroach/Scan1-24 77.7kB ± 8% 97.1kB ±12% +25.05% (p=0.000 n=10+10) Tracing/MultinodeCockroach/Insert-24 107kB ± 8% 115kB ± 7% +7.66% (p=0.023 n=10+10) name old allocs/op new allocs/op delta Tracing/Cockroach/Scan1-24 256 ± 1% 280 ± 1% +9.10% (p=0.000 n=9+9) Tracing/Cockroach/Insert-24 301 ± 2% 321 ± 1% +6.64% (p=0.000 n=10+10) Tracing/MultinodeCockroach/Scan1-24 787 ± 2% 921 ± 2% +17.04% (p=0.000 n=9+9) Tracing/MultinodeCockroach/Insert-24 748 ± 1% 805 ± 2% +7.61% (p=0.000 n=9+9) Release note: None
The sum of these changes look promising-ish. Here are 5 runs with kv95.
I'll work on polish in separate PRs. |
craig bot
pushed a commit
that referenced
this pull request
Mar 23, 2021
62118: sql: revive BenchmarkTracing r=irfansharif a=irfansharif Fixes #62081. It had rotted after we removed the trace.mode setting. It now makes use of a testing knob if enabled, installs real spans all sql statements. We'll defer the actual investigation into the cause of the slow-down in future PRs (prototyped in #62227). $ make bench PKG=./pkg/bench TESTFLAGS='-benchtime=10000x -count 20' \ BENCHES='BenchmarkTracing' BENCHTIMEOUT=50m > benchmark-tracing.txt $ cat benchmark-tracing.txt | grep -v tracing=t | sed 's/tracing=f//' > old.txt $ cat benchmark-tracing.txt | grep -v tracing=f | sed 's/tracing=t//' > new.txt $ benchstat old.txt new.txt name old time/op new time/op delta Tracing/Cockroach//Scan1-24 382µs ± 2% 412µs ± 8% +7.68% (p=0.000 n=10+10) Tracing/Cockroach//Insert-24 579µs ± 2% 609µs ± 6% +5.10% (p=0.000 n=10+10) Tracing/MultinodeCockroach//Scan1-24 799µs ± 2% 885µs ± 1% +10.76% (p=0.000 n=10+9) Tracing/MultinodeCockroach//Insert-24 1.09ms ± 1% 1.14ms ± 2% +5.04% (p=0.000 n=9+10) name old alloc/op new alloc/op delta Tracing/Cockroach//Scan1-24 25.3kB ± 5% 32.0kB ± 4% +26.55% (p=0.000 n=10+10) Tracing/Cockroach//Insert-24 38.0kB ± 6% 42.2kB ± 5% +11.02% (p=0.000 n=10+10) Tracing/MultinodeCockroach//Scan1-24 77.7kB ± 8% 97.1kB ±12% +25.05% (p=0.000 n=10+10) Tracing/MultinodeCockroach//Insert-24 107kB ± 8% 115kB ± 7% +7.66% (p=0.023 n=10+10) name old allocs/op new allocs/op delta Tracing/Cockroach//Scan1-24 256 ± 1% 280 ± 1% +9.10% (p=0.000 n=9+9) Tracing/Cockroach//Insert-24 301 ± 2% 321 ± 1% +6.64% (p=0.000 n=10+10) Tracing/MultinodeCockroach//Scan1-24 787 ± 2% 921 ± 2% +17.04% (p=0.000 n=9+9) Tracing/MultinodeCockroach//Insert-24 748 ± 1% 805 ± 2% +7.61% (p=0.000 n=9+9) Release note: None 62193: geo/geomfn: add point in polygon optimization to st_intersects r=otan a=andyyang890 This patch improves the performance of st_intersects for the common use case of testing whether a (multi)polygon and a (multi)point intersect. Release note: None Co-authored-by: irfan sharif <[email protected]> Co-authored-by: Andy Yang <[email protected]>
irfansharif
added a commit
to irfansharif/cockroach
that referenced
this pull request
Apr 1, 2021
We arrived at the previous default rate of 10% back in cockroachdb#59379. This was back when we were creating real tracing spans for all statements, and for sampled statements, we were propagating additional stats payloads. Consequently what cockroachdb#59379 ended up measuring (and finding the overhead acceptable) for was the performance hit we would incur for propagating stats payloads for statements already using real tracing spans. Since then, the landscape has changed. Notably we introduced cockroachdb#61777, which made it so that we were only using real tracing spans for sampled statements. This was done after performance analysis in cockroachdb#59424 showed that the use of real tracing spans in all statements resulted in tremendous overhead, for no real benefit. What this now leaves us with is a sampling rate that was tuned by only considering the stats payload overhead. What we want now is to also consider the overhead of using real tracing spans for sampled statements, vs. not. Doing this analysis gives us a very different picture for what the default rate should be. --- To find out what the overhead for sampled statements are currently, we experimented with kv95/enc=false/nodes=1/cpu=32. It's a simple benchmark that does little more than one-off statements, so should give us a concise picture of the sampling overhead. We ran six experiments in total (each corresponding to a pair of read+write rows), done in groups of three (each group corresponding to a table below). Each run in turn is comprised of 10 iterations of kv95, and what's varied between each run is the default sampling rate. We pin a sampling rate of 0.0 as the baseline that effectively switches off sampling entirely (and tracing), and measure the throughput degradation as we vary the sampling rate. ops/sec ops/sec --------------------|------------------|------------------ rate op grp | median diff | mean diff --------------------|------------------|------------------ 0.00 / read / #1 | 69817.90 | 69406.37 0.01 / read / #1 | 69300.35 -0.74% | 68717.23 -0.99% 0.10 / read / #1 | 67743.35 -2.97% | 67601.81 -2.60% 0.00 / write / #1 | 3672.55 | 3653.63 0.01 / write / #1 | 3647.65 -0.68% | 3615.90 -1.03% 0.10 / write / #1 | 3567.20 -2.87% | 3558.90 -2.59% ops/sec ops/sec --------------------|------------------|------------------ rate op grp | median diff | mean diff --------------------|------------------|------------------ 0.00 / read / #2 | 69440.80 68893.24 0.01 / read / #2 | 69481.55 +0.06% 69463.13 +0.82% (probably in the noise margin) 0.10 / read / #2 | 67841.80 -2.30% 66992.55 -2.76% 0.00 / write / #2 | 3652.45 3625.24 0.01 / write / #2 | 3657.55 -0.14% 3654.34 +0.80% 0.10 / write / #2 | 3570.75 -2.24% 3526.04 -2.74% The results above suggest that the current default rate of 10% is too high, and a 1% rate is much more acceptable. --- The fact that the cost of sampling is largely dominated by tracing is extremely unfortunate. We have ideas for how that can be improved (prototyped in cockroachdb#62227), but they're much too invasive to backport to 21.1. It's unfortunate that we only discovered the overhead this late in the development cycle. It was due to two major reasons: - cockroachdb#59992 landed late in the cycle, and enabled tracing for realsies (by propagating real tracing spans across rpc boundaries). We had done sanity checking for the tracing overhead before this point, but failed to realize that cockroachdb#59992 would merit re-analysis. - The test that alerted us to the degradation (tpccbench) had be persistently failing for a myriad of other reasons, so we didn't learn until too late that tracing was the latest offendor. tpccbench also doesn't deal with VM overload well (something cockroachdb#62361 hopes to address), and after tracing was enabled for realsies, this was the dominant failure mode. This resulted in perf data not making it's way to roachperf, which further hid possible indicators we had a major regression on our hands. We also didn't have a healthy process looking at roachperf on a continual basis, something we're looking to rectify going forward. We would've picked up on this regression had we been closely monitoring the kv95 charts. Release note: None
craig bot
pushed a commit
that referenced
this pull request
Apr 2, 2021
62998: sql: lower default sampling rate to 1% r=irfansharif a=irfansharif We arrived at the previous default rate of 10% back in #59379. This was back when we were creating real tracing spans for all statements, and for sampled statements, we were propagating additional stats payloads. Consequently what #59379 ended up measuring (and finding the overhead acceptable) for was the performance hit we would incur for propagating stats payloads for statements already using real tracing spans. Since then, the landscape has changed. Notably we introduced #61777, which made it so that we were only using real tracing spans for sampled statements. This was done after performance analysis in #59424 showed that the use of real tracing spans in all statements resulted in tremendous overhead, for no real benefit. What this now leaves us with is a sampling rate that was tuned by only considering the stats payload overhead. What we want now is to also consider the overhead of using real tracing spans for sampled statements, vs. not. Doing this analysis gives us a very different picture for what the default rate should be. --- To find out what the overhead for sampled statements are currently, we experimented with kv95/enc=false/nodes=1/cpu=32. It's a simple benchmark that does little more than one-off statements, so should give us a concise picture of the sampling overhead. We ran six experiments in total (each corresponding to a pair of read+write rows), done in groups of three (each group corresponding to a table below). Each run in turn is comprised of 10 iterations of kv95, and what's varied between each run is the default sampling rate. We pin a sampling rate of 0.0 as the baseline that effectively switches off sampling entirely (and tracing), and measure the throughput degradation as we vary the sampling rate. ops/sec ops/sec --------------------|------------------|------------------ rate op grp | median diff | mean diff --------------------|------------------|------------------ 0.00 / read / #1 | 69817.90 | 69406.37 0.01 / read / #1 | 69300.35 -0.74% | 68717.23 -0.99% 0.10 / read / #1 | 67743.35 -2.97% | 67601.81 -2.60% 0.00 / write / #1 | 3672.55 | 3653.63 0.01 / write / #1 | 3647.65 -0.68% | 3615.90 -1.03% 0.10 / write / #1 | 3567.20 -2.87% | 3558.90 -2.59% ops/sec ops/sec --------------------|------------------|------------------ rate op grp | median diff | mean diff --------------------|------------------|------------------ 0.00 / read / #2 | 69440.80 68893.24 0.01 / read / #2 | 69481.55 +0.06% 69463.13 +0.82% (probably in the noise margin) 0.10 / read / #2 | 67841.80 -2.30% 66992.55 -2.76% 0.00 / write / #2 | 3652.45 3625.24 0.01 / write / #2 | 3657.55 -0.14% 3654.34 +0.80% 0.10 / write / #2 | 3570.75 -2.24% 3526.04 -2.74% The results above suggest that the current default rate of 10% is too high, and a 1% rate is much more acceptable. --- The fact that the cost of sampling is largely dominated by tracing is extremely unfortunate. We have ideas for how that can be improved (prototyped in #62227), but they're much too invasive to backport to 21.1. Release note: None Co-authored-by: irfan sharif <[email protected]>
irfansharif
added a commit
to irfansharif/cockroach
that referenced
this pull request
Apr 2, 2021
We arrived at the previous default rate of 10% back in cockroachdb#59379. This was back when we were creating real tracing spans for all statements, and for sampled statements, we were propagating additional stats payloads. Consequently what cockroachdb#59379 ended up measuring (and finding the overhead acceptable) for was the performance hit we would incur for propagating stats payloads for statements already using real tracing spans. Since then, the landscape has changed. Notably we introduced cockroachdb#61777, which made it so that we were only using real tracing spans for sampled statements. This was done after performance analysis in cockroachdb#59424 showed that the use of real tracing spans in all statements resulted in tremendous overhead, for no real benefit. What this now leaves us with is a sampling rate that was tuned by only considering the stats payload overhead. What we want now is to also consider the overhead of using real tracing spans for sampled statements, vs. not. Doing this analysis gives us a very different picture for what the default rate should be. --- To find out what the overhead for sampled statements are currently, we experimented with kv95/enc=false/nodes=1/cpu=32. It's a simple benchmark that does little more than one-off statements, so should give us a concise picture of the sampling overhead. We ran six experiments in total (each corresponding to a pair of read+write rows), done in groups of three (each group corresponding to a table below). Each run in turn is comprised of 10 iterations of kv95, and what's varied between each run is the default sampling rate. We pin a sampling rate of 0.0 as the baseline that effectively switches off sampling entirely (and tracing), and measure the throughput degradation as we vary the sampling rate. ops/sec ops/sec --------------------|------------------|------------------ rate op grp | median diff | mean diff --------------------|------------------|------------------ 0.00 / read / #1 | 69817.90 | 69406.37 0.01 / read / #1 | 69300.35 -0.74% | 68717.23 -0.99% 0.10 / read / #1 | 67743.35 -2.97% | 67601.81 -2.60% 0.00 / write / #1 | 3672.55 | 3653.63 0.01 / write / #1 | 3647.65 -0.68% | 3615.90 -1.03% 0.10 / write / #1 | 3567.20 -2.87% | 3558.90 -2.59% ops/sec ops/sec --------------------|------------------|------------------ rate op grp | median diff | mean diff --------------------|------------------|------------------ 0.00 / read / #2 | 69440.80 68893.24 0.01 / read / #2 | 69481.55 +0.06% 69463.13 +0.82% (probably in the noise margin) 0.10 / read / #2 | 67841.80 -2.30% 66992.55 -2.76% 0.00 / write / #2 | 3652.45 3625.24 0.01 / write / #2 | 3657.55 -0.14% 3654.34 +0.80% 0.10 / write / #2 | 3570.75 -2.24% 3526.04 -2.74% The results above suggest that the current default rate of 10% is too high, and a 1% rate is much more acceptable. --- The fact that the cost of sampling is largely dominated by tracing is extremely unfortunate. We have ideas for how that can be improved (prototyped in cockroachdb#62227), but they're much too invasive to backport to 21.1. It's unfortunate that we only discovered the overhead this late in the development cycle. It was due to two major reasons: - cockroachdb#59992 landed late in the cycle, and enabled tracing for realsies (by propagating real tracing spans across rpc boundaries). We had done sanity checking for the tracing overhead before this point, but failed to realize that cockroachdb#59992 would merit re-analysis. - The test that alerted us to the degradation (tpccbench) had be persistently failing for a myriad of other reasons, so we didn't learn until too late that tracing was the latest offendor. tpccbench also doesn't deal with VM overload well (something cockroachdb#62361 hopes to address), and after tracing was enabled for realsies, this was the dominant failure mode. This resulted in perf data not making it's way to roachperf, which further hid possible indicators we had a major regression on our hands. We also didn't have a healthy process looking at roachperf on a continual basis, something we're looking to rectify going forward. We would've picked up on this regression had we been closely monitoring the kv95 charts. Release note: None
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
No description provided.