-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sql: only create real spans when session tracing/sampling #61777
sql: only create real spans when session tracing/sampling #61777
Conversation
Looks like the failure is an OOB on this line: cockroach/pkg/sql/sem/builtins/builtins.go Line 3633 in c7f9851
A non-nil span with an expected recording has none. |
86415df
to
5cdd4b0
Compare
fa2136b
to
a948908
Compare
Looking just at the
My first guess is that a |
a948908
to
9127386
Compare
Heh, I saw your comment too late. I was just about to ask for your (or @yuzefovich's) second pair of eyes. I don't think the colflow outbox code correctly deals with the possibility that the span available to it is a no-op one. What do you suppose should be done? Does it make sense to force real span creation at outboxes? I suppose not, which is exactly the behavior #61380 looks to change. |
cockroach/pkg/sql/execinfrapb/data.go Lines 254 to 255 in 70dd877
Is where the empty TraceData is being set. |
Looks like we were willing to propagate empty traces before. We'll need the following diff, I think: diff --git a/pkg/sql/colflow/colrpc/outbox.go b/pkg/sql/colflow/colrpc/outbox.go
index f9395956b0..b97d494943 100644
--- a/pkg/sql/colflow/colrpc/outbox.go
+++ b/pkg/sql/colflow/colrpc/outbox.go
@@ -298,7 +298,7 @@ func (o *Outbox) sendMetadata(ctx context.Context, stream flowStreamClient, errT
o.span.RecordStructured(s)
}
}
- if trace := execinfra.GetTraceData(ctx); trace != nil {
+ if trace := execinfra.GetTraceData(ctx); trace != nil && len(trace) != 0 {
msg.Data.Metadata = append(msg.Data.Metadata, execinfrapb.RemoteProducerMetadata{
Value: &execinfrapb.RemoteProducerMetadata_TraceData_{
TraceData: &execinfrapb.RemoteProducerMetadata_TraceData{ |
7159e29
to
b5f52f9
Compare
I'm not sure what to say here yet, but I'd encourage you to take a look at modifying
Also I'm not quite sure how remote child spans deal interact with root spans, but I imagine we do need to have "real" spans in the outboxes when the execution stats are being collected. Yeah, #61380 was trying to avoid doing that but also to fix the misuse of the API (where we started and stopped spans only for the purposes of the stats propagation). Overall, I think I might not have the right terminology and understanding of the impact here, possibly Alfonso will have some suggestions/clarifications. |
f721a58
to
2b70653
Compare
2b70653
to
f278b1f
Compare
This should now be ready for a real look. It's surprisingly small diff, but it was a bit unwieldy to get right (most of sql is unfamiliar to me). Given we're hoping to backport this to release-21.1, we should give this a close look and maybe let it bake for a few days. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, but I'm not an expert on the SQL so my LGTM is really just for the tracing package changes.
Reviewed 1 of 1 files at r1, 11 of 11 files at r2.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @asubiotto and @yuzefovich)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 1 files at r1.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @irfansharif and @yuzefovich)
pkg/sql/exec_util.go, line 1656 at r2 (raw file):
"session tracing", tracing.WithForceRealSpan()) sp.SetVerbose(true) st.ex.state.Ctx = newTxnCtx
Does it make sense to use a ctxHolder
here instead of overwriting the ctx directly?
pkg/sql/instrumentation.go, line 171 at r1 (raw file):
if sp := tracing.SpanFromContext(ctx); sp != nil { ih.sp = sp ih.finishSpan = false
I don't think we want to remove finishSpan
here. cc @yuzefovich. The case I'm thinking about specifically is when statement tracing threshold is enabled. A span is created at a higher level so the instrumentation helper should not Finish it, but it is still worth it to get the span's recording and record the stats. We should probably add a regression test for this case to instrumentation_test.go
pkg/sql/logictest/testdata/logic_test/contention_event, line 30 at r2 (raw file):
statement ok SET TRACING = on
This is more of a macro question but I thought that the idea of the background tracing work was that we would always have some sort of tracing enabled, so having to turn on session tracing to get contention events feels wrong. How should I think about this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @asubiotto and @irfansharif)
pkg/sql/instrumentation.go, line 171 at r1 (raw file):
Previously, asubiotto (Alfonso Subiotto Marqués) wrote…
I don't think we want to remove
finishSpan
here. cc @yuzefovich. The case I'm thinking about specifically is when statement tracing threshold is enabled. A span is created at a higher level so the instrumentation helper should not Finish it, but it is still worth it to get the span's recording and record the stats. We should probably add a regression test for this case toinstrumentation_test.go
Note that the first commit is the revert of #61532 which introduced finishSpan
.
AFAICT the only remaining piece from that PR is few changes in execCmd
. The revert looks good to me, but I wonder why you chose to do a "by-hand" revert?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @asubiotto)
pkg/sql/exec_util.go, line 1656 at r2 (raw file):
Previously, asubiotto (Alfonso Subiotto Marqués) wrote…
Does it make sense to use a
ctxHolder
here instead of overwriting the ctx directly?
Originally I didn't want to avoid a larger diff than necessary, but meh, done. PTAL.
pkg/sql/instrumentation.go, line 171 at r1 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
Note that the first commit is the revert of #61532 which introduced
finishSpan
.AFAICT the only remaining piece from that PR is few changes in
execCmd
. The revert looks good to me, but I wonder why you chose to do a "by-hand" revert?
I probably caused more confusion than deserved. I wanted to keep the comments introduced in that commit, everything else was mechanical. So if I'm reading everything correctly, I don't think there's any thing to do with this review thread?
pkg/sql/logictest/testdata/logic_test/contention_event, line 30 at r2 (raw file):
Previously, asubiotto (Alfonso Subiotto Marqués) wrote…
This is more of a macro question but I thought that the idea of the background tracing work was that we would always have some sort of tracing enabled, so having to turn on session tracing to get contention events feels wrong. How should I think about this?
The way to think about it now is that we'll only get contention events for sampled statements/txns. The sql.txn_stats.sample_rate
knob lets us traverse a spectrum of going between "always on tracing" as you're thinking of it, and "always off". For these tests SET TRACING = on is effectively doing the same thing as sql.txn_stats.sample_rate = 1.0
https://www.githubstatus.com Github is down, and my latest fixups aren't showing up in this PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @asubiotto)
pkg/sql/instrumentation.go, line 171 at r1 (raw file):
Previously, irfansharif (irfan sharif) wrote…
I probably caused more confusion than deserved. I wanted to keep the comments introduced in that commit, everything else was mechanical. So if I'm reading everything correctly, I don't think there's any thing to do with this review thread?
Yeah, I think the only thing that might be worth doing is explicitly mentioning the desire to keep the comments in the revert commit message.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @irfansharif)
pkg/sql/instrumentation.go, line 171 at r1 (raw file):
Previously, asubiotto (Alfonso Subiotto Marqués) wrote…
OK. So we should re-introduce finish span in another PR.
Yeah, I think the only thing that might be worth doing is explicitly mentioning the desire to keep the comments in the revert commit message.
Alfonso, which "finish span" are you thinking about? finishSpan
boolean was introduced in #61532 only because ih.sp
could be of two flavours - the txn span (which we don't want to finish) and and the sampling span (which we explicitly created in Setup
and want to finish).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @irfansharif)
pkg/sql/instrumentation.go, line 171 at r1 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
Yeah, I think the only thing that might be worth doing is explicitly mentioning the desire to keep the comments in the revert commit message.
Alfonso, which "finish span" are you thinking about?
finishSpan
boolean was introduced in #61532 only becauseih.sp
could be of two flavours - the txn span (which we don't want to finish) and and the sampling span (which we explicitly created inSetup
and want to finish).
Right, and I think we still have a case where we want to collect stats from a higher level span without finishing it. Concretely, when tracing is enabled for the statement or txn threshold cases. So in the if statement below, we want to set ih.sp
, return needsFinish=true
but set finishSpan=false
. Does that make sense? I feel like all this state can probably be simplified.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @irfansharif)
pkg/sql/instrumentation.go, line 171 at r1 (raw file):
Previously, asubiotto (Alfonso Subiotto Marqués) wrote…
Right, and I think we still have a case where we want to collect stats from a higher level span without finishing it. Concretely, when tracing is enabled for the statement or txn threshold cases. So in the if statement below, we want to set
ih.sp
, returnneedsFinish=true
but setfinishSpan=false
. Does that make sense? I feel like all this state can probably be simplified.
I see, yeah, it makes sense, but that would be a new behavior, so I think it is orthogonal to the revert commit and the main commit of this PR.
8d4f5dd
to
4e48293
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like the txnState's Ctx lifetime is such that it's cleaned up before SessionTracing is. Specifically, we finishSQLTxn
before we StopTracing
:
cockroach/pkg/sql/conn_executor.go
Line 832 in 1b6c7f9
ex.state.finishSQLTxn() |
cockroach/pkg/sql/conn_executor.go
Lines 882 to 884 in 1b6c7f9
if err := ex.sessionTracing.StopTracing(); err != nil { | |
log.Warningf(ctx, "error stopping tracing: %s", err) | |
} |
Does it make sense to use a ctxHolder here instead of overwriting the ctx directly?
The above makes this unwieldy. Yes, ideally we'd like to "unhijack" the context we just injected. But given it's cleaned up before we get to it, we won't have anything to clean up. This isn't try for the other usage of ctxHolder, where its lifetime is tied only to SessionTracer. For that reason, I'll keep things as was before and hijack Ctx directly instead of doing it through a ctxHolder. In the future, we can probably clean up the SessionTracer's lifecycle before we finishSQLTxn
. When we do so, we can start using a ctxHolder instead. I'm not doing that in this PR to keep the blast radius as small as possible.
TFTR!
bors r+
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @irfansharif)
This drastically reduces the memory overhead for tracing we're observing in cockroachdb#59424. This commit does a few disparate things to make it happen: 1. We now access the tracing span through txnState.Ctx exclusively. This gives us a single point to hijack, which we'll later do. By default txn's are initialized with a no-op span. If later on session tracing is enabled, we'll create a real (verbose) span and swap it out with the txn's no-op one. This gives us the same semantics as earlier, and on the plus side, we're not re-using the same tracing span when session tracing is toggled. 2. Hard tracing methods to work with no-op spans. Specifically GetRecording and TraceID. 3. Remove a crash vector through crdb_internal.trace_id. It was previously reaching into the first recording to retrieve a trace ID. But it's not guaranteed that recordings are non-empty. This could be used to induce panics in the server. This PR will need to get backported to 21.1. Fixes cockroachdb#59424. Release note: None
4e48293
to
d7407d5
Compare
Github statuses are being goofy, and bors is confused. bors r+ |
Build succeeded: |
We arrived at the previous default rate of 10% back in cockroachdb#59379. This was back when we were creating real tracing spans for all statements, and for sampled statements, we were propagating additional stats payloads. Consequently what cockroachdb#59379 ended up measuring (and finding the overhead acceptable) for was the performance hit we would incur for propagating stats payloads for statements already using real tracing spans. Since then, the landscape has changed. Notably we introduced cockroachdb#61777, which made it so that we were only using real tracing spans for sampled statements. This was done after performance analysis in cockroachdb#59424 showed that the use of real tracing spans in all statements resulted in tremendous overhead, for no real benefit. What this now leaves us with is a sampling rate that was tuned by only considering the stats payload overhead. What we want now is to also consider the overhead of using real tracing spans for sampled statements, vs. not. Doing this analysis gives us a very different picture for what the default rate should be. --- To find out what the overhead for sampled statements are currently, we experimented with kv95/enc=false/nodes=1/cpu=32. It's a simple benchmark that does little more than one-off statements, so should give us a concise picture of the sampling overhead. We ran six experiments in total (each corresponding to a pair of read+write rows), done in groups of three (each group corresponding to a table below). Each run in turn is comprised of 10 iterations of kv95, and what's varied between each run is the default sampling rate. We pin a sampling rate of 0.0 as the baseline that effectively switches off sampling entirely (and tracing), and measure the throughput degradation as we vary the sampling rate. ops/sec ops/sec --------------------|------------------|------------------ rate op grp | median diff | mean diff --------------------|------------------|------------------ 0.00 / read / #1 | 69817.90 | 69406.37 0.01 / read / #1 | 69300.35 -0.74% | 68717.23 -0.99% 0.10 / read / #1 | 67743.35 -2.97% | 67601.81 -2.60% 0.00 / write / #1 | 3672.55 | 3653.63 0.01 / write / #1 | 3647.65 -0.68% | 3615.90 -1.03% 0.10 / write / #1 | 3567.20 -2.87% | 3558.90 -2.59% ops/sec ops/sec --------------------|------------------|------------------ rate op grp | median diff | mean diff --------------------|------------------|------------------ 0.00 / read / #2 | 69440.80 68893.24 0.01 / read / #2 | 69481.55 +0.06% 69463.13 +0.82% (probably in the noise margin) 0.10 / read / #2 | 67841.80 -2.30% 66992.55 -2.76% 0.00 / write / #2 | 3652.45 3625.24 0.01 / write / #2 | 3657.55 -0.14% 3654.34 +0.80% 0.10 / write / #2 | 3570.75 -2.24% 3526.04 -2.74% The results above suggest that the current default rate of 10% is too high, and a 1% rate is much more acceptable. --- The fact that the cost of sampling is largely dominated by tracing is extremely unfortunate. We have ideas for how that can be improved (prototyped in cockroachdb#62227), but they're much too invasive to backport to 21.1. It's unfortunate that we only discovered the overhead this late in the development cycle. It was due to two major reasons: - cockroachdb#59992 landed late in the cycle, and enabled tracing for realsies (by propagating real tracing spans across rpc boundaries). We had done sanity checking for the tracing overhead before this point, but failed to realize that cockroachdb#59992 would merit re-analysis. - The test that alerted us to the degradation (tpccbench) had be persistently failing for a myriad of other reasons, so we didn't learn until too late that tracing was the latest offendor. tpccbench also doesn't deal with VM overload well (something cockroachdb#62361 hopes to address), and after tracing was enabled for realsies, this was the dominant failure mode. This resulted in perf data not making it's way to roachperf, which further hid possible indicators we had a major regression on our hands. We also didn't have a healthy process looking at roachperf on a continual basis, something we're looking to rectify going forward. We would've picked up on this regression had we been closely monitoring the kv95 charts. Release note: None
62998: sql: lower default sampling rate to 1% r=irfansharif a=irfansharif We arrived at the previous default rate of 10% back in #59379. This was back when we were creating real tracing spans for all statements, and for sampled statements, we were propagating additional stats payloads. Consequently what #59379 ended up measuring (and finding the overhead acceptable) for was the performance hit we would incur for propagating stats payloads for statements already using real tracing spans. Since then, the landscape has changed. Notably we introduced #61777, which made it so that we were only using real tracing spans for sampled statements. This was done after performance analysis in #59424 showed that the use of real tracing spans in all statements resulted in tremendous overhead, for no real benefit. What this now leaves us with is a sampling rate that was tuned by only considering the stats payload overhead. What we want now is to also consider the overhead of using real tracing spans for sampled statements, vs. not. Doing this analysis gives us a very different picture for what the default rate should be. --- To find out what the overhead for sampled statements are currently, we experimented with kv95/enc=false/nodes=1/cpu=32. It's a simple benchmark that does little more than one-off statements, so should give us a concise picture of the sampling overhead. We ran six experiments in total (each corresponding to a pair of read+write rows), done in groups of three (each group corresponding to a table below). Each run in turn is comprised of 10 iterations of kv95, and what's varied between each run is the default sampling rate. We pin a sampling rate of 0.0 as the baseline that effectively switches off sampling entirely (and tracing), and measure the throughput degradation as we vary the sampling rate. ops/sec ops/sec --------------------|------------------|------------------ rate op grp | median diff | mean diff --------------------|------------------|------------------ 0.00 / read / #1 | 69817.90 | 69406.37 0.01 / read / #1 | 69300.35 -0.74% | 68717.23 -0.99% 0.10 / read / #1 | 67743.35 -2.97% | 67601.81 -2.60% 0.00 / write / #1 | 3672.55 | 3653.63 0.01 / write / #1 | 3647.65 -0.68% | 3615.90 -1.03% 0.10 / write / #1 | 3567.20 -2.87% | 3558.90 -2.59% ops/sec ops/sec --------------------|------------------|------------------ rate op grp | median diff | mean diff --------------------|------------------|------------------ 0.00 / read / #2 | 69440.80 68893.24 0.01 / read / #2 | 69481.55 +0.06% 69463.13 +0.82% (probably in the noise margin) 0.10 / read / #2 | 67841.80 -2.30% 66992.55 -2.76% 0.00 / write / #2 | 3652.45 3625.24 0.01 / write / #2 | 3657.55 -0.14% 3654.34 +0.80% 0.10 / write / #2 | 3570.75 -2.24% 3526.04 -2.74% The results above suggest that the current default rate of 10% is too high, and a 1% rate is much more acceptable. --- The fact that the cost of sampling is largely dominated by tracing is extremely unfortunate. We have ideas for how that can be improved (prototyped in #62227), but they're much too invasive to backport to 21.1. Release note: None Co-authored-by: irfan sharif <[email protected]>
We arrived at the previous default rate of 10% back in cockroachdb#59379. This was back when we were creating real tracing spans for all statements, and for sampled statements, we were propagating additional stats payloads. Consequently what cockroachdb#59379 ended up measuring (and finding the overhead acceptable) for was the performance hit we would incur for propagating stats payloads for statements already using real tracing spans. Since then, the landscape has changed. Notably we introduced cockroachdb#61777, which made it so that we were only using real tracing spans for sampled statements. This was done after performance analysis in cockroachdb#59424 showed that the use of real tracing spans in all statements resulted in tremendous overhead, for no real benefit. What this now leaves us with is a sampling rate that was tuned by only considering the stats payload overhead. What we want now is to also consider the overhead of using real tracing spans for sampled statements, vs. not. Doing this analysis gives us a very different picture for what the default rate should be. --- To find out what the overhead for sampled statements are currently, we experimented with kv95/enc=false/nodes=1/cpu=32. It's a simple benchmark that does little more than one-off statements, so should give us a concise picture of the sampling overhead. We ran six experiments in total (each corresponding to a pair of read+write rows), done in groups of three (each group corresponding to a table below). Each run in turn is comprised of 10 iterations of kv95, and what's varied between each run is the default sampling rate. We pin a sampling rate of 0.0 as the baseline that effectively switches off sampling entirely (and tracing), and measure the throughput degradation as we vary the sampling rate. ops/sec ops/sec --------------------|------------------|------------------ rate op grp | median diff | mean diff --------------------|------------------|------------------ 0.00 / read / #1 | 69817.90 | 69406.37 0.01 / read / #1 | 69300.35 -0.74% | 68717.23 -0.99% 0.10 / read / #1 | 67743.35 -2.97% | 67601.81 -2.60% 0.00 / write / #1 | 3672.55 | 3653.63 0.01 / write / #1 | 3647.65 -0.68% | 3615.90 -1.03% 0.10 / write / #1 | 3567.20 -2.87% | 3558.90 -2.59% ops/sec ops/sec --------------------|------------------|------------------ rate op grp | median diff | mean diff --------------------|------------------|------------------ 0.00 / read / #2 | 69440.80 68893.24 0.01 / read / #2 | 69481.55 +0.06% 69463.13 +0.82% (probably in the noise margin) 0.10 / read / #2 | 67841.80 -2.30% 66992.55 -2.76% 0.00 / write / #2 | 3652.45 3625.24 0.01 / write / #2 | 3657.55 -0.14% 3654.34 +0.80% 0.10 / write / #2 | 3570.75 -2.24% 3526.04 -2.74% The results above suggest that the current default rate of 10% is too high, and a 1% rate is much more acceptable. --- The fact that the cost of sampling is largely dominated by tracing is extremely unfortunate. We have ideas for how that can be improved (prototyped in cockroachdb#62227), but they're much too invasive to backport to 21.1. It's unfortunate that we only discovered the overhead this late in the development cycle. It was due to two major reasons: - cockroachdb#59992 landed late in the cycle, and enabled tracing for realsies (by propagating real tracing spans across rpc boundaries). We had done sanity checking for the tracing overhead before this point, but failed to realize that cockroachdb#59992 would merit re-analysis. - The test that alerted us to the degradation (tpccbench) had be persistently failing for a myriad of other reasons, so we didn't learn until too late that tracing was the latest offendor. tpccbench also doesn't deal with VM overload well (something cockroachdb#62361 hopes to address), and after tracing was enabled for realsies, this was the dominant failure mode. This resulted in perf data not making it's way to roachperf, which further hid possible indicators we had a major regression on our hands. We also didn't have a healthy process looking at roachperf on a continual basis, something we're looking to rectify going forward. We would've picked up on this regression had we been closely monitoring the kv95 charts. Release note: None
This drastically reduces the memory overhead for tracing we're observing
in #59424. This commit does a few disparate things to make it happen:
gives us a single point to hijack, which we'll later do. By default
txn's are initialized with a no-op span. If later on session tracing is
enabled, we'll create a real (verbose) span and swap it out with the
txn's no-op one. This gives us the same semantics as earlier, and on the
plus side, we're not re-using the same tracing span when session tracing
is toggled.
GetRecording and TraceID.
previously reaching into the first recording to retrieve a trace ID. But
it's not guaranteed that recordings are non-empty. This could be used to
induce panics in the server.
This PR will need to get backported to 21.1. Fixes #59424.
Release note: None
+cc @cockroachdb/kv-east.