Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tracing: wrongly emitting trace events for session's first txn #59203

Closed
irfansharif opened this issue Jan 20, 2021 · 9 comments · Fixed by #64766
Closed

tracing: wrongly emitting trace events for session's first txn #59203

irfansharif opened this issue Jan 20, 2021 · 9 comments · Fixed by #64766
Assignees
Labels
A-tracing Relating to tracing in CockroachDB. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.

Comments

@irfansharif
Copy link
Contributor

See #59193. We bisected a failure in TestExecBuild/local/autocommit_nonmetamorphic to #58897, which simply enabled always-on tracing. We're now doubly emitting trace data when we shouldn't be.

@irfansharif irfansharif added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-tracing Relating to tracing in CockroachDB. labels Jan 20, 2021
@irfansharif irfansharif changed the title tracing: doubly emitting trace events with always-on tracing. tracing: doubly emitting trace events with always-on tracing Jan 20, 2021
irfansharif pushed a commit that referenced this issue Jan 26, 2021
We are still seeing memory issues on tpccbench/nodes=6/cpu=16/multi-az
which need to be investigated. Turn off background tracing while we do.

Touches #58298. We're also reverting an earlier commit as part of this
one (d252400). This revert is needed given
we've not yet addressed an underlying bug (#59203).

Release note: None
@irfansharif
Copy link
Contributor Author

irfansharif commented Jan 26, 2021

I don't think we're double emitting trace events. I think it's more that we're not resetting trace events in a span correctly. We used to expect the following data driven output for the following query:

statement ok
SET TRACING=ON;
  WITH cte AS (INSERT INTO ab VALUES (3, 3), (4, 4) RETURNING a, b)
  INSERT INTO ab (SELECT a*10, b*10 FROM cte);
SET TRACING=OFF

query TT
SELECT operation, message FROM [SHOW KV TRACE FOR SESSION]
WHERE message       LIKE '%r$rangeid: sending batch%'
  AND message   NOT LIKE '%PushTxn%'
  AND message   NOT LIKE '%QueryTxn%'
  AND operation NOT LIKE '%async%'
----
dist sender send  r35: sending batch 2 CPut to (n1,s1):1
dist sender send  r35: sending batch 2 CPut to (n1,s1):1
dist sender send  r35: sending batch 1 EndTxn to (n1,s1):1

Which fails with:

testdata/autocommit_nonmetamorphic:797: SELECT operation, message FROM [SHOW KV TRACE FOR SESSION]
WHERE message       LIKE '%r35: sending batch%'
  AND message   NOT LIKE '%PushTxn%'
  AND message   NOT LIKE '%QueryTxn%'
  AND operation NOT LIKE '%async%'
expected:
    dist sender send  r35: sending batch 2 CPut to (n1,s1):1
    dist sender send  r35: sending batch 2 CPut to (n1,s1):1
    dist sender send  r35: sending batch 1 EndTxn to (n1,s1):1
but found (query options: "") :
    dist sender send  r35: sending batch 1 DelRng to (n1,s1):1
    dist sender send  r35: sending batch 2 CPut to (n1,s1):1
    dist sender send  r35: sending batch 2 CPut to (n1,s1):1
    dist sender send  r35: sending batch 1 EndTxn to (n1,s1):1

Where's the DelRng coming from? Probably some previous trace. I double checked to see that running that query+trace capture in isolation (without any of the preceding trace captures) doesn't run into the same error. So we're definitely not resetting the span correctly, somewhere.

@irfansharif
Copy link
Contributor Author

irfansharif commented Jan 26, 2021

I think it's got to do with how we expect traces to behave with ROLLBACK.

# LogicTest: local !metamorphic

statement ok
CREATE TABLE ab (a INT PRIMARY KEY, b INT, FAMILY f1 (a, b))

# Get the range id.
let $rangeid
SELECT range_id FROM [ SHOW RANGES FROM TABLE ab ]

# Populate table descriptor cache.
query II
SELECT * FROM ab
----

# No auto-commit inside a transaction.
statement ok
BEGIN

statement ok
SET TRACING=ON;
  INSERT INTO ab VALUES (4, 4), (5, 5);
SET TRACING=OFF

query TT
SELECT operation, message FROM [SHOW KV TRACE FOR SESSION]
WHERE message     LIKE '%r$rangeid: sending batch%'
  AND message NOT LIKE '%PushTxn%'
  AND message NOT LIKE '%QueryTxn%'
----
dist sender send  r35: sending batch 2 CPut to (n1,s1):1

statement ok
ROLLBACK

statement ok
SET TRACING=ON;
  INSERT INTO ab VALUES (6, 6), (7, 7) RETURNING a, b;
SET TRACING=OFF

query TT
SELECT operation, message FROM [SHOW KV TRACE FOR SESSION]
WHERE message     LIKE '%r$rangeid: sending batch%'
  AND message NOT LIKE '%PushTxn%'
  AND message NOT LIKE '%QueryTxn%'
----
dist sender send  r35: sending batch 2 CPut, 1 EndTxn to (n1,s1):1

This is what I expect to behave like, but it doesn't, the "dist sender send r35: sending batch 2 CPut to (n1,s1):1" recording from the rolled back statement still shows up in future traces, despite "resetting" the trace using SET TRACING = on

 --- FAIL: TestExecBuild/local (0.00s)
        --- FAIL: TestExecBuild/local/tracing (0.23s)
            logic.go:2306: let $rangeid = 35
            logic.go:2265:

                testdata/tracing:49: SELECT operation, message FROM [SHOW KV TRACE FOR SESSION]
                WHERE message     LIKE '%r35: sending batch%'
                  AND message NOT LIKE '%PushTxn%'
                  AND message NOT LIKE '%QueryTxn%'
                expected:
                    dist sender send  r35: sending batch 2 CPut, 1 EndTxn to (n1,s1):1
                but found (query options: "") :
                    dist sender send  r35: sending batch 2 CPut to (n1,s1):1
                    dist sender send  r35: sending batch 2 CPut, 1 EndTxn to (n1,s1):1

@irfansharif
Copy link
Contributor Author

irfansharif commented Jan 27, 2021

I see that we're grabbing onto the first txn's span in the session here:

cockroach/pkg/sql/exec_util.go

Lines 1483 to 1485 in 854246a

// firstTxnSpan is the span of the first txn that was active when session
// tracing was enabled.
firstTxnSpan *tracing.Span

And since we're not really ever "consuming" span data from the aborted txn, it'll show up in future traces. We've had this behaviour for a while, and not sure if it's a sensible default. Handy repro #59458.

irfansharif added a commit to irfansharif/cockroach that referenced this issue Jan 27, 2021
...for sessions with aborted txns.

Attempt to repro cockroachdb#59203

Release note: None
asubiotto pushed a commit that referenced this issue Jan 27, 2021
We are still seeing memory issues on tpccbench/nodes=6/cpu=16/multi-az
which need to be investigated. Turn off background tracing while we do.

Touches #58298. We're also reverting an earlier commit as part of this
one (d252400). This revert is needed given
we've not yet addressed an underlying bug (#59203).

Release note: None
@irfansharif
Copy link
Contributor Author

yo @asubiotto, I think this is in your ballpark. I don't know if we'll have time to do anything about it, or if it's worthwhile for 21.1, but I do think it's a "bug".

@asubiotto
Copy link
Contributor

Hmm, thanks for raising this. We'll hopefully get to trace-related fixes during the stability period.

@irfansharif irfansharif changed the title tracing: doubly emitting trace events with always-on tracing tracing: wrongly emitting trace events with always-on tracing Jan 28, 2021
@irfansharif irfansharif changed the title tracing: wrongly emitting trace events with always-on tracing tracing: wrongly emitting trace events for session's first txn Jan 28, 2021
@irfansharif
Copy link
Contributor Author

Are we still planning on addressing this?

@asubiotto
Copy link
Contributor

It's a nice to have for 21.1 so not certain we'll definitely have it done for 21.1. We'll take a look at addressing it once we've completed higher-priority items.

@irfansharif
Copy link
Contributor Author

irfansharif commented Mar 11, 2021

I'm having to get into the weeds of this SessionTracer guy to work through the implications of #61777. Specifically, if we're not creating real spans unless absolutely needed, we can't rely on the assumptions of this code:

cockroach/pkg/sql/exec_util.go

Lines 1645 to 1656 in a86cf54

// If we're inside a transaction, start recording on the txn span.
if _, ok := st.ex.machine.CurState().(stateNoTxn); !ok {
sp := tracing.SpanFromContext(st.ex.state.Ctx)
if sp == nil {
return errors.Errorf("no txn span for SessionTracing")
}
// We want to clear out any existing recordings so they don't show up in
// future traces.
sp.ResetRecording()
sp.SetVerbose(true)
st.firstTxnSpan = sp
}

I think we'll want to hijack the txn's span + context in the same manner we do for the conn below. Does that sound right to you? +cc @cockroachdb/kv-east / @tbg.

@asubiotto
Copy link
Contributor

asubiotto commented Mar 11, 2021

That sounds reasonable to me, but this code is new to me too. (cc @andreimatei in case you want to chime in)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-tracing Relating to tracing in CockroachDB. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

4 participants