Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

execbuilder: unexpectedly changed KV usage due to tracing commit #60672

Closed
tbg opened this issue Feb 17, 2021 · 5 comments · Fixed by #64766
Closed

execbuilder: unexpectedly changed KV usage due to tracing commit #60672

tbg opened this issue Feb 17, 2021 · 5 comments · Fixed by #64766
Assignees
Labels
A-sql-execution Relating to SQL execution. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.

Comments

@tbg
Copy link
Member

tbg commented Feb 17, 2021

The test case here:

query TT
SELECT operation, message FROM [SHOW KV TRACE FOR SESSION]
WHERE message LIKE '%r$rangeid: sending batch%'
AND message NOT LIKE '%PushTxn%'
AND message NOT LIKE '%QueryTxn%'
----
dist sender send r36: sending batch 2 CPut, 1 EndTxn to (n1,s1):1

changed as a result of a commit that did not touch anything but tracing, namely this one (SHA is still unstable at time of writing) from #59992:

tracing: propagate non-recording spans across rpc boundaries

I adapted the test case to match the new expectation:

-- FAIL: TestExecBuild (0.17s)
    test_log_scope.go:73: test logs captured to: /tmp/logTestExecBuild382866614
    test_log_scope.go:74: use -show-logs to present logs inline
    --- FAIL: TestExecBuild/local (0.00s)
        --- FAIL: TestExecBuild/local/autocommit_nonmetamorphic (0.16s)
            logic.go:2324: let $rangeid = 36
            logic.go:2283: 
                 
                testdata/autocommit_nonmetamorphic:107: SELECT operation, message FROM [SHOW KV TRACE FOR SESSION]
                WHERE message     LIKE '%r36: sending batch%'
                  AND message NOT LIKE '%PushTxn%'
                  AND message NOT LIKE '%QueryTxn%'
                expected:
                    dist sender send  r36: sending batch 2 CPut, 1 EndTxn to (n1,s1):1
                    
                but found (query options: "") :
                    dist sender send  r36: sending batch 2 CPut to (n1,s1):1
                    dist sender send  r36: sending batch 2 CPut, 1 EndTxn to (n1,s1):1
            logic.go:3109: 
                testdata/autocommit_nonmetamorphic:115: error while processing
            logic.go:3109: testdata/autocommit_nonmetamorphic:115: too many errors encountered, skipping the rest of the input
    logic.go:3334: -- test log scope end --

based on the assumption that the extra request was always there, but did not get traced.

Since this is the 1PC optimization, it's worth double checking what's going on here.

@tbg tbg added A-sql-execution Relating to SQL execution. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. labels Feb 17, 2021
@tbg tbg assigned RaduBerinde and asubiotto and unassigned RaduBerinde Feb 17, 2021
@tbg
Copy link
Member Author

tbg commented Feb 17, 2021

Actually upon rewriting the file I get more random diffs:

diff --git a/pkg/sql/opt/exec/execbuilder/testdata/autocommit_nonmetamorphic b/pkg/sql/opt/exec/execbuilder/testdata/autocommit_nonmetamorphic
index 7191149f25..0a016b3606 100644
--- a/pkg/sql/opt/exec/execbuilder/testdata/autocommit_nonmetamorphic
+++ b/pkg/sql/opt/exec/execbuilder/testdata/autocommit_nonmetamorphic
@@ -110,6 +110,7 @@ WHERE message     LIKE '%r$rangeid: sending batch%'
   AND message NOT LIKE '%PushTxn%'
   AND message NOT LIKE '%QueryTxn%'
 ----
+dist sender send  r36: sending batch 2 CPut to (n1,s1):1
 dist sender send  r36: sending batch 2 CPut, 1 EndTxn to (n1,s1):1
 
 # TODO(radu): allow non-side-effecting projections.
@@ -133,6 +134,7 @@ WHERE message       LIKE '%r$rangeid: sending batch%'
   AND operation NOT LIKE '%async%'
 ----
 dist sender send  r36: sending batch 2 CPut to (n1,s1):1
+dist sender send  r36: sending batch 2 CPut to (n1,s1):1
 dist sender send  r36: sending batch 1 EndTxn to (n1,s1):1
 
 # Insert with RETURNING statement with side-effects should not auto-commit.
@@ -157,6 +159,7 @@ WHERE message       LIKE '%r$rangeid: sending batch%'
   AND operation NOT LIKE '%async%'
 ----
 dist sender send  r36: sending batch 2 CPut to (n1,s1):1
+dist sender send  r36: sending batch 2 CPut to (n1,s1):1
 dist sender send  r36: sending batch 1 EndTxn to (n1,s1):1
 
 # Another way to test the scenario above: generate an error and ensure that the
@@ -192,6 +195,7 @@ WHERE message     LIKE '%r$rangeid: sending batch%'
   AND message NOT LIKE '%PushTxn%'
   AND message NOT LIKE '%QueryTxn%'
 ----
+dist sender send  r36: sending batch 2 CPut to (n1,s1):1
 dist sender send  r36: sending batch 1 Put, 1 EndTxn to (n1,s1):1
 
 # Multi-row upsert should auto-commit.
@@ -213,6 +217,7 @@ WHERE message     LIKE '%r$rangeid: sending batch%'
   AND message NOT LIKE '%PushTxn%'
   AND message NOT LIKE '%QueryTxn%'
 ----
+dist sender send  r36: sending batch 2 CPut to (n1,s1):1
 dist sender send  r36: sending batch 2 Put, 1 EndTxn to (n1,s1):1
 
 # No auto-commit inside a transaction.
@@ -261,6 +266,7 @@ WHERE message     LIKE '%r$rangeid: sending batch%'
   AND message NOT LIKE '%PushTxn%'
   AND message NOT LIKE '%QueryTxn%'
 ----
+dist sender send  r36: sending batch 2 Put to (n1,s1):1
 dist sender send  r36: sending batch 2 Put, 1 EndTxn to (n1,s1):1
 
 # TODO(radu): allow non-side-effecting projections.
@@ -284,6 +290,7 @@ WHERE message       LIKE '%r$rangeid: sending batch%'
   AND operation NOT LIKE '%async%'
 ----
 dist sender send  r36: sending batch 2 Put to (n1,s1):1
+dist sender send  r36: sending batch 2 Put to (n1,s1):1
 dist sender send  r36: sending batch 1 EndTxn to (n1,s1):1
 
 # Upsert with RETURNING statement with side-effects should not auto-commit.
@@ -308,6 +315,7 @@ WHERE message       LIKE '%r$rangeid: sending batch%'
   AND operation NOT LIKE '%async%'
 ----
 dist sender send  r36: sending batch 2 Put to (n1,s1):1
+dist sender send  r36: sending batch 2 Put to (n1,s1):1
 dist sender send  r36: sending batch 1 EndTxn to (n1,s1):1
 
 # Another way to test the scenario above: generate an error and ensure that the
@@ -343,6 +351,7 @@ WHERE message     LIKE '%r$rangeid: sending batch%'
   AND message NOT LIKE '%PushTxn%'
   AND message NOT LIKE '%QueryTxn%'
 ----
+dist sender send  r36: sending batch 2 Put to (n1,s1):1
 dist sender send  r36: sending batch 1 Scan to (n1,s1):1
 dist sender send  r36: sending batch 2 Put, 1 EndTxn to (n1,s1):1
 
@@ -394,6 +403,8 @@ WHERE message     LIKE '%r$rangeid: sending batch%'
   AND message NOT LIKE '%QueryTxn%'
 ----
 dist sender send  r36: sending batch 1 Scan to (n1,s1):1
+dist sender send  r36: sending batch 2 Put to (n1,s1):1
+dist sender send  r36: sending batch 1 Scan to (n1,s1):1
 dist sender send  r36: sending batch 2 Put, 1 EndTxn to (n1,s1):1
 
 # TODO(radu): allow non-side-effecting projections.
@@ -418,6 +429,8 @@ WHERE message       LIKE '%r$rangeid: sending batch%'
 ----
 dist sender send  r36: sending batch 1 Scan to (n1,s1):1
 dist sender send  r36: sending batch 2 Put to (n1,s1):1
+dist sender send  r36: sending batch 1 Scan to (n1,s1):1
+dist sender send  r36: sending batch 2 Put to (n1,s1):1
 dist sender send  r36: sending batch 1 EndTxn to (n1,s1):1
 
 # Update with RETURNING statement with side-effects should not auto-commit.
@@ -443,6 +456,8 @@ WHERE message       LIKE '%r$rangeid: sending batch%'
 ----
 dist sender send  r36: sending batch 1 Scan to (n1,s1):1
 dist sender send  r36: sending batch 2 Put to (n1,s1):1
+dist sender send  r36: sending batch 1 Scan to (n1,s1):1
+dist sender send  r36: sending batch 2 Put to (n1,s1):1
 dist sender send  r36: sending batch 1 EndTxn to (n1,s1):1
 
 # Another way to test the scenario above: generate an error and ensure that the
@@ -478,6 +493,8 @@ WHERE message     LIKE '%r$rangeid: sending batch%'
   AND message NOT LIKE '%PushTxn%'
   AND message NOT LIKE '%QueryTxn%'
 ----
+dist sender send  r36: sending batch 1 Scan to (n1,s1):1
+dist sender send  r36: sending batch 2 Put to (n1,s1):1
 dist sender send  r36: sending batch 1 DelRng, 1 EndTxn to (n1,s1):1
 
 # Multi-row delete should auto-commit.
@@ -499,6 +516,8 @@ WHERE message     LIKE '%r$rangeid: sending batch%'
   AND message NOT LIKE '%PushTxn%'
   AND message NOT LIKE '%QueryTxn%'
 ----
+dist sender send  r36: sending batch 1 Scan to (n1,s1):1
+dist sender send  r36: sending batch 2 Put to (n1,s1):1
 dist sender send  r36: sending batch 1 DelRng, 1 EndTxn to (n1,s1):1
 
 # No auto-commit inside a transaction.
@@ -547,6 +566,7 @@ WHERE message     LIKE '%r$rangeid: sending batch%'
   AND message NOT LIKE '%PushTxn%'
   AND message NOT LIKE '%QueryTxn%'
 ----
+dist sender send  r36: sending batch 1 DelRng to (n1,s1):1
 dist sender send  r36: sending batch 1 Scan to (n1,s1):1
 dist sender send  r36: sending batch 2 Del, 1 EndTxn to (n1,s1):1
 
@@ -570,6 +590,7 @@ WHERE message       LIKE '%r$rangeid: sending batch%'
   AND message   NOT LIKE '%QueryTxn%'
   AND operation NOT LIKE '%async%'
 ----
+dist sender send  r36: sending batch 1 DelRng to (n1,s1):1
 dist sender send  r36: sending batch 1 Scan to (n1,s1):1
 dist sender send  r36: sending batch 2 Del to (n1,s1):1
 dist sender send  r36: sending batch 1 EndTxn to (n1,s1):1
@@ -595,6 +616,7 @@ WHERE message       LIKE '%r$rangeid: sending batch%'
   AND message   NOT LIKE '%QueryTxn%'
   AND operation NOT LIKE '%async%'
 ----
+dist sender send  r36: sending batch 1 DelRng to (n1,s1):1
 dist sender send  r36: sending batch 1 Scan to (n1,s1):1
 dist sender send  r36: sending batch 2 Del to (n1,s1):1
 dist sender send  r36: sending batch 1 EndTxn to (n1,s1):1
@@ -644,6 +666,7 @@ WHERE message       LIKE '%r$rangeid: sending batch%'
   AND message   NOT LIKE '%QueryTxn%'
   AND operation NOT LIKE '%async%'
 ----
+dist sender send  r36: sending batch 1 DelRng to (n1,s1):1
 dist sender send  r36: sending batch 2 CPut to (n1,s1):1
 dist sender send  r36: sending batch 2 Scan to (n1,s1):1
 dist sender send  r36: sending batch 1 EndTxn to (n1,s1):1
@@ -667,6 +690,7 @@ WHERE message       LIKE '%r$rangeid: sending batch%'
   AND message   NOT LIKE '%QueryTxn%'
   AND operation NOT LIKE '%async%'
 ----
+dist sender send  r36: sending batch 1 DelRng to (n1,s1):1
 dist sender send  r36: sending batch 1 Scan to (n1,s1):1
 dist sender send  r36: sending batch 1 Put to (n1,s1):1
 dist sender send  r36: sending batch 1 Scan to (n1,s1):1
@@ -692,6 +716,7 @@ WHERE message       LIKE '%r$rangeid: sending batch%'
   AND message   NOT LIKE '%QueryTxn%'
   AND operation NOT LIKE '%async%'
 ----
+dist sender send  r36: sending batch 1 DelRng to (n1,s1):1
 dist sender send  r36: sending batch 1 Scan to (n1,s1):1
 dist sender send  r36: sending batch 1 Del to (n1,s1):1
 dist sender send  r36: sending batch 1 Scan to (n1,s1):1
@@ -720,6 +745,7 @@ WHERE message       LIKE '%r$rangeid: sending batch%'
   AND operation NOT LIKE '%async%'
 ----
 dist sender send  r36: sending batch 1 DelRng to (n1,s1):1
+dist sender send  r36: sending batch 1 DelRng to (n1,s1):1
 dist sender send  r36: sending batch 1 Scan to (n1,s1):1
 dist sender send  r36: sending batch 1 Del, 1 EndTxn to (n1,s1):1
 
@@ -749,6 +775,7 @@ WHERE message       LIKE '%r$rangeid: sending batch%'
   AND message   NOT LIKE '%QueryTxn%'
   AND operation NOT LIKE '%async%'
 ----
+dist sender send  r36: sending batch 1 DelRng to (n1,s1):1
 dist sender send  r36: sending batch 2 CPut to (n1,s1):1
 dist sender send  r36: sending batch 2 CPut to (n1,s1):1
 dist sender send  r36: sending batch 1 EndTxn to (n1,s1):1
@@ -774,6 +801,7 @@ WHERE message       LIKE '%r$rangeid: sending batch%'
   AND message   NOT LIKE '%QueryTxn%'
   AND operation NOT LIKE '%async%'
 ----
+dist sender send  r36: sending batch 1 DelRng to (n1,s1):1
 dist sender send  r36: sending batch 2 CPut to (n1,s1):1
 dist sender send  r36: sending batch 2 CPut to (n1,s1):1
 dist sender send  r36: sending batch 1 EndTxn to (n1,s1):1

tbg added a commit to tbg/cockroach that referenced this issue Feb 17, 2021
We were previously propagating RPCs across boundaries only when they
were verbose. We now propagate any span (except the noop span)
regardless of current verbosity.
This ensures that SQL (which always creates real, non-verbose spans by
default) can trace the entirety of its operations.

This enables the background collection of contention metadata in 21.1
and more generally, the idea of always-on tracing. Note that we had
previously considered always-on tracing to mean avoiding the noop span
altogether; due to memory retention issues[1][2] encountered we are
taking a step back from that idea for the 21.1 release.

Somehow creating additional trace spans seems to have fouled up some
tracing-based tests. I can't boil that ocean here, so I filed a
follow-up issue: cockroachdb#60672

[1]: cockroachdb#59370
[2]: cockroachdb#59431

Release note: None
@yuzefovich
Copy link
Member

This seems like a dup of #59203.

tbg added a commit to tbg/cockroach that referenced this issue Feb 17, 2021
We were previously propagating RPCs across boundaries only when they
were verbose. We now propagate any span (except the noop span)
regardless of current verbosity.
This ensures that SQL (which always creates real, non-verbose spans by
default) can trace the entirety of its operations.

This enables the background collection of contention metadata in 21.1
and more generally, the idea of always-on tracing. Note that we had
previously considered always-on tracing to mean avoiding the noop span
altogether; due to memory retention issues[1][2] encountered we are
taking a step back from that idea for the 21.1 release.

Somehow creating additional trace spans seems to have fouled up some
tracing-based tests. I can't boil that ocean here, so I filed a
follow-up issue: cockroachdb#60672

[1]: cockroachdb#59370
[2]: cockroachdb#59431

Release note: None
@tbg
Copy link
Member Author

tbg commented Feb 18, 2021

I got another diff as a result of #60725:

commit d0fbf6fd9d88cadab57da0a50ca0096c27ef33c4
Author: Tobias Grieger <[email protected]>
Date:   Thu Feb 18 12:04:00 2021 +0100

    tracing: harden against use-after-Finish
    
    `net/trace`'s `Trace` is notoriously panicky when used after a call to
    `.Finish()`. We don't like such calls and try to avoid them, but they
    are hard to conclusively disprove. Besides, the opposite - never calling
    `.Finish()` is even worse, so in general we'd like to err on the side of
    use-after-free and want that to be safe.
    
    This commit does that: we move `Span` to `spanInner` and `*Span` becomes
    a simple wrapper that turns most operations into no-ops once the first
    call to `.Finish` has been initiated.
    
    While I was there, I also removed `IsBlackHole()` which wasn't really
    useful any more.
    
    This fixes the following crash:
    
    https://github.com/cockroachdb/cockroach/issues/58489#issuecomment-781263005
    
    which was surfaced by a recent switch in that test to enable tracing
    throughout (precisely to find these crashes!).
    
    This commit added additional diffs in `TestExecBuild`, presumably
    because some logging occurs to a finished Span, and that logging is now
    dropped. We already know there are issues there, and I mentioned this
    change in tracking issue #60672 as well.
    
    Release note (bug fix): fixed a bug that could result in crashes during
    tracing when using the `trace.debug.enable` cluster setting.

diff --git a/pkg/sql/opt/exec/execbuilder/testdata/autocommit_nonmetamorphic b/pkg/sql/opt/exec/execbuilder/testdata/autocommit_nonmetamorphic
index 0a016b3606..33570a162d 100644
--- a/pkg/sql/opt/exec/execbuilder/testdata/autocommit_nonmetamorphic
+++ b/pkg/sql/opt/exec/execbuilder/testdata/autocommit_nonmetamorphic
@@ -668,7 +668,6 @@ WHERE message       LIKE '%r$rangeid: sending batch%'
 ----
 dist sender send  r36: sending batch 1 DelRng to (n1,s1):1
 dist sender send  r36: sending batch 2 CPut to (n1,s1):1
-dist sender send  r36: sending batch 2 Scan to (n1,s1):1
 dist sender send  r36: sending batch 1 EndTxn to (n1,s1):1
 
 query B
@@ -693,7 +692,6 @@ WHERE message       LIKE '%r$rangeid: sending batch%'
 dist sender send  r36: sending batch 1 DelRng to (n1,s1):1
 dist sender send  r36: sending batch 1 Scan to (n1,s1):1
 dist sender send  r36: sending batch 1 Put to (n1,s1):1
-dist sender send  r36: sending batch 1 Scan to (n1,s1):1
 dist sender send  r36: sending batch 1 EndTxn to (n1,s1):1
 
 query B
@@ -719,7 +717,6 @@ WHERE message       LIKE '%r$rangeid: sending batch%'
 dist sender send  r36: sending batch 1 DelRng to (n1,s1):1
 dist sender send  r36: sending batch 1 Scan to (n1,s1):1
 dist sender send  r36: sending batch 1 Del to (n1,s1):1
-dist sender send  r36: sending batch 1 Scan to (n1,s1):1
 dist sender send  r36: sending batch 1 EndTxn to (n1,s1):1
 
 # Test with a single cascade, which should use autocommit.
@@ -746,7 +743,6 @@ WHERE message       LIKE '%r$rangeid: sending batch%'
 ----
 dist sender send  r36: sending batch 1 DelRng to (n1,s1):1
 dist sender send  r36: sending batch 1 DelRng to (n1,s1):1
-dist sender send  r36: sending batch 1 Scan to (n1,s1):1
 dist sender send  r36: sending batch 1 Del, 1 EndTxn to (n1,s1):1
 
 # -----------------------
diff --git a/pkg/sql/opt/exec/execbuilder/testdata/partial_index_nonmetamorphic b/pkg/sql/opt/exec/execbuilder/testdata/partial_index_nonmetamorphic
index e45575fce1..f4e618e0e8 100644
--- a/pkg/sql/opt/exec/execbuilder/testdata/partial_index_nonmetamorphic
+++ b/pkg/sql/opt/exec/execbuilder/testdata/partial_index_nonmetamorphic
@@ -905,7 +905,6 @@ query T kvtrace
 DELETE FROM t57085_p WHERE p = 1;
 ----
 DelRange /Table/56/1/1 - /Table/56/1/1/#
-Scan /Table/57/{1-2}
 Del /Table/57/2/1/10/0
 Del /Table/57/1/10/0
 Del /Table/57/1/20/0

tbg added a commit to tbg/cockroach that referenced this issue Feb 18, 2021
`net/trace`'s `Trace` is notoriously panicky when used after a call to
`.Finish()`. We don't like such calls and try to avoid them, but they
are hard to conclusively disprove. Besides, the opposite - never calling
`.Finish()` is even worse, so in general we'd like to err on the side of
use-after-free and want that to be safe.

This commit does that: we move `Span` to `spanInner` and `*Span` becomes
a simple wrapper that turns most operations into no-ops once the first
call to `.Finish` has been initiated.

While I was there, I also removed `IsBlackHole()` which wasn't really
useful any more.

This fixes the following crash:

cockroachdb#58489 (comment)

which was surfaced by a recent switch in that test to enable tracing
throughout (precisely to find these crashes!).

This commit added additional diffs in `TestExecBuild`, presumably
because some logging occurs to a finished Span, and that logging is now
dropped. We already know there are issues there, and I mentioned this
change in tracking issue cockroachdb#60672 as well.

Release note (bug fix): fixed a bug that could result in crashes during
tracing when using the `trace.debug.enable` cluster setting.
@tbg
Copy link
Member Author

tbg commented Feb 18, 2021

Changing (tracing.Span).Finish() to fail if s.done() should be an easy way to find the use-after-Finish that's probably behind these diffs.

@yuzefovich yuzefovich self-assigned this Apr 20, 2021
@yuzefovich
Copy link
Member

#61321 fixed the diff caused by #60725, but the original duplication of some stuff is still present. I'm looking into that.

@mgartner mgartner moved this to Done in SQL Queries Jul 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-sql-execution Relating to SQL execution. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

4 participants