Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

teamcity: show_trace logic test fails due to unexpected txn push #33720

Closed
cockroach-teamcity opened this issue Jan 14, 2019 · 7 comments
Closed
Assignees
Labels
A-sql-executor SQL txn logic A-testing Testing tools and infrastructure C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot.
Milestone

Comments

@cockroach-teamcity
Copy link
Member

The following tests appear to have failed on master (test): TestLogic/local: TestLogic/local/show_trace, TestLogic/local, TestLogic

You may want to check for open issues.

#1092793:

TestLogic/local: TestLogic/local/show_trace
--- FAIL: test/TestLogic: TestLogic/local: TestLogic/local/show_trace (1.830s)
logic.go:2301: 
	 
	testdata/logic_test/show_trace:280: SELECT operation, regexp_replace(message, '(\d\d\d\d-\d\d-\d\dT\d\d:\d\d:\d\d\.)?\d\d\d\d\d+', '...PK...') as message
	  FROM [SHOW KV TRACE FOR SESSION]
	WHERE message NOT LIKE '%Z/%'
	  AND tag NOT LIKE '%intExec=%'
	  AND tag NOT LIKE '%scExec%'
	  AND tag NOT LIKE '%IndexBackfiller%'
	expected:
	    dist sender send  querying next range at /Table/2/1/53/"kv2"/3/1
	    dist sender send  r6: sending batch 1 Get to (n1,s1):1
	    dist sender send  querying next range at /Table/3/1/55/2/1
	    dist sender send  r6: sending batch 1 Get to (n1,s1):1
	    dist sender send  r7: sending batch 1 EndTxn, 1 QueryIntent to (n1,s1):1
	    sql txn           Scan /Table/55/{1-2}
	    dist sender send  querying next range at /Table/55/1
	    dist sender send  r20: sending batch 1 Scan to (n1,s1):1
	    sql txn           fetched: /kv2/primary/...PK.../k/v -> /1/2
	    sql txn           Put /Table/55/1/...PK.../0 -> /TUPLE/1:1:Int/1/1:2:Int/4
	    sql txn           fetched: /kv2/primary/...PK.../k/v -> /2/3
	    sql txn           Put /Table/55/1/...PK.../0 -> /TUPLE/1:1:Int/2/1:2:Int/5
	    dist sender send  querying next range at /Table/55/1/...PK.../0
	    dist sender send  r20: sending batch 2 Put, 1 EndTxn to (n1,s1):1
	    sql txn           fast path completed
	    sql txn           rows affected: 2
	    
	but found (query options: "") :
	    dist sender send  querying next range at /Table/2/1/53/"kv2"/3/1
	    dist sender send  r6: sending batch 1 Get to (n1,s1):1
	    dist sender send  querying next range at /Table/3/1/55/2/1
	    dist sender send  r6: sending batch 1 Get to (n1,s1):1
	    dist sender send  r7: sending batch 1 EndTxn, 1 QueryIntent to (n1,s1):1
	    sql txn           Scan /Table/55/{1-2}
	    dist sender send  querying next range at /Table/55/1
	    dist sender send  r20: sending batch 1 Scan to (n1,s1):1
	    dist sender send  querying next range at /Table/SystemConfigSpan/Start
	    dist sender send  r6: sending batch 1 PushTxn to (n1,s1):1
	    dist sender send  querying next range at /Table/55/1/...PK.../0
	    dist sender send  r20: sending batch 2 ResolveIntent to (n1,s1):1
	    sql txn           fetched: /kv2/primary/...PK.../k/v -> /1/2
	    sql txn           Put /Table/55/1/...PK.../0 -> /TUPLE/1:1:Int/1/1:2:Int/4
	    sql txn           fetched: /kv2/primary/...PK.../k/v -> /2/3
	    sql txn           Put /Table/55/1/...PK.../0 -> /TUPLE/1:1:Int/2/1:2:Int/5
	    dist sender send  querying next range at /Table/55/1/...PK.../0
	    dist sender send  r20: sending batch 2 Put, 1 EndTxn to (n1,s1):1
	    sql txn           fast path completed
	    sql txn           rows affected: 2
	    
	
logic.go:2333: 
	testdata/logic_test/show_trace:305: error while processing
logic.go:2334: testdata/logic_test/show_trace:305: too many errors encountered, skipping the rest of the input
------- Stdout: -------
=== PAUSE TestLogic/local/show_trace



TestLogic
--- FAIL: test/TestLogic (366.430s)
test_log_scope.go:81: test logs captured to: /tmp/logTestLogic558571843
test_log_scope.go:62: use -show-logs to present logs inline



TestLogic/local
--- FAIL: test/TestLogic: TestLogic/local (0.420s)





Please assign, take a look and update the issue accordingly.

@cockroach-teamcity cockroach-teamcity added this to the 2.2 milestone Jan 14, 2019
@cockroach-teamcity cockroach-teamcity added C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. labels Jan 14, 2019
@jordanlewis
Copy link
Member

@nvanbenschoten @andreimatei the problem here is that there's an extra PushTxn. What's the right thing to do?

@andreimatei
Copy link
Contributor

I guess the right thing to do is... start logging to see what that intent is, and then take it from there. As long as all intents are on one range (as I guess they should be for that small table), each commit shouldn't leave intents behind. But maybe the intent that the update is running into is around the schema (which I guess would also be indicated by the preceding querying next range at /Table/SystemConfigSpan/Start line.
Or bisect and see where it started being flaky and blame that loser. I can't say I really want the issue at the moment :)

@tbg
Copy link
Member

tbg commented Jan 15, 2019 via email

@knz
Copy link
Contributor

knz commented Jan 16, 2019

what's the key /Table/SystemConfigSpan/Start? why was there a scan at that key? Is this expected in a txn push scenario?

@knz knz added A-sql-executor SQL txn logic A-testing Testing tools and infrastructure labels Jan 16, 2019
@knz knz changed the title teamcity: failed test: TestLogic teamcity: show_trace logic test fails due to unexpected txn push Jan 16, 2019
@andreimatei
Copy link
Contributor

(I don't think we're supposed to change issue titles for test flakes cause then TC will open new issues. Right?)

SystemConfigSpan/Start is the anchor key schema change transactions. So what I think happened here is that the update with the bad trace ran into an intent left over from the previous create table ... as... txn.
And I now realize that that's a multi-range txn (artificially so, since it needs to be anchored at on the SystemConfigSpan since it's a schema change and those need to gossip the SystemConfigSpan and I think they do it in a really bad way) and so I guess it's not unexpected for it to leave intents around since their cleanup is async (I don't know why Tobi brought in the txn record GC; I don't think that plays a role).
And so I think this failure simply tells us that the async cleanup was slow.
Unclear to me what to do about it. Have I told you I hate these trace tests yet? :)

@tbg
Copy link
Member

tbg commented Jan 17, 2019

That sounds reasonable (when I wrote my comment, I hadn't looked into the failure mode). The test could run the same query once without the trace to clean up any errant intents, assuming it is guaranteed that the schema change txn is done at that point. Of course that's just a band aid, tests of that kind are bound to be flaky.

@andreimatei andreimatei assigned knz and unassigned andreimatei Jan 22, 2019
@andreimatei
Copy link
Contributor

I don't know what to do here other than delete the test.
passing to @knz

craig bot pushed a commit that referenced this issue Mar 8, 2019
35521: sql: reduce non-determinism in the show_trace logic tests r=knz a=knz

Fixes #33720.
First commit from #35519.

The test is really about asserting the KV operations sent on behalf of
SQL statements. The distsender traffic is really irrelevant. Since
that's where most of the test flakes / non-determinism is coming from,
simply remove it.

Release note: None

35548: sqlbase: avoid a race in sqlbase.CancelChecker r=knz a=knz

Fixes  #35539.

Prior to the distsql-ification of planNode execution, all the
execution steps in a local query were interleaved (using
coroutine-style concurrency), so that the calls to
`(*CancelChecker).Check()` were all sequential.

With distsql now it's possible for different planNode `Next()` or
`startExec()` methods to be running in concurrent goroutines on
multiple cores, i.e. truly parallel.

This in turn requires atomic access to the counter in the cancel
checker. Without atomic access, there is a race condition and the
possibility that the cancel checker does not work well (some
increments can be performed two times, which could cause the condition
of the check to occasionally fail).

Found with SQLSmith.

Release note: None

Co-authored-by: Raphael 'kena' Poss <[email protected]>
@craig craig bot closed this as completed in #35521 Mar 8, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-sql-executor SQL txn logic A-testing Testing tools and infrastructure C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot.
Projects
None yet
Development

No branches or pull requests

5 participants