teamcity: show_trace logic test fails due to unexpected txn push #33720

cockroach-teamcity · 2019-01-14T19:45:06Z

The following tests appear to have failed on master (test): TestLogic/local: TestLogic/local/show_trace, TestLogic/local, TestLogic

You may want to check for open issues.

#1092793:

TestLogic/local: TestLogic/local/show_trace
--- FAIL: test/TestLogic: TestLogic/local: TestLogic/local/show_trace (1.830s)
logic.go:2301: 
	 
	testdata/logic_test/show_trace:280: SELECT operation, regexp_replace(message, '(\d\d\d\d-\d\d-\d\dT\d\d:\d\d:\d\d\.)?\d\d\d\d\d+', '...PK...') as message
	  FROM [SHOW KV TRACE FOR SESSION]
	WHERE message NOT LIKE '%Z/%'
	  AND tag NOT LIKE '%intExec=%'
	  AND tag NOT LIKE '%scExec%'
	  AND tag NOT LIKE '%IndexBackfiller%'
	expected:
	    dist sender send  querying next range at /Table/2/1/53/"kv2"/3/1
	    dist sender send  r6: sending batch 1 Get to (n1,s1):1
	    dist sender send  querying next range at /Table/3/1/55/2/1
	    dist sender send  r6: sending batch 1 Get to (n1,s1):1
	    dist sender send  r7: sending batch 1 EndTxn, 1 QueryIntent to (n1,s1):1
	    sql txn           Scan /Table/55/{1-2}
	    dist sender send  querying next range at /Table/55/1
	    dist sender send  r20: sending batch 1 Scan to (n1,s1):1
	    sql txn           fetched: /kv2/primary/...PK.../k/v -> /1/2
	    sql txn           Put /Table/55/1/...PK.../0 -> /TUPLE/1:1:Int/1/1:2:Int/4
	    sql txn           fetched: /kv2/primary/...PK.../k/v -> /2/3
	    sql txn           Put /Table/55/1/...PK.../0 -> /TUPLE/1:1:Int/2/1:2:Int/5
	    dist sender send  querying next range at /Table/55/1/...PK.../0
	    dist sender send  r20: sending batch 2 Put, 1 EndTxn to (n1,s1):1
	    sql txn           fast path completed
	    sql txn           rows affected: 2
	    
	but found (query options: "") :
	    dist sender send  querying next range at /Table/2/1/53/"kv2"/3/1
	    dist sender send  r6: sending batch 1 Get to (n1,s1):1
	    dist sender send  querying next range at /Table/3/1/55/2/1
	    dist sender send  r6: sending batch 1 Get to (n1,s1):1
	    dist sender send  r7: sending batch 1 EndTxn, 1 QueryIntent to (n1,s1):1
	    sql txn           Scan /Table/55/{1-2}
	    dist sender send  querying next range at /Table/55/1
	    dist sender send  r20: sending batch 1 Scan to (n1,s1):1
	    dist sender send  querying next range at /Table/SystemConfigSpan/Start
	    dist sender send  r6: sending batch 1 PushTxn to (n1,s1):1
	    dist sender send  querying next range at /Table/55/1/...PK.../0
	    dist sender send  r20: sending batch 2 ResolveIntent to (n1,s1):1
	    sql txn           fetched: /kv2/primary/...PK.../k/v -> /1/2
	    sql txn           Put /Table/55/1/...PK.../0 -> /TUPLE/1:1:Int/1/1:2:Int/4
	    sql txn           fetched: /kv2/primary/...PK.../k/v -> /2/3
	    sql txn           Put /Table/55/1/...PK.../0 -> /TUPLE/1:1:Int/2/1:2:Int/5
	    dist sender send  querying next range at /Table/55/1/...PK.../0
	    dist sender send  r20: sending batch 2 Put, 1 EndTxn to (n1,s1):1
	    sql txn           fast path completed
	    sql txn           rows affected: 2
	    
	
logic.go:2333: 
	testdata/logic_test/show_trace:305: error while processing
logic.go:2334: testdata/logic_test/show_trace:305: too many errors encountered, skipping the rest of the input
------- Stdout: -------
=== PAUSE TestLogic/local/show_trace



TestLogic
--- FAIL: test/TestLogic (366.430s)
test_log_scope.go:81: test logs captured to: /tmp/logTestLogic558571843
test_log_scope.go:62: use -show-logs to present logs inline



TestLogic/local
--- FAIL: test/TestLogic: TestLogic/local (0.420s)

Please assign, take a look and update the issue accordingly.

The text was updated successfully, but these errors were encountered:

jordanlewis · 2019-01-15T15:55:25Z

@nvanbenschoten @andreimatei the problem here is that there's an extra PushTxn. What's the right thing to do?

andreimatei · 2019-01-15T18:01:26Z

I guess the right thing to do is... start logging to see what that intent is, and then take it from there. As long as all intents are on one range (as I guess they should be for that small table), each commit shouldn't leave intents behind. But maybe the intent that the update is running into is around the schema (which I guess would also be indicated by the preceding querying next range at /Table/SystemConfigSpan/Start line.
Or bisect and see where it started being flaky and blame that loser. I can't say I really want the issue at the moment :)

tbg · 2019-01-15T18:05:28Z

If the txn isn't 1pc, the intent can be observed by others, and if that's true there can be a push if the txn record is gced before the waiter observes it. I haven't looked at the flake in detail but maybe that explains something.

…

On Tue, Jan 15, 2019, 19:01 Andrei Matei ***@***.*** wrote: I guess the right thing to do is... start logging to see what that intent is, and then take it from there. As long as all intents are on one range (as I guess they should be for that small table), each commit shouldn't leave intents behind. But maybe the intent that the update is running into is around the schema (which I guess would also be indicated by the preceding querying next range at /Table/SystemConfigSpan/Start line. Or bisect and see where it started being flaky and blame that loser. I can't say I really want the issue at the moment :) — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#33720 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AE135LbiKIlbD-EOzxB8ePL-J2YB4p1_ks5vDheEgaJpZM4Z_nn0> .

knz · 2019-01-16T11:02:52Z

what's the key /Table/SystemConfigSpan/Start? why was there a scan at that key? Is this expected in a txn push scenario?

andreimatei · 2019-01-16T16:09:09Z

(I don't think we're supposed to change issue titles for test flakes cause then TC will open new issues. Right?)

SystemConfigSpan/Start is the anchor key schema change transactions. So what I think happened here is that the update with the bad trace ran into an intent left over from the previous create table ... as... txn.
And I now realize that that's a multi-range txn (artificially so, since it needs to be anchored at on the SystemConfigSpan since it's a schema change and those need to gossip the SystemConfigSpan and I think they do it in a really bad way) and so I guess it's not unexpected for it to leave intents around since their cleanup is async (I don't know why Tobi brought in the txn record GC; I don't think that plays a role).
And so I think this failure simply tells us that the async cleanup was slow.
Unclear to me what to do about it. Have I told you I hate these trace tests yet? :)

tbg · 2019-01-17T08:29:25Z

That sounds reasonable (when I wrote my comment, I hadn't looked into the failure mode). The test could run the same query once without the trace to clean up any errant intents, assuming it is guaranteed that the schema change txn is done at that point. Of course that's just a band aid, tests of that kind are bound to be flaky.

andreimatei · 2019-01-22T16:54:54Z

I don't know what to do here other than delete the test.
passing to @knz

35521: sql: reduce non-determinism in the show_trace logic tests r=knz a=knz Fixes #33720. First commit from #35519. The test is really about asserting the KV operations sent on behalf of SQL statements. The distsender traffic is really irrelevant. Since that's where most of the test flakes / non-determinism is coming from, simply remove it. Release note: None 35548: sqlbase: avoid a race in sqlbase.CancelChecker r=knz a=knz Fixes #35539. Prior to the distsql-ification of planNode execution, all the execution steps in a local query were interleaved (using coroutine-style concurrency), so that the calls to `(*CancelChecker).Check()` were all sequential. With distsql now it's possible for different planNode `Next()` or `startExec()` methods to be running in concurrent goroutines on multiple cores, i.e. truly parallel. This in turn requires atomic access to the counter in the cancel checker. Without atomic access, there is a race condition and the possibility that the cancel checker does not work well (some increments can be performed two times, which could cause the condition of the check to occasionally fail). Found with SQLSmith. Release note: None Co-authored-by: Raphael 'kena' Poss <[email protected]>

cockroach-teamcity added this to the 2.2 milestone Jan 14, 2019

cockroach-teamcity added C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. labels Jan 14, 2019

jordanlewis assigned andreimatei Jan 15, 2019

knz added A-sql-executor SQL txn logic A-testing Testing tools and infrastructure labels Jan 16, 2019

knz changed the title ~~teamcity: failed test: TestLogic~~ teamcity: show_trace logic test fails due to unexpected txn push Jan 16, 2019

andreimatei assigned knz and unassigned andreimatei Jan 22, 2019

nvanbenschoten mentioned this issue Feb 19, 2019

teamcity: failed test: TestPlannerLogic #34911

Closed

andreimatei mentioned this issue Feb 21, 2019

teamcity: failed test: TestPlannerLogic #35106

Closed

This was referenced Mar 7, 2019

sql: reduce non-determinism in the show_trace logic tests #35521

Merged

sql: DDL has become excessively sensitive to txn retries #35549

Closed

craig bot closed this as completed in #35521 Mar 8, 2019

knz mentioned this issue Mar 9, 2019

sql: remove yet more trace non-determinism #35562

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

teamcity: show_trace logic test fails due to unexpected txn push #33720

teamcity: show_trace logic test fails due to unexpected txn push #33720

cockroach-teamcity commented Jan 14, 2019

jordanlewis commented Jan 15, 2019

andreimatei commented Jan 15, 2019

tbg commented Jan 15, 2019 via email

knz commented Jan 16, 2019

andreimatei commented Jan 16, 2019

tbg commented Jan 17, 2019

andreimatei commented Jan 22, 2019

teamcity: show_trace logic test fails due to unexpected txn push #33720

teamcity: show_trace logic test fails due to unexpected txn push #33720

Comments

cockroach-teamcity commented Jan 14, 2019

jordanlewis commented Jan 15, 2019

andreimatei commented Jan 15, 2019

tbg commented Jan 15, 2019 via email

knz commented Jan 16, 2019

andreimatei commented Jan 16, 2019

tbg commented Jan 17, 2019

andreimatei commented Jan 22, 2019