SCRUB index check hangs when run concurrently with TPCC #33173

thoszhang · 2018-12-14T22:33:24Z

The new SCRUB roachtests scrub/{all-checks,index-only}/tpcc-1000, which run a series of SCRUB checks on a cluster running TPCC at the same time, have been timing out because the SCRUB query hangs. See #33151, #33149.

The TPCC queries themselves run successfully during this test, and the cluster is able to execute other queries when I ssh into one of the nodes. SCRUB also runs fine when run on the cluster with TPCC data with no other queries running, so it seems like the deadlock occurs when there's contention between TPCC queries and the SCRUB queries that need to do a full table scan. (I also tried running the roachtest with AS OF SYSTEM TIME '-5s' in the SCRUB query to reduce contention, which was successful. See #33152.)

Goroutine dump: goroutines.zip

The text was updated successfully, but these errors were encountered:

thoszhang · 2018-12-14T22:53:10Z

the flow for the SCRUB query that got stuck:

goroutine 47409 [select, 490 minutes]:
github.com/cockroachdb/cockroach/pkg/sql/distsqlrun.(*Flow).Wait(0xc4998d5dc0)
        /go/src/github.com/cockroachdb/cockroach/pkg/sql/distsqlrun/flow.go:636 +0x12f
github.com/cockroachdb/cockroach/pkg/sql.(*DistSQLPlanner).Run(0xc42098af00, 0xc495f7b680, 0xc4af11c900, 0xc4338a91e0, 0xc445445600, 0xc473853860, 0x0)
        /go/src/github.com/cockroachdb/cockroach/pkg/sql/distsql_running.go:271 +0x914
github.com/cockroachdb/cockroach/pkg/sql.scrubRunDistSQL(0x345f700, 0xc4c8b6bf50, 0xc495f7b680, 0xc42a89d730, 0xc4338a91e0, 0xc466116e00, 0x6, 0x8, 0x0, 0x0, ...)
        /go/src/github.com/cockroachdb/cockroach/pkg/sql/scrub.go:566 +0x30d
github.com/cockroachdb/cockroach/pkg/sql.(*indexCheckOperation).Start(0xc44ab16930, 0x345f700, 0xc4c8b6bf50, 0xc473852f00, 0xc42a89d730, 0x0, 0x0)
        /go/src/github.com/cockroachdb/cockroach/pkg/sql/scrub_index.go:142 +0x517
github.com/cockroachdb/cockroach/pkg/sql.(*scrubNode).Next(0xc48229eaa0, 0x345f700, 0xc4c8b6bf50, 0xc473852f00, 0xc42a89d730, 0x0, 0x0, 0x0)
        /go/src/github.com/cockroachdb/cockroach/pkg/sql/scrub.go:127 +0x253
github.com/cockroachdb/cockroach/pkg/sql.(*planNodeToRowSource).Next(0xc4b3f28500, 0xc42a89d730, 0x344f800, 0xc48229eaa0, 0x0)
        /go/src/github.com/cockroachdb/cockroach/pkg/sql/plan_node_to_row_source.go:206 +0x5d3
github.com/cockroachdb/cockroach/pkg/sql/distsqlrun.Run(0x345f700, 0xc4c8b6bf50, 0x346a880, 0xc4b3f28500, 0x3451bc0, 0xc466116700)
        /go/src/github.com/cockroachdb/cockroach/pkg/sql/distsqlrun/base.go:172 +0x35
github.com/cockroachdb/cockroach/pkg/sql/distsqlrun.(*ProcessorBase).Run(0xc4b3f28500, 0x345f700, 0xc4c8b6bf50, 0x0)
        /go/src/github.com/cockroachdb/cockroach/pkg/sql/distsqlrun/processors.go:804 +0x98
github.com/cockroachdb/cockroach/pkg/sql/distsqlrun.(*Flow).StartSync(0xc49c64dc00, 0x345f700, 0xc4c8b6bf50, 0x2fec1e8, 0xc42d366460, 0x344e940)
        /go/src/github.com/cockroachdb/cockroach/pkg/sql/distsqlrun/flow.go:618 +0x191
github.com/cockroachdb/cockroach/pkg/sql.(*DistSQLPlanner).Run(0xc42098af00, 0xc495f7b560, 0xc4af11c900, 0xc43b6428c8, 0xc4a52fa2c0, 0xc42a89d7c8, 0x0)
        /go/src/github.com/cockroachdb/cockroach/pkg/sql/distsql_running.go:261 +0x868
github.com/cockroachdb/cockroach/pkg/sql.(*DistSQLPlanner).PlanAndRun(0xc42098af00, 0x345f700, 0xc4c8b6bc80, 0xc42a89d7c8, 0xc495f7b560, 0xc4af11c900, 0x344f800, 0xc48229eaa0, 0xc4a52fa2c0)
        /go/src/github.com/cockroachdb/cockroach/pkg/sql/distsql_running.go:805 +0x24c
github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).execWithDistSQLEngine(0xc42a89d300, 0x345f700, 0xc4c8b6bc80, 0xc42a89d730, 0x3, 0x7f63a6c30748, 0xc4af11c990, 0x3463c00, 0x0, 0x0)
        /go/src/github.com/cockroachdb/cockroach/pkg/sql/conn_executor_exec.go:995 +0x27a
github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).dispatchToExecutionEngine(0xc42a89d300, 0x345f700, 0xc4c8b6bc80, 0xc42b94f07c, 0x3c, 0x3463c40, 0xc460d18880, 0x0, 0x0, 0x0, ...)
        /go/src/github.com/cockroachdb/cockroach/pkg/sql/conn_executor_exec.go:837 +0xaa5
github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).execStmtInOpenState(0xc42a89d300, 0x345f700, 0xc4c8b6bc80, 0xc42b94f07c, 0x3c, 0x3463c40, 0xc460d18880, 0x0, 0x0, 0x0, ...)
        /go/src/github.com/cockroachdb/cockroach/pkg/sql/conn_executor_exec.go:417 +0xe92
github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).execStmt(0xc42a89d300, 0x345f700, 0xc4c8b6bc80, 0xc42b94f07c, 0x3c, 0x3463c40, 0xc460d18880, 0x0, 0x0, 0x0, ...)
        /go/src/github.com/cockroachdb/cockroach/pkg/sql/conn_executor_exec.go:98 +0x34d
github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).run(0xc42a89d300, 0x345f640, 0xc43283a440, 0xc420716c38, 0x5400, 0x15000, 0xc420716cd0, 0xc420420300, 0x0, 0x0)
        /go/src/github.com/cockroachdb/cockroach/pkg/sql/conn_executor.go:1136 +0x21c7
github.com/cockroachdb/cockroach/pkg/sql.(*Server).ServeConn(0xc420aca690, 0x345f640, 0xc43283a440, 0xc42a89d300, 0x5400, 0x15000, 0xc420716cd0, 0xc420420300, 0x0, 0x0)
        /go/src/github.com/cockroachdb/cockroach/pkg/sql/conn_executor.go:406 +0xce
github.com/cockroachdb/cockroach/pkg/sql/pgwire.(*conn).serveImpl.func4(0xc420aca690, 0x345f640, 0xc43283a440, 0xc42a89d300, 0x5400, 0x15000, 0xc420716cd0, 0xc420420300, 0xc420420360, 0xc430d53950)
        /go/src/github.com/cockroachdb/cockroach/pkg/sql/pgwire/conn.go:316 +0x81
created by github.com/cockroachdb/cockroach/pkg/sql/pgwire.(*conn).serveImpl
        /go/src/github.com/cockroachdb/cockroach/pkg/sql/pgwire/conn.go:315 +0x1094

there are some RowChannel.Push calls that are blocked on a send, potentially relevant:

goroutine 1723451 [chan send, 501 minutes]:
github.com/cockroachdb/cockroach/pkg/sql/distsqlrun.(*RowChannel).Push(0xc4a7b03500, 0xc4cd5df830, 0x3, 0x3, 0x0, 0x0)
        /go/src/github.com/cockroachdb/cockroach/pkg/sql/distsqlrun/base.go:426 +0xd3
github.com/cockroachdb/cockroach/pkg/sql/distsqlrun.(*routerBase).start.func1(0x345f640, 0xc4b3a4a240, 0xc4b78ab440, 0xc493c8b200, 0xc4998d5f48)
        /go/src/github.com/cockroachdb/cockroach/pkg/sql/distsqlrun/routers.go:324 +0x6e1
created by github.com/cockroachdb/cockroach/pkg/sql/distsqlrun.(*routerBase).start
        /go/src/github.com/cockroachdb/cockroach/pkg/sql/distsqlrun/routers.go:278 +0x9f

jordanlewis · 2018-12-14T23:25:42Z

Thanks for filing @lucy-zhang! @asubiotto could you take a look at this? Download the goroutine dump and have a look at some of the blocked threads. It looks suspiciously similar to some of the things you've been investigating recently - for example, some of the blocked threads are waiting on getting network quota from GRPC...

asubiotto · 2018-12-17T16:50:12Z

I took a quick look at this and it seems that the SCRUB is planned as a wrapped plan node, so any and all remote data will be sent over the wire, meaning that there is a high likelihood that the stream window is being taken over similar to #14948.

@lucy-zhang, what is stopping us from planning a scrub in a distributed manner?

knz · 2018-12-17T17:43:43Z

@asubiotto the answer to your latter question is "because it was never implemented so far".

jordanlewis · 2019-02-06T21:49:19Z

This will get done automatically once the delete-local pr is in.

34383: sql: delete local implementations of planNodes r=jordanlewis a=jordanlewis This PR deletes the remaining users of the planNode execution engine and deletes the duplicate implementations for those planNodes that have DistSQL equivalents. Closes #33173. Co-authored-by: Jordan Lewis <[email protected]>

thoszhang self-assigned this Dec 14, 2018

thoszhang added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Dec 14, 2018

vivekmenezes added the A-sql-execution Relating to SQL execution. label Dec 18, 2018

vivekmenezes unassigned thoszhang Dec 18, 2018

jordanlewis self-assigned this Feb 6, 2019

jordanlewis mentioned this issue Feb 8, 2019

sql: delete local implementations of planNodes #34383

Merged

craig bot closed this as completed in #34383 Feb 19, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SCRUB index check hangs when run concurrently with TPCC #33173

SCRUB index check hangs when run concurrently with TPCC #33173

thoszhang commented Dec 14, 2018

thoszhang commented Dec 14, 2018

jordanlewis commented Dec 14, 2018

asubiotto commented Dec 17, 2018

knz commented Dec 17, 2018

jordanlewis commented Feb 6, 2019

SCRUB index check hangs when run concurrently with TPCC #33173

SCRUB index check hangs when run concurrently with TPCC #33173

Comments

thoszhang commented Dec 14, 2018

thoszhang commented Dec 14, 2018

jordanlewis commented Dec 14, 2018

asubiotto commented Dec 17, 2018

knz commented Dec 17, 2018

jordanlewis commented Feb 6, 2019