Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SCRUB index check hangs when run concurrently with TPCC #33173

Closed
thoszhang opened this issue Dec 14, 2018 · 5 comments
Closed

SCRUB index check hangs when run concurrently with TPCC #33173

thoszhang opened this issue Dec 14, 2018 · 5 comments
Assignees
Labels
A-sql-execution Relating to SQL execution. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.

Comments

@thoszhang
Copy link
Contributor

The new SCRUB roachtests scrub/{all-checks,index-only}/tpcc-1000, which run a series of SCRUB checks on a cluster running TPCC at the same time, have been timing out because the SCRUB query hangs. See #33151, #33149.

The TPCC queries themselves run successfully during this test, and the cluster is able to execute other queries when I ssh into one of the nodes. SCRUB also runs fine when run on the cluster with TPCC data with no other queries running, so it seems like the deadlock occurs when there's contention between TPCC queries and the SCRUB queries that need to do a full table scan. (I also tried running the roachtest with AS OF SYSTEM TIME '-5s' in the SCRUB query to reduce contention, which was successful. See #33152.)

Goroutine dump: goroutines.zip

@thoszhang thoszhang self-assigned this Dec 14, 2018
@thoszhang
Copy link
Contributor Author

the flow for the SCRUB query that got stuck:

goroutine 47409 [select, 490 minutes]:
github.com/cockroachdb/cockroach/pkg/sql/distsqlrun.(*Flow).Wait(0xc4998d5dc0)
        /go/src/github.com/cockroachdb/cockroach/pkg/sql/distsqlrun/flow.go:636 +0x12f
github.com/cockroachdb/cockroach/pkg/sql.(*DistSQLPlanner).Run(0xc42098af00, 0xc495f7b680, 0xc4af11c900, 0xc4338a91e0, 0xc445445600, 0xc473853860, 0x0)
        /go/src/github.com/cockroachdb/cockroach/pkg/sql/distsql_running.go:271 +0x914
github.com/cockroachdb/cockroach/pkg/sql.scrubRunDistSQL(0x345f700, 0xc4c8b6bf50, 0xc495f7b680, 0xc42a89d730, 0xc4338a91e0, 0xc466116e00, 0x6, 0x8, 0x0, 0x0, ...)
        /go/src/github.com/cockroachdb/cockroach/pkg/sql/scrub.go:566 +0x30d
github.com/cockroachdb/cockroach/pkg/sql.(*indexCheckOperation).Start(0xc44ab16930, 0x345f700, 0xc4c8b6bf50, 0xc473852f00, 0xc42a89d730, 0x0, 0x0)
        /go/src/github.com/cockroachdb/cockroach/pkg/sql/scrub_index.go:142 +0x517
github.com/cockroachdb/cockroach/pkg/sql.(*scrubNode).Next(0xc48229eaa0, 0x345f700, 0xc4c8b6bf50, 0xc473852f00, 0xc42a89d730, 0x0, 0x0, 0x0)
        /go/src/github.com/cockroachdb/cockroach/pkg/sql/scrub.go:127 +0x253
github.com/cockroachdb/cockroach/pkg/sql.(*planNodeToRowSource).Next(0xc4b3f28500, 0xc42a89d730, 0x344f800, 0xc48229eaa0, 0x0)
        /go/src/github.com/cockroachdb/cockroach/pkg/sql/plan_node_to_row_source.go:206 +0x5d3
github.com/cockroachdb/cockroach/pkg/sql/distsqlrun.Run(0x345f700, 0xc4c8b6bf50, 0x346a880, 0xc4b3f28500, 0x3451bc0, 0xc466116700)
        /go/src/github.com/cockroachdb/cockroach/pkg/sql/distsqlrun/base.go:172 +0x35
github.com/cockroachdb/cockroach/pkg/sql/distsqlrun.(*ProcessorBase).Run(0xc4b3f28500, 0x345f700, 0xc4c8b6bf50, 0x0)
        /go/src/github.com/cockroachdb/cockroach/pkg/sql/distsqlrun/processors.go:804 +0x98
github.com/cockroachdb/cockroach/pkg/sql/distsqlrun.(*Flow).StartSync(0xc49c64dc00, 0x345f700, 0xc4c8b6bf50, 0x2fec1e8, 0xc42d366460, 0x344e940)
        /go/src/github.com/cockroachdb/cockroach/pkg/sql/distsqlrun/flow.go:618 +0x191
github.com/cockroachdb/cockroach/pkg/sql.(*DistSQLPlanner).Run(0xc42098af00, 0xc495f7b560, 0xc4af11c900, 0xc43b6428c8, 0xc4a52fa2c0, 0xc42a89d7c8, 0x0)
        /go/src/github.com/cockroachdb/cockroach/pkg/sql/distsql_running.go:261 +0x868
github.com/cockroachdb/cockroach/pkg/sql.(*DistSQLPlanner).PlanAndRun(0xc42098af00, 0x345f700, 0xc4c8b6bc80, 0xc42a89d7c8, 0xc495f7b560, 0xc4af11c900, 0x344f800, 0xc48229eaa0, 0xc4a52fa2c0)
        /go/src/github.com/cockroachdb/cockroach/pkg/sql/distsql_running.go:805 +0x24c
github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).execWithDistSQLEngine(0xc42a89d300, 0x345f700, 0xc4c8b6bc80, 0xc42a89d730, 0x3, 0x7f63a6c30748, 0xc4af11c990, 0x3463c00, 0x0, 0x0)
        /go/src/github.com/cockroachdb/cockroach/pkg/sql/conn_executor_exec.go:995 +0x27a
github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).dispatchToExecutionEngine(0xc42a89d300, 0x345f700, 0xc4c8b6bc80, 0xc42b94f07c, 0x3c, 0x3463c40, 0xc460d18880, 0x0, 0x0, 0x0, ...)
        /go/src/github.com/cockroachdb/cockroach/pkg/sql/conn_executor_exec.go:837 +0xaa5
github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).execStmtInOpenState(0xc42a89d300, 0x345f700, 0xc4c8b6bc80, 0xc42b94f07c, 0x3c, 0x3463c40, 0xc460d18880, 0x0, 0x0, 0x0, ...)
        /go/src/github.com/cockroachdb/cockroach/pkg/sql/conn_executor_exec.go:417 +0xe92
github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).execStmt(0xc42a89d300, 0x345f700, 0xc4c8b6bc80, 0xc42b94f07c, 0x3c, 0x3463c40, 0xc460d18880, 0x0, 0x0, 0x0, ...)
        /go/src/github.com/cockroachdb/cockroach/pkg/sql/conn_executor_exec.go:98 +0x34d
github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).run(0xc42a89d300, 0x345f640, 0xc43283a440, 0xc420716c38, 0x5400, 0x15000, 0xc420716cd0, 0xc420420300, 0x0, 0x0)
        /go/src/github.com/cockroachdb/cockroach/pkg/sql/conn_executor.go:1136 +0x21c7
github.com/cockroachdb/cockroach/pkg/sql.(*Server).ServeConn(0xc420aca690, 0x345f640, 0xc43283a440, 0xc42a89d300, 0x5400, 0x15000, 0xc420716cd0, 0xc420420300, 0x0, 0x0)
        /go/src/github.com/cockroachdb/cockroach/pkg/sql/conn_executor.go:406 +0xce
github.com/cockroachdb/cockroach/pkg/sql/pgwire.(*conn).serveImpl.func4(0xc420aca690, 0x345f640, 0xc43283a440, 0xc42a89d300, 0x5400, 0x15000, 0xc420716cd0, 0xc420420300, 0xc420420360, 0xc430d53950)
        /go/src/github.com/cockroachdb/cockroach/pkg/sql/pgwire/conn.go:316 +0x81
created by github.com/cockroachdb/cockroach/pkg/sql/pgwire.(*conn).serveImpl
        /go/src/github.com/cockroachdb/cockroach/pkg/sql/pgwire/conn.go:315 +0x1094

there are some RowChannel.Push calls that are blocked on a send, potentially relevant:

goroutine 1723451 [chan send, 501 minutes]:
github.com/cockroachdb/cockroach/pkg/sql/distsqlrun.(*RowChannel).Push(0xc4a7b03500, 0xc4cd5df830, 0x3, 0x3, 0x0, 0x0)
        /go/src/github.com/cockroachdb/cockroach/pkg/sql/distsqlrun/base.go:426 +0xd3
github.com/cockroachdb/cockroach/pkg/sql/distsqlrun.(*routerBase).start.func1(0x345f640, 0xc4b3a4a240, 0xc4b78ab440, 0xc493c8b200, 0xc4998d5f48)
        /go/src/github.com/cockroachdb/cockroach/pkg/sql/distsqlrun/routers.go:324 +0x6e1
created by github.com/cockroachdb/cockroach/pkg/sql/distsqlrun.(*routerBase).start
        /go/src/github.com/cockroachdb/cockroach/pkg/sql/distsqlrun/routers.go:278 +0x9f

@thoszhang thoszhang added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Dec 14, 2018
@jordanlewis
Copy link
Member

Thanks for filing @lucy-zhang! @asubiotto could you take a look at this? Download the goroutine dump and have a look at some of the blocked threads. It looks suspiciously similar to some of the things you've been investigating recently - for example, some of the blocked threads are waiting on getting network quota from GRPC...

@asubiotto
Copy link
Contributor

I took a quick look at this and it seems that the SCRUB is planned as a wrapped plan node, so any and all remote data will be sent over the wire, meaning that there is a high likelihood that the stream window is being taken over similar to #14948.

@lucy-zhang, what is stopping us from planning a scrub in a distributed manner?

@knz
Copy link
Contributor

knz commented Dec 17, 2018

@asubiotto the answer to your latter question is "because it was never implemented so far".

@vivekmenezes vivekmenezes added the A-sql-execution Relating to SQL execution. label Dec 18, 2018
@jordanlewis jordanlewis self-assigned this Feb 6, 2019
@jordanlewis
Copy link
Member

This will get done automatically once the delete-local pr is in.

craig bot pushed a commit that referenced this issue Feb 19, 2019
34383: sql: delete local implementations of planNodes r=jordanlewis a=jordanlewis

This PR deletes the remaining users of the planNode execution engine and deletes the duplicate implementations for those planNodes that have DistSQL equivalents.

Closes #33173.

Co-authored-by: Jordan Lewis <[email protected]>
@craig craig bot closed this as completed in #34383 Feb 19, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-sql-execution Relating to SQL execution. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.
Projects
None yet
Development

No branches or pull requests

5 participants