Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: loqrecovery/workload=tpcc/rangeSize=16mb failed #98639

Closed
cockroach-teamcity opened this issue Mar 14, 2023 · 1 comment · Fixed by #98851
Closed

roachtest: loqrecovery/workload=tpcc/rangeSize=16mb failed #98639

cockroach-teamcity opened this issue Mar 14, 2023 · 1 comment · Fixed by #98851
Assignees
Labels
A-testing Testing tools and infrastructure branch-master Failures and bugs on the master branch. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot.
Milestone

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Mar 14, 2023

roachtest.loqrecovery/workload=tpcc/rangeSize=16mb failed with artifacts on master @ 10c1e3e01b7da4cced0478e1bfd711a1c9be9afc:

test artifacts and logs in: /artifacts/loqrecovery/workload=tpcc/rangeSize=16mb/run_1
(test_runner.go:990).runTest: test timed out (10h0m0s)

Parameters: ROACHTEST_cloud=gce , ROACHTEST_cpu=4 , ROACHTEST_encrypted=false , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

/cc @cockroachdb/replication

This test on roachdash | Improve this report!

Jira issue: CRDB-25384

@cockroach-teamcity cockroach-teamcity added branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. T-kv-replication labels Mar 14, 2023
@cockroach-teamcity cockroach-teamcity added this to the 23.1 milestone Mar 14, 2023
@aliher1911 aliher1911 self-assigned this Mar 15, 2023
@aliher1911
Copy link
Contributor

Looks like tests get stuck while waiting for connection to cluster to be reestablished in:

		if err = contextutil.RunWithTimeout(ctx, "wait-for-restart", time.Minute,

unfortunately gosql.DB.ExecContext ignores context cancellation if it is waiting to establish connection and perform a handshake in database/sql/sql.go:758 where context is dropped.

Visible in the stack:

goroutine 2039139 [IO wait, 597 minutes]:
internal/poll.runtime_pollWait(0x7f6d9d8eef60, 0x72)
	GOROOT/src/runtime/netpoll.go:305 +0x89
internal/poll.(*pollDesc).wait(0xc00ff4f700?, 0xc005a42000?, 0x0)
	GOROOT/src/internal/poll/fd_poll_runtime.go:84 +0x32
internal/poll.(*pollDesc).waitRead(...)
	GOROOT/src/internal/poll/fd_poll_runtime.go:89
internal/poll.(*FD).Read(0xc00ff4f700, {0xc005a42000, 0x1000, 0x1000})
	GOROOT/src/internal/poll/fd_unix.go:167 +0x25a
net.(*netFD).Read(0xc00ff4f700, {0xc005a42000?, 0x495c00?, 0x7958d90?})
	GOROOT/src/net/fd_posix.go:55 +0x29
net.(*conn).Read(0xc0085941a0, {0xc005a42000?, 0x24?, 0xc00ff4f718?})
	GOROOT/src/net/net.go:183 +0x45
bufio.(*Reader).Read(0xc00680e840, {0xc00f07fb20, 0x5, 0xc003182fe8?})
	GOROOT/src/bufio/bufio.go:237 +0x1bb
io.ReadAtLeast({0x7903a40, 0xc00680e840}, {0xc00f07fb20, 0x5, 0x200}, 0x5)
	GOROOT/src/io/io.go:332 +0x9a
io.ReadFull(...)
	GOROOT/src/io/io.go:351
github.com/lib/pq.(*conn).recvMessage(0xc00f07fb00, 0xc004368198)
	github.com/lib/pq/external/com_github_lib_pq/conn.go:1004 +0xca
github.com/lib/pq.(*conn).recv(0xc00f07fb00)
	github.com/lib/pq/external/com_github_lib_pq/conn.go:1034 +0x45
github.com/lib/pq.(*conn).startup(0xc00f07fb00, 0xc0000bc060?)
	github.com/lib/pq/external/com_github_lib_pq/conn.go:1175 +0x6b6
github.com/lib/pq.(*Connector).open(0xc0037bf1d0, {0x7958d90, 0xc0000bc060})
	github.com/lib/pq/external/com_github_lib_pq/conn.go:378 +0x4c5
github.com/lib/pq.DialOpen({0x7934d28?, 0xc0045ad740}, {0xc00379b3c1?, 0x565a560?})
	github.com/lib/pq/external/com_github_lib_pq/conn.go:328 +0x7b
github.com/lib/pq.Open(...)
	github.com/lib/pq/external/com_github_lib_pq/conn.go:318
github.com/lib/pq.Driver.Open({}, {0xc00379b3c1, 0x32})
	github.com/lib/pq/external/com_github_lib_pq/conn.go:56 +0x85
database/sql.dsnConnector.Connect(...)
	GOROOT/src/database/sql/sql.go:759
database/sql.(*DB).conn(0xc0009720d0, {0x7958dc8, 0xc00491fbc0}, 0x1)
	GOROOT/src/database/sql/sql.go:1393 +0x763
database/sql.(*DB).exec(0x53ec8a?, {0x7958dc8, 0xc00491fbc0}, {0xc001b82300, 0x72}, {0x0, 0x0, 0x0}, 0x72?)
	GOROOT/src/database/sql/sql.go:1655 +0x5d
database/sql.(*DB).ExecContext(0x644d710?, {0x7958dc8, 0xc00491fbc0}, {0xc001b82300, 0x72}, {0x0, 0x0, 0x0})
	GOROOT/src/database/sql/sql.go:1633 +0xe5
github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.setDBRangeLimits({0x7958dc8, 0xc00491fbc0}, 0xc00491fbc0?, {0x61f5c91, 0x7}, 0x0)
	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/loss_of_quorum_recovery.go:591 +0x12e
github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.runRecoverLossOfQuorum.func1.1({0x7958dc8, 0xc00491fbc0})
	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/loss_of_quorum_recovery.go:279 +0x18f
github.com/cockroachdb/cockroach/pkg/util/contextutil.RunWithTimeout({0x7958d58?, 0xc004de5c80?}, {0x622707e, 0x10}, 0xdf8475800, 0xc003183c70)
	github.com/cockroachdb/cockroach/pkg/util/contextutil/context.go:91 +0xed
github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.runRecoverLossOfQuorum.func1({0x7958d58, 0xc004de5c80})
	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/loss_of_quorum_recovery.go:260 +0x1b6b
main.(*monitorImpl).Go.func1()
	main/pkg/cmd/roachtest/monitor.go:105 +0xa9
golang.org/x/sync/errgroup.(*Group).Go.func1()
	golang.org/x/sync/errgroup/external/org_golang_x_sync/errgroup/errgroup.go:75 +0x64
created by golang.org/x/sync/errgroup.(*Group).Go
	golang.org/x/sync/errgroup/external/org_golang_x_sync/errgroup/errgroup.go:72 +0xa5

We need a better approach to preventing hanging connections because we expect cluster to be unavailable in some tests. In this particular case liveness range was not able to recover.

@aliher1911 aliher1911 added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-testing Testing tools and infrastructure labels Mar 17, 2023
@craig craig bot closed this as completed in bbc5ec1 Mar 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-testing Testing tools and infrastructure branch-master Failures and bugs on the master branch. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants