Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: follower-reads/survival=region/locality=regional/reads=bounded-staleness/insufficient-quorum failed [test loses quorum on ts ranges] #78284

Closed
cockroach-teamcity opened this issue Mar 22, 2022 · 9 comments · Fixed by #79977
Assignees
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. T-kv KV Team

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Mar 22, 2022

roachtest.follower-reads/survival=region/locality=regional/reads=bounded-staleness/insufficient-quorum failed with artifacts on master @ 10e0c5d92f8ef953d6b497b448893bb5044cdd31:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /artifacts/follower-reads/survival=region/locality=regional/reads=bounded-staleness/insufficient-quorum/run_1
	follower_reads.go:712,follower_reads.go:373,follower_reads.go:72,test_runner.go:875: Post "http://34.145.50.48:26258/ts/query": EOF

	test_runner.go:1006,test_runner.go:905: test timed out (0s)
Help

See: roachtest README

See: How To Investigate (internal)

/cc @cockroachdb/kv-triage

This test on roachdash | Improve this report!

Jira issue: CRDB-14044

@cockroach-teamcity cockroach-teamcity added branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Mar 22, 2022
@blathers-crl blathers-crl bot added the T-kv KV Team label Mar 22, 2022
@tbg
Copy link
Member

tbg commented Mar 23, 2022


goroutine 16364430 [select, 593 minutes]:
net/http.(*persistConn).roundTrip(0xc002814360, 0xc000ced940)
	GOROOT/src/net/http/transport.go:2614 +0x97d
net/http.(*Transport).roundTrip(0x9575040, 0xc00367db00)
	GOROOT/src/net/http/transport.go:594 +0x7d1
net/http.(*Transport).RoundTrip(0x30, 0x65a6000)
	GOROOT/src/net/http/roundtrip.go:18 +0x19
net/http.send(0xc00367db00, {0x65a6000, 0x9575040}, {0x503df00, 0x591c01, 0x0})
	GOROOT/src/net/http/client.go:252 +0x5d8
net/http.(*Client).send(0xc003e17830, 0xc00367db00, {0x0, 0x10000000c, 0x0})
	GOROOT/src/net/http/client.go:176 +0x9b
net/http.(*Client).do(0xc003e17830, 0xc00367db00)
	GOROOT/src/net/http/client.go:725 +0x908
net/http.(*Client).Do(...)
	GOROOT/src/net/http/client.go:593
github.com/cockroachdb/cockroach/pkg/util/httputil.doJSONRequest({{0x0, 0x0}, 0x0, {0x0, 0x0}, 0x0}, 0xc00367db00, {0x66b9da0, 0xc003254918})
	github.com/cockroachdb/cockroach/pkg/util/httputil/http.go:109 +0x2d1
github.com/cockroachdb/cockroach/pkg/util/httputil.PostJSON({{0x0, 0x0}, 0x0, {0x0, 0x0}, 0x0}, {0xc00273b500, 0x22}, {0x66b9d50, 0xc003e17740}, ...)
	github.com/cockroachdb/cockroach/pkg/util/httputil/http.go:74 +0x16f
github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.verifyHighFollowerReadRatios({0x6648bb0, 0xc005a44b00}, {0x674bc28, 0xc0007328c0}, {0x676d3e8, 0xc003e24c80}, 0x1, {0x38b2fcc4, 0xed9cb8640, 0x0}, ...)
	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/follower_reads.go:711 +0x7ef
github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.runFollowerReadsTest({0x6648bb0, 0xc005a44b00}, {0x674bc28, 0xc0007328c0}, {0x676d3e8, 0xc003e24c80}, {0x1, {0x510203d, 0x8}, {0x50f894b, ...}, ...}, ...)
	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/follower_reads.go:373 +0x15ed
github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerFollowerReads.func1.1({0x6648bb0, 0xc005a44b00}, {0x674bc28, 0xc0007328c0}, {0x676d3e8, 0xc003e24c80})
	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/follower_reads.go:72 +0x3a6

@tbg
Copy link
Member

tbg commented Mar 23, 2022

It's hard to tell from the test setup (since there's a lot going on), but a simple explanation would be that the timeseries query above hangs because well, we lost quorum, so why can you query timeseries which are also stored in the KV store? Timeseries are 3x replicated and we're killing three out of six nodes. Unless the test is being clever about making sure the timeseries are in the surviving region, this kind of problem is expected.

We're definitely seeing unexpected request durations on those ranges, indicating that they are indeed unavailable:
$ cockroach debug merge-logs logs/*.unredacted | grep -E 'r4[^0-9]' | grep 'have been'
teamcity-4632355-1647926356-44-n6cpu4-geo-0004> W220322 09:01:33.477331 10085 kv/kvclient/kvcoord/dist_sender.go:1615 ⋮ [n4,ts-srv] 1500 slow range RPC: have been waiting 60.14s (58 attempts) for RPC Get [‹/System/tsd/cr.node.sql.select.count/4/10s/2022-03-22T08:00:00Z›,‹/Min›), Get [‹/System/tsd/cr.node.sql.select.count/4/10s/2022-03-22T09:00:00Z›,‹/Min›) to r4:‹/System{/tsd-tse}› [(n1,s1):1, (n2,s2):2, (n3,s3):3, next=4, gen=4]; resp: ‹failed to send RPC: sending to all replicas failed; last error: unable to dial n1: breaker open›
teamcity-4632355-1647926356-44-n6cpu4-geo-0004> W220322 09:01:33.682050 10086 kv/kvclient/kvcoord/dist_sender.go:1615 ⋮ [n4,ts-srv] 1502 slow range RPC: have been waiting 60.35s (59 attempts) for RPC Get [‹/System/tsd/cr.store.follower_reads.success_count/5/10s/2022-03-22T08:00:00Z›,‹/Min›), Get [‹/System/tsd/cr.store.follower_reads.success_count/5/10s/2022-03-22T09:00:00Z›,‹/Min›) to r4:‹/System{/tsd-tse}› [(n1,s1):1, (n2,s2):2, (n3,s3):3, next=4, gen=4]; resp: ‹failed to send RPC: sending to all replicas failed; last error: failed to connect to n1 at 10.142.0.194:26257: initial connection heartbeat failed: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 10.142.0.194:26257: connect: connection refused"›
teamcity-4632355-1647926356-44-n6cpu4-geo-0004> W220322 09:01:34.254075 10089 kv/kvclient/kvcoord/dist_sender.go:1615 ⋮ [n4,ts-srv] 1504 slow range RPC: have been waiting 60.92s (59 attempts) for RPC Get [‹/System/tsd/cr.node.sql.select.count/6/10s/2022-03-22T08:00:00Z›,‹/Min›), Get [‹/System/tsd/cr.node.sql.select.count/6/10s/2022-03-22T09:00:00Z›,‹/Min›) to r4:‹/System{/tsd-tse}› [(n1,s1):1, (n2,s2):2, (n3,s3):3, next=4, gen=4]; resp: ‹failed to send RPC: sending to all replicas failed; last error: unable to dial n1: breaker open›
teamcity-4632355-1647926356-44-n6cpu4-geo-0004> W220322 09:01:34.254099 10087 kv/kvclient/kvcoord/dist_sender.go:1615 ⋮ [n4,ts-srv] 1505 slow range RPC: have been waiting 60.92s (59 attempts) for RPC Get [‹/System/tsd/cr.node.sql.select.count/5/10s/2022-03-22T08:00:00Z›,‹/Min›), Get [‹/System/tsd/cr.node.sql.select.count/5/10s/2022-03-22T09:00:00Z›,‹/Min›) to r4:‹/System{/tsd-tse}› [(n1,s1):1, (n2,s2):2, (n3,s3):3, next=4, gen=4]; resp: ‹failed to send RPC: sending to all replicas failed; last error: unable to dial n1: breaker open›
teamcity-4632355-1647926356-44-n6cpu4-geo-0004> W220322 09:01:34.430622 10084 kv/kvclient/kvcoord/dist_sender.go:1615 ⋮ [n4,ts-srv] 1506 slow range RPC: have been waiting 61.10s (59 attempts) for RPC Get [‹/System/tsd/cr.store.follower_reads.success_count/4/10s/2022-03-22T08:00:00Z›,‹/Min›), Get [‹/System/tsd/cr.store.follower_reads.success_count/4/10s/2022-03-22T09:00:00Z›,‹/Min›) to r4:‹/System{/tsd-tse}› [(n1,s1):1, (n2,s2):2, (n3,s3):3, next=4, gen=4]; resp: ‹failed to send RPC: sending to all replicas failed; last error: unable to dial n1: breaker open›
teamcity-4632355-1647926356-44-n6cpu4-geo-0004> W220322 09:01:34.430677 10088 kv/kvclient/kvcoord/dist_sender.go:1615 ⋮ [n4,ts-srv] 1507 slow range RPC: have been waiting 61.09s (60 attempts) for RPC Get [‹/System/tsd/cr.store.follower_reads.success_count/6/10s/2022-03-22T08:00:00Z›,‹/Min›), Get [‹/System/tsd/cr.store.follower_reads.success_count/6/10s/2022-03-22T09:00:00Z›,‹/Min›) to r4:‹/System{/tsd-tse}› [(n1,s1):1, (n2,s2):2, (n3,s3):3, next=4, gen=4]; resp: ‹failed to send RPC: sending to all replicas failed; last error: unable to dial n1: breaker open›
teamcity-4632355-1647926356-44-n6cpu4-geo-0004> W220322 18:55:23.194077 1049484 kv/kvclient/kvcoord/dist_sender.go:1615 ⋮ [n4,client=35.185.76.110:33742,user=root] 152262 slow range RPC: have been waiting 60.45s (59 attempts) for RPC CheckConsistency [‹/System/tsd›,‹/System/"tse"›) to r4:‹/System{/tsd-tse}› [(n1,s1):1, (n2,s2):2, (n3,s3):3, next=4, gen=4]; resp: ‹failed to send RPC: sending to all replicas failed; last error: unable to dial n1: breaker open›

Replica circuit breakers should've let this test fail "gracefully" with a loss of quorum error. However, the SHA hadn't picked up #76146 yet, so breakers were disabled.

Going to toss this over to KV for further investigation.

@tbg tbg changed the title roachtest: follower-reads/survival=region/locality=regional/reads=bounded-staleness/insufficient-quorum failed roachtest: follower-reads/survival=region/locality=regional/reads=bounded-staleness/insufficient-quorum failed [test loses quorum on ts ranges] Mar 23, 2022
@cockroach-teamcity
Copy link
Member Author

roachtest.follower-reads/survival=region/locality=regional/reads=bounded-staleness/insufficient-quorum failed with artifacts on master @ 834eaa0e83350486830867b5edd6e8809b52aa55:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /artifacts/follower-reads/survival=region/locality=regional/reads=bounded-staleness/insufficient-quorum/run_1
	follower_reads.go:712,follower_reads.go:373,follower_reads.go:72,test_runner.go:875: Post "http://34.82.187.133:26258/ts/query": EOF

	test_runner.go:1006,test_runner.go:905: test timed out (0s)
Help

See: roachtest README

See: How To Investigate (internal)

Same failure on other branches

This test on roachdash | Improve this report!

@cockroach-teamcity

This comment was marked as off-topic.

@cockroach-teamcity

This comment was marked as off-topic.

@cockroach-teamcity
Copy link
Member Author

roachtest.follower-reads/survival=region/locality=regional/reads=bounded-staleness/insufficient-quorum failed with artifacts on master @ 63ea9139e2ca996e38b5fe7c7b43a97e625242f5:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /artifacts/follower-reads/survival=region/locality=regional/reads=bounded-staleness/insufficient-quorum/run_1
	follower_reads.go:712,follower_reads.go:373,follower_reads.go:72,test_runner.go:875: Post "http://35.233.229.131:26258/ts/query": EOF

	test_runner.go:1006,test_runner.go:905: test timed out (0s)
Help

See: roachtest README

See: How To Investigate (internal)

Same failure on other branches

This test on roachdash | Improve this report!

@nvanbenschoten
Copy link
Member

Unless the test is being clever about making sure the timeseries are in the surviving region, this kind of problem is expected.

We are attempting to be smart about this. See this code:

if topology.deadPrimaryRegion {
for r := retry.StartWithCtx(ctx, retryOpts); r.Next(); {
// If we're going to be killing nodes in a multi-region cluster, make
// sure system ranges have all upreplicated as expected as well. Do so
// using replication reports.
WaitForUpdatedReplicationReport(ctx, t, db)
var expAtRisk int
if topology.survival == zone {
// Only the 'test' table's range should be at risk of a region
// failure.
expAtRisk = 1
} else {
// No range should be at risk of a region failure.
expAtRisk = 0
}
const q3 = `
SELECT
coalesce(sum(at_risk_ranges), 0)
FROM
system.replication_critical_localities
WHERE
locality LIKE '%region%'
AND
locality NOT LIKE '%zone%'`
var atRisk int
require.NoError(t, db.QueryRowContext(ctx, q3).Scan(&atRisk))
if atRisk == expAtRisk {
break
}
t.L().Printf("rebalancing not complete, expected %d at risk ranges, "+
"found %d", expAtRisk, atRisk)
}
}
}

That logic should wait until all ranges other than the range in the database with ZONE survivability have upreplicated across regions.

But this isn't what we see in the logs you posted. Notice the range descriptor in r4:‹/System{/tsd-tse}› [(n1,s1):1, (n2,s2):2, (n3,s3):3, next=4, gen=4].

I reproduced this and confirmed that the unavailable timeseries range never achieved region survivability:

Screen Shot 2022-04-12 at 3 25 27 PM

Screen Shot 2022-04-12 at 3 25 44 PM

Something must be going wrong here with the replication reports. I wonder if it's related to async span config reconciliation in some form.

@cockroach-teamcity
Copy link
Member Author

roachtest.follower-reads/survival=region/locality=regional/reads=bounded-staleness/insufficient-quorum failed with artifacts on master @ cc07b5e7e670097560cb8412b380484773df1e96:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /artifacts/follower-reads/survival=region/locality=regional/reads=bounded-staleness/insufficient-quorum/run_1
	follower_reads.go:718,follower_reads.go:379,follower_reads.go:73,test_runner.go:875: Post "http://34.82.76.44:26258/ts/query": EOF

	test_runner.go:1006,test_runner.go:905: test timed out (0s)
Help

See: roachtest README

See: How To Investigate (internal)

Same failure on other branches

This test on roachdash | Improve this report!

@ajwerner ajwerner removed the release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. label Apr 14, 2022
@ajwerner
Copy link
Contributor

I'm pretty sure I've got this one. My suspicion is that it was caused by #76279 which meant that the system config was not immediately available when we first set the cluster setting to trigger the report, but it's just a hunch really. I'm not totally clear on why it was okay to only look at one report before this change. Maybe it's that report generation timing was such that we always did one iteration of the retry loop and it was there and now, for whatever reason, there's some timing thing involving the rangefeed that means that we have to do more than one iteration.

What I do know is that when I added code to print out the table state for a bunch of the tables inside the code to check on the critical localities, but before we actually did the scan to check on them, that it ran 60 times without failing where in the past I was getting 1-2/20. That lead me to wonder if we just weren't waiting for the right thing. Indeed it seems like we weren't. We were just waiting for one report to be written, but there are 3 reports in total and we write the critical localities report second.

I'm running it more, I'm at 55 successes with #79977 and it feels right to me. I've removed the release blocker label. My working theory is just that changed the timing and exposed the bug. I don't feel super eager to prove this out further right now, but I'm pretty happy with the answer of the moment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. T-kv KV Team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants